How to get the href value (which changes based on session) from the "Inspect Element" - html

Requirement:
To get the data from a URL into S3 and build a data pipeline. I am trying to get the dataset using python requests.get().
However, the URL changes every few minutes, and it becomes difficult to fetch the data using requests.get() without manual changes to the python script.
My Approach:
I am able to get the data to S3 and build the data pipeline. However, the URL is dynamic I am kinda stuck there.
Code:
import json
import requests
import boto3
import csv
import re
def lambda_handler(event, context):
response= requests.get("https://calcat.covid19.ca.gov/cacovidmodels/_w_d58f724d/session/d37479fcceb6b02080b6b7c23dd05046/download/dl.custom?w=d58f724d",verify=False)
#print(response.content)
s3_client = boto3.client('s3',region_name='us-west-2')
with open('/tmp/data.csv', 'w') as datacsvfile:
writer = csv.writer(datacsvfile, delimiter=",")
for line in response.iter_lines():
writer.writerows([line.strip().split(',')])
Query:
How can I get the href from the "Inspect element" section , store it in a variable and create a dynamic URL to make a REST call to. Below is the "Inspect element" code I am referring to :
This href only appears in the "inspect element" and not in the "View Source Code".

Related

Export Response based on Name from Chrome Network tab in DevTools

Is it possible somehow export Response information from Network tab from Chrome Dev Tools with python?
Thank you.
You can do this by right-clicking on the request to be exported and there will be
"Save all as HAR with content",Or directly copy the string that returns the result
If you're using Python then you can get the same information that you see in the Response tab, using the requests library,
>>> import requests
>>> response = requests.get("http://www.google.com")
>>> print response.content

How can I response function blocked by website?

I am trying to grab the data table in url = 'https://quotes.wsj.com/index/HK/XHKG/HSI/historical-prices/download?num_rows=150&range_days=150&endDate=02/29/2020'. I clicked on the link and a csv file will be download. Therefore, the link is correct. However, from the response.content, there is message saying "The request could not be satisfied.Request blocked." Can server distinguish manual click and python Request and block it? Any way to work around it?
My codes:
import requests
url = 'https://quotes.wsj.com/index/HK/XHKG/HSI/historical-prices/download?num_rows=150&range_days=150&endDate=02/29/2020'
response = requests.get(url)
print(response.content)
open('wsj.csv', 'wb').write(response.content)

Why is xpath's extract() returning an empty list for the href attribute of an anchor element?

Why do I get an empty list when trying to extract the href attribute of the anchor tag located on the following url: https://www.udemy.com/courses/search/?src=ukw&q=accounting using scrapy?
This is my code to extract the <a></a> element located inside the list-view-course-card--course-card-wrapper--TJ6ET class:
response.xpath("//div[#class='list-view-course-card--course-card-wrapper--TJ6ET']/a/#href").extract()
This site makes API calls to retrieve all the data.
You can use the scrapy shell to see the response that the site is returning.
scrapy shell 'https://www.udemy.com/courses/search/?src=ukw&q=accounting' and then view(response).
The data you are looking for is available at the following api call :
'https://www.udemy.com/api-2.0/search-courses/?fields[locale]=simple_english_title&src=ukw&q=accounting' . However, if you try to access this link directly, you will get a json object saying that you do not have permission to perform this action. How did I find this link ? Load the url on your browser, and go to the network tab on your developer tools and look for XHR objects.
The following spider will first make a request to the primary link and then make a request to the api call.
You will have to parse the json object that was returned to obtain your data. If you want to scale this spider for more products, you might want to look for a pattern in the structure of the api call.
import scrapy
class UdemySpider(scrapy.Spider):
name = 'udemy'
newurl = 'https://www.udemy.com/api-2.0/search-courses/?fields[locale]=simple_english_title&src=ukw&q=accounting'
def start_requests(self):
urls = ['https://www.udemy.com/courses/search/?src=ukw&q=accounting'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.api_call)
def api_call(self, response):
print("Working on second page")
yield scrapy.Request(url=self.newurl, callback=self.parse)
def parse(self, response):
#code to parse json object
`

failing to retrieve text from html using lxml and xpath

I'm working on a second house pricing project, so I need to scrape information from one of the largest second house trading platform in China. Here's my problem, the info on the page and the corresponding element using Chrome 'inspect' function are as follows:
my code:
>>>from lxml import etree
>>>import requests
>>>url = 'http://bj.lianjia.com/chengjiao/101101498110.html'
>>>r = requests.get(url)
>>>xiaoqu_avg_price = tree.xpath('//[#id="resblockCardContainer"]/div/div/div[2]/div/div[1]/span/text()')
>>>xiaoqu_avg_price
[]
The returned empty list is not desirable (ideally it should be 73648). Furthermore, I viewed its HTML source code, which shows:
So how should I do to get what I want? And what is the resblockCard means? Thanks.
This site like many others uses ajax for populating content. If you make a similar request you can get desired value in json format.
import requests
url = 'http://bj.lianjia.com/chengjiao/resblock?hid=101101498110&rid=1111027378082'
# Get json response
response = requests.get(url).json()
print(response['data']['resblock']['unitPrice'])
# 73648
Note two group of numbers in request url. The first group from original page url, second you can find under script tag in the original page source: resblockId:'1111027378082'.
That XPath query is not working as expected because you are running it against the source code of the page as it is served by the server, not as it looks on a rendered browser page.
One solution for this is to use Selenium in conjunction with PhantomJS or some other browser driver, which will run the JavaScript on that page and render it for you.
from selenium import webdriver
from lxml import html
driver = webdriver.PhantomJS(executable_path="<path to>/phantomjs.exe")
driver.get('http://bj.lianjia.com/chengjiao/101101498110.html')
source = driver.page_source
driver.close() # or quit() if there are no more pages to scrape
tree = html.fromstring(source)
price = tree.xpath('//div[#id="resblockCardContainer"]/div/div/div[2]/div/div[1]/span/text()')[0].strip()
The above returns 73648 元/㎡.

Find element on webpage with Python 2.7 requests

I need to see if an element is on a page, more specifically, I am trying to locate the element that means my proxy IP is banned in my request session.
So far I have tried:
import requests
from lxml import html
page = requests.get("http://www.adidas.com/us")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//span[#id='ctl00_mainContentPlaceHolder_lblInvalidRequest']")