failing to retrieve text from html using lxml and xpath - html

I'm working on a second house pricing project, so I need to scrape information from one of the largest second house trading platform in China. Here's my problem, the info on the page and the corresponding element using Chrome 'inspect' function are as follows:
my code:
>>>from lxml import etree
>>>import requests
>>>url = 'http://bj.lianjia.com/chengjiao/101101498110.html'
>>>r = requests.get(url)
>>>xiaoqu_avg_price = tree.xpath('//[#id="resblockCardContainer"]/div/div/div[2]/div/div[1]/span/text()')
>>>xiaoqu_avg_price
[]
The returned empty list is not desirable (ideally it should be 73648). Furthermore, I viewed its HTML source code, which shows:
So how should I do to get what I want? And what is the resblockCard means? Thanks.

This site like many others uses ajax for populating content. If you make a similar request you can get desired value in json format.
import requests
url = 'http://bj.lianjia.com/chengjiao/resblock?hid=101101498110&rid=1111027378082'
# Get json response
response = requests.get(url).json()
print(response['data']['resblock']['unitPrice'])
# 73648
Note two group of numbers in request url. The first group from original page url, second you can find under script tag in the original page source: resblockId:'1111027378082'.

That XPath query is not working as expected because you are running it against the source code of the page as it is served by the server, not as it looks on a rendered browser page.
One solution for this is to use Selenium in conjunction with PhantomJS or some other browser driver, which will run the JavaScript on that page and render it for you.
from selenium import webdriver
from lxml import html
driver = webdriver.PhantomJS(executable_path="<path to>/phantomjs.exe")
driver.get('http://bj.lianjia.com/chengjiao/101101498110.html')
source = driver.page_source
driver.close() # or quit() if there are no more pages to scrape
tree = html.fromstring(source)
price = tree.xpath('//div[#id="resblockCardContainer"]/div/div/div[2]/div/div[1]/span/text()')[0].strip()
The above returns 73648 元/㎡.

Related

Xpath not getting any data

I am trying to retrieve data from this rates website but It just cant get anything. I have tried this same format for a different website and it works fine so I'm not sure what is going on here.
import requests
from requests_html import HTMLSession
#get usd to gbp rate
session = HTMLSession()
response = session.get('https://www.xe.com/currencyconverter/convert/?Amount=1&From=USD&To=GBP')
rate = response.html.xpath('//span[#class="converterresult-toAmount"]/*')
print('USD to GBP Rate:',rate)
#(output): USD to GBP Rate: []
The HTML content returned from that URL is minimal, and does not contain the content you are attempting to target with XPath.
It appears that the JavaScript evaluated will render the content you are trying to scrape with react. In the HTML that is returned, there is a comment:
<!-- WARNING: Automated extraction of rates is prohibited under the Terms of Use. -->
So, I don't think they want you extracting data from the pages with scripts and automation.

How to get the href value (which changes based on session) from the "Inspect Element"

Requirement:
To get the data from a URL into S3 and build a data pipeline. I am trying to get the dataset using python requests.get().
However, the URL changes every few minutes, and it becomes difficult to fetch the data using requests.get() without manual changes to the python script.
My Approach:
I am able to get the data to S3 and build the data pipeline. However, the URL is dynamic I am kinda stuck there.
Code:
import json
import requests
import boto3
import csv
import re
def lambda_handler(event, context):
response= requests.get("https://calcat.covid19.ca.gov/cacovidmodels/_w_d58f724d/session/d37479fcceb6b02080b6b7c23dd05046/download/dl.custom?w=d58f724d",verify=False)
#print(response.content)
s3_client = boto3.client('s3',region_name='us-west-2')
with open('/tmp/data.csv', 'w') as datacsvfile:
writer = csv.writer(datacsvfile, delimiter=",")
for line in response.iter_lines():
writer.writerows([line.strip().split(',')])
Query:
How can I get the href from the "Inspect element" section , store it in a variable and create a dynamic URL to make a REST call to. Below is the "Inspect element" code I am referring to :
This href only appears in the "inspect element" and not in the "View Source Code".

Why is xpath's extract() returning an empty list for the href attribute of an anchor element?

Why do I get an empty list when trying to extract the href attribute of the anchor tag located on the following url: https://www.udemy.com/courses/search/?src=ukw&q=accounting using scrapy?
This is my code to extract the <a></a> element located inside the list-view-course-card--course-card-wrapper--TJ6ET class:
response.xpath("//div[#class='list-view-course-card--course-card-wrapper--TJ6ET']/a/#href").extract()
This site makes API calls to retrieve all the data.
You can use the scrapy shell to see the response that the site is returning.
scrapy shell 'https://www.udemy.com/courses/search/?src=ukw&q=accounting' and then view(response).
The data you are looking for is available at the following api call :
'https://www.udemy.com/api-2.0/search-courses/?fields[locale]=simple_english_title&src=ukw&q=accounting' . However, if you try to access this link directly, you will get a json object saying that you do not have permission to perform this action. How did I find this link ? Load the url on your browser, and go to the network tab on your developer tools and look for XHR objects.
The following spider will first make a request to the primary link and then make a request to the api call.
You will have to parse the json object that was returned to obtain your data. If you want to scale this spider for more products, you might want to look for a pattern in the structure of the api call.
import scrapy
class UdemySpider(scrapy.Spider):
name = 'udemy'
newurl = 'https://www.udemy.com/api-2.0/search-courses/?fields[locale]=simple_english_title&src=ukw&q=accounting'
def start_requests(self):
urls = ['https://www.udemy.com/courses/search/?src=ukw&q=accounting'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.api_call)
def api_call(self, response):
print("Working on second page")
yield scrapy.Request(url=self.newurl, callback=self.parse)
def parse(self, response):
#code to parse json object
`

Find element on webpage with Python 2.7 requests

I need to see if an element is on a page, more specifically, I am trying to locate the element that means my proxy IP is banned in my request session.
So far I have tried:
import requests
from lxml import html
page = requests.get("http://www.adidas.com/us")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//span[#id='ctl00_mainContentPlaceHolder_lblInvalidRequest']")

Where do I find the Google Places API Client Library for Python?

It's not under the supported libraries here:
https://developers.google.com/api-client-library/python/reference/supported_apis
Is it just not available with Python? If not, what language is it available for?
Andre's answer points you at a correct place to reference the API. Since your question was python specific, allow me to show you a basic approach to building your submitted search URL in python. This example will get you all the way to search content in just a few minutes after you sign up for Google's free API key.
ACCESS_TOKEN = <Get one of these following the directions on the places page>
import urllib
def build_URL(search_text='',types_text=''):
base_url = 'https://maps.googleapis.com/maps/api/place/textsearch/json' # Can change json to xml to change output type
key_string = '?key='+ACCESS_TOKEN # First think after the base_url starts with ? instead of &
query_string = '&query='+urllib.quote(search_text)
sensor_string = '&sensor=false' # Presumably you are not getting location from device GPS
type_string = ''
if types_text!='':
type_string = '&types='+urllib.quote(types_text) # More on types: https://developers.google.com/places/documentation/supported_types
url = base_url+key_string+query_string+sensor_string+type_string
return url
print(build_URL(search_text='Your search string here'))
This code will build and print a URL searching for whatever you put in the last line replacing "Your search string here". You need to build one of those URLs for each search. In this case I've printed it so you can copy and paste it into your browser address bar, which will give you a return (in the browser) of a JSON text object the same as you will get when your program submits that URL. I recommend using the python requests library to get that within your program and you can do that simply by taking the returned URL and doing this:
response = requests.get(url)
Next up you need to parse the returned response JSON, which you can do by converting it with the json library (look for json.loads for example). After running that response through json.loads you will have a nice python dictionary with all your results. You can also paste that return (e.g. from the browser or a saved file) into an online JSON viewer to understand the structure while you write code to access the dictionary that comes out of json.loads.
Please feel free to post more questions if part of this isn't clear.
Somebody has written a wrapper for the API: https://github.com/slimkrazy/python-google-places
Basically it's just HTTP with JSON responses. It's easier to access through JavaScript but it's just as easy to use urllib and the json library to connect to the API.
Ezekiel's answer worked great for me and all of the credit goes to him. I had to change his code in order for it to work with python3. Below is the code I used:
def build_URL(search_text='',types_text=''):
base_url = 'https://maps.googleapis.com/maps/api/place/textsearch/json'
key_string = '?key=' + ACCESS_TOKEN
query_string = '&query=' + urllib.parse.quote(search_text)
type_string = ''
if types_text != '':
type_string = '&types='+urllib.parse.quote(types_text)
url = base_url+key_string+query_string+type_string
return url
The changes were urllib.quote was changed to urllib.parse.quote and sensor was removed because google is deprecating it.