Why is xpath's extract() returning an empty list for the href attribute of an anchor element? - html

Why do I get an empty list when trying to extract the href attribute of the anchor tag located on the following url: https://www.udemy.com/courses/search/?src=ukw&q=accounting using scrapy?
This is my code to extract the <a></a> element located inside the list-view-course-card--course-card-wrapper--TJ6ET class:
response.xpath("//div[#class='list-view-course-card--course-card-wrapper--TJ6ET']/a/#href").extract()

This site makes API calls to retrieve all the data.
You can use the scrapy shell to see the response that the site is returning.
scrapy shell 'https://www.udemy.com/courses/search/?src=ukw&q=accounting' and then view(response).
The data you are looking for is available at the following api call :
'https://www.udemy.com/api-2.0/search-courses/?fields[locale]=simple_english_title&src=ukw&q=accounting' . However, if you try to access this link directly, you will get a json object saying that you do not have permission to perform this action. How did I find this link ? Load the url on your browser, and go to the network tab on your developer tools and look for XHR objects.
The following spider will first make a request to the primary link and then make a request to the api call.
You will have to parse the json object that was returned to obtain your data. If you want to scale this spider for more products, you might want to look for a pattern in the structure of the api call.
import scrapy
class UdemySpider(scrapy.Spider):
name = 'udemy'
newurl = 'https://www.udemy.com/api-2.0/search-courses/?fields[locale]=simple_english_title&src=ukw&q=accounting'
def start_requests(self):
urls = ['https://www.udemy.com/courses/search/?src=ukw&q=accounting'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.api_call)
def api_call(self, response):
print("Working on second page")
yield scrapy.Request(url=self.newurl, callback=self.parse)
def parse(self, response):
#code to parse json object
`

Related

Flask - endpoint returning dictionary instead of list

whenever I create an endpoint to return a list, it returns a number-keyed dictionary instead.
from flask import Flask
app = Flask(__name__)
#app.route('/somelist')
def somelist():
return ['a', 'b', 'c']
When I go to view the endpoint, I get a dictionary like this:
What's going on here? I just want a list!
It is not a dictionary. The 0:, for example is not part of the response that you've returned.
The browser JSON extension you are using is showing the list along with its indicies.
Instead, use curl or Postman and inspect the raw response without browser response parsing. Or click on the Raw Data tab...

How to read UTM tags and redirect url in react

I have an URL with UTM tags as below. When a user clicks/ hits the below URL(source), I would like to read UTM tags and redirect to another url(target).
Does anyone have a documentation link to read UTM tags and redirect the url in react?
Example:
Source
https://www.CustomDomain.com/?utm_source=linkedin&utm_medium=email&utm_campaign=sale&utm_id=123&utm_term=job&utm_content=ad
Target
https://www.CustomDomain.com/dashbord
With the latest react-router-dom *v6*, you can use a new hook named useSearchParams. Use it to get query params and you can then store them into localStorage,:
const [searchParams, setSearchParams] = useSearchParams();
searchParams.get("utm_source"); // similar for the rest of query params
With React router v4, You need to use this.props.location.search for generic or this.props.location (with useLocation) for functional components and parse the query parameters either by yourself or using a package

failing to retrieve text from html using lxml and xpath

I'm working on a second house pricing project, so I need to scrape information from one of the largest second house trading platform in China. Here's my problem, the info on the page and the corresponding element using Chrome 'inspect' function are as follows:
my code:
>>>from lxml import etree
>>>import requests
>>>url = 'http://bj.lianjia.com/chengjiao/101101498110.html'
>>>r = requests.get(url)
>>>xiaoqu_avg_price = tree.xpath('//[#id="resblockCardContainer"]/div/div/div[2]/div/div[1]/span/text()')
>>>xiaoqu_avg_price
[]
The returned empty list is not desirable (ideally it should be 73648). Furthermore, I viewed its HTML source code, which shows:
So how should I do to get what I want? And what is the resblockCard means? Thanks.
This site like many others uses ajax for populating content. If you make a similar request you can get desired value in json format.
import requests
url = 'http://bj.lianjia.com/chengjiao/resblock?hid=101101498110&rid=1111027378082'
# Get json response
response = requests.get(url).json()
print(response['data']['resblock']['unitPrice'])
# 73648
Note two group of numbers in request url. The first group from original page url, second you can find under script tag in the original page source: resblockId:'1111027378082'.
That XPath query is not working as expected because you are running it against the source code of the page as it is served by the server, not as it looks on a rendered browser page.
One solution for this is to use Selenium in conjunction with PhantomJS or some other browser driver, which will run the JavaScript on that page and render it for you.
from selenium import webdriver
from lxml import html
driver = webdriver.PhantomJS(executable_path="<path to>/phantomjs.exe")
driver.get('http://bj.lianjia.com/chengjiao/101101498110.html')
source = driver.page_source
driver.close() # or quit() if there are no more pages to scrape
tree = html.fromstring(source)
price = tree.xpath('//div[#id="resblockCardContainer"]/div/div/div[2]/div/div[1]/span/text()')[0].strip()
The above returns 73648 元/㎡.

Scraping Json Data from a REST Api

I am learning Firebase with Android and I need a database to play with. This is the Json request url :https://yts.ag/api/v2/list_movies.json .
It contains around 5000 movie List that I need. So I searched around internet and I found a tool called Scrapy. But I have no idea how to use it in a rest API. Any Help is appreciated
First you'll need to follow the Scrapy Tutorial to create a scrapy project, and then your spider can be as simple as this:
class MySpider(Spider):
name = 'myspider'
start_urls = ['https://yts.ag/api/v2/list_movies.json']
def parse(self, response):
json_response = json.loads(response.body)
for movie in json_response['data']['movies']:
yield Request(movie['url'], callback=self.parse_movie)
def parse_movie(self, response):
# work with every movie response
yield {'url': response.url}
Very easy. Follow the tutorial and start from the URL of your REST endpoint. On your parse() or parse_item() function, use json.loads(response.body) to load the JSON document. Since Scrapy now can ingest dicts, your code might be as simple as
import json
...
def parse(self, response):
return json.loads(response.body)
Here's a slightly more advanced use case.

Where do I find the Google Places API Client Library for Python?

It's not under the supported libraries here:
https://developers.google.com/api-client-library/python/reference/supported_apis
Is it just not available with Python? If not, what language is it available for?
Andre's answer points you at a correct place to reference the API. Since your question was python specific, allow me to show you a basic approach to building your submitted search URL in python. This example will get you all the way to search content in just a few minutes after you sign up for Google's free API key.
ACCESS_TOKEN = <Get one of these following the directions on the places page>
import urllib
def build_URL(search_text='',types_text=''):
base_url = 'https://maps.googleapis.com/maps/api/place/textsearch/json' # Can change json to xml to change output type
key_string = '?key='+ACCESS_TOKEN # First think after the base_url starts with ? instead of &
query_string = '&query='+urllib.quote(search_text)
sensor_string = '&sensor=false' # Presumably you are not getting location from device GPS
type_string = ''
if types_text!='':
type_string = '&types='+urllib.quote(types_text) # More on types: https://developers.google.com/places/documentation/supported_types
url = base_url+key_string+query_string+sensor_string+type_string
return url
print(build_URL(search_text='Your search string here'))
This code will build and print a URL searching for whatever you put in the last line replacing "Your search string here". You need to build one of those URLs for each search. In this case I've printed it so you can copy and paste it into your browser address bar, which will give you a return (in the browser) of a JSON text object the same as you will get when your program submits that URL. I recommend using the python requests library to get that within your program and you can do that simply by taking the returned URL and doing this:
response = requests.get(url)
Next up you need to parse the returned response JSON, which you can do by converting it with the json library (look for json.loads for example). After running that response through json.loads you will have a nice python dictionary with all your results. You can also paste that return (e.g. from the browser or a saved file) into an online JSON viewer to understand the structure while you write code to access the dictionary that comes out of json.loads.
Please feel free to post more questions if part of this isn't clear.
Somebody has written a wrapper for the API: https://github.com/slimkrazy/python-google-places
Basically it's just HTTP with JSON responses. It's easier to access through JavaScript but it's just as easy to use urllib and the json library to connect to the API.
Ezekiel's answer worked great for me and all of the credit goes to him. I had to change his code in order for it to work with python3. Below is the code I used:
def build_URL(search_text='',types_text=''):
base_url = 'https://maps.googleapis.com/maps/api/place/textsearch/json'
key_string = '?key=' + ACCESS_TOKEN
query_string = '&query=' + urllib.parse.quote(search_text)
type_string = ''
if types_text != '':
type_string = '&types='+urllib.parse.quote(types_text)
url = base_url+key_string+query_string+type_string
return url
The changes were urllib.quote was changed to urllib.parse.quote and sensor was removed because google is deprecating it.