How to efficiently crawl a website using Scrapy - html

I am trying to attempt web-scraping a real estate website using Scrapy and PyCharm, and failing miserably.
Desired Results:
Scrape 1 base URL (https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/), but 5 different internal URLs (https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/{**i**}-r/), where {i} = 1,2,3,4,5
Crawl all pages in each internal URL or using the base URL
Get all href links and crawl all href link and get span tag data from inside each href link.
Scrape around 5,000-7,000 unique listings as efficiently and fast as possible.
Output data into a CSV file while keeping Cyrillic characters.
Note: I have attempted web-scraping using BeautifulSoup but it took me around 1-2 min per listing, and around 2-3 hours to scrape all listings using a for loop. I was referred to Scrapy being faster option by a community member. I'm unsure if its cause of the data pipelines or if I can do multi-threading.
All and any help is greatly appreciated.^^
Website sample HTML snippet: This is a picture of the HTML I am trying to scrape.
Current Scrapy Code: This is what I have so far. When I use the scrapy crawl unegui_apts I cannot seem to get the results I want. I'm so lost.
# -*- coding: utf-8 -*-
# Import library
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request
# Create Spider class
class UneguiApartments(scrapy.Spider):
name = 'unegui_apts'
allowed_domains = ['www.unegui.mn']
custom_settings = {'FEEDS': {'results1.csv': {'format': 'csv'}}}
start_urls = [
'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/1-r/,'
'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/2-r/'
]
headers = {
'user-agent': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
}
def parse(self, response):
self.logger.debug('callback "parse": got response %r' % response)
cards = response.xpath('//div[#class="list-announcement-block"]')
for card in cards:
name = card.xpath('.//meta[#itemprop="name"]/text()').extract_first()
price = card.xpath('.//meta[#itemprop="price"]/text()').extract_first()
city = card.xpath('.//meta[#itemprop="areaServed"]/text()').extract_first()
date = card.xpath('.//*[#class="announcement-block__date"]/text()').extract_first().strip().split(', ')[0]
request = Request(link, callback=self.parse_details, meta={'name': name,
'price': price,
'city': city,
'date': date})
yield request
next_url = response.xpath('//li[#class="pager-next"]/a/#href').get()
if next_url:
# go to next page until no more pages
yield response.follow(next_url, callback=self.parse)
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(UneguiApartments)
process.start()

Your code has a number of issues:
The start_urls list contains invalid links
You defined your user_agent string in the headers dictionary but you are not using it when yielding requests
Your xpath selectors are incorrect
The next_url is incorrect hence does not yield new requests to the next pages
I have updated your code to fix the issues above as follows:
import scrapy
from scrapy.crawler import CrawlerProcess
# Create Spider class
class UneguiApartments(scrapy.Spider):
name = 'unegui_apts'
allowed_domains = ['www.unegui.mn']
custom_settings = {'FEEDS': {'results1.csv': {'format': 'csv'}},
'USER_AGENT': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"}
start_urls = [
'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/'
]
def parse(self, response):
cards = response.xpath(
'//li[contains(#class,"announcement-container")]')
for card in cards:
name = card.xpath(".//a[#itemprop='name']/#content").extract_first()
price = card.xpath(".//*[#itemprop='price']/#content").extract_first()
date = card.xpath("normalize-space(.//div[contains(#class,'announcement-block__date')]/text())").extract_first()
city = card.xpath(".//*[#itemprop='areaServed']/#content").extract_first()
yield {'name': name,
'price': price,
'city': city,
'date': date}
next_url = response.xpath("//a[contains(#class,'red')]/parent::li/following-sibling::li/a/#href").extract_first()
if next_url:
# go to next page until no more pages
yield response.follow(next_url, callback=self.parse)
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(UneguiApartments)
process.start()
You run the above spider by executing the command python <filename.py> since you are running a standalone script and not a full blown project.
Sample csv results are as shown in the image below. You will need to clean up the data using pipelines and the scrapy item class. See the docs for more details.

Related

Converting multiple json objects to a single dataframe/csv

I'm new with python.
I'd like to know how to run the same process of the code below for multiple urls.
# Code '1, that is working perfectly
url ='https://toyama.com.br/wp-json/wp/v2/assistencia?local=914&ramo=&_embed&per_page=100'
header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
df = pd.read_json(url)
resp = requests.get(url, headers=header)
pandas_data_frame1 = df['acf'].apply(pd.Series)
pandas_data_frame1.to_csv ('teste2.CSV', encoding ='utf-8-sig')
# Code2, that is not working perfectly (multiple urls, it is important to notice that some urls exist and others not, and I need do deal with this structure)
url1 =['https://toyama.com.br/wp-json/wp/v2/assistencia?local=914&ramo=&_embed&per_page=100',
'https://toyama.com.br/wp-json/wp/v2/assistencia?local=800&ramo=&_embed&per_page=100',
'https://toyama.com.br/wp-json/wp/v2/assistencia?local=933&ramo=&_embed&per_page=100',
'https://toyama.com.br/wp-json/wp/v2/assistencia?local=844&ramo=&_embed&per_page=100',
'https://toyama.com.br/wp-json/wp/v2/assistencia?local=806&ramo=&_embed&per_page=100',
'https://toyama.com.br/wp-json/wp/v2/assistencia?local=1207&ramo=&_embed&per_page=100']
header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
for links in url1:
df = pd.read_json(links)
resp1 = requests.get(links, headers=header)
data = json.loads(resp1.text)
for d in data:
pandas_data_frame1 = df['acf'].apply(pd.Series)
pandas_data_frame1.to_csv ('teste2.CSV', encoding ='utf-8-sig')
#unfortunately only saves the content of the link 'https://toyama.com.br/wp-json/wp/v2/assistencia?local=1207&ramo=&_embed&per_page=100'
What I need it is to have a csv where I have the json keys as a column, exactly like the code 1.
Kind regards!
Your code is working fine for most of the process. You are loading the data into your workspace and creating a dataframe with the information you need.
Nevertheless, you are replacing the csv file everytime you read a new link information. That is the reason your code is saving just the last link information.
I belive there are many ways to solve it. One simple strategy is to insert a counter to tell the code when just process the information as you did and when join dataframes into a single dataframe.
The code:
# Links for scapping web data
url1 =['https://toyama.com.br/wp-json/wp/v2/assistencia?local=914&ramo=&_embed&per_page=100',
'https://toyama.com.br/wp-json/wp/v2/assistencia?local=800&ramo=&_embed&per_page=100',
'https://toyama.com.br/wp-json/wp/v2/assistencia?local=933&ramo=&_embed&per_page=100',
'https://toyama.com.br/wp-json/wp/v2/assistencia?local=844&ramo=&_embed&per_page=100',
'https://toyama.com.br/wp-json/wp/v2/assistencia?local=806&ramo=&_embed&per_page=100',
'https://toyama.com.br/wp-json/wp/v2/assistencia?local=1207&ramo=&_embed&per_page=100']
header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
# Creating a counter to tell the code when join dataframes.
# For the first case we just create the dataframe. Others case we will join them into a single dataframe.
cont = 0
# Scrapping the Data
for links in url1:
cont += 1
# Printing which url link the code is reading
print('loop:' + str(cont))
df = pd.read_json(links)
resp1 = requests.get(links, headers=header)
data = json.loads(resp1.text)
# First dataframe processing.
if cont == 1:
for d in data:
complete_df = df['acf'].apply(pd.Series)
# Others dataframe processing
else:
for d in data:
others_df = df['acf'].apply(pd.Series)
complete_df = pd.concat([complete_df, others_df])
# Removing duplciates from the dataframe. I am not sure why but apparently the code is reading few json files.
complete_df = pandas_data_frame1.drop_duplicates()
# Saving CSV file.
complete_df.to_csv ('teste2.CSV', encoding ='utf-8-sig')
I hope this may help you.

Soup.find_all returning an empty list

I am trying to scrape a player stats table for NBA stats using requests and BeautifulSoup, but the response I am getting is not same as what I see using "Inspect Element"
The div containing this table is has class attribute: class="nba-stat-table__overflow. However, whenever I run the following code I get an empty list:
table = soup.find_all('div',attrs={'class="nba-stat-table__overflow'})
Here is my full code:
import os
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
url = 'https://stats.nba.com/players/boxscores/?Season=2018-19&SeasonType=Regular%20Season'
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
table = soup.find_all('div',attrs={'class="nba-stat-table__overflow'})
Basically the page is load via JavaScript, So bs4 or requests modules will not be able to render the JavaScript on the fly.
You should use selenium or requests_html modules to render the JS, But i noticed that the website is using API, which can be used to fetch the data, So I've called it and extracted the data.
Check My previous Answer which explain for you how to fetch the API.
import requests
import pandas as pd
params = {
"Counter": "1000",
"DateFrom": "",
"DateTo": "",
"Direction": "DESC",
"LeagueID": "00",
"PlayerOrTeam": "P",
"Season": "2018-19",
"SeasonType": "Regular Season",
"Sorter": "DATE"
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0',
"x-nba-stats-origin": "stats",
"x-nba-stats-token": "true",
"Referer": "https://stats.nba.com/players/boxscores/?Season=2018-19&SeasonType=Regular%20Season"
}
def main(url):
r = requests.get(url, params=params, headers=headers).json()
goal = []
for item in r['resultSets']:
df = pd.DataFrame(item['rowSet'], columns=item['headers'])
goal.append(df)
new = pd.concat(goal)
print(new)
new.to_csv("data.csv", index=False)
main("https://stats.nba.com/stats/leaguegamelog")
Output: View-Online

web scrape save to specific json in python, bs4

I have the following Python code:
import requests
import json
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 4.3; nl-nl; SAMSUNG GT-I9505 Build/JSS15J) AppleWebKit/537.36 (KHTML, like Gecko) Version/1.5 Chrome/28.0.1500.94 Mobile Safari/537.36'}
chapter = 0
while chapter < 3 :
url = 'http://www.komikgue.com/manga/one-piece/{chapter}/'
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')
mangas = soup.find_all('img', class_="img-responsive")
chapter += 1
def get_manga_details(manga):
src = manga.find('img', class_= "img-responsive").find("img")["src"]
alt = manga.find('img', class_= "img-responsive").find("img")["alt"]
return {
"chapter": chapter,
"src": src,"alt": alt
}
all_mangas = [get_manga_details(manga) for manga in mangas]
with open("manga.json", "w") as write_file:
json.dump(all_mangas, write_file)
print("Success")
This code functions in cmd but produces empty output.
Which is wrong, please teach me
I want it to be:
{
"chapter": "number": 1[
{
"src": "here", "alt" : "here",
"src": "here", "alt" : "here"
}]
}
Please guide me
There's a lot of things that are wrong with your code. First, the URL you are trying to access returns a 404, you need to rjust the chapter number with leading zeroes. Second, your logic and loops don't make much of a sense like defining your function and lists inside the loop, then expecting the output to contain all the chapters. Moreover, you're calling BeautifulSoup's find function again in your function which is not needed, you can directly access the attributes.
See my code below, it works on my machine
import requests
import json
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 4.3; nl-nl; SAMSUNG GT-I9505 Build/JSS15J) AppleWebKit/537.36 (KHTML, like Gecko) Version/1.5 Chrome/28.0.1500.94 Mobile Safari/537.36'}
chapter = 1
allmangas=[]
def get_manga_details(manga,i):
print(manga)
src = manga["src"]
alt = manga["alt"]
return {
"number": i,
"src": src,"alt": alt
}
while chapter < 3 :
url = 'http://www.komikgue.com/manga/one-piece/'+str(chapter).rjust(3,'0')
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')
mangas = soup.find_all('img', class_="img-responsive")
print(mangas)
allmangas.append({'chapter':chapter, 'data':[get_manga_details(manga,i) for i,manga in enumerate(mangas[:-1])]})
chapter += 1
with open("manga.json", "w") as write_file:
json.dump(allmangas, write_file)
print("Success")

Assigning beautifulsoup indexed values (html links and text) to a panda html DataFrame

The following code retrieves images and html links from a webpage and stores the values in a beautiful soup index. I am now using pandas in order to create an output html table for those images and links. I have managed to populate cells manually by calling on a specific index value but I can't seem to find a way add each indexed image and html text to the panda dataframe so that all the indexed values are displayed in the table. How could I do this ?
from bs4 import BeautifulSoup
import requests
import numpy as np
from pandas import *
import pandas as pd
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('max_colwidth',500)
from IPython.display import HTML
urldes = "https://www.johnpyeauctions.co.uk/lot_list.asp?saleid=4729&siteid=1"
# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(urldes, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
####################################
title_clean = soup.find('title')
print(title_clean)
image_links = [x['data-img'] for x in soup.find_all('a', rel='popover')]
for link in image_links:
print(link)
image_links_0 = image_links[0]
print(image_links_0)
mytags = []
tags = soup.find_all('td', width='41%')
for tag in tags:
image_text = tag.find('h5').text
mytags.append(image_text)
print(image_text)
for i in range(len(mytags)):
mytags[i]
mytags_0 = mytags[0]
image_links_0 = image_links[0]
#df = DataFrame({'foo1' : 'test',
df = DataFrame({'foo1' : '<img src="' + image_links_0 + '"/><p>' + mytags_0 + '</p>',
'foo2' : '' + mytags_0 + '',
'foo3' : mytags_0,
'foo4' : np.random.randn(2)})
print(df)
HTML(df.to_html('filename.html', escape=False))
print(tag)
This is the correct way to do it.
If you need any help with storing it and making an HTML out of it I'll be happy to provide a solution for that as well. Take care!
Update: Everything included, comments, scraping, writing to a file, creating tags with beautifulsoup.
from bs4 import BeautifulSoup
import requests
urldes = "https://www.johnpyeauctions.co.uk/lot_list.asp?saleid=4729&siteid=1"
# add header
mozila_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)\
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
headers = {'User-Agent': mozila_agent}
r = requests.get(urldes, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
############################################################
the_whole_table = soup.find('table', width='97%')
datalist = []
for tr in the_whole_table.find_all('tr')[1:]:
# you want to start from the 1st item not the 0th so [1:]
# Because the first is the thead i.e. Lot no, Picture, Lot Title...
index_num = tr.find('td', width='8%')
picture_link = index_num.next_sibling.a['data-img']
text_info = tr.find('td', width='41%')
current_bid = tr.find('td', width='13%')
time_left = tr.find('td', width='19%')
datalist.append([index_num.text, picture_link,
text_info.text, current_bid.text, time_left.text])
# for pic do ... print(picture_link) as for partial text only first 20
# characters
df = ['Index Number', 'Picture', 'Informational text',
'Current BID', 'Time Left now']
theads = BeautifulSoup('<table border="1"></table>', 'lxml')
thekeys = BeautifulSoup('<thead></thead>', 'html.parser')
for i in df:
tag = theads.new_tag('th')
tag.append(i)
thekeys.thead.append(tag)
theads.table.append(thekeys)
###############################################################
# The code above will initiate a table
# after that the for loop will create and populate the first row (thead)
for i in datalist:
thedata = BeautifulSoup('<tr></tr>', 'html.parser')
# we loop through the data we collected
for j in i:
if j.startswith('https'):
img_tag = theads.new_tag('img', src=j, height='50', width='50')
td_tag = theads.new_tag('td')
td_tag.append(img_tag)
thedata.tr.append(td_tag)
else:
tag = theads.new_tag('td')
tag.append(j)
thedata.tr.append(tag)
theads.table.append(thedata)
with open('asdf.html', 'w+') as f:
f.write(theads.prettify())
# each of these if you print them you'll get a information that you can store
# we use `.prettify()` as we can't write a BeautifulSoup object into a file.

Python Scraper - Request Post Function Not Returning Correct Page

I am working on my first website scraper and have ran into another issue. Below is my code. The website that is returned is the main page not the specific site for the parcel number I searched.
Am I using the wrong html class to identify the search function? Or is there something missing in the Python code? Any help would be much appreciated.
from bs4 import BeautifulSoup
import requests
web_page = 'https://mcassessor.maricopa.gov/index.php'
web_header ={ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
payload = {'homeSearchField' : '10218779'}
response = requests.post(web_page, data=payload, headers=web_header)
soup = BeautifulSoup(response.content, 'html.parser')
print (soup.prettify())