Hi
I'm trying to scrap some data from a website the data is displayed in a chart ( the data is currency prices over years)
I was able to get the XHR Request and the API link for the JSON data but when I open the response in the ( network tab or in a new tab ) the data is not completely displayed but in the chart the data is represented.
the api link
searched about the problem and I found this post which says that the dev-tools truncates long network response I tried the solution but the same problem is still happening .
tried to use wget to download them but it didn't help same issue appeared.
I'm opening the link in a separate tab on Brave browser (also tried Firefox)
I don't know what's the problem
Can you please help me ?!
You can scrape that API endpoint (with Python) like below:
import requests
import pandas as pd
pd.set_option('display.max_columns', None, 'display.max_colwidth', None)
headers = {
'content-type': 'application/json',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
r = requests.get('https://sy-exchange-rates-iwi3arxhhq-uc.a.run.app/api/rates?from=Thu,%2031%20Mar%202011%2021:00:00%20GMT&to=Mon,%2023%20Jan%202023%2002:03:23%20GMT&name=USD&source=liranews.info,sp-today.com,dei-sy.com&city=damascus', headers=headers)
df = pd.json_normalize(r.json())
print(df)
Result in terminal:
timestamp source city name buy sell
0 2021-01-03T07:00:00Z sp-today.com damascus USD 2855 2880
1 2021-01-03T07:00:00Z liranews.info damascus USD 2855 2880
2 2021-01-03T07:00:00Z dei-sy.com damascus USD 2845 2855
3 2021-01-03T08:00:00Z sp-today.com damascus USD 2855 2880
4 2021-01-03T08:00:00Z liranews.info damascus USD 2855 2880
... ... ... ... ... ... ...
50772 2023-01-22T22:00:00Z liranews.info damascus USD 6625 6685
50773 2023-01-22T23:00:00Z sp-today.com damascus USD 6625 6685
50774 2023-01-22T23:00:00Z liranews.info damascus USD 6625 6685
50775 2023-01-23T01:00:00Z sp-today.com damascus USD 6625 6685
50776 2023-01-23T01:00:00Z liranews.info damascus USD 6625 6685
50777 rows × 6 columns
Related
I am facing an issue while trying to scrape information from a website using the requests.get method. The information I receive from the website is inconsistent and doesn't match the actual data displayed on the website.
As an example, I have tried to scrape the size of an apartment located at the following link: https://www.sreality.cz/en/detail/sale/flat/2+kt/havlickuv-brod-havlickuv-brod-stromovka/3574729052. The size of the apartment is displayed as 54 square meters on the website, but when I use the requests.get method, the result shows 43 square meters instead of 54.
Apartment size on the webpage
Apartment size from the inspect code
Result in vscode
I have attached screenshots of the apartment size displayed on the website and the result in my Visual Studio Code for reference. The code I used for this is given below:
import requests
test = requests.get("https://www.sreality.cz/api/cs/v2/estates/3574729052?tms=1676140494143").json()
test["items"][8]
I am unable to find a solution to this issue and would greatly appreciate any help or guidance. If there is anything wrong with the format of my post, please let me know and I will make the necessary changes. Thank you in advance.
Here is one way to get the information you're after:
import requests
import pandas as pd
pd.set_option('display.max_columns', None, 'display.max_colwidth', None)
headers = {
'accept': 'application/json, text/plain, */*',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
url = 'https://www.sreality.cz/en/detail/sale/flat/2+kt/havlickuv-brod-havlickuv-brod-stromovka/3574729052'
property_id = url.split('/')[-1]
api_url = f'https://www.sreality.cz/api/en/v2/estates/{property_id}'
s.get(url)
df = pd.json_normalize(s.get(api_url).json()['items'])
df = df[['name', 'value']]
print(df)
Result in terminal:
name value
0 Total price 3 905 742
1 Update Yesterday
2 ID 3574729052
3 Building Brick
4 Property status Under construction
5 Ownership Personal
6 Property location Quiet part of municipality
7 Floor 3. floor of total 5 including 1 underground
8 Usable area 54
9 Balcony 4
10 Cellar 2
11 Sales commencement date 25.04.2022
12 Water [{'name': 'Water', 'value': 'District water supply'}]
13 Electricity [{'name': 'Electricity', 'value': '230 V'}]
14 Transportation [{'name': 'Transportation', 'value': 'Train'}, {'name': 'Transportation', 'value': 'Road'}, {'name': 'Transportation', 'value': 'Urban public transportation'}, {'name': 'Transportation', 'value': 'Bus'}]
15 Road [{'name': 'Road', 'value': 'Asphalt'}]
16 Barrier-free access True
17 Furnished False
18 Elevator True
The file I am calling is JSON with padding and I have done my simple coding to remove the padding but it appears by stringing together multiple JSON strings the formatting is not correct and I get root element errors.
I am using the output of the python program and running it through an online JSON formatter and validator website to check my output. I am a learner so please bear with my inexperience. All help appreciated.
import json
import re
import requests
payload = {}
headers = {}
for race in range(1, 3):
url = f"https://s3-ap-southeast-2.amazonaws.com/racevic.static/2018-01-01/flemington/sectionaltimes/race-{race}.json?callback=sectionaltimes_callback"
response = requests.request("GET", url, headers=headers, data=payload)
strip = 'sectionaltimes_callback'
string = response.text
repl =''
result = re.sub(strip, repl, string)
print(result)
This is one way of obtaining the data you're looking for:
import requests
import json
import pandas as pd
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0',
'Accept-Language' : 'en-US,en;q=0.5'}
for race in range(1, 3):
url = f"https://s3-ap-southeast-2.amazonaws.com/racevic.static/2018-01-01/flemington/sectionaltimes/race-{race}.json?callback=sectionaltimes_callback"
r = requests.get(url, headers=headers)
json_obj = json.loads(r.text.split('sectionaltimes_callback(')[1].rsplit(')', 1)[0])
df = pd.DataFrame(json_obj['Horses'])
print(df)
This would return (print out in terminal) a dataframe for each race:
Comment FinalPosition FinalPositionAbbreviation FullName SaddleNumber HorseUrl SilkUrl Trainer TrainerUrl Jockey ... DistanceVarToWinner SixHundredMetresTime TwoHundredMetresTime Early Mid Late OverallPeakSpeed PeakSpeedLocation OverallAvgSpeed DistanceFromRail
0 Resumes. Showed pace to lead well off the rail... 1 1st Crossing the Abbey 2 /horses/crossing-the-abbey //s3-ap-southeast-2.amazonaws.com/racevic.silk... T.Hughes /trainers/tim-hughes C.Williams ... 32.84 11.43 57.4 68.2 65.3 68.9 400m 63.3 0.8
1 Same sire as Katy's Daughter out of dual stake... 2 2nd Khulaasa 5 /horses/khulaasa //s3-ap-southeast-2.amazonaws.com/racevic.silk... D. & B.Hayes & T.Dabernig /trainers/david-hayes D.Oliver ... 0 32.61 11.29 56.6 68.4 66.0 69.2 700m 63.4 1.2
2 Trialled nicely before pleasing debut in what ... 3 3rd Graceful Star 4 /horses/graceful-star //s3-ap-southeast-2.amazonaws.com/racevic.silk... D. & B.Hayes & T.Dabernig /trainers/david-hayes A.Mallyon ... 0 33.10 11.56 56.9 67.4 64.8 68.5 400m 62.8 4.4
3 Sat second at debut, hampered at the 700m then... 4 4th Carnina 1 /horses/carnina //s3-ap-southeast-2.amazonaws.com/racevic.silk... T.Busuttin & N.Young /trainers/trent-busuttin B.Mertens ... +1 33.30 11.80 56.9 68.2 63.9 68.9 400m 62.7 3.0
4 $75k yearling by a Magic Millions winner out o... 5 5th Mirette 7 /horses/mirette //s3-ap-southeast-2.amazonaws.com/racevic.silk... A.Alexander /trainers/archie-alexander J.Childs ... 0 33.53 11.89 57.0 67.9 63.5 68.5 700m 62.5 3.8
5 $95k yearling by same sire as Pinot out of a s... 6 6th Dark Confidant 3 /horses/dark-confidant //s3-ap-southeast-2.amazonaws.com/racevic.silk... D. & B.Hayes & T.Dabernig /trainers/david-hayes D.Dunn ... +2 33.74 11.91 56.4 67.1 63.3 68.8 700m 61.9 5.0
6 Same sire as Vega Magic out of imported stakes... 7 7th La Celestina 6 /horses/la-celestina //s3-ap-southeast-2.amazonaws.com/racevic.silk... D.R.Brideoake /trainers/david-brideoake D.M.Lane ... +1 34.46 12.27 57.5 67.3 61.4 68.2 700m 61.7 0.8
7 rows × 29 columns
Comment FinalPosition FinalPositionAbbreviation FullName SaddleNumber HorseUrl SilkUrl Trainer TrainerUrl Jockey ... DistanceVarToWinner SixHundredMetresTime TwoHundredMetresTime Early Mid Late OverallPeakSpeed PeakSpeedLocation OverallAvgSpeed DistanceFromRail
0 Game in defeat both runs this campaign. Better... 1 1st Wise Hero 2 /horses/wise-hero //s3-ap-southeast-2.amazonaws.com/racevic.silk... J.W.Price /trainers/john-price S.M.Thornton ... 33.13 11.43 55.4 62.7 65.5 68.2 300m 61.7 0.7
1 Two runs since racing wide over this trip at C... 2 2nd Just Hifalutin 5 /horses/just-hifalutin //s3-ap-southeast-2.amazonaws.com/racevic.silk... E.Jusufovic /trainers/enver-jusufovic L.Currie ... +3 32.75 11.37 53.1 63.8 65.8 68.5 400m 61.7 3.3
2 Did a bit of early work at Seymour and was not... 3 3rd King Kohei 10 /horses/king-kohei //s3-ap-southeast-2.amazonaws.com/racevic.silk... Michael & Luke Cerchi /trainers/mick-cerchi
[...]
I'm trying to learn how to scrap components from website, specifically this website https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load
When I follow guidance from the internet, I collect several important elements such as class
"article-table sortable mw-collapsible jquery-tablesorter mw-made-collapsible"
and html elements like th and tb to get specific content of it using this code
import requests
from bs4 import BeautifulSoup
URL = "https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load"
page = requests.get(URL)
#print(page.text)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="mw-content-text")
teapot_loads = results.find_all("table", class_="article-table sortable mw-collapsible jquery-tablesorter mw-made-collapsible")
for teapot_loads in teapot_loads:
table_head_element = teapot_loads.find("th", class_="headerSort")
print(table_head_element)
print()
I seem to have written the correct element (th) and correct class name "headerSort." But the program doesn't return anything although there's no error in the program as well. What did I do wrong?
You can debug your code to see what went wrong, where. One such debugging effort is below, where we keep only one class for tables, and then print out the full class of the actual elements:
import requests
from bs4 import BeautifulSoup
URL = "https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load"
page = requests.get(URL)
#print(page.text)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="mw-content-text")
# print(results)
teapot_loads = results.find_all("table", class_="article-table")
for teapot_load in teapot_loads:
print(teapot_load.get_attribute_list('class'))
table_head_element = teapot_load.find("th", class_="headerSort")
print(table_head_element)
This will print out (beside the element you want printed out) the table class as well, as seen by requests/BeautifulSoup: ['article-table', 'sortable', 'mw-collapsible']. After the original HTML loads in page (with the original classes, seen by requests/BeautifulSoup), the Javascript in that page kicks in, and adds new classes to the table. As you are searching for elements containing such dynamically added classes, your search fails.
Nonetheless, here is a more elegant way of obtaining that table:
import pandas as pd
url = 'https://genshin-impact.fandom.com/wiki/Serenitea_Pot/Load'
dfs = pd.read_html(url)
print(dfs[1])
This will return a dataframe with that table:
Image
Name
Adeptal Energy
Load
ReducedLoad
Ratio
0
nan
"A Bloatty Floatty's Dream of the Sky"
60
65
47
0.92
1
nan
"A Guide in the Summer Woods"
60
35
24
1.71
2
nan
"A Messenger in the Summer Woods"
60
35
24
1.71
3
nan
"A Portrait of Paimon, the Greatest Companion"
90
35
24
2.57
4
nan
"A Seat in the Wilderness"
20
50
50
0.4
5
nan
"Ballad-Spinning Windwheel"
90
185
185
0.49
6
nan
"Between Nine Steps"
30
550
550
0.05
[...]
Documentation for bs4 (BeautifulSoup) can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
Also, docs for pandas.read_html: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
I have pulled data from API, I'm am looping through everything and find a key: value that has a URL. So I am creating a separate list of the URLs, what I need to do is follow the link and grab the contents from the page, pull the contents of that page back in to the array/list (it will just be a paragraph of text) and of course loop through the remaining URL. Do I need to use Selenium or BS4 and how do I loop through and pulling the page contents into my array/list?
json looks like this:
{
"merchandiseData": [
{
"clientID": 3003,
"name": "Yasir Carter",
"phone": "(758) 564-5345",
"email": "leo.vivamus#pedenec.net",
"address": "P.O. Box 881, 2723 Elementum, St.",
"postalZip": "DX2I 2LD",
"numberrange": 10,
"name1": "Harlan Mccarty",
"constant": ".com",
"text": "deserunt",
"url": "https://www.deserunt.com",
"text": "https://www."
},
]
}
Code thus far:
import requests
import json
import pandas as pd
import sqlalchemy as sq
import time
from datetime import datetime, timedelta
from flatten_json import flatten# read file
with open('_files/TestFile2.json', 'r') as f:
file_contents = json.load(f)
allThis = []
for x in file_contents['merchandiseData']:
holdAllThis = {
'client_id' : x['clientID'],
'client_description_link' : x['url']
}
allThis.append(holdAllThis)
print(client_id, client_description_link)
print(allThis)
Maybe using the JSON posted at https://github.com/webdevr712/python_follow_links and pandas:
import pandas as pd
import requests
# function mostly borrowed from https://stackoverflow.com/a/24519419/9192284
def site_response(link):
try:
r = requests.get(link, headers=headers)
# Consider any status other than 2xx an error
if not r.status_code // 100 == 2:
return "Error: {}".format(r)
return r.reason
except requests.exceptions.RequestException as e:
# A serious problem happened, like an SSLError or InvalidURL
return "Error: {}".format(e)
# set headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}
# url for downloading the json file
url = 'https://raw.githubusercontent.com/webdevr712/python_follow_links/main/merchData.json'
# get the json into a dataframe
df = pd.read_json(url)
df = pd.DataFrame(df['merchandiseData'].values.tolist())
# new column to store the response from running the site_response() function for each string in the 'url' column
df['site_response'] = df.apply(lambda x: site_response(x['url']), axis=1)
# print('OK responses:')
# print(df[df['site_response'].str.contains('OK')])
# output
print('\n\nAll responses:')
print(df[['url', 'site_response']])
Output:
All responses:
url site_response
0 https://www.deserunt.com Error: HTTPSConnectionPool(host='www.deserunt....
1 https://www.aliquip.com Error: HTTPSConnectionPool(host='www.aliquip.c...
2 https://www.sed.net Error: <Response [406]>
3 https://www.ad.net OK
4 https://www.Excepteur.edu Error: HTTPSConnectionPool(host='www.excepteur...
Full frame output:
clientID name phone \
0 3003 Yasir Carter (758) 564-5345
1 3103 Elaine Mccullough 1-265-168-1287
2 3203 Vanna Elliott (113) 485-7272
3 3303 Adrienne Holden 1-146-431-3745
4 3403 Freya Vang (858) 195-4886
email \
0 leo.vivamus#pedenec.net
1 sodales#enimcondimentum.net
2 elit.a#dui.org
3 lacus.quisque#magnapraesentinterdum.co.uk
4 diam.dictum#velmauris.net
address postalZip numberrange \
0 P.O. Box 881, 2723 Elementum, St. DX2I 2LD 10
1 7529 Dui. St. 24768-76452 9
2 Ap #368-6127 Lacinia Av. 6200 5
3 Ap #522-3209 Euismod St. 66746 3
4 P.O. Box 159, 416 Dui Ave 158425 4
name1 constant text url \
0 Harlan Mccarty .com https://www. https://www.deserunt.com
1 Kaseem Petersen .com https://www. https://www.aliquip.com
2 Kennan Holloway .net https://www. https://www.sed.net
3 Octavia Lambert .net https://www. https://www.ad.net
4 Kitra Maynard .edu https://www. https://www.Excepteur.edu
site_response
0 Error: HTTPSConnectionPool(host='www.deserunt....
1 Error: HTTPSConnectionPool(host='www.aliquip.c...
2 Error: <Response [406]>
3 OK
4 Error: HTTPSConnectionPool(host='www.excepteur...
From there you can move on to scraping each site that returns 'OK' and use Selenium (if required - you could check with another function) or BS4 etc.
I am using find_all in beautiful soup library to parse the HTML text.
code
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
URL = "https://housing.com/in/buy/searches/M1Pmp1mc1ak4wflhbs_735yq6kvim3c7hqz_3g8uxzo18sqqdcuwU2yr9t"
response = get(URL, headers=headers)
html_soup = BeautifulSoup(response.text, 'lxml')
len(html_soup)
This is returning only 20 items even though the page shows 250 results. What am I doing wrong here ?
Try (This takes all (291)):
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
URL = "https://housing.com/in/buy/searches/M1Pmp1mc1ak4wflhbs_735yq6kvim3c7hqz_3g8uxzo18sqqdcuwU2yr9t"
driver.get(URL)
driver.maximize_window()
PAUSE_TIME = 2
lh = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(PAUSE_TIME)
nh = driver.execute_script("return document.body.scrollHeight")
if nh == lh:
break
lh = nh
articles = driver.find_elements_by_css_selector('.css-h7k7mr')
for article in articles:
print(article.text)
print('-' * 80)
driver.close()
prints:
₹45.11 L
EMI starts at ₹28.13 K
3 BHK Apartment
Bachupally, Nizampet, Hyderabad
Build Up Area
1556 sq.ft
Avg. Price
₹2.90 K/sq.ft
Special Highlights
24x7 Security
Badminton Court
Cycling & Jogging Track
Gated Community
3 BHK Apartment available for sale in Bachapally,hyderabad,beside Mama Medical College, Nizampet, Hyderabad. Available amenities are: Gym, Swimming pool, Garden, Kids area, Sports facility, Lift. Apartment has 3 bedroom, 2 bathroom.
Read more
M Srikanth
Housing Prime Agent
Contact
--------------------------------------------------------------------------------
₹37.96 L - 62.05 L
EMI starts at ₹23.67 K
Bhuvanteza Evk Aura
Marketed by Sri Avani Infra Projects
Kollur, Hyderabad
Configurations
2, 3 BHK Apartments
Possession Starts
Nov, 2022
Avg. Price
₹3.65 K/sq.ft
Real estate developer Bhuvanteza Infrastructures has launched prime housing project Evk Aura in Kollur, Hyderabad. The project is offering beautiful and comfortable 2 and 3 BHK apartments for sale. Built-up area for 2 BHK apartments is in the range of 1040 to 1185 sq ft. and for 3 BHK apartments it is 1700 sq ft. Amenities which are required for a comfortable living will be available in the complex, they are car parking, club house, swimming pool, children play area, power backup and others. Developer Bhuvanteza Infrastructures can be contacted for owning an apartment in Evk Aura. Kollur is a ...
Read more
SA
Sri Avani Infra Projects
Seller
Contact
--------------------------------------------------------------------------------
and so on....
Note selenium: You need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
you're not reading right, there are 250 results in total but only 20 are shown, that's why you get 20 in python