request.get changes the content of the website? (Webscraping) - json

I am facing an issue while trying to scrape information from a website using the requests.get method. The information I receive from the website is inconsistent and doesn't match the actual data displayed on the website.
As an example, I have tried to scrape the size of an apartment located at the following link: https://www.sreality.cz/en/detail/sale/flat/2+kt/havlickuv-brod-havlickuv-brod-stromovka/3574729052. The size of the apartment is displayed as 54 square meters on the website, but when I use the requests.get method, the result shows 43 square meters instead of 54.
Apartment size on the webpage
Apartment size from the inspect code
Result in vscode
I have attached screenshots of the apartment size displayed on the website and the result in my Visual Studio Code for reference. The code I used for this is given below:
import requests
test = requests.get("https://www.sreality.cz/api/cs/v2/estates/3574729052?tms=1676140494143").json()
test["items"][8]
I am unable to find a solution to this issue and would greatly appreciate any help or guidance. If there is anything wrong with the format of my post, please let me know and I will make the necessary changes. Thank you in advance.

Here is one way to get the information you're after:
import requests
import pandas as pd
pd.set_option('display.max_columns', None, 'display.max_colwidth', None)
headers = {
'accept': 'application/json, text/plain, */*',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
url = 'https://www.sreality.cz/en/detail/sale/flat/2+kt/havlickuv-brod-havlickuv-brod-stromovka/3574729052'
property_id = url.split('/')[-1]
api_url = f'https://www.sreality.cz/api/en/v2/estates/{property_id}'
s.get(url)
df = pd.json_normalize(s.get(api_url).json()['items'])
df = df[['name', 'value']]
print(df)
Result in terminal:
name value
0 Total price 3 905 742
1 Update Yesterday
2 ID 3574729052
3 Building Brick
4 Property status Under construction
5 Ownership Personal
6 Property location Quiet part of municipality
7 Floor 3. floor of total 5 including 1 underground
8 Usable area 54
9 Balcony 4
10 Cellar 2
11 Sales commencement date 25.04.2022
12 Water [{'name': 'Water', 'value': 'District water supply'}]
13 Electricity [{'name': 'Electricity', 'value': '230 V'}]
14 Transportation [{'name': 'Transportation', 'value': 'Train'}, {'name': 'Transportation', 'value': 'Road'}, {'name': 'Transportation', 'value': 'Urban public transportation'}, {'name': 'Transportation', 'value': 'Bus'}]
15 Road [{'name': 'Road', 'value': 'Asphalt'}]
16 Barrier-free access True
17 Furnished False
18 Elevator True

Related

How do a loop through a list of URLs, follow each link, and pull content into array

I have pulled data from API, I'm am looping through everything and find a key: value that has a URL. So I am creating a separate list of the URLs, what I need to do is follow the link and grab the contents from the page, pull the contents of that page back in to the array/list (it will just be a paragraph of text) and of course loop through the remaining URL. Do I need to use Selenium or BS4 and how do I loop through and pulling the page contents into my array/list?
json looks like this:
{
"merchandiseData": [
{
"clientID": 3003,
"name": "Yasir Carter",
"phone": "(758) 564-5345",
"email": "leo.vivamus#pedenec.net",
"address": "P.O. Box 881, 2723 Elementum, St.",
"postalZip": "DX2I 2LD",
"numberrange": 10,
"name1": "Harlan Mccarty",
"constant": ".com",
"text": "deserunt",
"url": "https://www.deserunt.com",
"text": "https://www."
},
]
}
Code thus far:
import requests
import json
import pandas as pd
import sqlalchemy as sq
import time
from datetime import datetime, timedelta
from flatten_json import flatten# read file
with open('_files/TestFile2.json', 'r') as f:
file_contents = json.load(f)
allThis = []
for x in file_contents['merchandiseData']:
holdAllThis = {
'client_id' : x['clientID'],
'client_description_link' : x['url']
}
allThis.append(holdAllThis)
print(client_id, client_description_link)
print(allThis)
Maybe using the JSON posted at https://github.com/webdevr712/python_follow_links and pandas:
import pandas as pd
import requests
# function mostly borrowed from https://stackoverflow.com/a/24519419/9192284
def site_response(link):
try:
r = requests.get(link, headers=headers)
# Consider any status other than 2xx an error
if not r.status_code // 100 == 2:
return "Error: {}".format(r)
return r.reason
except requests.exceptions.RequestException as e:
# A serious problem happened, like an SSLError or InvalidURL
return "Error: {}".format(e)
# set headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}
# url for downloading the json file
url = 'https://raw.githubusercontent.com/webdevr712/python_follow_links/main/merchData.json'
# get the json into a dataframe
df = pd.read_json(url)
df = pd.DataFrame(df['merchandiseData'].values.tolist())
# new column to store the response from running the site_response() function for each string in the 'url' column
df['site_response'] = df.apply(lambda x: site_response(x['url']), axis=1)
# print('OK responses:')
# print(df[df['site_response'].str.contains('OK')])
# output
print('\n\nAll responses:')
print(df[['url', 'site_response']])
Output:
All responses:
url site_response
0 https://www.deserunt.com Error: HTTPSConnectionPool(host='www.deserunt....
1 https://www.aliquip.com Error: HTTPSConnectionPool(host='www.aliquip.c...
2 https://www.sed.net Error: <Response [406]>
3 https://www.ad.net OK
4 https://www.Excepteur.edu Error: HTTPSConnectionPool(host='www.excepteur...
Full frame output:
clientID name phone \
0 3003 Yasir Carter (758) 564-5345
1 3103 Elaine Mccullough 1-265-168-1287
2 3203 Vanna Elliott (113) 485-7272
3 3303 Adrienne Holden 1-146-431-3745
4 3403 Freya Vang (858) 195-4886
email \
0 leo.vivamus#pedenec.net
1 sodales#enimcondimentum.net
2 elit.a#dui.org
3 lacus.quisque#magnapraesentinterdum.co.uk
4 diam.dictum#velmauris.net
address postalZip numberrange \
0 P.O. Box 881, 2723 Elementum, St. DX2I 2LD 10
1 7529 Dui. St. 24768-76452 9
2 Ap #368-6127 Lacinia Av. 6200 5
3 Ap #522-3209 Euismod St. 66746 3
4 P.O. Box 159, 416 Dui Ave 158425 4
name1 constant text url \
0 Harlan Mccarty .com https://www. https://www.deserunt.com
1 Kaseem Petersen .com https://www. https://www.aliquip.com
2 Kennan Holloway .net https://www. https://www.sed.net
3 Octavia Lambert .net https://www. https://www.ad.net
4 Kitra Maynard .edu https://www. https://www.Excepteur.edu
site_response
0 Error: HTTPSConnectionPool(host='www.deserunt....
1 Error: HTTPSConnectionPool(host='www.aliquip.c...
2 Error: <Response [406]>
3 OK
4 Error: HTTPSConnectionPool(host='www.excepteur...
From there you can move on to scraping each site that returns 'OK' and use Selenium (if required - you could check with another function) or BS4 etc.

Beautiful soup - find_all function is returning returning only 20 items from the page. The actual results are around 250

I am using find_all in beautiful soup library to parse the HTML text.
code
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
URL = "https://housing.com/in/buy/searches/M1Pmp1mc1ak4wflhbs_735yq6kvim3c7hqz_3g8uxzo18sqqdcuwU2yr9t"
response = get(URL, headers=headers)
html_soup = BeautifulSoup(response.text, 'lxml')
len(html_soup)
This is returning only 20 items even though the page shows 250 results. What am I doing wrong here ?
Try (This takes all (291)):
from selenium import webdriver
import time
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
URL = "https://housing.com/in/buy/searches/M1Pmp1mc1ak4wflhbs_735yq6kvim3c7hqz_3g8uxzo18sqqdcuwU2yr9t"
driver.get(URL)
driver.maximize_window()
PAUSE_TIME = 2
lh = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(PAUSE_TIME)
nh = driver.execute_script("return document.body.scrollHeight")
if nh == lh:
break
lh = nh
articles = driver.find_elements_by_css_selector('.css-h7k7mr')
for article in articles:
print(article.text)
print('-' * 80)
driver.close()
prints:
₹45.11 L
EMI starts at ₹28.13 K
3 BHK Apartment
Bachupally, Nizampet, Hyderabad
Build Up Area
1556 sq.ft
Avg. Price
₹2.90 K/sq.ft
Special Highlights
24x7 Security
Badminton Court
Cycling & Jogging Track
Gated Community
3 BHK Apartment available for sale in Bachapally,hyderabad,beside Mama Medical College, Nizampet, Hyderabad. Available amenities are: Gym, Swimming pool, Garden, Kids area, Sports facility, Lift. Apartment has 3 bedroom, 2 bathroom.
Read more
M Srikanth
Housing Prime Agent
Contact
--------------------------------------------------------------------------------
₹37.96 L - 62.05 L
EMI starts at ₹23.67 K
Bhuvanteza Evk Aura
Marketed by Sri Avani Infra Projects
Kollur, Hyderabad
Configurations
2, 3 BHK Apartments
Possession Starts
Nov, 2022
Avg. Price
₹3.65 K/sq.ft
Real estate developer Bhuvanteza Infrastructures has launched prime housing project Evk Aura in Kollur, Hyderabad. The project is offering beautiful and comfortable 2 and 3 BHK apartments for sale. Built-up area for 2 BHK apartments is in the range of 1040 to 1185 sq ft. and for 3 BHK apartments it is 1700 sq ft. Amenities which are required for a comfortable living will be available in the complex, they are car parking, club house, swimming pool, children play area, power backup and others. Developer Bhuvanteza Infrastructures can be contacted for owning an apartment in Evk Aura. Kollur is a ...
Read more
SA
Sri Avani Infra Projects
Seller
Contact
--------------------------------------------------------------------------------
and so on....
Note selenium: You need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
you're not reading right, there are 250 results in total but only 20 are shown, that's why you get 20 in python

Webscrape using BeautifulSoup to Dataframe

This is the html code:
<div class="wp-block-atomic-blocks-ab-accordion ab-block-accordion ab-font-size-18"><details><summary class="ab-accordion-title"><strong>American Samoa</strong></summary><div class="ab-accordion-text">
<ul><li><strong>American Samoa Department of Health Travel Advisory</strong></li><li>March 2, 2020—Governor Moliga <a rel="noreferrer noopener" href="https://www.rnz.co.nz/international/pacific-news/410783/american-samoa-establishes-govt-taskforce-to-plan-for-coronavirus" target="_blank">appointed</a> a government taskforce to provide a plan for preparation and response to the covid-19 coronavirus. </li></ul>
<ul><li>March 25, 2020 – The Governor issued an Executive Order 001 recognizing the Declared Public Health Emergency and State of Emergency, and imminent threat to public health. The order requires the immediate and comprehensive enforcement by the Commissioner of Public Safety, Director of Health, Attorney General, and other agency leaders.
<ul>
<li>Business are also required to provide necessary supplies to the public and are prohibited from price gouging.</li>
</ul>
</li></ul>
</div></details></div>
I want to extract State, date and text and add to a dataframe with these three columns
State: American Samoa
Date: 2020-03-25
Text: The Governor Executive Order 001 recognizing the Declared Public Health Emergency and State of Emergency, and imminent threat to public health
My code so far:
soup = bs4.BeautifulSoup(data)
for tag in soup.find_all("summary"):
print("{0}: {1}".format(tag.name, tag.text))
for tag1 in soup.find_all("li"):
#print(type(tag1))
ln = tag1.text
dt = (ln.split(' – ')[0])
dt = (dt.split('—')[0])
#txt = ln.split(' – ')[1]
print(dt)
Need Help:
How to get the text till a "." only, I dont need the entire test
How to add to the dataframe as new row as I loop through (I have only attached a part if the source code of webpage)
Appreciate your help!
As a start I have added the code below. Unfortunately the web page is not uniform in it's use of HTML lists some ul elements contain nested uls others don't. This code is not perfect but a starting point, for example American Samoa has an absolute mess of nested ul elements so only appears once in the df.
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
}
# You need to specify User Agent headers or else you get a 403
data = requests.get("https://www.nga.org/coronavirus-state-actions-all/", headers=HEADERS).text
soup = BeautifulSoup(data, 'lxml')
rows_list = []
for detail in soup.find_all("details"):
state = detail.find('summary')
ul = detail.find('ul')
for li in ul.find_all('li', recursive=False):
# Three types of hyphen are used on this webpage
split = re.split('(?:-|–|—)', li.text, 1)
if len(split) == 2:
rows_list.append([state.text, split[0], split[1]])
else:
print("Error", li.text)
df = pd.DataFrame(rows_list)
with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', -1):
print(df)
It creates and prints a data frame with 547 rows and prints some error messages for text it can not split. You will have to work out exactly which data you need and how to tweak the code to suit your purpose.
You can use 'html.parser' if you don't have 'lxml' installed.
UPDATED
Another approach is to use regex to match any string beginning with a date:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
}
# You need to specify User Agent headers or else you get a 403
data = requests.get("https://www.nga.org/coronavirus-state-actions-all/", headers=HEADERS).text
soup = BeautifulSoup(data, 'html.parser')
rows_list = []
for detail in soup.find_all("details"):
state = detail.find('summary')
for li in detail.find_all('li'):
p = re.compile(r'(\s*(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s*(\d{1,2}),*\s*(\d{4}))', re.IGNORECASE)
m = re.match(p, li.text)
if m:
rows_list.append([state.text, m.group(0), m.string.replace(m.group(0), '')])
else:
print("Error", li.text)
df = pd.DataFrame(rows_list)
df.to_csv('out.csv')
this gives far more records 4,785. Again it is a starting point some data gets missed but far less. It writes the data to a csv file, out.csv.

How to retrieve data from a json

I retrieved a dataset from a news API in JSON format. I want to extract the news description from the JSON data.
This is my code:-
import requests
import json
url = ('http://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=608bf565c67f4d99994c08d74db82f54')
response = requests.get(url)
di=response.json()
di = json.dumps(di)
for di['articles'] in di:
print(article['title'])
The dataset looks like this:-
{'status': 'ok',
'totalResults': 38,
'articles': [
{'source':
{'id': 'the-washington-post',
'name': 'The Washington Post'},
'author': 'Derek Hawkins, Marisa Iati',
'title': 'Coronavirus updates: Texas, Florida and Arizona officials say early reopenings fueled an explosion of cases - The Washington Post',
'description': 'Local officials in states with surging coronavirus cases issued dire warnings Sunday about the spread of infections, saying the virus was rapidly outpacing containment efforts.',
'url': 'https://www.washingtonpost.com/nation/2020/07/05/coronavirus-update-us/',
'urlToImage': 'https://www.washingtonpost.com/wp-apps/imrs.php?src=https://arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/K3UMAKF6OMI6VF6BNTYRN77CNQ.jpg&w=1440',
'publishedAt': '2020-07-05T18:32:44Z',
'content': 'Here are some significant developments:\r\n<ul><li>The rolling seven-day average for daily new cases in the United States reached a record high for the 27th day in a row, climbing to 48,606 on Sunday, … [+5333 chars]'}])
Please guide me with this!
There are few corrections needed in your code.. below code should work and i have removed API KEY in answer make sure that you add one before testing
import requests
import json
url = ('http://newsapi.org/v2/top-headlines?'
'country=us&'
'apiKey=<API KEY>')
di=response.json()
#You don't need to dump json that is already in json format
#di = json.dumps(di)
#your loop is not correctly defined, below is correct way to do it
for article in di['articles']:
print(article['title'])
response.json
{'status': 'ok',
'totalResults': 38,
'articles': [
{'source':
{'id': 'the-washington-post',
'name': 'The Washington Post'},
'author': 'Derek Hawkins, Marisa Iati',
'title': 'Coronavirus updates: Texas, Florida and Arizona officials say early reopenings fueled an explosion of cases - The Washington Post',
'description': 'Local officials in states with surging coronavirus cases issued dire warnings Sunday about the spread of infections, saying the virus was rapidly outpacing containment efforts.',
'url': 'https://www.washingtonpost.com/nation/2020/07/05/coronavirus-update-us/',
'urlToImage': 'https://www.washingtonpost.com/wp-apps/imrs.php?src=https://arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/K3UMAKF6OMI6VF6BNTYRN77CNQ.jpg&w=1440',
'publishedAt': '2020-07-05T18:32:44Z',
'content': 'Here are some significant developments:\r\n<ul><li>The rolling seven-day average for daily new cases in the United States reached a record high for the 27th day in a row, climbing to 48,606 on Sunday, … [+5333 chars]'}]}
Code:
di = response.json() # Understand that 'di' is of type 'dictionary', key-value pair
for i in di["articles"]:
print(i["description"])
"articles" is one of the keys of dictionary di, It's corresponding value is of type list. "description" , which you are looking is part of this list (value of "articles"). Further list contains the dictionary (key-value pair).You can access from key - description

Recursively Slicing JSON into Dataframe Columns

I have a dataframe with a column containing JSON in the format, where one record looks like -
player_feedback
{'player': '1b87a117-09ef-41e2-8710-6bc144760a74',
'feedback': [{'answer': [{'id': '1-6gaincareerinfo', 'content': 'To gain career information'},
{'id': '1-5proveskills', 'content': 'Opportunity to prove skills by competing '},
{'id': '1-1diff', 'content': 'Try something different'}], 'question': 1},
{'answer': [{'id': '2-2skilldev', 'content': 'Skill development'}], 'question': 2},
{'answer': [{'id': '3-6exploit', 'content': 'Exploitation'},
{'id': '3-1forensics', 'content': 'Forensics'}], 'question': 3},
{'answer': 'verygood', 'question': 4},
{'answer': 'poor', 'question': 5}, ... ... ,
{'answer': 'verygood', 'question': 15}]}
Here are the first 5 rows of the data.
I want to convert this column to separate columns like -
player Question 1 Question 2 ... Question 15
1b87a117-09ef-41e2-8710-6bc144760a74 To gain career information, Skill development verygood
Opportunity to prove skills by competing,
Try something different
I started with -
df_survey_responses['player_feedback'].apply(ast.literal_eval).values.tolist()
but that only gets me the player id in a seperate field and the feedback in another. As far as I can tell, JSONNormalize would also give me similar result. How can I do this recursively to get my desired result, or is a better way to do this?
Thanks!
You can use a json flattener to like this one:
def flatten_json(nested_json):
"""
Flatten json object with nested keys into a single level.
Args:
nested_json: A nested json object.
Returns:
The flattened json object if successful, None otherwise.
"""
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
Which gives dataframes that look like this:
0
player 34a8eb8a-056f-4568-88dc-8736056819a3
feedback_0_answer_0_id 1-5proveskills
feedback_0_answer_0_content Opportunity to prove skills by competing
feedback_0_question 1
feedback_1_answer_0_id 2-1networking
feedback_1_answer_0_content Networking
feedback_1_answer_1_id 2-2skilldev
feedback_1_answer_1_content Skill development
feedback_1_question 2
feedback_2_answer_0_id 3-5boottoroot
feedback_2_answer_0_content Boot2root
feedback_2_answer_1_id 3-6exploit
feedback_2_answer_1_content Exploitation
feedback_2_question 3
feedback_3_answer good
feedback_3_question 4
feedback_4_answer good
feedback_4_question 5
feedback_5_answer selfchose
feedback_5_question 6
feedback_6_answer pairs
feedback_6_question 7
feedback_7_answer_0_id 7-persistence
feedback_7_answer_0_content Persistence
feedback_7_question 8
feedback_8_answer social
feedback_8_question 9
feedback_9_answer training
feedback_9_question 10
feedback_10_answer yes
feedback_10_question 11
feedback_11_answer yes
feedback_11_question 12
feedback_12_answer yes
feedback_12_question 13
feedback_13_answer yes
feedback_13_question 14
feedback_14_answer verygood
feedback_14_question 15
feedback_15_answer yes
feedback_15_question 16
feedback_16_answer yes
feedback_16_question 17
feedback_17_answer It would be good to have more exploitation one...
feedback_17_question 18