Downloading data from multiple-pages website with requests method in Python - json

I have this API documentation of the website http://json-homework.task-sss.krasilnikov.spb.ru/docs/9f66a575a6cfaaf7e43177317461d057 (which is only in Russian, unfortunately, but I'll try to explain), and I am to import the data about the group members from there, but the issue is that parameter page returns only 5 members, and when you increase the page number, it only returns next 5 members, not adding them to the previous five. Here is my code:
import pandas as pd
import requests as rq
import json
from pandas.io.json import json_normalize
url='http://json-homework.task-sss.krasilnikov.spb.ru/api/groups/getmembers?api_key=9f66a575a6cfaaf7e43177317461d057&group_id=4508123&page=1'
data=rq.get(url)
data1=json.loads(data.text)
data1=json_normalize(json.loads(data.text)["response"])
data1
and here is what my output looks like:
By entering bigger and bigger numbers, I also found out that the last part of data exists on 41 page, i.e. I need to get the data from 1 to 41 page. How can I include all the pages in my code? Maybe it is possible with some loop or something like that, I don't know...

According to the API documentation, there is no parameter to specify the users to fetch in one page, so you will have to get them 5 at a time, and since there are 41 pages you can just loop through the urls.
import requests as rq
import json
all_users = []
for page in range(1,42):
url=f'http://json-homework.task-sss.krasilnikov.spb.ru/api/groups/getmembers?api_key=9f66a575a6cfaaf7e43177317461d057&group_id=4508123&page={page}'
data=rq.get(url)
all_users.append(json.loads(data.text)["response"])
The above implementation, will of course not check for any api throttling i.e. the API may give unexpected data if too many requests are made in a very short duration, which you can mitigate using some well placed delays.

Related

Why is my xpath parser only scraping the first dozen elements in a table?

I'm learning how to interact with websites via python and I've been following this tutorial.
I wanted to know how xpath worked which led me here.
I've made some adjustments to the code from the origional site to select elements if multiple conditions are met.
However, in both the origional code from the tutorial and my own update I seem to be only grabbing a small section of the body. I was under the impression xpath('//tbody/tr')[:10]: would grab the entire body and allow me to interact with 10 in this case columns.
At time of writing when I run the code it retuns 2 hits, while if I do the same thing with a manual copy and paste to excel I get 94 hits.
Why am I only able to parse a few lines from the table and not all lines?
import requests
from lxml.html import fromstring
def get_proxies():
url = 'https://free-proxy-list.net/'
response = requests.get(url)
parser = fromstring(response.text)
proxies = set()
for i in parser.xpath('//tbody/tr')[:10]:
if i.xpath('.//td[7][contains(text(),"yes")]') and i.xpath('.//td[5][contains(text(),"elite")]'):
# Grabbing IP and corresponding PORT
proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
proxies.add(proxy)
return proxies
proxies = get_proxies()
print(proxies)
I figured it out: [:10] refers not to columns but to rows.
Changing 10 to 20, 50 or 100, check 10, 20, 50 or 100 rows.

Web scraping an "onclick" object table on a website with python

I am trying to scrape the data for this link: page.
If you click the up arrow you will notice the highlighted days in the month sections. Clicking on a highlighted day, a table with initiated tenders for that day will appear. All I need to do is get the data in each table for each highlighted day in the calendar. There might be one or more tenders (up to max of 7) per day.
Table appears on click
I have done some web scraping with bs4, however I think that this is a job for selenium (please, correct me if I am wrong) with which I am not very familiar.
So far, I have managed to find the arrow element by XPATH to navigate around the calendar and show me more months. After that I try clicking on a random day (in below code I clicked on 30.03.2020) upon which an html object called: "tenders-table cloned" appears in the html on inspect. The object name stays the same no matter what day you click on.
I am pretty stuck now, have tried to select by iterate and/or print what is inside that object table, it either says that object is not iterable or is None.
from selenium import webdriver
chrome_path = r"C:\Users\<name>\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://www.ibex.bg/bg/данни-за-пазара/централизиран-пазар-за-двустранни-договори/търговски-календар/")
driver.find_element_by_xpath("""//*[#id="content"]/div[3]/div/div[1]/div/i""").click()
driver.find_element_by_xpath("""//*[#id="content"]/div[3]/div/div[2]/div[1]/div[3]/table/tbody/tr[6]/td[1]""").click()
Please advice how I can proceed to extract the data from the table pop-up.
Please try below solution
driver.maximize_window()
wait = WebDriverWait(driver, 20)
elemnt=wait.until(EC.presence_of_element_located((By.XPATH, "//body/div[#id='wrapper']/div[#id='content']/div[#class='tenders']/div[#class='form-group']/div[1]/div[1]//i")))
elemnt.click()
elemnt1=wait.until(EC.presence_of_element_located((By.XPATH, "//div[#class='form-group']//div[1]//div[3]//table[1]//tbody[1]//tr[6]//td[1]")))
elemnt1.click()
lists=wait.until(EC.presence_of_all_elements_located((By.XPATH, "//table[#class='tenders-table cloned']")))
for element in lists:
print element.text
Well, i see there's no reason to use selenium for such case as it's will slow down your task.
The website is loaded with JavaScript event which render it's data dynamically once the page loads.
requests library will not be able to render JavaScript on the fly. so you can use selenium or requests_html. and indeed there's a lot of modules which can do that.
Now, we do have another option on the table, to track from where the data is rendered. I were able to locate the XHR request which is used to retrieve the data from the back-end API and render it to the users side.
You can get the XHR request by open Developer-Tools and check Network and check XHR/JS requests made depending of the type of call such as fetch
import requests
import json
data = {
'from': '2020-1-01',
'to': '2020-3-01'
}
def main(url):
r = requests.post(url, data=data).json()
print(json.dumps(r, indent=4)) # to see it in nice format.
print(r.keys())
main("http://www.ibex.bg/ajax/tenders_ajax.php")
Because am just a lazy coder: I will do it in this way:
import requests
import re
import pandas as pd
import ast
from datetime import datetime
data = {
'from': '2020-1-01',
'to': '2020-3-01'
}
def main(url):
r = requests.post(url, data=data).json()
matches = set(re.findall(r"tender_date': '([^']*)'", str(r)))
sort = (sorted(matches, key=lambda k: datetime.strptime(k, '%d.%m.%Y')))
print(f"Available Dates: {sort}")
opa = re.findall(r"({\'id.*?})", str(r))
convert = [ast.literal_eval(x) for x in opa]
df = pd.DataFrame(convert)
print(df)
df.to_csv("data.csv", index=False)
main("http://www.ibex.bg/ajax/tenders_ajax.php")
Output: view-online

How do I have beautiful soup read in the html fully? Possibly selenium issue?

I am trying to have some practice with beautiful soup, web scraping and python but I am struggling with getting this data from certain tags. I am trying to go through multiple pages of data on cars.com.
So when I read in the html, and the tags I need are
<cars-shop-srp-pagination>
</cars-shop-srp-pagination>
because the page number is in between them and in order for me to loop through the website pages, I need to know the max pages
from bs4
import BeautifulSoup
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mkId=20089&page=1&perPage=20&prMx=25000&rd=99999&searchSource=GN_REFINEMENT&sort=relevance&stkTypId=28881&zc=21042'
#
source = requests.get('https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=58767&mkId=20089&page=1&perPage=20&prMx=25000&rd=99999&searchSource=GN_REFINEMENT&sort=relevance&zc=21042').text
source = requests.get(url).content
soup = BeautifulSoup(source, 'html.parser')
print(soup.prettify())
link = soup.find(word_ = "cars-shop-srp-pagination")# linkNext = link.find('a')
print(link)
When I go through the output, the only thing I see for the "cars-shop-srp-pagination: is
<cars-shop-srp-pagination>
</cars-shop-srp-pagination>
when I need to see:
All of the code inside of them, specifically I want to get to:
*"<li ng-if="showLast"> <a class="js-last-page" ng-click="goToPage($event, numberOfPages)">50</a> </li>"*
Remember that BeautifulSoup only parses through HTML/XML code that you give it. If the page number isn't in your captured HTML code in your first place, then that's a problem with being able to capture the code properly, not with BeautifulSoup. Unfortunately, I think that this data is dynamically generated.
I found a work-around, though. Notice that at the top of the search results, the page says "(some number of cars) matches near you". For example:
<div class="matchcount">
<span class="filter-count">1,711</span>
<span class="filter-text"> matches near you</span>
You could capture this number, then divide by the number of results per page being displayed. In fact, this latter number can be passed into the URL. Note that you have to round up to the nearest integer, to catch the search results that show up on the final page. Also, any commas in numbers over 999 have to be removed from the string before you can int it.
from bs4 import BeautifulSoup
import urllib2
import math
perpage = 100
url = 'https://www.cars.com/for-sale/searchresults.action/'
url += '?dealerType=all&mdId=58767&mkId=20089&page=1&perPage=%d' % perpage
url += '&prMx=25000&searchSource=PAGINATION&sort=relevance&zc=21042'
response = urllib2.urlopen(url)
source = response.read()
soup = BeautifulSoup(source, 'lxml')
count_tag = soup.find('span', {'class' : 'filter-count'})
count = int(count_tag.text.replace(',',''))
pages = int(math.ceil(1.0* count / perpage))
print(pages)
One catch to this however is that if the search isn't refined enough, the website will say something like "Over 30 thousand matches", which is not an integer.
Also, I was getting a 503 response from requests.get(), so I switched to using urllib2 to get the HTML.
All that info (number of results, number of pages, results per page) is stored in a javascript dictionary within the returned content. You can simply regex out the object and parse with json. Note that the url is a query string and you can alter the results per page count in it. So, after doing an initial request to determine how many results there are, you can perform calcs to make any other changes. Note that you may also be able to use json through out and not BeautifulSoup. Though I think there would be a limit (perhaps the 20) with grabbing as shown below from each page so probably better to go with the 100 results per page and make initial request, regex out info and if more than 100 results then loop, altering url, to collect rest of results.
I don't think, regardless of the number of pages indicated/calculated, that you can actually go beyond page 50.
import requests
import re
import json
p = re.compile(r'digitalData = (.*?);')
r = requests.get('https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mkId=20089&page=1&perPage=20&prMx=25000&rd=99999&searchSource=GN_REFINEMENT&sort=relevance&stkTypId=28881&zc=21042')
data = json.loads(p.findall(r.text)[0])
num_results_returned = data['page']['search']['numResultsReturned']
total_num_pages = data['page']['search']['totalNumPages']
num_results_on_page = data['page']['search']['numResultsOnPage']

Mediawiki bot to find out the changes of the last hours

Asking for:
I want to use a bot (mediawiki api or direct database access), that find out the changes of all pages of the last hours. I need a diff and want to process the text afterwards. A guess into the right direction (diff of last edits) is enough. I know how to iterate pages.
Background:
I want to process the diff to find out new download links as part of template call. For this step I don't need help.
You can use pywikibot, This should be a (maybe) good start:
from pywikibot import Site
from datetime import timedelta
my_site = Site(language, family) # Or set default site with pywikibot config file
# From this object you may get the changes by timestamp, pywikibot.Timestamp are datetime objects as well.
current_time = my_site.getcurrenttime()
my_site.recentchanges(start = current_time, end=current_time - timedelta(hours=6))
You can iterate the pages which changed in the last X (6) hours and for each one:
from pywikibot import Page
current_page = Page(my_site, page_title) # Page title written in the recentchanges
return object
revisions = current_page.revisions(total=some_number, content=True)
After that you can use some internals api to get the diff between two revisions:
from pywikibot.diff import PatchManager
PatchManager(first_rev_text, second_rev_text).print_hunks() # print_hunks is for interactive changes, but you can work with any internals api here (that might not be simple).
That not fully code as you need but you can work from here and add some more logic to solve your case,
Good luck:)

Google Places Search Dynamic Data Loading

Hi am trying to search Google Places and the basic structure is fine:
https://plus.google.com/local/Leeds,%20United%20Kingdom/s/car%20rental
But for loading additional data you scroll down the page. I have used fiddler to check the request but cannot identify where any of the data being posted comes from?
Does anyone know how to simply load or start on page 2/3. Or even load moare than 10 at once.
This will load from 30 to 40, it respons in a format that I can't recognize, but it does what you want. You must be more specific if you want further assistance. Offset is the offset of the query, so 30 gets from 30 to 40, 100 gets from 100 to 110 etc.
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import requests
url = "https://plus.google.com/_/local/searchresultsonly"
offset = "30"
data = {
"f.req": '[[[],[],"car rental","Leeds, United Kingdom",[],["Leeds, West Yorkshire, UK",[0,0,0,0]],null,0],[{}]]'.format(offset),
}
print requests.post(url, data).text