I am trying to scrape the data for this link: page.
If you click the up arrow you will notice the highlighted days in the month sections. Clicking on a highlighted day, a table with initiated tenders for that day will appear. All I need to do is get the data in each table for each highlighted day in the calendar. There might be one or more tenders (up to max of 7) per day.
Table appears on click
I have done some web scraping with bs4, however I think that this is a job for selenium (please, correct me if I am wrong) with which I am not very familiar.
So far, I have managed to find the arrow element by XPATH to navigate around the calendar and show me more months. After that I try clicking on a random day (in below code I clicked on 30.03.2020) upon which an html object called: "tenders-table cloned" appears in the html on inspect. The object name stays the same no matter what day you click on.
I am pretty stuck now, have tried to select by iterate and/or print what is inside that object table, it either says that object is not iterable or is None.
from selenium import webdriver
chrome_path = r"C:\Users\<name>\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://www.ibex.bg/bg/данни-за-пазара/централизиран-пазар-за-двустранни-договори/търговски-календар/")
driver.find_element_by_xpath("""//*[#id="content"]/div[3]/div/div[1]/div/i""").click()
driver.find_element_by_xpath("""//*[#id="content"]/div[3]/div/div[2]/div[1]/div[3]/table/tbody/tr[6]/td[1]""").click()
Please advice how I can proceed to extract the data from the table pop-up.
Please try below solution
driver.maximize_window()
wait = WebDriverWait(driver, 20)
elemnt=wait.until(EC.presence_of_element_located((By.XPATH, "//body/div[#id='wrapper']/div[#id='content']/div[#class='tenders']/div[#class='form-group']/div[1]/div[1]//i")))
elemnt.click()
elemnt1=wait.until(EC.presence_of_element_located((By.XPATH, "//div[#class='form-group']//div[1]//div[3]//table[1]//tbody[1]//tr[6]//td[1]")))
elemnt1.click()
lists=wait.until(EC.presence_of_all_elements_located((By.XPATH, "//table[#class='tenders-table cloned']")))
for element in lists:
print element.text
Well, i see there's no reason to use selenium for such case as it's will slow down your task.
The website is loaded with JavaScript event which render it's data dynamically once the page loads.
requests library will not be able to render JavaScript on the fly. so you can use selenium or requests_html. and indeed there's a lot of modules which can do that.
Now, we do have another option on the table, to track from where the data is rendered. I were able to locate the XHR request which is used to retrieve the data from the back-end API and render it to the users side.
You can get the XHR request by open Developer-Tools and check Network and check XHR/JS requests made depending of the type of call such as fetch
import requests
import json
data = {
'from': '2020-1-01',
'to': '2020-3-01'
}
def main(url):
r = requests.post(url, data=data).json()
print(json.dumps(r, indent=4)) # to see it in nice format.
print(r.keys())
main("http://www.ibex.bg/ajax/tenders_ajax.php")
Because am just a lazy coder: I will do it in this way:
import requests
import re
import pandas as pd
import ast
from datetime import datetime
data = {
'from': '2020-1-01',
'to': '2020-3-01'
}
def main(url):
r = requests.post(url, data=data).json()
matches = set(re.findall(r"tender_date': '([^']*)'", str(r)))
sort = (sorted(matches, key=lambda k: datetime.strptime(k, '%d.%m.%Y')))
print(f"Available Dates: {sort}")
opa = re.findall(r"({\'id.*?})", str(r))
convert = [ast.literal_eval(x) for x in opa]
df = pd.DataFrame(convert)
print(df)
df.to_csv("data.csv", index=False)
main("http://www.ibex.bg/ajax/tenders_ajax.php")
Output: view-online
Related
I have this API documentation of the website http://json-homework.task-sss.krasilnikov.spb.ru/docs/9f66a575a6cfaaf7e43177317461d057 (which is only in Russian, unfortunately, but I'll try to explain), and I am to import the data about the group members from there, but the issue is that parameter page returns only 5 members, and when you increase the page number, it only returns next 5 members, not adding them to the previous five. Here is my code:
import pandas as pd
import requests as rq
import json
from pandas.io.json import json_normalize
url='http://json-homework.task-sss.krasilnikov.spb.ru/api/groups/getmembers?api_key=9f66a575a6cfaaf7e43177317461d057&group_id=4508123&page=1'
data=rq.get(url)
data1=json.loads(data.text)
data1=json_normalize(json.loads(data.text)["response"])
data1
and here is what my output looks like:
By entering bigger and bigger numbers, I also found out that the last part of data exists on 41 page, i.e. I need to get the data from 1 to 41 page. How can I include all the pages in my code? Maybe it is possible with some loop or something like that, I don't know...
According to the API documentation, there is no parameter to specify the users to fetch in one page, so you will have to get them 5 at a time, and since there are 41 pages you can just loop through the urls.
import requests as rq
import json
all_users = []
for page in range(1,42):
url=f'http://json-homework.task-sss.krasilnikov.spb.ru/api/groups/getmembers?api_key=9f66a575a6cfaaf7e43177317461d057&group_id=4508123&page={page}'
data=rq.get(url)
all_users.append(json.loads(data.text)["response"])
The above implementation, will of course not check for any api throttling i.e. the API may give unexpected data if too many requests are made in a very short duration, which you can mitigate using some well placed delays.
I am trying to have some practice with beautiful soup, web scraping and python but I am struggling with getting this data from certain tags. I am trying to go through multiple pages of data on cars.com.
So when I read in the html, and the tags I need are
<cars-shop-srp-pagination>
</cars-shop-srp-pagination>
because the page number is in between them and in order for me to loop through the website pages, I need to know the max pages
from bs4
import BeautifulSoup
import requests
url = 'https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mkId=20089&page=1&perPage=20&prMx=25000&rd=99999&searchSource=GN_REFINEMENT&sort=relevance&stkTypId=28881&zc=21042'
#
source = requests.get('https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=58767&mkId=20089&page=1&perPage=20&prMx=25000&rd=99999&searchSource=GN_REFINEMENT&sort=relevance&zc=21042').text
source = requests.get(url).content
soup = BeautifulSoup(source, 'html.parser')
print(soup.prettify())
link = soup.find(word_ = "cars-shop-srp-pagination")# linkNext = link.find('a')
print(link)
When I go through the output, the only thing I see for the "cars-shop-srp-pagination: is
<cars-shop-srp-pagination>
</cars-shop-srp-pagination>
when I need to see:
All of the code inside of them, specifically I want to get to:
*"<li ng-if="showLast"> <a class="js-last-page" ng-click="goToPage($event, numberOfPages)">50</a> </li>"*
Remember that BeautifulSoup only parses through HTML/XML code that you give it. If the page number isn't in your captured HTML code in your first place, then that's a problem with being able to capture the code properly, not with BeautifulSoup. Unfortunately, I think that this data is dynamically generated.
I found a work-around, though. Notice that at the top of the search results, the page says "(some number of cars) matches near you". For example:
<div class="matchcount">
<span class="filter-count">1,711</span>
<span class="filter-text"> matches near you</span>
You could capture this number, then divide by the number of results per page being displayed. In fact, this latter number can be passed into the URL. Note that you have to round up to the nearest integer, to catch the search results that show up on the final page. Also, any commas in numbers over 999 have to be removed from the string before you can int it.
from bs4 import BeautifulSoup
import urllib2
import math
perpage = 100
url = 'https://www.cars.com/for-sale/searchresults.action/'
url += '?dealerType=all&mdId=58767&mkId=20089&page=1&perPage=%d' % perpage
url += '&prMx=25000&searchSource=PAGINATION&sort=relevance&zc=21042'
response = urllib2.urlopen(url)
source = response.read()
soup = BeautifulSoup(source, 'lxml')
count_tag = soup.find('span', {'class' : 'filter-count'})
count = int(count_tag.text.replace(',',''))
pages = int(math.ceil(1.0* count / perpage))
print(pages)
One catch to this however is that if the search isn't refined enough, the website will say something like "Over 30 thousand matches", which is not an integer.
Also, I was getting a 503 response from requests.get(), so I switched to using urllib2 to get the HTML.
All that info (number of results, number of pages, results per page) is stored in a javascript dictionary within the returned content. You can simply regex out the object and parse with json. Note that the url is a query string and you can alter the results per page count in it. So, after doing an initial request to determine how many results there are, you can perform calcs to make any other changes. Note that you may also be able to use json through out and not BeautifulSoup. Though I think there would be a limit (perhaps the 20) with grabbing as shown below from each page so probably better to go with the 100 results per page and make initial request, regex out info and if more than 100 results then loop, altering url, to collect rest of results.
I don't think, regardless of the number of pages indicated/calculated, that you can actually go beyond page 50.
import requests
import re
import json
p = re.compile(r'digitalData = (.*?);')
r = requests.get('https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mkId=20089&page=1&perPage=20&prMx=25000&rd=99999&searchSource=GN_REFINEMENT&sort=relevance&stkTypId=28881&zc=21042')
data = json.loads(p.findall(r.text)[0])
num_results_returned = data['page']['search']['numResultsReturned']
total_num_pages = data['page']['search']['totalNumPages']
num_results_on_page = data['page']['search']['numResultsOnPage']
Recently I started using dash for Data Visualization and I'm analyzing the Stock Data using qunadle API, but unable to get multiple dashboards of dropdown displaying the options of each dataset using a for loop like this
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import pandas as pd
import quandl
import plotly.graph_objs as go
import auth
api_key = auth.key
def easy_analysis(quandl_datasets):
try:
for dataset in quandl_datasets:
df = quandl.get(dataset,authtoken=api_key)
df = df.reset_index()
app = dash.Dash(__name__)
app.layout = html.Div([
html.H3(dataset),
dcc.Dropdown(
id=dataset,
options=[{'label' : s,'value' : s} for s in df.columns[1:]],
value=['Open'],
multi=True
),
dcc.Graph(id='dataset' + str(dataset))
])
#app.callback(
Output('dataset' + str(dataset),'figure'),
[Input(dataset,'value')]
)
def draw_graph(dataset):
graphs = []
for column in dataset:
graphs.append(go.Scatter(
x=list(df.Date),
y=list(df[column]),
name=str(column),
mode='lines'
))
return {'data' : graphs}
app.run_server(debug=True)
except Exception as e:
print(str(e))
easy_analysis(['NSE/KOTAKNIFTY','NSE/ZENSARTECH','NSE/BSLGOLDETF'])
The Output which I expected was having multiple dashboards with all the dropdown options one after the other. But the result what I got was having only one dashboard of the last item in the easy_analysis() function list
easy_analysis(['NSE/KOTAKNIFTY','NSE/ZENSARTECH','NSE/BSLGOLDETF']), considered only 'NSE/BSLGOLDETF'
what am I supposed do to fix this and get multiple dashboards of each dataset as provided in the list. I also checked the Dash User Guide, but could not get what I was looking for.
But, when passed only one argument for only one dataset with a for loop, the code works fine and the graph changes according to the option selected in the dropdown.
The code is here.
The code does not work because you are redefining a Dash app at each iteration of the for loop.
Even if you have three datasets, you need to define the Dash app and its layout only once.
You can make three requests to the Quandl API and - if possible - save everything in the same pandas Dataframe.
One question is whether you want to display all dropdowns and graphs (i.e. dropdown + graph for each Quandl dataset) or only one dropdown and one graph. I would suggest to start with the first approach, because it's much easier. Anyway, for the second approach you can have a look at this solution.
Wrote a very simple function to pull data from the espn api and display in default/index. However default/index is a blank page.
At this point I'm not even trying to parse through the JSON - I just want to see something on my browser.
default.py:
import urllib2
import json
#espn_uri being pulled from models/db.py
def index():
r = urllib2.Request(espn_uri)
opener = urllib2.build_opener()
f = opener.open(r)
status = json.load(f)
return dict(status)
default/index.html:
{{status}}
Thank you!
Try: return dict(status=status)
return dict(status) works because status it itself a dict, and dict(status) just copies it. But it's probably got no key named status, or at least nothing interesting.
And yes, you need =.
As JLundell advises, first return paired data via the dictionary:
return dict(my_status=status)
Second, as you've worked out, use the following to access the returned, rather than the local variable in index.html. Make sure you use the equals sign here or nothing will display
{{=my_status}}
When it comes to JSON, you can return the data using
return my_status.json()
Several other options are available to return data as a list, or to return HTML.
Finally, I recommend that you make use of jQuery and AJAX ($.ajax), so that the AJAX return value can be easily assigned to a JS object. This will also allow you to handle success or errors in the form of JS functions.
I want to be able to download full histories of a few thousand articles from http://en.wikipedia.org/wiki/Special:Export and I am looking for a programmatic approach to automate it
I started the following in python but that doesn't get any useful result
query = "http://en.wikipedia.org/w/index.api?title=Special:Export&pages=%s&history=1&action=submit" % 'Page_title_here'
f = urllib.urlopen(query)
s = f.read()
Any suggestions?
Drop the list of pages you want to download in the pages array and this should work. Run the script and it will print the XML file. Note that Wikipedia seems to block the urllib user agent, but I don't see anything on the pages that suggests automatically downloading is disallowed. Use at your own risk.
You can also add 'curonly':1 to the dictionary to fetch only the current version.
#!/usr/bin/python
import urllib
class AppURLopener(urllib.FancyURLopener):
version = "WikiDownloader"
urllib._urlopener = AppURLopener()
query = "http://en.wikipedia.org/w/index.php?title=Special:Export&action=submit"
pages = ['Canada']
data = { 'catname':'', 'wpDownload':1, 'pages':"\n".join(pages)}
data = urllib.urlencode(data)
f = urllib.urlopen(query, data)
s = f.read()
print(s)