I am trying to consume ContextualWeb News API. The endpoint is described here:
https://rapidapi.com/contextualwebsearch/api/web-search
Here is the request snippet in Python as described in RapidAPI:
response = unirest.get("https://contextualwebsearch-websearch-v1.p.rapidapi.com/api/Search/NewsSearchAPI?autoCorrect=true&pageNumber=1&pageSize=10&q=Taylor+Swift&safeSearch=false",
headers={
"X-RapidAPI-Host": "contextualwebsearch-websearch-v1.p.rapidapi.com",
"X-RapidAPI-Key": "XXXXXX"
}
)
How do I send the request and parse the response? Can you provide a complete code example for the News API?
use the python version 3.X for below code.Below is the complete example example where I am passing string Taylor Swift and parsing response...Let me know if you stuck anywhere
import requests # install from: http://docs.python-requests.org/en/master/
# Replace the following string value with your valid X-RapidAPI-Key.
Your_X_RapidAPI_Key = "XXXXXXXXXXXXXXXXXXX";
# The query parameters: (update according to your search query)
q = "Taylor%20Swift" # the search query
pageNumber = 1 # the number of requested page
pageSize = 10 # the size of a page
autoCorrect = True # autoCorrectspelling
safeSearch = False # filter results for adult content
response = requests.get(
"https://contextualwebsearch-websearch-v1.p.rapidapi.com/api/Search/NewsSearchAPI?q={}&pageNumber={}&pageSize={}&autocorrect={}&safeSearch={}".format(
q, pageNumber, pageSize, autoCorrect, safeSearch),
headers={
"X-RapidAPI-Key": Your_X_RapidAPI_Key
}
).json()
# Get the numer of items returned
totalCount = response["totalCount"];
# Get the list of most frequent searches related to the input search query
relatedSearch = response["relatedSearch"]
# Go over each resulting item
for webPage in response["value"]:
# Get the web page metadata
url = webPage["url"]
title = webPage["title"]
description = webPage["description"]
keywords = webPage["keywords"]
provider = webPage["provider"]["name"]
datePublished = webPage["datePublished"]
# Get the web page image (if exists)
imageUrl = webPage["image"]["url"]
imageHeight = webPage["image"]["height"]
imageWidth = webPage["image"]["width"]
thumbnail = webPage["image"]["thumbnail"]
thumbnailHeight = webPage["image"]["thumbna`enter code here`ilHeight"]
# An example: Output the webpage url, title and published date:
print("Url: %s. Title: %s. Published Date:%s." % (url, title, datePublished))
Related
I want to make a script that prints the links to results in bing search to the console. The problem is that when I run the script there is no output. I believe the website thinks I am a bot?
from bs4 import BeautifulSoup
import requests
search = input("search for:")
params = {"q": "search"}
r = requests.get("http://www.bing.com/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find("ol", {"id": "b_results"})
links = results.find_all("Li", {"class": "b_algo"})
for item in links:
item_text = item.find("a").text
item_href = item.find("a").attrs["href"]
if item_text and item_href:
print(item_text)
print(item_href)
You need to use the search variable instead of "search". You also have a typo in your script: li is lower case.
Change these lines:
params = {"q": "search"}
.......
links = results.find_all("Li", {"class": "b_algo"})
To this:
params = {"q": search}
........
links = results.find_all("li", {"class": "b_algo"})
Note that some queries don't return anything. "crossword" has results, but "peanut" does not. The result page structure may be different based on the query.
There are 2 issues in this code -
Search is a variable name, so it should not be used with quotes. Change it to below
params = {"q": search}
When you include variable name inside quotes while fetching link, it becomes a static link. For dynamic link you should do it as below -
r = requests.get("http://www.bing.com/"+search, params=params)
After making these 2 changes, if you still do not get any output , check if you are using correct tag in results variable.
I am trying to scrape reviews data on the booking.com that is inside of <ul> tag with the class="review_list". There are 10 reviews, each of them is inside of <li>with the class="review_list_new_item_block".
And here is the picture of data inside of the first <li>tag:
But I noticed that I can't scrape most of the data inside of this <ul> and <li>tags, although I always use the same logic for the xpaths. I tried for example following xpaths to scrape title, text, language, review date and stay date:
title = response.xpath('//h3[#class="c-review-block__title"]/text()').extract()
#title = response.xpath('//div[#class="c-review-block__row"]//h3/text()')
text = response.xpath('//span[#class="c-review__prefix c-review__prefix--color-green"]/span[2]/text()').extract()
lang = response.xpath('//span[#class="c-review__prefix c-review__prefix--color-green"]/span[2]/#lang').extract()
reviewdate = response.xpath('//span[#class="c-review-block__date"]/text()').extract()
staydate = response.xpath('//div[#class="c-review-block__room-info__name"]/div/span/text()').extract()
Only xpaths for these two items worked:
author = response.xpath('//span[#class="bui-avatar-block__title"]/text()').extract()
authorcountry = response.xpath('//span[#class="bui-avatar-block__subtitle"]/text()').extract()
Do you have any suggestions? Is it the issue in the way I am using the xpaths or does booking.com have any restrictions in this place of HTML-Code? Thank you in advance!
My script:
import scrapy
class BookingSpider(scrapy.Spider):
name = 'booking-spider'
allowed_domains = ['booking.com']
# start with the page of all countries
start_urls = [
'https://www.booking.com/country.de.html?aid=356980;label=gog235jc-1DCAIoLDgcSAdYA2gsiAEBmAEHuAEHyAEP2AED6AEB-AECiAIBqAIDuAK7q7DyBcACAQ;sid=8de61678ac61d10a89c13a3941fd3dcd'
]
# get country page
def parse(self, response):
for countryurl in response.xpath('normalize-space(//a[contains(text(),"Schweiz")]/#href)'):
url = response.urljoin(countryurl.extract())
yield scrapy.Request(url, callback=self.parse_country)
# get page of all hotels in a country
def parse_country(self, response):
for hotelsurl in response.xpath('normalize-space(//a[#class="bui-button bui-button--secondary"]/#href)'):
url = response.urljoin(hotelsurl.extract())
yield scrapy.Request(url, callback=self.parse_allhotels)
# get page of one hotel
def parse_allhotels(self, response):
for hotelurl in response.xpath('normalize-space(//a[#class="hotel_name_link url"]/#href)'):
url = response.urljoin(hotelurl.extract())
yield scrapy.Request(url, callback=self.parse_hotelpage)
next_page = response.xpath('//a[contains(#class,"paging-next") and contains(#title,"Nächste Seite")]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_allhotels)
# get review page of this hotel
def parse_hotelpage(self, response):
reviewsurl = response.xpath('//a[#class="hp_nav_reviews_link toggle_review track_review_link_zh"]/#href')
url = response.urljoin(reviewsurl[0].extract())
new_url = url.replace('blockdisplay4', 'tab-reviews')
yield scrapy.Request(new_url, callback=self.parse_reviews, dont_filter=True)
# parse its reviews
def parse_reviews(self, response):
author = response.xpath('//span[#class="bui-avatar-block__title"]/text()').extract()
authorcountry = response.xpath('//span[#class="bui-avatar-block__subtitle"]/text()').extract()
title = response.xpath('//div[#class="c-review-block"]//div[#class="c-review-block__row"]//h3/text()').extract()
print(title)
You can try the below xpaths.
title:
//div[#class='c-review-block']//div[#class="c-review-block__row"]//h3/text()
text (includes both great and poor text)
//div[#class='c-review-block']//div[#class='c-review-block__row'][3]//text()
review date
//div[#class='c-review-block']//div[#class='c-review-block__row']//span[#class="c-review-block__date"]/text()
stay date:
//div[#class='c-review-block']//div[#class='c-review-block__room-info']//span[#class="c-review-block__date"]/text()
subtitle:
//div[#class='c-review-block']//span[#class="bui-avatar-block__subtitle"]/text()
You have to get the review nodes by using //div[#class='c-review-block'] and then iterate through all the nodes to get the details. If you are iterating through each review then you just have to replace //div[#class='c-review-block'] with . in the so that the xpath are in the review contenxt.
So I'm aiming to scrape 2 tables (in different formats) from a website - https://info.fsc.org/details.php?id=a0240000005sQjGAAU&type=certificate after using the search bar to iterate this over a list of license codes. I haven't included the loop fully yet but I added it at the top for completeness.
My issue is that because the two tables I want, Product Data and Certificate Data are in 2 different formats, so I have to scrape them separately. As the Product data is in the normal "tr" format on the webpage, this bit is easy and I've managed to extract a CSV file of this. The harder bit is extracting Certificate Data, as it is in "div" form.
I've managed to print the Certificate Data as a list of text, using the class function, however I need to have it in a tabular form saved in a CSV file. As you can see, I've tried several unsuccessful ways of converting it to a CSV but If you have any suggestions, it would be much appreciated, thank you!! Also any other general tips to improve my code would be great too, as I am new to web-scraping.
#namelist = open('example.csv', newline='', delimiter = 'example')
#for name in namelist:
#include all of the below
driver = webdriver.Chrome(executable_path="/Users/jamesozden/Downloads/chromedriver")
url = "https://info.fsc.org/certificate.php"
driver.get(url)
search_bar = driver.find_element_by_xpath('//*[#id="code"]')
search_bar.send_keys("FSC-C001777")
search_bar.send_keys(Keys.RETURN)
new_url = driver.current_url
r = requests.get(new_url)
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0]
df, = pd.read_html(str(table))
certificate = soup.find(class_= 'certificatecl').text
##certificate1 = pd.read_html(str(certificate))
driver.quit()
df.to_csv("Product_Data.csv", index=False)
##certificate1.to_csv("Certificate_Data.csv", index=False)
#print(df[0].to_json(orient='records'))
print certificate
Output:
Status
Valid
First Issue Date
2009-04-01
Last Issue Date
2018-02-16
Expiry Date
2019-04-01
Standard
FSC-STD-40-004 V3-0
What I want but over hundreds/thousands of license codes (I just manually created this one sample in Excel):
Desired output
EDIT
So whilst this is now working for Certificate Data, I also want to scrape the Product Data and output that into another .csv file. However currently it is only printing 5 copies of the product data for the final license code which is not what I want.
New Code:
df = pd.read_csv("MS_License_Codes.csv")
codes = df["License Code"]
def get_data_by_code(code):
data = [
('code', code),
('submit', 'Search'),
]
response = requests.post('https://info.fsc.org/certificate.php', data=data)
soup = BeautifulSoup(response.content, 'lxml')
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
return [code, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'Certificate_Data.csv'
#codes = ['C001777', 'C001777', 'C001777', 'C001777']
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
table = soup.find_all('table')[0]
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
Here's all you need.
No chromedriver. No pandas. Forget about it in context of scraping.
import requests
import csv
from bs4 import BeautifulSoup
# This is all what you need for your task. Really.
# No chromedriver. Don't use it for scraping. EVER.
# No pandas. Don't use it for writing csv. It's not what pandas was made for.
#Function to parse single data page based on single input code.
def get_data_by_code(code):
# Parameters to build POST-request.
# "type" and "submit" params are static. "code" is your desired code to scrape.
data = [
('type', 'certificate'),
('code', code),
('submit', 'Search'),
]
# POST-request to gain page data.
response = requests.post('https://info.fsc.org/certificate.php', data=data)
# "soup" object to parse html data.
soup = BeautifulSoup(response.content, 'lxml')
# "status" variable. Contains first's found [LABEL tag, with text="Status"] following sibling DIV text. Which is status.
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
# Same for issue dates... etc.
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
# Returning found data as list of values.
return [response.url, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'output.csv'
codes = ['C001777', 'C001777', 'C001777', 'C001777']
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
#Writing list of values to file as single row.
writer.writerow((get_data_by_code(code)))
Everything is really straightforward here. I'd suggest you spend some time in Chrome dev tools "network" tab to have a better understanding of request forging, which is a must for scraping tasks.
In general, you don't need to run chrome to click the "search" button, you need to forge request generated by this click. Same for any form and ajax.
well... you should sharpen your skills (:
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
### HERE'S THE PROBLEM:
# "soup" variable is declared inside of "get_data_by_code" function.
# So you can't use it in outer context.
table = soup.find_all('table')[0] # <--- you should move this line to
#definition of "get_data_by_code" function and return it's value somehow...
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
As per example you can return dictionary of values from "get_data_by_code" function:
def get_data_by_code(code):
...
table = soup.find_all('table')[0]
return dict(row=row, table=table)
I've been trying to upload a file using the box v2 api with requests.
So far I had little luck though. Maybe someone here can help me see what I'm actually doing wrong.
file_name = "%s%s" % (slugify(sync_file.description), file_suffix)
file_handle = open(settings.MEDIA_ROOT + str(sync_file.document), 'rb')
folder_id = str(sync_file.patient.box_patient_folder_id)
r = requests.post(
files_url,
headers=headers,
files={
file_name: file_handle,
"folder_id": folder_id,
},
)
My authentication works, because I'm creating a folder just before that, using the same data.
A response looks something like this:
{
u'status': 404,
u'code': u'not_found',
u'help_url': u'http://developers.box.com/docs/#errors',
u'request_id': u'77019510950608f791a0c1',
u'message': u'Not Found',
u'type': u'error'
}
Maybe someone on here ran into a similar issue.
You need to pass 2 Python dictionaries, files and data. files are {uniqFileName:openFileObj}, and data are {uniqFileName:filename}. Below is the upload method from my box class. And remember to add a final entry in data, 'folder_id': destination_id.
def uploadFiles(self, ufiles, folid):
'''uploads 1 or more files in the ufiles list of tuples containing
(src fullpath, dest name). folid is the id of the folder to
upload to.'''
furl = URL2 + 'files/data'
data, files = {}, {}
for i, v in enumerate(ufiles):
ff = v[0]
fn = v[1]
#copy to new, renamed file in tmp folder if necessary
#can't find a way to do this with the api
if os.path.basename(ff) != fn:
dest = os.path.join(TMP, fn)
shutil.copy2(ff, dest)
ff = dest
f = open(ff, 'rb')
k = 'filename' + str(i)
data[k] = fn
files[k] = f
data['folder_id'] = folid
res = self.session.post(furl, files=files, data=data)
for k in files:
files[k].close()
return res.status_code
Here is a sample call:
destFol = '406600304'
ret = box.uploadFile((('c:/1temp/hc.zip', 'hz.zip'),), destFol)
Like I said, the above function is a method of a class, with an instance attr that holds a requests session. But you can use requests.post instead of self.session.post, and it will work just the same. Just remember to add the headers with your apikey and token if you do it outside a session.
According to the documentation, you are supposed to be able to rename the file by giving it a new name in the data dict. But I can't make this work except by copying the src file to a temp dir with the desired name and uploading that. It's a bit of a hack, but it works.
good luck,
Mike
As someone requested my implementation, I figured I would put it out here for anyone trying to achieve something similar.
files_url = "%s/files/content" % (settings.BOX_API_HOST)
headers = {"Authorization": "BoxAuth api_key=%s&auth_token=%s" %
(settings.BOX_API_KEY, self.doctor.box_auth_token)
}
file_root, file_suffix = os.path.splitext(str(self.document))
filename = "%s%s" % (slugify(self.description), file_suffix)
files = {
'filename1': open(settings.MEDIA_ROOT + str(self.document), 'rb'),
}
data = {
'filename1': filename,
'folder_id': str(self.patient.get_box_folder()),
}
r = requests.post(files_url,
headers=headers,
files=files,
data=data)
file_response = simplejson.loads(r.text)
try:
if int(file_response['entries'][0]['id']) > 0:
box_file_id = int(file_response['entries'][0]['id'])
#Update the name of file
file_update_url = "%s/files/%s" % (settings.BOX_API_HOST, box_file_id)
data_update = {"name": filename}
file_update = requests.put(file_update_url,
data=simplejson.dumps(data_update),
headers=headers)
LocalDocument.objects.filter(id=self.id).update(box_file_id=box_file_id)
except:
pass
So in essence, I needed to send the file and retrieve the ID of the newly updated file and send another request to box. Personally, I don't like it either, but it works for me and haven't been able to find any other implementations that do the correct naming from the get-go.
Hope someone can benefit from this snippet.
My solution, using requests:
def upload_to_box(folder_id, auth_token, file_out):
headers = { 'Authorization' : BOX_AUTH.format(auth_token) }
url = 'https://api.box.com/2.0/files/content'
files = { 'filename': (new_file_name, open(file_out,'rb')) }
data = { 'folder_id': folder_id }
response = requests.post(url, params=data, files=files, headers=headers)
It would be nice if you could specify the new_copy parameter but there's nothing documented for it and it doesn't seem to work.
In the api, i'm generating a Xml response by hitting the url with request data in params. It contains some field which have html content and tags.The content is getting saved correctly in DB but when the response is getting generated the tags are being encoded which will happen as we need to skip those fields while parsing. I would like to know that how can i implement CDATA in order to skip the specific fields while parsing.
def generate_mobile_api_success_response(status_code, format, request_id, content = nil)
format_type_method, options_hash, content_type = get_format_method(format)
data = { "request_id" => request_id, "status" => status_code, "message" => status_message(status_code)}
data["data"] = content unless content.blank?
data = generate_data_format(format, data)
resp = [status_code, { "Content-Type" => content_type , "request_id" => request_id}, data.send(format_type_method, options_hash)]
generate_active_controller_response_format(resp)
resp
end
Content passed is a params hash, and format is xml. resp contains the following data when i tried to print it.Detailed description tag contains the encoded data
[201, {"request_id"=>"b425bce0-307d-012f-3e68-042b2b8686e6", "Content-Type"=>"application/xml"}, "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<response>\n <data>\n <fine_print_line_3>line 3</fine_print_line_3>\n <available_start_date>2012-02-02T06:00:00+05:30</available_start_date>\n <status>inactive</status>\n <highlight_line_2>gfgf</highlight_line_2>\n <original_price>50.00</original_price>\n <category_id>bc210bb0-52b7-012e-8896-12313b077c61</category_id>\n <available_end_date>2012-03-25T00:00:00+05:30</available_end_date>\n <expiry_date>2012-08-25T00:00:00+05:30</expiry_date>\n <highlight_line_3></highlight_line_3>\n <product_service>food</product_service>\n <created_at>2012-02-03T15:43:56+05:30</created_at>\n <detailed_description><b>this is the testing detailed</b> </detailed_description>...
I would surely like to post some extra code if required.
So your data is in XML and you would like to know the content of these fields?
Use Nokogiri:
xml = data.send(format_type_method, options_hash)
doc = Nokogiri::XML(xml)
start_date = doc.xpath("//available_start_date").content
p start_date #=> "2012-02-02T06:00:00+05:30"
Once you have your fields in variables you can whatever you want.