Via the modules urllib and re I am attempting to scrape a web page for it's text content. I'm following a guide provided by "SentDex" on youtube found here ( https://www.youtube.com/watch?v=GEshegZzt3M ) and the documentation with the official Python site to cobble together a quick solution. The information that comes back has plenty of HTML markup and special characters that I am trying to remove. My end result is successful but I feel that it is hard coded solution and is only useful for this one scenario.
The code is as follows :
url = "http://someUrl.com/dir/doc.html" #Target URL
values = {'s':'basics',
'submit':'search'} #Set parameters for later use
data = urllib.parse.urlencode(values) #Really not sure...
data = data.encode('utf-8') #set to UTF-8
req = urllib.request.Request(url,data)#Arrange the request parameters
resp = urllib.request.urlopen(req)#Get the document's contents matching that data type from that URL
respData = resp.read() #read the content into a variable
#BS4 method
soup = BeautifulSoup(respData, 'html.parser')
text = soup.find_all("p")
#end BS4
#re method
text = re.findall(r"<p>(.*?)</p>",str(respData)) #get all paragraph tag contents
text = str(text) #convert it to a string
#end re
conds = ["<b>","</b>","<i>","</i>","\\","[","]","\'"] #things to remove from text
for case in conds:#for each of those things
text = text.replace(case,"") #remove string AKA replace with nothing
Are there more effective ways to achieve the end goal of removing all "Markup" from a string further than explicit definitions of each condition?
Related
The following code goes over the 10 pages of JSON returned by GET request to the URL.
and checks how many records satisfy the condition that bloodPressureDiastole is between the specified limits. It does the job, but I was wondering if there was a better or cleaner way to achieve this in python
import urllib.request
import urllib.parse
import json
baseUrl = 'https://jsonmock.hackerrank.com/api/medical_records?page='
count = 0
for i in range(1, 11):
url = baseUrl+str(i)
f = urllib.request.urlopen(url)
response = f.read().decode('utf-8')
response = json.loads(response)
lowerlimit = 110
upperlimit = 120
for elem in response['data']:
bd = elem['vitals']['bloodPressureDiastole']
if bd >= lowerlimit and bd <= upperlimit:
count = count+1
print(count)
There is no access through fields to json content because you get dict object from json.loads (see translation scheme here). It realises access via __getitem__ method (dict[key]) instead of __getattr__ (object.field) as keys may be any hashible objects not only strings. Moreover, even strings cannot serve as fields if they starts with digits or are the same with built-in dictionary methods.
Despite this, you can define your own custom class realising desired behavior with acceptable key names. json.loads has an argument object_hook wherein you may put any callable object (function or class) that take a dict as the sole argument (not only the resulted one but every one in json recursively) & return something as the result. If your jsons match some template, you can define a class with predefined fields for the json content & even with methods in order to get a robust Python-object, a part of your domain logic.
For instance, let's realise the access through fields. I get json content from response.json method from requests but it has the same arguments as in json package. Comments in code contain remarks about how to make your code more pythonic.
from collections import Counter
from requests import get
class CustomJSON(dict):
def __getattr__(self, key):
return self[key]
def __setattr__(self, key, value):
self[key] = value
LOWER_LIMIT = 110 # Constants should be in uppercase.
UPPER_LIMIT = 120
base_url = 'https://jsonmock.hackerrank.com/api/medical_records'
# It is better use special tools for handling URLs
# in order to evade possible exceptions in the future.
# By the way, your option could look clearer with f-strings
# that can put values from variables (not only) in-place:
# url = f'https://jsonmock.hackerrank.com/api/medical_records?page={i}'
counter = Counter(normal_pressure=0)
# It might be left as it was. This option is useful
# in case of additional counting any other information.
for page_number in range(1, 11):
records = get(
base_url, data={"page": page_number}
).json(object_hook=CustomJSON)
# Python has a pile of libraries for handling url requests & responses.
# urllib is a standard library rewritten from scratch for Python 3.
# However, there is a more featured (connection pooling, redirections, proxi,
# SSL verification &c.) & convenient third-party
# (this is the only disadvantage) library: urllib3.
# Based on it, requests provides an easier, more convenient & friendlier way
# to work with url-requests. So I highly recommend using it
# unless you are aiming for complex connections & url-processing.
for elem in records.data:
if LOWER_LIMIT <= elem.vitals.bloodPressureDiastole <= UPPER_LIMIT:
counter["normal_pressure"] += 1
print(counter)
I'm still a beginner with Scrapy, but this problem really got me scratching my head. I've got a webstore from which I need to extract data. The data is all on one page, but most of the time incomplete. It always has a name, but not always an amount or a description. It's structured in repeating classes like this. Note that this example has all three datafields filled.
I need:
The product name, located in < h4 class="mod-article-tile__title">
The product amount, located in < span class="price__unit">
The product description, located in < div class="mod-article-tile__info">
I managed to extract the data I need like this:
import pprint
import scrapy
class BasicSpider(scrapy.Spider):
name = 'aldi'
allowed_domains = ['aldi.nl']
base_url = 'https://www.aldi.nl/onze-producten/a-merken.html'
start_urls = ['https://www.aldi.nl/onze-producten/a-merken.html']
def parse(self, response):
products = response.xpath('//*[#class="mod-article-tile__content"]').extract()
name = response.xpath('//*[#class="mod-article-tile__title"]/text()').extract()
amount = response.xpath('//*[#class="price price--50 price--right mod-article-tile__price"]/text()').extract()
info = response.xpath('//*[#class="mod-article-tile__info"]/p/text()').extract()
i = 0
for product in products:
pprint.pprint(name[i] + " : " + amount[i] + ", " + info[i])
i+=1
However, this doesn't take incomplete data into account. So now since not all lists have the same length, an IndexError is thrown, and the data isn't assigned correctly. I tried parsing it using product, but I can't use xpath on it afterwards because it's a string.
So, is there a way to use xpath on the string result, or another way to extract the data from product? Or should I rather look into checking if the parsed data is empty, and insert empty data there?
Oh, and also, I can't seem to remove the pesky \n\t's that appear everywhere. I tried
def clean_string(self, string):
result = string.replace('\\n', '')
result = result.replace('\\t', '')
return result.strip()
But it didn't do the trick. Anyone able to drop a hint to resolve that?
I am trying to extract the text part from the request that I made through grequest library but I am unable to figure out how can I do so.
If we use Requests Library I would do
r = requests.get('www.google.com')
htmls.append(r.text)
Now if I am using grequests I can only get a list of response code and not text.
rs = (grequests.get(u) for u in urls)
result = grequests.map(rs)
What I've tried
result = grequests.map(rs.text)
I get an error using above piece of code AttributeError: 'generator' object has no attribute 'text'
My desired output is a list of html text where response code is 200 else the value should be None.
How can I achieve that?
Desired Output:
response_code = [<Response [200]>,<Response [404]>,<Response [200]>]
htmls = ['html1', None, 'html2']
You can use something like below
rs = (grequests.get(u) for u in urls)
responses = grequests.map(rs)
text = list(map(lambda d : d.text if d else None, responses))
print(text)
What you are getting back is a response array after you call the map. And then you can process this data using native map function
I am using an API to pull data from a url, however the API has a pagination limit. It goes like:
Page (default is 1 and it's the page number you want to retrieve)
Per_page (default is 100 and it's the maximum number of results returned in the response(max=500))
I have a script which I can get the results of a page or per page but I want to automate it. I want to be able to loop through all the pages or per_page(500) and load it in to a json file.
Here is my code that can get 500 results per_page:
import json, pprint
import requests
url = "https://my_api.com/v1/users?per_page=500"
header = {"Authorization": "Bearer <my_api_token>"}
s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>" }
resp = s.get(url, headers=header, verify=False)
raw=resp.json()
for x in raw:
print(x)
The output is 500 but is there a way to keep going and pull the results starting from where it left off? Or even go by page and get all the data per page until there's no data in a page?
It will be helpful, if you present a sample response from your API.
If the API is equipped properly, there will be a next property in a given response that leads you to the next page.
You can then keep calling the API with the link given in the next recursively. On the last page, there will be no next in the Link header.
resp.links["next"]["url"] will give you the URL to the next page.
For example, the GitHub API has next, last, first, and prev properties.
To put it into code, first, you need to turn your code into functions.
Given that there is a maximum of 500 results per page, it implies you are extracting a list of data of some sort from the API. Often, these data are returned in a list somewhere inside raw.
For now, let's assume you want to extract all elements inside a list at raw.get('data').
import requests
header = {"Authorization": "Bearer <my_api_token>"}
results_per_page = 500
def compose_url():
return (
"https://my_api.com/v1/users"
+ "?per_page="
+ str(results_per_page)
+ "&page_number="
+ "1"
)
def get_result(url=None):
if url_get is None:
url_get = compose_url()
else:
url_get = url
s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>"}
resp = s.get(url_get, headers=header, verify=False)
# You may also want to check the status code
if resp.status_code != 200:
raise Exception(resp.status_code)
raw = resp.json() # of type dict
data = raw.get("data") # of type list
if not "url" in resp.links.get("next"):
# We are at the last page, return data
return data
# Otherwise, recursively get results from the next url
return data + get_result(resp.links["next"]["url"]) # concat lists
def main():
# Driver function
data = get_result()
# Then you can print the data or save it to a file
if __name__ == "__main__":
# Now run the driver function
main()
However, if there isn't a proper Link header, I see 2 solutions:
(1) recursion and (2) loop.
I'll demonstrate recursion.
As you have mentioned, when there is pagination in API responses, i.e. when there is a limit of maximum number of results per page, there is often a query parameter called page number or start index of some sort to indicate which "page" you are querying, so we'll utilize the page_number parameter in the code.
The logic is:
Given a HTTP request response, if there is less than 500 results, it means there is no more pages. Return the results.
If there are 500 results in a given response, it means there's probably another page, so we advance page_number by 1 and do a recursion (by calling the function itself) and concatenate with previous results.
import requests
header = {"Authorization": "Bearer <my_api_token>"}
results_per_page = 500
def compose_url(results_per_page, current_page_number):
return (
"https://my_api.com/v1/users"
+ "?per_page="
+ str(results_per_page)
+ "&page_number="
+ str(current_page_number)
)
def get_result(current_page_number):
s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>"}
url = compose_url(results_per_page, current_page_number)
resp = s.get(url, headers=header, verify=False)
# You may also want to check the status code
if resp.status_code != 200:
raise Exception(resp.status_code)
raw = resp.json() # of type dict
data = raw.get("data") # of type list
# If the length of data is smaller than results_per_page (500 of them),
# that means there is no more pages
if len(data) < results_per_page:
return data
# Otherwise, advance the page number and do a recursion
return data + get_result(current_page_number + 1) # concat lists
def main():
# Driver function
data = get_result(1)
# Then you can print the data or save it to a file
if __name__ == "__main__":
# Now run the driver function
main()
If you truly want to store the raw responses, you can. However, you'll still need to check the number of results in a given response. The logic is similar. If a given raw contains 500 results, it means there is probably another page. We advance the page number by 1 and do a recursion.
Let's still assume raw.get('data') is the list whose length is the number of results.
Because JSON/dictionary files cannot be simply concatenated, you can store raw (which is a dictionary) of each page into a list of raws. You can then parse and synthesize the data in whatever way you want.
Use the following get_result function:
def get_result(current_page_number):
s = requests.Session()
s.proxies = {"http": "<my_proxies>", "https": "<my_proxies>"}
url = compose_url(results_per_page, current_page_number)
resp = s.get(url, headers=header, verify=False)
# You may also want to check the status code
if resp.status_code != 200:
raise Exception(resp.status_code)
raw = resp.json() # of type dict
data = raw.get("data") # of type list
if len(data) == results_per_page:
return [raw] + get_result(current_page_number + 1) # concat lists
return [raw] # convert raw into a list object on the fly
As for the loop method, the logic is similar to recursion. Essentially, you will call the get_result() function a number of times, collect the results, and break early when the furthest page contains less than 500 results.
If you know the total number of results in advance, you can simply the run the loop for a predetermined number of times.
Do you follow? Do you have any further questions?
(I'm a little confused by what you mean by "load it into a JSON file". Do you mean saving the final raw results into a JSON file? Or are you referring to the .json() method in resp.json()? In that case, you don't need import json to do resp.json(). The .json() method on resp is actually part of the requests module.
On a bonus point, you can make your HTTP requests asynchronous, but this is slightly beyond the scope of your original question.
P.s. I'm happy to learn what other solutions, perhaps more elegant ones, that people use.
Second day on web scraping using Python. I am trying to pull a substring within a string. I wrote the following python code using BeautifulSoup:
containers = page_soup.findAll("li",{"class":"grid-tile "})
container_test = containers[7]
product_container = container_test.findAll("div",{"class":"product-
swatches"})
product = product_container[0].findAll("li")
product[0].a.img.get("data-price")
This outputs the following:
'{"saleprice":"$39.90","price":""}'
How do I print out saleprice and price separately? Result should look like:
saleprice = $39.90
price = ""
Use the json module - specifically, the loads method, which loads JSON-formatted strings common on websites.
string = '{"saleprice":"$39.90","price":""}'
>>> import json
json_data = json.loads(string)
sale_price = json_data['saleprice']
price = json_date['price']
print(saleprice, price)
>>> (u'', u'$39.90')
The u preceding the string indicates that the string is unicode, which is well explained here.
Additionally, you could use ast.literal_eval, as the string is formatted like a normal Python dictionary. That process would be:
import ast
string = '{"saleprice":"$39.90","price":""}'
dict_representation_of_string = ast.literal_eval(string)
print(string.keys())
>>> ['price', 'saleprice']
this link should be able to help
Convert a String representation of a Dictionary to a dictionary?
import ast
BSoutput = '{"saleprice":"$39.90","price":""}'
testing = ast.literal_eval(BSoutput)
saleprice = testing['saleprice']
price = testing['price']
print "saleprice = " + saleprice