Python Web Scraping BeautifulSoup - substring

Python Web Scraping BeautifulSoup - substring - html

Second day on web scraping using Python. I am trying to pull a substring within a string. I wrote the following python code using BeautifulSoup:
containers = page_soup.findAll("li",{"class":"grid-tile "})
container_test = containers[7]
product_container = container_test.findAll("div",{"class":"product-
swatches"})
product = product_container[0].findAll("li")
product[0].a.img.get("data-price")
This outputs the following:
'{"saleprice":"$39.90","price":""}'
How do I print out saleprice and price separately? Result should look like:
saleprice = $39.90
price = ""

Use the json module - specifically, the loads method, which loads JSON-formatted strings common on websites.
string = '{"saleprice":"$39.90","price":""}'
>>> import json
json_data = json.loads(string)
sale_price = json_data['saleprice']
price = json_date['price']
print(saleprice, price)
>>> (u'', u'$39.90')
The u preceding the string indicates that the string is unicode, which is well explained here.
Additionally, you could use ast.literal_eval, as the string is formatted like a normal Python dictionary. That process would be:
import ast
string = '{"saleprice":"$39.90","price":""}'
dict_representation_of_string = ast.literal_eval(string)
print(string.keys())
>>> ['price', 'saleprice']

this link should be able to help
Convert a String representation of a Dictionary to a dictionary?
import ast
BSoutput = '{"saleprice":"$39.90","price":""}'
testing = ast.literal_eval(BSoutput)
saleprice = testing['saleprice']
price = testing['price']
print "saleprice = " + saleprice

Related

Count the number of people having a property bounded by two numbers

The following code goes over the 10 pages of JSON returned by GET request to the URL.
and checks how many records satisfy the condition that bloodPressureDiastole is between the specified limits. It does the job, but I was wondering if there was a better or cleaner way to achieve this in python
import urllib.request
import urllib.parse
import json
baseUrl = 'https://jsonmock.hackerrank.com/api/medical_records?page='
count = 0
for i in range(1, 11):
url = baseUrl+str(i)
f = urllib.request.urlopen(url)
response = f.read().decode('utf-8')
response = json.loads(response)
lowerlimit = 110
upperlimit = 120
for elem in response['data']:
bd = elem['vitals']['bloodPressureDiastole']
if bd >= lowerlimit and bd <= upperlimit:
count = count+1
print(count)

There is no access through fields to json content because you get dict object from json.loads (see translation scheme here). It realises access via __getitem__ method (dict[key]) instead of __getattr__ (object.field) as keys may be any hashible objects not only strings. Moreover, even strings cannot serve as fields if they starts with digits or are the same with built-in dictionary methods.
Despite this, you can define your own custom class realising desired behavior with acceptable key names. json.loads has an argument object_hook wherein you may put any callable object (function or class) that take a dict as the sole argument (not only the resulted one but every one in json recursively) & return something as the result. If your jsons match some template, you can define a class with predefined fields for the json content & even with methods in order to get a robust Python-object, a part of your domain logic.
For instance, let's realise the access through fields. I get json content from response.json method from requests but it has the same arguments as in json package. Comments in code contain remarks about how to make your code more pythonic.
from collections import Counter
from requests import get
class CustomJSON(dict):
def __getattr__(self, key):
return self[key]
def __setattr__(self, key, value):
self[key] = value
LOWER_LIMIT = 110 # Constants should be in uppercase.
UPPER_LIMIT = 120
base_url = 'https://jsonmock.hackerrank.com/api/medical_records'
# It is better use special tools for handling URLs
# in order to evade possible exceptions in the future.
# By the way, your option could look clearer with f-strings
# that can put values from variables (not only) in-place:
# url = f'https://jsonmock.hackerrank.com/api/medical_records?page={i}'
counter = Counter(normal_pressure=0)
# It might be left as it was. This option is useful
# in case of additional counting any other information.
for page_number in range(1, 11):
records = get(
base_url, data={"page": page_number}
).json(object_hook=CustomJSON)
# Python has a pile of libraries for handling url requests & responses.
# urllib is a standard library rewritten from scratch for Python 3.
# However, there is a more featured (connection pooling, redirections, proxi,
# SSL verification &c.) & convenient third-party
# (this is the only disadvantage) library: urllib3.
# Based on it, requests provides an easier, more convenient & friendlier way
# to work with url-requests. So I highly recommend using it
# unless you are aiming for complex connections & url-processing.
for elem in records.data:
if LOWER_LIMIT <= elem.vitals.bloodPressureDiastole <= UPPER_LIMIT:
counter["normal_pressure"] += 1
print(counter)

Scrapy and incomplete data, how do I store everything correctly?

I'm still a beginner with Scrapy, but this problem really got me scratching my head. I've got a webstore from which I need to extract data. The data is all on one page, but most of the time incomplete. It always has a name, but not always an amount or a description. It's structured in repeating classes like this. Note that this example has all three datafields filled.
I need:
The product name, located in < h4 class="mod-article-tile__title">
The product amount, located in < span class="price__unit">
The product description, located in < div class="mod-article-tile__info">
I managed to extract the data I need like this:
import pprint
import scrapy
class BasicSpider(scrapy.Spider):
name = 'aldi'
allowed_domains = ['aldi.nl']
base_url = 'https://www.aldi.nl/onze-producten/a-merken.html'
start_urls = ['https://www.aldi.nl/onze-producten/a-merken.html']
def parse(self, response):
products = response.xpath('//*[#class="mod-article-tile__content"]').extract()
name = response.xpath('//*[#class="mod-article-tile__title"]/text()').extract()
amount = response.xpath('//*[#class="price price--50 price--right mod-article-tile__price"]/text()').extract()
info = response.xpath('//*[#class="mod-article-tile__info"]/p/text()').extract()
i = 0
for product in products:
pprint.pprint(name[i] + " : " + amount[i] + ", " + info[i])
i+=1
However, this doesn't take incomplete data into account. So now since not all lists have the same length, an IndexError is thrown, and the data isn't assigned correctly. I tried parsing it using product, but I can't use xpath on it afterwards because it's a string.
So, is there a way to use xpath on the string result, or another way to extract the data from product? Or should I rather look into checking if the parsed data is empty, and insert empty data there?
Oh, and also, I can't seem to remove the pesky \n\t's that appear everywhere. I tried
def clean_string(self, string):
result = string.replace('\\n', '')
result = result.replace('\\t', '')
return result.strip()
But it didn't do the trick. Anyone able to drop a hint to resolve that?

When loading a JSON from mongo into a Python Dataframe, how should you handle NaN?

I am getting an error when I try and "flatten" json into a dataframe, I believe it is because some of the cells have NaN in. What is the best way to handle this?
The Error I get is "AttributeError: 'float' object has no attribute 'keys'"
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
client = MongoClient('mongodb://localhost:27017/')
#Import Counterparties
counterpartydb = client.counterparties
cptylist = counterpartydb.counterparties
cptylists = pd.DataFrame(list(cptylist.find()))
details = pd.DataFrame(list(cptylists['details']))
CurRating = pd.DataFrame(list(cptylists['currentRating']))

Since MongoDB is schemaless, sometimes there will be Null values in a response. You can iterate over these and check to see if the value is None.
cptylists = pd.DataFrame(list(cptylist.find()))
creditRating = []
for rating in cptylists['creditRating']:
if rating['creditRating'] is not None:
creditRating.append(rating['creditRating'])
else:
creditRating.append('No value in database')
creditRating = pd.DataFrame(creditRating)
The list comprehension version of this would be something like:
if 'creditRating' in cptylists:
creditRating = pd.DataFrame([k for k in (cptylists['creditRating'] or [])] )

Replace many parts of a string python3

Via the modules urllib and re I am attempting to scrape a web page for it's text content. I'm following a guide provided by "SentDex" on youtube found here ( https://www.youtube.com/watch?v=GEshegZzt3M ) and the documentation with the official Python site to cobble together a quick solution. The information that comes back has plenty of HTML markup and special characters that I am trying to remove. My end result is successful but I feel that it is hard coded solution and is only useful for this one scenario.
The code is as follows :
url = "http://someUrl.com/dir/doc.html" #Target URL
values = {'s':'basics',
'submit':'search'} #Set parameters for later use
data = urllib.parse.urlencode(values) #Really not sure...
data = data.encode('utf-8') #set to UTF-8
req = urllib.request.Request(url,data)#Arrange the request parameters
resp = urllib.request.urlopen(req)#Get the document's contents matching that data type from that URL
respData = resp.read() #read the content into a variable
#BS4 method
soup = BeautifulSoup(respData, 'html.parser')
text = soup.find_all("p")
#end BS4
#re method
text = re.findall(r"<p>(.*?)</p>",str(respData)) #get all paragraph tag contents
text = str(text) #convert it to a string
#end re
conds = ["<b>","</b>","<i>","</i>","\\","[","]","\'"] #things to remove from text
for case in conds:#for each of those things
text = text.replace(case,"") #remove string AKA replace with nothing
Are there more effective ways to achieve the end goal of removing all "Markup" from a string further than explicit definitions of each condition?

solrj QueryResponse getTermsResponse returns null

I'm trying to get a TermsResponse object from a solrj QueryResponse object, but it doesn't seem to be working. I'm using scala, but I would be happy with a working java example too.
First I set up the term vector query, which looks to be working:
val solrurl = "http://localhost:8983/solr"
val server= new HttpSolrServer( solrurl )
val query = new SolrQuery
query.setRequestHandler("/tvrh")
query.set("fl", "text")
query.set("tv.all", true)
query.setQuery("uid:" + id)
val response = server.query(query)
The query returns a QueryResponse object whose toString looks to be a JSON object. This object includes the term vector information (terms, frequency, etc . . .) as part of the JSON object.
But when I do this I always get a null object:
val termsResponse = Option(response.getTermsResponse)
Is this function deprecated?
If so what is the best way to retrieve the structure from QueryResponse? Convert to JSON? Some other sources point to using response.get("termVector") but that seems to be deprecated.
Any ideas?
Thanks

I have been using simple java object for this with the following configuration.
//Adding terms for 2 word phrases
qterms.setTerms(true);
qterms.setRequestHandler("/terms");
qterms.setTermsLimit(20);
qterms.addTermsField("PhraseIndx2");
qterms.setTermsMinCount(20);
QueryResponse response = solr.query(query);
SolrDocumentList results = response.getResults();
//queryresponse get all terms from in 2 phrase field
System.out.println ("printing the terms from queryresponse: \n");
QueryResponse resTerms = solr.query(qterms);
TermsResponse termResp =resTerms.getTermsResponse();
List<Term> terms = termResp.getTerms("PhraseIndx2");
System.out.print(terms.size());

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Python Web Scraping BeautifulSoup - substring - html

this link should be able to help Convert a String representation of a Dictionary to a dictionary? import ast BSoutput = '{"saleprice":"$39.90","price":""}' testing = ast.literal_eval(BSoutput) saleprice = testing['saleprice'] price = testing['price'] print "saleprice = " + saleprice

Related

Count the number of people having a property bounded by two numbers

Scrapy and incomplete data, how do I store everything correctly?

When loading a JSON from mongo into a Python Dataframe, how should you handle NaN?

Replace many parts of a string python3

solrj QueryResponse getTermsResponse returns null

Categories

Resources