Beautifulsoup scaping Google reviews (google places) - html

I am attempting to scrape User reviews from Google places reviews (the API only returns 5 most helpful reviews). I am attempting to use Beautifulsoup to retrieve 4 pieces of information
1) Name of the reviewer
2) When the review was written
3) Rating (out of 5)
4) Body of review
Inspecting each element I can find the location of the information
1) Name of reviewer:
<a class="_e8k" style="color:black;text-decoration:none" href="https://www.google.com/maps/contrib/103603482673238284204/reviews">Steve Fox</a>
2) When the review was written
<span style="color:#999;font-size:13px">3 months ago</span>
3) Rating (visible in code, but doesn't show when "run code snippet"
<span class="_pxg _Jxg" aria-label="Rated 1.0 out of 5,"><span style="width:14px"></span></span>
4) Body of the review
<span jsl="$t t-uvHqeLvCkgA;$x 0;" class="r-i8GVQS_tBTbg">Don't go near this company. Must be the world's worst ISP. Threatened to set debt collection services on me when I refused to pay for a service that they had cut off through competence. They even spitefully managed to apply block on our internet connection after we moved to a new Isp. I hate this company.</span>
I am struggling with how to refer to the position of information within the HTML. I see the last 3 pieces of information are in spans so I attempted the following- but none of relevant information was returned
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://www.google.co.nz/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=orcon&lrd=0x6d0d3833fefacf95:0x59fef608692d4541,1,').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
attempt1 = soup.find_all('span class')
for span in attempt1:
print(span)
I assume I am not correctly/accurately referencing the 4 pieces of information within the HTML. Can some point out what is wrong? Regards Steve

To scrape the reviews of a place you'll need the place id. It looks like this 0x89c259a61c75684f:0x79d31adb123348d2.
And then you need to make the request with the fallowing url that contains the place_id:
https://www.google.com/async/reviewDialog?hl=en&async=feature_id:0x89c259a61c75684f:0x79d31adb123348d2,sort_by:,next_page_token:,associated_topic:,_fmt:pc
Alternatively you could use a third party solution like SerpApi. It's a paid API with a free trial. We handle proxies, solve captchas, and parse all the rich structured data for you.
Example python code (available in other libraries also):
from serpapi import GoogleSearch
params = {
"engine": "google_maps_reviews",
"place_id": "0x89c259a61c75684f:0x79d31adb123348d2",
"hl": "en",
"api_key": "secret_api_key"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"reviews": [
{
"user": {
"name": "HerbertTomlinson O",
"link": "https://www.google.com/maps/contrib/100851257830988379503?hl=en-US&sa=X&ved=2ahUKEwiIlNzLtJrxAhVFWs0KHfclCwAQvvQBegQIARAy",
"thumbnail": "https://lh3.googleusercontent.com/a/AATXAJyjD5T8NEJSdOUAveA8IuMDTLXE9edBHDpFTvZ8=s40-c-c0x00000000-cc-rp-mo-br100",
"reviews": 2
},
"rating": 4,
"date": "2 months ago",
"snippet": "Finally, I found the best coffee shop today. Their choice of music is usually blasting from the past which was really relaxing and made me stay longer. There are tables for lovers and also for group of friends. The coffees and foods here are very affordable and well worth the money. You can't go wrong with this coffee shop. This is very worth to visit."
},
{
"user": {
"name": "Izaac Collier",
"link": "https://www.google.com/maps/contrib/116734781291082397423?hl=en-US&sa=X&ved=2ahUKEwiIlNzLtJrxAhVFWs0KHfclCwAQvvQBegQIARA-",
"thumbnail": "https://lh3.googleusercontent.com/a-/AOh14GgfhltPhiWrkTwe6swLUQRCWf_asuTfHPRnJCLc=s40-c-c0x00000000-cc-rp-mo-br100",
"reviews": 2
},
"rating": 5,
"date": "a month ago",
"snippet": "I am not into coffee but one of my friends invited me here. As I looked the menu, I was convinced, so I ordered one for me. The food was tasty and the staff were very friendly and accommodating. The ambience was very cosy and comfortable. The coffee was great and super tasty. I will recommend this and will visit again!"
},
...
Check out the documentation for more details.
Disclaimer: I work at SerpApi.

Related

Phython3 code to find nested uniques within a JSON file

First off I'm a scientist, NOT a coder. Haven't coded since my college days so feel free to knock me around a bit. But I have a project for a non-profit that I'd like to help with.
I have the code to dl the JSON file, a sample of which I'll provide below. But for now, my goal is to search and display unique birds in each of the unique areas. I've spend about 10 days scouring the web, writing many hundreds of lines of code, all to no avail. I'm certain one of you will spend 3 minutes to write one line comprehension that'll do it perfectly. My hat's off in advance.
Here's a small sample extracted from thousands of items in a dl'ed JSON file:
{
"speciesCode": "snogoo",
"comName": "Snow Goose",
"sciName": "Anser caerulescens",
"locId": "L1415313",
"locName": "Vacation Isle",
"obsDt": "2023-02-15 15:28",
"howMany": 3,
"lat": 32.7750146,
"lng": -117.2352583,
"obsValid": false,
"obsReviewed": false,
"locationPrivate": false,
"subId": "S128423924"
},
{
"speciesCode": "gwfgoo",
"comName": "Greater White-fronted Goose",
"sciName": "Anser albifrons",
"locId": "L1415313",
"locName": "Vacation Isle",
"obsDt": "2023-02-15 15:28",
"howMany": 1,
"lat": 32.7750146,
"lng": -117.2352583,
"obsValid": false,
"obsReviewed": false,
"locationPrivate": false,
"subId": "S128423924"
},
{
"speciesCode": "snogoo",
"comName": "Snow Goose",
"sciName": "Anser caerulescens",
"locId": "L1415313",
"locName": "Vacation Isle",
"obsDt": "2023-02-15 15:28",
"howMany": 3,
"lat": 32.7750146,
"lng": -117.2352583,
"obsValid": false,
"obsReviewed": false,
"locationPrivate": false,
"subId": "S128423922"
},
What I need to do is to extract the unique "comName" species that have been seen at each "locName". So in the above extraction there are two different records of a Snow Goose showing up at the same location. I only need one. I won't give you the giggles by offering my attempts. I can traverse the JSON fine, and create a dict with different results. But selecting the uniques in the nested loops have bamboozled me. If I've violated any common rules here, please flog me gently. I really would like to be a good netizen.
Thank you much for any help you can provide.

How can I render an object in HTML?

So in my angular project, I want to render an array of product object.
I was able to render it as JSON object:
<td>{{o.products |json}}</td>
And for example this one of the outputs:
[ { "id": 4, "name": "Forever", "description": "Because you suffer a lot physically and morally, we will not let you suffer financially.\n• Lump sum payment: Up to US $500,000 paid immediately upon diagnosis of any covered 32 critical illnesses.\n• Worldwide coverage: Giving you the assistance you need even if you move to another country.\n• Telemedicine and e-counsultancy through World Care International: Access to free expert care from world-renowned medical centres in the US specialising in your condition.", "price": 300, "logo": "assets\\download(5).jpg", "category": 1, "image": "assets\\forever.jpg" } ]
Now what if I only want to show the name attribute and not the whole product attributes. How can I do that?
You should use ngFor directive, to create a for-loop that iterates over all products, and print only the product name:
<td *ngFor="let product of o.products">{{product.name}}</td>

How to get recent fact checks on Google FactCheck API?

I would like to get recent fact checks for using Google's Fact Check Tools. There is a search API here: https://developers.google.com/fact-check/tools/api/reference/rest/v1alpha1/claims.
I want to get the list of recent fact checks with no search query, just like the explorer here: https://toolbox.google.com/factcheck/explorer/search/list:recent;hl=en. The APIs seem to only show query, even though the explorer lets you get recent fact checks. Is there a way to get the recent ones?
This link could give you all the information: https://developers.google.com/fact-check/tools/api/reference/rest/v1alpha1/claims/search
Enter a query and press EXECUTE and you can examine the results by "selecting all" in the little box where show the results.
This https://developers.google.com/fact-check/tools/api/reference/rest/v1alpha1/claims/search?apix_params=%7B%22maxAgeDays%22%3A33%2C%22query%22%3A%22preexisiting%22%2C%22reviewPublisherSiteFilter%22%3A%22Washington%20Post%22%7D
doesn't work because "Washington Post" is not valid and Google provides no list of valid "reviewPublisherSiteFilter"s.
Leave APIKEY box blank and use NUMBER OF DAYS in "maxAgeDays" and you should get the result you want.
Something like this:
{
"claims": [
{
"text": "“We're going to be doing a health care plan, very strongly, and protect people with preexisting conditions… We have other alternatives to Obamacare that are 50% less expensive and that are actually better.”\n“We have run [Obamacare] so much better than Obama ran it.”\n“At the end of my first term, we're going to have close to 300, maybe over 300 new federal judges, including Court of Appeal, two Supreme Court justices.”\nStock Market is proof that Americans are “doing better than they were doing before the pandemic came.”\n“We want people to come into our country ... but we want them to come in through a legal system.”",
"claimant": "#dwebbKHN",
"claimDate": "2020-09-17T10:21:00Z",
"claimReview": [
{
"publisher": {
"name": "Misbar",
"site": "misbar.com"
},
"url": "https://misbar.com/factcheck/2020/09/17/trump-town-hall-special-%E2%80%93-other-topics",
"title": "Trump Town Hall Special – Other Topics | Fact Check",
"reviewDate": "2020-09-17T10:21:00Z",
"textualRating": "Fake",
"languageCode": "en"
}
]
},
{
"text": "Mr. Trump, who has not followed through on a pledge in July that he would have a health care plan ready and signed in two weeks, said his administration would not get rid of the preexisting conditions coverage that were implemented by the Affordable Care Act. He was responding to Ellesia Blaque, an assistant professor who lives in Philadelphia, who told him she's paying $7,000 a year for life-saving medicine because of a condition she was born with, sarcoidosis.",
"claimant": "Donald Trump",
"claimDate": "2020-09-16T00:00:00Z",
"claimReview": [
{
"publisher": {
"name": "CBS News",
"site": "cbsnews.com"
},
"url": "https://www.cbsnews.com/news/trump-town-hall-fact-check-health-care-covid-19/#preexisting",
"title": "Fact-checking Trump's town hall health care claims",
"reviewDate": "2020-09-16T00:00:00Z",
"textualRating": "Mostly False",
"languageCode": "en"
}
]
},

DB Schema for Laravel app recording data-points on arbitrary number of variables

My question is about creating a proper schema or way of storing some data I will be collecting.
My app runs Laravel 6.
So I have a number of 'campaigns', an example of which is like this:
{
"campaign_name": "Campaign 1",
"keywords": ["keyword 1", "keyword 2", "keyword 3"], // there may be hundreds of keywords
"urls": ["google.com", "bing.com", "example.com"], // there may be many urls
"business_names": ["Google", "Bing, "Example"], // there may be many business_names
"locations": [
{
"address": "location 1", //this is a postal address
"lat": "-37.8183",
"lng": "144.957"
},
{
"address": "location 2", //this is a postal address
"lat": "-37.7861",
"lng": "145.312"
}
// there may be 50-100 locations.
]
}
Each url (and each business name) will get matched up with each keyword along with each location.
ie:
google.com
- keyword 1 location 1
- keyword 1 location 2
- keyword 1 location 3
- keyword 2 location 1
- keyword 2 location 2
// etc etc. there may be hundreds of keywords and hundreds of locations.
bing.com
- keyword 1 location 1
- keyword 1 location 2
// etc etc as above.
Each of these concatenations will have time series data points that I want to store and ultimately query.
I see how a number of tables may be setup to handle this, but is there a way to slightly simplify this by storing some json?
Most of my migrations on projects have been pretty simple with just a single relation but this is a bit harder for me.
Any help is appreciated. I would ideally like to avoid a number of tables and complex pivots or associations if possible (understanding the benefits of normalization...)

How to grab values from key:value pairs in parsed JSON in Python

I am trying to extract data from a JSON file. Here is my code.
import json
json_file = open('test.json')
data = json.load(json_file)
json_file.close()
At this point I can print the file using print data and it appears that the file has been read in properly.
Now I would like to grab a value from a key:value pair that is in a dictionary nested within a list nested within a dictionary (Phew!). My initial strategy was to create a dictionary and plop the dictionary nested within the list in it and then extract the key:value pairs I need.
dict = {}
dict = json_file.get('dataObjects', None)
dict[0]
When I try to look at the contents of dict, I see that there is only one element. The entire dictionary appears to have been read as a list. I've tried a couple of different approaches, but I still wind up with a dictionary read as a list. I would love to grab the nested dictionary and use another .get to grab the values I need. Here is a sample of the JSON I am working with. Specifically, I am trying to pull out the identifier and description values from the dataObjects section.
{
"identifier": 213726,
"scientificName": "Carcharodon carcharias (Linnaeus, 1758)",
"richness_score": 89.6095,
"synonyms": [
],
"taxonConcepts": [
{
"identifier": 20728481,
"scientificName": "Carcharodon carcharias (Linnaeus, 1758)",
"nameAccordingTo": "WORMS Species Information (Marine Species)",
"canonicalForm": "Carcharodon carcharias",
"sourceIdentfier": "105838"
},
{
"identifier": 24922984,
"scientificName": "Carcharodon carcharias",
"nameAccordingTo": "IUCN Red List (Species Assessed for Global Conservation)",
"canonicalForm": "Carcharodon carcharias",
"sourceIdentfier": "IUCN-3855"
},
],
"dataObjects": [
{
"identifier": "5e1882d822ec530069d6d29e28944369",
"dataObjectVersionID": 5671572,
"dataType": "http://purl.org/dc/dcmitype/Text",
"dataSubtype": "",
"vettedStatus": "Trusted",
"dataRating": 3.0,
"subject": "http://rs.tdwg.org/ontology/voc/SPMInfoItems#TaxonBiology",
"mimeType": "text/html",
"title": "Biology",
"language": "en",
"license": "http://creativecommons.org/licenses/by-nc-sa/3.0/",
"rights": "Copyright Wildscreen 2003-2008",
"rightsHolder": "Wildscreen",
"audience": [
"General public"
],
"source": "http://www.arkive.org/great-white-shark/carcharodon-carcharias/",
"description": "Despite its worldwide notoriety, very little is known about the natural ecology and behaviour of this predator. These sharks are usually solitary or occur in pairs, although it is apparently a social animal that can also be found in small aggregations of 10 or more, particularly around a carcass (3) (6). Females are ovoviviparous; the pups hatch from eggs retained within their mother's body, and she then gives birth to live young (10). Great white sharks are particularly slow-growing, late maturing and long-lived, with a small litter size and low reproductive capacity (8). Females do not reproduce until they reach about 4.5 to 5 metres in length, and litter sizes range from two to ten pups (8). The length of gestation is not known but estimated at between 12 and 18 months, and it is likely that these sharks only reproduce every two or three years (8) (11). After birth, there is no maternal care, and despite their large size, survival of young is thought to be low (8). Great whites are at the top of the marine food chain, and these sharks are skilled predators. They feed predominately on fish but will also consume turtles, molluscs, and crustaceans, and are active hunters of small cetaceans such as dolphins and porpoises, and of other marine mammals such as seals and sea lions (12). Using their acute senses of smell, sound location and electroreception, weak and injured prey can be detected from a great distance (7). Efficient swimmers, sharks have a quick turn of speed and will attack rapidly before backing off whilst the prey becomes weakened; they are sometimes seen leaping clear of the water (6). Great whites, unlike most other fish, are able to maintain their body temperature higher than that of the surrounding water using a heat exchange system in their blood vessels (11).",
"agents": [
{
"full_name": "ARKive",
"homepage": "http://www.arkive.org/",
"role": "provider"
}
],
}
]
}
The source you provide cannot be read by json, there are two comma's in there that you have to delete (the one on the fourth line from the bottom, and the one two lines above dataObjects.
Only after that does the json module parse without error:
import json
json_file = open('test.json')
data = json.load(json_file)
do = data['dataObjects'][0]
print do['identifier']
print do['description']
json_file.close()