I'm currently doing a spike for a project and was hoping the community may be able to shed some light on things.
I would like to use Google Cloud Vision to scan the below image and then derive the key/value pairs from it (such as Title: Ground Rod..., Last Revision: June 27, 2012). This is a basic example, it could have much more data and the layout may be different to this.
Since there is no easy correlation between the key/values i'm not sure if this possible? Is it possible to train the google vision with example images? Or are there any other solutions that may be able to do this?
Thank you!
You can use the Cloud Vision API to scan the image and obtain the useful key/value pairs writing a program using the Vision Api Client Libraries. For example, dragging the image file here and switching to “Text” tab you can visualize this:
[...]DRAWING TITLE GROUND ROD STRUCTURAL STEEL CONNECTION DETAIL E-80-05 Division of Technical Resources Office of Research Facilities National Institutes of Health The formulae 5-steel- deal ** * -||-| S - for building H-KANA --- Ej as state of the art e A uto-aut - R4fco- biomedical research facilities: LAST REVISION JUNE 27, 2012
In “Document”, on block 10, you can read this:
G R O U N D R O D S T R U C T U R A L S T E E L C O N N E C T I O N D E T A I L
One last useful operation: open “JSON” tab, search “ground rod structural” in the navigator. If you go the fourth entrance and scroll up you will see the coordinates of the bounding boxes containing “June 27, 2012”, in reverse order: 2,1,0,2,etc. The 2 is defined as follows:
"boundingBox": {
"vertices": [
{
"x": 671,
"y": 1173
},
{
"x": 679,
"y": 1173
},
{
"x": 679,
"y": 1200
},
{
"x": 671,
"y": 1200
}
]
},
"text": "2",
"confidence": 0.96
}
],
"confidence": 0.98
}
],
"confidence": 0.99
}
],
"blockType": "TEXT",
"confidence": 0.99
}
]
}
],
As far as I know, footers for technical draws contain a well-structured limited type of information (as for example title, date and legislation rule in this case) that cannot change much.
Taking into account all the information gathered through the Cloud Vision API and the Client Libraries availability, a script could be written in one of the code languages to identify and save the useful blocks and post-process them to obtain key/value pairs. Found a document text detection sample here or a tutorial here.
It is not possible to train Cloud Vision API with example images. To train a Machine Learning model, it is required a training dataset with its corresponding answer, commonly denoted as target. You could use Cloud AI for Machine Learning purposes to do this.
Related
So in my angular project, I want to render an array of product object.
I was able to render it as JSON object:
<td>{{o.products |json}}</td>
And for example this one of the outputs:
[ { "id": 4, "name": "Forever", "description": "Because you suffer a lot physically and morally, we will not let you suffer financially.\n• Lump sum payment: Up to US $500,000 paid immediately upon diagnosis of any covered 32 critical illnesses.\n• Worldwide coverage: Giving you the assistance you need even if you move to another country.\n• Telemedicine and e-counsultancy through World Care International: Access to free expert care from world-renowned medical centres in the US specialising in your condition.", "price": 300, "logo": "assets\\download(5).jpg", "category": 1, "image": "assets\\forever.jpg" } ]
Now what if I only want to show the name attribute and not the whole product attributes. How can I do that?
You should use ngFor directive, to create a for-loop that iterates over all products, and print only the product name:
<td *ngFor="let product of o.products">{{product.name}}</td>
I would like to get recent fact checks for using Google's Fact Check Tools. There is a search API here: https://developers.google.com/fact-check/tools/api/reference/rest/v1alpha1/claims.
I want to get the list of recent fact checks with no search query, just like the explorer here: https://toolbox.google.com/factcheck/explorer/search/list:recent;hl=en. The APIs seem to only show query, even though the explorer lets you get recent fact checks. Is there a way to get the recent ones?
This link could give you all the information: https://developers.google.com/fact-check/tools/api/reference/rest/v1alpha1/claims/search
Enter a query and press EXECUTE and you can examine the results by "selecting all" in the little box where show the results.
This https://developers.google.com/fact-check/tools/api/reference/rest/v1alpha1/claims/search?apix_params=%7B%22maxAgeDays%22%3A33%2C%22query%22%3A%22preexisiting%22%2C%22reviewPublisherSiteFilter%22%3A%22Washington%20Post%22%7D
doesn't work because "Washington Post" is not valid and Google provides no list of valid "reviewPublisherSiteFilter"s.
Leave APIKEY box blank and use NUMBER OF DAYS in "maxAgeDays" and you should get the result you want.
Something like this:
{
"claims": [
{
"text": "“We're going to be doing a health care plan, very strongly, and protect people with preexisting conditions… We have other alternatives to Obamacare that are 50% less expensive and that are actually better.”\n“We have run [Obamacare] so much better than Obama ran it.”\n“At the end of my first term, we're going to have close to 300, maybe over 300 new federal judges, including Court of Appeal, two Supreme Court justices.”\nStock Market is proof that Americans are “doing better than they were doing before the pandemic came.”\n“We want people to come into our country ... but we want them to come in through a legal system.”",
"claimant": "#dwebbKHN",
"claimDate": "2020-09-17T10:21:00Z",
"claimReview": [
{
"publisher": {
"name": "Misbar",
"site": "misbar.com"
},
"url": "https://misbar.com/factcheck/2020/09/17/trump-town-hall-special-%E2%80%93-other-topics",
"title": "Trump Town Hall Special – Other Topics | Fact Check",
"reviewDate": "2020-09-17T10:21:00Z",
"textualRating": "Fake",
"languageCode": "en"
}
]
},
{
"text": "Mr. Trump, who has not followed through on a pledge in July that he would have a health care plan ready and signed in two weeks, said his administration would not get rid of the preexisting conditions coverage that were implemented by the Affordable Care Act. He was responding to Ellesia Blaque, an assistant professor who lives in Philadelphia, who told him she's paying $7,000 a year for life-saving medicine because of a condition she was born with, sarcoidosis.",
"claimant": "Donald Trump",
"claimDate": "2020-09-16T00:00:00Z",
"claimReview": [
{
"publisher": {
"name": "CBS News",
"site": "cbsnews.com"
},
"url": "https://www.cbsnews.com/news/trump-town-hall-fact-check-health-care-covid-19/#preexisting",
"title": "Fact-checking Trump's town hall health care claims",
"reviewDate": "2020-09-16T00:00:00Z",
"textualRating": "Mostly False",
"languageCode": "en"
}
]
},
I'm trying to set up a heatmap graph for request latencies in grafana, using a stackdriver backend.
With the following query I get the right heatmap, however the bucket labels are in seconds without decimals, which means there is an 8,4,2 and 1 second bucket, and then many 0 second buckets. Is there a way to switch to ms labels?
For clarification: the bucket names coming back in the result of the query are integers, so changing the units or decimal places in the visualisation won't help.
Query as seen in grafana editor (for better readability)
Currently the graph looks like this
"queries": [
{
"refId": "A",
"intervalMs": 15000,
"datasourceId": 14,
"metricType": "serviceruntime.googleapis.com/api/request_latencies",
"crossSeriesReducer": "REDUCE_SUM",
"perSeriesAligner": "ALIGN_DELTA",
"alignmentPeriod": "stackdriver-auto",
"groupBys": [],
"view": "FULL",
"filters": [
// -> some more removed for privacy reasons
"AND",
"resource.type",
"=",
"api"
],
"aliasBy": "{{bucket}}",
"type": "timeSeriesQuery"
}
]
I am attempting to scrape User reviews from Google places reviews (the API only returns 5 most helpful reviews). I am attempting to use Beautifulsoup to retrieve 4 pieces of information
1) Name of the reviewer
2) When the review was written
3) Rating (out of 5)
4) Body of review
Inspecting each element I can find the location of the information
1) Name of reviewer:
<a class="_e8k" style="color:black;text-decoration:none" href="https://www.google.com/maps/contrib/103603482673238284204/reviews">Steve Fox</a>
2) When the review was written
<span style="color:#999;font-size:13px">3 months ago</span>
3) Rating (visible in code, but doesn't show when "run code snippet"
<span class="_pxg _Jxg" aria-label="Rated 1.0 out of 5,"><span style="width:14px"></span></span>
4) Body of the review
<span jsl="$t t-uvHqeLvCkgA;$x 0;" class="r-i8GVQS_tBTbg">Don't go near this company. Must be the world's worst ISP. Threatened to set debt collection services on me when I refused to pay for a service that they had cut off through competence. They even spitefully managed to apply block on our internet connection after we moved to a new Isp. I hate this company.</span>
I am struggling with how to refer to the position of information within the HTML. I see the last 3 pieces of information are in spans so I attempted the following- but none of relevant information was returned
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://www.google.co.nz/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=orcon&lrd=0x6d0d3833fefacf95:0x59fef608692d4541,1,').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
attempt1 = soup.find_all('span class')
for span in attempt1:
print(span)
I assume I am not correctly/accurately referencing the 4 pieces of information within the HTML. Can some point out what is wrong? Regards Steve
To scrape the reviews of a place you'll need the place id. It looks like this 0x89c259a61c75684f:0x79d31adb123348d2.
And then you need to make the request with the fallowing url that contains the place_id:
https://www.google.com/async/reviewDialog?hl=en&async=feature_id:0x89c259a61c75684f:0x79d31adb123348d2,sort_by:,next_page_token:,associated_topic:,_fmt:pc
Alternatively you could use a third party solution like SerpApi. It's a paid API with a free trial. We handle proxies, solve captchas, and parse all the rich structured data for you.
Example python code (available in other libraries also):
from serpapi import GoogleSearch
params = {
"engine": "google_maps_reviews",
"place_id": "0x89c259a61c75684f:0x79d31adb123348d2",
"hl": "en",
"api_key": "secret_api_key"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"reviews": [
{
"user": {
"name": "HerbertTomlinson O",
"link": "https://www.google.com/maps/contrib/100851257830988379503?hl=en-US&sa=X&ved=2ahUKEwiIlNzLtJrxAhVFWs0KHfclCwAQvvQBegQIARAy",
"thumbnail": "https://lh3.googleusercontent.com/a/AATXAJyjD5T8NEJSdOUAveA8IuMDTLXE9edBHDpFTvZ8=s40-c-c0x00000000-cc-rp-mo-br100",
"reviews": 2
},
"rating": 4,
"date": "2 months ago",
"snippet": "Finally, I found the best coffee shop today. Their choice of music is usually blasting from the past which was really relaxing and made me stay longer. There are tables for lovers and also for group of friends. The coffees and foods here are very affordable and well worth the money. You can't go wrong with this coffee shop. This is very worth to visit."
},
{
"user": {
"name": "Izaac Collier",
"link": "https://www.google.com/maps/contrib/116734781291082397423?hl=en-US&sa=X&ved=2ahUKEwiIlNzLtJrxAhVFWs0KHfclCwAQvvQBegQIARA-",
"thumbnail": "https://lh3.googleusercontent.com/a-/AOh14GgfhltPhiWrkTwe6swLUQRCWf_asuTfHPRnJCLc=s40-c-c0x00000000-cc-rp-mo-br100",
"reviews": 2
},
"rating": 5,
"date": "a month ago",
"snippet": "I am not into coffee but one of my friends invited me here. As I looked the menu, I was convinced, so I ordered one for me. The food was tasty and the staff were very friendly and accommodating. The ambience was very cosy and comfortable. The coffee was great and super tasty. I will recommend this and will visit again!"
},
...
Check out the documentation for more details.
Disclaimer: I work at SerpApi.
I am trying to extract data from a JSON file. Here is my code.
import json
json_file = open('test.json')
data = json.load(json_file)
json_file.close()
At this point I can print the file using print data and it appears that the file has been read in properly.
Now I would like to grab a value from a key:value pair that is in a dictionary nested within a list nested within a dictionary (Phew!). My initial strategy was to create a dictionary and plop the dictionary nested within the list in it and then extract the key:value pairs I need.
dict = {}
dict = json_file.get('dataObjects', None)
dict[0]
When I try to look at the contents of dict, I see that there is only one element. The entire dictionary appears to have been read as a list. I've tried a couple of different approaches, but I still wind up with a dictionary read as a list. I would love to grab the nested dictionary and use another .get to grab the values I need. Here is a sample of the JSON I am working with. Specifically, I am trying to pull out the identifier and description values from the dataObjects section.
{
"identifier": 213726,
"scientificName": "Carcharodon carcharias (Linnaeus, 1758)",
"richness_score": 89.6095,
"synonyms": [
],
"taxonConcepts": [
{
"identifier": 20728481,
"scientificName": "Carcharodon carcharias (Linnaeus, 1758)",
"nameAccordingTo": "WORMS Species Information (Marine Species)",
"canonicalForm": "Carcharodon carcharias",
"sourceIdentfier": "105838"
},
{
"identifier": 24922984,
"scientificName": "Carcharodon carcharias",
"nameAccordingTo": "IUCN Red List (Species Assessed for Global Conservation)",
"canonicalForm": "Carcharodon carcharias",
"sourceIdentfier": "IUCN-3855"
},
],
"dataObjects": [
{
"identifier": "5e1882d822ec530069d6d29e28944369",
"dataObjectVersionID": 5671572,
"dataType": "http://purl.org/dc/dcmitype/Text",
"dataSubtype": "",
"vettedStatus": "Trusted",
"dataRating": 3.0,
"subject": "http://rs.tdwg.org/ontology/voc/SPMInfoItems#TaxonBiology",
"mimeType": "text/html",
"title": "Biology",
"language": "en",
"license": "http://creativecommons.org/licenses/by-nc-sa/3.0/",
"rights": "Copyright Wildscreen 2003-2008",
"rightsHolder": "Wildscreen",
"audience": [
"General public"
],
"source": "http://www.arkive.org/great-white-shark/carcharodon-carcharias/",
"description": "Despite its worldwide notoriety, very little is known about the natural ecology and behaviour of this predator. These sharks are usually solitary or occur in pairs, although it is apparently a social animal that can also be found in small aggregations of 10 or more, particularly around a carcass (3) (6). Females are ovoviviparous; the pups hatch from eggs retained within their mother's body, and she then gives birth to live young (10). Great white sharks are particularly slow-growing, late maturing and long-lived, with a small litter size and low reproductive capacity (8). Females do not reproduce until they reach about 4.5 to 5 metres in length, and litter sizes range from two to ten pups (8). The length of gestation is not known but estimated at between 12 and 18 months, and it is likely that these sharks only reproduce every two or three years (8) (11). After birth, there is no maternal care, and despite their large size, survival of young is thought to be low (8). Great whites are at the top of the marine food chain, and these sharks are skilled predators. They feed predominately on fish but will also consume turtles, molluscs, and crustaceans, and are active hunters of small cetaceans such as dolphins and porpoises, and of other marine mammals such as seals and sea lions (12). Using their acute senses of smell, sound location and electroreception, weak and injured prey can be detected from a great distance (7). Efficient swimmers, sharks have a quick turn of speed and will attack rapidly before backing off whilst the prey becomes weakened; they are sometimes seen leaping clear of the water (6). Great whites, unlike most other fish, are able to maintain their body temperature higher than that of the surrounding water using a heat exchange system in their blood vessels (11).",
"agents": [
{
"full_name": "ARKive",
"homepage": "http://www.arkive.org/",
"role": "provider"
}
],
}
]
}
The source you provide cannot be read by json, there are two comma's in there that you have to delete (the one on the fourth line from the bottom, and the one two lines above dataObjects.
Only after that does the json module parse without error:
import json
json_file = open('test.json')
data = json.load(json_file)
do = data['dataObjects'][0]
print do['identifier']
print do['description']
json_file.close()