I was getting the following error initially when I was trying to run the code below-
Error:-the JSON object must be str, not 'bytes'
import urllib.request
import json
search = '230 boulder lane cottonwood az'
search = search.replace(' ','%20')
places_api_key = 'AIzaSyDou2Q9Doq2q2RWJWncglCIt0kwZ0jcR5c'
url = 'https://maps.googleapis.com/maps/api/place/textsearch/json?query='+search+'&key='+places_api_key
json_obj = urllib.request.urlopen(url)
data = json.load(json_obj)
for item in data ['results']:
print(item['formatted_address'])
print(item['types'])
After making some troubleshooting changes like:-
json_obj = urllib.request.urlopen(url)
obj = json.load(json_obj)
data = json_obj .readall().decode('utf-8')
Error - 'HTTPResponse' object has no attribute 'decode'
I am getting the error above, I have tried multiple posts on stackoverflow nothing seem to work. I have uploaded the entire working code if anyone can get it to work I'll be very grateful. What I don't understand is that why the same thing worked for others and not me.
Thanks!
urllib.request.urlopen returns an HTTPResponse object which cannot be directly json decoded (because it is a bytestream)
So you'll instead want:
# Convert from bytes to text
resp_text = urllib.request.urlopen(url).read().decode('UTF-8')
# Use loads to decode from text
json_obj = json.loads(resp_text)
However, if you print resp_text from your example, you'll notice it is actually xml, so you'll want an xml reader:
resp_text = urllib.request.urlopen(url).read().decode('UTF-8')
(Pdb) print(resp_text)
<?xml version="1.0" encoding="UTF-8"?>
<PlaceSearchResponse>
<status>OK</status>
...
update (python3.6+)
In python3.6+, json.load can take a byte stream (and json.loads can take a byte string)
This is now valid:
json_obj = json.load(urllib.request.urlopen(url))
Related
I am getting an error while working with JSON response:
Error: AttributeError: 'str' object has no attribute 'get'
What could be the issue?
I am also getting the following errors for the rest of the values:
***TypeError: 'builtin_function_or_method' object is not subscriptable
'Phone': value['_source']['primaryPhone'],
KeyError: 'primaryPhone'***
# -*- coding: utf-8 -*-
import scrapy
import json
class MainSpider(scrapy.Spider):
name = 'main'
start_urls = ['https://experts.expcloud.com/api4/std?searchterms=AB&size=216&from=0']
def parse(self, response):
resp = json.loads(response.body)
values = resp['hits']['hits']
for value in values:
yield {
'Full Name': value['_source']['fullName'],
'Phone': value['_source']['primaryPhone'],
"Email": value['_source']['primaryEmail'],
"City": value.get['_source']['city'],
"Zip Code": value.get['_source']['zipcode'],
"Website": value['_source']['websiteURL'],
"Facebook": value['_source']['facebookURL'],
"LinkedIn": value['_source']['LinkedIn_URL'],
"Twitter": value['_source']['Twitter'],
"BIO": value['_source']['Bio']
}
It's nested deeper than what you think it is. That's why you're getting an error.
Code Example
import scrapy
import json
class MainSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://experts.expcloud.com/api4/std?searchterms=AB&size=216&from=0']
def parse(self, response):
resp = json.loads(response.body)
values = resp['hits']['hits']
for value in values:
yield {
'Full Name': value['_source']['fullName'],
'Primary Phone':value['_source']['primaryPhone']
}
Explanation
The resp variable is creating a python dictionary, but there is no resp['hits']['hits']['fullName'] within this JSON data. The data you're looking for, for fullName is actually resp['hits']['hits'][i]['_source']['fullName']. i being an number because resp['hits']['hits'] is a list.
resp['hits'] is a dictionary and therefore the values variable is fine.
But resp['hits']['hits'] is a list, therefore you can't use the get request, and it's only accepts numbers as values within [], not strings. Hence the error.
Tips
Use response.json() instead of json.loads(response.body), since Scrapy v2.2, scrapy now has support for json internally. Behind the scenes it already imports json.
Also check the json data, I used requests for ease and just getting nesting down till I got the data you needed.
Yielding a dictionary is fine for this type of data as it's well structured, but any other data that needs modifying or changing or is wrong in places. Use either Items dictionary or ItemLoader. There's a lot more flexibility in those two ways of yielding an output than yielding a dictionary. I almost never yield a dictionary, the only time is when you have highly structured data.
Updated Code
Looking at the JSON data, there are quite a lot of missing data. This is part of web scraping you will find errors like this. Here we use a try and except block, for when we get a KeyError which means python hasn't been able to recognise the key associated with a value. We have to handle that exception, which we do here by saying to yield a string 'No XXX'
Once you start getting gaps etc it's better to consider an Items dictionary or Itemloaders.
Now it's worth looking at the Scrapy docs about Items. Essentially Scrapy does two things, it extracted data from websites, and it provides a mechanism for storing this data. The way it does this is storing it in a dictionary called Items. The code isn't that much different from yielding a dictionary but Items dictionary allows you to manipulate the extracted data more easily with extra things scrapy can do. You need to edit your items.py first with the fields you want. We create a class called TestItem, we define each field using scrapy.Field(). We then can import this class in our spider script.
items.py
import scrapy
class TestItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
full_name = scrapy.Field()
Phone = scrapy.Field()
Email = scrapy.Field()
City = scrapy.Field()
Zip_code = scrapy.Field()
Website = scrapy.Field()
Facebook = scrapy.Field()
Linkedin = scrapy.Field()
Twitter = scrapy.Field()
Bio = scrapy.Field()
Here we're specifying what we want the fields to be, you can't use a string with spaces unfortunately hence why full name is full_name. The field() creates the field of the item dictionary for us.
We import this item dictionary into our spider script with from ..items import TestItem. The from ..items means we're taking the items.py from the parent folder to the spider script and we're importing the class TestItem. That way our spider can populate the items dictionary with our json data.
Note that just before the for loop we instantiate the class TestItem by item = TestItem(). Instantiate means to call upon the class, in this case it makes a dictionary. This means we are creating the item dictionary and then we populate that dictionary with keys and values. You have to does this before you add your keys and values as you can see from within the for loop.
Spider script
import scrapy
import json
from ..items import TestItem
class MainSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://experts.expcloud.com/api4/std?searchterms=AB&size=216&from=0']
def parse(self, response):
resp = json.loads(response.body)
values = response.json()['hits']['hits']
item = TestItem()
for value in values:
try:
item['full_name'] = value['_source']['fullName']
except KeyError:
item['full_name'] = 'No Name'
try:
item['Phone'] = value['_source']['primaryPhone']
except KeyError:
item['Phone'] = 'No Phone number'
try:
item["Email"] = value['_source']['primaryEmail']
except KeyError:
item['Email'] = 'No Email'
try:
item["City"] = value['_source']['activeLocations'][0]['city']
except KeyError:
item['City'] = 'No City'
try:
item["Zip_code"] = value['_source']['activeLocations'][0]['zipcode']
except KeyError:
item['Zip_code'] = 'No Zip code'
try:
item["Website"] = value['AgentMarketingCenter'][0]['Website']
except KeyError:
item['Website'] = 'No Website'
try:
item["Facebook"] = value['_source']['AgentMarketingCenter'][0]['Facebook_URL']
except KeyError:
item['Facebook'] = 'No Facebook'
try:
item["Linkedin"] = value['_source']['AgentMarketingCenter'][0]['LinkedIn_URL']
except KeyError:
item['Linkedin'] = 'No Linkedin'
try:
item["Twitter"] = value['_source']['AgentMarketingCenter'][0]['Twitter']
except KeyError:
item['Twitter'] = 'No Twitter'
try:
item["Bio"]: value['_source']['AgentMarketingCenter'][0]['Bio']
except KeyError:
item['Bio'] = 'No Bio'
yield item
I am trying to get as many profile links as I can on khanacademy.org. I am using their api.
I am struggling navigating through the json file to get the desired data.
Here is my code :
from urllib.request import urlopen
import json
with urlopen("https://www.khanacademy.org/api/internal/discussions/video/what-are-algorithms/questions?casing=camel&limit=10&page=0&sort=1&lang=en&_=190422-1711-072ca2269550_1556031278137") as response:
source = response.read()
data= json.loads(source)
for item in data['feedback']:
print(item['authorKaid'])
profile_answers = item['answers']['authorKaid']
print(profile_answers)
My goal is to get as many authorKaid as possible en then store them (to create a database later).
When I run this code I get this error :
TypeError: list indices must be integers or slices, not str
I don't understand why, on this tutorial video : https://www.youtube.com/watch?v=9N6a-VLBa2I at 16:10 it is working.
the issue is item['answers'] are lists and you are trying to access by a string rather than an index value. So when you try to get item['answers']['authorKaid'] there is the error:
What you really want is
print (item['answers'][0]['authorKaid'])
print (item['answers'][1]['authorKaid'])
print (item['answers'][2]['authorKaid'])
etc...
So you're actually wanting to iterate through those lists. Try this:
from urllib.request import urlopen
import json
with urlopen("https://www.khanacademy.org/api/internal/discussions/video/what-are-algorithms/questions?casing=camel&limit=10&page=0&sort=1&lang=en&_=190422-1711-072ca2269550_1556031278137") as response:
source = response.read()
data= json.loads(source)
for item in data['feedback']:
print(item['authorKaid'])
for each in item['answers']:
profile_answers = each['authorKaid']
print(profile_answers)
I am trying to load a Json file from a url and parse it on Python3.4 but i get a few errors and I've no idea what they are pointing to. I did verify the json file on the url from jsonlint.com and the file seems fine. The data.read() is returning 'byte' file and i've type casted it. The code is
import urllib.request
import json
inp = input("enter url :")
if len(inp)<1: inp ='http://python-data.dr-chuck.net/comments_42.json'
data=urllib.request.urlopen(inp)
data_str = str(data.read())
print(type(data_str))
parse_data = json.loads(data_str)
print(type(parse_data))
The error that i'm getting is:
The expression str(data.read()) doesn't "cast" your bytes into a string, it just produces a string representation of them. This can be seen if you print data_str: it's a str beginning with b'.
To actually decode the JSON, you need to do data_str = data.read().decode('utf=8')
I am trying to read json response from this link. But its not working! I get the following error:
ValueError: No JSON object could be decoded.
Here is the code I've tried:
import urllib2, json
a = urllib2.urlopen('https://www.googleapis.com/pagespeedonline/v3beta1/mobileReady?key=AIzaSyDkEX-f1JNLQLC164SZaobALqFv4PHV-kA&screenshot=true&snapshots=true&locale=en_US&url=https://www.economicalinsurance.com/en/&strategy=mobile&filter_third_party_resources=false&callback=_callbacks_._DElanZU7Xh1K')
data = json.loads(a)
I made these changes:
import requests, json
r=requests.get('https://www.googleapis.com/pagespeedonline/v3beta1/mobileReady?key=AIzaSyDkEX-f1JNLQLC164SZaobALqFv4PHV-kA&screenshot=true&snapshots=true&locale=en_US&url=https://www.economicalinsurance.com/en/&strategy=mobile&filter_third_party_resources=false')
json_data = json.loads(r.text)
print json_data['ruleGroups']['USABILITY']['score']
A Quick question - Construct Image link .
I able to get here : -
from selenium import webdriver
txt = json_data['screenshot']['data']
txt = str(txt).replace('-','/').replace('_','/')
#then in order to construct the image link i tried : -
image_link = 'data:image/jpeg;base64,'+txt
driver = webdriver.Firefox()
driver.get(image_link)
The problem is i am not getting the image, also the len(object_original) as compared len(image_link) differs . Could anybody please advise the right elements missing in my constructed image link ?. Thank you
Here is API link - https://www.google.co.uk/webmasters/tools/mobile-friendly/ Sorry added it late .
Two corrections need to be made to your code:
The url was corrected (as mentioned by Felix Kling here). You have to remove the callback parameter from the GET request you were sending.
Also, if you check the type of the response that you were fetching earlier you'll notice that it wasn't a string. It was <type 'instance'>. And since json.loads() accepts a string as a parameter variable you would've got another error. Therefore, use a.read() to fetch the response data in string.
Hence, this should be your code:
import urllib2, json
a = urllib2.urlopen('https://www.googleapis.com/pagespeedonline/v3beta1/mobileReady?key=AIzaSyDkEX-f1JNLQLC164SZaobALqFv4PHV-kA&screenshot=true&snapshots=true&locale=en_US&url=https://www.economicalinsurance.com/en/&strategy=mobile&filter_third_party_resources=false')
data = json.loads(a.read())
Answer to your second query (regarding the image) is:
from base64 import decodestring
arr = json_data['screenshot']['data']
arr = arr.replace("_", "/")
arr = arr.replace("-","+")
fh = open("imageToSave.jpeg", "wb")
fh.write(str(arr).decode('base64'))
fh.close()
Here, is the image you were trying to fetch - Link
Felix Kling is right about the address, but I also created a variable that holds the URL. You can try this out to and it should work:
import urllib2, json
url = "https://www.googleapis.com/pagespeedonline/v3beta1/mobileReady?key=AIzaSyDkEX-f1JNLQLC164SZaobALqFv4PHV-kA&screenshot=true&snapshots=true&locale=en_US&url=https://www.economicalinsurance.com/en/&strategy=mobile&filter_third_party_resources=false"
response = urllib2.urlopen(url)
data = json.loads(response.read())
print data
I want to manipulate the information at THIS url. I can successfully open it and read its contents. But what I really want to do is throw out all the stuff I don't want, and to manipulate the stuff I want to keep.
Is there a way to convert the string into a dict so I can iterate over it? Or do I just have to parse it as is (str type)?
from urllib.request import urlopen
url = 'http://www.quandl.com/api/v1/datasets/FRED/GDP.json'
response = urlopen(url)
print(response.read()) # returns string with info
When I printed response.read() I noticed that b was preprended to the string (e.g. b'{"a":1,..). The "b" stands for bytes and serves as a declaration for the type of the object you're handling. Since, I knew that a string could be converted to a dict by using json.loads('string'), I just had to convert the byte type to a string type. I did this by decoding the response to utf-8 decode('utf-8'). Once it was in a string type my problem was solved and I was easily able to iterate over the dict.
I don't know if this is the fastest or most 'pythonic' way of writing this but it works and theres always time later of optimization and improvement! Full code for my solution:
from urllib.request import urlopen
import json
# Get the dataset
url = 'http://www.quandl.com/api/v1/datasets/FRED/GDP.json'
response = urlopen(url)
# Convert bytes to string type and string type to dict
string = response.read().decode('utf-8')
json_obj = json.loads(string)
print(json_obj['source_name']) # prints the string with 'source_name' key
You can also use python's requests library instead.
import requests
url = 'http://www.quandl.com/api/v1/datasets/FRED/GDP.json'
response = requests.get(url)
dict = response.json()
Now you can manipulate the "dict" like a python dictionary.
json works with Unicode text in Python 3 (JSON format itself is defined only in terms of Unicode text) and therefore you need to decode bytes received in HTTP response. r.headers.get_content_charset('utf-8') gets your the character encoding:
#!/usr/bin/env python3
import io
import json
from urllib.request import urlopen
with urlopen('https://httpbin.org/get') as r, \
io.TextIOWrapper(r, encoding=r.headers.get_content_charset('utf-8')) as file:
result = json.load(file)
print(result['headers']['User-Agent'])
It is not necessary to use io.TextIOWrapper here:
#!/usr/bin/env python3
import json
from urllib.request import urlopen
with urlopen('https://httpbin.org/get') as r:
result = json.loads(r.read().decode(r.headers.get_content_charset('utf-8')))
print(result['headers']['User-Agent'])
TL&DR: When you typically get data from a server, it is sent in bytes. The rationale is that these bytes will need to be 'decoded' by the recipient, who should know how to use the data. You should decode the binary upon arrival to not get 'b' (bytes) but instead a string.
Use case:
import requests
def get_data_from_url(url):
response = requests.get(url_to_visit)
response_data_split_by_line = response.content.decode('utf-8').splitlines()
return response_data_split_by_line
In this example, I decode the content that I received into UTF-8. For my purposes, I then split it by line, so I can loop through each line with a for loop.
I guess things have changed in python 3.4. This worked for me:
print("resp:" + json.dumps(resp.json()))