Pymongo - Storing JSON files [duplicate] - json

Folks,
I just spent a good amount of time trying to look this up -- I ought to be missing something basic.
I have a python object, all I want to do is to insert this object in mondodb.
This is what I have:
from pymongo import Connection
import json
conn = Connection()
db = conn.cl_database
postings = db.postings_collection
class Posting(object):
def __init__(self, link, found=None, expired=None):
self.link = link
self.found = found
self.expired = expired
posting = Posting('objectlink1')
value = json.dumps(posting, default=lambda x:x.__dict__)
postings.insert(value)
throws this error:
Traceback (most recent call last):
File "./mongotry.py", line 21, in <module>
postings.insert(value)
File "build/bdist.macosx-10.7-intel/egg/pymongo/collection.py", line 302, in insert
File "build/bdist.macosx-10.7-intel/egg/pymongo/database.py", line 252, in _fix_incoming
File "build/bdist.macosx-10.7-intel/egg/pymongo/son_manipulator.py", line 73, in transform_incoming
TypeError: 'str' object does not support item assignment
Seems like it is because json.dumps() returns a string.
Now if I do do a loads of the value before inserting it works fine:
posting = Posting('objectlink1')
value = json.dumps(posting, default=lambda x:x.__dict__)
value = json.loads(value)
postings.insert(value)
What is the most straight-forward to do this?
Thanks!

What is value in your initial code?
It should be dict not class instance
This should work:
postings.insert(posting.__dict__)

You are misusing the insert method for the collection. Review here: http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.insert
What you need to be inserting is a document. It should be a dict with keys and values. Simply trying to insert a string is not appropriate. json.dumps returns a string in json format. If you are just dumping it to get a dict then the json step is not necessary.
Insert exactly how your documents should look:
postings.insert({"key":"value"})
Or convert your class instance directly into the dict you want to store as a doc and then insert it. It works with your json.dumps.loads() because that ultimately does give you a dict.

Related

Getting AttributeError error 'str' object has no attribute 'get'

I am getting an error while working with JSON response:
Error: AttributeError: 'str' object has no attribute 'get'
What could be the issue?
I am also getting the following errors for the rest of the values:
***TypeError: 'builtin_function_or_method' object is not subscriptable
'Phone': value['_source']['primaryPhone'],
KeyError: 'primaryPhone'***
# -*- coding: utf-8 -*-
import scrapy
import json
class MainSpider(scrapy.Spider):
name = 'main'
start_urls = ['https://experts.expcloud.com/api4/std?searchterms=AB&size=216&from=0']
def parse(self, response):
resp = json.loads(response.body)
values = resp['hits']['hits']
for value in values:
yield {
'Full Name': value['_source']['fullName'],
'Phone': value['_source']['primaryPhone'],
"Email": value['_source']['primaryEmail'],
"City": value.get['_source']['city'],
"Zip Code": value.get['_source']['zipcode'],
"Website": value['_source']['websiteURL'],
"Facebook": value['_source']['facebookURL'],
"LinkedIn": value['_source']['LinkedIn_URL'],
"Twitter": value['_source']['Twitter'],
"BIO": value['_source']['Bio']
}
It's nested deeper than what you think it is. That's why you're getting an error.
Code Example
import scrapy
import json
class MainSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://experts.expcloud.com/api4/std?searchterms=AB&size=216&from=0']
def parse(self, response):
resp = json.loads(response.body)
values = resp['hits']['hits']
for value in values:
yield {
'Full Name': value['_source']['fullName'],
'Primary Phone':value['_source']['primaryPhone']
}
Explanation
The resp variable is creating a python dictionary, but there is no resp['hits']['hits']['fullName'] within this JSON data. The data you're looking for, for fullName is actually resp['hits']['hits'][i]['_source']['fullName']. i being an number because resp['hits']['hits'] is a list.
resp['hits'] is a dictionary and therefore the values variable is fine.
But resp['hits']['hits'] is a list, therefore you can't use the get request, and it's only accepts numbers as values within [], not strings. Hence the error.
Tips
Use response.json() instead of json.loads(response.body), since Scrapy v2.2, scrapy now has support for json internally. Behind the scenes it already imports json.
Also check the json data, I used requests for ease and just getting nesting down till I got the data you needed.
Yielding a dictionary is fine for this type of data as it's well structured, but any other data that needs modifying or changing or is wrong in places. Use either Items dictionary or ItemLoader. There's a lot more flexibility in those two ways of yielding an output than yielding a dictionary. I almost never yield a dictionary, the only time is when you have highly structured data.
Updated Code
Looking at the JSON data, there are quite a lot of missing data. This is part of web scraping you will find errors like this. Here we use a try and except block, for when we get a KeyError which means python hasn't been able to recognise the key associated with a value. We have to handle that exception, which we do here by saying to yield a string 'No XXX'
Once you start getting gaps etc it's better to consider an Items dictionary or Itemloaders.
Now it's worth looking at the Scrapy docs about Items. Essentially Scrapy does two things, it extracted data from websites, and it provides a mechanism for storing this data. The way it does this is storing it in a dictionary called Items. The code isn't that much different from yielding a dictionary but Items dictionary allows you to manipulate the extracted data more easily with extra things scrapy can do. You need to edit your items.py first with the fields you want. We create a class called TestItem, we define each field using scrapy.Field(). We then can import this class in our spider script.
items.py
import scrapy
class TestItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
full_name = scrapy.Field()
Phone = scrapy.Field()
Email = scrapy.Field()
City = scrapy.Field()
Zip_code = scrapy.Field()
Website = scrapy.Field()
Facebook = scrapy.Field()
Linkedin = scrapy.Field()
Twitter = scrapy.Field()
Bio = scrapy.Field()
Here we're specifying what we want the fields to be, you can't use a string with spaces unfortunately hence why full name is full_name. The field() creates the field of the item dictionary for us.
We import this item dictionary into our spider script with from ..items import TestItem. The from ..items means we're taking the items.py from the parent folder to the spider script and we're importing the class TestItem. That way our spider can populate the items dictionary with our json data.
Note that just before the for loop we instantiate the class TestItem by item = TestItem(). Instantiate means to call upon the class, in this case it makes a dictionary. This means we are creating the item dictionary and then we populate that dictionary with keys and values. You have to does this before you add your keys and values as you can see from within the for loop.
Spider script
import scrapy
import json
from ..items import TestItem
class MainSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://experts.expcloud.com/api4/std?searchterms=AB&size=216&from=0']
def parse(self, response):
resp = json.loads(response.body)
values = response.json()['hits']['hits']
item = TestItem()
for value in values:
try:
item['full_name'] = value['_source']['fullName']
except KeyError:
item['full_name'] = 'No Name'
try:
item['Phone'] = value['_source']['primaryPhone']
except KeyError:
item['Phone'] = 'No Phone number'
try:
item["Email"] = value['_source']['primaryEmail']
except KeyError:
item['Email'] = 'No Email'
try:
item["City"] = value['_source']['activeLocations'][0]['city']
except KeyError:
item['City'] = 'No City'
try:
item["Zip_code"] = value['_source']['activeLocations'][0]['zipcode']
except KeyError:
item['Zip_code'] = 'No Zip code'
try:
item["Website"] = value['AgentMarketingCenter'][0]['Website']
except KeyError:
item['Website'] = 'No Website'
try:
item["Facebook"] = value['_source']['AgentMarketingCenter'][0]['Facebook_URL']
except KeyError:
item['Facebook'] = 'No Facebook'
try:
item["Linkedin"] = value['_source']['AgentMarketingCenter'][0]['LinkedIn_URL']
except KeyError:
item['Linkedin'] = 'No Linkedin'
try:
item["Twitter"] = value['_source']['AgentMarketingCenter'][0]['Twitter']
except KeyError:
item['Twitter'] = 'No Twitter'
try:
item["Bio"]: value['_source']['AgentMarketingCenter'][0]['Bio']
except KeyError:
item['Bio'] = 'No Bio'
yield item

How to navigate through a json file with Python 3? TypeError: list indices must be integers or slices, not str

I am trying to get as many profile links as I can on khanacademy.org. I am using their api.
I am struggling navigating through the json file to get the desired data.
Here is my code :
from urllib.request import urlopen
import json
with urlopen("https://www.khanacademy.org/api/internal/discussions/video/what-are-algorithms/questions?casing=camel&limit=10&page=0&sort=1&lang=en&_=190422-1711-072ca2269550_1556031278137") as response:
source = response.read()
data= json.loads(source)
for item in data['feedback']:
print(item['authorKaid'])
profile_answers = item['answers']['authorKaid']
print(profile_answers)
My goal is to get as many authorKaid as possible en then store them (to create a database later).
When I run this code I get this error :
TypeError: list indices must be integers or slices, not str
I don't understand why, on this tutorial video : https://www.youtube.com/watch?v=9N6a-VLBa2I at 16:10 it is working.
the issue is item['answers'] are lists and you are trying to access by a string rather than an index value. So when you try to get item['answers']['authorKaid'] there is the error:
What you really want is
print (item['answers'][0]['authorKaid'])
print (item['answers'][1]['authorKaid'])
print (item['answers'][2]['authorKaid'])
etc...
So you're actually wanting to iterate through those lists. Try this:
from urllib.request import urlopen
import json
with urlopen("https://www.khanacademy.org/api/internal/discussions/video/what-are-algorithms/questions?casing=camel&limit=10&page=0&sort=1&lang=en&_=190422-1711-072ca2269550_1556031278137") as response:
source = response.read()
data= json.loads(source)
for item in data['feedback']:
print(item['authorKaid'])
for each in item['answers']:
profile_answers = each['authorKaid']
print(profile_answers)

GCP Proto Datastore encode JsonProperty in base64

I store a blob of Json in the datastore using JsonProperty.
I don't know the structure of the json data.
I am using endpoints proto datastore in order to retrieve my data.
The probleme is the json property is encoded in base64 and I want a plain json object.
For the example, the json data will be:
{
first: 1,
second: 2
}
My code looks something like:
import endpoints
from google.appengine.ext import ndb
from protorpc import remote
from endpoints_proto_datastore.ndb import EndpointsModel
class Model(EndpointsModel):
data = ndb.JsonProperty()
#endpoints.api(name='myapi', version='v1', description='My Sample API')
class DataEndpoint(remote.Service):
#Model.method(path='mymodel2', http_method='POST',
name='mymodel.insert')
def MyModelInsert(self, my_model):
my_model.data = {"first": 1, "second": 2}
my_model.put()
return my_model
#Model.method(path='mymodel/{entityKey}',
http_method='GET',
name='mymodel.get')
def getMyModel(self, model):
print(model.data)
return model
API = endpoints.api_server([DataEndpoint])
When I call the api for getting a model, I get:
POST /_ah/api/myapi/v1/mymodel2
{
"data": "eyJzZWNvbmQiOiAyLCAiZmlyc3QiOiAxfQ=="
}
where eyJzZWNvbmQiOiAyLCAiZmlyc3QiOiAxfQ== is the base64 encoded of {"second": 2, "first": 1}
And the print statement give me: {u'second': 2, u'first': 1}
So, in the method, I can explore the json blob data as a python dict.
But, in the api call, the data is encoded in base64.
I expeted the api call to give me:
{
'data': {
'second': 2,
'first': 1
}
}
How can I get this result?
After the discussion in the comments of your question, let me share with you a sample code that you can use in order to store a JSON object in Datastore (it will be stored as a string), and later retrieve it in such a way that:
It will show as plain JSON after the API call.
You will be able to parse it again to a Python dict using eval.
I hope I understood correctly your issue, and this helps you with it.
import endpoints
from google.appengine.ext import ndb
from protorpc import remote
from endpoints_proto_datastore.ndb import EndpointsModel
class Sample(EndpointsModel):
column1 = ndb.StringProperty()
column2 = ndb.IntegerProperty()
column3 = ndb.StringProperty()
#endpoints.api(name='myapi', version='v1', description='My Sample API')
class MyApi(remote.Service):
# URL: .../_ah/api/myapi/v1/mymodel - POSTS A NEW ENTITY
#Sample.method(path='mymodel', http_method='GET', name='Sample.insert')
def MyModelInsert(self, my_model):
dict={'first':1, 'second':2}
dict_str=str(dict)
my_model.column1="Year"
my_model.column2=2018
my_model.column3=dict_str
my_model.put()
return my_model
# URL: .../_ah/api/myapi/v1/mymodel/{ID} - RETRIEVES AN ENTITY BY ITS ID
#Sample.method(request_fields=('id',), path='mymodel/{id}', http_method='GET', name='Sample.get')
def MyModelGet(self, my_model):
if not my_model.from_datastore:
raise endpoints.NotFoundException('MyModel not found.')
dict=eval(my_model.column3)
print("This is the Python dict recovered from a string: {}".format(dict))
return my_model
application = endpoints.api_server([MyApi], restricted=False)
I have tested this code using the development server, but it should work the same in production using App Engine with Endpoints and Datastore.
After querying the first endpoint, it will create a new Entity which you will be able to find in Datastore, and which contains a property column3 with your JSON data in string format:
Then, if you use the ID of that entity to retrieve it, in your browser it will show the string without any strange encoding, just plain JSON:
And in the console, you will be able to see that this string can be converted to a Python dict (or also a JSON, using the json module if you prefer):
I hope I have not missed any point of what you want to achieve, but I think all the most important points are covered with this code: a property being a JSON object, store it in Datastore, retrieve it in a readable format, and being able to use it again as JSON/dict.
Update:
I think you should have a look at the list of available Property Types yourself, in order to find which one fits your requirements better. However, as an additional note, I have done a quick test working with a StructuredProperty (a property inside another property), by adding these modifications to the code:
#Define the nested model (your JSON object)
class Structured(EndpointsModel):
first = ndb.IntegerProperty()
second = ndb.IntegerProperty()
#Here I added a new property for simplicity; remember, StackOverflow does not write code for you :)
class Sample(EndpointsModel):
column1 = ndb.StringProperty()
column2 = ndb.IntegerProperty()
column3 = ndb.StringProperty()
column4 = ndb.StructuredProperty(Structured)
#Modify this endpoint definition to add a new property
#Sample.method(request_fields=('id',), path='mymodel/{id}', http_method='GET', name='Sample.get')
def MyModelGet(self, my_model):
if not my_model.from_datastore:
raise endpoints.NotFoundException('MyModel not found.')
#Add the new nested property here
dict=eval(my_model.column3)
my_model.column4=dict
print(json.dumps(my_model.column3))
print("This is the Python dict recovered from a string: {}".format(dict))
return my_model
With these changes, the response of the call to the endpoint looks like:
Now column4 is a JSON object itself (although it is not printed in-line, I do not think that should be a problem.
I hope this helps too. If this is not the exact behavior you want, maybe should play around with the Property Types available, but I do not think there is one type to which you can print a Python dict (or JSON object) without previously converting it to a String.

Trying to get a simple loop of JSON passed to POST - Python

I have a WS using Flask/python 2.7. I have 1 JSON object passed to the WS. I have been successful in capturing the object and returning the whole JSON.
I have looked all over for examples (many use print of test dataset in python) and have tried json.dumps, json.loads, json.dump, json.load, for loops, etc.
What I would like to do seems simple and I know it is me, but I get errors no matter what I try. I am trying to parse the JSON, put the values in to variables, and do "stuff".
This works:
#app.route('/v1/test', methods = ['POST'])
def api_message():
if request.headers['Content-Type'] == 'application/json':
return "JSON Message: " + json.dumps(request.json, separators=(',',':'))
else:
return "415 Unsupported Media Type"
This does not (and many variations of this using different things):
jsonobject = json.dumps(request.json)
pstring = json.loads(jsonobject)
for key, value in pstring.iteritems():
return value
What I want to do (pseudo code):
for each JSON
get the name value pairs in to a place where I can do something like this (which was done on a flat file)
input_data = pd.read_csv(sio, delimiter=',', names=columns)
probs = model.predict_proba(input_data)
I am sure I didn't make this as clear as I could but it is a challenge because I get errors like below (examples -- not all at once of course) with all the different things I try:
AttributeError: 'dict' object has no attribute 'translate'
TypeError: 'dict' object is not callable
AttributeError: 'str' object has no attribute 'iteritems'
So after all that, what is the right way to do this?

Problems while serializing ValuesQuerySet object to json in django

Im not able to serialize a ValuesQuerySet object to json data, i´ve found multiple solutions to this gap, but this case is different because i need to follow the Foreign Keys values.
from task_manager.models import UserTasks
data=UserTasks.objects.filter(user__username="root",server_id=2).values("server_id__mnemonic")
The previous query returns something like this:
>>> print data
[{'server_id__mnemonic': u'lol'}, {'server_id__mnemonic': u'lol'}, {'server_id__mnemonic': u'lol'},.......]
But when I try to serialize it to JSON format raises the next exception:
>>> json_data = serializers.serialize('json',data)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "C:\Python27\lib\site-packages\django\core\serializers\__init__.py", line 122, in serialize
s.serialize(queryset, **options)
File "C:\Python27\lib\site-packages\django\core\serializers\base.py", line 45, in serialize
concrete_model = obj._meta.concrete_model
AttributeError: 'dict' object has no attribute '_meta'
>>> type(data)
<class 'django.db.models.query.ValuesQuerySet'>
I´ve found in the official Django manual a solution that says: If you only want a subset of fields to be serialized, you can specify a fields argument to the serializer:
from django.core import serializers
data = serializers.serialize('xml', SomeModel.objects.all(), fields=('name','size'))
But with this code, i cannot get the foreign keys values i want.
Thanks
values() gives you a ValuesQuerySet which you can serialize by converting it to a list and using json module, no need to involve Django serializers here:
import json
from task_manager.models import UserTasks
data = UserTasks.objects.filter(user__username="root",server_id=2).values("server_id__mnemonic")
print json.dumps(list(data))
Another option would to be use serializers.serialize() with specifying fields argument:
data = UserTasks.objects.filter(user__username="root",server_id=2)
print serializers.serialize('json', data, fields=('server_id__mnemonic', ))