skipping Attribute error while importing twitter data into pandas - json

I have almost 1 gb file storing almost .2 mln tweets. And, the huge size of file obviously carries some errors. The errors are shown as
AttributeError: 'int' object has no attribute 'items'. This occurs when I try to run this code.
raw_data_path = input("Enter the path for raw data file: ")
tweet_data_path = raw_data_path
tweet_data = []
tweets_file = open(tweet_data_path, "r", encoding="utf-8")
for line in tweets_file:
try:
tweet = json.loads(line)
tweet_data.append(tweet)
except:
continue
tweet_data2 = [tweet for tweet in tweet_data if isinstance(tweet,
dict)]
from pandas.io.json import json_normalize
tweets = json_normalize(tweet_data2)[["text", "lang", "place.country",
"created_at", "coordinates",
"user.location", "id"]]
Can a solution be found where those lines where such error occurs can be skipped and continue for the rest of the lines.

The issue here is not with lines in data but with tweet_data itself. If you check your tweet_data, you will find one more elements which are of 'int' datatype (assuming your tweet_data is a list of dictionaries as it only expects "dict or list of dicts").
You may want to check your tweet data to remove values other that dictionaries.
I was able to reproduce with below example for json_normalize document:
Working Example:
from pandas.io.json import json_normalize
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {
'governor': 'Rick Scott'
},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {
'governor': 'John Kasich'
},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]},
]
json_normalize(data)
Output:
Displays datarame
Reproducing Error:
from pandas.io.json import json_normalize
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {
'governor': 'Rick Scott'
},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {
'governor': 'John Kasich'
},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]},
1 # *Added an integer to the list*
]
result = json_normalize(data)
Error:
AttributeError: 'int' object has no attribute 'items'
How to prune "tweet_data": Not needed, if you follow update below
Before normalising, run below:
tweet_data = [tweet for tweet in tweet_data if isinstance(tweet, dict)]
Update: (for foor loop)
for line in tweets_file:
try:
tweet = json.loads(line)
if isinstance(tweet, dict):
tweet_data.append(tweet)
except:
continue

The final form of code looks like this:
tweet_data_path = raw_data_path
tweet_data = []
tweets_file = open(tweet_data_path, "r", encoding="utf-8")
for line in tweets_file:
try:
tweet = json.loads(line)
if isinstance(tweet, dict):
tweet_data.append(tweet)
except:
continue
This clears all the possibility of attribute error that might hinder importing into panda dataframe.

Related

Writing JSON object to file and getting null written instead

I'm using Python to create a data.json file and write a json object to it.
with open('data.json', 'w', encoding='utf-8') as f:
util.json.dump(jsonData, f, ensure_ascii=False, indent=4)
where jsonData = {'Book': {'author': 'John Black', 'description': 'When....
When I locate data.json file on my computer and open it to modify the content, instead of {'Book': {'author':... I see null printed in the file.
I don't understand why it is happening, jsonData is not null, I printed it out before manipulating to double-check.
Thank you for your help in advance! =)
I am not sure what purpose util is fulfilling here but using json library seems to be giving right results.
import json
jsonData = {'Book': {'author': 'John Black', 'description': 'When....'}}
with open('data.json', 'w', encoding='utf-8') as f:
json.dump(jsonData, f, ensure_ascii=False, indent=4)
import json
jsonData = {
"Book": {
"author": "ohn Black",
"description": "afasffsaf afafasfsa"
}
}
with open('data.json', 'w', encoding='utf-8') as f:
f.write(json.dumps(jsonData))

Unable to resolve TypeError: Object of type 'map' is not JSON serializable

Error while parsing map to string using json.dumps in python 3.6
x = {'id_str': '639035115457388544', 'video': False, 'photo': False, 'link': True, 'hashtags': <map object at 0x7f1762ab9320>, 'coordinates': None, 'timestamp_ms': 1441218018000, 'text': 'Police suspected hit-and-run', 'user': {'id': 628694263, 'name': 'Beth LeBlanc', 'friends_count': 235, 'verified': False, 'followers_count': 654, 'created_at': 1341631106000, 'time_zone': None, 'statuses_count': 3966, 'protected': 3966}, 'mentions': [], 'screen_name': 'THBethLeBlanc', 'reply': None, 'tweet_type': 'Tweet', 'mentionedurl': None, 'possibly_sensitive': False, 'placename': '', 'sentiments': 'Undefined'}
print(json.dumps(x))
TypeError: Object of type 'map' is not JSON serializable
I don't know how you get value for 'hashtags', but this below example will help you to solve your question a little bit. Surround your map object with list().
>>> import json
>>>
>>> some_map_value = map([],[])
>>> some_map_value
<map object at 0x7f380a75a850>
>>>
>>> x = {'hashtags': some_map_value}
>>> x
{'hashtags': <map object at 0x7f380a75a850>}
>>>
>>> json.dumps(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python3.7/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python3.7/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/lib/python3.7/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type map is not JSON serializable
>>>
>>> list(some_map_value)
[]
>>> x = {'hashtags': list(some_map_value)} # surround your map object with list
>>> json.dumps(x)
'{"hashtags": []}'
For more information check this Getting a map() to return a list in Python 3.x
Ask Question question. If this is not you are lokking for, please put a comment to this answer.
Update: Just check your comment. Surround your map(lambda x: x['text'],doc['entities']['hashtags']) with list() like list(map(lambda x: x['text'],doc['entities']['hashtags']))
if doc['entities'].get('media'):
tweet['photo'] = True
if doc.get('extended_entities'):
tweet[doc['extended_entities']['media'][0]['type']] = True
tweet['mediaurl'] = doc['extended_entities']['media'][0]['media_url']
if doc['entities'].get('urls'):
tweet['link'] = True
tweet['hashtags'] = list(map(lambda x: x['text'],doc['entities']['hashtags']))
tweet['coordinates'] = doc['coordinates']
There is an error in your x where the hashtags key has no corresponding value. Here it is fixed:
https://repl.it/repls/SubtleLovableSystemadministrator

The input is a list and the output is in the form of a nested dictionary in a list

Input:
input_list=['1.exe','2.exe','3.exe','4.exe']
Output format:
out_dict=[{'name':'1.exe',
'children':[{'name':'2.exe',
'children':[{'name':'3.exe
'children':[{'name':'4.exe'}]}}}]
The input is the a list as above mentioned and we have to obtain the output in the format as mentioned in the above lines.
I tried using nested for loops but it isn't working. How can we implement JSON in this?
input_list=['1.exe','2.exe','3.exe','4.exe']
def split(data):
try:
first_value = data[0]
data = [{'name': first_value, 'children': split(data[1:])} if split(data[1:]) != [] else {'name': first_value}]
return data
except:
return data
print (split(input_list))
output:
[{'name': '1.exe', 'children':
[{'name': '2.exe', 'children':
[{'name': '3.exe', 'children':
[{'name': '4.exe'}]}]}]}]
code which is a little bit more easier to understand (with explinations):
input_list=['1.exe','2.exe','3.exe','4.exe']
def split(input_list):
if len(input_list) == 0:
return input_list # if there is no data return empty list
else: # if we have elements
first_value = input_list[0] # first value
if split(input_list[1:]) != []: # data[1:] will return a list with all values except the first value
input_list = [{'name':first_value ,'children': split(input_list[1:])}]
return input_list # return after the last recursion is called
else:
input_list = [{'name': first_value}]
return input_list
print (split(input_list))
output:
[{'name': '1.exe', 'children':
[{'name': '2.exe', 'children':
[{'name': '3.exe', 'children':
[{'name': '4.exe'}]}]}]}]
or:
input_list=['1.exe','2.exe','3.exe','4.exe']
def split(input_list):
if input_list:
head, *tail = input_list # This is a nicer way of doing head, tail = data[0], data[1:]
if split(tail) != []:
return [{'name': head, 'children':split(tail)}]
else:
return [{'name': head}]
else:
return {}
print (split(input_list))
Convert from Python to JSON:
import json
# a Python object (dict):
x = {
"name": "John",
"age": 30,
"city": "New York"
}
# convert into JSON:
y = json.dumps(x)
# the result is a JSON string:
print(y)
JSON is a syntax for storing and exchanging data. Convert from Python
to JSON If you have a Python object, you can convert it into a JSON
string by using the json.dumps() method.
import json
input_list=['1.exe','2.exe','3.exe','4.exe']
def split(input_list):
try:
first_value = input_list[0]
input_list = {'name': first_value, 'children': split(input_list[1:])} if split(input_list[1:]) != [] else {'name': first_value}
return input_list
except:
return input_list
data = split(input_list)
print (json.dumps(data))

Validating trello board API responses in Python unittest

I am writing a unittest that queries the trello board API and want to assert that a particular card exists.
The first attempt was using the /1/boards/[board_id]/lists rewuest which gives results like:
[{'cards': [
{'id': 'id1', 'name': 'item1'},
{'id': 'id2', 'name': 'item2'},
{'id': 'id3', 'name': 'item3'},
{'id': 'id4', 'name': 'item4'},
{'id': 'id5', 'name': 'item5'},
{'id': 'id6', 'name': 'item6'}],
'id': 'id7',
'name': 'ABC'},
{'cards': [], 'id': 'id8', 'name': 'DEF'},
{'cards': [], 'id': 'id9', 'name': 'GHI'}]
I want to assert that 'item6' is indeed in the above mentioned list. Loading the json and using assertTrue, like this:
element = [item for item in json_data if item['name'] == "item6"]
self.assertTrue(element)
but I receive an error: 'TypeError: the JSON object must be str, bytes or bytearray, not 'list'.
Then discovered using the /1/boards/[board_id]/cards request gives a plain list of cards:
[
{'id': 'id1', 'name': 'item1'},
{'id': 'id2', 'name': 'item2'},
...
]
How should I write this unittest assertion?
The neatest option is to create a class that will equal the dict for the card you want to ensure is there, then use that in an assertion. For your example, with a list of cards returned over the api:
cards = board.get_cards()
self.assertIn(Card(name="item6"), cards)
Here's a reasonable implementation for the Card() helper class, it may look a little complex but is mostly straight forward:
class Card(object):
"""Class that matches a dict with card details from json api response."""
def __init__(self, name):
self.name = name
def __eq__(self, other):
if isinstance(other, dict):
return other.get("name", None) == self.name
return NotImplemented
def __repr__(self):
return "{}({!r}, {!r})".format(
self.__class__.__name__, self.key, self.value)
You could add more fields to validate as needed.
One question worth touching on at this point is whether the unit test should be making real api queries. Generally a unit test would have test data to just focus on the function you control, but perhaps this is really an integration test for your trello deployment using the unittest module?
import unittest
from urllib.request import urlopen
import json
class Basic(unittest.TestCase):
url = 'https://api.trello.com/1/boards/[my_id]/cards?fields=id,name,idList,url&key=[my_key]&token=[my_token]'
response = urlopen(url)
resp = response.read()
json_ob = json.loads(resp)
el_list = [item for item in json_ob if item['name'] == 'card6']
def testBasic(self):
self.assertTrue(self.el_list)
if __name__ == '__main__':
unittest.main()
So what I did wrong: I focused too much on the list itself which I got after using the following code:
import requests
from pprint import pprint
import json
url = "https://api.trello.com/1/boards/[my_id]/lists"
params = {"cards":"open","card_fields":"name","fields":"name","key":"[my_key]","token":"[my_token]"}
response = requests.get(url=url, params=params)
pprint(response.json())

invalid json format of facebook graph api

I am using graph api to fetch ad audiences information, when i tried https://graph.facebook.com/act_adaccountid/customaudiences?fields=
but when i tried it through a program i am getting invalid json format
from urlib2 import urlopen
from simplejson import loads
x = loads(urlopen('https://graph.facebook.com/act_adaccountid/customaudiences?fields=<comma_separate_list_of_fields?access_token='XXXXXXXXXX').read())
output:
{'paging': {'cursors': {'after': 'NjAxMDE5ODE5NjgxMw==', 'before': 'NjAxNTAzNDkwOTAxMw=='}}, 'data': [{'account_id': 1377346239145180L, 'id': '6015034909013'}, {'account_id': 1377346239145180L, 'id': '6015034901213'}, {'account_id': 1377346239145180L, 'id': '6015034901013'}, {'account_id': 1377346239145180L, 'id': '6015034900413'}
{'data': []}
expected output:
http://pastebin.com/5265tJ8w