Python pandas convert CSV upto level "n" nested JSON? - json

I want to convert the csv file into nested json upto n level. I am using below code to get the desired output from this link. But I am getting an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-109-89d84e9a61bf> in <module>()
40 # make a list of keys
41 keys_list = []
---> 42 for item in d['children']:
43 keys_list.append(item['name'])
44
TypeError: 'datetime.date' object has no attribute '__getitem__'
Below is the code:
# CSV 2 flare.json
# convert a csv file to flare.json for use with many D3.js viz's
# This script creates outputs a flare.json file with 2 levels of nesting.
# For additional nested layers, add them in lines 32 - 47
# sample: http://bl.ocks.org/mbostock/1283663
# author: Andrew Heekin
# MIT License
import pandas as pd
import json
df = pd.read_csv('file.csv')
# choose columns to keep, in the desired nested json hierarchical order
df = df[['Group','year','quarter']]
# order in the groupby here matters, it determines the json nesting
# the groupby call makes a pandas series by grouping 'the_parent' and 'the_child', while summing the numerical column 'child_size'
df1 = df.groupby(['Group','year','quarter'])['quarter'].count()
df1 = df1.reset_index(name = "count")
#print df1.head()
# start a new flare.json document
flare = dict()
flare = {"name":"flare", "children": []}
#df1['year'] = [str(yr) for yr in df1['year']]
for line in df1.values:
the_parent = line[0]
the_child = line[1]
child_size = line[2]
# make a list of keys
keys_list = []
for item in d['children']:
keys_list.append(item['name'])
# if 'the_parent' is NOT a key in the flare.json yet, append it
if not the_parent in keys_list:
d['children'].append({"name":the_parent, "children":[{"name":the_child, "size":child_size}]})
# if 'the_parent' IS a key in the flare.json, add a new child to it
else:
d['children'][keys_list.index(the_parent)]['children'].append({"name":the_child, "size":child_size})
flare = d
# export the final result to a json file
with open('flare.json', 'w') as outfile:
json.dump(flare, outfile)
Expected Output in below format:
{
"name": "stock",
"children": [
{"name": "fruits",
"children": [
{"name": "berries",
"children": [
{"count": 20, "name": "blueberry"},
{"count": 70, "name": "cranberry"},
{"count": 96, "name": "raspberry"},
{"count": 140, "name": "strawberry"}]
},
{"name": "citrus",
"children": [
{"count": 20, "name": "grapefruit"},
{"count": 120, "name": "lemon"},
{"count": 50, "name": "orange"}]
},
{"name": "dried fruit",
"children": [
{"count": 25, "name": "dates"},
{"count": 10, "name": "raisins"}]
}]
},
{"name": "vegtables",
"children": [
{"name": "green leaf",
"children": [
{"count": 19, "name": "cress"},
{"count": 18, "name": "spinach"}]
},
{
"name": "legumes",
"children": [
{"count": 27, "name": "beans"},
{"count": 12, "name": "chickpea"}]
}]
}]
}
Could any one please help how to resolve this error.
Thanks

Related

multiple object of an array creates different columns in the CSV file

Here is my JSON example. When I convert JSON to CSV file, it creates different columns for each object of reviews array. columns names be like - serial name.0 rating.0 _id.0 name.1 rating.1 _id.1. How can i convert to CSV file where only serial,name,rating,_id will be the column name and every object of the reviews will be put in a different row?
`
[{
"serial": "63708940a8d291c502be815f",
"reviews": [
{
"name": "shadman",
"rating": 4,
"_id":"6373d4eb50cff661989f3d83"
},
{
"name": "niloy1",
"rating": 3,
"_id": "6373d59450cff661989f3db8"
},
],
}]
`
`
I am trying to use the CSV file to pandas. If not possible, is there any way to solve the problem using pandas package in python?
I suggest you use pandas for the CSV export only and process the json data by flattening the data structure first so that the result can then be easily loaded in a Pandas DataFrame.
Try:
data_python = [{
"serial": "63708940a8d291c502be815f",
"reviews": [
{
"name": "shadman",
"rating": 4,
"_id":"6373d4eb50cff661989f3d83"
},
{
"name": "niloy1",
"rating": 3,
"_id": "6373d59450cff661989f3db8"
},
],
}]
from collections import defaultdict
from pprint import pprint
import pandas as pd
dct_flat = defaultdict(list)
for dct in data_python:
for dct_reviews in dct["reviews"]:
dct_flat['serial'].append(dct['serial'])
for key, value in dct_reviews.items():
dct_flat[key].append(value)
#pprint(data_python)
#pprint(dct_flat)
df = pd.DataFrame(dct_flat)
print(df)
df.to_csv("data.csv")
which gives:
serial name rating _id
0 63708940a8d291c502be815f shadman 4 6373d4eb50cff661989f3d83
1 63708940a8d291c502be815f niloy1 3 6373d59450cff661989f3db8
and
,serial,name,rating,_id
0,63708940a8d291c502be815f,shadman,4,6373d4eb50cff661989f3d83
1,63708940a8d291c502be815f,niloy1,3,6373d59450cff661989f3db8
as CSV file content.
Notice that the json you provided in your question can't be loaded from file or string in Python neither using Python json module nor using Pandas because it is not valid json code. See below for corrected valid json data:
valid_json_data='''\
[{
"serial": "63708940a8d291c502be815f",
"reviews": [
{
"name": "shadman",
"rating": 4,
"_id":"6373d4eb50cff661989f3d83"
},
{
"name": "niloy1",
"rating": 3,
"_id": "6373d59450cff661989f3db8"
}
]
}]
'''
and code for loading this data from json file:
import json
json_file = "data.json"
with open(json_file) as f:
data_json = f.read()
data_python = json.loads(data_json)

Json column in Pandas dataframe - Parsing and splitting

I have an json dataframe with tedx talks as items (rows), that has a column 'ratings' in json format going like this. (The column depicts how the talk was described by audience)
[{"id": 7, "name": "Funny", "count": 19645}, {"id": 1, "name": "Beautiful", "count": 4573}, {"id": 9, "name": "Ingenious", "count": 6073}, ..........]
[{"id": 7, "name": "Funny", "count": 544}, {"id": 3, "name": "Courageous", "count": 139}, {"id": 2, "name": "Confusing", "count": 62}, {"id": 1, "name": "Beautiful", "count": 58}, ........]
Obviously the order of the descriptive words name is not standard/same for each item (tedx talk). Each word has an id(same for all talks) and a count respectively for each talk.
I am interested in manipulating the data and extracting three new integer columns regarding counts of: funny, inspiring, confusing, storing there the count for each of those words for the respective talks
Among other stuff, tried this
df['ratings'] = df['ratings'].map(lambda x: dict(eval(x)))
in return i get this error
File "C:/Users/Paul/Google Drive/WEEK4/ted-talks/w4e1.py", line 30, in
df['ratings'] = df['ratings'].map(lambda x: dict(eval(x)))
ValueError: dictionary update sequence element #0 has length 3; 2 is required
Been trying several different ways, but havent been able to even get values from the json formatted column properly. Any suggestions?
You can use list comprehension with flattening and convert string repr to list of dict by ast.literal_eval what is better solution like eval:
import pandas as pd
import ast
df = pd.DataFrame({'ratings': ['[{"id": 7, "name": "Funny", "count": 19645}, {"id": 1, "name": "Beautiful", "count": 4573}, {"id": 9, "name": "Ingenious", "count": 6073}]', '[{"id": 7, "name": "Funny", "count": 544}, {"id": 3, "name": "Courageous", "count": 139}, {"id": 2, "name": "Confusing", "count": 62}, {"id": 1, "name": "Beautiful", "count": 58}]']})
print (df)
ratings
0 [{"id": 7, "name": "Funny", "count": 19645}, {...
1 [{"id": 7, "name": "Funny", "count": 544}, {"i...
df1 = pd.DataFrame([y for x in df['ratings'] for y in ast.literal_eval(x)])
print (df1)
id name count
0 7 Funny 19645
1 1 Beautiful 4573
2 9 Ingenious 6073
3 7 Funny 544
4 3 Courageous 139
5 2 Confusing 62
6 1 Beautiful 58

Parsing and cleaning text file in Python?

I have a text file which contains raw data. I want to parse that data and clean it so that it can be used further.The following is the rawdata.
"{\x0A \x22identifier\x22: {\x0A \x22company_code\x22: \x22TSC\x22,\x0A \x22product_type\x22: \x22airtime-ctg\x22,\x0A \x22host_type\x22: \x22android\x22\x0A },\x0A \x22id\x22: {\x0A \x22type\x22: \x22guest\x22,\x0A \x22group\x22: \x22guest\x22,\x0A \x22uuid\x22: \x221a0d4d6e-0c00-11e7-a16f-0242ac110002\x22,\x0A \x22device_id\x22: \x22423e49efa4b8b013\x22\x0A },\x0A \x22stats\x22: [\x0A {\x0A \x22timestamp\x22: \x222017-03-22T03:21:11+0000\x22,\x0A \x22software_id\x22: \x22A-ACTG\x22,\x0A \x22action_id\x22: \x22open_app\x22,\x0A \x22values\x22: {\x0A \x22device_id\x22: \x22423e49efa4b8b013\x22,\x0A \x22language\x22: \x22en\x22\x0A }\x0A }\x0A ]\x0A}"
I want to remove all the hexadecimal characters,I tried parsing the data and storing in an array and cleaning it using re.sub() but it gives the same data.
for line in f:
new_data = re.sub(r'[^\x00-\x7f],\x22',r'', line)
data.append(new_data)
\x0A is the hex code for newline. After s = <your json string>, print(s) gives
>>> print(s)
{
"identifier": {
"company_code": "TSC",
"product_type": "airtime-ctg",
"host_type": "android"
},
"id": {
"type": "guest",
"group": "guest",
"uuid": "1a0d4d6e-0c00-11e7-a16f-0242ac110002",
"device_id": "423e49efa4b8b013"
},
"stats": [
{
"timestamp": "2017-03-22T03:21:11+0000",
"software_id": "A-ACTG",
"action_id": "open_app",
"values": {
"device_id": "423e49efa4b8b013",
"language": "en"
}
}
]
}
You should parse this with the json module load (from file) or loads (from string) functions. You will get a dict with 2 dicts and a list with a dict.

Save series of dictionaries to json file and retrieve it later

So I am doing some work in python where I have to generate a series of dictionaries. I want to write each of these dictionaries to a single file.
The code to write the dictionaries look like this
with open('some_name.json', 'w') as fh:
data = function_generate_dict() # returns a dictionary
json.dump(data, fh)
That works fine and I can view the outputted file and can even load its content like thus
with open('some_name.json', 'r+') as rh:
for line in rh.readlines():
print(line)
But when I try to reload each dictionary from the file by doing this
with open('some_name', 'r') as rh:
cont = rh.read()
js =json.loads(cont)
I always get a JSONDecodeError: Extra data: line 1 column 220 (char 219)
which I suspect is coming from where one dictionary ends and another begins.
If I do this (json.load() instead of json.loads())
with open('some_name', 'r') as rh:
cont = rh.read()
js =json.load(cont)
I get this error AttributeError: 'str' object has no attribute 'read'
I have even tried using jsonl as the file format. But it doesn't work.
Here is a sample of the dictionaries I'm generating
{"measure_no": "0", "divisions": "256", "fifths": "5", "mode": "major", "beats": "4", "beat-type": "4", "transpose": "-9", "step": ["G"], "alter": ["1"], "octave": ["6"], "duration": ["256"], "syllabic": [], "text": []}{"measure_no": "1", "divisions": "256", "fifths": "5", "mode": "major", "beats": "4", "beat-type": "4", "transpose": "-9", "step": ["G", "G", "G", "G"], "alter": ["1", "1", "1", "1"], "octave": ["6", "6", "6", "6"], "duration": ["384", "128", "256", "256"], "syllabic": [], "text": []}{"measure_no": "2", "divisions": "256", "fifths": "5", "mode": "major", "beats": "4", "beat-type": "4", "transpose": "-9", "step": ["C", "G", "G"], "alter": ["1", "1", "1"], "octave": ["7", "6", "6"], "duration": ["384", "128", "512"], "syllabic": [], "text": []}
Your code dumps your dictionary to the end of the existing line:
This is 1: hard to read for human beings.
2: hard to read for the "readlines" function.
Add a new line after you dump your dictionnary.
with open('some_name.json', 'a') as fh:
''' note the use of 'a' instead of 'w', you want to append your
dictionnaries, not overwrite them every time '''
data = function_generate_dict() # returns a dictionary
json.dump(data, fh)
fh.write('\n')
Then when you want to read: from the file to a dictionary : you can do it with a loop which reads every line as a different json dictionary:
The way I would see it is make an empty dictionnary first to store every line.
jsonlist = []
'''make a list to store the json dictionaries'''
with open('some_name','r') as rh:
for line in rh.readlines():
jsonlist.append(json.loads(line))
You now have variable jsonlist for which each index is one of your json dictionaries, all that is left for you is to manipulate those indexes.
>>> jsonlist[0]
{'measure_no': '0', 'divisions': '256', 'fifths': '5', 'mode': 'major', 'beats': '4', 'beat-type': '4', 'transpose': '-9', 'step': ['G'], 'alter': ['1'], 'octave': ['6'], 'duration': ['256'], 'syllabic': [], 'text': []}

Format JSON file in R: lexical error with character encoding

I have data that I retrieved from a server in JSON format. I now want to pre-process these data in R.
My raw .json file (if opened in a text editor) looks like this:
{"id": 1,"data": "{\"unid\":\"wU6993\",\"age\":\"21\",\"origin\":\"Netherlands\",\"biling\":\"2\",\"langs\":\"Dutch\",\"selfrating\":\"80\",\"selfarrest\":\"20\",\"condition\":1,\"fly\":\"2\",\"flytime\":0,\"purpose\":\"na\",\"destin\":\"Madrid\",\"txtQ1\":\"I\'m flying to Madrid to catch up with friends.\"}"}
I want to parse it back for further use to its intended format:
`{
"id": 1,
"data": {
"unid": "wU6993",
"age": "21",
"origin": "Netherlands",
"biling": "2",
"langs": "Dutch",
"selfrating": "80",
"selfarrest": "20",
"condition": 1,
"fly": "2",
"flytime": 0,
"purpose": "na",
"destin": "Madrid",
"txtQ1": "I'm flying to Madrid to catch up with friends."
}
}`
Using jsonlite I can't read it in at all:
parsed = jsonlite::fromJSON(txt = 'exp1.json')
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
lexical error: inside a string, '\' occurs before a character which it may not.
in\":\"Madrid\",\"txtQ1\":\"I\'m flying to Madrid to catch u
(right here) ------^
I think the error tells me that some characters are escaped that should have been.
How can I solve this and read my file?
You have extra quotes around the nested braces defining "data", the value of which is actually stored as one huge string instead of valid JSON. Take them out, and
my_json <- '{"id": 1,"data": "{\"unid\":\"wU6993\",\"age\":\"21\",\"origin\":\"Netherlands\",\"biling\":\"2\",\"langs\":\"Dutch\",\"selfrating\":\"80\",\"selfarrest\":\"20\",\"condition\":1,\"fly\":\"2\",\"flytime\":0,\"purpose\":\"na\",\"destin\":\"Madrid\",\"txtQ1\":\"I\'m flying to Madrid to catch up with friends.\"}"}'
my_json <- sub('"\\{', '\\{', my_json)
my_json <- sub('\\}"', '\\}', my_json)
jsonlite::prettify(my_json)
# {
# "id": 1,
# "data": {
# "unid": "wU6993",
# "age": "21",
# "origin": "Netherlands",
# "biling": "2",
# "langs": "Dutch",
# "selfrating": "80",
# "selfarrest": "20",
# "condition": 1,
# "fly": "2",
# "flytime": 0,
# "purpose": "na",
# "destin": "Madrid",
# "txtQ1": "I'm flying to Madrid to catch up with friends."
# }
# }