Json column in Pandas dataframe - Parsing and splitting - json

I have an json dataframe with tedx talks as items (rows), that has a column 'ratings' in json format going like this. (The column depicts how the talk was described by audience)
[{"id": 7, "name": "Funny", "count": 19645}, {"id": 1, "name": "Beautiful", "count": 4573}, {"id": 9, "name": "Ingenious", "count": 6073}, ..........]
[{"id": 7, "name": "Funny", "count": 544}, {"id": 3, "name": "Courageous", "count": 139}, {"id": 2, "name": "Confusing", "count": 62}, {"id": 1, "name": "Beautiful", "count": 58}, ........]
Obviously the order of the descriptive words name is not standard/same for each item (tedx talk). Each word has an id(same for all talks) and a count respectively for each talk.
I am interested in manipulating the data and extracting three new integer columns regarding counts of: funny, inspiring, confusing, storing there the count for each of those words for the respective talks
Among other stuff, tried this
df['ratings'] = df['ratings'].map(lambda x: dict(eval(x)))
in return i get this error
File "C:/Users/Paul/Google Drive/WEEK4/ted-talks/w4e1.py", line 30, in
df['ratings'] = df['ratings'].map(lambda x: dict(eval(x)))
ValueError: dictionary update sequence element #0 has length 3; 2 is required
Been trying several different ways, but havent been able to even get values from the json formatted column properly. Any suggestions?

You can use list comprehension with flattening and convert string repr to list of dict by ast.literal_eval what is better solution like eval:
import pandas as pd
import ast
df = pd.DataFrame({'ratings': ['[{"id": 7, "name": "Funny", "count": 19645}, {"id": 1, "name": "Beautiful", "count": 4573}, {"id": 9, "name": "Ingenious", "count": 6073}]', '[{"id": 7, "name": "Funny", "count": 544}, {"id": 3, "name": "Courageous", "count": 139}, {"id": 2, "name": "Confusing", "count": 62}, {"id": 1, "name": "Beautiful", "count": 58}]']})
print (df)
ratings
0 [{"id": 7, "name": "Funny", "count": 19645}, {...
1 [{"id": 7, "name": "Funny", "count": 544}, {"i...
df1 = pd.DataFrame([y for x in df['ratings'] for y in ast.literal_eval(x)])
print (df1)
id name count
0 7 Funny 19645
1 1 Beautiful 4573
2 9 Ingenious 6073
3 7 Funny 544
4 3 Courageous 139
5 2 Confusing 62
6 1 Beautiful 58

Related

Concatenate folder of multiple newline-delimited JSON files into single file

We have a directory /our_jsons that has the files:
file1.json
{"team": 1, "leagueId": 1, "name": "the ballers"}
{"team": 2, "leagueId": 1, "name": "the hoopers"}
file2.json
{"team": 3, "leagueId": 1, "name": "the gamerrs"}
{"team": 4, "leagueId": 1, "name": "the drivers"}
file3.json
{"team": 5, "leagueId": 1, "name": "the jumpers"}
{"team": 6, "leagueId": 1, "name": "the riserss"}
and we need to stack these into a single file output_file.json, that simply has all of the JSONs in our directory combined / stacked on top of one another:
output_file.json
{"team": 1, "leagueId": 1, "name": "the ballers"}
{"team": 2, "leagueId": 1, "name": "the hoopers"}
{"team": 3, "leagueId": 1, "name": "the gamerrs"}
{"team": 4, "leagueId": 1, "name": "the drivers"}
{"team": 5, "leagueId": 1, "name": "the jumpers"}
{"team": 6, "leagueId": 1, "name": "the riserss"}
Is this possible to do with a bash command in Mac / Linux? We're hoping this is easier than combining ordinary JSONs because these are NDJSONs and so the files truly simply need to just be stacked on top of one-another. Our full data is much larger (~10GB of data split over 100+ newline-delimited JSONs), and we're hoping to find a decently-performant (under 2-5 minutes) solution if possible. I just installed and am reading docs on jq currently, and will update if we come up with a solution.
EDIT:
It looks like jq . our_jsons/* > output_file.json concats the JSONs, however the output is not an ND JSON but rather an ordinary (and invalid) JSON file...
cat tmp/* | jq -c '.' > tmp/output_file.json appears to get the job done!

Convert types from strings to ints in newline delimited JSON file

According to How to preserve integer data type when exporting to JSON?, it is not currently possible to preserve integer types when exporting from BigQuery to JSON. This minor detail about BigQuery --> GCS JSON exports has been causing us many problems. The result of one of our table exports is a newline-delimited JSON that looks like this:
{"leagueId": "1", "name": "the ballers"}
{"team": "2", "leagueId": "1", "name": "the hoopers"}
{"team": "3", "leagueId": "1", "name": "the gamerrs"}
{"team": "4", "leagueId": "1", "name": "the drivers"}
{"team": "5", "leagueId": "1", "name": "the jumpers"}
{"team": "6", "leagueId": "1", "name": "the riserss"}
team, leagueId should both be ints, and we'd like to modify this NDJSON converting these strings back into its. The output we're going for is:
{"leagueId": 1, "name": "the ballers"}
{"team": 2, "leagueId": 1, "name": "the hoopers"}
{"team": 3, "leagueId": 1, "name": "the gamerrs"}
{"team": 4, "leagueId": 1, "name": "the drivers"}
{"team": 5, "leagueId": 1, "name": "the jumpers"}
{"team": 6, "leagueId": 1, "name": "the riserss"}
Assuming we know / have a list/array of the columns that need to be converted from strings into ints [team, leagueId], how can we do this conversion? Is this possible with (a) a bash command using a tool like jq, or (b) is there some python solution? Our full NDJSON is ~10GB in size, and performance is important as this is a step in our daily data-ingestion pipeline.
Edit: How to convert a string to an integer in a JSON file using jq? - trying to use this post to help. Have come up with jq '.team | tonumber' tmp/testNDJSON.json, but this simply returns 1 2 3 4 5 6, not an updated JSON, and only handles one key, not multiple keys.
Edit2: jq -c '{leagueId: .leagueId | tonumber, team: .team | tonumber, name: .name}' tmp/testNDJSON.json > tmp/new_output.json this would work if not for the missing team value in the first JSON... getting closer.
you can use if
jq -c 'if .team then {leagueId: .leagueId | tonumber, team: .team | tonumber, name: .name}
else {leagueId: .leagueId | tonumber, name: .name} end '
more conditionals https://stedolan.github.io/jq/manual/v1.6/#ConditionalsandComparisons

For and If/else in python to check and add information to json

Hello I'm having trouble making a loop of if else in python, I need my if to check that there is a description "quantity" in my product and if there is to leave it as it is else to add in "quantity" : 0,
I want to make my for check that the "quantity" is present and if it's not to add it in.
But I have no idea how to make this for if else combo
data = json.load(json_data)
for product in data:
if product ["quantity"] in data
else 'w' product ["quantity":0]
It's going to show the result hopefully with this
with open('br2.json', 'w', encoding='utf8') as json_data:
json_data.write(json.dumps(data, ensure_ascii=False))
json_data.close()
I want it to go over a json like this
[{"id": 2162952, "name": "Kit Gamer acer - Notebook + Headset + Mouse",
"price": 25599.0, "category": "Eletrônicos"},
{"id": 3500957, "name": "Monitor 29 LG FHD Ultrawide com 1000:1 de
contraste", "quantity": 18, "price": 1559.4, "category":
"Eletrônicos"},
{"id": 1911864, "name": "Mouse Gamer Predator cestus 510 Fox Preto",
"price": 699.0, "category": "Acessórios"}]
And return it like this
[{"id": 2162952, "name": "Kit Gamer acer - Notebook + Headset +
Mouse","quantity": 0, "price": 25599.0, "category": "Eletrônicos"},
{"id": 3500957, "name": "Monitor 29 LG FHD Ultrawide com 1000:1 de
contraste", "quantity": 18, "price": 1559.4, "category": "Eletrônicos"},
{"id": 1911864, "name": "Mouse Gamer Predator cestus 510 Fox Preto",
"price": 699.0, "category": "Acessórios"}]
if product is a dictionary you can check if the key "quantity" is not in product and in that case add that key with value 0 with:
if "quantity" not in product:
product["quantity"] = 0

How to convert a regular dataframe into JSON?

I've seen different conversions done on Stack, and none of them have the results I need. I have a data frame that I imported from an Excel file, manipulated, and want to export as a JSON file. I have tried this:
exportJson <- toJSON(data)
print(exportJson)
write(exportJson, "test.json")
json_data <- fromJSON(file="test.json")
My data looks like this:
Jill Jimmie Alex Jane
Jill Jill 0 Jill Jill
Jimmie 0 Jimmie Jimmie 0
Alex 0 Alex Alex 0
Jane Jane Jane Jane 0
My output looks like this:
{
"Jill": ["Jill",
"0",
"0",
"Jane",
"0",
"0",
"0",
"0",
"0",
"0",
...
when I need it to look like this format:
{
"nodes": [
{
"id": "id1",
"name": "Jill",
"val": 1
},
{
"id": "id2",
"name": "Jill",
"val": 10
},
(...)
],
"links": [
{
"source": "id1",
"target": "id2"
},
(...)
]
}
I've seen ways of converting JSON to a dataframe and I am aware of RJSONIO, jsonlite, rjson, etc. , I've googled it, and maybe I am just missing an obvious answer.
The '.' command in jq will reformat the JSON data. Using the jqr package:
library(jqr)
# Unformatted (no whitespace)
x <- '{"a":1,"b":2,"c":[1,2,3],"d":{"e":1,"f":2}}'
jq(x, '.')
Output reformatted (with whitespace)
{
"a": 1,
"b": 2,
"c": [
1,
2,
3
],
"d": {
"e": 1,
"f": 2
}
}
jq is also a available as a standalone utility: https://stedolan.github.io/jq/

Python pandas convert CSV upto level "n" nested JSON?

I want to convert the csv file into nested json upto n level. I am using below code to get the desired output from this link. But I am getting an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-109-89d84e9a61bf> in <module>()
40 # make a list of keys
41 keys_list = []
---> 42 for item in d['children']:
43 keys_list.append(item['name'])
44
TypeError: 'datetime.date' object has no attribute '__getitem__'
Below is the code:
# CSV 2 flare.json
# convert a csv file to flare.json for use with many D3.js viz's
# This script creates outputs a flare.json file with 2 levels of nesting.
# For additional nested layers, add them in lines 32 - 47
# sample: http://bl.ocks.org/mbostock/1283663
# author: Andrew Heekin
# MIT License
import pandas as pd
import json
df = pd.read_csv('file.csv')
# choose columns to keep, in the desired nested json hierarchical order
df = df[['Group','year','quarter']]
# order in the groupby here matters, it determines the json nesting
# the groupby call makes a pandas series by grouping 'the_parent' and 'the_child', while summing the numerical column 'child_size'
df1 = df.groupby(['Group','year','quarter'])['quarter'].count()
df1 = df1.reset_index(name = "count")
#print df1.head()
# start a new flare.json document
flare = dict()
flare = {"name":"flare", "children": []}
#df1['year'] = [str(yr) for yr in df1['year']]
for line in df1.values:
the_parent = line[0]
the_child = line[1]
child_size = line[2]
# make a list of keys
keys_list = []
for item in d['children']:
keys_list.append(item['name'])
# if 'the_parent' is NOT a key in the flare.json yet, append it
if not the_parent in keys_list:
d['children'].append({"name":the_parent, "children":[{"name":the_child, "size":child_size}]})
# if 'the_parent' IS a key in the flare.json, add a new child to it
else:
d['children'][keys_list.index(the_parent)]['children'].append({"name":the_child, "size":child_size})
flare = d
# export the final result to a json file
with open('flare.json', 'w') as outfile:
json.dump(flare, outfile)
Expected Output in below format:
{
"name": "stock",
"children": [
{"name": "fruits",
"children": [
{"name": "berries",
"children": [
{"count": 20, "name": "blueberry"},
{"count": 70, "name": "cranberry"},
{"count": 96, "name": "raspberry"},
{"count": 140, "name": "strawberry"}]
},
{"name": "citrus",
"children": [
{"count": 20, "name": "grapefruit"},
{"count": 120, "name": "lemon"},
{"count": 50, "name": "orange"}]
},
{"name": "dried fruit",
"children": [
{"count": 25, "name": "dates"},
{"count": 10, "name": "raisins"}]
}]
},
{"name": "vegtables",
"children": [
{"name": "green leaf",
"children": [
{"count": 19, "name": "cress"},
{"count": 18, "name": "spinach"}]
},
{
"name": "legumes",
"children": [
{"count": 27, "name": "beans"},
{"count": 12, "name": "chickpea"}]
}]
}]
}
Could any one please help how to resolve this error.
Thanks