Convert R data table column from JSON to data table - json

I have a column that contains JSON data as in the following example,
library(data.table)
test <- data.table(a = list(1,2,3),
info = list("{'duration': '10', 'country': 'US'}",
"{'duration': '20', 'country': 'US'}",
"{'duration': '30', 'country': 'GB', 'width': '20'}"))
I want to convert the last column to equivalent R storage, which would look similar to,
res <- data.table(a = list(1, 2, 3),
duration = list(10, 20, 30),
country = list('US', 'US', 'GB'),
width = list(NA, NA, 20))
Since I have 500K rows with different contents I would look for a quick way to do this.

A variation without the need to separate out the JSON string
library(data.table)
library(jsonlite)
test[, info := gsub("'", "\"", info)]
test[, rbindlist(lapply(info, fromJSON), use.names = TRUE, fill = TRUE)]
# duration country width
# 1: 10 US NA
# 2: 20 US NA
# 3: 30 GB 20

Parse the JSON first, then build the data.frame (or data.table):
json_string <- paste(c("[{'duration': '10', 'country': 'US'}",
"{'duration': '20', 'country': 'US'}",
"{'duration': '30', 'country': 'GB'}",
"{'width': '20'}]"), collapse=", ")
# JSON standard requires double quotes
json_string <- gsub("'", "\"", json_string)
library("jsonlite")
fromJSON(json_string)
# duration country width
# 1 10 US <NA>
# 2 20 US <NA>
# 3 30 GB <NA>
# 4 <NA> <NA> 20
This isn't exactly what you asked for as your JSON doesn't associate 'width' with the previous record, you might need to do some manipulation first:
json_string <- paste(c("[{'duration': '10', 'country': 'US'}",
"{'duration': '20', 'country': 'US'}",
"{'duration': '30', 'country': 'GB', 'width': '20'}]"),
collapse=", ")
json_string <- gsub("'", "\"", json_string)
df <- jsonlite::fromJSON(json_string)
data.table::as.data.table(df)
# duration country width
# 1: 10 US NA
# 2: 20 US NA
# 3: 30 GB 20

Related

Dynamically Flatten JSON response from API gives one Huge row

I am trying to dynamically flatten a json response for an API request but getting only one row with all the record back. kindly assist or point me in the right direction.
My json response looks like this
import requests, json
URL='https://data.calgary.ca/resource/848s-4m4z.json'
data = json.loads(requests.get(URL).text)
data
[{'sector': 'NORTH',
'community_name': 'THORNCLIFFE',
'group_category': 'Crime',
'category': 'Theft FROM Vehicle',
'count': '9',
'resident_count': '8474',
'date': '2018-03-01T12:00:00.000',
'year': '2018',
'month': 'MAR',
'id': '2018-MAR-THORNCLIFFE-Theft FROM Vehicle-9',
'geocoded_column': {'latitude': '51.103099554741',
'longitude': '-114.068779421169',
'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'},
':#computed_region_4a3i_ccfj': '2',
':#computed_region_p8tp_5dkv': '4',
':#computed_region_4b54_tmc4': '2',
':#computed_region_kxmf_bzkv': '192'},
{'sector': 'SOUTH',
'community_name': 'WOODBINE',
'group_category': 'Crime',
'category': 'Theft FROM Vehicle',
'count': '3',
'resident_count': '8866',
'date': '2019-11-01T00:00:00.000',
'year': '2019',
'month': 'NOV',
'id': '2019-NOV-WOODBINE-Theft FROM Vehicle-3',
'geocoded_column': {'latitude': '50.939610852207664',
'longitude': '-114.12962865374453',
'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'},
':#computed_region_4a3i_ccfj': '1',
':#computed_region_p8tp_5dkv': '6',
':#computed_region_4b54_tmc4': '5',
':#computed_region_kxmf_bzkv': '43'}
]
Here is my code
``
`# Function for flattening
# json
def flatten_json(y):
out = {}
def flatten(x, name=''):
# If the Nested key-value
# pair is of dict type
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
# If the Nested key-value
# pair is of list type
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
# Driver code
# print(flatten_json(data))
newf=flatten_json(data)
pd.json_normalize(newf)`
``
It returns
[enter image description here](https://i.stack.imgur.com/i6mUe.png)
While am expecting the data in the following format
[enter image description here](https://i.stack.imgur.com/mXNtU.png).
json_normalize gives me the data in expected format but I need a way to dynamically parse different json request format (programmatically).
To get your dataframe in correct form you can use this example (data is your list from the question):
import requests
import pandas as pd
from ast import literal_eval
url = "https://data.calgary.ca/resource/848s-4m4z.json"
df = pd.DataFrame(requests.get(url).json())
df = pd.concat(
[
df,
df.pop("geocoded_column")
.apply(pd.Series)
.add_prefix("geocoded_column_"),
],
axis=1,
)
df["geocoded_column_human_address"] = df["geocoded_column_human_address"].apply(
literal_eval
)
df = pd.concat(
[
df,
df.pop("geocoded_column_human_address")
.apply(pd.Series)
.add_prefix("addr_"),
],
axis=1,
)
print(df.head().to_markdown(index=False))
Prints:
sector
community_name
group_category
category
count
resident_count
date
year
month
id
:#computed_region_4a3i_ccfj
:#computed_region_p8tp_5dkv
:#computed_region_4b54_tmc4
:#computed_region_kxmf_bzkv
geocoded_column_latitude
geocoded_column_longitude
addr_address
addr_city
addr_state
addr_zip
NORTH
THORNCLIFFE
Crime
Theft FROM Vehicle
9
8474
2018-03-01T12:00:00.000
2018
MAR
2018-MAR-THORNCLIFFE-Theft FROM Vehicle-9
2
4
2
192
51.1031
-114.069
SOUTH
WOODBINE
Crime
Theft FROM Vehicle
3
8866
2019-11-01T00:00:00.000
2019
NOV
2019-NOV-WOODBINE-Theft FROM Vehicle-3
1
6
5
43
50.9396
-114.13
SOUTH
WILLOW PARK
Crime
Theft FROM Vehicle
4
5328
2019-11-01T00:00:00.000
2019
NOV
2019-NOV-WILLOW PARK-Theft FROM Vehicle-4
3
5
6
89
50.9566
-114.056
SOUTH
WILLOW PARK
Crime
Commercial Robbery
1
5328
2019-11-01T00:00:00.000
2019
NOV
2019-NOV-WILLOW PARK-Commercial Robbery-1
3
5
6
89
50.9566
-114.056
WEST
LINCOLN PARK
Crime
Commercial Break & Enter
5
2617
2019-11-01T00:00:00.000
2019
NOV
2019-NOV-LINCOLN PARK-Commercial Break & Enter-5
1
2
8
42
51.0101
-114.13

How to read a json to a pandas MultiIndex Dataframe?

I have a json format file which looks like this.
{'accounting': [{'firstName': 'John',
'lastName': 'De',
'age': 29,
'PhNumber': 253435221},
{'firstName': 'Mary',
'lastName': 'Smith',
'age': 38,
'PhNumber': 5766546221}],
'sales': [{'firstName': 'Sally',
'lastName': 'Green',
'age': 29,
'PhNumber': 63546433221},
{'firstName': 'Jim',
'lastName': 'Galley',
'age': 48,
'PhNumber': 3566648322}]}
How can I read this in to a pandas multi index dataframe with columns
(accounting, firstname), (accoutning, lastName), (accounting, age),
(accounting, PhNumber), (sales, firstname), (sales, lastName), (sales, age), (sales, PhNumber)
Use dictionary comprehension with DataFrame constructor:
import json
with open('myJson.json') as data_file:
d = json.load(data_file)
df = pd.concat({k: pd.DataFrame(v) for k, v in d.items()}).unstack(0).swaplevel(1,0, axis=1).sort_index(axis=1)
print (df)
accounting sales
PhNumber age firstName lastName PhNumber age firstName lastName
0 253435221 29 John De 63546433221 29 Sally Green
1 5766546221 38 Mary Smith 3566648322 48 Jim Galley
A simpler approach would specify the axis already at the concat level. This will help avoid the unstacking an the sorting and will keep orignal column order.
import json
with open('myJson.json') as data_file:
d = json.load(data_file)
df = pd.concat({k: pd.DataFrame(v) for k, v in d.items()}, axis=1)
import pandas as pd
df = pd.read_json('my_json.json')
df = pd.concat([pd.DataFrame(df.iloc[i,1]).assign(department=df.iloc[i,0]) for i in range(len(df))])

Splitting a pandas data frame's column containing json data into multiple columns

I loaded and normalized a json data as:
json_string = json.loads(data)
df_norm = json_normalize(json_string, errors='ignore')
Say it has now 2 columns:
Group Members
A [{'id':'1', 'metrics': '34', 'profile': 'abc'},{'id':'3',
'metrics': '32', 'profile': 'dc'}]
B [{'id':'2', 'metrics': '4', 'profile': 'bac'}]
I am looking for a method to split the 'Members' column and merging it back to the original data frame under the same 'Group', like:
Group Members id metrics profile
A {'id':'1', 'metrics': '34', 'profile': 'abc'},{'id':'3', 'metrics': '32', 'profile': 'dc'}] 1 34 abc
A {'id':'1', 'metrics': '34', 'profile': 'abc'},{'id':'3', 'metrics': '32', 'profile': 'dc'}] 3 32 dc
B [{'id':'2', 'metrics': '4', 'profile': 'bac'}] 4 4 bac
Any help would be much appreciated.
Use:
import ast
#if necessary convert column to list of dicts
df['Members'] = df['Members'].apply(ast.literal_eval)
#create DataFrames in list comprehension
df1 = pd.concat({k:pd.DataFrame(v) for k, v in df['Members'].items()})
#join to original
df = df.join(df1.reset_index(level=1, drop=True)).reset_index(drop=True)
print (df)
Group Members id metrics profile
0 A [{'id': '1', 'metrics': '34', 'profile': 'abc'... 1 34 abc
1 A [{'id': '1', 'metrics': '34', 'profile': 'abc'... 3 32 dc
2 B [{'id': '2', 'metrics': '4', 'profile': 'bac'}] 2 4 bac

Convert EIA Json to DataFrame - Python 3.6

I was trying to convert Json File from http://api.eia.gov/bulk/INTL.zip to dataframe.
Below is my code
import os, sys,json
import pandas as pd
sourcePath = r"D:\Learn\EIA\INTL.txt"
DF = pd.read_json(sourcePath, lines=True)
DF2 = DF[['series_id', 'name', 'units', 'geography', 'f', 'data']] # Need only these columns
DF2 = DF2.dropna(subset=['data']) # Delete if blank/NA
DF2[['Date', 'Value']] = pd.DataFrame([item for item in DF2.data]) # DF2.data contains list, converting to Data Frame
Error:-
Traceback (most recent call last):
File "D:\python\pyCharm\EIA\EIAINTL2018May.py", line 11, in
DF2[['Date', 'Value']] = pd.DataFrame([item for item in DF2.data])
File "C:\Python36\lib\site-packages\pandas\core\frame.py", line 2326, in setitem
self._setitem_array(key, value)
File "C:\Python36\lib\site-packages\pandas\core\frame.py", line 2350, in _setitem_array
raise ValueError('Columns must be same length as key')
ValueError: Columns must be same length as key
I stuck, Please help on this.
I need results like below: Date & Values present in List in DF.data column
DF2[['Date', 'Value']] = pd.DataFrame([item for item in DF2.data]).iloc[:,0:2] # This not working
New Code changes after jezrael solution:
import os, sys, ast
import pandas as pd
sourcePath = r"C:\sunil_plus\dataset\EIAINTL2018May\8_updation2018Aug2\source\INTL.txt"
DF = pd.read_json(sourcePath, lines=True)
DF2 = DF[['series_id', 'name', 'units', 'geography', 'f', 'data']] # Need only these columns
DF2 = DF2.dropna(subset=['data'])
DF2['Date'] = [[x[0] for x in item] for item in DF2.data]
DF2['Values'] = [[x[1] for x in item] for item in DF2.data]
DF_All = pd.DataFrame(); DF4 = pd.DataFrame()
for series_id in DF2['series_id']:
DF3 = DF2.loc[DF2['series_id'] == series_id]
DF4['DateF'] = [item for item in DF3.Date] # Here I need to convert List values to Rows
DF4['ValuesF'] = [item for item in DF3.Values] # Here I need to convert List values to Rows
# Above code not working as expected
DF3 = DF3[['series_id', 'name', 'units', 'geography', 'f']] # Need only these columns
DF5 = pd.concat([DF3, DF4], axis=1).ffill() # Concat to get DateF & ValuesF Values
DF_All = DF_All.append(DF5)
You can use 2 list comprehensions for match first and second value of nested lists:
sourcePath = r"D:\Learn\EIA\INTL.txt"
DF = pd.read_json(sourcePath, lines=True)
DF2 = DF[['series_id', 'name', 'units', 'geography', 'f', 'data']] # Need only these columns
DF2 = DF2.dropna(subset=['data'])
DF2['Date'] = [[x[0] for x in item] for item in DF2.data]
DF2['Values'] = [[x[1] for x in item] for item in DF2.data]
print (DF2.head())
series_id name \
0 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
1 INTL.51-8-SRB-MMTCD.A CO2 Emissions from the Consumption of Natural ...
2 INTL.51-8-SSD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
3 INTL.51-8-SUN-MMTCD.A CO2 Emissions from the Consumption of Natural ...
4 INTL.51-8-SVK-MMTCD.A CO2 Emissions from the Consumption of Natural ...
units geography f \
0 Million Metric Tons MKD A
1 Million Metric Tons SRB A
2 Million Metric Tons SSD A
3 Million Metric Tons SUN A
4 Million Metric Tons SVK A
data \
0 [[2015, 0.1], [2014, (s)], [2013, (s)], [2012,...
1 [[2015, 4.1], [2014, 3.5], [2013, 4.2], [2012,...
2 [[2011, --], [2010, --], [2006, --], [2003, --...
3 [[2006, --], [2003, --], [2002, --], [2001, --...
4 [[2015, 9.1], [2014, 8.8], [2013, 11], [2012, ...
Date \
0 [2015, 2014, 2013, 2012, 2011, 2010, 2009, 200...
1 [2015, 2014, 2013, 2012, 2011, 2010, 2009, 200...
2 [2011, 2010, 2006, 2003, 2002, 2001, 2000, 199...
3 [2006, 2003, 2002, 2001, 2000, 1999, 1998, 199...
4 [2015, 2014, 2013, 2012, 2011, 2010, 2009, 200...
Values
0 [0.1, (s), (s), 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, ...
1 [4.1, 3.5, 4.2, 5.2, 4.4, 4.1, 3.2, 4.2, 4.1, ...
2 [--, --, --, --, --, --, --, --, --, --, --, -...
3 [--, --, --, --, --, --, --, --, --, --, --, -...
4 [9.1, 8.8, 11, 10, 11, 12, 10, 12, 12, 13, 14,...
EDIT: You can repeat rows and create new 2 columns:
sourcePath = 'INTL.txt'
DF = pd.read_json(sourcePath, lines=True)
cols = ['series_id', 'name', 'units', 'geography', 'f', 'data']
DF2 = DF[cols].dropna(subset=['data'])
DF3 = DF2.join(pd.DataFrame(DF2.pop('data').values.tolist())
.stack()
.reset_index(level=1, drop=True)
.rename('data')
).reset_index(drop=True)
DF3[['Date', 'Value']] = pd.DataFrame(DF3['data'].values.tolist())
#if want remove original data column
#DF3[['Date', 'Value']] = pd.DataFrame(DF3.pop('data').values.tolist())
print (DF3.head())
series_id name \
0 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
1 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
2 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
3 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
4 INTL.51-8-MKD-MMTCD.A CO2 Emissions from the Consumption of Natural ...
units geography f data Date Value
0 Million Metric Tons MKD A [2015, 0.1] 2015 0.1
1 Million Metric Tons MKD A [2014, (s)] 2014 (s)
2 Million Metric Tons MKD A [2013, (s)] 2013 (s)
3 Million Metric Tons MKD A [2012, 0.2] 2012 0.2
4 Million Metric Tons MKD A [2011, 0.2] 2011 0.2

iterate through a list of dicts and update quantities for items

I have a list of purchases fetched from a json API that looks like this:
[{'quantity': '7', 'productUuid': '12345', 'unitPrice': 1234, 'name': 'apple'}, {'quantity': '7', 'productUuid': '12346', 'unitPrice': 4321, 'name': 'orange'}, {'quantity': '5', 'productUuid': '12345', 'unitPrice': 1234, 'name': 'apple'}]
What I'd like to do is get the following output where productUuid are compared and quantities added for the same productUuid to reflect the total sales of said product:
[{'quantity': '12', 'productUuid': '12345', 'unitPrice': 1234, 'name': 'apple'}, {'quantity': '7', 'productUuid': '12346', 'unitPrice': 4321, 'name': 'orange'}]
I tried the following, copying the relevant keys and values into a list for (I thought) easier manipulation, but it doesn't work and I feel there's probably a way easier solution to my problem.
def sort_products(json_list):
'''sorts by productUuid and merges indentical products'''
sorted_list = []
# copy lines to a list for easier sorting
for line in json_list:
sorted_list.append([line['productUuid'], line['name'], line['quantity'], line['unitPrice']])
# sort
sorted_list.sort(key=lambda x: x[0])
# merge
merged_list = []
for i, line in enumerate(sorted_list):
merged_list.append(line)
last_index = len(merged_list) - 1
merged_list[i][2] = 0
merged_list[i][3] = 0
# copy if product not in merged_list and set
if line[0] == merged_list[last_index][0]:
merged_list[last_index][2] = int(merged_list[last_index][2]) + int(line[2])
merged_list[last_index][3] = merged_list[last_index][3] + line[3]
else:
merged_list.append(line)
merged_list[last_index][2] = 0
merged_list[last_index][3] = 0
merged_list[last_index][2] = int(merged_list[last_index][2]) + int(line[2])
merged_list[last_index][3] = merged_list[last_index][3] + line[3]
Thanks for your suggestions!
Got it in the end by spending more time on it. I first created a list of uuids and then iterated through it to add the quantities.
def add_quantities(json_list):
'''adds quantities for the same productUuid and returns a new dict with total quantities for each product'''
# create a list to store uuids and another to store purchase dicts
uuids = []
list_of_totals = []
for purchase in json_list:
if purchase['productUuid'] not in uuids:
uuids.append(purchase['productUuid'])
# iterate through list of uuids against dict and add quantities for corresponding uuids
for uuid in uuids:
uuid_qty = 0
totals = {}
for purchase in json_list:
if uuid in purchase['productUuid']:
uuid_qty = uuid_qty + int(purchase['quantity'])
unitPrice = purchase['unitPrice']
name = purchase['name']
totals.update({'productUuid': uuid, 'quantity': uuid_qty, 'unitPrice': unitPrice, 'name': name})
list_of_totals.append(totals)
return list_of_totals
any ideas to make this better are welcome.