How to read a json to a pandas MultiIndex Dataframe? - json

I have a json format file which looks like this.
{'accounting': [{'firstName': 'John',
'lastName': 'De',
'age': 29,
'PhNumber': 253435221},
{'firstName': 'Mary',
'lastName': 'Smith',
'age': 38,
'PhNumber': 5766546221}],
'sales': [{'firstName': 'Sally',
'lastName': 'Green',
'age': 29,
'PhNumber': 63546433221},
{'firstName': 'Jim',
'lastName': 'Galley',
'age': 48,
'PhNumber': 3566648322}]}
How can I read this in to a pandas multi index dataframe with columns
(accounting, firstname), (accoutning, lastName), (accounting, age),
(accounting, PhNumber), (sales, firstname), (sales, lastName), (sales, age), (sales, PhNumber)

Use dictionary comprehension with DataFrame constructor:
import json
with open('myJson.json') as data_file:
d = json.load(data_file)
df = pd.concat({k: pd.DataFrame(v) for k, v in d.items()}).unstack(0).swaplevel(1,0, axis=1).sort_index(axis=1)
print (df)
accounting sales
PhNumber age firstName lastName PhNumber age firstName lastName
0 253435221 29 John De 63546433221 29 Sally Green
1 5766546221 38 Mary Smith 3566648322 48 Jim Galley

A simpler approach would specify the axis already at the concat level. This will help avoid the unstacking an the sorting and will keep orignal column order.
import json
with open('myJson.json') as data_file:
d = json.load(data_file)
df = pd.concat({k: pd.DataFrame(v) for k, v in d.items()}, axis=1)

import pandas as pd
df = pd.read_json('my_json.json')
df = pd.concat([pd.DataFrame(df.iloc[i,1]).assign(department=df.iloc[i,0]) for i in range(len(df))])

Related

Dynamically Flatten JSON response from API gives one Huge row

I am trying to dynamically flatten a json response for an API request but getting only one row with all the record back. kindly assist or point me in the right direction.
My json response looks like this
import requests, json
URL='https://data.calgary.ca/resource/848s-4m4z.json'
data = json.loads(requests.get(URL).text)
data
[{'sector': 'NORTH',
'community_name': 'THORNCLIFFE',
'group_category': 'Crime',
'category': 'Theft FROM Vehicle',
'count': '9',
'resident_count': '8474',
'date': '2018-03-01T12:00:00.000',
'year': '2018',
'month': 'MAR',
'id': '2018-MAR-THORNCLIFFE-Theft FROM Vehicle-9',
'geocoded_column': {'latitude': '51.103099554741',
'longitude': '-114.068779421169',
'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'},
':#computed_region_4a3i_ccfj': '2',
':#computed_region_p8tp_5dkv': '4',
':#computed_region_4b54_tmc4': '2',
':#computed_region_kxmf_bzkv': '192'},
{'sector': 'SOUTH',
'community_name': 'WOODBINE',
'group_category': 'Crime',
'category': 'Theft FROM Vehicle',
'count': '3',
'resident_count': '8866',
'date': '2019-11-01T00:00:00.000',
'year': '2019',
'month': 'NOV',
'id': '2019-NOV-WOODBINE-Theft FROM Vehicle-3',
'geocoded_column': {'latitude': '50.939610852207664',
'longitude': '-114.12962865374453',
'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'},
':#computed_region_4a3i_ccfj': '1',
':#computed_region_p8tp_5dkv': '6',
':#computed_region_4b54_tmc4': '5',
':#computed_region_kxmf_bzkv': '43'}
]
Here is my code
``
`# Function for flattening
# json
def flatten_json(y):
out = {}
def flatten(x, name=''):
# If the Nested key-value
# pair is of dict type
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
# If the Nested key-value
# pair is of list type
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
# Driver code
# print(flatten_json(data))
newf=flatten_json(data)
pd.json_normalize(newf)`
``
It returns
[enter image description here](https://i.stack.imgur.com/i6mUe.png)
While am expecting the data in the following format
[enter image description here](https://i.stack.imgur.com/mXNtU.png).
json_normalize gives me the data in expected format but I need a way to dynamically parse different json request format (programmatically).
To get your dataframe in correct form you can use this example (data is your list from the question):
import requests
import pandas as pd
from ast import literal_eval
url = "https://data.calgary.ca/resource/848s-4m4z.json"
df = pd.DataFrame(requests.get(url).json())
df = pd.concat(
[
df,
df.pop("geocoded_column")
.apply(pd.Series)
.add_prefix("geocoded_column_"),
],
axis=1,
)
df["geocoded_column_human_address"] = df["geocoded_column_human_address"].apply(
literal_eval
)
df = pd.concat(
[
df,
df.pop("geocoded_column_human_address")
.apply(pd.Series)
.add_prefix("addr_"),
],
axis=1,
)
print(df.head().to_markdown(index=False))
Prints:
sector
community_name
group_category
category
count
resident_count
date
year
month
id
:#computed_region_4a3i_ccfj
:#computed_region_p8tp_5dkv
:#computed_region_4b54_tmc4
:#computed_region_kxmf_bzkv
geocoded_column_latitude
geocoded_column_longitude
addr_address
addr_city
addr_state
addr_zip
NORTH
THORNCLIFFE
Crime
Theft FROM Vehicle
9
8474
2018-03-01T12:00:00.000
2018
MAR
2018-MAR-THORNCLIFFE-Theft FROM Vehicle-9
2
4
2
192
51.1031
-114.069
SOUTH
WOODBINE
Crime
Theft FROM Vehicle
3
8866
2019-11-01T00:00:00.000
2019
NOV
2019-NOV-WOODBINE-Theft FROM Vehicle-3
1
6
5
43
50.9396
-114.13
SOUTH
WILLOW PARK
Crime
Theft FROM Vehicle
4
5328
2019-11-01T00:00:00.000
2019
NOV
2019-NOV-WILLOW PARK-Theft FROM Vehicle-4
3
5
6
89
50.9566
-114.056
SOUTH
WILLOW PARK
Crime
Commercial Robbery
1
5328
2019-11-01T00:00:00.000
2019
NOV
2019-NOV-WILLOW PARK-Commercial Robbery-1
3
5
6
89
50.9566
-114.056
WEST
LINCOLN PARK
Crime
Commercial Break & Enter
5
2617
2019-11-01T00:00:00.000
2019
NOV
2019-NOV-LINCOLN PARK-Commercial Break & Enter-5
1
2
8
42
51.0101
-114.13

how to convert a list of dataframe to json in python

I want to convert below dataframes to json.
Salary :
Balance before Salary Salary
Date
Jun-18 27.20 15300.0
Jul-18 88.20 15300.0
Aug-18 176.48 14783.0
Sep-18 48.48 16249.0
Oct-18 241.48 14448.0
Nov-18 49.48 15663.0
Balance :
Balance
Date
Jun-18 3580.661538
Jul-18 6817.675556
Aug-18 7753.483077
Sep-18 5413.868421
Oct-18 5996.120000
Nov-18 8276.805000
Dec-18 9269.000000
I tried:
dfs = [Salary, Balance]
dfs.to_json("path/test.json")
but it gives me an error:
AttributeError: 'list' object has no attribute 'to_json'
but when I tried for single dataframe, I got the following result:
{"Balance before Salary":{"Jun-18":27.2,"Jul-18":88.2,"Aug-18":176.48,"Sep-18":48.48,"Oct-18":241.48,"Nov-18":49.48},"Salary":{"Jun-18":15300.0,"Jul-18":15300.0,"Aug-18":14783.0,"Sep-18":16249.0,"Oct-18":14448.0,"Nov-18":15663.0}}
You can use to_json method.
From the docs:
>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
... index=['row 1', 'row 2'],
... columns=['col 1', 'col 2'])
>>> df.to_json(orient='records')
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
Use concat for one DataFrame (necessary same index values in each DataFrame for alignment) and then convert to json:
dfs = [check_Salary_date, sum_Salary]
df = pd.concat(dfs, axis=1, keys=np.arange(len(dfs)))
df.columns = ['{}{}'.format(b, a) for a, b in df.columns]
df.to_json("path/test.json")

Convert pandas columns to comma separated lists to be used in sql statements

I have a dataframe and I am trying to turn the column into a comma separated list. The end goal is to pass this comma seperated list as a list of filtered items in a SQL query.
How do I go about doing this?
> import pandas as pd
>
> mydata = [{'id' : 'jack', 'b': 87, 'c': 1000},
> {'id' : 'jill', 'b': 55, 'c':2000}, {'id' : 'july', 'b': 5555, 'c':22000}]
df = pd.DataFrame(mydata)
df
Expected solution - note the quotes around the ids since they are strings and the items in column titled 'b' since that is a numerical field and the way in which SQL works. I would then eventually send a query like
select * from mytable where ids in (my_ids) or values in (my_values):
my_ids = 'jack', 'jill','july'
my_values = 87,55,5555
I encountered a similar issue and solved it in one line using values and tolist() as
df['col_name'].values.tolist()
So in your case, it will be
my_ids = my_data['id'].values.tolist() # ['jack', 'jill', 'july']
my_values = my_data['b'].values.tolist()
Let's use apply with argument 'reduce=False' then check the dtype of the series and apply the proper argument to join:
df.apply(lambda x: ', '.join(x.astype(str)) if x.dtype=='int64' else ', '.join("\'"+x.astype(str)+"\'"), reduce=False)
Output:
b 87, 55, 5555
c 1000, 2000, 22000
id 'jack', 'jill', 'july'
dtype: object

Convert R data table column from JSON to data table

I have a column that contains JSON data as in the following example,
library(data.table)
test <- data.table(a = list(1,2,3),
info = list("{'duration': '10', 'country': 'US'}",
"{'duration': '20', 'country': 'US'}",
"{'duration': '30', 'country': 'GB', 'width': '20'}"))
I want to convert the last column to equivalent R storage, which would look similar to,
res <- data.table(a = list(1, 2, 3),
duration = list(10, 20, 30),
country = list('US', 'US', 'GB'),
width = list(NA, NA, 20))
Since I have 500K rows with different contents I would look for a quick way to do this.
A variation without the need to separate out the JSON string
library(data.table)
library(jsonlite)
test[, info := gsub("'", "\"", info)]
test[, rbindlist(lapply(info, fromJSON), use.names = TRUE, fill = TRUE)]
# duration country width
# 1: 10 US NA
# 2: 20 US NA
# 3: 30 GB 20
Parse the JSON first, then build the data.frame (or data.table):
json_string <- paste(c("[{'duration': '10', 'country': 'US'}",
"{'duration': '20', 'country': 'US'}",
"{'duration': '30', 'country': 'GB'}",
"{'width': '20'}]"), collapse=", ")
# JSON standard requires double quotes
json_string <- gsub("'", "\"", json_string)
library("jsonlite")
fromJSON(json_string)
# duration country width
# 1 10 US <NA>
# 2 20 US <NA>
# 3 30 GB <NA>
# 4 <NA> <NA> 20
This isn't exactly what you asked for as your JSON doesn't associate 'width' with the previous record, you might need to do some manipulation first:
json_string <- paste(c("[{'duration': '10', 'country': 'US'}",
"{'duration': '20', 'country': 'US'}",
"{'duration': '30', 'country': 'GB', 'width': '20'}]"),
collapse=", ")
json_string <- gsub("'", "\"", json_string)
df <- jsonlite::fromJSON(json_string)
data.table::as.data.table(df)
# duration country width
# 1: 10 US NA
# 2: 20 US NA
# 3: 30 GB 20

Manipulating data in CSV

Using Python 3 and numpy, I am trying to read and manipulate a CSV. My intent is to find all buildings that are over 50,000 square feet, the data for which is in column 6. The interpreter returns an error stating, "Line # (got 1 columns instead of 11)." I think that my issue is registering the data type as a string, but I have tried different data types and cannot get the script to work.
import numpy as np
dataframe = np.genfromtxt('buildingsv1.csv', dtype=str, skip_header=1, delimiter="none",usecols=(6))
headers = next(dataframe)
for row in dataframe:
if 50000 in row(6):
print(row)
np.savetxt('buildingsv2')
SOLUTION (using Pandas instead of Numpy)
import pandas as pd
total_df = pd.read_csv('buildingsv1.csv', keep_default_na=False, na_values=[""])
#Build new DataFrame of 4 columns
total_df[['PARCELID', 'KIVAPIN', 'ADDRESS', 'APN']]
total_df[total_df.sqft >= 50000]
A version of the raw dataset is available. I am using a desktop version with machine-readable headings and more columns.
Here's a general idea using Pandas (which is built on Numpy).
import pandas as pd
import numpy as np
# I generated df below but you'd want to read the data with pd.read_csv() like so
#df = pd.read_csv('buildingsv1.csv')
df = pd.DataFrame(np.random.rand(10, 6)*100000,
columns=['Column'+str(i) for i in range(1, 7)])
new_df = df[df['Column6'] >= 50000]
It's good practice to check dtypes in Pandas using df.dtypes. Your data will need to be numeric first to filter over 50,000.
If your numeric data has commas (ex: 50,000), it can be problematic. Here's an example with a column that contains commas.
>>> df1 = pd.DataFrame({'Other Data': [2, 3, 44, 5, 65, 6], 'Commas1': [' 68,028,616 ', ' 162,470,071 ', ' 135,393,045 ', ' 89,981,894 ', ' 74,787,888 ', ' 173,610,498 ']})
>>> df1
Commas1 Other Data
0 68,028,616 2
1 162,470,071 3
2 135,393,045 44
3 89,981,894 5
4 74,787,888 65
5 173,610,498 6
>>> df1.dtypes
Commas1 object
Other Data int64
dtype: object
One way to convert Commas1 column is to use regex:
df1['Commas1'] = df1['Commas1'].str.replace(r'[^\d\.]', '').astype('int64')
>>> df1
Commas1 Other Data
0 68028616 2
1 162470071 3
2 135393045 44
3 89981894 5
4 74787888 65
5 173610498 6
>>> df1.dtypes
Commas1 int64
Other Data int64
dtype: object
The takeaway is, Commas1 has been converted to an integer datatype in this example. You can change int64 to float64 for example if you need floats instead of ints.
Here's a sample run with a comma delimited csv (with numpy)
Simulate a file with a list of lines.
In [168]: txt="""name, val1, val2, val3
me, 23, 34, 34
you, 34, 22, 35
he, 22, 66, 66
she, 36,32,36
"""
In [169]: txt=txt.splitlines()
Load with genfromtxt:
In [170]: data = np.genfromtxt(txt,dtype=None, delimiter=',')
In [171]: data
Out[171]:
array([['name', ' val1', ' val2', ' val3'],
['me', ' 23', ' 34', ' 34'],
['you', ' 34', ' 22', ' 35'],
['he', ' 22', ' 66', ' 66'],
['she', ' 36', '32', '36']],
dtype='|S5')
oops, it loaded strings - because the first line is names.
Skip the first line:
In [174]: data = np.genfromtxt(txt,dtype=None, skip_header=1,delimiter=',')
In [175]: data
Out[175]:
array([('me', 23, 34, 34), ('you', 34, 22, 35), ('he', 22, 66, 66),
('she', 36, 32, 36)],
dtype=[('f0', 'S3'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
It deduced the column types correctly, but gave them generic names. names=True to use column headers from the file:
In [176]: data = np.genfromtxt(txt,dtype=None, names=True,delimiter=',')
In [177]: data
Out[177]:
array([('me', 23, 34, 34), ('you', 34, 22, 35), ('he', 22, 66, 66),
('she', 36, 32, 36)],
dtype=[('name', 'S3'), ('val1', '<i4'), ('val2', '<i4'), ('val3', '<i4')])
data is a 1d array, with 4 records; the fields of those records are defined in the dtype.
Now we can display rows from this array according to some column criteria:
In [179]: for row in data:
if row['val2']>32:
print(row)
.....:
('me', 23, 34, 34)
('he', 22, 66, 66)
One record:
In [181]: data[0]
Out[181]: ('me', 23, 34, 34)
One field (column):
In [182]: data['name']
Out[182]:
array(['me', 'you', 'he', 'she'],
dtype='|S3')
Those selected values can be collected into a new array with an expression like:
In [205]: data1=data[data['val2']>32]
In [206]: data1
Out[206]:
array([('me', 23, 34, 34), ('he', 22, 66, 66)],
dtype=[('name', 'S3'), ('val1', '<i4'), ('val2', '<i4'), ('val3', '<i4')])
Writing a matching csv isn't quite so nice with numpy. It has a savetxt that writes data in columns, but you have to specify format and header.
In [207]: header='name, val1, val2, val3'
In [208]: fmt='%10s, %4d, %4d, %4d'
In [209]: np.savetxt('test.csv',data1, fmt=fmt,header=header)
In [210]: cat test.csv
# name, val1, val2, val3
'me', 23, 34, 34
'he', 22, 66, 66