How to make dataframe table from json file - json

I have json file I want to convert it to pandas dataframe, take some variables from 'tags' and take some variables from 'fields'
{
"tags": {
"ID": "99909",
"type": "fff",
"ID2": "565789"
},
"timestamp": 1500079519064,
"tenant": "dxy",
"tstable": "data",
"user": "writer",
"fields": {
"a": "0.003",
"b": "0.011",
}
}
Required output:
df_out=pd.DataFrame({'ID':[99909],'type':["fff"],'ID2':[565789],"timestamp": [1500079519064],"tenant": ["dxy"],"tstable": ["data"],"user": ["writer"],"a": ["0.003"],"b": ["0.011"]})
print(df_out)
ID type ID2 timestamp tenant tstable user a b
0 99909 fff 565789 1577078519064 dxy data writer 0.003 0.011

Use json.json_normalize:
j = {
"tags": {
"ID": "99909",
"type": "fff",
"ID2": "565789"
},
"timestamp": 1500079519064,
"tenant": "dxy",
"tstable": "data",
"user": "writer",
"fields": {
"a": "0.003",
"b": "0.011",
}
}
from pandas.io.json import json_normalize
df = json_normalize(j)
print (df)
timestamp tenant tstable user tags.ID tags.type tags.ID2 fields.a \
0 1500079519064 dxy data writer 99909 fff 565789 0.003
fields.b
0 0.011
Last if necessary change columns names add rename:
f = lambda x: x.split('.')[-1]
df = json_normalize(j).rename(columns=f)
print (df)
timestamp tenant tstable user ID type ID2 a b
0 1500079519064 dxy data writer 99909 fff 565789 0.003 0.011

If you have nested columns then you first need to normalize the data:
import pandas as pd
from pandas.io.json import json_normalize
data = [
{
"tags": {
"ID": "99909",
"type": "fff",
"ID2": "565789"
},
"timestamp": 1500079519064,
"tenant": "dxy",
"tstable": "data",
"user": "writer",
"fields": {
"a": "0.003",
"b": "0.011",
}
}]
df = pd.DataFrame.from_dict(json_normalize(data), orient='columns')
print(df)

Related

Large Json file send batches wise to HubSpot API

I tried many ways and tested many scenarios I did R&D a lot but unable to found issue/solution
I have a requirement, The HubSpot API accepts only 15k rec every time so we have large json file so we need to split/divide like batches wise 15k rec need to send api once 15k added in api it sleeps 10 sec and capture each response like this, the process would continue until all rec finished
I try with chunk code and modulus operator but didn't get any response
Not sure below code work or not can anyone please suggest better way
How to send batches wise to HubSpot API, How to post
Thanks in advance, this would great help for me!!!!!!!!
with open(r'D:\Users\lakshmi.vijaya\Desktop\Invalidemail\allhubusers_data.json', 'r') as run:
dict_run = run.readlines()
dict_ready = (''.join(dict_run))
count = 1000
subsets = (dict_ready[x:x + count] for x in range(0, len(dict_ready), count))
url = 'https://api.hubapi.com/contacts/v1/contact/batch'
headers = {'Authorization' : "Bearer pat-na1-**************************", 'Accept' : 'application/json', 'Content-Type' : 'application/json','Transfer-encoding':'chunked'}
for subset in subsets:
#print(subset)
urllib3.disable_warnings()
r = requests.post(url, data=subset, headers=headers,verify=False,
timeout=(15,20), stream=True)
print(r.status_code)
print(r.content)
ERROR:;;
400
b'\r\n400 Bad Request\r\n\r\n400 Bad Request\r\ncloudflare\r\n\r\n\r\n'
This is other method:
with open(r'D:\Users\lakshmi.vijaya\Desktop\Invalidemail\allhubusers_data.json', 'r') as run:
dict_run = run.readlines()
dict_ready = (''.join(dict_run))
url = 'https://api.hubapi.com/contacts/v1/contact/batch'
headers = {'Authorization' : "Bearer pat-na1***********-", 'Accept' : 'application/json', 'Content-Type' : 'application/json','Transfer-encoding':'chunked'}
urllib3.disable_warnings()
r = requests.post(url, data=dict_ready, headers=headers,verify=False,
timeout=(15,20), stream=True)
r.iter_content(chunk_size=1000000)
print(r.status_code)
print(r.content)
ERROR::::
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='api.hubapi.com', port=443): Max retries exceeded with url: /contacts/v1/contact/batch
(Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2396)')))
This how json data looks like in large json file
{
"email": "aaazaj21#yahoo.com",
"properties": [
{
"property": "XlinkUserID",
"value": 422211111
},
{
"property": "register_time",
"value": "2021-09-02"
},
{
"property": "linked_alexa",
"value": 1
},
{
"property": "linked_googlehome",
"value": 0
},
{
"property": "fan_speed_switch_0x51_",
"value": 2
}
]
},
{
"email": "zzz7#gmail.com",
"properties": [
{
"property": "XlinkUserID",
"value": 13333666
},
{
"property": "register_time",
"value": "2021-04-24"
},
{
"property": "linked_alexa",
"value": 1
},
{
"property": "linked_googlehome",
"value": 0
},
{
"property": "full_colora19_st_0x06_",
"value": 2
}
]
}
I try with adding list of objects
[
{
"email": "aaazaj21#yahoo.com",
"properties": [
{
"property": "XlinkUserID",
"value": 422211111
},
{
"property": "register_time",
"value": "2021-09-02"
},
{
"property": "linked_alexa",
"value": 1
},
{
"property": "linked_googlehome",
"value": 0
},
{
"property": "fan_speed_switch_0x51_",
"value": 2
}
]
},
{
"email": "zzz7#gmail.com",
"properties": [
{
"property": "XlinkUserID",
"value": 13333666
},
{
"property": "register_time",
"value": "2021-04-24"
},
{
"property": "linked_alexa",
"value": 1
},
{
"property": "linked_googlehome",
"value": 0
},
{
"property": "full_colora19_st_0x06_",
"value": 2
}
]
}
]
You haven't said if your JSON file is a representation of an array of objects or just one object. Arrays are converted to Python lists by json.load and objects are converted to Python dictionaries.
Here is some code that assumes it is an array of objects if is is not an array of objects see https://stackoverflow.com/a/22878842/839338 but the same principle can be used
Assuming you want 15k bytes not records if it is the number of records you can simplify the code and just pass 15000 as the second argument to chunk_list().
import json
import math
import pprint
# See https://stackoverflow.com/a/312464/839338
def chunk_list(list_to_chunk, number_of_list_items):
"""Yield successive chunk_size-sized chunks from list."""
for i in range(0, len(list_to_chunk), number_of_list_items):
yield list_to_chunk[i:i + number_of_list_items]
with open('./allhubusers_data.json', 'r') as run:
json_data = json.load(run)
desired_size = 15000
json_size = len(json.dumps(json_data))
print(f'{json_size=}')
print(f'Divide into {math.ceil(json_size/desired_size)} sub-sets')
print(f'Number of list items per subset = {len(json_data)//math.ceil(json_size/desired_size)}')
if isinstance(json_data, list):
print("Found a list")
sub_sets = chunk_list(json_data, len(json_data)//math.ceil(json_size/desired_size))
else:
exit("Data not list")
for sub_set in sub_sets:
pprint.pprint(sub_set)
print(f'Length of sub-set {len(json.dumps(sub_set))}')
# Do stuff with the sub sets...
text_subset = json.dumps(sub_set) # ...
you may need to adjust the value of desired_size downwards if the sub_sets vary in length of text.
UPDATED IN RESPONSE TO COMMENT
If you just need 15000 records per request this code should work for you
import json
import pprint
import requests
# See https://stackoverflow.com/a/312464/839338
def chunk_list(list_to_chunk, number_of_list_items):
"""Yield successive chunk_size-sized chunks from list."""
for i in range(0, len(list_to_chunk), number_of_list_items):
yield list_to_chunk[i:i + number_of_list_items]
url = 'https://api.hubapi.com/contacts/v1/contact/batch'
headers = {
'Authorization': "Bearer pat-na1-**************************",
'Accept': 'application/json',
'Content-Type': 'application/json',
'Transfer-encoding': 'chunked'
}
with open(r'D:\Users\lakshmi.vijaya\Desktop\Invalidemail\allhubusers_data.json', 'r') as run:
json_data = json.load(run)
desired_size = 15000
if isinstance(json_data, list):
print("Found a list")
sub_sets = chunk_list(json_data, desired_size)
else:
exit("Data not list")
for sub_set in sub_sets:
# pprint.pprint(sub_set)
print(f'Length of sub-set {len(sub_set)}')
r = requests.post(
url,
data=json.dumps(sub_set),
headers=headers,
verify=False,
timeout=(15, 20),
stream=True
)
print(r.status_code)
print(r.content)

pandas column to list for a json file

from a Dataframe, I want to have a JSON output file with one key having a list:
Expected output:
[
{
"model": "xx",
"id": 1,
"name": "xyz",
"categories": [1,2],
},
{
...
},
]
What I have:
[
{
"model": "xx",
"id": 1,
"name": "xyz",
"categories": "1,2",
},
{
...
},
]
The actual code is :
df = pd.read_excel('data_threated.xlsx')
result = df.reset_index(drop=True).to_json("output_json.json", orient='records')
parsed = json.dumps(result)
jsonfile = open("output_json.json", 'r')
data = json.load(jsonfile)
How can I achive this easily?
EDIT:
print(df['categories'].unique().tolist())
['1,2,3', 1, nan, '1,2,3,6', 9, 8, 11, 4, 5, 2, '1,2,3,4,5,6,7,8,9']
You can use:
df = pd.read_excel('data_threated.xlsx').reset_index(drop=True)
df['categories'] = df['categories'].apply(lambda x: [int(i) for i in x.split(',')] if isinstance(x, str) else '')
df.to_json('output.json', orient='records', indent=4)
Content of output.json
[
{
"model":"xx",
"id":1,
"name":"xyz",
"categories":[
1,
2
]
}
]
Note you can also use:
df['categories'] = pd.eval(df['categories'])

JSON to CSV - Go 4 level deep

I would like to extract only a small fraction of my JSON response in a .csv file. However, I need to go to 4 levels deep and I am currently only able to go to 3 level deep. My goal is to have a .csv with 3 columns campaign_id, campaign_name, cost_per_click and 3 lines for each of my campaigns.
Original JSON
{
"318429215527453": {
"conversion_events": {
"data": [
{
"id": "djdfhdf",
"name": "Total",
"cost": 328.14,
"metrics_breakdown": {
"data": [
{
"campaign_id": 2364,
"campaign_name": "uk",
"cost_per_click": 1345
},
{
"campaign_id": 7483,
"campaign_name": "fr",
"cost_per_click": 756
},
{
"campaign_id": 8374,
"campaign_name": "spain",
"cost_per_click": 545
},
{
"campaign_id": 2431,
"campaign_name": "ge",
"cost_per_click": 321
}
],
"paging": {
"cursors": {
"after": "MjUZD"
},
"next": "https://graph.facebook.com/v9.0/xxxx"
}
}
}
],
"summary": {
"count": 1,
"metric_date_range": {
"date_range": {
"begin_date": "2021-01-09T00:00:00+0100",
"end_date": "2021-02-08T00:00:00+0100",
"time_zone": "Europe/Paris"
},
"prior_period_date_range": {
"begin_date": "2020-12-10T00:00:00+0100",
"end_date": "2021-01-09T00:00:00+0100"
}
}
}
},
"id": "xxx"
}
}
reformated.py
import json
with open('campaigns.json') as json_file:
data = json.load(json_file)
reformated_json = data['318429215527453']['conversion_events']['data']
with open('data.json', 'w') as outfile:
json.dump(reformated_json, outfile)
I tried to add ['metrics_breakdown'] or another ['data'] at the end of reformated_json but I am getting TypeError: list indices must be integers or slices, not str.
{
"id": "djdfhdf",
"name": "Total",
"cost": 328.14,
"metrics_breakdown": {
"data": [
{
"campaign_id": 2364,
"campaign_name": "uk",
"cost_per_click": 1345,
},
{
"campaign_id": 7483,
"campaign_name": "fr",
"cost_per_click": 756,
},
{
"campaign_id": 8374,
"campaign_name": "spain",
"cost_per_click": 545,
},
{
"campaign_id": 2431,
"campaign_name": "ge",
"cost_per_click": 321,
},
],
"paging": {
"cursors": {
"after": "MjUZD"
},
"next": "https://graph.facebook.com/v9.0/xxxx"
}
}
}
]
import csv
import json
from typing import Dict, List, Union # typing for easy development
# read json function
def read_json(json_path: str) -> Union[Dict, List]:
with open(json_path, 'r') as file_io:
return json.load(file_io)
# write csv function
def write_csv(data: List[Dict], csv_path: str) -> None:
with open(csv_path, 'w') as file:
fieldnames = set().union(*data)
writer = csv.DictWriter(file, fieldnames=fieldnames,
lineterminator='\n')
writer.writeheader()
writer.writerows(data)
# parse campaigns using a comprehension
def parse_campaigns(data: Dict) -> List[Dict]:
return [row
for value in data.values() # first level (conversion events)
for root_data in value['conversion_events']['data'] # conversion events/data
for row in root_data['metrics_breakdown']['data']] # data/metrics_breakdown/data
json_data = read_json('./campaigns.json')
campaign_data = parse_campaigns(json_data)
write_csv(campaign_data, 'campaigns.csv')
campaigns.csv (I copied the data to multiple root dictionary objects):
cost_per_click,campaign_id,campaign_name
1345,2364,uk
756,7483,fr
545,8374,spain
321,2431,ge
1345,2364,uk
756,7483,fr
545,8374,spain
321,2431,ge
The first data subkey contains a single-element list. Dereference with [0] to get the element, then fetch the next layers of keys. Then a DictWriter can be used to write the CSV lines:
import json
import csv
with open('campaigns.json') as json_file:
data = json.load(json_file)
items = data['318429215527453']['conversion_events']['data'][0]['metrics_breakdown']['data']
with open('data.csv', 'w', newline='') as outfile:
w = csv.DictWriter(outfile,fieldnames=items[0].keys())
w.writeheader()
w.writerows(items)
Output:
campaign_id,campaign_name,cost_per_click
2364,uk,1345
7483,fr,756
8374,spain,545
2431,ge,321

Python - Converted Nested JSON with Nested Columns

I am new to Python and JSON data structures and was looking for some assistance
I have been able to create some Python code that calls a Web API and converts the returning JSON data (report_rows) into a dataframe successfully using json_normalize()
I am having some issues converting and sorting the JSON column names into the dataframe column names and was wondering if I could get some help on the following...
Get Column Names from JSON data - In the dataframe I would like to convert the column names: c1, c2, c3, etc to RECORD_NO, REF_RECORD_NO, SOV_LINEITEM_NO. The column names are in the JSON data [data][report_header][cXX][name] where cXX is the column number
Sort Column Names - I would like to order the dataframe columns so instead of c1, c10, c11, c12, c2, c3, etc it is c1, c2, c3 ... c10, c11,c12
If someone is able to provide some help, it would be greatly appreciated
Thanks in advance
Python Code
json_data = json.loads(res.read())
data = pd.json_normalize(json_data['data'], record_path=['report_row'])
print(data)
which outputs the following
c1 c10 c11 ... c7 c8 c9
0 CON-0000001 71 VEN-0000001 ... Build IT System Contract 123 Pending
1 CON-0000002 72 VEN-0000002 ... Build IT System Contract XYZ Approved
JSON Data
"data": [
{
"report_header": {
"c11": {
"name": "VENDOR_RECORD",
"type": "java.lang.String"
},
"c10": {
"name": "VENDOR_ID",
"type": "java.lang.Integer"
},
"c12": {
"name": "VENDOR_NAME",
"type": "java.lang.String"
},
"c1": {
"name": "RECORD_NO",
"type": "java.lang.String"
},
"c2": {
"name": "REF_RECORD_NO",
"type": "java.lang.String"
},
"c3": {
"name": "SOV_LINEITEM_NO",
"type": "java.lang.String"
},
"c4": {
"name": "REF_ITEM",
"type": "java.lang.String"
},
"c5": {
"name": "PROJECTNUMBER",
"type": "java.lang.String"
},
"c6": {
"name": "PROJECTNAME",
"type": "java.lang.String"
},
"c7": {
"name": "TITLE",
"type": "java.lang.String"
},
"c8": {
"name": "CONTRACT_NO",
"type": "java.lang.String"
},
"c9": {
"name": "STATUS",
"type": "java.lang.String"
}
},
"report_row": [
{
"c1": "CON-0000001",
"c10": "71 ",
"c11": "VEN-0000001",
"c12": "Microsoft",
"c2": "",
"c3": "1",
"c4": "",
"c5": "P-0037",
"c6": "Project ABC",
"c7": "Build IT System",
"c8": "Contract 123",
"c9": "Pending"
},
{
"c1": "CON-0000002",
"c10": "72 ",
"c11": "VEN-0000002",
"c12": "Google",
"c2": "",
"c3": "1.1",
"c4": "",
"c5": "P-0037",
"c6": "Project ABC",
"c7": "Build IT System",
"c8": "Contract XYZ",
"c9": "Approved"
}
]
}
],
"message": [
"OK"
],
"status": 200
}
i was able to resolve the issue by adding the following code...
# Get the number of fields/columns in the JSON data
number_of_fields = len((json_data['data'][0]['report_header']))
reorder_columns = []
new_column_names = []
field_index = 0
# Loop through the Columns and do the following...
# reorder_columns - this is the column order that i want: c1, c2, c3 ... c10, c11, c12
# new_column_name - this will retrieve the column names from the header: c1.name, c2.name, etc
while field_index < number_of_fields:
field_index += 1
new_column = "c" + str(field_index)
reorder_columns.append(new_column)
column_header = new_column + '.name'
new_column_name = header.iloc[0][new_column + '.name']
new_column_names.append(new_column_name)
data = pd.json_normalize(json_data['data'], record_path=['report_row'])
data = data.reindex(columns=reorder_columns)
data.columns = new_column_names

Does TOML support nested arrays of objects/tables?

I want to generate JSON from TOML files. The JSON structure should be something like this, with arrays of objects within arrays of objects:
{
"things": [
{
"a": "thing1",
"b": "fdsa",
"multiline": "Some sample text."
},
{
"a": "Something else",
"b": "zxcv",
"multiline": "Multiline string",
"objs": [ // LOOK HERE
{ "x": 1},
{ "x": 4 },
{ "x": 3 }
]
},
{
"a": "3",
"b": "asdf",
"multiline": "thing 3.\nanother line"
}
]
}
I have some TOML that looks like the example below, but it doesn't seem to work with the objs section.
name = "A Test of the TOML Parser"
[[things]]
a = "thing1"
b = "fdsa"
multiLine = """
Some sample text."""
[[things]]
a = "Something else"
b = "zxcv"
multiLine = """
Multiline string"""
[[things.objs]] # MY QUESTION IS ABOUT THIS PART
x = 1
[[things.objs]]
x = 4
[[things.objs]]
x = 3
[[things]]
a = "3"
b = "asdf"
multiLine = """
thing 3.
another line"""
Is there a way to do it in TOML? JSON to TOML converters don't seem to work with my example. And does it work with deeper nesting of arrays of arrays/tables?
As per the PR that merged this feature in the main TOML repository, this is the correct syntax for arrays of objects:
[[products]]
name = "Hammer"
sku = 738594937
[[products]]
[[products]]
name = "Nail"
sku = 284758393
color = "gray"
Which would produce the following equivalent JSON:
{
"products": [
{ "name": "Hammer", "sku": 738594937 },
{ },
{ "name": "Nail", "sku": 284758393, "color": "gray" }
]
}
I'm not sure why it wasn't working before, but this seems to work:
name = "A Test of the TOML Parser"
[[things]]
a = "thing1"
b = "fdsa"
multiLine = """
Some sample text."""
[[things]]
a = "Something else"
b = "zxcv"
multiLine = """
Multiline string"""
[[things.objs]]
x = 1
[[things.objs]]
x = 4
[[things.objs]]
x = 7
[[things.objs.morethings]]
y = [
2,
3,
4
]
[[things.objs.morethings]]
y = 9
[[things]]
a = "3"
b = "asdf"
multiLine = """
thing 3.
another line"""
JSON output:
{
"name": "A Test of the TOML Parser",
"things": [{
"a": "thing1",
"b": "fdsa",
"multiLine": "Some sample text."
}, {
"a": "Something else",
"b": "zxcv",
"multiLine": "Multiline string",
"objs": [{
"x": 1
}, {
"x": 4
}, {
"x": 7,
"morethings": [{
"y": [2, 3, 4]
}, {
"y": 9
}]
}]
}, {
"a": "3",
"b": "asdf",
"multiLine": "thing 3.\\nanother line"
}]
}