Nested JSON to CSV (multilevel) - json

I received a JSON file to transform into a CSV file.
[
{
"id": "132465",
"ext_numero_commande": "4500291738L",
"ext_line_items": [
{
"key_0": "10",
"key_1": "10021405 / 531.415.110",
"key_4": "4 Pce"
},
{
"key_0": "20",
"key_1": "10021258 / 531.370.140 / NPK-Nr. 224412",
"key_4": "4 Pce"
},
{
"key_0": "30",
"key_1": "10020895 / 531.219.120 / NPK-Nr. 222111",
"key_4": "10 Pce"
},
{
"key_0": "40",
"key_1": "10028633 / 552.470.110",
"key_4": "3 Pce"
}
],
"ext_prix": [
{
"key_0": "11.17"
},
{
"key_0": "9.01"
},
{
"key_0": "18.63"
},
{
"key_0": "24.15"
}
],
"ext_tag": "Date_livraison",
"ext_jour_livraison": "23-07-2021",
"ext_jour_livraison_1": null
}
]
id
Ext_Numero_Commande
Ext_line items1
Ext_line items 4
Ext_Prix
Ext_Tag
Ext_Jour_Livraison
Ext_Jour_Livraison 1
132465
4500291738L
10
10021405 / 531.415.110
4 Pce
11.17
Date_livraison
23-07-2021
132465
4500291738L
20
10021258 / 531.370.140 / NPK-Nr. 224412
4 Pce
9.01
Date_livraison
23-07-2021
132465
4500291738L
30
10020895 / 531.219.120 / NPK-Nr. 222111
10 Pce
18.63
Date_livraison
23-07-2021
132465
4500291738L
40
10028633 / 552.470.110
3 Pce
24.15
Date_livraison
23-07-2021
I found the function pd.json_normalize.
df=pd.json_normalize(
json_object[0],
record_path=['ext_line_items'],
meta=['id', 'ext_numero_commande', 'ext_tag',
'ext_jour_livraison', 'ext_jour_livraison_1'])
I have nearly my end result, and I can add the last column ["ext_prix"] with the same method and a concatenation function.
Is a function which does it automatically?
I used this function, but it returns an error.
df=pd.json_normalize(
json_object[0],
record_path=['ext_line_items','ext_prix'],
meta=['id', 'ext_numero_commande', 'ext_tag',
'ext_jour_livraison', 'ext_jour_livraison_1'])

You can solve this problem if your JSON dataset fixed length, I hope this code will work properly. Follow this way..(JSON to Dictionary to CSV using pandas).
# import pandas
import pandas as pd
# Read JSON file
df = pd.read_json('C://Users//jahir//Desktop//json_file.json')
# create dictionary
create_dict = {}
# Iteration JSON file
for index,values in df.iterrows():
# for loop ext_line_items
for ext_lines in df['ext_line_items'][index]:
for item in ext_lines:
column_name = 'ext_line_items_'+item
if column_name not in create_dict:
create_dict[column_name] =[]
# four time for to create four product ext_line_items_key_0, ext_line_items_key_1, ext_line_items_key_4
time_loop = 4
while time_loop > 0:
time_loop-=1
create_dict[column_name]+=[ext_lines[item]]
# for loop ext_prix dataset
# four time for to create four product ext_prix_key_0
time_loop = 4
while time_loop > 0:
time_loop-=1
for ext_prix in df['ext_prix'][index]:
for item in ext_prix:
column_name = 'ext_prix_'+item
if column_name not in create_dict:
create_dict[column_name] =[]
create_dict[column_name]+=[ext_prix[item]]
# add key in dictionary
for i in ['id', 'ext_numero_commande', 'ext_tag', 'ext_jour_livraison']:
create_dict[i]=[]
# 16 time for to create 4*4 product 'id', 'ext_numero_commande', 'ext_tag', 'ext_jour_livraison'
total_time = 16
while total_time > 0:
total_time-=1
for j in ['id', 'ext_numero_commande', 'ext_tag', 'ext_jour_livraison']:
create_dict[j] += [df[j][index]]
# Dictionary to DataFrame
pd_dict= pd.DataFrame.from_dict(create_dict)
# DataFrame to write csv file
write_csv_path = 'C://Users//jahir//Desktop//csv_file.csv'
pd_dict.to_csv(write_csv_path, index = False, header = True)
Output:

Using pd.json_normalize you can solve this problem. Follow this way..(json_normalize to merge to csv)
import pandas as pd
data = [
{
"id": "132465",
"ext_numero_commande": "4500291738L",
"ext_line_items": [
{
"key_0": "10",
"key_1": "10021405 / 531.415.110",
"key_2": "4 Pce"
},
{
"key_0": "20",
"key_1": "10021258 / 531.370.140 / NPK-Nr. 224412",
"key_2": "4 Pce"
},
{
"key_0": "30",
"key_1": "10020895 / 531.219.120 / NPK-Nr. 222111",
"key_2": "10 Pce"
},
{
"key_0": "40",
"key_1": "10028633 / 552.470.110",
"key_2": "3 Pce"
}
],
"ext_prix": [
{
"key_4": "11.17"
},
{
"key_4": "9.01"
},
{
"key_4": "18.63"
},
{
"key_4": "24.15"
}
],
"ext_tag": "Date_livraison",
"ext_jour_livraison": "23-07-2021",
"ext_jour_livraison_1": None
}
]
# Used json_normalize for record path ext_line_items
normalize_ext_line_items = pd.json_normalize(data, record_path = "ext_line_items", meta = ["id", "ext_numero_commande", 'ext_tag', 'ext_jour_livraison'])
# Used json_normalize for record path ext_prix
normalize_ext_prix = pd.json_normalize(data,record_path = "ext_prix", meta = ["id"])
# merge to
final_output = normalize_ext_line_items.merge(normalize_ext_prix, on='id', how = 'left')
# DataFrame to write csv file used your file path
write_csv_path = 'C://Users//jahir//Desktop//csv_file.csv'
final_output.to_csv(write_csv_path, index = False, header = True)
Output:

Related

Large Json file send batches wise to HubSpot API

I tried many ways and tested many scenarios I did R&D a lot but unable to found issue/solution
I have a requirement, The HubSpot API accepts only 15k rec every time so we have large json file so we need to split/divide like batches wise 15k rec need to send api once 15k added in api it sleeps 10 sec and capture each response like this, the process would continue until all rec finished
I try with chunk code and modulus operator but didn't get any response
Not sure below code work or not can anyone please suggest better way
How to send batches wise to HubSpot API, How to post
Thanks in advance, this would great help for me!!!!!!!!
with open(r'D:\Users\lakshmi.vijaya\Desktop\Invalidemail\allhubusers_data.json', 'r') as run:
dict_run = run.readlines()
dict_ready = (''.join(dict_run))
count = 1000
subsets = (dict_ready[x:x + count] for x in range(0, len(dict_ready), count))
url = 'https://api.hubapi.com/contacts/v1/contact/batch'
headers = {'Authorization' : "Bearer pat-na1-**************************", 'Accept' : 'application/json', 'Content-Type' : 'application/json','Transfer-encoding':'chunked'}
for subset in subsets:
#print(subset)
urllib3.disable_warnings()
r = requests.post(url, data=subset, headers=headers,verify=False,
timeout=(15,20), stream=True)
print(r.status_code)
print(r.content)
ERROR:;;
400
b'\r\n400 Bad Request\r\n\r\n400 Bad Request\r\ncloudflare\r\n\r\n\r\n'
This is other method:
with open(r'D:\Users\lakshmi.vijaya\Desktop\Invalidemail\allhubusers_data.json', 'r') as run:
dict_run = run.readlines()
dict_ready = (''.join(dict_run))
url = 'https://api.hubapi.com/contacts/v1/contact/batch'
headers = {'Authorization' : "Bearer pat-na1***********-", 'Accept' : 'application/json', 'Content-Type' : 'application/json','Transfer-encoding':'chunked'}
urllib3.disable_warnings()
r = requests.post(url, data=dict_ready, headers=headers,verify=False,
timeout=(15,20), stream=True)
r.iter_content(chunk_size=1000000)
print(r.status_code)
print(r.content)
ERROR::::
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='api.hubapi.com', port=443): Max retries exceeded with url: /contacts/v1/contact/batch
(Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2396)')))
This how json data looks like in large json file
{
"email": "aaazaj21#yahoo.com",
"properties": [
{
"property": "XlinkUserID",
"value": 422211111
},
{
"property": "register_time",
"value": "2021-09-02"
},
{
"property": "linked_alexa",
"value": 1
},
{
"property": "linked_googlehome",
"value": 0
},
{
"property": "fan_speed_switch_0x51_",
"value": 2
}
]
},
{
"email": "zzz7#gmail.com",
"properties": [
{
"property": "XlinkUserID",
"value": 13333666
},
{
"property": "register_time",
"value": "2021-04-24"
},
{
"property": "linked_alexa",
"value": 1
},
{
"property": "linked_googlehome",
"value": 0
},
{
"property": "full_colora19_st_0x06_",
"value": 2
}
]
}
I try with adding list of objects
[
{
"email": "aaazaj21#yahoo.com",
"properties": [
{
"property": "XlinkUserID",
"value": 422211111
},
{
"property": "register_time",
"value": "2021-09-02"
},
{
"property": "linked_alexa",
"value": 1
},
{
"property": "linked_googlehome",
"value": 0
},
{
"property": "fan_speed_switch_0x51_",
"value": 2
}
]
},
{
"email": "zzz7#gmail.com",
"properties": [
{
"property": "XlinkUserID",
"value": 13333666
},
{
"property": "register_time",
"value": "2021-04-24"
},
{
"property": "linked_alexa",
"value": 1
},
{
"property": "linked_googlehome",
"value": 0
},
{
"property": "full_colora19_st_0x06_",
"value": 2
}
]
}
]
You haven't said if your JSON file is a representation of an array of objects or just one object. Arrays are converted to Python lists by json.load and objects are converted to Python dictionaries.
Here is some code that assumes it is an array of objects if is is not an array of objects see https://stackoverflow.com/a/22878842/839338 but the same principle can be used
Assuming you want 15k bytes not records if it is the number of records you can simplify the code and just pass 15000 as the second argument to chunk_list().
import json
import math
import pprint
# See https://stackoverflow.com/a/312464/839338
def chunk_list(list_to_chunk, number_of_list_items):
"""Yield successive chunk_size-sized chunks from list."""
for i in range(0, len(list_to_chunk), number_of_list_items):
yield list_to_chunk[i:i + number_of_list_items]
with open('./allhubusers_data.json', 'r') as run:
json_data = json.load(run)
desired_size = 15000
json_size = len(json.dumps(json_data))
print(f'{json_size=}')
print(f'Divide into {math.ceil(json_size/desired_size)} sub-sets')
print(f'Number of list items per subset = {len(json_data)//math.ceil(json_size/desired_size)}')
if isinstance(json_data, list):
print("Found a list")
sub_sets = chunk_list(json_data, len(json_data)//math.ceil(json_size/desired_size))
else:
exit("Data not list")
for sub_set in sub_sets:
pprint.pprint(sub_set)
print(f'Length of sub-set {len(json.dumps(sub_set))}')
# Do stuff with the sub sets...
text_subset = json.dumps(sub_set) # ...
you may need to adjust the value of desired_size downwards if the sub_sets vary in length of text.
UPDATED IN RESPONSE TO COMMENT
If you just need 15000 records per request this code should work for you
import json
import pprint
import requests
# See https://stackoverflow.com/a/312464/839338
def chunk_list(list_to_chunk, number_of_list_items):
"""Yield successive chunk_size-sized chunks from list."""
for i in range(0, len(list_to_chunk), number_of_list_items):
yield list_to_chunk[i:i + number_of_list_items]
url = 'https://api.hubapi.com/contacts/v1/contact/batch'
headers = {
'Authorization': "Bearer pat-na1-**************************",
'Accept': 'application/json',
'Content-Type': 'application/json',
'Transfer-encoding': 'chunked'
}
with open(r'D:\Users\lakshmi.vijaya\Desktop\Invalidemail\allhubusers_data.json', 'r') as run:
json_data = json.load(run)
desired_size = 15000
if isinstance(json_data, list):
print("Found a list")
sub_sets = chunk_list(json_data, desired_size)
else:
exit("Data not list")
for sub_set in sub_sets:
# pprint.pprint(sub_set)
print(f'Length of sub-set {len(sub_set)}')
r = requests.post(
url,
data=json.dumps(sub_set),
headers=headers,
verify=False,
timeout=(15, 20),
stream=True
)
print(r.status_code)
print(r.content)

How to bring json format to relational form?

this is my code:
%spark.pyspark
df_principalBody = spark.sql("""
SELECT
gtin
, principalBodyConstituents
--, principalBodyConstituents.coatings.materialType.value
FROM
v_df_source""")
df_principalBody.createOrReplaceTempView("v_df_principalBody")
df_principalBody.collect();
And this is the output:
[Row(gtin='7617014161936', principalBodyConstituents=[Row(coatings=[Row(materialType=Row(value='003', valueRange='405')
How can I read the value and valueRange fields in relational format?
I tried with explode and flatten, but it will not work.
Part of my json:
{
"gtin": "7617014161936",
"timePeriods": [
{
"fractionData": {
"principalBody": {
"constituents": [
{
"coatings": [
{
"materialType": {
"value": "003",
"valueRange": "405"
},
"percentage": 0.1
}
],
...
You can use data_dict.items() to list key/value pairs.
I used part of your json as below -
str1 = """{"gtin": "7617014161936","timePeriods": [{"fractionData": {"principalBody": {"constituents": [{"coatings": [
{
"materialType": {
"value": "003",
"valueRange": "405"
},
"percentage": 0.1
}
]}]}}}]}"""
import json
res = json.loads(str1)
res_dict = res['timePeriods'][0]['fractionData']['principalBody']['constituents'][0]['coatings'][0]['materialType']
df = spark.createDataFrame(data=res_dict.items())
Output :
+----------+---+
| _1| _2|
+----------+---+
| value|003|
|valueRange|405|
+----------+---+
You can even specify your schema:
from pyspark.sql.types import *
df = spark.createDataFrame(res_dict.items(),
schema=StructType(fields=[
StructField("key", StringType()),
StructField("value", StringType())])).show()
Resulting in
+----------+-----+
| key|value|
+----------+-----+
| value| 003|
|valueRange| 405|
+----------+-----+

How to loop different types of nested JSON objects multiple times in the same message

Python noob here, again. I'm trying to create a python script to auto-generate a JSON with multiple item but records multiple times using a for loop to generate them, the JSON message is structured and cardinality are as follows:
messageHeader[1]
-item [1-*]
--itemAttributesA [0-1]
--itemAttributesB [0-1]
--itemAttributesC [0-1]
--itemLocaton [1]
--itemRelationships [0-1]
I've had some really good help before for looping through the same object but for one record for example just the itemRelationships record. However as soon as I try to create one message with many items (i.e. 5) and a single instance of an itemAttribute, itemLocation and itemRelationships it does not work as I keep getting a key error. I've tried to define what a keyError is in relation to what I am trying to do but cannot link what I am doing wrong to the examples else where.
Here's my code as it stands:
import json
import random
data = {'messageID': random.randint(0, 2147483647), 'messageType': 'messageType'}
data['item'] = list()
itemAttributeType = input("Please selct what type of Attribute item has, either 'A', 'B' or 'C' :")
for x in range(0, 5):
data['item'].append({
'itemId': "I",
'itemType': "T"})
if itemAttributeType == "A":
data['item'][0]['itemAttributesA']
data['item'][0]['itemAttributesA'].append({
'attributeA': "ITA"})
elif itemAttributeType == "B":
data['item'][0]['itemAttributesB']
data['item'][0]['itemAttributesB'].append({
'attributeC': "ITB"})
else:
data['item'][0]['itemAttributesC']
data['item'][0]['itemAttributesC'].append({
'attributeC': "ITC"})
pass
data['item'][0]['itemLocation'] = {
'itemDetail': "ITC"}
itemRelation = input("Does the item have a relation: ")
if itemRelation > '':
data['item'][0]['itemRelations'] = {
'itemDetail': "relation"}
else:
pass
print(json.dumps(data, indent=4))
I have tried also tried this code which gives me better results:
import json
import random
data = {'messageID': random.randint(0, 2147483647), 'messageType': 'messageType'}
data['item'] = list()
itemAttributeType = input("Please selct what type of Attribute item has, either 'A', 'B' or 'C' :")
for x in range(0, 5):
data['item'].append({
'itemId': "I",
'itemType': "T"})
if itemAttributeType == "A":
data['item'][0]['itemAttributesA'] = {
'attributeA': "ITA"}
elif itemAttributeType == "B":
data['item'][0]['itemAttributesB'] = {
'attributeB': "ITB"}
else:
data['item'][0]['itemAttributesC'] = {
'attributeC': "ITC"}
pass
data['item'][0]['itemLocation'] = {
'itemDetail': "ITC"}
itemRelation = input("Does the item have a relation: ")
if itemRelation > '':
data['item'][0]['itemRelations'] = {
'itemDetail': "relation"}
else:
pass
print(json.dumps(data, indent=4))
This actually gives me a result but gives me messageHeader, item, itemAttributeA, itemLocation, itemRelations, and then four items records at the end as follows:
{
"messageID": 1926708779,
"messageType": "messageType",
"item": [
{
"itemId": "I",
"itemType": "T",
"itemAttributesA": {
"itemLocationType": "ITA"
},
"itemLocation": {
"itemDetail": "location"
},
"itemRelations": {
"itemDetail": "relation"
}
},
{
"itemId": "I",
"itemType": "T"
},
{
"itemId": "I",
"itemType": "T"
},
{
"itemId": "I",
"itemType": "T"
},
{
"itemId": "I",
"itemType": "T"
}
]
}
What I am trying to achieve is this output:
{
"messageID": 2018369867,
"messageType": "messageType",
"item": [{
"itemId": "I",
"itemType": "T",
"itemAttributesA": {
"attributeA": "ITA"
},
"itemLocation": {
"itemDetail": "Location"
},
"itemRelation": [{
"itemDetail": "D"
}]
}, {
"item": [{
"itemId": "I",
"itemType": "T",
"itemAttributesB": {
"attributeA": "ITB"
},
"itemLocation": {
"itemDetail": "Location"
},
"itemRelation": [{
"itemDetail": "D"
}]
}, {
"item": [{
"itemId": "I",
"itemType": "T",
"itemAttributesC": {
"attributeA": "ITC"
},
"itemLocation": {
"itemDetail": "Location"
},
"itemRelation": [{
"itemDetail": "D"
}]
}, {
"item": [{
"itemId": "I",
"itemType": "T",
"itemAttributesA": {
"attributeA": "ITA"
},
"itemLocation": {
"itemDetail": "Location"
},
"itemRelation": [{
"itemDetail": "D"
}]
},
{
"item": [{
"itemId": "I",
"itemType": "T",
"itemAttributesB": {
"attributeA": "ITB"
},
"itemLocation": {
"itemDetail": "Location"
},
"itemRelation": [{
"itemDetail": "D"
}]
}]
}
]
}]
}]
}]
}
I've been at this for the best part of a whole day trying to get it to work, butchering away at code, where am I going wrong, any help would be greatly appreciated
Your close. I think the part your are missing is adding the dict to your current dict and indentation with your for loop.
import json
import random
data = {'messageID': random.randint(0, 2147483647), 'messageType': 'messageType'}
data['item'] = list()
itemAttributeType = input("Please selct what type of Attribute item has, either 'A', 'B' or 'C' :")
for x in range(0, 5):
data['item'].append({
'itemId': "I",
'itemType': "T"})
if itemAttributeType == "A":
# First you need to add `itemAttributesA` to your dict:
data['item'][x]['itemAttributesA'] = dict()
# You could also do data['item'][x] = {'itemAttributesA': = dict()}
data['item'][x]['itemAttributesA']['attributeA'] = "ITA"
elif itemAttributeType == "B":
data['item'][x]['itemAttributesB'] = dict()
data['item'][x]['itemAttributesB']['attributeC'] = "ITB"
else:
data['item'][x]['itemAttributesC'] = dict()
data['item'][x]['itemAttributesC']['attributeC'] = "ITC"
data['item'][x]['itemLocation'] = {'itemDetail': "ITC"}
itemRelation = input("Does the item have a relation: ")
if itemRelation > '':
data['item'][x]['itemRelations'] = {'itemDetail': "relation"}
else:
pass
print(json.dumps(data, indent=4))
This code can also be shortened considerably if your example is close to what you truly desire:
import json
import random
data = {'messageID': random.randint(0, 2147483647), 'messageType': 'messageType'}
data['item'] = list()
itemAttributeType = input("Please selct what type of Attribute item has, either 'A', 'B' or 'C' :")
for x in range(0, 5):
new_item = {
'itemId': "I",
'itemType': "T",
'itemAttributes' + str(itemAttributeType): {
'attribute' + str(itemAttributeType): "IT" + str(itemAttributeType)
},
'itemLocation': {'itemDetail': "ITC"}
}
itemRelation = input("Does the item have a relation: ")
if itemRelation > '':
new_item['itemRelations'] = {'itemDetail': itemRelation}
data['item'].append(new_item)
print(json.dumps(data, indent=4))
Another note: If you want messageID to be truly unique than you should probably look into a UUID; otherwise you may have message ids that match.
import uuid
unique_id = str(uuid.uuid4())
print(unique_id)

How to put logic and assert dynamic data using karate

Here is my one response for a particular request
{
"data": {
"foo": [{
"total_value":200,
"applied_value": [{
"type": "A",
"id": 79806,
"value": 200
}]
}]
}
}
Here is my another response for the SAME request
{
"data": {
"foo": [{
"total_value":300,
"applied_value": [{
"type": "A",
"id": 79806,
"value": 200
},
{
"type": "B",
"id": 79809,
"value": 100
}
]
}]
}
}
I am unsure for which scenario will I get which response
So the use case is
Whenever there are 2 values in applied_value add two values and assert
Whenever there is only 1 value in applied_value directly assert
Here's one possible solution:
* def adder = function(array) { var total = 0; for (var i = 0; i < array.length; i++) total += array[i]; return total }
* def response =
"""
{
"data": {
"foo": [{
"total_value":300,
"applied_value": [{
"type": "A",
"id": 79806,
"value": 200
},
{
"type": "B",
"id": 79809,
"value": 100
}
]
}]
}
}
"""
* def expected = get[0] response..total_value
* def values = $response..value
* def total = adder(values)
* match expected == total
Just as an example, an alternate way to implement the adder routine is like this:
* def total = 0
* def add = function(x){ karate.set('total', karate.get('total') + x ) }
* eval karate.forEach(values, add)

R - Create a nested JSON object with two names from two different dataframes

Having two dataframes like
final_data2:
id type lang div
1 hola page es 1
and paths:
source target count
1 hola adios 1
I am able to combine both dataframes in the same JSON using jsonlite and the following code:
cat(toJSON(c(apply(final_data2,1,function(x)list(id = unname(x[1]),
type = unname(x[2]), lang = unname(x[3]), div = unname(x[4]))),
apply(paths,1,function(x)list(source = unname(x[1]), target = unname(x[2]),
playcount = unname(x[3])))), pretty = TRUE))
The result is a set of arrays as following:
[
{
"id": ["hola"],
"type": ["page"],
"lang": ["es"],
"div": ["1"]
},
{
"source": ["hola"],
"target": ["adios"],
"playcount": ["1"]
}
]
However, I need that the generated object contains two names (nodes and links), each nesting one of the previously defined dataframes. Therefore, the structure should look like this:
{
"nodes": [
{
"id": ["hola"],
"type": ["page"],
"lang": ["es"],
"div": ["1"]
}
],
"links": [
{
"source": ["hola"],
"target": ["adios"],
"playcount": ["1"]
}
]
}
Any tip on how to achieve it?
Just pass the data.frames as a named list into toJSON:
library(jsonlite)
df1 <- read.table(textConnection(" id type lang div
1 hola page es 1"), header = T)
df2 <- read.table(textConnection(" source target count
1 hola adios 1"), header = T)
toJSON(list(nodes = df1, links = df2), pretty = T)
# {
# "nodes": [
# {
# "id": "hola",
# "type": "page",
# "lang": "es",
# "div": 1
# }
# ],
# "links": [
# {
# "source": "hola",
# "target": "adios",
# "count": 1
# }
# ]
# }