How to bring json format to relational form? - json

this is my code:
%spark.pyspark
df_principalBody = spark.sql("""
SELECT
gtin
, principalBodyConstituents
--, principalBodyConstituents.coatings.materialType.value
FROM
v_df_source""")
df_principalBody.createOrReplaceTempView("v_df_principalBody")
df_principalBody.collect();
And this is the output:
[Row(gtin='7617014161936', principalBodyConstituents=[Row(coatings=[Row(materialType=Row(value='003', valueRange='405')
How can I read the value and valueRange fields in relational format?
I tried with explode and flatten, but it will not work.
Part of my json:
{
"gtin": "7617014161936",
"timePeriods": [
{
"fractionData": {
"principalBody": {
"constituents": [
{
"coatings": [
{
"materialType": {
"value": "003",
"valueRange": "405"
},
"percentage": 0.1
}
],
...

You can use data_dict.items() to list key/value pairs.
I used part of your json as below -
str1 = """{"gtin": "7617014161936","timePeriods": [{"fractionData": {"principalBody": {"constituents": [{"coatings": [
{
"materialType": {
"value": "003",
"valueRange": "405"
},
"percentage": 0.1
}
]}]}}}]}"""
import json
res = json.loads(str1)
res_dict = res['timePeriods'][0]['fractionData']['principalBody']['constituents'][0]['coatings'][0]['materialType']
df = spark.createDataFrame(data=res_dict.items())
Output :
+----------+---+
| _1| _2|
+----------+---+
| value|003|
|valueRange|405|
+----------+---+
You can even specify your schema:
from pyspark.sql.types import *
df = spark.createDataFrame(res_dict.items(),
schema=StructType(fields=[
StructField("key", StringType()),
StructField("value", StringType())])).show()
Resulting in
+----------+-----+
| key|value|
+----------+-----+
| value| 003|
|valueRange| 405|
+----------+-----+

Related

Nested JSON to CSV (multilevel)

I received a JSON file to transform into a CSV file.
[
{
"id": "132465",
"ext_numero_commande": "4500291738L",
"ext_line_items": [
{
"key_0": "10",
"key_1": "10021405 / 531.415.110",
"key_4": "4 Pce"
},
{
"key_0": "20",
"key_1": "10021258 / 531.370.140 / NPK-Nr. 224412",
"key_4": "4 Pce"
},
{
"key_0": "30",
"key_1": "10020895 / 531.219.120 / NPK-Nr. 222111",
"key_4": "10 Pce"
},
{
"key_0": "40",
"key_1": "10028633 / 552.470.110",
"key_4": "3 Pce"
}
],
"ext_prix": [
{
"key_0": "11.17"
},
{
"key_0": "9.01"
},
{
"key_0": "18.63"
},
{
"key_0": "24.15"
}
],
"ext_tag": "Date_livraison",
"ext_jour_livraison": "23-07-2021",
"ext_jour_livraison_1": null
}
]
id
Ext_Numero_Commande
Ext_line items1
Ext_line items 4
Ext_Prix
Ext_Tag
Ext_Jour_Livraison
Ext_Jour_Livraison 1
132465
4500291738L
10
10021405 / 531.415.110
4 Pce
11.17
Date_livraison
23-07-2021
132465
4500291738L
20
10021258 / 531.370.140 / NPK-Nr. 224412
4 Pce
9.01
Date_livraison
23-07-2021
132465
4500291738L
30
10020895 / 531.219.120 / NPK-Nr. 222111
10 Pce
18.63
Date_livraison
23-07-2021
132465
4500291738L
40
10028633 / 552.470.110
3 Pce
24.15
Date_livraison
23-07-2021
I found the function pd.json_normalize.
df=pd.json_normalize(
json_object[0],
record_path=['ext_line_items'],
meta=['id', 'ext_numero_commande', 'ext_tag',
'ext_jour_livraison', 'ext_jour_livraison_1'])
I have nearly my end result, and I can add the last column ["ext_prix"] with the same method and a concatenation function.
Is a function which does it automatically?
I used this function, but it returns an error.
df=pd.json_normalize(
json_object[0],
record_path=['ext_line_items','ext_prix'],
meta=['id', 'ext_numero_commande', 'ext_tag',
'ext_jour_livraison', 'ext_jour_livraison_1'])
You can solve this problem if your JSON dataset fixed length, I hope this code will work properly. Follow this way..(JSON to Dictionary to CSV using pandas).
# import pandas
import pandas as pd
# Read JSON file
df = pd.read_json('C://Users//jahir//Desktop//json_file.json')
# create dictionary
create_dict = {}
# Iteration JSON file
for index,values in df.iterrows():
# for loop ext_line_items
for ext_lines in df['ext_line_items'][index]:
for item in ext_lines:
column_name = 'ext_line_items_'+item
if column_name not in create_dict:
create_dict[column_name] =[]
# four time for to create four product ext_line_items_key_0, ext_line_items_key_1, ext_line_items_key_4
time_loop = 4
while time_loop > 0:
time_loop-=1
create_dict[column_name]+=[ext_lines[item]]
# for loop ext_prix dataset
# four time for to create four product ext_prix_key_0
time_loop = 4
while time_loop > 0:
time_loop-=1
for ext_prix in df['ext_prix'][index]:
for item in ext_prix:
column_name = 'ext_prix_'+item
if column_name not in create_dict:
create_dict[column_name] =[]
create_dict[column_name]+=[ext_prix[item]]
# add key in dictionary
for i in ['id', 'ext_numero_commande', 'ext_tag', 'ext_jour_livraison']:
create_dict[i]=[]
# 16 time for to create 4*4 product 'id', 'ext_numero_commande', 'ext_tag', 'ext_jour_livraison'
total_time = 16
while total_time > 0:
total_time-=1
for j in ['id', 'ext_numero_commande', 'ext_tag', 'ext_jour_livraison']:
create_dict[j] += [df[j][index]]
# Dictionary to DataFrame
pd_dict= pd.DataFrame.from_dict(create_dict)
# DataFrame to write csv file
write_csv_path = 'C://Users//jahir//Desktop//csv_file.csv'
pd_dict.to_csv(write_csv_path, index = False, header = True)
Output:
Using pd.json_normalize you can solve this problem. Follow this way..(json_normalize to merge to csv)
import pandas as pd
data = [
{
"id": "132465",
"ext_numero_commande": "4500291738L",
"ext_line_items": [
{
"key_0": "10",
"key_1": "10021405 / 531.415.110",
"key_2": "4 Pce"
},
{
"key_0": "20",
"key_1": "10021258 / 531.370.140 / NPK-Nr. 224412",
"key_2": "4 Pce"
},
{
"key_0": "30",
"key_1": "10020895 / 531.219.120 / NPK-Nr. 222111",
"key_2": "10 Pce"
},
{
"key_0": "40",
"key_1": "10028633 / 552.470.110",
"key_2": "3 Pce"
}
],
"ext_prix": [
{
"key_4": "11.17"
},
{
"key_4": "9.01"
},
{
"key_4": "18.63"
},
{
"key_4": "24.15"
}
],
"ext_tag": "Date_livraison",
"ext_jour_livraison": "23-07-2021",
"ext_jour_livraison_1": None
}
]
# Used json_normalize for record path ext_line_items
normalize_ext_line_items = pd.json_normalize(data, record_path = "ext_line_items", meta = ["id", "ext_numero_commande", 'ext_tag', 'ext_jour_livraison'])
# Used json_normalize for record path ext_prix
normalize_ext_prix = pd.json_normalize(data,record_path = "ext_prix", meta = ["id"])
# merge to
final_output = normalize_ext_line_items.merge(normalize_ext_prix, on='id', how = 'left')
# DataFrame to write csv file used your file path
write_csv_path = 'C://Users//jahir//Desktop//csv_file.csv'
final_output.to_csv(write_csv_path, index = False, header = True)
Output:

Deserialize complex JSON with pandas

I am trying to deserialize some complex JSON that is also inconsistent with pandas and struggling to get the parsing right
{
"STATUS":"REQUEST_OK",
"DATA":[
{
"companyID":"AABBCCDD",
"ITEMS":[
{
"ind":"12345",
"pt":"1231",
"code":"E333",
"name":"Pop ,",
"RES":[
{
"i":1,
"D":{
"e":123674,
"p":"",
"s":"",
"t":1000
},
"lot":"073",
"V":[
{
"t":6,
"v":0.1
}
],
"p":1
},
{
}
]
},
{
"ind":"423",
"pt":"571",
"code":"E1",
"name":"Dam ,",
"RES":[
{
"i":5,
"D":{
"e":120751,
"p":"",
"s":"",
"t":800
},
"lot":"9",
"V":[
{
"t":4543,
"v":1.33
}
],
"p":1
},
{
}
]
},
{
"ind":"0323",
"pt":"123221",
"code":"LS",
"name":"Paint ,",
"RES":[
{
"i":61,
"D":{
"e":946,
"p":"",
"s":"",
"t":11100
},
"lot":"8",
"V":[
{
"t":9,
"v":0.06
}
],
"p":1
},
{
}
]
}
]
}
]
}
The data here is supposed to build this table
|companyID | ind |pt |code |name |i |e |p |s |t |lot |t |v |p
|------------------------------------------------------------------------------------------------
|AABBCCDD |12345 |1231 |E333 |Pop |1 |123674 | | |1000 |073 |6 |0.1 |1
|------------------------------------------------------------------------------------------------
|----
And so on.
The biggest pain for me is that there could be only 1 level of this tag
{
"ind":"423",
"pt":"571",
"code":"E1",
"name":"Dam ,",
"RES":[
{
but inside it, multiple
{
"i":61,
"D":{
"e":946,
"p":"",
"s":"",
"t":11100
},
"lot":"8",
"V":[
{
"t":9,
"v":0.06
}
],
"p":1
},
Example
i:61
means there are 61 of those json tags inside of the first tag.
Any clues on how to parse this JSON?
Try it this way:
data = """your json above"""
import pandas as pd
key = []
value = []
new_dat = data.split(',')
for n in new_dat:
if '{' in n:
m = n.split('{')[1].strip()
else:
m = n.strip().replace('}','').replace('\n','')
if ':' in m:
key.append(m.split(':')[0])
value.append(m.split(':')[1])
pd.DataFrame([value],columns=key)
Output:
"STATUS" "companyID" "ind" "pt" "code" "name" "i" "e" "p" "s" ... "name" "i" "e" "p" "s" "t" "lot" "t" "v" "p"
0 "REQUEST_OK" "AABBCCDD" "12345" "1231" "E333" "Pop 1 123674 "" "" ... "Paint 61 946 "" "" 11100 "8" 9 0.06 ] 1
You can then use standard pandas methods to drop unnecessary columns, etc.

Spark: Splitting JSON strings into separate dataframe columns

Im loading the below JSON string into a dataframe column.
{
"title": {
"titleid": "222",
"titlename": "ABCD"
},
"customer": {
"customerDetail": {
"customerid": 878378743,
"customerstatus": "ACTIVE",
"customersystems": {
"customersystem1": "SYS01",
"customersystem2": null
},
"sysid": null
},
"persons": [{
"personid": "123",
"personname": "IIISKDJKJSD"
},
{
"personid": "456",
"personname": "IUDFIDIKJK"
}]
}
}
val js = spark.read.json("./src/main/resources/json/customer.txt")
println(js.schema)
val newDF = df.select(from_json($"value", js.schema).as("parsed_value"))
newDF.selectExpr("parsed_value.customer.*").show(false)
//Schema:
StructType(StructField(customer,StructType(StructField(customerDetail,StructType(StructField(customerid,LongType,true), StructField(customerstatus,StringType,true), StructField(customersystems,StructType(StructField(customersystem1,StringType,true), StructField(customersystem2,StringType,true)),true), StructField(sysid,StringType,true)),true), StructField(persons,ArrayType(StructType(StructField(personid,StringType,true), StructField(personname,StringType,true)),true),true)),true), StructField(title,StructType(StructField(titleid,StringType,true), StructField(titlename,StringType,true)),true))
//Output:
+------------------------------+---------------------------------------+
|customerDetail |persons |
+------------------------------+---------------------------------------+
|[878378743, ACTIVE, [SYS01,],]|[[123, IIISKDJKJSD], [456, IUDFIDIKJK]]|
+------------------------------+---------------------------------------+
My Question: Is there a way that I can split the key value as a separate dataframe columns like below
by keeping the Array columns as is since I need to have only one record per json string:
Example for customer column:
customer.customerDetail.customerid,customer.customerDetail.customerstatus,customer.customerDetail.customersystems.customersystem1,customer.customerDetail.customersystems.customersystem2,customerid,customer.customerDetail.sysid,customer.persons
878378743,ACTIVE,SYS01,null,null,{"persons": [ { "personid": "123", "personname": "IIISKDJKJSD" }, { "personid": "456", "personname": "IUDFIDIKJK" } ] }
Edited post :
val df = spark.read.json("your/path/data.json")
import org.apache.spark.sql.functions.col
def collectFields(field: String, sc: DataType): Seq[String] = {
sc match {
case sf: StructType => sf.fields.flatMap(f => collectFields(field+"."+f.name, f.dataType))
case _ => Seq(field)
}
}
val fields = collectFields("",df.schema).map(_.tail)
df.select(fields.map(col):_*).show(false)
Output :
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|customerid|customerstatus|customersystem1|customersystem2|sysid|persons |titleid|titlename|
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|878378743 |ACTIVE |SYS01 |null |null |[[123,IIISKDJKJSD], [456,IUDFIDIKJK]]|222 |ABCD |
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
You can try with the help of RDD's by defining column names in an empty RDD and then reading json,converting it to DataFrame with .toDF() and iterating it to the empty RDD.

How to insert an empty object into JSON using Circe?

I'm getting a JSON object over the network, as a String. I'm then using Circe to parse it. I want to add a handful of fields to it, and then pass it on downstream.
Almost all of that works.
The problem is that my "adding" is really "overwriting". That's actually ok, as long as I add an empty object first. How can I add such an empty object?
So looking at the code below, I am overwriting "sometimes_empty:{}" and it works. But because sometimes_empty is not always empty, it results in some data loss. I'd like to add a field like: "custom:{}" and then ovewrite the value of custom with my existing code.
Two StackOverflow posts were helpful. One worked, but wasn't quite what I was looking for. The other I couldn't get to work.
1: Modifying a JSON array in Scala with circe
2: Adding field to a json using Circe
val js: String = """
{
"id": "19",
"type": "Party",
"field": {
"id": 1482,
"name": "Anne Party",
"url": "https"
},
"sometimes_empty": {
},
"bool": true,
"timestamp": "2018-12-18T11:39:18Z"
}
"""
val newJson = parse(js).toOption
.flatMap { doc =>
doc.hcursor
.downField("sometimes_empty")
.withFocus(_ =>
Json.fromFields(
Seq(
("myUrl", Json.fromString(myUrl)),
("valueZ", Json.fromString(valueZ)),
("valueQ", Json.fromString(valueQ)),
("balloons", Json.fromString(balloons))
)
)
)
.top
}
newJson match {
case Some(v) => return v.toString
case None => println("Failure!")
}
We need to do a couple of things. First, we need to zoom in on the specific property we want to update, if it doesn't exist, we'll create a new empty one. Then, we turn the zoomed in property in the form of a Json into JsonObject in order to be able to modify it using the +: method. Once we've done that, we need to take the updated property and re-introduce it in the original parsed JSON to get the complete result:
import io.circe.{Json, JsonObject, parser}
import io.circe.syntax._
object JsonTest {
def main(args: Array[String]): Unit = {
val js: String =
"""
|{
| "id": "19",
| "type": "Party",
| "field": {
| "id": 1482,
| "name": "Anne Party",
| "url": "https"
| },
| "bool": true,
| "timestamp": "2018-12-18T11:39:18Z"
|}
""".stripMargin
val maybeAppendedJson =
for {
json <- parser.parse(js).toOption
sometimesEmpty <- json.hcursor
.downField("sometimes_empty")
.focus
.orElse(Option(Json.fromJsonObject(JsonObject.empty)))
jsonObject <- json.asObject
emptyFieldJson <- sometimesEmpty.asObject
appendedField = emptyFieldJson.+:("added", Json.fromBoolean(true))
res = jsonObject.+:("sometimes_empty", appendedField.asJson)
} yield res
maybeAppendedJson.foreach(obj => println(obj.asJson.spaces2))
}
}
Yields:
{
"id" : "19",
"type" : "Party",
"field" : {
"id" : 1482,
"name" : "Anne Party",
"url" : "https"
},
"sometimes_empty" : {
"added" : true,
"someProperty" : true
},
"bool" : true,
"timestamp" : "2018-12-18T11:39:18Z"
}

Trying to turn a data frame into hierarchical json array using jsonlite in r

I'm trying to get my super-simple data frame into something a little more useful - a json array in this case.
My data looks like
| V1 | V2 | V3 | V4 | V5 |
|-----------|-----------|-----------|-----------|-----------|
| 717374788 | 694405490 | 606978836 | 578345907 | 555450273 |
| 429700970 | 420694891 | 420694211 | 420792447 | 420670045 |
and I want it to look like
[
{
"V1": {
"id": 717374788
},
"results": [
{
"id": 694405490
},
{
"id": 606978836
},
{
"id": 578345907
},
{
"id": 555450273
}
]
},
{
"V1": {
"id": 429700970
},
"results": [
{
"id": 420694891
},
{
"id": 420694211
},
{
"id": 420792447
},
{
"id": 420670045
}
]
}
]
Any thoughts on how I can make that happen?
Thanks for your help!
Your data.frame cannot be directly written into that format.
In order to get the desired json, firstly you need to turn your data.frame into this structure:
list(
list(V1=list(id=<num>),
results=list(
list(id=<num>),
list(id=<num>),
...)),
...)
Here's a way to apply the transformation to your example data:
library(jsonlite)
# recreate your data.frame
DF <-
data.frame(V1=c(717374788,429700970),
V2=c(694405490, 420694891),
V3=c(606978836,420694211),
V4=c(578345907,420792447),
V5=c(555450273,420670045))
# transform the data.frame into the described structure
idsIndexes <- which(names(DF) != 'V1')
a <- lapply(1:nrow(DF),FUN=function(i){
list(V1=list(id=DF[i,'V1']),
results=lapply(idsIndexes,
FUN=function(j)list(id=DF[i,j])))
})
# serialize to json
txt <- toJSON(a)
# if you want, indent the json
txt <- prettify(txt)