Extract columns from a json and write into a dataframe - json

I have the following string:
{
"code": 4,
"results": [
{
"requests": 100,
"requests_country": 291,
"listing": {
"first": 1,
"second": 2
}
},
{
"requests": 200,
"requests_country": 292,
"listing": {
"first": 10,
"second": 220
}
}
]
}
I would like to extract certain values in order to create a dataframe.
This is the desired output:
+---------+----------------+
| requests|requests_country|
+---------+----------------+
| 100 |291 |
| 200 |292 |
+---------+----------------+
I tried a lot of methods but none work.
I tried converting to map and then use parse to extract results but I keep getting errors.

Make use of explode_outer
import org.apache.spark.sql.functions.{col, explode_outer}
import org.apache.spark.sql.SparkSession
object Main extends App {
val spark = SparkSession.builder
.master("local")
.appName("Spark app")
.getOrCreate()
val df = spark.read.json("src/main/resources/file.json")
df.show()
//+----+--------------------+
//|code| results|
//+----+--------------------+
//| 4|[{{1, 2}, 100, 29...|
//+----+--------------------+
val df1 = df.select(explode_outer(col("results")))
df1.show()
//+--------------------+
//| col|
//+--------------------+
//| {{1, 2}, 100, 291}|
//|{{10, 220}, 200, ...|
//+--------------------+
val df2 = df1.select(col("col.requests"), col("col.requests_country"))
df2.show()
//+--------+----------------+
//|requests|requests_country|
//+--------+----------------+
//| 100| 291|
//| 200| 292|
//+--------+----------------+
}
file.json
{"code": 4, "results": [{"requests": 100, "requests_country": 291, "listing": {"first": 1, "second": 2}}, {"requests": 200, "requests_country": 292, "listing": {"first": 10, "second": 220 }}]}
How to translate a complex nested JSON structure into multiple columns in a Spark DataFrame
Cannot cast dataframe column containing an array to String
Difference between explode and explode_outer

Related

How to bring json format to relational form?

this is my code:
%spark.pyspark
df_principalBody = spark.sql("""
SELECT
gtin
, principalBodyConstituents
--, principalBodyConstituents.coatings.materialType.value
FROM
v_df_source""")
df_principalBody.createOrReplaceTempView("v_df_principalBody")
df_principalBody.collect();
And this is the output:
[Row(gtin='7617014161936', principalBodyConstituents=[Row(coatings=[Row(materialType=Row(value='003', valueRange='405')
How can I read the value and valueRange fields in relational format?
I tried with explode and flatten, but it will not work.
Part of my json:
{
"gtin": "7617014161936",
"timePeriods": [
{
"fractionData": {
"principalBody": {
"constituents": [
{
"coatings": [
{
"materialType": {
"value": "003",
"valueRange": "405"
},
"percentage": 0.1
}
],
...
You can use data_dict.items() to list key/value pairs.
I used part of your json as below -
str1 = """{"gtin": "7617014161936","timePeriods": [{"fractionData": {"principalBody": {"constituents": [{"coatings": [
{
"materialType": {
"value": "003",
"valueRange": "405"
},
"percentage": 0.1
}
]}]}}}]}"""
import json
res = json.loads(str1)
res_dict = res['timePeriods'][0]['fractionData']['principalBody']['constituents'][0]['coatings'][0]['materialType']
df = spark.createDataFrame(data=res_dict.items())
Output :
+----------+---+
| _1| _2|
+----------+---+
| value|003|
|valueRange|405|
+----------+---+
You can even specify your schema:
from pyspark.sql.types import *
df = spark.createDataFrame(res_dict.items(),
schema=StructType(fields=[
StructField("key", StringType()),
StructField("value", StringType())])).show()
Resulting in
+----------+-----+
| key|value|
+----------+-----+
| value| 003|
|valueRange| 405|
+----------+-----+

Nested Json extract the value with unknown key in the middle

I have a Json column(colJson) in a dataframe like this
{
"a": "value1",
"b": "value1",
"c": true,
"details": {
"qgiejfkfk123": { //unknown value
"model1": {
"score": 0.531,
"version": "v1"
},
"model2": {
"score": 0.840,
"version": "v2"
},
"other_details": {
"decision": false,
"version": "v1"
}
}
}
}
Here 'qgiejfkfk123' is dynamic value and changes with each row. However I need to extract model1.score as well as model2.score.
I tried
sourceDf.withColumn("model1_score",get_json_object(col("colJson"), "$.details.*.model1.score").cast(DoubleType))
.withColumn("model2_score",get_json_object(col("colJson"), "$.details.*.model2.score").cast(DoubleType))
but did not work.
I managed to get your solution by using from_json, parsing the dynamic value as Map<String, Struct> and exploding the values from it:
val schema = "STRUCT<`details`: MAP<STRING, STRUCT<`model1`: STRUCT<`score`: DOUBLE, `version`: STRING>, `model2`: STRUCT<`score`: DOUBLE, `version`: STRING>, `other_details`: STRUCT<`decision`: BOOLEAN, `version`: STRING>>>>"
val fromJsonDf = sourceDf.withColumn("colJson", from_json(col("colJson"), lit(schema)))
val explodeDf = fromJsonDf.select($"*", explode(col("colJson.details")))
// +----------------------------------------------------------+------------+--------------------------------------+
// |colJson |key |value |
// +----------------------------------------------------------+------------+--------------------------------------+
// |{{qgiejfkfk123 -> {{0.531, v1}, {0.84, v2}, {false, v1}}}}|qgiejfkfk123|{{0.531, v1}, {0.84, v2}, {false, v1}}|
// +----------------------------------------------------------+------------+--------------------------------------+
val finalDf = explodeDf.select(col("value.model1.score").as("model1_score"), col("value.model2.score").as("model2_score"))
// +------------+------------+
// |model1_score|model2_score|
// +------------+------------+
// | 0.531| 0.84|
// +------------+------------+

Writing Nested JSON Dictionary List To CSV

Issue
I'm trying to write the following nested list of dictionary which has another list of dictionary to csv. I tried multiple ways but I can not get it to properly write it:
Json Data
[
{
"Basic_Information_Source": [
{
"Image": "image1.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 574,
"Image_Height": 262,
"Image_Size": 277274
}
],
"Basic_Information_Destination": [
{
"Image": "image1_dst.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 574,
"Image_Height": 262,
"Image_Size": 277539
}
],
"Values": [
{
"Value1": 75.05045463635267,
"Value2": 0.006097560975609756,
"Value3": 0.045083481733371615,
"Value4": 0.008639858263904898
}
]
},
{
"Basic_Information_Source": [
{
"Image": "image2.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 1600,
"Image_Height": 1066,
"Image_Size": 1786254
}
],
"Basic_Information_Destination": [
{
"Image": "image2_dst.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 1600,
"Image_Height": 1066,
"Image_Size": 1782197
}
],
"Values": [
{
"Value1": 85.52662890580055,
"Value2": 0.0005464352720450282,
"Value3": 0.013496113910369758,
"Value4": 0.003800236380811839
}
]
}
]
Working Code
I tried to use the following code and it works, but it only saved the headers and then dumps all the underlying list as text in the csv file:
import json
import csv
def Convert_CSV():
ar_enc_file = open('analysis_results_enc.json','r')
json_data = json.load(ar_enc_file)
keys = json_data[0].keys()
with open('test.csv', 'w', encoding='utf8', newline='') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(json_data)
ar_enc_file.close()
Convert_CSV()
Working Output / Issue with it
The output writes the following header:
Basic_Information_Source
Basic_Information_Destination
Values
And then it dumps all other data inside each header as a list like this:
[{'Image': 'image1.png', 'Image_Format': 'PNG', 'Image_Mode': 'RGB', 'Image_Width': 574, 'Image_Height': 262, 'Image_Size': 277274}]
Expected Output / Sample
Trying to generate the above type of output for each dictionary in the array of dictionaries.
How do it properly write it?
I'm sure someone will come by with a much more elegant solution. That being said:
You have a few problems.
You have inconsistent entries with the fields you want to align.
Even if you pad your data you have intermediate lists that need flattened out.
Then you still have separated data that needs to be merged together.
DictWriter AFAIK expects it's data in the format of [{'column': 'entry'},{'column': 'entry'} so even if you do all the previous steps you're still not in the right format.
So let's get started.
For the first two parts we can combine.
def pad_list(lst, size, padding=None):
# we wouldn't have to make a copy but I prefer to
# avoid the possibility of getting bitten by mutability
_lst = lst[:]
for _ in range(len(lst), size):
_lst.append(padding)
return _lst
# this expects already parsed json data
def flatten(json_data):
lst = []
for dct in json_data:
# here we're just setting a max size of all dict entries
# this is in case the shorter entry is in the first iteration
max_size = 0
# we initialize a dict for each of the list entries
# this is in case you have inconsistent lengths between lists
flattened = dict()
for k, v in dct.items():
entries = list(next(iter(v), dict()).values())
flattened[k] = entries
max_size = max(len(entries), max_size)
# here we append the padded version of the keys for the dict
lst.append({k: pad_list(v, max_size) for k, v in flattened.items()})
return lst
So now we have a flattened, list of dicts whos values are lists of consistent length. Essentially:
[
{
"Basic_Information_Source": [
"image1.png",
"PNG",
"RGB",
574,
262,
277274
],
"Basic_Information_Destination": [
"image1_dst.png",
"PNG",
"RGB",
574,
262,
277539
],
"Values": [
75.05045463635267,
0.006097560975609756,
0.045083481733371615,
0.008639858263904898,
None,
None
]
}
]
But this list has multiple dicts that need to be merged, not just one.
So we need to merge.
# this should be self explanatory
def merge(flattened):
merged = dict()
for dct in flattened:
for k, v in dct.items():
if k not in merged:
merged[k] = []
merged[k].extend(v)
return merged
This gives us something close to this:
{
"Basic_Information_Source": [
"image1.png",
"PNG",
"RGB",
574,
262,
277274,
"image2.png",
"PNG",
"RGB",
1600,
1066,
1786254
],
"Basic_Information_Destination": [
"image1_dst.png",
"PNG",
"RGB",
574,
262,
277539,
"image2_dst.png",
"PNG",
"RGB",
1600,
1066,
1782197
],
"Values": [
75.05045463635267,
0.006097560975609756,
0.045083481733371615,
0.008639858263904898,
None,
None,
85.52662890580055,
0.0005464352720450282,
0.013496113910369758,
0.003800236380811839,
None,
None
]
}
But wait, we still need to format it for the writer.
Our data needs to be in the format of [{'column_1': 'entry', column_2: 'entry'},{'column_1': 'entry', column_2: 'entry'}
So we format:
def format_for_writer(merged):
formatted = []
for k, v in merged.items():
for i, item in enumerate(v):
# on the first pass this will append an empty dict
# on subsequent passes it will be ignored
# and add keys into the existing dict
if i >= len(formatted):
formatted.append(dict())
formatted[i][k] = item
return formatted
So finally, we have a nice clean formatted data structure we can just hand to our writer function.
def convert_csv(formatted):
keys = formatted[0].keys()
with open('test.csv', 'w', encoding='utf8', newline='') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(formatted)
Full code with json string:
import json
import csv
json_raw = """\
[
{
"Basic_Information_Source": [
{
"Image": "image1.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 574,
"Image_Height": 262,
"Image_Size": 277274
}
],
"Basic_Information_Destination": [
{
"Image": "image1_dst.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 574,
"Image_Height": 262,
"Image_Size": 277539
}
],
"Values": [
{
"Value1": 75.05045463635267,
"Value2": 0.006097560975609756,
"Value3": 0.045083481733371615,
"Value4": 0.008639858263904898
}
]
},
{
"Basic_Information_Source": [
{
"Image": "image2.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 1600,
"Image_Height": 1066,
"Image_Size": 1786254
}
],
"Basic_Information_Destination": [
{
"Image": "image2_dst.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 1600,
"Image_Height": 1066,
"Image_Size": 1782197
}
],
"Values": [
{
"Value1": 85.52662890580055,
"Value2": 0.0005464352720450282,
"Value3": 0.013496113910369758,
"Value4": 0.003800236380811839
}
]
}
]
"""
def pad_list(lst, size, padding=None):
_lst = lst[:]
for _ in range(len(lst), size):
_lst.append(padding)
return _lst
def flatten(json_data):
lst = []
for dct in json_data:
max_size = 0
flattened = dict()
for k, v in dct.items():
entries = list(next(iter(v), dict()).values())
flattened[k] = entries
max_size = max(len(entries), max_size)
lst.append({k: pad_list(v, max_size) for k, v in flattened.items()})
return lst
def merge(flattened):
merged = dict()
for dct in flattened:
for k, v in dct.items():
if k not in merged:
merged[k] = []
merged[k].extend(v)
return merged
def format_for_writer(merged):
formatted = []
for k, v in merged.items():
for i, item in enumerate(v):
if i >= len(formatted):
formatted.append(dict())
formatted[i][k] = item
return formatted
def convert_csv(formatted):
keys = formatted[0].keys()
with open('test.csv', 'w', encoding='utf8', newline='') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(formatted)
def main():
json_data = json.loads(json_raw)
flattened = flatten(json_data)
merged = merge(flattened)
formatted = format_for_writer(merged)
convert_csv(formatted)
if __name__ == '__main__':
main()

Spark: Splitting JSON strings into separate dataframe columns

Im loading the below JSON string into a dataframe column.
{
"title": {
"titleid": "222",
"titlename": "ABCD"
},
"customer": {
"customerDetail": {
"customerid": 878378743,
"customerstatus": "ACTIVE",
"customersystems": {
"customersystem1": "SYS01",
"customersystem2": null
},
"sysid": null
},
"persons": [{
"personid": "123",
"personname": "IIISKDJKJSD"
},
{
"personid": "456",
"personname": "IUDFIDIKJK"
}]
}
}
val js = spark.read.json("./src/main/resources/json/customer.txt")
println(js.schema)
val newDF = df.select(from_json($"value", js.schema).as("parsed_value"))
newDF.selectExpr("parsed_value.customer.*").show(false)
//Schema:
StructType(StructField(customer,StructType(StructField(customerDetail,StructType(StructField(customerid,LongType,true), StructField(customerstatus,StringType,true), StructField(customersystems,StructType(StructField(customersystem1,StringType,true), StructField(customersystem2,StringType,true)),true), StructField(sysid,StringType,true)),true), StructField(persons,ArrayType(StructType(StructField(personid,StringType,true), StructField(personname,StringType,true)),true),true)),true), StructField(title,StructType(StructField(titleid,StringType,true), StructField(titlename,StringType,true)),true))
//Output:
+------------------------------+---------------------------------------+
|customerDetail |persons |
+------------------------------+---------------------------------------+
|[878378743, ACTIVE, [SYS01,],]|[[123, IIISKDJKJSD], [456, IUDFIDIKJK]]|
+------------------------------+---------------------------------------+
My Question: Is there a way that I can split the key value as a separate dataframe columns like below
by keeping the Array columns as is since I need to have only one record per json string:
Example for customer column:
customer.customerDetail.customerid,customer.customerDetail.customerstatus,customer.customerDetail.customersystems.customersystem1,customer.customerDetail.customersystems.customersystem2,customerid,customer.customerDetail.sysid,customer.persons
878378743,ACTIVE,SYS01,null,null,{"persons": [ { "personid": "123", "personname": "IIISKDJKJSD" }, { "personid": "456", "personname": "IUDFIDIKJK" } ] }
Edited post :
val df = spark.read.json("your/path/data.json")
import org.apache.spark.sql.functions.col
def collectFields(field: String, sc: DataType): Seq[String] = {
sc match {
case sf: StructType => sf.fields.flatMap(f => collectFields(field+"."+f.name, f.dataType))
case _ => Seq(field)
}
}
val fields = collectFields("",df.schema).map(_.tail)
df.select(fields.map(col):_*).show(false)
Output :
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|customerid|customerstatus|customersystem1|customersystem2|sysid|persons |titleid|titlename|
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|878378743 |ACTIVE |SYS01 |null |null |[[123,IIISKDJKJSD], [456,IUDFIDIKJK]]|222 |ABCD |
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
You can try with the help of RDD's by defining column names in an empty RDD and then reading json,converting it to DataFrame with .toDF() and iterating it to the empty RDD.

scala - How to parse JSON string into separate records?

I wrote a method to concatenate JSON values.
def mergeSales(storeJValue: JValue): String = {
val salesJValue: JValue = parse(rawJson)
val store = compact(render(storeJValue))
val sales = compact(render(salesJValue))
val mergedSales: String = s"""{"store":$store,"sales":$sales}"""
mergedSales
}
As a result I'm getting strings like this, a store with an array of corresponding sales:
{"store":{"store_id":"01","name":"Store_1"}, "sales":[{"saleId": 10, "name": "New name1", "saleType": "New Type1"}, {"saleId": 20, "name": "Some name1", "saleType": "SomeType5"}, {"saleId": 30, "name": "Some name3", "saleType": "SomeType3"}]}
How should I parse it to get a list of records where the same store is mapped to each sale from the array? I want it to look like this:
{"store":{"store_id":"01","name":"Store_1"}, "sale":{"saleId": 10, "name": "New name1", "saleType": "New Type1"}}
{"store":{"store_id":"01","name":"Store_1"}, "sale":{"saleId": 20, "name": "New name2", "saleType": "New Type2"}}
{"store":{"store_id":"01","name":"Store_1"}, "sale":{"saleId": 30, "name": "Some name3", "saleType": "SomeType3"}}
Sales have a huge amount of fields in reality, so creating a case class will be rather complex.
i think best way to use json4s API which will extract all your json code and convert it into map than you can easily traverse
you required to create case class :
case class Store(store_id: String, name: String)
case class Sale(saleId:String, name:String, saleType:String)
case class Result(store: Store, sale: Sale)
case class SaleStore(store: Store, sales: List[Sale])
then it is very straight forward to get solution using json4s
val str =
"""{
| "store": {
| "store_id": "01",
| "name": "Store_1"
| },
| "sales": [
| {
| "saleId": 10,
| "name": "New name1",
| "saleType": "New Type1"
| },
| {
| "saleId": 20,
| "name": "Some name1",
| "saleType": "SomeType5"
| },
| {
| "saleId": 30,
| "name": "Some name3",
| "saleType": "SomeType3"
| }
| ]
|}""".stripMargin
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = org.json4s.DefaultFormats
val saleStore = parse(str).extract[SaleStore]
val result = saleStore.sales.flatMap(sale => List(saleStore.store -> sale))
val mapper: ObjectMapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
result.map(r => mapper.writeValueAsString(Result(r._1, r._2))).foreach(println)
output:
{"store":{"store_id":"01","name":"Store_1"},"sale":{"saleId":"10","name":"New name1","saleType":"New Type1"}}
{"store":{"store_id":"01","name":"Store_1"},"sale":{"saleId":"20","name":"Some name1","saleType":"SomeType5"}}
{"store":{"store_id":"01","name":"Store_1"},"sale":{"saleId":"30","name":"Some name3","saleType":"SomeType3"}}