Spark Scala nested JSON stored as structure table - json

I have huge no of nested JSON having more than 200 keys want to convert & store in structure table.
|-- ip_address: string (nullable = true)
|-- xs_latitude: double (nullable = true)
|-- Applications: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b_als_o_isehp: string (nullable = true)
| | |-- b_als_p_isehp: string (nullable = true)
| | |-- b_als_s_isehp: string (nullable = true)
| | |-- l_als_o_eventid: string (nullable = true)
....
Read JSON and get each ip_address having one application array data
{"ip_address": 1512199720,"Applications": [{"s_pd": -1,"s_path": "NA", "p_pd": "temp0"}, {"s_pd": -1,"s_path": "root/hdfs", "p_pd": "temp1"},{"s_pd": -1,"s_path": "root/hdfs", "p_pd": "temp2"}],}
val data = spark.read.json("file:///root/users/data/s_json.json")
var appDf = data.withColumn("data",explode($"Applications")).select($"Applications.s_pd", $"Applications.s_path", $"Applications.p_pd", $"ip_address")
appDf.printSchema
/// gives
root
|-- s_pd: array (nullable = true)
| |-- element: string (containsNull = true)
|-- s_path: array (nullable = true)
| |-- element: string (containsNull = true)
|-- p_pd: array (nullable = true)
| |-- element: string (containsNull = true)
|-- ip_address: string (nullable = true)
In each dataframe record contain an array with duplicate records. How to get the record in table format.

Mistake
Your mistake is that you are using the original (Application) struct column to select the nested struct in separate column.
Solution
You had to select from the exploded column which is data
var appDf = data.withColumn("data",explode($"Applications"))
.select($"ip_address", $"data.s_pd", $"data.s_path", $"data.p_pd")
and you should get
+----------+----+---------+-----+
|ip_address|s_pd|s_path |p_pd |
+----------+----+---------+-----+
|1512199720|-1 |NA |temp0|
|1512199720|-1 |root/hdfs|temp1|
|1512199720|-1 |root/hdfs|temp2|
+----------+----+---------+-----+
I hope the answer is helpful

Related

Transform a list of JSON string to a list of dict in Pyspark

I’m struggling to transform a list of JSON string to a list of dict in Pyspark without using udf or using rdd
I have this kind of dataframe:
Key
JSON_string
123456
["""{"Zipcode":704,"ZipCodeType":"STA"}""","""{"City":"PARC","State":"PR"}"""]
789123
["""{"Zipcode":7,"ZipCodeType":"AZA"}""","""{"City":"PRE","State":"XY"}"""]
How can I transform col(JSON_string) by using built-in functions in Pyspark to [{"Zipcode":704,"ZipCodeType":"STA"},{"City":"PARC","State":"PR"}] ?
I tried many functions such as create_map, collect_list, from_json, to_json, explode, json.loads, json.dump but no way to get the expected result.
Thank you for your help
Explode your JSON_string column, and read it as json, group by again.
df = df.withColumn('JSON_string', f.explode('JSON_string'))
schema = spark.read.json(df.rdd.map(lambda r: r.JSON_string)).schema
df_result = df.withColumn('JSON', f.from_json('JSON_string', schema)) \
.drop('JSON_string') \
.groupBy('Key') \
.agg(f.collect_list('JSON').alias('JSON'))
df_result.show(truncate=False)
df_result.printSchema()
+------+------------------------------------------------+
|Key |JSON |
+------+------------------------------------------------+
|123456|[{null, null, STA, 704}, {PARC, PR, null, null}]|
|789123|[{null, null, AZA, 7}, {PRE, XY, null, null}] |
+------+------------------------------------------------+
root
|-- Key: long (nullable = true)
|-- JSON: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- City: string (nullable = true)
| | |-- State: string (nullable = true)
| | |-- ZipCodeType: string (nullable = true)
| | |-- Zipcode: long (nullable = true)

Parse and write JSON Lines format file

I have a file in JSON Lines format with the following content:
[1, "James", 21, "M", "2016-04-07 10:25:09"]
[2, "Liz", 25, "F", "2017-05-07 20:25:09"]
...
Each line is a JSON array string, and the types of fields are: integer、string、integer、string、string. How to convert it to a DataFrame with the following schema?
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- gender: string (nullable = true)
|-- time: string (nullable = true)
On the contrary, if I have a DataFrame with the above schema, how to generate a file like the above JSON Lines format?
Assuming your file does not have headers line, this is one way to create a df from your file. But I'd expect there was be a better option.
df = spark.read.text("file_jsonlines")
c = F.split(F.regexp_extract('value', '\[(.*)\]', 1), ',')
df = df.select(
c[0].cast('int').alias('id'),
c[1].alias('name'),
c[2].cast('int').alias('age'),
c[3].alias('gender'),
c[4].alias('time'),
)
+---+--------+---+------+----------------------+
|id |name |age|gender|time |
+---+--------+---+------+----------------------+
|1 | "James"|21 | "M" | "2016-04-07 10:25:09"|
|2 | "Liz" |25 | "F" | "2017-05-07 20:25:09"|
+---+--------+---+------+----------------------+
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- gender: string (nullable = true)
|-- time: string (nullable = true)

Extract DataFrame from nested, tagged array in Spark

I'm using Spark to read JSON documents on the following form:
{
"items": [
{"type": "foo", value: 1},
{"type": "bar", value: 2}
]
}
That is, the array items are tagged by the "type" column.
Given that I know the vocabulary of "type" (i.e. {foo, bar}), how do I get a dataframe out like so:
root
|-- bar: integer (nullable = true)
|-- foo: integer (nullable = true)
You can manually curate the schema as below:
>>> df2 = df.selectExpr("array(struct(items[0].value as foo, items[1].value as bar)) as items")
>>> df2.printSchema()
root
|-- items: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- foo: long (nullable = true)
| | |-- bar: long (nullable = true)
Or a slightly more general approach using filter:
>>> df2 = df.selectExpr("array(struct(filter(items, x -> x.type = 'foo')[0].value as foo, filter(items, x -> x.type = 'bar')[0].value as bar)) as items")
>>> df2.printSchema()
root
|-- items: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- foo: long (nullable = true)
| | |-- bar: long (nullable = true)
Or using pivot:
>>> df2 = df.select(expr("inline_outer(items)")).groupBy().pivot("type").agg(
... first("value")
... )
>>> df2.printSchema()
root
|-- bar: integer (nullable = true)
|-- foo: integer (nullable = true)

Flattening a json file using Spark and Scala

I have a json file like this:
{
"Item Version" : 1.0,
"Item Creation Time" : "2019-04-14 14:15:09",
"Trade Dictionary" : {
"Country" : "India",
"TradeNumber" : "1",
"action" : {
"Action1" : false,
"Action2" : true,
"Action3" : false
},
"Value" : "XXXXXXXXXXXXXXX",
"TradeRegion" : "Global"
},
"Prod" : {
"Type" : "Driver",
"Product Dic" : { },
"FX Legs" : [ {
"Spot Date" : "2019-04-16",
"Value" : true
} ]
},
"Payments" : {
"Payment Details" : [ {
"Payment Date" : "2019-04-11",
"Payment Type" : "Rej"
} ]
}
}
I need a table in below format:
Version|Item Creation Time|Country|TradeNumber|Action1|Action2|Action3|Value |TradeRegion|Type|Product Dic|Spot Date |Value|Payment Date|Payment Type |
1 |2019-04-14 14:15 | India| 1 | false| true | false |xxxxxx|Global |Driver|{} |2019-04-16 |True |2019-11-14 |Rej
So it will just iterate each key value pair, put the key as column name and it's values to table values.
My current code:
val data2 = data.withColumn("vars",explode(array($"Product")))
.withColumn("subs", explode($"vars.FX Legs"))
.select($"vars.*",$"subs.*")
The problem here is that I have to provide the column names myself. Is there any way to make this more generic?
Since you have both array and struct columns mixed together in multiple levels it is not that simple to create a general solution. The main problem is that the explode function must be executed on all array column which is an action.
The simplest solution I can come up with uses recursion to check for any struct or array columns. If there are any then those will be flattened and then we check again (after flattening there will be additional columns which can be arrays or structs, hence the complexity). The flattenStruct part is from here.
Code:
def flattenStruct(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenStruct(st, colName)
case _ => Array(col(colName))
}
})
}
def flattenSchema(df: DataFrame): DataFrame = {
val structExists = df.schema.fields.filter(_.dataType.typeName == "struct").size > 0
val arrayCols = df.schema.fields.filter(_.dataType.typeName == "array").map(_.name)
if(structExists){
flattenSchema(df.select(flattenStruct(df.schema):_*))
} else if(arrayCols.size > 0) {
val newDF = arrayCols.foldLeft(df){
(tempDf, colName) => tempDf.withColumn(colName, explode(col(colName)))
}
flattenSchema(newDF)
} else {
df
}
}
Running the above method on the input dataframe:
flattenSchema(data)
will give a dataframe with the following schema:
root
|-- Item Creation Time: string (nullable = true)
|-- Item Version: double (nullable = true)
|-- Payment Date: string (nullable = true)
|-- Payment Type: string (nullable = true)
|-- Spot Date: string (nullable = true)
|-- Value: boolean (nullable = true)
|-- Product Dic: string (nullable = true)
|-- Type: string (nullable = true)
|-- Country: string (nullable = true)
|-- TradeNumber: string (nullable = true)
|-- TradeRegion: string (nullable = true)
|-- Value: string (nullable = true)
|-- Action1: boolean (nullable = true)
|-- Action2: boolean (nullable = true)
|-- Action3: boolean (nullable = true)
To keep the prefix of the struct columns in the name of the new columns, you only need to adjust the last case in the flattenStruct function:
case _ => Array(col(colName).as(colName.replace(".", "_")))
Use explode function to flatten dataframes with arrays. Here is an example:
val df = spark.read.json(Seq(json).toDS.rdd)
df.show(10, false)
df.printSchema
df: org.apache.spark.sql.DataFrame = [Item Creation Time: string, Item Version: double ... 3 more fields]
+-------------------+------------+--------------------------------+----------------------------------------+---------------------------------------------------+
|Item Creation Time |Item Version|Payments |Prod |Trade Dictionary |
+-------------------+------------+--------------------------------+----------------------------------------+---------------------------------------------------+
|2019-04-14 14:15:09|1.0 |[WrappedArray([2019-04-11,Rej])]|[WrappedArray([2019-04-16,true]),Driver]|[India,1,Global,XXXXXXXXXXXXXXX,[false,true,false]]|
+-------------------+------------+--------------------------------+----------------------------------------+---------------------------------------------------+
root
|-- Item Creation Time: string (nullable = true)
|-- Item Version: double (nullable = true)
|-- Payments: struct (nullable = true)
| |-- Payment Details: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Payment Date: string (nullable = true)
| | | |-- Payment Type: string (nullable = true)
|-- Prod: struct (nullable = true)
| |-- FX Legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Spot Date: string (nullable = true)
| | | |-- Value: boolean (nullable = true)
| |-- Type: string (nullable = true)
|-- Trade Dictionary: struct (nullable = true)
| |-- Country: string (nullable = true)
| |-- TradeNumber: string (nullable = true)
| |-- TradeRegion: string (nullable = true)
| |-- Value: string (nullable = true)
| |-- action: struct (nullable = true)
| | |-- Action1: boolean (nullable = true)
| | |-- Action2: boolean (nullable = true)
| | |-- Action3: boolean (nullable = true)
val flat = df
.select($"Item Creation Time", $"Item Version", explode($"Payments.Payment Details") as "row")
.select($"Item Creation Time", $"Item Version", $"row.*")
flat.show
flat: org.apache.spark.sql.DataFrame = [Item Creation Time: string, Item Version: double ... 2 more fields]
+-------------------+------------+------------+------------+
| Item Creation Time|Item Version|Payment Date|Payment Type|
+-------------------+------------+------------+------------+
|2019-04-14 14:15:09| 1.0| 2019-04-11| Rej|
+-------------------+------------+------------+------------+
This Solution can be achieved very easily using a library named JFlat - https://github.com/opendevl/Json2Flat.
String str = new String(Files.readAllBytes(Paths.get("/path/to/source/file.json")));
JFlat flatMe = new JFlat(str);
//get the 2D representation of JSON document
List<Object[]> json2csv = flatMe.json2Sheet().getJsonAsSheet();
//write the 2D representation in csv format
flatMe.write2csv("/path/to/destination/file.json");

How to parse nested JSON objects in spark sql?

I have a schema as shown below. How can i parse the nested objects
root
|-- apps: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- appName: string (nullable = true)
| | |-- appPackage: string (nullable = true)
| | |-- Ratings: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- date: string (nullable = true)
| | | | |-- rating: long (nullable = true)
|-- id: string (nullable = true)
Assuming you read in a json file and print the schema you are showing us like this:
DataFrame df = sqlContext.read().json("/path/to/file").toDF();
df.registerTempTable("df");
df.printSchema();
Then you can select nested objects inside a struct type like so...
DataFrame app = df.select("app");
app.registerTempTable("app");
app.printSchema();
app.show();
DataFrame appName = app.select("element.appName");
appName.registerTempTable("appName");
appName.printSchema();
appName.show();
Try this:
val nameAndAddress = sqlContext.sql("""
SELECT name, address.city, address.state
FROM people
""")
nameAndAddress.collect.foreach(println)
Source:
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
Have you tried doing it straight from the SQL query like
Select apps.element.Ratings from yourTableName
This will probably return an array and you can more easily access the elements inside.
Also, I use this online Json viewer when I have to deal with large JSON structures and the schema is too complex:
http://jsonviewer.stack.hu/
I am using pyspark, but the logic should be similar.
I found this way of parsing my nested json useful:
df.select(df.apps.appName.alias("apps_Name"), \
df.apps.appPackage.alias("apps_Package"), \
df.apps.Ratings.date.alias("apps_Ratings_date")) \
.show()
The code could be obviously shorten with a f-string.
var df = spark.read.format("json").load("/path/to/file")
df.createOrReplaceTempView("df");
spark.sql("select apps.element.Ratings from df where apps.element.appName like '%app_name%' ").show()