Parse and write JSON Lines format file - json

I have a file in JSON Lines format with the following content:
[1, "James", 21, "M", "2016-04-07 10:25:09"]
[2, "Liz", 25, "F", "2017-05-07 20:25:09"]
...
Each line is a JSON array string, and the types of fields are: integer、string、integer、string、string. How to convert it to a DataFrame with the following schema?
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- gender: string (nullable = true)
|-- time: string (nullable = true)
On the contrary, if I have a DataFrame with the above schema, how to generate a file like the above JSON Lines format?

Assuming your file does not have headers line, this is one way to create a df from your file. But I'd expect there was be a better option.
df = spark.read.text("file_jsonlines")
c = F.split(F.regexp_extract('value', '\[(.*)\]', 1), ',')
df = df.select(
c[0].cast('int').alias('id'),
c[1].alias('name'),
c[2].cast('int').alias('age'),
c[3].alias('gender'),
c[4].alias('time'),
)
+---+--------+---+------+----------------------+
|id |name |age|gender|time |
+---+--------+---+------+----------------------+
|1 | "James"|21 | "M" | "2016-04-07 10:25:09"|
|2 | "Liz" |25 | "F" | "2017-05-07 20:25:09"|
+---+--------+---+------+----------------------+
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- gender: string (nullable = true)
|-- time: string (nullable = true)

Related

Extract DataFrame from nested, tagged array in Spark

I'm using Spark to read JSON documents on the following form:
{
"items": [
{"type": "foo", value: 1},
{"type": "bar", value: 2}
]
}
That is, the array items are tagged by the "type" column.
Given that I know the vocabulary of "type" (i.e. {foo, bar}), how do I get a dataframe out like so:
root
|-- bar: integer (nullable = true)
|-- foo: integer (nullable = true)
You can manually curate the schema as below:
>>> df2 = df.selectExpr("array(struct(items[0].value as foo, items[1].value as bar)) as items")
>>> df2.printSchema()
root
|-- items: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- foo: long (nullable = true)
| | |-- bar: long (nullable = true)
Or a slightly more general approach using filter:
>>> df2 = df.selectExpr("array(struct(filter(items, x -> x.type = 'foo')[0].value as foo, filter(items, x -> x.type = 'bar')[0].value as bar)) as items")
>>> df2.printSchema()
root
|-- items: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- foo: long (nullable = true)
| | |-- bar: long (nullable = true)
Or using pivot:
>>> df2 = df.select(expr("inline_outer(items)")).groupBy().pivot("type").agg(
... first("value")
... )
>>> df2.printSchema()
root
|-- bar: integer (nullable = true)
|-- foo: integer (nullable = true)

Merge Dataframes With Differents Schemas - Scala Spark

I'm working in transform a JSON into a Data Frame. In the first step I create an Array of Data Frame and after that I make an Union. But I've a problem to do a Union in a JSON with Different Schemas.
I Can do it if the JSON have the same Schema like you can see in this other question: Parse JSON root in a column using Spark-Scala
I'm working with the following data:
val exampleJsonDifferentSchema = spark.createDataset(
"""
{"ITEM1512":
{"name":"Yin",
"address":{"city":"Columbus",
"state":"Ohio"},
"age":28 },
"ITEM1518":
{"name":"Yang",
"address":{"city":"Working",
"state":"Marc"}
},
"ITEM1458":
{"name":"Yossup",
"address":{"city":"Macoss",
"state":"Microsoft"},
"age":28
}
}""" :: Nil)
As you see the difference is that one Data Frame doesn't have Age.
val itemsExampleDiff = spark.read.json(exampleJsonDifferentSchema)
itemsExampleDiff.show(false)
itemsExampleDiff.printSchema
+---------------------------------+---------------------------+-----------------------+
|ITEM1458 |ITEM1512 |ITEM1518 |
+---------------------------------+---------------------------+-----------------------+
|[[Macoss, Microsoft], 28, Yossup]|[[Columbus, Ohio], 28, Yin]|[[Working, Marc], Yang]|
+---------------------------------+---------------------------+-----------------------+
root
|-- ITEM1458: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1512: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1518: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
My solution now is as the follow code where i make an array of DataFrame:
val columns:Array[String] = itemsExample.columns
var arrayOfExampleDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = itemsExample.select(lit(col_name).as("Item"), col(col_name).as("Value"))
arrayOfExampleDFs = arrayOfExampleDFs :+ temp
}
val jsonDF = arrayOfExampleDFs.reduce(_ union _)
But I've a JSON with Different Schemas when I reduce in a union I can't do it because the Data Frame need to have the same Schema. In fact, I've the following error:
org.apache.spark.sql.AnalysisException: Union can only be performed on
tables with the compatible column types.
I'm trying to do something similar I've found in this question: How to perform union on two DataFrames with different amounts of columns in spark?
Specifically that part:
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
But I cant make the set for the columns because I need to catch dynamically the columns both totals and singles. I only can do something like that:
for(i <- 0 until arrayOfExampleDFs.length-1) {
val cols1 = arrayOfExampleDFs(i).select("Value").columns.toSet
val cols2 = arrayOfExampleDFs(i+1).select("Value").columns.toSet
val total = cols1 ++ cols2
arrayOfExampleDFs(i).select("Value").printSchema()
print(total)
}
So, how could be a function that do this union dynamically?
Update: expected output
In this Case This Data Frame and Schema:
+--------+---------------------------------+
|Item |Value |
+--------+---------------------------------+
|ITEM1458|[[Macoss, Microsoft], 28, Yossup]|
|ITEM1512|[[Columbus, Ohio], 28, Yin] |
|ITEM1518|[[Working, Marc], null, Yang] |
+--------+---------------------------------+
root
|-- Item: string (nullable = false)
|-- Value: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
Here is one possible solution which creates a common schema for all the dataframes by adding the age column when it is not found:
import org.apache.spark.sql.functions.{col, lit, struct}
import org.apache.spark.sql.types.{LongType, StructField, StructType}
....
for(col_name <- columns){
val currentDf = itemsExampleDiff.select(col(col_name))
// try to identify if age field is present
val hasAge = currentDf.schema.fields(0)
.dataType
.asInstanceOf[StructType]
.fields
.contains(StructField("age", LongType, true))
val valueCol = hasAge match {
// if not construct a new value column
case false => struct(
col(s"${col_name}.address"),
lit(null).cast("bigint").as("age"),
col(s"${col_name}.name")
)
case true => col(col_name)
}
arrayOfExampleDFs = arrayOfExampleDFs :+ currentDf.select(lit(col_name).as("Item"), valueCol.as("Value"))
}
val jsonDF = arrayOfExampleDFs.reduce(_ union _)
// +--------+---------------------------------+
// |Item |Value |
// +--------+---------------------------------+
// |ITEM1458|[[Macoss, Microsoft], 28, Yossup]|
// |ITEM1512|[[Columbus, Ohio], 28, Yin] |
// |ITEM1518|[[Working, Marc],, Yang] |
// +--------+---------------------------------+
Analysis: probably the most demanding part is finding out whether the age is present or not. For the look up we use df.schema.fields property which allow us to dig into the internal schema of each column.
When age is not found we regenerate the column by using a struct:
struct(
col(s"${col_name}.address"),
lit(null).cast("bigint").as("age"),
col(s"${col_name}.name")
)

Flattening a json file using Spark and Scala

I have a json file like this:
{
"Item Version" : 1.0,
"Item Creation Time" : "2019-04-14 14:15:09",
"Trade Dictionary" : {
"Country" : "India",
"TradeNumber" : "1",
"action" : {
"Action1" : false,
"Action2" : true,
"Action3" : false
},
"Value" : "XXXXXXXXXXXXXXX",
"TradeRegion" : "Global"
},
"Prod" : {
"Type" : "Driver",
"Product Dic" : { },
"FX Legs" : [ {
"Spot Date" : "2019-04-16",
"Value" : true
} ]
},
"Payments" : {
"Payment Details" : [ {
"Payment Date" : "2019-04-11",
"Payment Type" : "Rej"
} ]
}
}
I need a table in below format:
Version|Item Creation Time|Country|TradeNumber|Action1|Action2|Action3|Value |TradeRegion|Type|Product Dic|Spot Date |Value|Payment Date|Payment Type |
1 |2019-04-14 14:15 | India| 1 | false| true | false |xxxxxx|Global |Driver|{} |2019-04-16 |True |2019-11-14 |Rej
So it will just iterate each key value pair, put the key as column name and it's values to table values.
My current code:
val data2 = data.withColumn("vars",explode(array($"Product")))
.withColumn("subs", explode($"vars.FX Legs"))
.select($"vars.*",$"subs.*")
The problem here is that I have to provide the column names myself. Is there any way to make this more generic?
Since you have both array and struct columns mixed together in multiple levels it is not that simple to create a general solution. The main problem is that the explode function must be executed on all array column which is an action.
The simplest solution I can come up with uses recursion to check for any struct or array columns. If there are any then those will be flattened and then we check again (after flattening there will be additional columns which can be arrays or structs, hence the complexity). The flattenStruct part is from here.
Code:
def flattenStruct(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenStruct(st, colName)
case _ => Array(col(colName))
}
})
}
def flattenSchema(df: DataFrame): DataFrame = {
val structExists = df.schema.fields.filter(_.dataType.typeName == "struct").size > 0
val arrayCols = df.schema.fields.filter(_.dataType.typeName == "array").map(_.name)
if(structExists){
flattenSchema(df.select(flattenStruct(df.schema):_*))
} else if(arrayCols.size > 0) {
val newDF = arrayCols.foldLeft(df){
(tempDf, colName) => tempDf.withColumn(colName, explode(col(colName)))
}
flattenSchema(newDF)
} else {
df
}
}
Running the above method on the input dataframe:
flattenSchema(data)
will give a dataframe with the following schema:
root
|-- Item Creation Time: string (nullable = true)
|-- Item Version: double (nullable = true)
|-- Payment Date: string (nullable = true)
|-- Payment Type: string (nullable = true)
|-- Spot Date: string (nullable = true)
|-- Value: boolean (nullable = true)
|-- Product Dic: string (nullable = true)
|-- Type: string (nullable = true)
|-- Country: string (nullable = true)
|-- TradeNumber: string (nullable = true)
|-- TradeRegion: string (nullable = true)
|-- Value: string (nullable = true)
|-- Action1: boolean (nullable = true)
|-- Action2: boolean (nullable = true)
|-- Action3: boolean (nullable = true)
To keep the prefix of the struct columns in the name of the new columns, you only need to adjust the last case in the flattenStruct function:
case _ => Array(col(colName).as(colName.replace(".", "_")))
Use explode function to flatten dataframes with arrays. Here is an example:
val df = spark.read.json(Seq(json).toDS.rdd)
df.show(10, false)
df.printSchema
df: org.apache.spark.sql.DataFrame = [Item Creation Time: string, Item Version: double ... 3 more fields]
+-------------------+------------+--------------------------------+----------------------------------------+---------------------------------------------------+
|Item Creation Time |Item Version|Payments |Prod |Trade Dictionary |
+-------------------+------------+--------------------------------+----------------------------------------+---------------------------------------------------+
|2019-04-14 14:15:09|1.0 |[WrappedArray([2019-04-11,Rej])]|[WrappedArray([2019-04-16,true]),Driver]|[India,1,Global,XXXXXXXXXXXXXXX,[false,true,false]]|
+-------------------+------------+--------------------------------+----------------------------------------+---------------------------------------------------+
root
|-- Item Creation Time: string (nullable = true)
|-- Item Version: double (nullable = true)
|-- Payments: struct (nullable = true)
| |-- Payment Details: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Payment Date: string (nullable = true)
| | | |-- Payment Type: string (nullable = true)
|-- Prod: struct (nullable = true)
| |-- FX Legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Spot Date: string (nullable = true)
| | | |-- Value: boolean (nullable = true)
| |-- Type: string (nullable = true)
|-- Trade Dictionary: struct (nullable = true)
| |-- Country: string (nullable = true)
| |-- TradeNumber: string (nullable = true)
| |-- TradeRegion: string (nullable = true)
| |-- Value: string (nullable = true)
| |-- action: struct (nullable = true)
| | |-- Action1: boolean (nullable = true)
| | |-- Action2: boolean (nullable = true)
| | |-- Action3: boolean (nullable = true)
val flat = df
.select($"Item Creation Time", $"Item Version", explode($"Payments.Payment Details") as "row")
.select($"Item Creation Time", $"Item Version", $"row.*")
flat.show
flat: org.apache.spark.sql.DataFrame = [Item Creation Time: string, Item Version: double ... 2 more fields]
+-------------------+------------+------------+------------+
| Item Creation Time|Item Version|Payment Date|Payment Type|
+-------------------+------------+------------+------------+
|2019-04-14 14:15:09| 1.0| 2019-04-11| Rej|
+-------------------+------------+------------+------------+
This Solution can be achieved very easily using a library named JFlat - https://github.com/opendevl/Json2Flat.
String str = new String(Files.readAllBytes(Paths.get("/path/to/source/file.json")));
JFlat flatMe = new JFlat(str);
//get the 2D representation of JSON document
List<Object[]> json2csv = flatMe.json2Sheet().getJsonAsSheet();
//write the 2D representation in csv format
flatMe.write2csv("/path/to/destination/file.json");

Spark Scala nested JSON stored as structure table

I have huge no of nested JSON having more than 200 keys want to convert & store in structure table.
|-- ip_address: string (nullable = true)
|-- xs_latitude: double (nullable = true)
|-- Applications: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b_als_o_isehp: string (nullable = true)
| | |-- b_als_p_isehp: string (nullable = true)
| | |-- b_als_s_isehp: string (nullable = true)
| | |-- l_als_o_eventid: string (nullable = true)
....
Read JSON and get each ip_address having one application array data
{"ip_address": 1512199720,"Applications": [{"s_pd": -1,"s_path": "NA", "p_pd": "temp0"}, {"s_pd": -1,"s_path": "root/hdfs", "p_pd": "temp1"},{"s_pd": -1,"s_path": "root/hdfs", "p_pd": "temp2"}],}
val data = spark.read.json("file:///root/users/data/s_json.json")
var appDf = data.withColumn("data",explode($"Applications")).select($"Applications.s_pd", $"Applications.s_path", $"Applications.p_pd", $"ip_address")
appDf.printSchema
/// gives
root
|-- s_pd: array (nullable = true)
| |-- element: string (containsNull = true)
|-- s_path: array (nullable = true)
| |-- element: string (containsNull = true)
|-- p_pd: array (nullable = true)
| |-- element: string (containsNull = true)
|-- ip_address: string (nullable = true)
In each dataframe record contain an array with duplicate records. How to get the record in table format.
Mistake
Your mistake is that you are using the original (Application) struct column to select the nested struct in separate column.
Solution
You had to select from the exploded column which is data
var appDf = data.withColumn("data",explode($"Applications"))
.select($"ip_address", $"data.s_pd", $"data.s_path", $"data.p_pd")
and you should get
+----------+----+---------+-----+
|ip_address|s_pd|s_path |p_pd |
+----------+----+---------+-----+
|1512199720|-1 |NA |temp0|
|1512199720|-1 |root/hdfs|temp1|
|1512199720|-1 |root/hdfs|temp2|
+----------+----+---------+-----+
I hope the answer is helpful

Spark SQL dataframe manipulation

I am new to scala and spark and I am struggling to solve a rather simple problem. Consider the following csv file:
val df = spark.read.option("header", true).option("inferSchema", true).csv("/home/USER/test_file.csv")
df.show(false)
df: org.apache.spark.sql.DataFrame = [id: int, measurement: int ... 1 more field]
+---+-----------+----------------------------------------------------+
|id |measurement|location |
+---+-----------+----------------------------------------------------+
|1 |235 |{type: 'point', lon_lat: [-122.03476,37.362517]} |
|2 |45 |{type: 'point', lon_lat: [115.8614000, -31.9522400]}|
+---+-----------+----------------------------------------------------+
df.printSchema
|-- id: integer (nullable = true)
|-- measurement: integer (nullable = true)
|-- location: string (nullable = true)
I would like to create a dataset with the following schema (data type is not so important as I want to save it to csv later):
dataset: org.apache.spark.sql.Dataset[Features] = [id: string, measurement: string, typeattr: string, lon: string, lat: string]
|-- id: string (nullable = true)
|-- measurement: string (nullable = true)
|-- typeattr: string (nullable = true)
|-- lon: string (nullable = true)
|-- lat: string (nullable = true)
My code so far looks like this:
case class Features(id: String, measurement:String)
var dataset = df.map(s => {
Features(s(0).toString, s(1).toString)
})
How can I parse the json string (splitting the "lon_lat" key) into the dataset?