Merge Dataframes With Differents Schemas - Scala Spark - json

I'm working in transform a JSON into a Data Frame. In the first step I create an Array of Data Frame and after that I make an Union. But I've a problem to do a Union in a JSON with Different Schemas.
I Can do it if the JSON have the same Schema like you can see in this other question: Parse JSON root in a column using Spark-Scala
I'm working with the following data:
val exampleJsonDifferentSchema = spark.createDataset(
"""
{"ITEM1512":
{"name":"Yin",
"address":{"city":"Columbus",
"state":"Ohio"},
"age":28 },
"ITEM1518":
{"name":"Yang",
"address":{"city":"Working",
"state":"Marc"}
},
"ITEM1458":
{"name":"Yossup",
"address":{"city":"Macoss",
"state":"Microsoft"},
"age":28
}
}""" :: Nil)
As you see the difference is that one Data Frame doesn't have Age.
val itemsExampleDiff = spark.read.json(exampleJsonDifferentSchema)
itemsExampleDiff.show(false)
itemsExampleDiff.printSchema
+---------------------------------+---------------------------+-----------------------+
|ITEM1458 |ITEM1512 |ITEM1518 |
+---------------------------------+---------------------------+-----------------------+
|[[Macoss, Microsoft], 28, Yossup]|[[Columbus, Ohio], 28, Yin]|[[Working, Marc], Yang]|
+---------------------------------+---------------------------+-----------------------+
root
|-- ITEM1458: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1512: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1518: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
My solution now is as the follow code where i make an array of DataFrame:
val columns:Array[String] = itemsExample.columns
var arrayOfExampleDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = itemsExample.select(lit(col_name).as("Item"), col(col_name).as("Value"))
arrayOfExampleDFs = arrayOfExampleDFs :+ temp
}
val jsonDF = arrayOfExampleDFs.reduce(_ union _)
But I've a JSON with Different Schemas when I reduce in a union I can't do it because the Data Frame need to have the same Schema. In fact, I've the following error:
org.apache.spark.sql.AnalysisException: Union can only be performed on
tables with the compatible column types.
I'm trying to do something similar I've found in this question: How to perform union on two DataFrames with different amounts of columns in spark?
Specifically that part:
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
But I cant make the set for the columns because I need to catch dynamically the columns both totals and singles. I only can do something like that:
for(i <- 0 until arrayOfExampleDFs.length-1) {
val cols1 = arrayOfExampleDFs(i).select("Value").columns.toSet
val cols2 = arrayOfExampleDFs(i+1).select("Value").columns.toSet
val total = cols1 ++ cols2
arrayOfExampleDFs(i).select("Value").printSchema()
print(total)
}
So, how could be a function that do this union dynamically?
Update: expected output
In this Case This Data Frame and Schema:
+--------+---------------------------------+
|Item |Value |
+--------+---------------------------------+
|ITEM1458|[[Macoss, Microsoft], 28, Yossup]|
|ITEM1512|[[Columbus, Ohio], 28, Yin] |
|ITEM1518|[[Working, Marc], null, Yang] |
+--------+---------------------------------+
root
|-- Item: string (nullable = false)
|-- Value: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)

Here is one possible solution which creates a common schema for all the dataframes by adding the age column when it is not found:
import org.apache.spark.sql.functions.{col, lit, struct}
import org.apache.spark.sql.types.{LongType, StructField, StructType}
....
for(col_name <- columns){
val currentDf = itemsExampleDiff.select(col(col_name))
// try to identify if age field is present
val hasAge = currentDf.schema.fields(0)
.dataType
.asInstanceOf[StructType]
.fields
.contains(StructField("age", LongType, true))
val valueCol = hasAge match {
// if not construct a new value column
case false => struct(
col(s"${col_name}.address"),
lit(null).cast("bigint").as("age"),
col(s"${col_name}.name")
)
case true => col(col_name)
}
arrayOfExampleDFs = arrayOfExampleDFs :+ currentDf.select(lit(col_name).as("Item"), valueCol.as("Value"))
}
val jsonDF = arrayOfExampleDFs.reduce(_ union _)
// +--------+---------------------------------+
// |Item |Value |
// +--------+---------------------------------+
// |ITEM1458|[[Macoss, Microsoft], 28, Yossup]|
// |ITEM1512|[[Columbus, Ohio], 28, Yin] |
// |ITEM1518|[[Working, Marc],, Yang] |
// +--------+---------------------------------+
Analysis: probably the most demanding part is finding out whether the age is present or not. For the look up we use df.schema.fields property which allow us to dig into the internal schema of each column.
When age is not found we regenerate the column by using a struct:
struct(
col(s"${col_name}.address"),
lit(null).cast("bigint").as("age"),
col(s"${col_name}.name")
)

Related

Parse and write JSON Lines format file

I have a file in JSON Lines format with the following content:
[1, "James", 21, "M", "2016-04-07 10:25:09"]
[2, "Liz", 25, "F", "2017-05-07 20:25:09"]
...
Each line is a JSON array string, and the types of fields are: integer、string、integer、string、string. How to convert it to a DataFrame with the following schema?
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- gender: string (nullable = true)
|-- time: string (nullable = true)
On the contrary, if I have a DataFrame with the above schema, how to generate a file like the above JSON Lines format?
Assuming your file does not have headers line, this is one way to create a df from your file. But I'd expect there was be a better option.
df = spark.read.text("file_jsonlines")
c = F.split(F.regexp_extract('value', '\[(.*)\]', 1), ',')
df = df.select(
c[0].cast('int').alias('id'),
c[1].alias('name'),
c[2].cast('int').alias('age'),
c[3].alias('gender'),
c[4].alias('time'),
)
+---+--------+---+------+----------------------+
|id |name |age|gender|time |
+---+--------+---+------+----------------------+
|1 | "James"|21 | "M" | "2016-04-07 10:25:09"|
|2 | "Liz" |25 | "F" | "2017-05-07 20:25:09"|
+---+--------+---+------+----------------------+
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- gender: string (nullable = true)
|-- time: string (nullable = true)

Add new column to dataframe with modified schema or drop nested columns of array type in Scala

json:-
{"ID": "500", "Data": [{"field2": 308, "field3": 346, "field1": 40.36582609126494, "field7": 3, "field4": 1583057346.0, "field5": -80.03243596528726, "field6": 16.0517578125, "field8": 5}, {"field2": 307, "field3": 348, "field1": 40.36591421686625, "field7": 3, "field4": 1583057347.0, "field5": -80.03259684675493, "field6": 16.234375, "field8": 5}]}
schema:-
val MySchema: StructType =
StructType( Array(
StructField("ID",StringType,true),
StructField("Data", ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field2",LongType,true),
StructField("field3",LongType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true),
StructField("field7",LongType,true),
StructField("field8",LongType,true)
)),true),true)))
Load json into dataframe:-
val MyDF = spark.readStream
.schema(MySchema)
.json(input)
where 'input' is a file that contains above json
How can I add a new column "Data_New" to the above dataframe 'MyDF' with schema as
val Data_New_Schema: StructType =
StructType( Array(
StructField("Data", ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true)
)),true),true)))
Please note a huge volume of such json files will be loaded in the source and so performing an explode followed by a collect_list will crash the driver
You can try one of the following two methods:
for Spark 2.4+, use transform:
import org.apache.spark.sql.functions._
val df_new = df.withColumn("Data_New", expr("struct(transform(Data, x -> (x.field1 as f1, x.field4 as f4, x.field5 as f5, x.field6 as f6)))").cast(Data_New_Schema))
scala> df_new.printSchema
root
|-- ID: string (nullable = true)
|-- Data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- field1: double (nullable = true)
| | |-- field2: long (nullable = true)
| | |-- field3: long (nullable = true)
| | |-- field4: double (nullable = true)
| | |-- field5: double (nullable = true)
| | |-- field6: double (nullable = true)
| | |-- field7: long (nullable = true)
| | |-- field8: long (nullable = true)
|-- Data_New: struct (nullable = false)
| |-- Data: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- field1: double (nullable = true)
| | | |-- field4: double (nullable = true)
| | | |-- field5: double (nullable = true)
| | | |-- field6: double (nullable = true)
Notice that nullable = false on the top level schema of Data_New, if you want to make it true, add nullif function to the SQL expression: nullif(struct(transform(Data, x -> (...))), null), or a more efficient way if(true, struct(transform(Data, x -> (...))), null).
Prior to Spark 2.4, use from_json + to_json:
val df_new = df.withColumn("Data_New", from_json(to_json(struct('Data)), Data_New_Schema))
Edit: Per comment, if you want Data_New to be an array of structs, just remove struct function, for example:
val Data_New_Schema: ArrayType = ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true)
)),true)
// if you need `containsNull=true`, then cast the above Type definition
val df_new = df.withColumn("Data_New", expr("transform(Data, x -> (x.field1 as field1, x.field4 as field4, x.field5 as field5, x.field6 as field6))"))
Or
val df_new = df.withColumn("Data_New", from_json(to_json('Data), Data_New_Schema))

Parse JSON root in a column using Spark-Scala

I've problems in order to transform the root of a JSOM a record in a data frame for an undetermined number of records.
I've a data frame generated with a JSON similar the following:
val exampleJson = spark.createDataset(
"""
{"ITEM1512":
{"name":"Yin",
"address":{"city":"Columbus",
"state":"Ohio"}
},
"ITEM1518":
{"name":"Yang",
"address":{"city":"Working",
"state":"Marc"}
}
}""" :: Nil)
When I read it whit the following instruction
val itemsExample = spark.read.json(exampleJson)
The Schema and Data Frame generated is the following:
+-----------------------+-----------------------+
|ITEM1512 |ITEM1518 |
+-----------------------+-----------------------+
|[[Columbus, Ohio], Yin]|[[Working, Marc], Yang]|
+-----------------------+-----------------------+
root
|-- ITEM1512: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1518: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
But i want to generate something like this:
+-----------------------+-----------------------+
|Item |Values |
+-----------------------+-----------------------+
|ITEM1512 |[[Columbus, Ohio], Yin]|
|ITEM1518 |[[Working, Marc], Yang]|
+-----------------------+-----------------------+
So, in order to parse this JSON data I need to read all the columns and added it to a record in the Data Frame, because there are more than this two items that i write as example. In fact, there are millions of items that I'd like to add in a Data Frame.
I'm trying to replicate the solution found here in: How to parse the JSON data using Spark-Scala
with this code:
val columns:Array[String] = itemsExample.columns
var arrayOfDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = itemsExample.selectExpr("explode("+col_name+") as element")
.select(
lit(col_name).as("Item"),
col("element.E").as("Value"))
arrayOfDFs = arrayOfDFs :+ temp
}
val jsonDF = arrayOfDFs.reduce(_ union _)
jsonDF.show(false)
But I face with the problem while in the example reading in the other question the root is in array in my case the root is an StrucType. Therefore the next exception is thrown:
org.apache.spark.sql.AnalysisException: cannot resolve
'explode(ITEM1512)' due to data type mismatch: input to function
explode should be array or map type, not
struct,name:string>
You can use stack function.
Example:
itemsExample.selectExpr("""stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)""").
show(false)
//+--------+-----------------------+
//|Item |Values |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+
UPDATE:
Dynamic Stack query:
val stack=df.columns.map(x => s"'${x}',${x}").mkString(s"stack(${df.columns.size},",",",")as (Item,Values)")
//stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)
itemsExample.selectExpr(stack).show()
//+--------+-----------------------+
//|Item |Values |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+

Flattening a json file using Spark and Scala

I have a json file like this:
{
"Item Version" : 1.0,
"Item Creation Time" : "2019-04-14 14:15:09",
"Trade Dictionary" : {
"Country" : "India",
"TradeNumber" : "1",
"action" : {
"Action1" : false,
"Action2" : true,
"Action3" : false
},
"Value" : "XXXXXXXXXXXXXXX",
"TradeRegion" : "Global"
},
"Prod" : {
"Type" : "Driver",
"Product Dic" : { },
"FX Legs" : [ {
"Spot Date" : "2019-04-16",
"Value" : true
} ]
},
"Payments" : {
"Payment Details" : [ {
"Payment Date" : "2019-04-11",
"Payment Type" : "Rej"
} ]
}
}
I need a table in below format:
Version|Item Creation Time|Country|TradeNumber|Action1|Action2|Action3|Value |TradeRegion|Type|Product Dic|Spot Date |Value|Payment Date|Payment Type |
1 |2019-04-14 14:15 | India| 1 | false| true | false |xxxxxx|Global |Driver|{} |2019-04-16 |True |2019-11-14 |Rej
So it will just iterate each key value pair, put the key as column name and it's values to table values.
My current code:
val data2 = data.withColumn("vars",explode(array($"Product")))
.withColumn("subs", explode($"vars.FX Legs"))
.select($"vars.*",$"subs.*")
The problem here is that I have to provide the column names myself. Is there any way to make this more generic?
Since you have both array and struct columns mixed together in multiple levels it is not that simple to create a general solution. The main problem is that the explode function must be executed on all array column which is an action.
The simplest solution I can come up with uses recursion to check for any struct or array columns. If there are any then those will be flattened and then we check again (after flattening there will be additional columns which can be arrays or structs, hence the complexity). The flattenStruct part is from here.
Code:
def flattenStruct(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenStruct(st, colName)
case _ => Array(col(colName))
}
})
}
def flattenSchema(df: DataFrame): DataFrame = {
val structExists = df.schema.fields.filter(_.dataType.typeName == "struct").size > 0
val arrayCols = df.schema.fields.filter(_.dataType.typeName == "array").map(_.name)
if(structExists){
flattenSchema(df.select(flattenStruct(df.schema):_*))
} else if(arrayCols.size > 0) {
val newDF = arrayCols.foldLeft(df){
(tempDf, colName) => tempDf.withColumn(colName, explode(col(colName)))
}
flattenSchema(newDF)
} else {
df
}
}
Running the above method on the input dataframe:
flattenSchema(data)
will give a dataframe with the following schema:
root
|-- Item Creation Time: string (nullable = true)
|-- Item Version: double (nullable = true)
|-- Payment Date: string (nullable = true)
|-- Payment Type: string (nullable = true)
|-- Spot Date: string (nullable = true)
|-- Value: boolean (nullable = true)
|-- Product Dic: string (nullable = true)
|-- Type: string (nullable = true)
|-- Country: string (nullable = true)
|-- TradeNumber: string (nullable = true)
|-- TradeRegion: string (nullable = true)
|-- Value: string (nullable = true)
|-- Action1: boolean (nullable = true)
|-- Action2: boolean (nullable = true)
|-- Action3: boolean (nullable = true)
To keep the prefix of the struct columns in the name of the new columns, you only need to adjust the last case in the flattenStruct function:
case _ => Array(col(colName).as(colName.replace(".", "_")))
Use explode function to flatten dataframes with arrays. Here is an example:
val df = spark.read.json(Seq(json).toDS.rdd)
df.show(10, false)
df.printSchema
df: org.apache.spark.sql.DataFrame = [Item Creation Time: string, Item Version: double ... 3 more fields]
+-------------------+------------+--------------------------------+----------------------------------------+---------------------------------------------------+
|Item Creation Time |Item Version|Payments |Prod |Trade Dictionary |
+-------------------+------------+--------------------------------+----------------------------------------+---------------------------------------------------+
|2019-04-14 14:15:09|1.0 |[WrappedArray([2019-04-11,Rej])]|[WrappedArray([2019-04-16,true]),Driver]|[India,1,Global,XXXXXXXXXXXXXXX,[false,true,false]]|
+-------------------+------------+--------------------------------+----------------------------------------+---------------------------------------------------+
root
|-- Item Creation Time: string (nullable = true)
|-- Item Version: double (nullable = true)
|-- Payments: struct (nullable = true)
| |-- Payment Details: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Payment Date: string (nullable = true)
| | | |-- Payment Type: string (nullable = true)
|-- Prod: struct (nullable = true)
| |-- FX Legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Spot Date: string (nullable = true)
| | | |-- Value: boolean (nullable = true)
| |-- Type: string (nullable = true)
|-- Trade Dictionary: struct (nullable = true)
| |-- Country: string (nullable = true)
| |-- TradeNumber: string (nullable = true)
| |-- TradeRegion: string (nullable = true)
| |-- Value: string (nullable = true)
| |-- action: struct (nullable = true)
| | |-- Action1: boolean (nullable = true)
| | |-- Action2: boolean (nullable = true)
| | |-- Action3: boolean (nullable = true)
val flat = df
.select($"Item Creation Time", $"Item Version", explode($"Payments.Payment Details") as "row")
.select($"Item Creation Time", $"Item Version", $"row.*")
flat.show
flat: org.apache.spark.sql.DataFrame = [Item Creation Time: string, Item Version: double ... 2 more fields]
+-------------------+------------+------------+------------+
| Item Creation Time|Item Version|Payment Date|Payment Type|
+-------------------+------------+------------+------------+
|2019-04-14 14:15:09| 1.0| 2019-04-11| Rej|
+-------------------+------------+------------+------------+
This Solution can be achieved very easily using a library named JFlat - https://github.com/opendevl/Json2Flat.
String str = new String(Files.readAllBytes(Paths.get("/path/to/source/file.json")));
JFlat flatMe = new JFlat(str);
//get the 2D representation of JSON document
List<Object[]> json2csv = flatMe.json2Sheet().getJsonAsSheet();
//write the 2D representation in csv format
flatMe.write2csv("/path/to/destination/file.json");

Spark SQL dataframe manipulation

I am new to scala and spark and I am struggling to solve a rather simple problem. Consider the following csv file:
val df = spark.read.option("header", true).option("inferSchema", true).csv("/home/USER/test_file.csv")
df.show(false)
df: org.apache.spark.sql.DataFrame = [id: int, measurement: int ... 1 more field]
+---+-----------+----------------------------------------------------+
|id |measurement|location |
+---+-----------+----------------------------------------------------+
|1 |235 |{type: 'point', lon_lat: [-122.03476,37.362517]} |
|2 |45 |{type: 'point', lon_lat: [115.8614000, -31.9522400]}|
+---+-----------+----------------------------------------------------+
df.printSchema
|-- id: integer (nullable = true)
|-- measurement: integer (nullable = true)
|-- location: string (nullable = true)
I would like to create a dataset with the following schema (data type is not so important as I want to save it to csv later):
dataset: org.apache.spark.sql.Dataset[Features] = [id: string, measurement: string, typeattr: string, lon: string, lat: string]
|-- id: string (nullable = true)
|-- measurement: string (nullable = true)
|-- typeattr: string (nullable = true)
|-- lon: string (nullable = true)
|-- lat: string (nullable = true)
My code so far looks like this:
case class Features(id: String, measurement:String)
var dataset = df.map(s => {
Features(s(0).toString, s(1).toString)
})
How can I parse the json string (splitting the "lon_lat" key) into the dataset?