Spark Scala - Split Array of Structs into Dataframe Columns - json

I have a nested source json file that contains an array of structs. The number of structs varies greatly from row to row and I would like to use Spark (scala) to dynamically create new dataframe columns from the key/values of the struct where the key is the column name and the value is the column value.
Example Minified json record
{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}
dataframe schema
scala> val df = spark.read.json("file:///tmp/nested_test.json")
root
|-- key1: struct (nullable = true)
| |-- key2: struct (nullable = true)
| | |-- key3: string (nullable = true)
| | |-- key4: string (nullable = true)
| | |-- key5: struct (nullable = true)
| | | |-- key6: string (nullable = true)
| | | |-- key7: string (nullable = true)
| | | |-- values: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- value: string (nullable = true)
Whats been done so far
df.select(
($"key1.key2.key3").as("key3"),
($"key1.key2.key4").as("key4"),
($"key1.key2.key5.key6").as("key6"),
($"key1.key2.key5.key7").as("key7"),
($"key1.key2.key5.values").as("values")).
show(truncate=false)
+----+----+----+----+----------------------------------------------------------------------------+
|key3|key4|key6|key7|values |
+----+----+----+----+----------------------------------------------------------------------------+
|AK |EU |001 |N |[[valuesColumn1, 9.876], [valuesColumn2, 1.2345], [valuesColumn3, 8.675309]]|
+----+----+----+----+----------------------------------------------------------------------------+
There is an array of 3 structs here but the 3 structs need to be spilt into 3 separate columns dynamically (the number of 3 can vary greatly), and I am not sure how to do it.
Sample Desired output
Notice that there were 3 new columns produced for each of the array elements within the values array.
+----+----+----+----+-----------------------------------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-----------------------------------------+
|AK |EU |001 |N |9.876 |1.2345 |8.675309 |
+----+----+----+----+-----------------------------------------+
Reference
I believe that the desired solution is something similar to what was discussed in this SO post but with 2 main differences:
The number of columns is hardcoded to 3 in the SO post but in my circumstance, the number of array elements is unknown
The column names need to be driven by the name column and the column value by the value.
...
| | | | |-- element: struct (containsNull = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- value: string (nullable = true)

You could do it this way:
val sac = new SparkContext("local[*]", " first Program");
val sqlc = new SQLContext(sac);
import sqlc.implicits._;
import org.apache.spark.sql.functions.split
import scala.math._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.{ min, max }
val json = """{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}"""
val df1 = sqlc.read.json(Seq(json).toDS())
val df2 = df1.select(
($"key1.key2.key3").as("key3"),
($"key1.key2.key4").as("key4"),
($"key1.key2.key5.key6").as("key6"),
($"key1.key2.key5.key7").as("key7"),
($"key1.key2.key5.values").as("values")
)
val numColsVal = df2
.withColumn("values_size", size($"values"))
.agg(max($"values_size"))
.head()
.getInt(0)
val finalDFColumns = df2.select(explode($"values").as("values")).select("values.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect.foldLeft(df2.limit(0))((cdf, c) => cdf.withColumn(c, lit(null))).columns
val finalDF = df2.select($"*" +: (0 until numColsVal).map(i => $"values".getItem(i)("value").as($"values".getItem(i)("name").toString)): _*)
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).show(false)
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).drop($"values").show(false)
The resulting final output as :
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
|AK |EU |001 |N |9.876 |1.2345 |8.675309 |
+----+----+----+----+-------------+-------------+-------------+
Hope I got your question right!
----------- EDIT with Explanation----------
This block gets the number of columns to be created for the array structure.
val numColsVal = df2
.withColumn("values_size", size($"values"))
.agg(max($"values_size"))
.head()
.getInt(0)
finalDFColumns is the DF created with all the expected columns as output with null values.
Below block returns the different columns that needs to be created from the array structure.
df2.select(explode($"values").as("values")).select("values.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect
Below block combines the above new columns with the other columns in df2 initialized with empty/null values.
foldLeft(df2.limit(0))((cdf, c) => cdf.withColumn(c, lit(null)))
Combining these two blocks if you print the output you will get :
+----+----+----+----+------+-------------+-------------+-------------+
|key3|key4|key6|key7|values|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+------+-------------+-------------+-------------+
+----+----+----+----+------+-------------+-------------+-------------+
Now we have the structure ready. We need the values for corresponding columns here. Below block gets us the values:
df2.select($"*" +: (0 until numColsVal).map(i => $"values".getItem(i)("value").as($"values".getItem(i)("name").toString)): _*)
This results like below:
+----+----+----+----+--------------------+---------------+---------------+---------------+
|key3|key4|key6|key7| values|values[0][name]|values[1][name]|values[2][name]|
+----+----+----+----+--------------------+---------------+---------------+---------------+
| AK| EU| 001| N|[[valuesColumn1, ...| 9.876| 1.2345| 8.675309|
+----+----+----+----+--------------------+---------------+---------------+---------------+
Now we need to rename the columns as we have in the first block above. So we will use the zip function to merge the columns and then use foldLeft method to rename the output columns as below :
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).show(false)
This results in the below structure:
+----+----+----+----+--------------------+-------------+-------------+-------------+
|key3|key4|key6|key7| values|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+--------------------+-------------+-------------+-------------+
| AK| EU| 001| N|[[valuesColumn1, ...| 9.876| 1.2345| 8.675309|
+----+----+----+----+--------------------+-------------+-------------+-------------+
We are almost there. We now just need to remove the unwanted values column like this:
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).drop($"values").show(false)
Thus resulting into expected output as follows -
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
|AK |EU |001 |N |9.876 |1.2345 |8.675309 |
+----+----+----+----+-------------+-------------+-------------+
I'm not sure if I was able to explain it clearly. But if you try breaking the above statements/code and try printing it you will get to know how we are reaching till the output. You could find the explanation with examples for different functions used in this logic on internet.

I found that this approach performed much better and was easier to understand using an explode and pivot:
val json = """{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}"""
val df = spark.read.json(Seq(json).toDS())
// schema
df.printSchema
root
|-- key1: struct (nullable = true)
| |-- key2: struct (nullable = true)
| | |-- key3: string (nullable = true)
| | |-- key4: string (nullable = true)
| | |-- key5: struct (nullable = true)
| | | |-- key6: string (nullable = true)
| | | |-- key7: string (nullable = true)
| | | |-- values: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- value: string (nullable = true)
// create final df
val finalDf = df.
select(
$"key1.key2.key3".as("key3"),
$"key1.key2.key4".as("key4"),
$"key1.key2.key5.key6".as("key6"),
$"key1.key2.key5.key7".as("key7"),
explode($"key1.key2.key5.values").as("values")
).
groupBy(
$"key3", $"key4", $"key6", $"key7"
).
pivot("values.name").
agg(min("values.value")).alias("values.name")
// result
finalDf.show
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
| AK| EU| 001| N| 9.876| 1.2345| 8.675309|
+----+----+----+----+-------------+-------------+-------------+

Related

How to read a string value in JSON array struct?

This is my code:
df_05_body = spark.sql("""
select
gtin
, principalBody.constituents
from
v_df_04""")
df_05_body.createOrReplaceTempView("v_df_05_body")
df_05_body.printSchema()
This is the schema:
root
|-- gtin: array (nullable = true)
| |-- element: string (containsNull = true)
|-- constituents: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- constituentCategory: struct (nullable = true)
| | | | |-- value: string (nullable = true)
| | | | |-- valueRange: string (nullable = true)
How to change the principalBody.constituents row in the SQL to read the fields constituentCategory.value and constituentCategory.valueRange?
The column constituents is an array of arrays of structs. If your intent is to get a flat structure then you'll need to flatten the nested arrays, then explode:
df_05_body = spark.sql("""
WITH
v_df_04_exploded AS (
SELECT
gtin,
explode(flatten(principalBody.constituents)) AS constituent
FROM
v_df_04 )
SELECT
gtin,
constituent.constituentCategory.value,
constituent.constituentCategory.valueRange
FROM
v_df_04_exploded
""")
Or simply using inline after flatten like this:
df_05_body = spark.sql("""
SELECT
gtin,
inline(flatten(principalBody.constituents))
FROM
v_df_04_exploded
""")

How to convert the dataframe column type from string to (array and struct) in spark

I have a Dataframe with the following schema, where 'name' is a string type and the value
is a complex JSON with Array and struct.
Basically with string datatype i couldn't able to parse the data and write into rows.
So I am trying to convert datatype and apply explode to parse the data.
Current:
root
|--id: string (nullable = true)
|--partitionNo: string (nullable = true)
|--name: string (nullable = true)
After conversion:
Expected:
root
|id: string (nullable = true)
|partitionNo: string (nullable = true)
|name: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- extension: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- url: string (nullable = true)
| | | | |-- valueMetadata: struct (nullable = true)
| | | | |-- modifiedDateTime: string (nullable = true)
| | | | |-- code: string (nullable = true)
| | |-- lastName: string (nullable = true)
| | |-- firstName: array (nullable = true)
| | | |-- element: string (containsNull = true)
You can use from_json, but you need to provide a schema, which can be automatically inferred using a spaghetti code... because from_json only accepts a schema in the form of lit:
val df2 = df.withColumn(
"name",
from_json(
$"name",
// the lines below generate the schema
lit(
df.select(
schema_of_json(
lit(
df.select($"name").head()(0)
)
)
).head()(0)
)
// end of schema generation
)
)

Unable to fetch Json Column using sparkDataframe:org.apache.spark.sql.AnalysisException: cannot resolve 'explode;

Can Someone Help me In this Scenario.I am reading one Json File using spark/scala and then trying to access column name but while accessing the column name i am getting below error message.
org.apache.spark.sql.AnalysisException: cannot resolve
'explode(`b2b_bill_products_prod_details`.`amt`)'
due to data type mismatch: input to function explode should be
array or map type, not DoubleType;;
Please see the Json Schema and my code below.
root
|-- b2b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- add1: string (nullable = true)
| | |-- bill: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- amt: double (nullable = true)
| | | | |-- products: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- prod_details: struct (nullable = true)
| | | | | | | |-- amt: double (nullable = true)
I want to access amt field(last line in the json schema) I am writing below spark/scala code
df.withColumn("b2b_bill",explode($"b2b.bill"))
.withColumn("b2b_bill_products",explode($"b2b_bill.products"))
.withColumn("b2b_bill_products_prod_details", explode($"b2b_bill_products.prod_details"))
.withColumn("b2b_bill_products_prod_details_amt",explode($"b2b_bill_products_prod_details.amt"))
Your 4th explode function is applied on the amt: double column, wherein the explode function expects array/map input type. Thats the error reported.
Edit
You can access the inner most amt field with the expression given below,
df.withColumn("b2b_bill",explode($"b2b.bill"))
.withColumn("b2b_bill_products",explode($"b2b_bill.products"))
.withColumn("b2b_bill_products_prod_details_amt", $"b2b_bill_products.element.prod_details.amt")

Need help Parsing strange JSON with scala

I am working on parsing json to spark dataframe in scala. I have a nested json file of 50 different records of different household items. On JSON I am trying to parse the equipment tag is as below:
"equipment":[{"tv":[""]}]
Due to this item name (ex: tv in this case) is becoming column name than values.
Ideally this tag should be like,
"equipment":["tv"]
Is there a way parse this type of JSON tags/ contents?
Due to this the dataframe schema is being shown as:
|-- equipment: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ac: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- tv: array (nullable = true)
| | | |-- element: string (containsNull = true)
Where you can see that (above) ac & tv are becoming column headers. Instead of that i need them to shown as values. The dataframe should look like:
+----------+
|equipment |
+----------+
|tv |
|ac |
+----------+
A simple explode function should have done the trick for you but looking at your schema, two explode functions would do the trick as
val newdf = dataframe.withColumn("equipment", explode($"equipment"))
newdf.withColumn("equipment", explode(array($"equipment.*"))).show(false)
With these steps you should have the desired result as in the question.
Edited
From your comments it seems that you are trying to explode the fieldNames not the values. So the following code should work then
val newdf = dataframe.withColumn("equipment", explode($"equipment"))
sc.parallelize(newdf.select("equipment.*").schema.fieldNames.toSeq).toDF("equipment").show(false)
Here's the complete code I am testing with
val data = Seq("""{"equipment":[{"tv":[""],"ac":[""]}]}""")
val dataframe = sqlContext.read.json(sc.parallelize(data))
dataframe.printSchema()
val newdf = dataframe.withColumn("equipment", explode($"equipment"))
sc.parallelize(newdf.select("equipment.*").schema.fieldNames.toSeq).toDF("equipment").show(false)
the printed schema matches with yours
root
|-- equipment: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ac: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- tv: array (nullable = true)
| | | |-- element: string (containsNull = true)
And the result I get matches with your expected result
+---------+
|equipment|
+---------+
|ac |
|tv |
+---------+

Spark SQL DataFrame pretty print

I'm not very good with Scala (I'm more an R addict) I wish to display the WrappedArray elemnt's content (see below sqlDf.show()) in two rows using Scala in spark-shell. I've tried the explode() function but couldn't get further ...
scala> val sqlDf = spark.sql("select t.articles.donneesComptablesArticle.taxes from dau_temp t")
sqlDf: org.apache.spark.sql.DataFrame = [taxes: array<array<struct<baseImposition:bigint,codeCommunautaire:string,codeNatureTaxe:string,codeTaxe:string,droitCautionnable:boolean,droitPercu:boolean,imputationCreditCautionne:boolean,montantLiquidation:bigint,quotite:double,statutAi2:boolean,statutDeLiquidation:string,statutRessourcesPropres:boolean,typeTaxe:string>>>]
scala> sqlDf.show
16/12/21 15:13:21 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
+--------------------+
| taxes|
+--------------------+
|[WrappedArray([12...|
+--------------------+
scala> sqlDf.printSchema
root
|-- taxes: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- baseImposition: long (nullable = true)
| | | |-- codeCommunautaire: string (nullable = true)
| | | |-- codeNatureTaxe: string (nullable = true)
| | | |-- codeTaxe: string (nullable = true)
| | | |-- droitCautionnable: boolean (nullable = true)
| | | |-- droitPercu: boolean (nullable = true)
| | | |-- imputationCreditCautionne: boolean (nullable = true)
| | | |-- montantLiquidation: long (nullable = true)
| | | |-- quotite: double (nullable = true)
| | | |-- statutAi2: boolean (nullable = true)
| | | |-- statutDeLiquidation: string (nullable = true)
| | | |-- statutRessourcesPropres: boolean (nullable = true)
| | | |-- typeTaxe: string (nullable = true)
scala> val sqlDfTaxes = sqlDf.select(explode(sqlDf("taxes")))
sqlDfTaxes: org.apache.spark.sql.DataFrame = [col: array<struct<baseImposition:bigint,codeCommunautaire:string,codeNatureTaxe:string,codeTaxe:string,droitCautionnable:boolean,droitPercu:boolean,imputationCreditCautionne:boolean,montantLiquidation:bigint,quotite:double,statutAi2:boolean,statutDeLiquidation:string,statutRessourcesPropres:boolean,typeTaxe:string>>]
scala> sqlDfTaxes.show()
16/12/21 15:22:28 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
+--------------------+
| col|
+--------------------+
|[[12564,B00,TVA,A...|
+--------------------+
The "readable" content looks like this (THIS IS MY GOAL: a classic row x columns structure display with headers):
codeTaxe codeCommunautaire baseImposition quotite montantLiquidation statutDeLiquidation
A445 B00 12564 20.0 2513 C
U165 A00 12000 4.7 564 C
codeNatureTaxe typeTaxe statutRessourcesPropres statutAi2 imputationCreditCautionne
TVA ADVAL FALSE TRUE FALSE
DD ADVAL TRUE FALSE TRUE
droitCautionnable droitPercu
FALSE TRUE
FALSE TRUE
and the class of each row is (found it using R package sparklyr):
<jobj[100]>
class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
[12564,B00,TVA,A445,false,true,false,2513,20.0,true,C,false,ADVAL]
[[1]][[1]][[2]]
<jobj[101]>
class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
[12000,A00,DD,U165,false,true,true,564,4.7,false,C,true,ADVAL]
you can explode on each column:
val flattenedtaxes = sqlDf.withColumn("codeCommunautaire", org.apache.spark.sql.functions.explode($"taxes. codeCommunautaire"))
After this your flattenedtaxes will have 2 columns taxes(all the columns as is) new column codeCommunautaire