Spark Scala creating schema based on target JSON structure - json

I'm hopelessly stuck trying to generate my Spark Schema based on the JSON structure I know I want.
I have a JSON structure that looks like this:
{
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"key5": [
"key6": "value6",
"key7": [
"key8": "value8",
"key8": "value9"
]
]
}
I tried to recreate the structure by creating the following Schema in Spark 2.4.8, running in Scala:
val targetSchemaSO = StructType(
List(
StructField("key1", StringType, true),
StructField("key2", StringType, true),
StructField("key3", StringType, true),
StructField("key4", StringType, true),
StructField("key5", StructType(
List(
StructField("key6", StringType, true),
StructField("key7", ArrayType(StructType(
List(
StructField("key8", StringType, true)
))), true)
)), true)
)
)
However, when trying to format each row as a Spark Row using this code:
val outputDictSO = scala.collection.mutable.LinkedHashMap[String, Any](
"key1" -> "value1",
"key2" -> "value2",
"key3" -> "value3",
"key4" -> "value4",
"key5" -> (
"key6" -> "value6",
"key7" -> (
"key8" -> "value8",
"key8" -> "value9"
)
)
)
return Row.fromSeq(output_dict.values.toSeq)
I get the following error when mapping it to the provided Schema:
Error while encoding: java.lang.RuntimeException: scala.Tuple2 is not a valid external
type for schema of struct<key6:string,key7:array<struct<key8:string>>>
The program I'm basing this off of uses this exact Schema in PySpark and the DataFrame is created just fine; do the StructTypes work differently between PySpark and Spark Scala? What would be the correct Schema to make so that nested arrays in the Schema are possible?

You can use from_json function ,pass the json string column as input and printSchema to retrieve valid schema. (Use one set of json string from the data and create dataframe if needed)
val df1 = df.withColumn("json",from_json(json_string_column))
df1.printSchema()

There is no difference in creating the DataFrame schema in both languages. If you want to create a dataframe with a specified schema, the correct way to build the Rows, according to that provided schema, could be something like:
val outputDictSO =
Row("value1", "value2", "value3", "value4",
Row("value6", Array(
Row("value8"))))
val df0 =
spark.createDataFrame(
spark.sparkContext.parallelize(Seq(outputDictSO)), targetSchemaSO)
df0.printSchema()
The dataframe´s schema is as expected:
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: string (nullable = true)
|-- key4: string (nullable = true)
|-- key5: struct (nullable = true)
| |-- key6: string (nullable = true)
| |-- key7: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- key8: string (nullable = true)
If you see the content of the dataframe:
df0.show()
Gives:
+------+------+------+------+--------------------+
| key1| key2| key3| key4| key5|
+------+------+------+------+--------------------+
|value1|value2|value3|value4|[value6, [[value8]]]|
+------+------+------+------+--------------------+
You can select for a nested key:
df0.select("key5.key6").show()
Spark returns:
+------+
| key6|
+------+
|value6|
+------+
The error is because the specified schema and the data do not match. If you take your data:
val outputDictSOMap = Map(
"key1" -> "value1",
"key2" -> "value2",
"key3" -> "value3",
"key4" -> "value4",
"key5" -> (
"key6" -> "value6",
"key7" -> (
"key8" -> "value8",
"key9" -> "value9"
)
))
And convert it to json:
import org.json4s.jackson.Serialization
implicit val formats = org.json4s.DefaultFormats
val json = Serialization.write(outputDictSOMap)
val df1 = spark.read.json(Seq(json).toDS)
df1.printSchema()
The resulting schema is:
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: string (nullable = true)
|-- key4: string (nullable = true)
|-- key5: struct (nullable = true)
| |-- _1: struct (nullable = true)
| | |-- key6: string (nullable = true)
| |-- _2: struct (nullable = true)
| | |-- key7: struct (nullable = true)
| | | |-- _1: struct (nullable = true)
| | | | |-- key8: string (nullable = true)
| | | |-- _2: struct (nullable = true)
| | | | |-- key9: string (nullable = true)
And that is the reason of your error.

Related

Apache Spark - Transform Map[String, String] to Map[String, Struct]

I have below dataset:
{
"col1": "val1",
"col2": {
"key1": "{\"SubCol1\":\"ABCD\",\"SubCol2\":\"EFGH\"}",
"key2": "{\"SubCol1\":\"IJKL\",\"SubCol2\":\"MNOP\"}"
}
}
with schema StructType(StructField(col1,StringType,true), StructField(col2,MapType(StringType,StringType,true),true)).
I want to convert col2 to below format:
{
"col1": "val1",
"col2": {
"key1": {"SubCol1":"ABCD","SubCol2":"EFGH"},
"key2": {"SubCol1":"IJKL","SubCol2":"MNOP"}
}
}
The updated dataset schema will be as below:
StructType(StructField(col1,StringType,true), StructField(col2,MapType(StringType,StructType(StructField(SubCol1,StringType,true), StructField(SubCol2,StringType,true)),true),true))
You can use transform_values on the map column:
val df2 = df.withColumn(
"col2",
expr("transform_values(col2, (k, x) -> from_json(x, 'struct<SubCol1:string, SubCol2:string>'))")
)
Try below code It will work in spark 2.4.7
Creating DataFrame with sample data.
scala> val df = Seq(
("val1",Map(
"key1" -> "{\"SubCol1\":\"ABCD\",\"SubCol2\":\"EFGH\"}",
"key2" -> "{\"SubCol1\":\"IJKL\",\"SubCol2\":\"MNOP\"}"))
).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: map<string,string>]
Steps:
Extract map keys (map_keys), values (map_values) into different arrays.
Convert map values into desired output. i.e. Struct
Use map_from_arrays function to combine keys & values from the above steps to create Map[String, Struct]
scala>
val finalDF = df
.withColumn(
"col2_new",
map_from_arrays(
map_keys($"col2"),
expr("""transform(map_values(col2), x -> from_json(x,"struct<SubCol1:string, SubCol2:string>"))""")
)
)
Printing Schema
finalDF.printSchema
root
|-- col1: string (nullable = true)
|-- col2: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- col2_new: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- SubCol1: string (nullable = true)
| | |-- SubCol2: string (nullable = true)
Printing Final Output
+----+------------------------------------------------------------------------------------------+--------------------------------------------+
|col1|col2 |col2_new |
+----+------------------------------------------------------------------------------------------+--------------------------------------------+
|val1|[key1 -> {"SubCol1":"ABCD","SubCol2":"EFGH"}, key2 -> {"SubCol1":"IJKL","SubCol2":"MNOP"}]|[key1 -> [ABCD, EFGH], key2 -> [IJKL, MNOP]]|
+----+------------------------------------------------------------------------------------------+--------------------------------------------+

Pyspark: store dataframe as JSON in MySQL table column

I have a spark dataframe which needs to be stored in JSON format in MYSQL table as a column value. (along with other string type values in their respective column)
Something similar to this:
column 1
column 2
val 1
[{"name":"Peter G", "age":44, "city":"Quahog"}, {"name":"John G", "age":30, "city":"Quahog"}, {...}, ...]
val 1
[{"name":"Stewie G", "age":3, "city":"Quahog"}, {"name":"Ron G", "age":41, "city":"Quahog"}, {...}, ...]
...
...
Here [{"name":"Peter G", "age":44, "city":"Quahog"}, {"name":"John G", "age":30, "city":"Quahog"}, {...}, ...] is the result of one dataframe stored as list of dict
I can do:
str(dataframe_object.toJSON().collect())
and then store it to mysql table column, but this would mean loading the entire data in memory, before storing it in mysql table. Is there better/optimal way to achieve this i.e. without using collect()?
I suppose you can convert your StructType column into JSON string, then using spark.write.jdbc to write to MySQL. As long as your MySQL table has that column as JSON type, you should be all set.
# My sample data
{
"c1": "val1",
"c2": [
{ "name": "N1", "age": 100 },
{ "name": "N2", "age": 101 }
]
}
from pyspark.sql import functions as F
from pyspark.sql import types as T
schema = T.StructType([
T.StructField('c1', T.StringType()),
T.StructField('c2', T.ArrayType(T.StructType([
T.StructField('name', T.StringType()),
T.StructField('age', T.IntegerType())
])))
])
df = spark.read.json('a.json', schema=schema, multiLine=True)
df.show(10, False)
df.printSchema()
+----+----------------------+
|c1 |c2 |
+----+----------------------+
|val1|[{N1, 100}, {N2, 101}]|
+----+----------------------+
root
|-- c1: string (nullable = true)
|-- c2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = true)
df.withColumn('j', F.to_json('c2')).show(10, False)
+----+----------------------+-------------------------------------------------+
|c1 |c2 |j |
+----+----------------------+-------------------------------------------------+
|val1|[{N1, 100}, {N2, 101}]|[{"name":"N1","age":100},{"name":"N2","age":101}]|
+----+----------------------+-------------------------------------------------+
EDIT #1:
# My sample data
{
"c1": "val1",
"c2": "[{ \"name\": \"N1\", \"age\": 100 },{ \"name\": \"N2\", \"age\": 101 }]"
}
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.read.json('a.json', multiLine=True)
df.show(10, False)
df.printSchema()
+----+-----------------------------------------------------------+
|c1 |c2 |
+----+-----------------------------------------------------------+
|val1|[{ "name": "N1", "age": 100 },{ "name": "N2", "age": 101 }]|
+----+-----------------------------------------------------------+
root
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
schema = T.ArrayType(T.StructType([
T.StructField('name', T.StringType()),
T.StructField('age', T.IntegerType())
]))
df2 = df.withColumn('j', F.from_json('c2', schema))
df2.show(10, False)
df2.printSchema()
+----+-----------------------------------------------------------+----------------------+
|c1 |c2 |j |
+----+-----------------------------------------------------------+----------------------+
|val1|[{ "name": "N1", "age": 100 },{ "name": "N2", "age": 101 }]|[{N1, 100}, {N2, 101}]|
+----+-----------------------------------------------------------+----------------------+
root
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- j: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = true)

Add new column to dataframe with modified schema or drop nested columns of array type in Scala

json:-
{"ID": "500", "Data": [{"field2": 308, "field3": 346, "field1": 40.36582609126494, "field7": 3, "field4": 1583057346.0, "field5": -80.03243596528726, "field6": 16.0517578125, "field8": 5}, {"field2": 307, "field3": 348, "field1": 40.36591421686625, "field7": 3, "field4": 1583057347.0, "field5": -80.03259684675493, "field6": 16.234375, "field8": 5}]}
schema:-
val MySchema: StructType =
StructType( Array(
StructField("ID",StringType,true),
StructField("Data", ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field2",LongType,true),
StructField("field3",LongType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true),
StructField("field7",LongType,true),
StructField("field8",LongType,true)
)),true),true)))
Load json into dataframe:-
val MyDF = spark.readStream
.schema(MySchema)
.json(input)
where 'input' is a file that contains above json
How can I add a new column "Data_New" to the above dataframe 'MyDF' with schema as
val Data_New_Schema: StructType =
StructType( Array(
StructField("Data", ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true)
)),true),true)))
Please note a huge volume of such json files will be loaded in the source and so performing an explode followed by a collect_list will crash the driver
You can try one of the following two methods:
for Spark 2.4+, use transform:
import org.apache.spark.sql.functions._
val df_new = df.withColumn("Data_New", expr("struct(transform(Data, x -> (x.field1 as f1, x.field4 as f4, x.field5 as f5, x.field6 as f6)))").cast(Data_New_Schema))
scala> df_new.printSchema
root
|-- ID: string (nullable = true)
|-- Data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- field1: double (nullable = true)
| | |-- field2: long (nullable = true)
| | |-- field3: long (nullable = true)
| | |-- field4: double (nullable = true)
| | |-- field5: double (nullable = true)
| | |-- field6: double (nullable = true)
| | |-- field7: long (nullable = true)
| | |-- field8: long (nullable = true)
|-- Data_New: struct (nullable = false)
| |-- Data: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- field1: double (nullable = true)
| | | |-- field4: double (nullable = true)
| | | |-- field5: double (nullable = true)
| | | |-- field6: double (nullable = true)
Notice that nullable = false on the top level schema of Data_New, if you want to make it true, add nullif function to the SQL expression: nullif(struct(transform(Data, x -> (...))), null), or a more efficient way if(true, struct(transform(Data, x -> (...))), null).
Prior to Spark 2.4, use from_json + to_json:
val df_new = df.withColumn("Data_New", from_json(to_json(struct('Data)), Data_New_Schema))
Edit: Per comment, if you want Data_New to be an array of structs, just remove struct function, for example:
val Data_New_Schema: ArrayType = ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true)
)),true)
// if you need `containsNull=true`, then cast the above Type definition
val df_new = df.withColumn("Data_New", expr("transform(Data, x -> (x.field1 as field1, x.field4 as field4, x.field5 as field5, x.field6 as field6))"))
Or
val df_new = df.withColumn("Data_New", from_json(to_json('Data), Data_New_Schema))

parse json data with pyspark

I am using pyspark to read the json file below :
{
"data": {
"indicatr": {
"indicatr": {
"id": "5c9e41e4884db700desdaad8"}}}}
I wrote the following python code :
from pyspark.sql import Window, DataFrame
from pyspark.sql.types import *
from pyspark.sql.types import StructType
from pyspark.sql import functions as F
schema = StructType([
StructField("data", StructType([
StructField("indicatr", StructType([
StructField("indicatr", StructType([
StructField("id", StringType())
]))]))]))])
df = spark.read.json("pathtofile/test.json", multiLine=True)
df.show()
df2 = df.withColumn("json", F.col("data").cast("string"))
df3=df2.select(F.col("json"))
df3.collect()
df4 =df3.select(F.from_json(F.col("json"), schema).alias("name"))
df4.show()
I am getting the following result :
|name|
+----+
|null|
Anyone who knows how to solve this, please
When you select the column labeled json, you’re selecting a column that is entirely of the StringType (logically, because you’re casting it to that type). Even though it looks like a valid JSON object, it’s really just a string. df2.data does not have that issue though:
In [2]: df2.printSchema()
root
|-- data: struct (nullable = true)
| |-- indicatr: struct (nullable = true)
| | |-- indicatr: struct (nullable = true)
| | | |-- id: double (nullable = true)
|-- json: string (nullable = true)
By the way, you can immediately pass in the schema on read:
In [3]: df = spark.read.json("data.json", multiLine=True, schema=schema)
...: df.printSchema()
...:
...:
root
|-- data: struct (nullable = true)
| |-- indicatr: struct (nullable = true)
| | |-- indicatr: struct (nullable = true)
| | | |-- id: string (nullable = true)
You can dig down in the columns to reach the nested values:
In [4]: df.select(df.data.indicatr.indicatr.id).show()
+-------------------------+
|data.indicatr.indicatr.id|
+-------------------------+
| 5c9e41e4884db700desdaad8|
+-------------------------+

How to create schema for Spark SQL for Array of struct?

How to create a schema for the below json to read schema. I am using hiveContext.read.schema().json("input.json"), and I want to ignore the first two "ErrorMessage" and "IsError" read only Report.
Below is the JSON:
{
"ErrorMessage": null,
"IsError": false,
"Report":{
"tl":[
{
"TlID":"F6",
"CID":"mo"
},
{
"TlID":"Fk",
"CID":"mo"
}
]
}
}
I created the below schema :
val schema = StructType(
Array(
StructField("Report", StructType(
Array(
StructField
("tl",ArrayType(StructType(Array(
StructField("TlID", StringType),
StructField("CID", IntegerType)
)))))))))
Below is my json.printSchema() :
root
|-- Report: struct (nullable = true)
| |-- tl: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- TlID: string (nullable = true)
| | | |-- CID: integer (nullable = true)
The schema is incorrect. CID in your data is clearly not String ("mo"). Use
val schema = StructType(Array(
StructField("Report", StructType(
Array(
StructField
("tl",ArrayType(StructType(Array(
StructField("CID", StringType),
StructField("TlID", StringType)
)))))))))
and:
val df = Seq("""{
"ErrorMessage": null,
"IsError": false,
"Report":{
"tl":[
{
"TlID":"F6",
"CID":"mo"
},
{
"TlID":"Fk",
"CID":"mo"
}
]
}
}""").toDS
spark.read.schema(schema).json(df).show(false)
+--------------------------------+
|Report |
+--------------------------------+
|[WrappedArray([mo,F6], [mo,Fk])]|
+--------------------------------+
Datatype: array<struct<metrics_name:string,metrics_value:string>>
import org.apache.spark.sql.types.{ArrayType}
StructField("usage_metrics", ArrayType(StructType(
Array(
StructField("metric_name", StringType, true),
StructField("metric_value", StringType, true)
)
))))