I have a spark dataframe which needs to be stored in JSON format in MYSQL table as a column value. (along with other string type values in their respective column)
Something similar to this:
column 1
column 2
val 1
[{"name":"Peter G", "age":44, "city":"Quahog"}, {"name":"John G", "age":30, "city":"Quahog"}, {...}, ...]
val 1
[{"name":"Stewie G", "age":3, "city":"Quahog"}, {"name":"Ron G", "age":41, "city":"Quahog"}, {...}, ...]
...
...
Here [{"name":"Peter G", "age":44, "city":"Quahog"}, {"name":"John G", "age":30, "city":"Quahog"}, {...}, ...] is the result of one dataframe stored as list of dict
I can do:
str(dataframe_object.toJSON().collect())
and then store it to mysql table column, but this would mean loading the entire data in memory, before storing it in mysql table. Is there better/optimal way to achieve this i.e. without using collect()?
I suppose you can convert your StructType column into JSON string, then using spark.write.jdbc to write to MySQL. As long as your MySQL table has that column as JSON type, you should be all set.
# My sample data
{
"c1": "val1",
"c2": [
{ "name": "N1", "age": 100 },
{ "name": "N2", "age": 101 }
]
}
from pyspark.sql import functions as F
from pyspark.sql import types as T
schema = T.StructType([
T.StructField('c1', T.StringType()),
T.StructField('c2', T.ArrayType(T.StructType([
T.StructField('name', T.StringType()),
T.StructField('age', T.IntegerType())
])))
])
df = spark.read.json('a.json', schema=schema, multiLine=True)
df.show(10, False)
df.printSchema()
+----+----------------------+
|c1 |c2 |
+----+----------------------+
|val1|[{N1, 100}, {N2, 101}]|
+----+----------------------+
root
|-- c1: string (nullable = true)
|-- c2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = true)
df.withColumn('j', F.to_json('c2')).show(10, False)
+----+----------------------+-------------------------------------------------+
|c1 |c2 |j |
+----+----------------------+-------------------------------------------------+
|val1|[{N1, 100}, {N2, 101}]|[{"name":"N1","age":100},{"name":"N2","age":101}]|
+----+----------------------+-------------------------------------------------+
EDIT #1:
# My sample data
{
"c1": "val1",
"c2": "[{ \"name\": \"N1\", \"age\": 100 },{ \"name\": \"N2\", \"age\": 101 }]"
}
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.read.json('a.json', multiLine=True)
df.show(10, False)
df.printSchema()
+----+-----------------------------------------------------------+
|c1 |c2 |
+----+-----------------------------------------------------------+
|val1|[{ "name": "N1", "age": 100 },{ "name": "N2", "age": 101 }]|
+----+-----------------------------------------------------------+
root
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
schema = T.ArrayType(T.StructType([
T.StructField('name', T.StringType()),
T.StructField('age', T.IntegerType())
]))
df2 = df.withColumn('j', F.from_json('c2', schema))
df2.show(10, False)
df2.printSchema()
+----+-----------------------------------------------------------+----------------------+
|c1 |c2 |j |
+----+-----------------------------------------------------------+----------------------+
|val1|[{ "name": "N1", "age": 100 },{ "name": "N2", "age": 101 }]|[{N1, 100}, {N2, 101}]|
+----+-----------------------------------------------------------+----------------------+
root
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- j: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = true)
Related
I'm hopelessly stuck trying to generate my Spark Schema based on the JSON structure I know I want.
I have a JSON structure that looks like this:
{
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"key5": [
"key6": "value6",
"key7": [
"key8": "value8",
"key8": "value9"
]
]
}
I tried to recreate the structure by creating the following Schema in Spark 2.4.8, running in Scala:
val targetSchemaSO = StructType(
List(
StructField("key1", StringType, true),
StructField("key2", StringType, true),
StructField("key3", StringType, true),
StructField("key4", StringType, true),
StructField("key5", StructType(
List(
StructField("key6", StringType, true),
StructField("key7", ArrayType(StructType(
List(
StructField("key8", StringType, true)
))), true)
)), true)
)
)
However, when trying to format each row as a Spark Row using this code:
val outputDictSO = scala.collection.mutable.LinkedHashMap[String, Any](
"key1" -> "value1",
"key2" -> "value2",
"key3" -> "value3",
"key4" -> "value4",
"key5" -> (
"key6" -> "value6",
"key7" -> (
"key8" -> "value8",
"key8" -> "value9"
)
)
)
return Row.fromSeq(output_dict.values.toSeq)
I get the following error when mapping it to the provided Schema:
Error while encoding: java.lang.RuntimeException: scala.Tuple2 is not a valid external
type for schema of struct<key6:string,key7:array<struct<key8:string>>>
The program I'm basing this off of uses this exact Schema in PySpark and the DataFrame is created just fine; do the StructTypes work differently between PySpark and Spark Scala? What would be the correct Schema to make so that nested arrays in the Schema are possible?
You can use from_json function ,pass the json string column as input and printSchema to retrieve valid schema. (Use one set of json string from the data and create dataframe if needed)
val df1 = df.withColumn("json",from_json(json_string_column))
df1.printSchema()
There is no difference in creating the DataFrame schema in both languages. If you want to create a dataframe with a specified schema, the correct way to build the Rows, according to that provided schema, could be something like:
val outputDictSO =
Row("value1", "value2", "value3", "value4",
Row("value6", Array(
Row("value8"))))
val df0 =
spark.createDataFrame(
spark.sparkContext.parallelize(Seq(outputDictSO)), targetSchemaSO)
df0.printSchema()
The dataframeĀ“s schema is as expected:
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: string (nullable = true)
|-- key4: string (nullable = true)
|-- key5: struct (nullable = true)
| |-- key6: string (nullable = true)
| |-- key7: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- key8: string (nullable = true)
If you see the content of the dataframe:
df0.show()
Gives:
+------+------+------+------+--------------------+
| key1| key2| key3| key4| key5|
+------+------+------+------+--------------------+
|value1|value2|value3|value4|[value6, [[value8]]]|
+------+------+------+------+--------------------+
You can select for a nested key:
df0.select("key5.key6").show()
Spark returns:
+------+
| key6|
+------+
|value6|
+------+
The error is because the specified schema and the data do not match. If you take your data:
val outputDictSOMap = Map(
"key1" -> "value1",
"key2" -> "value2",
"key3" -> "value3",
"key4" -> "value4",
"key5" -> (
"key6" -> "value6",
"key7" -> (
"key8" -> "value8",
"key9" -> "value9"
)
))
And convert it to json:
import org.json4s.jackson.Serialization
implicit val formats = org.json4s.DefaultFormats
val json = Serialization.write(outputDictSOMap)
val df1 = spark.read.json(Seq(json).toDS)
df1.printSchema()
The resulting schema is:
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: string (nullable = true)
|-- key4: string (nullable = true)
|-- key5: struct (nullable = true)
| |-- _1: struct (nullable = true)
| | |-- key6: string (nullable = true)
| |-- _2: struct (nullable = true)
| | |-- key7: struct (nullable = true)
| | | |-- _1: struct (nullable = true)
| | | | |-- key8: string (nullable = true)
| | | |-- _2: struct (nullable = true)
| | | | |-- key9: string (nullable = true)
And that is the reason of your error.
I am trying to read nested JSON file. I am not able to explode nested column and read the JSON file properly.
My Json file
{
"Univerity": "JNTU",
"Department": {
"DepartmentID": "101",
"Student": {
"lastName": "Fraun",
"address": "23 hyd 500089",
"email": "ss.fraun#yahoo.co.in",
"Subjects": [
{
"subjectId": "12592",
"subjectName": "Boyce"
},
{
"subjectId": "12592",
"subjectName": "Boyce"
}
]
}
}
}
Code :
```
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, regexp_replace, split
spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()
if __name__ == '__main__':
df = spark.read.option("multiline","true").json("C:\Workspace\student1.json").cache()
df.show()
df.printSchema()
df.withColumn("Department", explode(col("Department")))
df.show()
```
My below output and error:
+--------------------+---------+
| Department|Univerity|
+--------------------+---------+
|{101, {[{12592, B...| JNTU|
+--------------------+---------+
root
|-- Department: struct (nullable = true)
| |-- DepartmentID: string (nullable = true)
| |-- Student: struct (nullable = true)
| | |-- Subjects: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- subjectId: string (nullable = true)
| | | | |-- subjectName: string (nullable = true)
| | |-- address: string (nullable = true)
| | |-- email: string (nullable = true)
| | |-- lastName: string (nullable = true)
|-- Univerity: string (nullable = true)
Traceback (most recent call last):
File "C:/student/agility-data-electrode/electrode/entities/student.py", line 12, in <module>
df.withColumn("Department", explode(col("Department")))
File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\pyspark\sql\dataframe.py", line 2455, in withColumn
return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\py4j\java_gateway.py", line 1310, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\pyspark\sql\utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve 'explode(`Department`)' due to data type mismatch: input to function explode should be array or map type, not struct<DepartmentID:string,Student:struct<Subjects:array<struct<subjectId:string,subjectName:string>>,address:string,email:string,lastName:string>>;
'Project [explode(Department#0) AS Department#65, Univerity#1]
+- Relation[Department#0,Univerity#1] json
You can explode only an array columns , Choose the subject columns to explode.
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, regexp_replace, split
spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()
if __name__ == '__main__':
df = spark.read.option("multiline","true").json("C:\Workspace\student1.json").cache()
df.show()
df.printSchema()
df.withColumn("Subjects", explode(col("Department.Student.Subjects")))
df.show()
I have below dataset:
{
"col1": "val1",
"col2": {
"key1": "{\"SubCol1\":\"ABCD\",\"SubCol2\":\"EFGH\"}",
"key2": "{\"SubCol1\":\"IJKL\",\"SubCol2\":\"MNOP\"}"
}
}
with schema StructType(StructField(col1,StringType,true), StructField(col2,MapType(StringType,StringType,true),true)).
I want to convert col2 to below format:
{
"col1": "val1",
"col2": {
"key1": {"SubCol1":"ABCD","SubCol2":"EFGH"},
"key2": {"SubCol1":"IJKL","SubCol2":"MNOP"}
}
}
The updated dataset schema will be as below:
StructType(StructField(col1,StringType,true), StructField(col2,MapType(StringType,StructType(StructField(SubCol1,StringType,true), StructField(SubCol2,StringType,true)),true),true))
You can use transform_values on the map column:
val df2 = df.withColumn(
"col2",
expr("transform_values(col2, (k, x) -> from_json(x, 'struct<SubCol1:string, SubCol2:string>'))")
)
Try below code It will work in spark 2.4.7
Creating DataFrame with sample data.
scala> val df = Seq(
("val1",Map(
"key1" -> "{\"SubCol1\":\"ABCD\",\"SubCol2\":\"EFGH\"}",
"key2" -> "{\"SubCol1\":\"IJKL\",\"SubCol2\":\"MNOP\"}"))
).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: map<string,string>]
Steps:
Extract map keys (map_keys), values (map_values) into different arrays.
Convert map values into desired output. i.e. Struct
Use map_from_arrays function to combine keys & values from the above steps to create Map[String, Struct]
scala>
val finalDF = df
.withColumn(
"col2_new",
map_from_arrays(
map_keys($"col2"),
expr("""transform(map_values(col2), x -> from_json(x,"struct<SubCol1:string, SubCol2:string>"))""")
)
)
Printing Schema
finalDF.printSchema
root
|-- col1: string (nullable = true)
|-- col2: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- col2_new: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- SubCol1: string (nullable = true)
| | |-- SubCol2: string (nullable = true)
Printing Final Output
+----+------------------------------------------------------------------------------------------+--------------------------------------------+
|col1|col2 |col2_new |
+----+------------------------------------------------------------------------------------------+--------------------------------------------+
|val1|[key1 -> {"SubCol1":"ABCD","SubCol2":"EFGH"}, key2 -> {"SubCol1":"IJKL","SubCol2":"MNOP"}]|[key1 -> [ABCD, EFGH], key2 -> [IJKL, MNOP]]|
+----+------------------------------------------------------------------------------------------+--------------------------------------------+
Schema
root
|-- userId: string (nullable = true)
|-- languageknowList: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- code: string (nullable = false)
| | |-- description: string (nullable = false)
| | |-- name: string (nullable = false)
In this schema there is a user with userId 0, I have to concatenate the languageknowList in userId 0 with languageknowList of all other users.
How can I do that
Example:
input data to DF
[{
"userId":1,
"languageknowList": [[10,"Hindi","Hindi"],[11,"Spanish","Spanish"]]
},
{
"userId":2,
"languageknowList": [[11,"Spanish","Spanish"]]
},
{
"userId":0,
"languageknowList": [[1,"English","English"],[2,"German","German"]]
}]
output df should be like:
[{
"userId":1,
"languageknowList": [[10,"Hindi","Hindi"],[11,"Spanish","Spanish"],[1,"English","English"],[2,"German","German"]]
},
{
"userId":2,
"languageknowList": [[11,"Spanish","Spanish"],[1,"English","English"],[2,"German","German"]]
}]
You can cross join the dataframe to the row with userId = 0, and concat the arrays of languages:
result = df.filter('userId != 0').crossJoin(
df.filter('userId = 0').select('languageknowList').toDF('language')
).select(
'userId',
F.concat('languageknowList', 'language').alias('languageknowList')
)
result.show(20,0)
+------+----------------------------------------------------------------------------------------+
|userId|languageknowList |
+------+----------------------------------------------------------------------------------------+
|1 |[[10, Hindi, Hindi], [11, Spanish, Spanish], [1, English, English], [2, German, German]]|
|2 |[[11, Spanish, Spanish], [1, English, English], [2, German, German]] |
+------+----------------------------------------------------------------------------------------+
result.coalesce(1).write.json('result')
$ cat result/part-00000-b34b3748-71b5-46d4-b011-6b208978cc5a-c000.json
{"userId":1,"languageknowList":[["10","Hindi","Hindi"],["11","Spanish","Spanish"],["1","English","English"],["2","German","German"]]}
{"userId":2,"languageknowList":[["11","Spanish","Spanish"],["1","English","English"],["2","German","German"]]}
json:-
{"ID": "500", "Data": [{"field2": 308, "field3": 346, "field1": 40.36582609126494, "field7": 3, "field4": 1583057346.0, "field5": -80.03243596528726, "field6": 16.0517578125, "field8": 5}, {"field2": 307, "field3": 348, "field1": 40.36591421686625, "field7": 3, "field4": 1583057347.0, "field5": -80.03259684675493, "field6": 16.234375, "field8": 5}]}
schema:-
val MySchema: StructType =
StructType( Array(
StructField("ID",StringType,true),
StructField("Data", ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field2",LongType,true),
StructField("field3",LongType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true),
StructField("field7",LongType,true),
StructField("field8",LongType,true)
)),true),true)))
Load json into dataframe:-
val MyDF = spark.readStream
.schema(MySchema)
.json(input)
where 'input' is a file that contains above json
How can I add a new column "Data_New" to the above dataframe 'MyDF' with schema as
val Data_New_Schema: StructType =
StructType( Array(
StructField("Data", ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true)
)),true),true)))
Please note a huge volume of such json files will be loaded in the source and so performing an explode followed by a collect_list will crash the driver
You can try one of the following two methods:
for Spark 2.4+, use transform:
import org.apache.spark.sql.functions._
val df_new = df.withColumn("Data_New", expr("struct(transform(Data, x -> (x.field1 as f1, x.field4 as f4, x.field5 as f5, x.field6 as f6)))").cast(Data_New_Schema))
scala> df_new.printSchema
root
|-- ID: string (nullable = true)
|-- Data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- field1: double (nullable = true)
| | |-- field2: long (nullable = true)
| | |-- field3: long (nullable = true)
| | |-- field4: double (nullable = true)
| | |-- field5: double (nullable = true)
| | |-- field6: double (nullable = true)
| | |-- field7: long (nullable = true)
| | |-- field8: long (nullable = true)
|-- Data_New: struct (nullable = false)
| |-- Data: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- field1: double (nullable = true)
| | | |-- field4: double (nullable = true)
| | | |-- field5: double (nullable = true)
| | | |-- field6: double (nullable = true)
Notice that nullable = false on the top level schema of Data_New, if you want to make it true, add nullif function to the SQL expression: nullif(struct(transform(Data, x -> (...))), null), or a more efficient way if(true, struct(transform(Data, x -> (...))), null).
Prior to Spark 2.4, use from_json + to_json:
val df_new = df.withColumn("Data_New", from_json(to_json(struct('Data)), Data_New_Schema))
Edit: Per comment, if you want Data_New to be an array of structs, just remove struct function, for example:
val Data_New_Schema: ArrayType = ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true)
)),true)
// if you need `containsNull=true`, then cast the above Type definition
val df_new = df.withColumn("Data_New", expr("transform(Data, x -> (x.field1 as field1, x.field4 as field4, x.field5 as field5, x.field6 as field6))"))
Or
val df_new = df.withColumn("Data_New", from_json(to_json('Data), Data_New_Schema))