Reading Nested Json file in Pyspark code. pyspark.sql.utils.AnalysisException: - json

I am trying to read nested JSON file. I am not able to explode nested column and read the JSON file properly.
My Json file
{
"Univerity": "JNTU",
"Department": {
"DepartmentID": "101",
"Student": {
"lastName": "Fraun",
"address": "23 hyd 500089",
"email": "ss.fraun#yahoo.co.in",
"Subjects": [
{
"subjectId": "12592",
"subjectName": "Boyce"
},
{
"subjectId": "12592",
"subjectName": "Boyce"
}
]
}
}
}
Code :
```
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, regexp_replace, split
spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()
if __name__ == '__main__':
df = spark.read.option("multiline","true").json("C:\Workspace\student1.json").cache()
df.show()
df.printSchema()
df.withColumn("Department", explode(col("Department")))
df.show()
```
My below output and error:
+--------------------+---------+
| Department|Univerity|
+--------------------+---------+
|{101, {[{12592, B...| JNTU|
+--------------------+---------+
root
|-- Department: struct (nullable = true)
| |-- DepartmentID: string (nullable = true)
| |-- Student: struct (nullable = true)
| | |-- Subjects: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- subjectId: string (nullable = true)
| | | | |-- subjectName: string (nullable = true)
| | |-- address: string (nullable = true)
| | |-- email: string (nullable = true)
| | |-- lastName: string (nullable = true)
|-- Univerity: string (nullable = true)
Traceback (most recent call last):
File "C:/student/agility-data-electrode/electrode/entities/student.py", line 12, in <module>
df.withColumn("Department", explode(col("Department")))
File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\pyspark\sql\dataframe.py", line 2455, in withColumn
return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\py4j\java_gateway.py", line 1310, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\pyspark\sql\utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve 'explode(`Department`)' due to data type mismatch: input to function explode should be array or map type, not struct<DepartmentID:string,Student:struct<Subjects:array<struct<subjectId:string,subjectName:string>>,address:string,email:string,lastName:string>>;
'Project [explode(Department#0) AS Department#65, Univerity#1]
+- Relation[Department#0,Univerity#1] json

You can explode only an array columns , Choose the subject columns to explode.
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, regexp_replace, split
spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()
if __name__ == '__main__':
df = spark.read.option("multiline","true").json("C:\Workspace\student1.json").cache()
df.show()
df.printSchema()
df.withColumn("Subjects", explode(col("Department.Student.Subjects")))
df.show()

Related

Transform a list of JSON string to a list of dict in Pyspark

I’m struggling to transform a list of JSON string to a list of dict in Pyspark without using udf or using rdd
I have this kind of dataframe:
Key
JSON_string
123456
["""{"Zipcode":704,"ZipCodeType":"STA"}""","""{"City":"PARC","State":"PR"}"""]
789123
["""{"Zipcode":7,"ZipCodeType":"AZA"}""","""{"City":"PRE","State":"XY"}"""]
How can I transform col(JSON_string) by using built-in functions in Pyspark to [{"Zipcode":704,"ZipCodeType":"STA"},{"City":"PARC","State":"PR"}] ?
I tried many functions such as create_map, collect_list, from_json, to_json, explode, json.loads, json.dump but no way to get the expected result.
Thank you for your help
Explode your JSON_string column, and read it as json, group by again.
df = df.withColumn('JSON_string', f.explode('JSON_string'))
schema = spark.read.json(df.rdd.map(lambda r: r.JSON_string)).schema
df_result = df.withColumn('JSON', f.from_json('JSON_string', schema)) \
.drop('JSON_string') \
.groupBy('Key') \
.agg(f.collect_list('JSON').alias('JSON'))
df_result.show(truncate=False)
df_result.printSchema()
+------+------------------------------------------------+
|Key |JSON |
+------+------------------------------------------------+
|123456|[{null, null, STA, 704}, {PARC, PR, null, null}]|
|789123|[{null, null, AZA, 7}, {PRE, XY, null, null}] |
+------+------------------------------------------------+
root
|-- Key: long (nullable = true)
|-- JSON: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- City: string (nullable = true)
| | |-- State: string (nullable = true)
| | |-- ZipCodeType: string (nullable = true)
| | |-- Zipcode: long (nullable = true)

Pyspark: store dataframe as JSON in MySQL table column

I have a spark dataframe which needs to be stored in JSON format in MYSQL table as a column value. (along with other string type values in their respective column)
Something similar to this:
column 1
column 2
val 1
[{"name":"Peter G", "age":44, "city":"Quahog"}, {"name":"John G", "age":30, "city":"Quahog"}, {...}, ...]
val 1
[{"name":"Stewie G", "age":3, "city":"Quahog"}, {"name":"Ron G", "age":41, "city":"Quahog"}, {...}, ...]
...
...
Here [{"name":"Peter G", "age":44, "city":"Quahog"}, {"name":"John G", "age":30, "city":"Quahog"}, {...}, ...] is the result of one dataframe stored as list of dict
I can do:
str(dataframe_object.toJSON().collect())
and then store it to mysql table column, but this would mean loading the entire data in memory, before storing it in mysql table. Is there better/optimal way to achieve this i.e. without using collect()?
I suppose you can convert your StructType column into JSON string, then using spark.write.jdbc to write to MySQL. As long as your MySQL table has that column as JSON type, you should be all set.
# My sample data
{
"c1": "val1",
"c2": [
{ "name": "N1", "age": 100 },
{ "name": "N2", "age": 101 }
]
}
from pyspark.sql import functions as F
from pyspark.sql import types as T
schema = T.StructType([
T.StructField('c1', T.StringType()),
T.StructField('c2', T.ArrayType(T.StructType([
T.StructField('name', T.StringType()),
T.StructField('age', T.IntegerType())
])))
])
df = spark.read.json('a.json', schema=schema, multiLine=True)
df.show(10, False)
df.printSchema()
+----+----------------------+
|c1 |c2 |
+----+----------------------+
|val1|[{N1, 100}, {N2, 101}]|
+----+----------------------+
root
|-- c1: string (nullable = true)
|-- c2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = true)
df.withColumn('j', F.to_json('c2')).show(10, False)
+----+----------------------+-------------------------------------------------+
|c1 |c2 |j |
+----+----------------------+-------------------------------------------------+
|val1|[{N1, 100}, {N2, 101}]|[{"name":"N1","age":100},{"name":"N2","age":101}]|
+----+----------------------+-------------------------------------------------+
EDIT #1:
# My sample data
{
"c1": "val1",
"c2": "[{ \"name\": \"N1\", \"age\": 100 },{ \"name\": \"N2\", \"age\": 101 }]"
}
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.read.json('a.json', multiLine=True)
df.show(10, False)
df.printSchema()
+----+-----------------------------------------------------------+
|c1 |c2 |
+----+-----------------------------------------------------------+
|val1|[{ "name": "N1", "age": 100 },{ "name": "N2", "age": 101 }]|
+----+-----------------------------------------------------------+
root
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
schema = T.ArrayType(T.StructType([
T.StructField('name', T.StringType()),
T.StructField('age', T.IntegerType())
]))
df2 = df.withColumn('j', F.from_json('c2', schema))
df2.show(10, False)
df2.printSchema()
+----+-----------------------------------------------------------+----------------------+
|c1 |c2 |j |
+----+-----------------------------------------------------------+----------------------+
|val1|[{ "name": "N1", "age": 100 },{ "name": "N2", "age": 101 }]|[{N1, 100}, {N2, 101}]|
+----+-----------------------------------------------------------+----------------------+
root
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- j: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = true)

Add new column to dataframe with modified schema or drop nested columns of array type in Scala

json:-
{"ID": "500", "Data": [{"field2": 308, "field3": 346, "field1": 40.36582609126494, "field7": 3, "field4": 1583057346.0, "field5": -80.03243596528726, "field6": 16.0517578125, "field8": 5}, {"field2": 307, "field3": 348, "field1": 40.36591421686625, "field7": 3, "field4": 1583057347.0, "field5": -80.03259684675493, "field6": 16.234375, "field8": 5}]}
schema:-
val MySchema: StructType =
StructType( Array(
StructField("ID",StringType,true),
StructField("Data", ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field2",LongType,true),
StructField("field3",LongType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true),
StructField("field7",LongType,true),
StructField("field8",LongType,true)
)),true),true)))
Load json into dataframe:-
val MyDF = spark.readStream
.schema(MySchema)
.json(input)
where 'input' is a file that contains above json
How can I add a new column "Data_New" to the above dataframe 'MyDF' with schema as
val Data_New_Schema: StructType =
StructType( Array(
StructField("Data", ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true)
)),true),true)))
Please note a huge volume of such json files will be loaded in the source and so performing an explode followed by a collect_list will crash the driver
You can try one of the following two methods:
for Spark 2.4+, use transform:
import org.apache.spark.sql.functions._
val df_new = df.withColumn("Data_New", expr("struct(transform(Data, x -> (x.field1 as f1, x.field4 as f4, x.field5 as f5, x.field6 as f6)))").cast(Data_New_Schema))
scala> df_new.printSchema
root
|-- ID: string (nullable = true)
|-- Data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- field1: double (nullable = true)
| | |-- field2: long (nullable = true)
| | |-- field3: long (nullable = true)
| | |-- field4: double (nullable = true)
| | |-- field5: double (nullable = true)
| | |-- field6: double (nullable = true)
| | |-- field7: long (nullable = true)
| | |-- field8: long (nullable = true)
|-- Data_New: struct (nullable = false)
| |-- Data: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- field1: double (nullable = true)
| | | |-- field4: double (nullable = true)
| | | |-- field5: double (nullable = true)
| | | |-- field6: double (nullable = true)
Notice that nullable = false on the top level schema of Data_New, if you want to make it true, add nullif function to the SQL expression: nullif(struct(transform(Data, x -> (...))), null), or a more efficient way if(true, struct(transform(Data, x -> (...))), null).
Prior to Spark 2.4, use from_json + to_json:
val df_new = df.withColumn("Data_New", from_json(to_json(struct('Data)), Data_New_Schema))
Edit: Per comment, if you want Data_New to be an array of structs, just remove struct function, for example:
val Data_New_Schema: ArrayType = ArrayType(
StructType( Array(
StructField("field1",DoubleType,true),
StructField("field4",DoubleType,true),
StructField("field5",DoubleType,true),
StructField("field6",DoubleType,true)
)),true)
// if you need `containsNull=true`, then cast the above Type definition
val df_new = df.withColumn("Data_New", expr("transform(Data, x -> (x.field1 as field1, x.field4 as field4, x.field5 as field5, x.field6 as field6))"))
Or
val df_new = df.withColumn("Data_New", from_json(to_json('Data), Data_New_Schema))

Merge Dataframes With Differents Schemas - Scala Spark

I'm working in transform a JSON into a Data Frame. In the first step I create an Array of Data Frame and after that I make an Union. But I've a problem to do a Union in a JSON with Different Schemas.
I Can do it if the JSON have the same Schema like you can see in this other question: Parse JSON root in a column using Spark-Scala
I'm working with the following data:
val exampleJsonDifferentSchema = spark.createDataset(
"""
{"ITEM1512":
{"name":"Yin",
"address":{"city":"Columbus",
"state":"Ohio"},
"age":28 },
"ITEM1518":
{"name":"Yang",
"address":{"city":"Working",
"state":"Marc"}
},
"ITEM1458":
{"name":"Yossup",
"address":{"city":"Macoss",
"state":"Microsoft"},
"age":28
}
}""" :: Nil)
As you see the difference is that one Data Frame doesn't have Age.
val itemsExampleDiff = spark.read.json(exampleJsonDifferentSchema)
itemsExampleDiff.show(false)
itemsExampleDiff.printSchema
+---------------------------------+---------------------------+-----------------------+
|ITEM1458 |ITEM1512 |ITEM1518 |
+---------------------------------+---------------------------+-----------------------+
|[[Macoss, Microsoft], 28, Yossup]|[[Columbus, Ohio], 28, Yin]|[[Working, Marc], Yang]|
+---------------------------------+---------------------------+-----------------------+
root
|-- ITEM1458: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1512: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1518: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
My solution now is as the follow code where i make an array of DataFrame:
val columns:Array[String] = itemsExample.columns
var arrayOfExampleDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = itemsExample.select(lit(col_name).as("Item"), col(col_name).as("Value"))
arrayOfExampleDFs = arrayOfExampleDFs :+ temp
}
val jsonDF = arrayOfExampleDFs.reduce(_ union _)
But I've a JSON with Different Schemas when I reduce in a union I can't do it because the Data Frame need to have the same Schema. In fact, I've the following error:
org.apache.spark.sql.AnalysisException: Union can only be performed on
tables with the compatible column types.
I'm trying to do something similar I've found in this question: How to perform union on two DataFrames with different amounts of columns in spark?
Specifically that part:
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
But I cant make the set for the columns because I need to catch dynamically the columns both totals and singles. I only can do something like that:
for(i <- 0 until arrayOfExampleDFs.length-1) {
val cols1 = arrayOfExampleDFs(i).select("Value").columns.toSet
val cols2 = arrayOfExampleDFs(i+1).select("Value").columns.toSet
val total = cols1 ++ cols2
arrayOfExampleDFs(i).select("Value").printSchema()
print(total)
}
So, how could be a function that do this union dynamically?
Update: expected output
In this Case This Data Frame and Schema:
+--------+---------------------------------+
|Item |Value |
+--------+---------------------------------+
|ITEM1458|[[Macoss, Microsoft], 28, Yossup]|
|ITEM1512|[[Columbus, Ohio], 28, Yin] |
|ITEM1518|[[Working, Marc], null, Yang] |
+--------+---------------------------------+
root
|-- Item: string (nullable = false)
|-- Value: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
Here is one possible solution which creates a common schema for all the dataframes by adding the age column when it is not found:
import org.apache.spark.sql.functions.{col, lit, struct}
import org.apache.spark.sql.types.{LongType, StructField, StructType}
....
for(col_name <- columns){
val currentDf = itemsExampleDiff.select(col(col_name))
// try to identify if age field is present
val hasAge = currentDf.schema.fields(0)
.dataType
.asInstanceOf[StructType]
.fields
.contains(StructField("age", LongType, true))
val valueCol = hasAge match {
// if not construct a new value column
case false => struct(
col(s"${col_name}.address"),
lit(null).cast("bigint").as("age"),
col(s"${col_name}.name")
)
case true => col(col_name)
}
arrayOfExampleDFs = arrayOfExampleDFs :+ currentDf.select(lit(col_name).as("Item"), valueCol.as("Value"))
}
val jsonDF = arrayOfExampleDFs.reduce(_ union _)
// +--------+---------------------------------+
// |Item |Value |
// +--------+---------------------------------+
// |ITEM1458|[[Macoss, Microsoft], 28, Yossup]|
// |ITEM1512|[[Columbus, Ohio], 28, Yin] |
// |ITEM1518|[[Working, Marc],, Yang] |
// +--------+---------------------------------+
Analysis: probably the most demanding part is finding out whether the age is present or not. For the look up we use df.schema.fields property which allow us to dig into the internal schema of each column.
When age is not found we regenerate the column by using a struct:
struct(
col(s"${col_name}.address"),
lit(null).cast("bigint").as("age"),
col(s"${col_name}.name")
)

parse json data with pyspark

I am using pyspark to read the json file below :
{
"data": {
"indicatr": {
"indicatr": {
"id": "5c9e41e4884db700desdaad8"}}}}
I wrote the following python code :
from pyspark.sql import Window, DataFrame
from pyspark.sql.types import *
from pyspark.sql.types import StructType
from pyspark.sql import functions as F
schema = StructType([
StructField("data", StructType([
StructField("indicatr", StructType([
StructField("indicatr", StructType([
StructField("id", StringType())
]))]))]))])
df = spark.read.json("pathtofile/test.json", multiLine=True)
df.show()
df2 = df.withColumn("json", F.col("data").cast("string"))
df3=df2.select(F.col("json"))
df3.collect()
df4 =df3.select(F.from_json(F.col("json"), schema).alias("name"))
df4.show()
I am getting the following result :
|name|
+----+
|null|
Anyone who knows how to solve this, please
When you select the column labeled json, you’re selecting a column that is entirely of the StringType (logically, because you’re casting it to that type). Even though it looks like a valid JSON object, it’s really just a string. df2.data does not have that issue though:
In [2]: df2.printSchema()
root
|-- data: struct (nullable = true)
| |-- indicatr: struct (nullable = true)
| | |-- indicatr: struct (nullable = true)
| | | |-- id: double (nullable = true)
|-- json: string (nullable = true)
By the way, you can immediately pass in the schema on read:
In [3]: df = spark.read.json("data.json", multiLine=True, schema=schema)
...: df.printSchema()
...:
...:
root
|-- data: struct (nullable = true)
| |-- indicatr: struct (nullable = true)
| | |-- indicatr: struct (nullable = true)
| | | |-- id: string (nullable = true)
You can dig down in the columns to reach the nested values:
In [4]: df.select(df.data.indicatr.indicatr.id).show()
+-------------------------+
|data.indicatr.indicatr.id|
+-------------------------+
| 5c9e41e4884db700desdaad8|
+-------------------------+