parse json data with pyspark - json

I am using pyspark to read the json file below :
{
"data": {
"indicatr": {
"indicatr": {
"id": "5c9e41e4884db700desdaad8"}}}}
I wrote the following python code :
from pyspark.sql import Window, DataFrame
from pyspark.sql.types import *
from pyspark.sql.types import StructType
from pyspark.sql import functions as F
schema = StructType([
StructField("data", StructType([
StructField("indicatr", StructType([
StructField("indicatr", StructType([
StructField("id", StringType())
]))]))]))])
df = spark.read.json("pathtofile/test.json", multiLine=True)
df.show()
df2 = df.withColumn("json", F.col("data").cast("string"))
df3=df2.select(F.col("json"))
df3.collect()
df4 =df3.select(F.from_json(F.col("json"), schema).alias("name"))
df4.show()
I am getting the following result :
|name|
+----+
|null|
Anyone who knows how to solve this, please

When you select the column labeled json, you’re selecting a column that is entirely of the StringType (logically, because you’re casting it to that type). Even though it looks like a valid JSON object, it’s really just a string. df2.data does not have that issue though:
In [2]: df2.printSchema()
root
|-- data: struct (nullable = true)
| |-- indicatr: struct (nullable = true)
| | |-- indicatr: struct (nullable = true)
| | | |-- id: double (nullable = true)
|-- json: string (nullable = true)
By the way, you can immediately pass in the schema on read:
In [3]: df = spark.read.json("data.json", multiLine=True, schema=schema)
...: df.printSchema()
...:
...:
root
|-- data: struct (nullable = true)
| |-- indicatr: struct (nullable = true)
| | |-- indicatr: struct (nullable = true)
| | | |-- id: string (nullable = true)
You can dig down in the columns to reach the nested values:
In [4]: df.select(df.data.indicatr.indicatr.id).show()
+-------------------------+
|data.indicatr.indicatr.id|
+-------------------------+
| 5c9e41e4884db700desdaad8|
+-------------------------+

Related

Transform a list of JSON string to a list of dict in Pyspark

I’m struggling to transform a list of JSON string to a list of dict in Pyspark without using udf or using rdd
I have this kind of dataframe:
Key
JSON_string
123456
["""{"Zipcode":704,"ZipCodeType":"STA"}""","""{"City":"PARC","State":"PR"}"""]
789123
["""{"Zipcode":7,"ZipCodeType":"AZA"}""","""{"City":"PRE","State":"XY"}"""]
How can I transform col(JSON_string) by using built-in functions in Pyspark to [{"Zipcode":704,"ZipCodeType":"STA"},{"City":"PARC","State":"PR"}] ?
I tried many functions such as create_map, collect_list, from_json, to_json, explode, json.loads, json.dump but no way to get the expected result.
Thank you for your help
Explode your JSON_string column, and read it as json, group by again.
df = df.withColumn('JSON_string', f.explode('JSON_string'))
schema = spark.read.json(df.rdd.map(lambda r: r.JSON_string)).schema
df_result = df.withColumn('JSON', f.from_json('JSON_string', schema)) \
.drop('JSON_string') \
.groupBy('Key') \
.agg(f.collect_list('JSON').alias('JSON'))
df_result.show(truncate=False)
df_result.printSchema()
+------+------------------------------------------------+
|Key |JSON |
+------+------------------------------------------------+
|123456|[{null, null, STA, 704}, {PARC, PR, null, null}]|
|789123|[{null, null, AZA, 7}, {PRE, XY, null, null}] |
+------+------------------------------------------------+
root
|-- Key: long (nullable = true)
|-- JSON: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- City: string (nullable = true)
| | |-- State: string (nullable = true)
| | |-- ZipCodeType: string (nullable = true)
| | |-- Zipcode: long (nullable = true)

Reading Nested Json file in Pyspark code. pyspark.sql.utils.AnalysisException:

I am trying to read nested JSON file. I am not able to explode nested column and read the JSON file properly.
My Json file
{
"Univerity": "JNTU",
"Department": {
"DepartmentID": "101",
"Student": {
"lastName": "Fraun",
"address": "23 hyd 500089",
"email": "ss.fraun#yahoo.co.in",
"Subjects": [
{
"subjectId": "12592",
"subjectName": "Boyce"
},
{
"subjectId": "12592",
"subjectName": "Boyce"
}
]
}
}
}
Code :
```
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, regexp_replace, split
spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()
if __name__ == '__main__':
df = spark.read.option("multiline","true").json("C:\Workspace\student1.json").cache()
df.show()
df.printSchema()
df.withColumn("Department", explode(col("Department")))
df.show()
```
My below output and error:
+--------------------+---------+
| Department|Univerity|
+--------------------+---------+
|{101, {[{12592, B...| JNTU|
+--------------------+---------+
root
|-- Department: struct (nullable = true)
| |-- DepartmentID: string (nullable = true)
| |-- Student: struct (nullable = true)
| | |-- Subjects: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- subjectId: string (nullable = true)
| | | | |-- subjectName: string (nullable = true)
| | |-- address: string (nullable = true)
| | |-- email: string (nullable = true)
| | |-- lastName: string (nullable = true)
|-- Univerity: string (nullable = true)
Traceback (most recent call last):
File "C:/student/agility-data-electrode/electrode/entities/student.py", line 12, in <module>
df.withColumn("Department", explode(col("Department")))
File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\pyspark\sql\dataframe.py", line 2455, in withColumn
return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\py4j\java_gateway.py", line 1310, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\pyspark\sql\utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve 'explode(`Department`)' due to data type mismatch: input to function explode should be array or map type, not struct<DepartmentID:string,Student:struct<Subjects:array<struct<subjectId:string,subjectName:string>>,address:string,email:string,lastName:string>>;
'Project [explode(Department#0) AS Department#65, Univerity#1]
+- Relation[Department#0,Univerity#1] json
You can explode only an array columns , Choose the subject columns to explode.
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, regexp_replace, split
spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()
if __name__ == '__main__':
df = spark.read.option("multiline","true").json("C:\Workspace\student1.json").cache()
df.show()
df.printSchema()
df.withColumn("Subjects", explode(col("Department.Student.Subjects")))
df.show()

Merge Dataframes With Differents Schemas - Scala Spark

I'm working in transform a JSON into a Data Frame. In the first step I create an Array of Data Frame and after that I make an Union. But I've a problem to do a Union in a JSON with Different Schemas.
I Can do it if the JSON have the same Schema like you can see in this other question: Parse JSON root in a column using Spark-Scala
I'm working with the following data:
val exampleJsonDifferentSchema = spark.createDataset(
"""
{"ITEM1512":
{"name":"Yin",
"address":{"city":"Columbus",
"state":"Ohio"},
"age":28 },
"ITEM1518":
{"name":"Yang",
"address":{"city":"Working",
"state":"Marc"}
},
"ITEM1458":
{"name":"Yossup",
"address":{"city":"Macoss",
"state":"Microsoft"},
"age":28
}
}""" :: Nil)
As you see the difference is that one Data Frame doesn't have Age.
val itemsExampleDiff = spark.read.json(exampleJsonDifferentSchema)
itemsExampleDiff.show(false)
itemsExampleDiff.printSchema
+---------------------------------+---------------------------+-----------------------+
|ITEM1458 |ITEM1512 |ITEM1518 |
+---------------------------------+---------------------------+-----------------------+
|[[Macoss, Microsoft], 28, Yossup]|[[Columbus, Ohio], 28, Yin]|[[Working, Marc], Yang]|
+---------------------------------+---------------------------+-----------------------+
root
|-- ITEM1458: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1512: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1518: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
My solution now is as the follow code where i make an array of DataFrame:
val columns:Array[String] = itemsExample.columns
var arrayOfExampleDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = itemsExample.select(lit(col_name).as("Item"), col(col_name).as("Value"))
arrayOfExampleDFs = arrayOfExampleDFs :+ temp
}
val jsonDF = arrayOfExampleDFs.reduce(_ union _)
But I've a JSON with Different Schemas when I reduce in a union I can't do it because the Data Frame need to have the same Schema. In fact, I've the following error:
org.apache.spark.sql.AnalysisException: Union can only be performed on
tables with the compatible column types.
I'm trying to do something similar I've found in this question: How to perform union on two DataFrames with different amounts of columns in spark?
Specifically that part:
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
But I cant make the set for the columns because I need to catch dynamically the columns both totals and singles. I only can do something like that:
for(i <- 0 until arrayOfExampleDFs.length-1) {
val cols1 = arrayOfExampleDFs(i).select("Value").columns.toSet
val cols2 = arrayOfExampleDFs(i+1).select("Value").columns.toSet
val total = cols1 ++ cols2
arrayOfExampleDFs(i).select("Value").printSchema()
print(total)
}
So, how could be a function that do this union dynamically?
Update: expected output
In this Case This Data Frame and Schema:
+--------+---------------------------------+
|Item |Value |
+--------+---------------------------------+
|ITEM1458|[[Macoss, Microsoft], 28, Yossup]|
|ITEM1512|[[Columbus, Ohio], 28, Yin] |
|ITEM1518|[[Working, Marc], null, Yang] |
+--------+---------------------------------+
root
|-- Item: string (nullable = false)
|-- Value: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
Here is one possible solution which creates a common schema for all the dataframes by adding the age column when it is not found:
import org.apache.spark.sql.functions.{col, lit, struct}
import org.apache.spark.sql.types.{LongType, StructField, StructType}
....
for(col_name <- columns){
val currentDf = itemsExampleDiff.select(col(col_name))
// try to identify if age field is present
val hasAge = currentDf.schema.fields(0)
.dataType
.asInstanceOf[StructType]
.fields
.contains(StructField("age", LongType, true))
val valueCol = hasAge match {
// if not construct a new value column
case false => struct(
col(s"${col_name}.address"),
lit(null).cast("bigint").as("age"),
col(s"${col_name}.name")
)
case true => col(col_name)
}
arrayOfExampleDFs = arrayOfExampleDFs :+ currentDf.select(lit(col_name).as("Item"), valueCol.as("Value"))
}
val jsonDF = arrayOfExampleDFs.reduce(_ union _)
// +--------+---------------------------------+
// |Item |Value |
// +--------+---------------------------------+
// |ITEM1458|[[Macoss, Microsoft], 28, Yossup]|
// |ITEM1512|[[Columbus, Ohio], 28, Yin] |
// |ITEM1518|[[Working, Marc],, Yang] |
// +--------+---------------------------------+
Analysis: probably the most demanding part is finding out whether the age is present or not. For the look up we use df.schema.fields property which allow us to dig into the internal schema of each column.
When age is not found we regenerate the column by using a struct:
struct(
col(s"${col_name}.address"),
lit(null).cast("bigint").as("age"),
col(s"${col_name}.name")
)

How to query using column names that include "$"?

In Spark SQL, I can use
val spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.master("local")
.config("spark.sql.warehouse.dir", "warehouseLocation-value")
.getOrCreate()
val df = spark.read.json("source/myRecords.json")
df.createOrReplaceTempView("shipment")
val sqlDF = spark.sql("SELECT * FROM shipment")
to get the data from "myRecords.json", and the structure of this json file is:
df.printSchema()
root
|-- _id: struct (nullable = true)
| |-- $oid: string (nullable = true)
|-- container: struct (nullable = true)
| |-- barcode: string (nullable = true)
| |-- code: string (nullable = true)
I can get the specific column of this json such as:
val sqlDF = spark.sql("SELECT container.barcode, container.code FROM shipment")
But how can I get id.$oid from this json file?
I have tried "SELECT id.$oid FROM shipment_log" or "SELECT id.\$oid FROM shipment_log", but not work at all.
error message:
error: invalid escape character
Can any one tell me how can I get id.$oid ?
Backticks are your friend:
spark.read.json(sc.parallelize(Seq(
"""{"_id": {"$oid": "foo"}}""")
)).createOrReplaceTempView("df")
spark.sql("SELECT _id.`$oid` FROM df").show
+----+
|$oid|
+----+
| foo|
+----+
Same as DataFrame API:
spark.table("df").select($"_id".getItem("$oid")).show
+--------+
|_id.$oid|
+--------+
| foo|
+--------+
or
spark.table("df").select($"_id.$$oid")
+--------+
|_id.$oid|
+--------+
| foo|
+--------+

How to parse nested JSON objects in spark sql?

I have a schema as shown below. How can i parse the nested objects
root
|-- apps: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- appName: string (nullable = true)
| | |-- appPackage: string (nullable = true)
| | |-- Ratings: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- date: string (nullable = true)
| | | | |-- rating: long (nullable = true)
|-- id: string (nullable = true)
Assuming you read in a json file and print the schema you are showing us like this:
DataFrame df = sqlContext.read().json("/path/to/file").toDF();
df.registerTempTable("df");
df.printSchema();
Then you can select nested objects inside a struct type like so...
DataFrame app = df.select("app");
app.registerTempTable("app");
app.printSchema();
app.show();
DataFrame appName = app.select("element.appName");
appName.registerTempTable("appName");
appName.printSchema();
appName.show();
Try this:
val nameAndAddress = sqlContext.sql("""
SELECT name, address.city, address.state
FROM people
""")
nameAndAddress.collect.foreach(println)
Source:
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
Have you tried doing it straight from the SQL query like
Select apps.element.Ratings from yourTableName
This will probably return an array and you can more easily access the elements inside.
Also, I use this online Json viewer when I have to deal with large JSON structures and the schema is too complex:
http://jsonviewer.stack.hu/
I am using pyspark, but the logic should be similar.
I found this way of parsing my nested json useful:
df.select(df.apps.appName.alias("apps_Name"), \
df.apps.appPackage.alias("apps_Package"), \
df.apps.Ratings.date.alias("apps_Ratings_date")) \
.show()
The code could be obviously shorten with a f-string.
var df = spark.read.format("json").load("/path/to/file")
df.createOrReplaceTempView("df");
spark.sql("select apps.element.Ratings from df where apps.element.appName like '%app_name%' ").show()