How to parse nested JSON objects in spark sql? - json

I have a schema as shown below. How can i parse the nested objects
root
|-- apps: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- appName: string (nullable = true)
| | |-- appPackage: string (nullable = true)
| | |-- Ratings: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- date: string (nullable = true)
| | | | |-- rating: long (nullable = true)
|-- id: string (nullable = true)

Assuming you read in a json file and print the schema you are showing us like this:
DataFrame df = sqlContext.read().json("/path/to/file").toDF();
df.registerTempTable("df");
df.printSchema();
Then you can select nested objects inside a struct type like so...
DataFrame app = df.select("app");
app.registerTempTable("app");
app.printSchema();
app.show();
DataFrame appName = app.select("element.appName");
appName.registerTempTable("appName");
appName.printSchema();
appName.show();

Try this:
val nameAndAddress = sqlContext.sql("""
SELECT name, address.city, address.state
FROM people
""")
nameAndAddress.collect.foreach(println)
Source:
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html

Have you tried doing it straight from the SQL query like
Select apps.element.Ratings from yourTableName
This will probably return an array and you can more easily access the elements inside.
Also, I use this online Json viewer when I have to deal with large JSON structures and the schema is too complex:
http://jsonviewer.stack.hu/

I am using pyspark, but the logic should be similar.
I found this way of parsing my nested json useful:
df.select(df.apps.appName.alias("apps_Name"), \
df.apps.appPackage.alias("apps_Package"), \
df.apps.Ratings.date.alias("apps_Ratings_date")) \
.show()
The code could be obviously shorten with a f-string.

var df = spark.read.format("json").load("/path/to/file")
df.createOrReplaceTempView("df");
spark.sql("select apps.element.Ratings from df where apps.element.appName like '%app_name%' ").show()

Related

Transform a list of JSON string to a list of dict in Pyspark

I’m struggling to transform a list of JSON string to a list of dict in Pyspark without using udf or using rdd
I have this kind of dataframe:
Key
JSON_string
123456
["""{"Zipcode":704,"ZipCodeType":"STA"}""","""{"City":"PARC","State":"PR"}"""]
789123
["""{"Zipcode":7,"ZipCodeType":"AZA"}""","""{"City":"PRE","State":"XY"}"""]
How can I transform col(JSON_string) by using built-in functions in Pyspark to [{"Zipcode":704,"ZipCodeType":"STA"},{"City":"PARC","State":"PR"}] ?
I tried many functions such as create_map, collect_list, from_json, to_json, explode, json.loads, json.dump but no way to get the expected result.
Thank you for your help
Explode your JSON_string column, and read it as json, group by again.
df = df.withColumn('JSON_string', f.explode('JSON_string'))
schema = spark.read.json(df.rdd.map(lambda r: r.JSON_string)).schema
df_result = df.withColumn('JSON', f.from_json('JSON_string', schema)) \
.drop('JSON_string') \
.groupBy('Key') \
.agg(f.collect_list('JSON').alias('JSON'))
df_result.show(truncate=False)
df_result.printSchema()
+------+------------------------------------------------+
|Key |JSON |
+------+------------------------------------------------+
|123456|[{null, null, STA, 704}, {PARC, PR, null, null}]|
|789123|[{null, null, AZA, 7}, {PRE, XY, null, null}] |
+------+------------------------------------------------+
root
|-- Key: long (nullable = true)
|-- JSON: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- City: string (nullable = true)
| | |-- State: string (nullable = true)
| | |-- ZipCodeType: string (nullable = true)
| | |-- Zipcode: long (nullable = true)

PySpark, parse json given deep nested schema

I have a pyspark dataframe where, for each row, there is a column which is a json string.
+----------------------------------------------------------------------------+
|data |
+----------------------------------------------------------------------------+
|{"student":{"name":"Bob","surname":"Smith","age":18},"scholarship":true} |
|{"student":{"name":"Adam","surname":"Smith","age":"23"},"scholarship":false}|
+----------------------------------------------------------------------------+
I want to explode this json strings, in order to be compliant with the following schema:
root
|-- scholarship: boolean (nullable = true)
|-- student: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
| |-- surname: string (nullable = true)
So, my solution is the following:
my_schema = StructType([
StructField("scholarship",BooleanType(),True),
StructField(
"student",
StructType([
StructField("age",LongType(),True),
StructField("name",StringType(),True),
StructField("surname",StringType(),True)
]),
True
)
])
parsed_df = my_df.withColumn("data", from_json(col("data"), my_schema))
In this way, the parsed_df is the following:
+------------------------+
|data |
+------------------------+
|{true, {18, Bob, Smith}}|
|{false, null} |
+------------------------+
Instead, I would like an output as:
+----------------------------+
|data |
+----------------------------+
|{true, {18, Bob, Smith}} |
|{false, {null, Adam, Smith}}|
+----------------------------+
Is there any option in the from_json method or any alternative solution to reach this result?
I add that I cannot use databricks and also that (unlike the example) in the business use, I don't define the schema, but this is passed to me every time. My question is more general: given a spark schema, is there any way to parse a json string column in a dataframe, although this json is deeply nested?

Parse JSON root in a column using Spark-Scala

I've problems in order to transform the root of a JSOM a record in a data frame for an undetermined number of records.
I've a data frame generated with a JSON similar the following:
val exampleJson = spark.createDataset(
"""
{"ITEM1512":
{"name":"Yin",
"address":{"city":"Columbus",
"state":"Ohio"}
},
"ITEM1518":
{"name":"Yang",
"address":{"city":"Working",
"state":"Marc"}
}
}""" :: Nil)
When I read it whit the following instruction
val itemsExample = spark.read.json(exampleJson)
The Schema and Data Frame generated is the following:
+-----------------------+-----------------------+
|ITEM1512 |ITEM1518 |
+-----------------------+-----------------------+
|[[Columbus, Ohio], Yin]|[[Working, Marc], Yang]|
+-----------------------+-----------------------+
root
|-- ITEM1512: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1518: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
But i want to generate something like this:
+-----------------------+-----------------------+
|Item |Values |
+-----------------------+-----------------------+
|ITEM1512 |[[Columbus, Ohio], Yin]|
|ITEM1518 |[[Working, Marc], Yang]|
+-----------------------+-----------------------+
So, in order to parse this JSON data I need to read all the columns and added it to a record in the Data Frame, because there are more than this two items that i write as example. In fact, there are millions of items that I'd like to add in a Data Frame.
I'm trying to replicate the solution found here in: How to parse the JSON data using Spark-Scala
with this code:
val columns:Array[String] = itemsExample.columns
var arrayOfDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = itemsExample.selectExpr("explode("+col_name+") as element")
.select(
lit(col_name).as("Item"),
col("element.E").as("Value"))
arrayOfDFs = arrayOfDFs :+ temp
}
val jsonDF = arrayOfDFs.reduce(_ union _)
jsonDF.show(false)
But I face with the problem while in the example reading in the other question the root is in array in my case the root is an StrucType. Therefore the next exception is thrown:
org.apache.spark.sql.AnalysisException: cannot resolve
'explode(ITEM1512)' due to data type mismatch: input to function
explode should be array or map type, not
struct,name:string>
You can use stack function.
Example:
itemsExample.selectExpr("""stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)""").
show(false)
//+--------+-----------------------+
//|Item |Values |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+
UPDATE:
Dynamic Stack query:
val stack=df.columns.map(x => s"'${x}',${x}").mkString(s"stack(${df.columns.size},",",",")as (Item,Values)")
//stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)
itemsExample.selectExpr(stack).show()
//+--------+-----------------------+
//|Item |Values |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+

parse json data with pyspark

I am using pyspark to read the json file below :
{
"data": {
"indicatr": {
"indicatr": {
"id": "5c9e41e4884db700desdaad8"}}}}
I wrote the following python code :
from pyspark.sql import Window, DataFrame
from pyspark.sql.types import *
from pyspark.sql.types import StructType
from pyspark.sql import functions as F
schema = StructType([
StructField("data", StructType([
StructField("indicatr", StructType([
StructField("indicatr", StructType([
StructField("id", StringType())
]))]))]))])
df = spark.read.json("pathtofile/test.json", multiLine=True)
df.show()
df2 = df.withColumn("json", F.col("data").cast("string"))
df3=df2.select(F.col("json"))
df3.collect()
df4 =df3.select(F.from_json(F.col("json"), schema).alias("name"))
df4.show()
I am getting the following result :
|name|
+----+
|null|
Anyone who knows how to solve this, please
When you select the column labeled json, you’re selecting a column that is entirely of the StringType (logically, because you’re casting it to that type). Even though it looks like a valid JSON object, it’s really just a string. df2.data does not have that issue though:
In [2]: df2.printSchema()
root
|-- data: struct (nullable = true)
| |-- indicatr: struct (nullable = true)
| | |-- indicatr: struct (nullable = true)
| | | |-- id: double (nullable = true)
|-- json: string (nullable = true)
By the way, you can immediately pass in the schema on read:
In [3]: df = spark.read.json("data.json", multiLine=True, schema=schema)
...: df.printSchema()
...:
...:
root
|-- data: struct (nullable = true)
| |-- indicatr: struct (nullable = true)
| | |-- indicatr: struct (nullable = true)
| | | |-- id: string (nullable = true)
You can dig down in the columns to reach the nested values:
In [4]: df.select(df.data.indicatr.indicatr.id).show()
+-------------------------+
|data.indicatr.indicatr.id|
+-------------------------+
| 5c9e41e4884db700desdaad8|
+-------------------------+

Reading Nested JSON via Spark SQL - [AnalysisException] cannot resolve Column

I have a JSON data like this:
{
"parent":[
{
"prop1":1.0,
"prop2":"C",
"children":[
{
"child_prop1":[
"3026"
]
}
]
}
]
}
After reading data from Spark I get following schema:
val df = spark.read.json("test.json")
df.printSchema
root
|-- parent: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- children: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- child_prop1: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | |-- prop1: double (nullable = true)
| | |-- prop2: string (nullable = true)
Now, I want to select child_prop1 from df. But when I try to select it I get org.apache.spark.sql.AnalysisException. Something like this:
df.select("parent.children.child_prop1")
org.apache.spark.sql.AnalysisException: cannot resolve '`parent`.`children`['child_prop1']' due to data type mismatch: argument 2 requires integral type, however, ''child_prop1'' is of string type.;;
'Project [parent#60.children[child_prop1] AS child_prop1#63]
+- Relation[parent#60] json
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:307)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1139)
... 48 elided
Although, when I select only children from df it works fine.
df.select("parent.children").show(false)
+------------------------------------+
|children |
+------------------------------------+
|[WrappedArray([WrappedArray(3026)])]|
+------------------------------------+
I cannot understand why it is giving exception even though the column is present in dataframe.
Any help is appreciated !
Your Json is a valid json which and I think you don't need to change your input data.
Use explode to get the data as
import org.apache.spark.sql.functions.explode
val data = spark.read.json("src/test/java/data.json")
val child = data.select(explode(data("parent.children"))).toDF("children")
child.select(explode(child("children.child_prop1"))).toDF("child_prop1").show()
If you can change the input data you can follow #ramesh suggestions
If you look at the schema child_prop1 is inside nested array of root array parent. So we need to be able to define the position of the child_prop1 and thats what the error is suggesting you to define.
Converting your json format should do the trick.
Changing the json to
{"parent":{"prop1":1.0,"prop2":"C","children":{"child_prop1":["3026"]}}}
and applying the
df.select("parent.children.child_prop1").show(false)
will give output as
+-----------+
|child_prop1|
+-----------+
|[3026] |
+-----------+
And
Changing the json to
{"parent":{"prop1":1.0,"prop2":"C","children":[{"child_prop1":["3026"]}]}}
and applying the
df.select("parent.children.child_prop1").show(false)
will result
+--------------------+
|child_prop1 |
+--------------------+
|[WrappedArray(3026)]|
+--------------------+
I hope the answer helps