Parse JSON root in a column using Spark-Scala - json

I've problems in order to transform the root of a JSOM a record in a data frame for an undetermined number of records.
I've a data frame generated with a JSON similar the following:
val exampleJson = spark.createDataset(
"""
{"ITEM1512":
{"name":"Yin",
"address":{"city":"Columbus",
"state":"Ohio"}
},
"ITEM1518":
{"name":"Yang",
"address":{"city":"Working",
"state":"Marc"}
}
}""" :: Nil)
When I read it whit the following instruction
val itemsExample = spark.read.json(exampleJson)
The Schema and Data Frame generated is the following:
+-----------------------+-----------------------+
|ITEM1512 |ITEM1518 |
+-----------------------+-----------------------+
|[[Columbus, Ohio], Yin]|[[Working, Marc], Yang]|
+-----------------------+-----------------------+
root
|-- ITEM1512: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1518: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
But i want to generate something like this:
+-----------------------+-----------------------+
|Item |Values |
+-----------------------+-----------------------+
|ITEM1512 |[[Columbus, Ohio], Yin]|
|ITEM1518 |[[Working, Marc], Yang]|
+-----------------------+-----------------------+
So, in order to parse this JSON data I need to read all the columns and added it to a record in the Data Frame, because there are more than this two items that i write as example. In fact, there are millions of items that I'd like to add in a Data Frame.
I'm trying to replicate the solution found here in: How to parse the JSON data using Spark-Scala
with this code:
val columns:Array[String] = itemsExample.columns
var arrayOfDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = itemsExample.selectExpr("explode("+col_name+") as element")
.select(
lit(col_name).as("Item"),
col("element.E").as("Value"))
arrayOfDFs = arrayOfDFs :+ temp
}
val jsonDF = arrayOfDFs.reduce(_ union _)
jsonDF.show(false)
But I face with the problem while in the example reading in the other question the root is in array in my case the root is an StrucType. Therefore the next exception is thrown:
org.apache.spark.sql.AnalysisException: cannot resolve
'explode(ITEM1512)' due to data type mismatch: input to function
explode should be array or map type, not
struct,name:string>

You can use stack function.
Example:
itemsExample.selectExpr("""stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)""").
show(false)
//+--------+-----------------------+
//|Item |Values |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+
UPDATE:
Dynamic Stack query:
val stack=df.columns.map(x => s"'${x}',${x}").mkString(s"stack(${df.columns.size},",",",")as (Item,Values)")
//stack(2,'ITEM1512',ITEM1512,'ITEM1518',ITEM1518) as (Item,Values)
itemsExample.selectExpr(stack).show()
//+--------+-----------------------+
//|Item |Values |
//+--------+-----------------------+
//|ITEM1512|[[Columbus, Ohio], Yin]|
//|ITEM1518|[[Working, Marc], Yang]|
//+--------+-----------------------+

Related

Transform a list of JSON string to a list of dict in Pyspark

I’m struggling to transform a list of JSON string to a list of dict in Pyspark without using udf or using rdd
I have this kind of dataframe:
Key
JSON_string
123456
["""{"Zipcode":704,"ZipCodeType":"STA"}""","""{"City":"PARC","State":"PR"}"""]
789123
["""{"Zipcode":7,"ZipCodeType":"AZA"}""","""{"City":"PRE","State":"XY"}"""]
How can I transform col(JSON_string) by using built-in functions in Pyspark to [{"Zipcode":704,"ZipCodeType":"STA"},{"City":"PARC","State":"PR"}] ?
I tried many functions such as create_map, collect_list, from_json, to_json, explode, json.loads, json.dump but no way to get the expected result.
Thank you for your help
Explode your JSON_string column, and read it as json, group by again.
df = df.withColumn('JSON_string', f.explode('JSON_string'))
schema = spark.read.json(df.rdd.map(lambda r: r.JSON_string)).schema
df_result = df.withColumn('JSON', f.from_json('JSON_string', schema)) \
.drop('JSON_string') \
.groupBy('Key') \
.agg(f.collect_list('JSON').alias('JSON'))
df_result.show(truncate=False)
df_result.printSchema()
+------+------------------------------------------------+
|Key |JSON |
+------+------------------------------------------------+
|123456|[{null, null, STA, 704}, {PARC, PR, null, null}]|
|789123|[{null, null, AZA, 7}, {PRE, XY, null, null}] |
+------+------------------------------------------------+
root
|-- Key: long (nullable = true)
|-- JSON: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- City: string (nullable = true)
| | |-- State: string (nullable = true)
| | |-- ZipCodeType: string (nullable = true)
| | |-- Zipcode: long (nullable = true)

Extract array from list of json strings using Spark

I have a column in my data frame which contains list of JSONs but the type is of String. I need to run explode on this column, so first I need to convert this into a list. I couldn't find much references to this use case.
Sample data:
columnName: "[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}, {...}]"
The above is how the data looks like, the fields are not fixed (index 0 might have JSON with some fields while index 1 will have fields with some other fields). In the list there can be more nested JSONs or some extra fields. I am currently using this -
"""explode(split(regexp_replace(regexp_replace(colName, '(\\\},)','}},'), '(\\\[|\\\])',''), "},")) as colName""" where I am just replacing "}," with "}}," then removing "[]" and then calling split on "}," but this approach doesn't work since there are nested JSONs.
How can I extract the array from the string?
You can try this way:
// Initial DataFrame
df.show(false)
+----------------------------------------------------------------------+
|columnName |
+----------------------------------------------------------------------+
|[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}]|
+----------------------------------------------------------------------+
df.printSchema()
root
|-- columnName: string (nullable = true)
// toArray is a user defined function that parses an array of json objects which is present as a string
import org.json.JSONArray
val toArray = udf { (data: String) => {
val jsonArray = new JSONArray(data)
var arr: Array[String] = Array()
val objects = (0 until jsonArray.length).map(x => jsonArray.getJSONObject(x))
objects.foreach { elem =>
arr :+= elem.toString
}
arr
}
}
// Using the udf and exploding the resultant array
val df1 = df.withColumn("columnName",explode(toArray(col("columnName"))))
df1.show(false)
+-----------------------------------------------------+
|columnName |
+-----------------------------------------------------+
|{"other":7,"name":"a","info":{"grade":"b","age":"1"}}|
|{"random":"x"} |
+-----------------------------------------------------+
df1.printSchema()
root
|-- columnName: string (nullable = true)
// Parsing the json string by obtaining the schema dynamically
val schema = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).schema
val df2 = df1.withColumn("columnName",from_json(col("columnName"),schema))
df2.show(false)
+---------------+
|columnName |
+---------------+
|[[1, b], a, 7,]|
|[,,, x] |
+---------------+
df2.printSchema()
root
|-- columnName: struct (nullable = true)
| |-- info: struct (nullable = true)
| | |-- age: string (nullable = true)
| | |-- grade: string (nullable = true)
| |-- name: string (nullable = true)
| |-- other: long (nullable = true)
| |-- random: string (nullable = true)
// Extracting all the fields from the json
df2.select(col("columnName.*")).show(false)
+------+----+-----+------+
|info |name|other|random|
+------+----+-----+------+
|[1, b]|a |7 |null |
|null |null|null |x |
+------+----+-----+------+
Edit:
You can try this way if you can use get_json_object function
// Get the list of columns dynamically
val columns = spark.read.json(df1.select("columnName").rdd.map(x => x(0).toString)).columns
// define an empty array of Column type and get_json_object function to extract the columns
var extract_columns: Array[Column] = Array()
columns.foreach { column =>
extract_columns :+= get_json_object(col("columnName"), "$." + column).as(column)
}
df1.select(extract_columns: _*).show(false)
+-----------------------+----+-----+------+
|info |name|other|random|
+-----------------------+----+-----+------+
|{"grade":"b","age":"1"}|a |7 |null |
|null |null|null |x |
+-----------------------+----+-----+------+
Please note that info column is not of struct type. You may have to follow similar way to extract the columns of the nested json
val testString = """[{"name":"a","info":{"age":"1","grade":"b"},"other":7},{"random":"x"}]"""
val ds = Seq(testString).toDS()
spark.read.json(ds)
.select("info.age", "info.grade","name","other","random")
.show(10,false)

Merge Dataframes With Differents Schemas - Scala Spark

I'm working in transform a JSON into a Data Frame. In the first step I create an Array of Data Frame and after that I make an Union. But I've a problem to do a Union in a JSON with Different Schemas.
I Can do it if the JSON have the same Schema like you can see in this other question: Parse JSON root in a column using Spark-Scala
I'm working with the following data:
val exampleJsonDifferentSchema = spark.createDataset(
"""
{"ITEM1512":
{"name":"Yin",
"address":{"city":"Columbus",
"state":"Ohio"},
"age":28 },
"ITEM1518":
{"name":"Yang",
"address":{"city":"Working",
"state":"Marc"}
},
"ITEM1458":
{"name":"Yossup",
"address":{"city":"Macoss",
"state":"Microsoft"},
"age":28
}
}""" :: Nil)
As you see the difference is that one Data Frame doesn't have Age.
val itemsExampleDiff = spark.read.json(exampleJsonDifferentSchema)
itemsExampleDiff.show(false)
itemsExampleDiff.printSchema
+---------------------------------+---------------------------+-----------------------+
|ITEM1458 |ITEM1512 |ITEM1518 |
+---------------------------------+---------------------------+-----------------------+
|[[Macoss, Microsoft], 28, Yossup]|[[Columbus, Ohio], 28, Yin]|[[Working, Marc], Yang]|
+---------------------------------+---------------------------+-----------------------+
root
|-- ITEM1458: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1512: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1518: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
My solution now is as the follow code where i make an array of DataFrame:
val columns:Array[String] = itemsExample.columns
var arrayOfExampleDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = itemsExample.select(lit(col_name).as("Item"), col(col_name).as("Value"))
arrayOfExampleDFs = arrayOfExampleDFs :+ temp
}
val jsonDF = arrayOfExampleDFs.reduce(_ union _)
But I've a JSON with Different Schemas when I reduce in a union I can't do it because the Data Frame need to have the same Schema. In fact, I've the following error:
org.apache.spark.sql.AnalysisException: Union can only be performed on
tables with the compatible column types.
I'm trying to do something similar I've found in this question: How to perform union on two DataFrames with different amounts of columns in spark?
Specifically that part:
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
But I cant make the set for the columns because I need to catch dynamically the columns both totals and singles. I only can do something like that:
for(i <- 0 until arrayOfExampleDFs.length-1) {
val cols1 = arrayOfExampleDFs(i).select("Value").columns.toSet
val cols2 = arrayOfExampleDFs(i+1).select("Value").columns.toSet
val total = cols1 ++ cols2
arrayOfExampleDFs(i).select("Value").printSchema()
print(total)
}
So, how could be a function that do this union dynamically?
Update: expected output
In this Case This Data Frame and Schema:
+--------+---------------------------------+
|Item |Value |
+--------+---------------------------------+
|ITEM1458|[[Macoss, Microsoft], 28, Yossup]|
|ITEM1512|[[Columbus, Ohio], 28, Yin] |
|ITEM1518|[[Working, Marc], null, Yang] |
+--------+---------------------------------+
root
|-- Item: string (nullable = false)
|-- Value: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
Here is one possible solution which creates a common schema for all the dataframes by adding the age column when it is not found:
import org.apache.spark.sql.functions.{col, lit, struct}
import org.apache.spark.sql.types.{LongType, StructField, StructType}
....
for(col_name <- columns){
val currentDf = itemsExampleDiff.select(col(col_name))
// try to identify if age field is present
val hasAge = currentDf.schema.fields(0)
.dataType
.asInstanceOf[StructType]
.fields
.contains(StructField("age", LongType, true))
val valueCol = hasAge match {
// if not construct a new value column
case false => struct(
col(s"${col_name}.address"),
lit(null).cast("bigint").as("age"),
col(s"${col_name}.name")
)
case true => col(col_name)
}
arrayOfExampleDFs = arrayOfExampleDFs :+ currentDf.select(lit(col_name).as("Item"), valueCol.as("Value"))
}
val jsonDF = arrayOfExampleDFs.reduce(_ union _)
// +--------+---------------------------------+
// |Item |Value |
// +--------+---------------------------------+
// |ITEM1458|[[Macoss, Microsoft], 28, Yossup]|
// |ITEM1512|[[Columbus, Ohio], 28, Yin] |
// |ITEM1518|[[Working, Marc],, Yang] |
// +--------+---------------------------------+
Analysis: probably the most demanding part is finding out whether the age is present or not. For the look up we use df.schema.fields property which allow us to dig into the internal schema of each column.
When age is not found we regenerate the column by using a struct:
struct(
col(s"${col_name}.address"),
lit(null).cast("bigint").as("age"),
col(s"${col_name}.name")
)

Convert spark Dataframe with schema to dataframe of json String

I have a Dataframe like this:
+--+--------+--------+----+-------------+------------------------------+
|id|name |lastname|age |timestamp |creditcards |
+--+--------+--------+----+-------------+------------------------------+
|1 |michel |blanc |35 |1496756626921|[[hr6,3569823], [ee3,1547869]]|
|2 |peter |barns |25 |1496756626551|[[ye8,4569872], [qe5,3485762]]|
+--+--------+--------+----+-------------+------------------------------+
where the schema of my df is like below:
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- lastname: string (nullable = true)
|-- age: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- creditcards: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- number: string (nullable = true)
I would like to convert each line to a json string knowing my schema. So this dataframe would have one column string containing the json.
first line should be like this:
{
"id":"1",
"name":"michel",
"lastname":"blanc",
"age":"35",
"timestamp":"1496756626921",
"creditcards":[
{
"id":"hr6",
"number":"3569823"
},
{
"id":"ee3",
"number":"1547869"
}
]
}
and the secone line of the dataframe should be like this:
{
"id":"2",
"name":"peter",
"lastname":"barns",
"age":"25",
"timestamp":"1496756626551",
"creditcards":[
{
"id":"ye8",
"number":"4569872"
},
{
"id":"qe5",
"number":"3485762"
}
]
}
my goal is not to write the dataframe to json file. My goal is to convert df1 to a second df2 in order to push each json line of df2 to kafka topic
I have this code to create the dataframe:
val line1 = """{"id":"1","name":"michel","lastname":"blanc","age":"35","timestamp":"1496756626921","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3","number":"1547869"}]}"""
val line2 = """{"id":"2","name":"peter","lastname":"barns","age":"25","timestamp":"1496756626551","creditcards":[{"id":"ye8","number":"4569872"}, {"id":"qe5","number":"3485762"}]}"""
val rdd = sc.parallelize(Seq(line1, line2))
val df = sqlContext.read.json(rdd)
df show false
df printSchema
Do you have any idea?
If all you need is a single-column DataFrame/Dataset with each column value representing each row of the original DataFrame in JSON, you can simply apply toJSON to your DataFrame, as in the following:
df.show
// +---+------------------------------+---+--------+------+-------------+
// |age|creditcards |id |lastname|name |timestamp |
// +---+------------------------------+---+--------+------+-------------+
// |35 |[[hr6,3569823], [ee3,1547869]]|1 |blanc |michel|1496756626921|
// |25 |[[ye8,4569872], [qe5,3485762]]|2 |barns |peter |1496756626551|
// +---+------------------------------+---+--------+------+-------------+
val dsJson = df.toJSON
// dsJson: org.apache.spark.sql.Dataset[String] = [value: string]
dsJson.show
// +--------------------------------------------------------------------------+
// |value |
// +--------------------------------------------------------------------------+
// |{"age":"35","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3",...|
// |{"age":"25","creditcards":[{"id":"ye8","number":"4569872"},{"id":"qe5",...|
// +--------------------------------------------------------------------------+
[UPDATE]
To add name as an additional column, you can extract it from the JSON column using from_json:
val result = dsJson.withColumn("name", from_json($"value", df.schema)("name"))
result.show
// +--------------------+------+
// | value| name|
// +--------------------+------+
// |{"age":"35","cred...|michel|
// |{"age":"25","cred...| peter|
// +--------------------+------+
For that, you can directly convert your dataframe to a Dataset of JSON string using
val jsonDataset: Dataset[String] = df.toJSON
You can convert it into a dataframe using
val jsonDF: DataFrame = jsonDataset.toDF
Here the json will be alphabetically ordered so the output of
jsonDF show false
will be
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"age":"35","creditcards":[{"id":"hr6","number":"3569823"},{"id":"ee3","number":"1547869"}],"id":"1","lastname":"blanc","name":"michel","timestamp":"1496756626921"}|
|{"age":"25","creditcards":[{"id":"ye8","number":"4569872"},{"id":"qe5","number":"3485762"}],"id":"2","lastname":"barns","name":"peter","timestamp":"1496756626551"} |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+

How to parse nested JSON objects in spark sql?

I have a schema as shown below. How can i parse the nested objects
root
|-- apps: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- appName: string (nullable = true)
| | |-- appPackage: string (nullable = true)
| | |-- Ratings: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- date: string (nullable = true)
| | | | |-- rating: long (nullable = true)
|-- id: string (nullable = true)
Assuming you read in a json file and print the schema you are showing us like this:
DataFrame df = sqlContext.read().json("/path/to/file").toDF();
df.registerTempTable("df");
df.printSchema();
Then you can select nested objects inside a struct type like so...
DataFrame app = df.select("app");
app.registerTempTable("app");
app.printSchema();
app.show();
DataFrame appName = app.select("element.appName");
appName.registerTempTable("appName");
appName.printSchema();
appName.show();
Try this:
val nameAndAddress = sqlContext.sql("""
SELECT name, address.city, address.state
FROM people
""")
nameAndAddress.collect.foreach(println)
Source:
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html
Have you tried doing it straight from the SQL query like
Select apps.element.Ratings from yourTableName
This will probably return an array and you can more easily access the elements inside.
Also, I use this online Json viewer when I have to deal with large JSON structures and the schema is too complex:
http://jsonviewer.stack.hu/
I am using pyspark, but the logic should be similar.
I found this way of parsing my nested json useful:
df.select(df.apps.appName.alias("apps_Name"), \
df.apps.appPackage.alias("apps_Package"), \
df.apps.Ratings.date.alias("apps_Ratings_date")) \
.show()
The code could be obviously shorten with a f-string.
var df = spark.read.format("json").load("/path/to/file")
df.createOrReplaceTempView("df");
spark.sql("select apps.element.Ratings from df where apps.element.appName like '%app_name%' ").show()