Pyspark - Json Column - Concat Key and Value as a String - json

I have a dataframe with 2 string columns, and another one with an array strucuture:
-- music: string (nullable = true)
|-- artist: string (nullable = true)
|-- details: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- Genre: string (nullable = true)
| | |-- Origin: string (nullable = true)
Just to help you, this is a sample data:
music | artist | details
Music_1 | Artist_1 | [{"Genre": "Rock", "Origin": "USA"}]
Music_2 | Artist_3 | [{"Genre": "", "Origin": "USA"}]
Music_3 | Artist_1 | [{"Genre": "Rock", "Origin": "UK"}]
I am trying a simple operation, I guess, just concat the Key and Value by '-'. Basically, what I am trying to do is to get the following strucuture:
music | artist | details
Music_1 | Artist_1 | Genre - Rock, Origin - USA
Music_2 | Artist_3 | Genre - , Origin - USA
Music_3 | Artist_1 | Genre - Rock, Origin - UK
For that I already tried an approach that was sparate first the key and value in different columns to then I can concat the items:
display(df.select(col("music"), col("artist"), posexplode("details").alias("key","value")))
But I got the following result:
music | artist | key | value
Music_1 | Artist_1 | 0 | [{"Genre": "Rock", "Origin": "USA"}]
Music_2 | Artist_3 | 0 | [{"Genre": "", "Origin": "USA"}]
Music_3 | Artist_1 | 0 | [{"Genre": "Rock", "Origin": "UK"}]
Probably is not the best solution, anyone can help me?
Thanks!

You can use built-in higher order function transform() to get desired result (From spark 2.4).
df = # Input data
df.withColumn('details', expr("transform(details, c-> concat_ws(', ', concat_ws(' - ', 'Genre', c['Genre']),
concat_ws(' - ', 'Origin', c['Origin'])))")) \
.withColumn('details', explode_outer('details')) \
.show(truncate=False)
+--------+--------------------------+-------+
|artist |details |music |
+--------+--------------------------+-------+
|Artist_1|Genre - Rock, Origin - USA|Music_1|
|Artist_3|Genre - , Origin - USA |Music_2|
|Artist_1|Genre - Rock, Origin - UK |Music_3|
+--------+--------------------------+-------+

Related

parse json data with spark 2.3

I have the following json data :
{
"3200": {
"id": "3200",
"value": [
"cat",
"dog"
]
},
"2000": {
"id": "2000",
"value": [
"bird"
]
},
"2500": {
"id": "2500",
"value": [
"kitty"
]
},
"3650": {
"id": "3650",
"value": [
"horse"
]
}
}
the schema of this data , with printSchema utilty after we load the data with spark is as follows:
root
|-- 3200: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- 2000: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- 2500: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- 3650: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: array (nullable = true)
| | |-- element: string (containsNull = true)
and I want to get the following dataframe
id value
3200 cat
2000 bird
2500 kitty
3200 dog
3650 horse
How can I do the parsing to get this expected output
Using spark-sql
Dataframe step (same as in Mohana's answer)
val df = spark.read.json(Seq(jsonData).toDS())
Build a temp view
df.createOrReplaceTempView("df")
Result:
val cols_k = df.columns.map( x => s"`${x}`.id" ).mkString(",")
val cols_v = df.columns.map( x => s"`${x}`.value" ).mkString(",")
spark.sql(s"""
with t1 ( select map_from_arrays(array(${cols_k}),array(${cols_v})) s from df ),
t2 ( select explode(s) (key,value) from t1 )
select key, explode(value) value from t2
""").show(false)
+----+-----+
|key |value|
+----+-----+
|2000|bird |
|2500|kitty|
|3200|cat |
|3200|dog |
|3650|horse|
+----+-----+
You can use stack() function to transpose the dataframe then extract key field and explode value field using explode_outer function.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val jsonData = """{
| "3200": {
| "id": "3200",
| "value": [
| "cat",
| "dog"
| ]
| },
| "2000": {
| "id": "2000",
| "value": [
| "bird"
| ]
| },
| "2500": {
| "id": "2500",
| "value": [
| "kitty"
| ]
| },
| "3650": {
| "id": "3650",
| "value": [
| "horse"
| ]
| }
|}
|""".stripMargin
val df = spark.read.json(Seq(jsonData).toDS())
df.selectExpr("stack (4, *) key")
.select(expr("key.id").as("key"),
explode_outer(expr("key.value")).as("value"))
.show(false)
+----+-----+
|key |value|
+----+-----+
|2000|bird |
|2500|kitty|
|3200|cat |
|3200|dog |
|3650|horse|
+----+-----+

Spark Scala - Split Array of Structs into Dataframe Columns

I have a nested source json file that contains an array of structs. The number of structs varies greatly from row to row and I would like to use Spark (scala) to dynamically create new dataframe columns from the key/values of the struct where the key is the column name and the value is the column value.
Example Minified json record
{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}
dataframe schema
scala> val df = spark.read.json("file:///tmp/nested_test.json")
root
|-- key1: struct (nullable = true)
| |-- key2: struct (nullable = true)
| | |-- key3: string (nullable = true)
| | |-- key4: string (nullable = true)
| | |-- key5: struct (nullable = true)
| | | |-- key6: string (nullable = true)
| | | |-- key7: string (nullable = true)
| | | |-- values: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- value: string (nullable = true)
Whats been done so far
df.select(
($"key1.key2.key3").as("key3"),
($"key1.key2.key4").as("key4"),
($"key1.key2.key5.key6").as("key6"),
($"key1.key2.key5.key7").as("key7"),
($"key1.key2.key5.values").as("values")).
show(truncate=false)
+----+----+----+----+----------------------------------------------------------------------------+
|key3|key4|key6|key7|values |
+----+----+----+----+----------------------------------------------------------------------------+
|AK |EU |001 |N |[[valuesColumn1, 9.876], [valuesColumn2, 1.2345], [valuesColumn3, 8.675309]]|
+----+----+----+----+----------------------------------------------------------------------------+
There is an array of 3 structs here but the 3 structs need to be spilt into 3 separate columns dynamically (the number of 3 can vary greatly), and I am not sure how to do it.
Sample Desired output
Notice that there were 3 new columns produced for each of the array elements within the values array.
+----+----+----+----+-----------------------------------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-----------------------------------------+
|AK |EU |001 |N |9.876 |1.2345 |8.675309 |
+----+----+----+----+-----------------------------------------+
Reference
I believe that the desired solution is something similar to what was discussed in this SO post but with 2 main differences:
The number of columns is hardcoded to 3 in the SO post but in my circumstance, the number of array elements is unknown
The column names need to be driven by the name column and the column value by the value.
...
| | | | |-- element: struct (containsNull = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- value: string (nullable = true)
You could do it this way:
val sac = new SparkContext("local[*]", " first Program");
val sqlc = new SQLContext(sac);
import sqlc.implicits._;
import org.apache.spark.sql.functions.split
import scala.math._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.{ min, max }
val json = """{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}"""
val df1 = sqlc.read.json(Seq(json).toDS())
val df2 = df1.select(
($"key1.key2.key3").as("key3"),
($"key1.key2.key4").as("key4"),
($"key1.key2.key5.key6").as("key6"),
($"key1.key2.key5.key7").as("key7"),
($"key1.key2.key5.values").as("values")
)
val numColsVal = df2
.withColumn("values_size", size($"values"))
.agg(max($"values_size"))
.head()
.getInt(0)
val finalDFColumns = df2.select(explode($"values").as("values")).select("values.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect.foldLeft(df2.limit(0))((cdf, c) => cdf.withColumn(c, lit(null))).columns
val finalDF = df2.select($"*" +: (0 until numColsVal).map(i => $"values".getItem(i)("value").as($"values".getItem(i)("name").toString)): _*)
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).show(false)
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).drop($"values").show(false)
The resulting final output as :
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
|AK |EU |001 |N |9.876 |1.2345 |8.675309 |
+----+----+----+----+-------------+-------------+-------------+
Hope I got your question right!
----------- EDIT with Explanation----------
This block gets the number of columns to be created for the array structure.
val numColsVal = df2
.withColumn("values_size", size($"values"))
.agg(max($"values_size"))
.head()
.getInt(0)
finalDFColumns is the DF created with all the expected columns as output with null values.
Below block returns the different columns that needs to be created from the array structure.
df2.select(explode($"values").as("values")).select("values.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect
Below block combines the above new columns with the other columns in df2 initialized with empty/null values.
foldLeft(df2.limit(0))((cdf, c) => cdf.withColumn(c, lit(null)))
Combining these two blocks if you print the output you will get :
+----+----+----+----+------+-------------+-------------+-------------+
|key3|key4|key6|key7|values|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+------+-------------+-------------+-------------+
+----+----+----+----+------+-------------+-------------+-------------+
Now we have the structure ready. We need the values for corresponding columns here. Below block gets us the values:
df2.select($"*" +: (0 until numColsVal).map(i => $"values".getItem(i)("value").as($"values".getItem(i)("name").toString)): _*)
This results like below:
+----+----+----+----+--------------------+---------------+---------------+---------------+
|key3|key4|key6|key7| values|values[0][name]|values[1][name]|values[2][name]|
+----+----+----+----+--------------------+---------------+---------------+---------------+
| AK| EU| 001| N|[[valuesColumn1, ...| 9.876| 1.2345| 8.675309|
+----+----+----+----+--------------------+---------------+---------------+---------------+
Now we need to rename the columns as we have in the first block above. So we will use the zip function to merge the columns and then use foldLeft method to rename the output columns as below :
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).show(false)
This results in the below structure:
+----+----+----+----+--------------------+-------------+-------------+-------------+
|key3|key4|key6|key7| values|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+--------------------+-------------+-------------+-------------+
| AK| EU| 001| N|[[valuesColumn1, ...| 9.876| 1.2345| 8.675309|
+----+----+----+----+--------------------+-------------+-------------+-------------+
We are almost there. We now just need to remove the unwanted values column like this:
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).drop($"values").show(false)
Thus resulting into expected output as follows -
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
|AK |EU |001 |N |9.876 |1.2345 |8.675309 |
+----+----+----+----+-------------+-------------+-------------+
I'm not sure if I was able to explain it clearly. But if you try breaking the above statements/code and try printing it you will get to know how we are reaching till the output. You could find the explanation with examples for different functions used in this logic on internet.
I found that this approach performed much better and was easier to understand using an explode and pivot:
val json = """{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}"""
val df = spark.read.json(Seq(json).toDS())
// schema
df.printSchema
root
|-- key1: struct (nullable = true)
| |-- key2: struct (nullable = true)
| | |-- key3: string (nullable = true)
| | |-- key4: string (nullable = true)
| | |-- key5: struct (nullable = true)
| | | |-- key6: string (nullable = true)
| | | |-- key7: string (nullable = true)
| | | |-- values: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- value: string (nullable = true)
// create final df
val finalDf = df.
select(
$"key1.key2.key3".as("key3"),
$"key1.key2.key4".as("key4"),
$"key1.key2.key5.key6".as("key6"),
$"key1.key2.key5.key7".as("key7"),
explode($"key1.key2.key5.values").as("values")
).
groupBy(
$"key3", $"key4", $"key6", $"key7"
).
pivot("values.name").
agg(min("values.value")).alias("values.name")
// result
finalDf.show
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
| AK| EU| 001| N| 9.876| 1.2345| 8.675309|
+----+----+----+----+-------------+-------------+-------------+

Unable to fetch Json Column using sparkDataframe:org.apache.spark.sql.AnalysisException: cannot resolve 'explode;

Can Someone Help me In this Scenario.I am reading one Json File using spark/scala and then trying to access column name but while accessing the column name i am getting below error message.
org.apache.spark.sql.AnalysisException: cannot resolve
'explode(`b2b_bill_products_prod_details`.`amt`)'
due to data type mismatch: input to function explode should be
array or map type, not DoubleType;;
Please see the Json Schema and my code below.
root
|-- b2b: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- add1: string (nullable = true)
| | |-- bill: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- amt: double (nullable = true)
| | | | |-- products: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- prod_details: struct (nullable = true)
| | | | | | | |-- amt: double (nullable = true)
I want to access amt field(last line in the json schema) I am writing below spark/scala code
df.withColumn("b2b_bill",explode($"b2b.bill"))
.withColumn("b2b_bill_products",explode($"b2b_bill.products"))
.withColumn("b2b_bill_products_prod_details", explode($"b2b_bill_products.prod_details"))
.withColumn("b2b_bill_products_prod_details_amt",explode($"b2b_bill_products_prod_details.amt"))
Your 4th explode function is applied on the amt: double column, wherein the explode function expects array/map input type. Thats the error reported.
Edit
You can access the inner most amt field with the expression given below,
df.withColumn("b2b_bill",explode($"b2b.bill"))
.withColumn("b2b_bill_products",explode($"b2b_bill.products"))
.withColumn("b2b_bill_products_prod_details_amt", $"b2b_bill_products.element.prod_details.amt")

Convert column with string json string to column with dictionary in pyspark

I have a column with following structure in my dataframe.
+--------------------+
| data|
+--------------------+
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
|{"sbar":{"_id":"5...|
+--------------------+
only showing top 5 rows
The data inside column is a json string. I want to convert the column to some other type (map, struct..). How do I do this with a udf function? I have created a function like this but cant figure out what the return type should be. I tried StructType and MapType which threw error. This is my code.
import json
from pyspark.sql.types import MapType, StructType
udf_getDict = F.udf(lambda x: json.loads(x), StructType)
subset.select(udf_getDict(F.col('data'))).printSchema()
You can use an approach with spark.read.json and df.rdd.map such as this:
json_string = """
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
"""
df2 = spark.createDataFrame(
[
(1, json_string),
],
['id', 'txt']
)
df2.dtypes
[('id', 'bigint'), ('txt', 'string')]
new_df = spark.read.json(df2.rdd.map(lambda r: r.txt))
new_df.printSchema()
root
|-- glossary: struct (nullable = true)
| |-- GlossDiv: struct (nullable = true)
| | |-- GlossList: struct (nullable = true)
| | | |-- GlossEntry: struct (nullable = true)
| | | | |-- Abbrev: string (nullable = true)
| | | | |-- Acronym: string (nullable = true)
| | | | |-- GlossDef: struct (nullable = true)
| | | | | |-- GlossSeeAlso: array (nullable = true)
| | | | | | |-- element: string (containsNull = true)
| | | | | |-- para: string (nullable = true)
| | | | |-- GlossSee: string (nullable = true)
| | | | |-- GlossTerm: string (nullable = true)
| | | | |-- ID: string (nullable = true)
| | | | |-- SortAs: string (nullable = true)
| | |-- title: string (nullable = true)
| |-- title: string (nullable = true)

Traversing through the Json object

I have a json file which has the following data:
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": [
"GML",
"XML"
]
},
"GlossSee": "markup"
}
}
}
}
}
I need to read this file in pyspark and traverse through all the elements in the json. I need to recognize all the struct, array and array of struct columns and need to create separate hive tables for each struct and array column.
For Example:
Glossary will be one table with "title" as the column
GlossEntry will be another table with columns "ID", "SortAs", "GlossTerm", "acronym", "abbrev"
The data will grow in the future with more nested structures. So i will have to write a generalized code which traverses through all the JSON elements and recognizes all the structs and array columns.
Is there a way to loop through every element in the nested struct?
Spark is able to automatically parse and infer json schema. Once its in the spark dataframe, you can access elements with the json by specifying its path.
json_df = spark.read.json(filepath)
json_df.printSchema()
Output:
root
|-- glossary: struct (nullable = true)
| |-- GlossDiv: struct (nullable = true)
| | |-- GlossList: struct (nullable = true)
| | | |-- GlossEntry: struct (nullable = true)
| | | | |-- Abbrev: string (nullable = true)
| | | | |-- Acronym: string (nullable = true)
| | | | |-- GlossDef: struct (nullable = true)
| | | | | |-- GlossSeeAlso: array (nullable = true)
| | | | | | |-- element: string (containsNull = true)
| | | | | |-- para: string (nullable = true)
| | | | |-- GlossSee: string (nullable = true)
| | | | |-- GlossTerm: string (nullable = true)
| | | | |-- ID: string (nullable = true)
| | | | |-- SortAs: string (nullable = true)
| | |-- title: string (nullable = true)
| |-- title: string (nullable = true)
Then choose the fields to extract:
json_df.select("glossary.title").show()
json_df.select("glossary.GlossDiv.GlossList.GlossEntry.*").select("Abbrev","Acronym","ID","SortAs").show()
Extracted output:
+----------------+
| title|
+----------------+
|example glossary|
+----------------+
+-------------+-------+----+------+
| Abbrev|Acronym| ID|SortAs|
+-------------+-------+----+------+
|ISO 8879:1986| SGML|SGML| SGML|
+-------------+-------+----+------+