I want to separate a string of JSONs in my dataframe column into multiple rows in PySpark. Example:
Input:
id
addresses
1
[{"city":null,"state":null,"street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}]
Expected output:
id
addresses
1
{"city":null,"state":null,"street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"}
1
{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}
Any ideas how to do this?
Looking at the example in your question, it is not clear what is the type of the addresses column and what type you need in the output column. So, let's explore different combinations.
addresses column is of type ArrayType: in this case, you can use explode:
df.select('id', F.explode('addresses').alias('address'))
The result is:
+---+-----------------------------------------------------------------------------------------------------+
|id |address |
+---+-----------------------------------------------------------------------------------------------------+
|1 |{country -> USA, state -> null, city -> null, street -> 123, ABC St, ABC Square, postalCode -> 11111}|
|1 |{country -> USA, state -> TX, city -> Dallas, street -> 456, DEF Plaza, Test St, postalCode -> 99999}|
+---+-----------------------------------------------------------------------------------------------------+
The type of the output column will be the same of the type of the items in the input column.
addresses column is an Array of StringType, but you want your output to be a StructTpye: in this case, you can convert each string into a struct, using from_json:
from pyspark.sql import functions as F, SparkSession, types as T
json_schema = T.StructType([
T.StructField("city", T.StringType()),
T.StructField("state", T.StringType()),
T.StructField("street", T.StringType()),
T.StructField("postalCode", T.StringType()),
T.StructField("country", T.StringType()),
])
df_struct_from_array = (
df
.withColumn('address', F.explode('addresses'))
.select('id', F.from_json('address', json_schema).alias('address'))
)
The following dataframe is the result:
+---+-------------------------------------------------+
|id |address |
+---+-------------------------------------------------+
|1 |{null, null, 123, ABC St, ABC Square, 11111, USA}|
|1 |{Dallas, TX, 456, DEF Plaza, Test St, 99999, USA}|
+---+-------------------------------------------------+
The schema of df_struct_from_array is:
root
|-- id: long (nullable = true)
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
| |-- street: string (nullable = true)
| |-- postalCode: string (nullable = true)
| |-- country: string (nullable = true)
addresses column is of StringType and you want a StructType Column in output: in this case, you have to convert from JSON first and then explode:
json_schema = T.ArrayType(T.StructType([
T.StructField("city", T.StringType()),
T.StructField("state", T.StringType()),
T.StructField("street", T.StringType()),
T.StructField("postalCode", T.StringType()),
T.StructField("country", T.StringType()),
]))
df_struct_from_str = (
df
.withColumn('addresses_conv', F.from_json('addresses', json_schema))
.select('id', F.explode('addresses_conv').alias('address'))
)
This is what you get:
+---+-------------------------------------------------+
|id |address |
+---+-------------------------------------------------+
|1 |{null, null, 123, ABC St, ABC Square, 11111, USA}|
|1 |{Dallas, TX, 456, DEF Plaza, Test St, 99999, USA}|
+---+-------------------------------------------------+
root
|-- id: long (nullable = true)
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
| |-- street: string (nullable = true)
| |-- postalCode: string (nullable = true)
| |-- country: string (nullable = true)
Related
I have a an array of JSONs as below.
id address
1 [{street: 11 Summit Ave, city: null, postal_code: 07306, state: NJ , country: null}, {street: 11 Sum Ave , city: null , postal_code: null, state: NJ, country: US}, {street: 12 Oliver Avenue, city: Seattle , postal_code: 98121, state: WA, country: US}]
Here's what the data types are:
root
|-- id: string (nullable = true)
|-- addresses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| | |-- street: string (nullable = true)
| | |-- postalCode: string (nullable = true)
| | |-- country: string (nullable = true)
I want to create a string of the addresses ignoring nulls and separated by a delimiter (say ;). So output should look like:
id addresses
1 11 Summit Ave 07306 NJ ; 11 Sum Ave NJ US; 12 Oliver Avenue Seattle 98121 WA US
How can I achieve this in PySpark? If it matters, my original address is of string type but using from_json, I converted it to the schema specified above.
This would work:
df.withColumn("allAdd", F.explode("addresses"))\
.withColumn("asString", F.expr("concat_ws(' ', allAdd.*)"))\
.groupBy("id")\
.agg(F.concat_ws("; ", F.collect_list("asString")).alias("asString"))\
.show(truncate=False)
Input:
Output:
I have a nested source json file that contains an array of structs. The number of structs varies greatly from row to row and I would like to use Spark (scala) to dynamically create new dataframe columns from the key/values of the struct where the key is the column name and the value is the column value.
Example Minified json record
{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}
dataframe schema
scala> val df = spark.read.json("file:///tmp/nested_test.json")
root
|-- key1: struct (nullable = true)
| |-- key2: struct (nullable = true)
| | |-- key3: string (nullable = true)
| | |-- key4: string (nullable = true)
| | |-- key5: struct (nullable = true)
| | | |-- key6: string (nullable = true)
| | | |-- key7: string (nullable = true)
| | | |-- values: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- value: string (nullable = true)
Whats been done so far
df.select(
($"key1.key2.key3").as("key3"),
($"key1.key2.key4").as("key4"),
($"key1.key2.key5.key6").as("key6"),
($"key1.key2.key5.key7").as("key7"),
($"key1.key2.key5.values").as("values")).
show(truncate=false)
+----+----+----+----+----------------------------------------------------------------------------+
|key3|key4|key6|key7|values |
+----+----+----+----+----------------------------------------------------------------------------+
|AK |EU |001 |N |[[valuesColumn1, 9.876], [valuesColumn2, 1.2345], [valuesColumn3, 8.675309]]|
+----+----+----+----+----------------------------------------------------------------------------+
There is an array of 3 structs here but the 3 structs need to be spilt into 3 separate columns dynamically (the number of 3 can vary greatly), and I am not sure how to do it.
Sample Desired output
Notice that there were 3 new columns produced for each of the array elements within the values array.
+----+----+----+----+-----------------------------------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-----------------------------------------+
|AK |EU |001 |N |9.876 |1.2345 |8.675309 |
+----+----+----+----+-----------------------------------------+
Reference
I believe that the desired solution is something similar to what was discussed in this SO post but with 2 main differences:
The number of columns is hardcoded to 3 in the SO post but in my circumstance, the number of array elements is unknown
The column names need to be driven by the name column and the column value by the value.
...
| | | | |-- element: struct (containsNull = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- value: string (nullable = true)
You could do it this way:
val sac = new SparkContext("local[*]", " first Program");
val sqlc = new SQLContext(sac);
import sqlc.implicits._;
import org.apache.spark.sql.functions.split
import scala.math._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.{ min, max }
val json = """{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}"""
val df1 = sqlc.read.json(Seq(json).toDS())
val df2 = df1.select(
($"key1.key2.key3").as("key3"),
($"key1.key2.key4").as("key4"),
($"key1.key2.key5.key6").as("key6"),
($"key1.key2.key5.key7").as("key7"),
($"key1.key2.key5.values").as("values")
)
val numColsVal = df2
.withColumn("values_size", size($"values"))
.agg(max($"values_size"))
.head()
.getInt(0)
val finalDFColumns = df2.select(explode($"values").as("values")).select("values.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect.foldLeft(df2.limit(0))((cdf, c) => cdf.withColumn(c, lit(null))).columns
val finalDF = df2.select($"*" +: (0 until numColsVal).map(i => $"values".getItem(i)("value").as($"values".getItem(i)("name").toString)): _*)
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).show(false)
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).drop($"values").show(false)
The resulting final output as :
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
|AK |EU |001 |N |9.876 |1.2345 |8.675309 |
+----+----+----+----+-------------+-------------+-------------+
Hope I got your question right!
----------- EDIT with Explanation----------
This block gets the number of columns to be created for the array structure.
val numColsVal = df2
.withColumn("values_size", size($"values"))
.agg(max($"values_size"))
.head()
.getInt(0)
finalDFColumns is the DF created with all the expected columns as output with null values.
Below block returns the different columns that needs to be created from the array structure.
df2.select(explode($"values").as("values")).select("values.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect
Below block combines the above new columns with the other columns in df2 initialized with empty/null values.
foldLeft(df2.limit(0))((cdf, c) => cdf.withColumn(c, lit(null)))
Combining these two blocks if you print the output you will get :
+----+----+----+----+------+-------------+-------------+-------------+
|key3|key4|key6|key7|values|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+------+-------------+-------------+-------------+
+----+----+----+----+------+-------------+-------------+-------------+
Now we have the structure ready. We need the values for corresponding columns here. Below block gets us the values:
df2.select($"*" +: (0 until numColsVal).map(i => $"values".getItem(i)("value").as($"values".getItem(i)("name").toString)): _*)
This results like below:
+----+----+----+----+--------------------+---------------+---------------+---------------+
|key3|key4|key6|key7| values|values[0][name]|values[1][name]|values[2][name]|
+----+----+----+----+--------------------+---------------+---------------+---------------+
| AK| EU| 001| N|[[valuesColumn1, ...| 9.876| 1.2345| 8.675309|
+----+----+----+----+--------------------+---------------+---------------+---------------+
Now we need to rename the columns as we have in the first block above. So we will use the zip function to merge the columns and then use foldLeft method to rename the output columns as below :
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).show(false)
This results in the below structure:
+----+----+----+----+--------------------+-------------+-------------+-------------+
|key3|key4|key6|key7| values|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+--------------------+-------------+-------------+-------------+
| AK| EU| 001| N|[[valuesColumn1, ...| 9.876| 1.2345| 8.675309|
+----+----+----+----+--------------------+-------------+-------------+-------------+
We are almost there. We now just need to remove the unwanted values column like this:
finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf, column) => fdf.withColumnRenamed(column._1, column._2)).drop($"values").show(false)
Thus resulting into expected output as follows -
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
|AK |EU |001 |N |9.876 |1.2345 |8.675309 |
+----+----+----+----+-------------+-------------+-------------+
I'm not sure if I was able to explain it clearly. But if you try breaking the above statements/code and try printing it you will get to know how we are reaching till the output. You could find the explanation with examples for different functions used in this logic on internet.
I found that this approach performed much better and was easier to understand using an explode and pivot:
val json = """{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}"""
val df = spark.read.json(Seq(json).toDS())
// schema
df.printSchema
root
|-- key1: struct (nullable = true)
| |-- key2: struct (nullable = true)
| | |-- key3: string (nullable = true)
| | |-- key4: string (nullable = true)
| | |-- key5: struct (nullable = true)
| | | |-- key6: string (nullable = true)
| | | |-- key7: string (nullable = true)
| | | |-- values: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- value: string (nullable = true)
// create final df
val finalDf = df.
select(
$"key1.key2.key3".as("key3"),
$"key1.key2.key4".as("key4"),
$"key1.key2.key5.key6".as("key6"),
$"key1.key2.key5.key7".as("key7"),
explode($"key1.key2.key5.values").as("values")
).
groupBy(
$"key3", $"key4", $"key6", $"key7"
).
pivot("values.name").
agg(min("values.value")).alias("values.name")
// result
finalDf.show
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
| AK| EU| 001| N| 9.876| 1.2345| 8.675309|
+----+----+----+----+-------------+-------------+-------------+
I am trying to pull out data as below from data frame. The Json data which has nested arrays is completely in one column(_c1). I want to pull it out and create it as separate data frame with valid column names. One sample record would be as below.
|_c1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"Id":"31279605299","Type":"12121212","client":"Checklist _API","eventTime":"2020-03-17T15:50:30.640Z","eventType":"Event","payload":{"sourceApp":"ios","questionnaire":{"version":"1.0","question":"How to resolve ? ","fb":"Na"}}}
I am reading it to a schema as,
val schema=StructType(Array(
StructField("Id", StringType, false),
StructField("Type", StringType, false),
StructField("client", StringType, false),
StructField("eventTime", StringType, false),
StructField("eventType", StringType, false),
StructField("payload", ArrayType(StructType(Array(
StructField("sourceApp", StringType, false),
StructField("questionnaire", ArrayType(StructType(Array(
StructField("version", StringType, false),
StructField("question", StringType, false),
StructField("fb", StringType, false)))))
))))
))
val json_paral = DF.select(from_json(col("_c1"),schema))
`
Structure comes out as below,
`
|-- jsontostructs(_c1): struct (nullable = true)
| |-- Id: string (nullable = true)
| |-- Type: string (nullable = true)
| |-- client: string (nullable = true)
| |-- eventTime: string (nullable = true)
| |-- eventType: string (nullable = true)
| |-- payload: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- sourceApp: string (nullable = true)
| | | |-- questionnaire: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- version: string (nullable = true)
| | | | | |-- question: string (nullable = true)
| | | | | |-- fb: string (nullable = true)
The structure is good but when I check the dataframe all data is coming out as NULL. Is the read fine ? Not getting any parsing issues either.
Please check if this helps-
1. Load the data
val data = """{"Id":"31279605299","Type":"12121212","client":"Checklist _API","eventTime":"2020-03-17T15:50:30.640Z","eventType":"Event","payload":{"sourceApp":"ios","questionnaire":{"version":"1.0","question":"How to resolve ? ","fb":"Na"}}} """
val df = Seq(data).toDF("jsonCol")
df.show(false)
df.printSchema()
Output-
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|jsonCol |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"Id":"31279605299","Type":"12121212","client":"Checklist _API","eventTime":"2020-03-17T15:50:30.640Z","eventType":"Event","payload":{"sourceApp":"ios","questionnaire":{"version":"1.0","question":"How to resolve ? ","fb":"Na"}}} |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
root
|-- jsonCol: string (nullable = true)
2. extract the json string to separate fileds
df.select(json_tuple(col("jsonCol"), "Id", "Type", "client", "eventTime", "eventType", "payload"))
.show(false)
Output-
+-----------+--------+--------------+------------------------+-----+----------------------------------------------------------------------------------------------+
|c0 |c1 |c2 |c3 |c4 |c5 |
+-----------+--------+--------------+------------------------+-----+----------------------------------------------------------------------------------------------+
|31279605299|12121212|Checklist _API|2020-03-17T15:50:30.640Z|Event|{"sourceApp":"ios","questionnaire":{"version":"1.0","question":"How to resolve ? ","fb":"Na"}}|
+-----------+--------+--------------+------------------------+-----+----------------------------------------------------------------------------------------------+
3. using from_json(..)
val processed = df.select(
expr("from_json(jsonCol, 'struct<Id:string,Type:string,client:string,eventTime:string, eventType:string," +
"payload:struct<questionnaire:struct<fb:string,question:string,version:string>,sourceApp:string>>')")
.as("json_converted"))
processed.show(false)
processed.printSchema()
Output-
+-------------------------------------------------------------------------------------------------------------+
|json_converted |
+-------------------------------------------------------------------------------------------------------------+
|[31279605299, 12121212, Checklist _API, 2020-03-17T15:50:30.640Z, Event, [[Na, How to resolve ? , 1.0], ios]]|
+-------------------------------------------------------------------------------------------------------------+
root
|-- json_converted: struct (nullable = true)
| |-- Id: string (nullable = true)
| |-- Type: string (nullable = true)
| |-- client: string (nullable = true)
| |-- eventTime: string (nullable = true)
| |-- eventType: string (nullable = true)
| |-- payload: struct (nullable = true)
| | |-- questionnaire: struct (nullable = true)
| | | |-- fb: string (nullable = true)
| | | |-- question: string (nullable = true)
| | | |-- version: string (nullable = true)
| | |-- sourceApp: string (nullable = true)
Instead of reading it to schema I tried making it to a value as
val Df = json_DF.map(r => r.getString(0))
This will pull the data as a string on which the below would pull it out with the keys as column names.
val g1DF=spark.read.json(Df)
Did some lateral view explode nested to pull out nested array values.
There is hive table with single column of type string.
hive> desc logical_control.test1;
OK
test_field_1 string test field 1
val df2 = spark.sql("select * from logical_control.test1")
df2.printSchema()
root
|-- test_field_1: string (nullable = true)
df2.show(false)
+------------------------+
|test_field_1 |
+------------------------+
|[[str0], [str1], [str2]]|
+------------------------+
How to transform it to structured column like below?
root
|-- A: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- S: string (nullable = true)
I tried to recover it with initial schema that data being structured before it was written to the hdfs. But json_data is null.
val schema = StructType(
Seq(
StructField("A", ArrayType(
StructType(
Seq(
StructField("S", StringType, nullable = true))
)
), nullable = true)
)
)
val df3 = df2.withColumn("json_data", from_json(col("test_field_1"), schema))
df3.printSchema()
root
|-- test_field_1: string (nullable = true)
|-- json_data: struct (nullable = true)
| |-- A: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- S: string (nullable = true)
df3.show(false)
+------------------------+---------+
|test_field_1 |json_data|
+------------------------+---------+
|[[str0], [str1], [str2]]|null |
+------------------------+---------+
If the structure of test_field_1 is fixed and you don't mind "parsing" the field yourself, you can use an udf to perform the transformation:
case class S(S:String)
def toArray: String => Array[S] = _.replaceAll("[\\[\\]]","").split(",").map(s => S(s.trim))
val toArrayUdf = udf(toArray)
val df3 = df2.withColumn("json_data", toArrayUdf(col("test_field_1")))
df3.printSchema()
df3.show(false)
prints
root
|-- test_field_1: string (nullable = true)
|-- json_data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- S: string (nullable = true)
+------------------------+------------------------+
|test_field_1 |json_data |
+------------------------+------------------------+
|[[str0], [str1], [str2]]|[[str0], [str1], [str2]]|
+------------------------+------------------------+
The tricky part is to create the second level (element: struct) of the structure. I have used the case class S to create this struct.
I have a jsonfile to be parsed.The json format is like this :
{"cv_id":"001","cv_parse": { "educations": [{"major": "English", "degree": "Bachelor" },{"major": "English", "degree": "Master "}],"basic_info": { "birthyear": "1984", "location": {"state": "New York"}}}}
I have to get every word in the file.How can I get the "major" from an array and do I have to get the word of "province" using the method df.select("cv_parse.basic_info.location.province")?
This is the result I want:
cv_id major degree birthyear state
001 English Bachelor 1984 New York
001 English MasterĀ 1984 New York
This might not be the best way of doing it but you can give it a shot.
// import the implicits functions
import org.apache.spark.sql.functions._
import sqlContext.implicits._
//read the json file
val jsonDf = sqlContext.read.json("sample-data/sample.json")
jsonDf.printSchema
Your schema would be :
root
|-- cv_id: string (nullable = true)
|-- cv_parse: struct (nullable = true)
| |-- basic_info: struct (nullable = true)
| | |-- birthyear: string (nullable = true)
| | |-- location: struct (nullable = true)
| | | |-- state: string (nullable = true)
| |-- educations: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- degree: string (nullable = true)
| | | |-- major: string (nullable = true)
Now you need can have explode the educations column
val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"),
$"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state")
explodedResult.printSchema
Now your schema would be
root
|-- cv_id: string (nullable = true)
|-- col: struct (nullable = true)
| |-- degree: string (nullable = true)
| |-- major: string (nullable = true)
|-- birthyear: string (nullable = true)
|-- state: string (nullable = true)
Now you can select the columns
explodedResult.select("cv_id", "birthyear", "state", "col.degree", "col.major").show
+-----+---------+--------+--------+-------+
|cv_id|birthyear| state| degree| major|
+-----+---------+--------+--------+-------+
| 001| 1984|New York|Bachelor|English|
| 001| 1984|New York| Master |English|
+-----+---------+--------+--------+-------+