Array of JSON to Dataframe in Spark received by Kafka - json

I'm writing a Spark application in Scala using Spark Structured Streaming that receive some data formatted in JSON style from Kafka. This application could receive both a single or multiple JSON object formatted in this way:
[{"key1":"value1","key2":"value2"},{"key1":"value1","key2":"value2"},...,{"key1":"value1","key2":"value2"}]
I tried to define a StructType like:
var schema = StructType(
Array(
StructField("key1",DataTypes.StringType),
StructField("key2",DataTypes.StringType)
))
But it doesn't work.
My actual code for parsing JSON:
var data = (this.stream).getStreamer().load()
.selectExpr("CAST (value AS STRING) as json")
.select(from_json($"json",schema=schema).as("data"))
I would like to get this JSON objects in a dataframe like
+----------+---------+
| key1| key2|
+----------+---------+
| value1| value2|
| value1| value2|
........
| value1| value2|
+----------+---------+
Anyone can help me please?
Thank you!

As your incoming string is Array of JSON, one way is to write a UDF to parse the Array, then explode the parsed Array. Below is the complete code with each steps explained. I have written it for batch but same can be used for streaming with minimal changes.
object JsonParser{
//case class to parse the incoming JSON String
case class JSON(key1: String, key2: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder().
appName("JSON").
master("local").
getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
//sample JSON array String coming from kafka
val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
//UDF to parse JSON array String
val jsonConverter = udf { jsonString: String =>
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
mapper.readValue(jsonString, classOf[Array[JSON]])
}
val df = str.toDF("json") //json String column
.withColumn("array", jsonConverter($"json")) //parse the JSON Array
.withColumn("json", explode($"array")) //explode the Array
.drop("array") //drop unwanted columns
.select("json.*") //explode the JSON to separate columns
//display the DF
df.show()
//+------+------+
//| key1| key2|
//+------+------+
//|value1|value2|
//|value3|value4|
//+------+------+
}
}

This worked fine for me in Spark 3.0.0 and Scala 2.12.10. I used schema_of_json to get the schema of the data in a suitable format for from_json, and applied explode and * operator in the last step of the chain to expand accordingly.
// TO KNOW THE SCHEMA
scala> val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
str: Seq[String] = List([{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}])
scala> val df = str.toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]
scala> df.show()
+--------------------+
| json|
+--------------------+
|[{"key1":"value1"...|
+--------------------+
scala> val schema = df.select(schema_of_json(df.select(col("json")).first.getString(0))).as[String].first
schema: String = array<struct<key1:string,key2:string>>
Use the resulting string as your schema: 'array<structkey1:string,key2:string>', as follows:
// TO PARSE THE ARRAY OF JSON's
scala> val parsedJson1 = df.selectExpr("from_json(json, 'array<struct<key1:string,key2:string>>') as parsed_json")
parsedJson1: org.apache.spark.sql.DataFrame = [parsed_json: array<struct<key1:string,key2:string>>]
scala> parsedJson1.show()
+--------------------+
| parsed_json|
+--------------------+
|[[value1, value2]...|
+--------------------+
scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json").select("json.*")
data: org.apache.spark.sql.DataFrame = [key1: string, key2: string]
scala> data.show()
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+
Just FYI, without the star expansion the intermediate result looks as follows:
scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json")
data: org.apache.spark.sql.DataFrame = [json: struct<key1: string, key2: string>]
scala> data.show()
+----------------+
| json|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+

You can add ArrayType to your schema and from_json would
convert the data to json.
var schema = ArrayType(StructType(
Array(
StructField("key1", DataTypes.StringType),
StructField("key2", DataTypes.StringType)
)))
Explode it to get the json array element in each row.
val explodedDf = df.withColumn("jsonData", explode(from_json(col("value"), schema)))
.select($"jsonData").show
+----------------+
| jsonData|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+
Select the json keys
explodedDf.select("jsonData.*").show
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+

Related

Flattening json string in spark

I have the following dataframe in spark:
root
|-- user_id: string (nullable = true)
|-- payload: string (nullable = true)
in which payload is an json string with no fixed schema, here are some sample data:
{'user_id': '001','payload': '{"country":"US","time":"11111"}'}
{'user_id': '002','payload': '{"message_id":"8936716"}'}
{'user_id': '003','payload': '{"brand":"adidas","when":""}'}
I want to output the above data in json format with the flattened payload(basically just extracting key value pairs from payload and put them into the root level), for example:
{'user_id': '001','country':'US','time':'11111'}
{'user_id': '002','message_id':'8936716'}
{'user_id': '003','brand':'adidas','when':''}
Stackoverflow said this is a duplicated question to Flatten Nested Spark Dataframe but it's not..
The difference here is that the value of payload in my case is just string type.
You can parse the payload JSON as a map<string,string> and add the user_id to the payload:
import pyspark.sql.functions as F
# input dataframe
df.show(truncate=False)
+-------+-------------------------------+
|user_id|payload |
+-------+-------------------------------+
|001 |{"country":"US","time":"11111"}|
|002 |{"message_id":"8936716"} |
|003 |{"brand":"adidas","when":""} |
+-------+-------------------------------+
df2 = df.select(
F.to_json(
F.map_concat(
F.create_map(F.lit('user_id'), F.col('user_id')),
F.from_json('payload', 'map<string,string>')
)
).alias('out')
)
df2.show(truncate=False)
+-----------------------------------------------+
|out |
+-----------------------------------------------+
|{"user_id":"001","country":"US","time":"11111"}|
|{"user_id":"002","message_id":"8936716"} |
|{"user_id":"003","brand":"adidas","when":""} |
+-----------------------------------------------+
To write it to a JSON file, you can do:
df2.coalesce(1).write.text('filepath')
This is how I finally solved the problem
json_schema = spark.read.json(source_parquet_df.rdd.map(lambda row: row.payload)).schema
new_df=source_parquet_df.withColumn('payload_json_obj',from_json(col('payload'),json_schema)).drop(source_parquet_df.payload)
flat_df = new_df.select([c for c in new_df.columns if c != 'payload_json_obj']+['payload_json_obj.*'])

Reading an element from a json object stored in a column

I have the following dataframe
+-------+--------------------------------
|__key__|______value____________________|
| 1 | {"name":"John", "age": 34} |
| 2 | {"name":"Rose", "age": 50} |
I want to retrieve all age values within this dataframe and store it later within an array.
val x = df_clean.withColumn("value", col("value.age"))
x.show(false)
But this throws and exception.
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Can't extract value from value#89: need struct type but got string;
How to resolve my requirement
EDIT
val schema = existingSparkSession.read.json(df_clean.select("value").as[String]).schema
val my_json = df_clean.select(from_json(col("value"), schema).alias("jsonValue"))
my_json.printSchema()
val df_final = my_json.withColumn("age", col("jsonValue.age"))
df_final.show(false)
Currently no exceptions are thrown. Yet I can't see any output also
EDIT 2
println("---+++++--------")
df_clean.select("value").take(1)
println("---+++++--------")
output
---+++++--------
---+++++--------
If you have long json and want to create schema then you can use from_json with schema.
import org.apache.spark.sql.functions._
val df = Seq(
(1, "{\"name\":\"John\", \"age\": 34}"),
(2, "{\"name\":\"Rose\", \"age\": 50}")
).toDF("key", "value")
val schema = spark.read.json(df.select("value").as[String]).schema
val resultDF = df.withColumn("value", from_json($"value", schema))
resultDF.show(false)
resultDF.printSchema()
Output:
+---+----------+
|key|value |
+---+----------+
|1 |{34, John}|
|2 |{50, Rose}|
+---+----------+
Schema:
root
|-- key: integer (nullable = false)
|-- value: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
If you directly need to access the nested fields then you can use get_json_object
df.withColumn("name", get_json_object($"value", "$.name"))
.withColumn("age", get_json_object($"value", "$.age"))
.show(false)

spark dataframes : reading json having duplicate column names but different datatypes

I have json data like below where version field is the differentiator -
file_1 = {"version": 1, "stats": {"hits":20}}
file_2 = {"version": 2, "stats": [{"hour":1,"hits":10},{"hour":2,"hits":12}]}
In the new format, stats column is now Arraytype(StructType).
Earlier only file_1 was needed so I was using
spark.read.schema(schema_def_v1).json(path)
Now I need to read both these type of multiple json files which come together. I cannot define stats as string in schema_def as that would affect the corruptrecord feature(for stats column) which checks malformed json and schema compliance of all the fields.
Example df output required in 1 read only -
version | hour | hits
1 | null | 20
2 | 1 | 10
2 | 2 | 12
I have tried to read with mergeSchema option but that makes stats field String type.
Also, I have tried making two dataframes by filtering on the version field, and applying spark.read.schema(schema_def_v1).json(df_v1.toJSON). Here also stats column becomes String type.
I was thinking if while reading, I could parse the df column headers as stats_v1 and stats_v2 on basis of data-types can solve the problem. Please help with any possible solutions.
UDF to check string or array, if it is string it will convert string to an array.
import org.apache.spark.sql.functions.udf
import org.json4s.{DefaultFormats, JObject}
import org.json4s.jackson.JsonMethods.parse
import org.json4s.jackson.Serialization.write
import scala.util.{Failure, Success, Try}
object Parse {
implicit val formats = DefaultFormats
def toArray(data:String) = {
val json_data = (parse(data))
if(json_data.isInstanceOf[JObject]) write(List(json_data)) else data
}
}
val toJsonArray = udf(Parse.toArray _)
scala> "ls -ltr /tmp/data".!
total 16
-rw-r--r-- 1 srinivas root 37 Jun 26 17:49 file_1.json
-rw-r--r-- 1 srinivas root 69 Jun 26 17:49 file_2.json
res4: Int = 0
scala> val df = spark.read.json("/tmp/data").select("stats","version")
df: org.apache.spark.sql.DataFrame = [stats: string, version: bigint]
scala> df.printSchema
root
|-- stats: string (nullable = true)
|-- version: long (nullable = true)
scala> df.show(false)
+-------+-------------------------------------------+
|version|stats |
+-------+-------------------------------------------+
|1 |{"hits":20} |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|
+-------+-------------------------------------------+
Output
scala>
import org.apache.spark.sql.types._
val schema = ArrayType(MapType(StringType,IntegerType))
df
.withColumn("json_stats",explode(from_json(toJsonArray($"stats"),schema)))
.select(
$"version",
$"stats",
$"json_stats".getItem("hour").as("hour"),
$"json_stats".getItem("hits").as("hits")
).show(false)
+-------+-------------------------------------------+----+----+
|version|stats |hour|hits|
+-------+-------------------------------------------+----+----+
|1 |{"hits":20} |null|20 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|1 |10 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|2 |12 |
+-------+-------------------------------------------+----+----+
Without UDF
scala> val schema = ArrayType(MapType(StringType,IntegerType))
scala> val expr = when(!$"stats".contains("[{"),concat(lit("["),$"stats",lit("]"))).otherwise($"stats")
df
.withColumn("stats",expr)
.withColumn("stats",explode(from_json($"stats",schema)))
.select(
$"version",
$"stats",
$"stats".getItem("hour").as("hour"),
$"stats".getItem("hits").as("hits")
)
.show(false)
+-------+-----------------------+----+----+
|version|stats |hour|hits|
+-------+-----------------------+----+----+
|1 |[hits -> 20] |null|20 |
|2 |[hour -> 1, hits -> 10]|1 |10 |
|2 |[hour -> 2, hits -> 12]|2 |12 |
+-------+-----------------------+----+----+
Read the second file first, explode stats, use schema to read first file.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
file_1 = {"version": 1, "stats": {"hits": 20}}
file_2 = {"version": 2, "stats": [{"hour": 1, "hits": 10}, {"hour": 2, "hits": 12}]}
df1 = spark.read.json(sc.parallelize([file_2])).withColumn('stats', explode('stats'))
schema = df1.schema
spark.read.schema(schema).json(sc.parallelize([file_1])).printSchema()
output >> root
|-- stats: struct (nullable = true)
| |-- hits: long (nullable = true)
| |-- hour: long (nullable = true)
|-- version: long (nullable = true)
IIUC, you can read the JSON files using spark.read.text and then parse the value with json_tuple, from_json. notice for stats field we use coalesce to parse fields based on two or more schema. (add wholetext=True as an argument of spark.read.text if each file contains a single JSON document cross multiple lines)
from pyspark.sql.functions import json_tuple, coalesce, from_json, array
df = spark.read.text("/path/to/all/jsons/")
schema_1 = "array<struct<hour:int,hits:int>>"
schema_2 = "struct<hour:int,hits:int>"
df.select(json_tuple('value', 'version', 'stats').alias('version', 'stats')) \
.withColumn('status', coalesce(from_json('stats', schema_1), array(from_json('stats', schema_2)))) \
.selectExpr('version', 'inline_outer(status)') \
.show()
+-------+----+----+
|version|hour|hits|
+-------+----+----+
| 2| 1| 10|
| 2| 2| 12|
| 1|null| 20|
+-------+----+----+

Spark: correct schema to load JSON as DataFrame

I have a JSON like
{ 1234 : "blah1", 9807: "blah2", 467: "blah_k", ...}
written to a gzipped file. It is a mapping of one ID space to another where the keys are ints and values are strings.
I want to load it as a DataFrame in Spark.
I loaded it as,
val df = spark.read.format("json").load("my_id_file.json.gz")
By default, Spark loaded it with a schema that looks like
|-- 1234: string (nullable = true)
|-- 9807: string (nullable = true)
|-- 467: string (nullable = true)
Instead, I want to my DataFrame to look like
+----+------+
|id1 |id2 |
+----+------+
|1234|blah1 |
|9007|blah2 |
|467 |blah_k|
+----+------+
So, I tried the following.
import org.apache.spark.sql.types._
val idMapSchema = StructType(Array(StructField("id1", IntegerType, true), StructField("id2", StringType, true)))
val df = spark.read.format("json").schema(idMapSchema).load("my_id_file.json.gz")
However, the loaded data frame looks like
scala> df.show
+----+----+
|id1 |id2 |
+----+----+
|null|null|
+----+----+
How can I specify the schema to fix this? Is there a "pure" dataframe approach (without creating an RDD and then creating DataFrame)?
One way to achieve this is to read the input file as textFile and apply your parsing logic within map() and then convert the result to dataframe
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer
val rdd = sparkSession.sparkContext.textFile("my_input_file_path")
.map(row => {
val list = new ListBuffer[String]()
val inputJson = new JSONObject(row)
for (key <- inputJson.keySet()) {
val resultJson = new JSONObject()
resultJson.put("col1", key)
resultJson.put("col2", inputJson.get(key))
list += resultJson.toString()
}
list
}).flatMap(row => row)
val df = sparkSession.read.json(rdd)
df.printSchema()
df.show(false)
output:
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
+----+------+
|col1|col2 |
+----+------+
|1234|blah1 |
|467 |blah_k|
|9807|blah2 |
+----+------+

Fit a json string to a DataFrame using a schema

I have a schema that looks like this:
StructType(StructField(keys,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7,true))
I have a json string(that matches this schema) that I need to convert to fit the above schema.
"{"keys" : [2.0, 1.0]}"
How I proceed to get a dataframe out of this string to get a DataFrame that matches my schema?
Following are the steps I have tried in a scala notebook:
val rddData2 = sc.parallelize("""{"keys" : [1.0 , 2.0] }""" :: Nil)
val in = session.read.schema(schema).json(rddData2)
in.show
This is the output being shown:
+-----------+
|keys |
+-----------+
|null |
+-----------+
If you have a json string as
val jsonString = """{"keys" : [2.0, 1.0]}"""
then you can create a dataframe without schema as
val jsonRdd = sc.parallelize(Seq(jsonString))
val df = sqlContext.read.json(jsonRdd)
which should give you
+----------+
|keys |
+----------+
|[2.0, 1.0]|
+----------+
with schema
root
|-- keys: array (nullable = true)
| |-- element: double (containsNull = true)
Now if you want to convert the array column created by default to Vector, then you would need a udf function as
import org.apache.spark.sql.functions._
def vectorUdf = udf((array: collection.mutable.WrappedArray[Double]) => org.apache.spark.ml.linalg.Vectors.dense(Array(array: _*)))
and call the udf function using .withColumn as
df.withColumn("keys", vectorUdf(col("keys")))
You should be getting dataframe with schema as
root
|-- keys: vector (nullable = true)