I have retrieved a table from SQL Server which contains over 3 million records.
Top 10 Records:
+---------+-------------+----------+
|ACCOUNTNO|VEHICLENUMBER|CUSTOMERID|
+---------+-------------+----------+
| 10003014| MH43AJ411| 20000000|
| 10003014| MH43AJ411| 20000001|
| 10003015| MH12GZ3392| 20000002|
| 10003016| GJ15Z8173| 20000003|
| 10003018| MH05AM902| 20000004|
| 10003019| GJ15CD7657| 20001866|
| 10003019| MH02BY7774| 20000005|
| 10003019| MH02DG7774| 20000933|
| 10003019| GJ15CA7387| 20001865|
| 10003019| GJ15CB9601| 20001557|
+---------+-------------+----------+
only showing top 10 rows
Here ACCOUNTNO is unique, same ACCOUNTNO might have more than one VEHICLENUMBER, for each Vehicle we might have unique CUSTOMERID with respect to that VEHICLENUMBER
I want to export as a JSON format.
This is my code to achieve the output:
package com.issuer.pack2.spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
object sqltojson {
def main(args:Array[String])
{
System.setProperty("hadoop.home.dir", "C:/winutil/")
val conf = new SparkConf().setAppName("SQLtoJSON").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val jdbcSqlConnStr = "jdbc:sqlserver://192.168.70.88;databaseName=ISSUER;user=bhaskar;password=welcome123;"
val jdbcDbTable = "[HISTORY].[TP_CUSTOMER_PREPAIDACCOUNTS]"
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> jdbcSqlConnStr,"dbtable" -> jdbcDbTable)).load()
// jdbcDF.show(10)
jdbcDF.registerTempTable("tp_customer_account")
val res01 = sqlContext.sql("SELECT ACCOUNTNO, VEHICLENUMBER, CUSTOMERID FROM tp_customer_account GROUP BY ACCOUNTNO, VEHICLENUMBER, CUSTOMERID ORDER BY ACCOUNTNO ")
// res01.show(10)
res01.coalesce(1).write.json("D:/res01.json")
}
}
The output I got:
{"ACCOUNTNO":10003014,"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000001}
{"ACCOUNTNO":10003014,"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000000}
{"ACCOUNTNO":10003015,"VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":20000002}
{"ACCOUNTNO":10003016,"VEHICLENUMBER":"GJ15Z8173","CUSTOMERID":20000003}
{"ACCOUNTNO":10003018,"VEHICLENUMBER":"MH05AM902","CUSTOMERID":20000004}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"MH02BY7774","CUSTOMERID":20000005}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CA7387","CUSTOMERID":20001865}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CD7657","CUSTOMERID":20001866}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"MH02DG7774","CUSTOMERID":20000933}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CB9601","CUSTOMERID":20001557}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CD7387","CUSTOMERID":20029961}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CF7747","CUSTOMERID":20009020}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CB727","CUSTOMERID":20000008}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CA7837","CUSTOMERID":20001223}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CD7477","CUSTOMERID":20001690}
{"ACCOUNTNO":10003020,"VEHICLENUMBER":"MH01AX5658","CUSTOMERID":20000006}
{"ACCOUNTNO":10003021,"VEHICLENUMBER":"GJ15AD727","CUSTOMERID":20000007}
{"ACCOUNTNO":10003023,"VEHICLENUMBER":"GU15PP7567","CUSTOMERID":20000009}
{"ACCOUNTNO":10003024,"VEHICLENUMBER":"GJ15CA7567","CUSTOMERID":20000010}
{"ACCOUNTNO":10003025,"VEHICLENUMBER":"GJ5JB9312","CUSTOMERID":20000011}
But I want to get the JSON format output like this:
I have written the JSON below manually (maybe I have designed wrongly, I want that the ACCOUNTNO should be unique) for first three records of my above table.
{
"ACCOUNTNO":10003014,
"VEHICLE": [
{ "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000000},
{ "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000001}
],
"ACCOUNTNO":10003015,
"VEHICLE": [
{ "VEHICLENUMBER":"MH12GZ3392", "CUSTOMERID":20000002}
]
}
So, how to achieve this JSON format using Spark code?
Scala spark-sql
You can do the following (instead of registerTempTable you can usecreateOrReplaceTempView as registerTempTable is deprecated)
jdbcDF.createGlobalTempView("tp_customer_account")
val res01 = sqlContext.sql("SELECT ACCOUNTNO, collect_list(struct(`VEHICLENUMBER`, `CUSTOMERID`)) as VEHICLE FROM tp_customer_account GROUP BY ACCOUNTNO ORDER BY ACCOUNTNO ")
res01.coalesce(1).write.json("D:/res01.json")
You should get your desired output as
{"ACCOUNTNO":"10003014","VEHICLE":[{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":"20000000"},{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":"20000001"}]}
{"ACCOUNTNO":"10003015","VEHICLE":[{"VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":"20000002"}]}
{"ACCOUNTNO":"10003016","VEHICLE":[{"VEHICLENUMBER":"GJ15Z8173","CUSTOMERID":"20000003"}]}
{"ACCOUNTNO":"10003018","VEHICLE":[{"VEHICLENUMBER":"MH05AM902","CUSTOMERID":"20000004"}]}
{"ACCOUNTNO":"10003019","VEHICLE":[{"VEHICLENUMBER":"GJ15CD7657","CUSTOMERID":"20001866"},{"VEHICLENUMBER":"MH02BY7774","CUSTOMERID":"20000005"},{"VEHICLENUMBER":"MH02DG7774","CUSTOMERID":"20000933"},{"VEHICLENUMBER":"GJ15CA7387","CUSTOMERID":"20001865"},{"VEHICLENUMBER":"GJ15CB9601","CUSTOMERID":"20001557"}]}
Scala spark API
Using spark scala API, you can do the following:
import org.apache.spark.sql.functions._
val res01 = jdbcDF.groupBy("ACCOUNTNO")
.agg(collect_list(struct("VEHICLENUMBER", "CUSTOMERID")).as("VEHICLE"))
res01.coalesce(1).write.json("D:/res01.json")
You should be getting the same answer as the sql way.
I hope the answer is helpful.
Related
I have the following data in a spark dataframe:
id
series
1
{"2016-01-31T00:00:00.000Z": null, "2016-06-30T00:00:00.000Z": 6394317.0, "2016-07-31T00:00:00.000Z": 6550781.0, "2016-08-31T00:00:00.000Z": 7107308.0}
2
{"2016-01-31T00:00:00.000Z": null, "2016-06-30T00:00:00.000Z": 6394317.0}
I would like to extract the time series data into a format more suitable to work; e.g. the following format
id
timestamp
value
1
"2016-01-31T00:00:00.000Z"
2000.3
1
"2016-02-31T00:00:00.000Z"
100000.3
1
"2016-02-31T00:00:00.000Z"
null
2
"2012-01-31T00:00:00.000Z"
6394317.0
2
"2013-02-31T00:00:00.000Z"
10000317.0
I have tried df.groupby('id') and can achieve this in pandas by iterating over the groupby object. e.g:
for fund_id, df_i in df.groupby('id'):
ts = json.loads(df_i['series'].iloc[0]) # get time series
id = df_i['id'].iloc[0] # get id
# storing all timeseries in temp df
df_temp = pd.DataFrame(columns=['id','date','value'])
df_temp['value']=ts.values()
df_temp['date']=ts.keys()
df_temp['id'] = id
# Finally append all df_temp
Any ideas how to do the same thing in spark?
You can, yes. You have to jump through some hoops to convert the json string to an array, explode it, then split the remaining string only on the colon (:) outside the quotes.
Sorry, can't read. Pyspark approach:
jstring = """{"2016-01-31T00:00:00.000Z": null, "2016-06-30T00:00:00.000Z": 6394317.0, "2016-07-31T00:00:00.000Z": 6550781.0, "2016-08-31T00:00:00.000Z": 7107308.0}"""
df = spark.createDataFrame(
[
(1, jstring)
],
["id", "series"]
)
from pyspark.sql.functions import regexp_replace,explode,split,trim,expr
df.select("id",regexp_replace(regexp_replace("series","\\{",""),"\\}", "").alias("s")). \
select("id",explode(split("s",",").cast("array<string>")).alias("exp_series")). \
select("id",split("exp_series",":(?=([^\"]*\"[^\"]*\")*[^\"]*$)").alias("foo")). \
select("id",trim(expr("foo[0]")).alias("a"),trim(expr("foo[1]")).alias("b")).show()
Scala approach:
scala> val jstring = """{"2016-01-31T00:00:00.000Z": null, "2016-06-30T00:00:00.000Z": 6394317.0, "2016-07-31T00:00:00.000Z": 6550781.0, "2016-08-31T00:00:00.000Z": 7107308.0}"""
jstring: String = {"2016-01-31T00:00:00.000Z": null, "2016-06-30T00:00:00.000Z": 6394317.0, "2016-07-31T00:00:00.000Z": 6550781.0, "2016-08-31T00:00:00.000Z": 7107308.0}
scala> val data = Seq((1,jstring))
data: Seq[(Int, String)] = List((1,{"2016-01-31T00:00:00.000Z": null, "2016-06-30T00:00:00.000Z": 6394317.0, "2016-07-31T00:00:00.000Z": 6550781.0, "2016-08-31T00:00:00.000Z": 7107308.0}))
scala> val df = data.toDF("id","series")
df: org.apache.spark.sql.DataFrame = [id: int, series: string]
scala> df.select($"id",regexp_replace(regexp_replace($"series","\\{",""),"\\}", "").alias("s")).
| select($"id",explode(split($"s",",").cast("array<string>")).alias("exp_series")).
| select($"id",split($"exp_series",":(?=([^\"]*\"[^\"]*\")*[^\"]*$)").alias("foo")).
| select($"id",trim($"foo".getItem(0)).alias("a"),trim($"foo".getItem(1)).alias("b")).show(false)
+---+--------------------------+---------+
|id |a |b |
+---+--------------------------+---------+
|1 |"2016-01-31T00:00:00.000Z"|null |
|1 |"2016-06-30T00:00:00.000Z"|6394317.0|
|1 |"2016-07-31T00:00:00.000Z"|6550781.0|
|1 |"2016-08-31T00:00:00.000Z"|7107308.0|
+---+--------------------------+---------+
I have json data like below where version field is the differentiator -
file_1 = {"version": 1, "stats": {"hits":20}}
file_2 = {"version": 2, "stats": [{"hour":1,"hits":10},{"hour":2,"hits":12}]}
In the new format, stats column is now Arraytype(StructType).
Earlier only file_1 was needed so I was using
spark.read.schema(schema_def_v1).json(path)
Now I need to read both these type of multiple json files which come together. I cannot define stats as string in schema_def as that would affect the corruptrecord feature(for stats column) which checks malformed json and schema compliance of all the fields.
Example df output required in 1 read only -
version | hour | hits
1 | null | 20
2 | 1 | 10
2 | 2 | 12
I have tried to read with mergeSchema option but that makes stats field String type.
Also, I have tried making two dataframes by filtering on the version field, and applying spark.read.schema(schema_def_v1).json(df_v1.toJSON). Here also stats column becomes String type.
I was thinking if while reading, I could parse the df column headers as stats_v1 and stats_v2 on basis of data-types can solve the problem. Please help with any possible solutions.
UDF to check string or array, if it is string it will convert string to an array.
import org.apache.spark.sql.functions.udf
import org.json4s.{DefaultFormats, JObject}
import org.json4s.jackson.JsonMethods.parse
import org.json4s.jackson.Serialization.write
import scala.util.{Failure, Success, Try}
object Parse {
implicit val formats = DefaultFormats
def toArray(data:String) = {
val json_data = (parse(data))
if(json_data.isInstanceOf[JObject]) write(List(json_data)) else data
}
}
val toJsonArray = udf(Parse.toArray _)
scala> "ls -ltr /tmp/data".!
total 16
-rw-r--r-- 1 srinivas root 37 Jun 26 17:49 file_1.json
-rw-r--r-- 1 srinivas root 69 Jun 26 17:49 file_2.json
res4: Int = 0
scala> val df = spark.read.json("/tmp/data").select("stats","version")
df: org.apache.spark.sql.DataFrame = [stats: string, version: bigint]
scala> df.printSchema
root
|-- stats: string (nullable = true)
|-- version: long (nullable = true)
scala> df.show(false)
+-------+-------------------------------------------+
|version|stats |
+-------+-------------------------------------------+
|1 |{"hits":20} |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|
+-------+-------------------------------------------+
Output
scala>
import org.apache.spark.sql.types._
val schema = ArrayType(MapType(StringType,IntegerType))
df
.withColumn("json_stats",explode(from_json(toJsonArray($"stats"),schema)))
.select(
$"version",
$"stats",
$"json_stats".getItem("hour").as("hour"),
$"json_stats".getItem("hits").as("hits")
).show(false)
+-------+-------------------------------------------+----+----+
|version|stats |hour|hits|
+-------+-------------------------------------------+----+----+
|1 |{"hits":20} |null|20 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|1 |10 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|2 |12 |
+-------+-------------------------------------------+----+----+
Without UDF
scala> val schema = ArrayType(MapType(StringType,IntegerType))
scala> val expr = when(!$"stats".contains("[{"),concat(lit("["),$"stats",lit("]"))).otherwise($"stats")
df
.withColumn("stats",expr)
.withColumn("stats",explode(from_json($"stats",schema)))
.select(
$"version",
$"stats",
$"stats".getItem("hour").as("hour"),
$"stats".getItem("hits").as("hits")
)
.show(false)
+-------+-----------------------+----+----+
|version|stats |hour|hits|
+-------+-----------------------+----+----+
|1 |[hits -> 20] |null|20 |
|2 |[hour -> 1, hits -> 10]|1 |10 |
|2 |[hour -> 2, hits -> 12]|2 |12 |
+-------+-----------------------+----+----+
Read the second file first, explode stats, use schema to read first file.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
file_1 = {"version": 1, "stats": {"hits": 20}}
file_2 = {"version": 2, "stats": [{"hour": 1, "hits": 10}, {"hour": 2, "hits": 12}]}
df1 = spark.read.json(sc.parallelize([file_2])).withColumn('stats', explode('stats'))
schema = df1.schema
spark.read.schema(schema).json(sc.parallelize([file_1])).printSchema()
output >> root
|-- stats: struct (nullable = true)
| |-- hits: long (nullable = true)
| |-- hour: long (nullable = true)
|-- version: long (nullable = true)
IIUC, you can read the JSON files using spark.read.text and then parse the value with json_tuple, from_json. notice for stats field we use coalesce to parse fields based on two or more schema. (add wholetext=True as an argument of spark.read.text if each file contains a single JSON document cross multiple lines)
from pyspark.sql.functions import json_tuple, coalesce, from_json, array
df = spark.read.text("/path/to/all/jsons/")
schema_1 = "array<struct<hour:int,hits:int>>"
schema_2 = "struct<hour:int,hits:int>"
df.select(json_tuple('value', 'version', 'stats').alias('version', 'stats')) \
.withColumn('status', coalesce(from_json('stats', schema_1), array(from_json('stats', schema_2)))) \
.selectExpr('version', 'inline_outer(status)') \
.show()
+-------+----+----+
|version|hour|hits|
+-------+----+----+
| 2| 1| 10|
| 2| 2| 12|
| 1|null| 20|
+-------+----+----+
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I want to hit an API by applying some parameters from a dataframe, get the Json Response body, and from the body, pull out all the distinct values of a particular Key.
I then need to add this column into the first dataframe.
Suppose i have a dataframe like below:
df1:
+-----+-------+--------+
| DB | User | UserID |
+-----+-------+--------+
| db1 | user1 | 123 |
| db2 | user2 | 456 |
+-----+-------+--------+
I want to hit a REST API by providing the column value of Df1 as parameters.
If my parameters for URL is db=db1 and User=user1(First record of df1),the response will be a json format of following format:
{
"data":[
{
"db": "db1"
"User": "User1"
"UserID": 123
"Query": "Select * from A"
"Application": "App1"
},
{
"db": "db1"
"User": "User1"
"UserID": 123
"Query": "Select * from B"
"Application": "App2"
}
]
}
From this json file, i want get distinct values of Application key as an array or list and attach it as a new column to Df1
My output will look similar to below:
Final df:
+-----+-------+--------+-------------+
| DB | User | UserID | Apps |
+-----+-------+--------+-------------+
| db1 | user1 | 123 | {App1,App2} |
| db2 | user2 | 456 | {App3,App3} |
+-----+-------+--------+-------------+
I have come up with a high level plan on how to achieve it.
Add a new column called response URL built from multiple columns in input.
Define a scala function that takes in URL and return an array of application and convert it to UDF.
Create another column by applying the UDF by passing response URL.
Since i am pretty new to scala-spark and have never worked with REST APIs, can someone please help me here on achieving the result please.
Any other idea or suggestion is always welcome.
I am using spark 1.6.
Check below code, You may need to write logic to invoke reset api. once you get result next process is simple.
scala> val df = Seq(("db1","user1",123),("db2","user2",456)).toDF("db","user","userid")
df: org.apache.spark.sql.DataFrame = [db: string, user: string, userid: int]
scala> df.show(false)
+---+-----+------+
|db |user |userid|
+---+-----+------+
|db1|user1|123 |
|db2|user2|456 |
+---+-----+------+
scala> :paste
// Entering paste mode (ctrl-D to finish)
def invokeRestAPI(db:String,user: String) = {
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
// Write your invoke logic & for now I am hardcoding your sample json here.
val json_data = parse("""{"data":[ {"db": "db1","User": "User1","UserID": 123,"Query": "Select * from A","Application": "App1"},{"db": "db1","User": "User1","UserID": 123,"Query": "Select * from B","Application": "App2"}]}""")
(json_data \\ "data" \ "Application").extract[Set[String]].toList
}
// Exiting paste mode, now interpreting.
invokeRestAPI: (db: String, user: String)List[String]
scala> val fetch = udf(invokeRestAPI _)
fetch: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StringType,true),List(StringType, StringType))
scala> df.withColumn("apps",fetch($"db",$"user")).show(false)
+---+-----+------+------------+
|db |user |userid|apps |
+---+-----+------+------------+
|db1|user1|123 |[App1, App2]|
|db2|user2|456 |[App1, App2]|
+---+-----+------+------------+
My JSON column names are a combination of lower and uppercase case (Ex: title/Title and name/Name), due to which in output, I am getting name and Name as two different columns (similarly title and Title).
How can I make the JSON columns as case insensitive?
config("spark.sql.caseSensitive", "true") -> I tried this, but it is not working.
val df = Seq(
("A", "B", "{\"Name\":\"xyz\",\"Address\":\"NYC\",\"title\":\"engg\"}"),
("C", "D", "{\"Name\":\"mnp\",\"Address\":\"MIC\",\"title\":\"data\"}"),
("E", "F", "{\"name\":\"pqr\",\"Address\":\"MNN\",\"Title\":\"bi\"}")
)).toDF("col_1", "col_2", "col_json")
import sc.implicits._
val col_schema = spark.read.json(df.select("col_json").as[String]).schema
val outputDF = df.withColumn("new_col", from_json(col("col_json"), col_schema))
.select("col_1", "col_2", "new_col.*")
outputDF.show(false)
Current output:
Expected/Needed output (column names to be case-insensitive):
Soltion 1
You can group the columns by their lowercase names and merge them using coalesce function:
// set spark.sql.caseSensitive to true to avoid ambuigity
spark.conf.set("spark.sql.caseSensitive", "true")
val col_schema = spark.read.json(df.select("col_json").as[String]).schema
val df1 = df.withColumn("new_col", from_json(col("col_json"), col_schema))
.select("col_1", "col_2", "new_col.*")
val mergedCols = df1.columns.groupBy(_.toLowerCase).values
.map(grp =>
if (grp.size > 1) coalesce(grp.map(col): _*).as(grp(0))
else col(grp(0))
).toSeq
val outputDF = df1.select(mergedCols:_*)
outputDF.show()
//+----+-------+-----+-----+-----+
//|Name|Address|col_1|Title|col_2|
//+----+-------+-----+-----+-----+
//|xyz |NYC |A |engg |B |
//|mnp |MIC |C |data |D |
//|pqr |MNN |E |bi |F |
//+----+-------+-----+-----+-----+
Solution 2
Another way is to parse the JSON string column into MapType instead of StructType, and using transform_keys you can lower case the column name, then explode the map and pivot to get columns:
import org.apache.spark.sql.types.{MapType, StringType}
val outputDF = df.withColumn(
"col_json",
from_json(col("col_json"), MapType(StringType, StringType))
).select(
col("col_1"),
col("col_2"),
explode(expr("transform_keys(col_json, (k, v) -> lower(k))"))
).groupBy("col_1", "col_2")
.pivot("key")
.agg(first("value"))
outputDF.show()
//+-----+-----+-------+----+-----+
//|col_1|col_2|address|name|title|
//+-----+-----+-------+----+-----+
//|E |F |MNN |pqr |bi |
//|C |D |MIC |mnp |data |
//|A |B |NYC |xyz |engg |
//+-----+-----+-------+----+-----+
For this solution transform_keys is only avlaible since Spark 3, for older versions you can use UDF :
val mapKeysToLower = udf((m: Map[String, String]) => {
m.map { case (k, v) => k.toLowerCase -> v }
})
You will need to merge your columns, using something like:
import org.apache.spark.sql.functions.when
df = df.withColumn("title", when($"title".isNull, $"Title").otherwise($"title").drop("Title")
I'm writing a Spark application in Scala using Spark Structured Streaming that receive some data formatted in JSON style from Kafka. This application could receive both a single or multiple JSON object formatted in this way:
[{"key1":"value1","key2":"value2"},{"key1":"value1","key2":"value2"},...,{"key1":"value1","key2":"value2"}]
I tried to define a StructType like:
var schema = StructType(
Array(
StructField("key1",DataTypes.StringType),
StructField("key2",DataTypes.StringType)
))
But it doesn't work.
My actual code for parsing JSON:
var data = (this.stream).getStreamer().load()
.selectExpr("CAST (value AS STRING) as json")
.select(from_json($"json",schema=schema).as("data"))
I would like to get this JSON objects in a dataframe like
+----------+---------+
| key1| key2|
+----------+---------+
| value1| value2|
| value1| value2|
........
| value1| value2|
+----------+---------+
Anyone can help me please?
Thank you!
As your incoming string is Array of JSON, one way is to write a UDF to parse the Array, then explode the parsed Array. Below is the complete code with each steps explained. I have written it for batch but same can be used for streaming with minimal changes.
object JsonParser{
//case class to parse the incoming JSON String
case class JSON(key1: String, key2: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder().
appName("JSON").
master("local").
getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
//sample JSON array String coming from kafka
val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
//UDF to parse JSON array String
val jsonConverter = udf { jsonString: String =>
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
mapper.readValue(jsonString, classOf[Array[JSON]])
}
val df = str.toDF("json") //json String column
.withColumn("array", jsonConverter($"json")) //parse the JSON Array
.withColumn("json", explode($"array")) //explode the Array
.drop("array") //drop unwanted columns
.select("json.*") //explode the JSON to separate columns
//display the DF
df.show()
//+------+------+
//| key1| key2|
//+------+------+
//|value1|value2|
//|value3|value4|
//+------+------+
}
}
This worked fine for me in Spark 3.0.0 and Scala 2.12.10. I used schema_of_json to get the schema of the data in a suitable format for from_json, and applied explode and * operator in the last step of the chain to expand accordingly.
// TO KNOW THE SCHEMA
scala> val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
str: Seq[String] = List([{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}])
scala> val df = str.toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]
scala> df.show()
+--------------------+
| json|
+--------------------+
|[{"key1":"value1"...|
+--------------------+
scala> val schema = df.select(schema_of_json(df.select(col("json")).first.getString(0))).as[String].first
schema: String = array<struct<key1:string,key2:string>>
Use the resulting string as your schema: 'array<structkey1:string,key2:string>', as follows:
// TO PARSE THE ARRAY OF JSON's
scala> val parsedJson1 = df.selectExpr("from_json(json, 'array<struct<key1:string,key2:string>>') as parsed_json")
parsedJson1: org.apache.spark.sql.DataFrame = [parsed_json: array<struct<key1:string,key2:string>>]
scala> parsedJson1.show()
+--------------------+
| parsed_json|
+--------------------+
|[[value1, value2]...|
+--------------------+
scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json").select("json.*")
data: org.apache.spark.sql.DataFrame = [key1: string, key2: string]
scala> data.show()
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+
Just FYI, without the star expansion the intermediate result looks as follows:
scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json")
data: org.apache.spark.sql.DataFrame = [json: struct<key1: string, key2: string>]
scala> data.show()
+----------------+
| json|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+
You can add ArrayType to your schema and from_json would
convert the data to json.
var schema = ArrayType(StructType(
Array(
StructField("key1", DataTypes.StringType),
StructField("key2", DataTypes.StringType)
)))
Explode it to get the json array element in each row.
val explodedDf = df.withColumn("jsonData", explode(from_json(col("value"), schema)))
.select($"jsonData").show
+----------------+
| jsonData|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+
Select the json keys
explodedDf.select("jsonData.*").show
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+