Create JSON column in Spark Scala - json

I have some data that needs to be written as a JSON string after some transformations in a spark (+scala) job.
I'm using the to_json function along with struct and/or array function in order to build the final json that is requested.
I have one piece of the json that looks like:
"field":[
"foo",
{
"inner_field":"bar"
}
]
I'm not an expert in JSON, so I don't know if this structure is usual or not, all I know is that this is a valid JSON format.
I'm having trouble to create a dataframe column with this format and I want to know what is the best way to create this type of data columns.
Thanks in advance

If you have a dataframe with a bunch of columns you want to turn into a json string column, you can make use of the to_json and the struct functions. Something like this:
import org.apache.spark.sql.types._
val df = Seq(
(1, "string1", Seq("string2", "string3")),
(2, "string4", Seq("string5", "string6"))
).toDF("colA", "colB", "colC")
df.show
+----+-------+------------------+
|colA| colB| colC|
+----+-------+------------------+
| 1|string1|[string2, string3]|
| 2|string4|[string5, string6]|
+----+-------+------------------+
val newDf = df.withColumn("jsonString", to_json(struct($"colA", $"colB", $"colC")))
newDf.show(false)
+----+-------+------------------+--------------------------------------------------------+
|colA|colB |colC |jsonString |
+----+-------+------------------+--------------------------------------------------------+
|1 |string1|[string2, string3]|{"colA":1,"colB":"string1","colC":["string2","string3"]}|
|2 |string4|[string5, string6]|{"colA":2,"colB":"string4","colC":["string5","string6"]}|
+----+-------+------------------+--------------------------------------------------------+
struct makes a single StructType column from multiple columns and to_json turns them into a json string.
Hope this helps!

Related

Convert Array[Byte] to JSON format using Spark Scala

I'm reading an .avro file where the data of a particular column is in binary format. I'm currently converting the binary format to string format with the help of UDF for a readable purpose and then finally i will need to convert it into JSON format for further parsing the data. Is there a way i can convert string object to JSON format using Spark Scala code.
Any help would be much appreciated.
val avroDF = spark.read.format("com.databricks.spark.avro").
load("file:///C:/46.avro")
import org.apache.spark.sql.functions.udf
// Convert byte object to String format
val toStringDF = udf((x: Array[Byte]) => new String(x))
val newDF = avroDF.withColumn("BODY",
toStringDF(avroDF("body"))).select("BODY")
Output of newDF is shown below:
BODY |
+---------------------------------------------------------------------------------------------------------------+
|{"VIN":"FU74HZ501740XXXXX","MSG_TYPE":"SIGNAL","TT":0,"RPM":[{"E":1566800008672,"V":1073.75},{"E":1566800002538,"V":1003.625},{"E":1566800004084,"V":1121.75}
My desired output should be like below:
I do not know if you want a generic solution but in your particular case, you can code something like this:
spark.read.json(newDF.as[String])
.withColumn("RPM", explode(col("RPM")))
.withColumn("E", col("RPM.E"))
.withColumn("V", col("RPM.V"))
.drop("RPM")
.show()

Reading Nested Json in Spark-Structured Streaming

I am trying to read data from Kafka using structured streaming. The data received from kafka is in json format. I use a sample json to create the schema and later in the code I use the from_json function to convert the json to a dataframe for further processing. The problem I am facing is with the nested schema and multi-values. The sample schema defines a tag (say a) as a struct. The json data read from kafka can have either one or multiple values for the same tag ( in two different values).
val df0= spark.read.format("json").load("contactSchema0.json")
val schema0 = df0.schema
val df1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "node1:9092").option("subscribe", "my_first_topic").load()
val df2 = df1.selectExpr("CAST(value as STRING)").toDF()
val df3 = df2.select(from_json($"value",schema0).alias("value"))
contactSchema0.json has a sample tag as follows:
"contactList": {
"contact": [{
"id": 1001
},
{
"id": 1002
}]
}
Thus contact is inferred as a struct. But the JSON data read from Kafka can also have data as follows:
"contactList": {
"contact": {
"id": 1001
}
}
So if I define the schema as a struct, spark.json is unable to infer single values and in case if I define the schema as string spark.json is unable to infer multi-values.
Can't find such feature in Spark JSON Options but Jackson has DeserializationFeature.ACCEPT_SINGLE_VALUE_AS_ARRAY as described in this answer.
So we can get around with something like this
case class MyModel(contactList: ContactList)
case class ContactList(contact: Array[Contact])
case class Contact(id: Int)
val txt =
"""|{"contactList": {"contact": [{"id": 1001}]}}
|{"contactList": {"contact": {"id": 1002}}}"""
.stripMargin.lines.toSeq.toDS()
txt
.mapPartitions[MyModel] { it: Iterator[String] =>
val reader = new ObjectMapper()
.registerModule(DefaultScalaModule)
.enable(DeserializationFeature.ACCEPT_SINGLE_VALUE_AS_ARRAY)
.readerFor(classOf[MyModel])
it.map(reader.readValue[MyModel])
}
.show()
Output:
+-----------+
|contactList|
+-----------+
| [[[1001]]]|
| [[[1002]]]|
+-----------+
Note that to get a Dataset in your code, you could use
val df2 = df1.selectExpr("CAST(value as STRING)").as[String]
instead and then call mapPartitions for df2 like before.

Converting Spark Dataset into JSON

I am trying to convert a spark dataset into JSON. I tried .toJSON() method but its not much of a help.
I have a dataset which looks like this
| ord_status|count|
+--------------------+-----+
| Fallout| 3374|
| Flowthrough|12083|
| In-Progress| 3804|
I am trying to convert to it to a JSON like this:
"overallCounts": {
"flowthrough": 2148,
"fallout": 4233,
"inprogress": 1300
}
My question is that is there any way through which we can parse column values side by side and show them as JSON.
Update: I converted dataset in given json format by converting it into a list and then parsing each value and putting it into a string. Although that's lot of manual work. Are there any built in methods which can convert datasets into such json format?
Please find the below solution. Dataset has to iterate using mapPartitions then generate the final string which contains the only JSON elements.
val list = List(("Fallout",3374), ("Flowthrough", 12083), ("In-Progress", 3804))
val ds = list.toDS()
ds.show
val rows = ds.mapPartitions(itr => {
val string =
s"""
|"%s" : %d
|""".stripMargin
val pairs = itr.map(ele => string.format(ele._1, ele._2)).mkString
Iterator(pairs)
})
val text = rows.collect().mkString
val finalJson = """
|"overallCounts": {
| %s
| }
|""".stripMargin
println(finalJson.format(text))

Specifying schema on JSON via Spark

I would like to specify a schema when reading from a json but when trying to map a number to a Double it fails, I tried FloatType and IntType with no joy!
When inferring the schema customer id is set to String, and I would like to cast it as Double
so df1 is corrupted while df2 shows
Also FYI I need this to be generic as I would like to apply it to any json, I specified the below schema as an example of the issue I am facing
import org.apache.spark.sql.types.{BinaryType, StringType, StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
val testSchema = StructType(Array(StructField("customerid",DoubleType)))
val df1 = spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
df1.show(1)
df2.show(1)
Any help would be appreciated, I am sure I am missing something obvious but for the life of me I cant tell what it is!
Let me clarify, I am loading a file that is saved using sparkContext.newAPIHadoopRDD
So converting a RDD[JsonObject] to a dataframe while applying the schema to it
The Json field since enclosed by double quotes is considered as a String .How about casting the column to Double?. this casting solution can be made generic if the details on what columns are expected to casted to Double is provided.
df1.select(df1("customerid").cast(DoubleType)).show()
+----------+
|customerid|
+----------+
| 535137.0|
+----------+

Difference between SQLContext.createDataframe(RDD, StructType) vs. SQLContext.read().schema(StructType).json(RDD) if I am reading in JSON strings?

createDataframe
and
read.schema().json() seem to serve same function if we give in string of JSON?
EDIT:
I seem to have found a third option:
[JsonRDD.jsonStringtoRow](https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/sql/json/JsonRDD.html#jsonStringToRow(org.apache.spark.rdd.RDD, org.apache.spark.sql.types.StructType, java.lang.String))
SQLContext.createDataframe(RDD, StructType) here, the first parameter is RDD of string which is not in JSON format. It needs to be Rows of RDD.
SQLContext.read().schema(StructType).json(RDD) here, the parameter RDD should be a string with json format.
If you have a JSON dataset, you can load it into a dataframe using spark.read.json in Scala. From the Spark documentation:
val df = spark.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
createDataFrame(rdd) will work when your RDD contains Row objects. Spark will infer the data types, or you can specify the schema (which I would recommend unless you're certain that your data doesn't contain anything peculiar).