Parse a JSON struct as a JSON object in Scala / Spark - json

I store into my MongoDB collection a huge list of JSON strings. For simplicity, I have extracted a sample document into the text file businessResource.json:
{
"data" : {
"someBusinessData" : {
"capacity" : {
"fuelCapacity" : NumberLong(282)
},
"someField" : NumberLong(16),
"anotherField" : {
"isImportant" : true,
"lastDateAndTime" : "2008-01-01T11:11",
"specialFlag" : "YMA"
},
...
}
My problem: how can I convert the "someBusinessData" into a JSON object using Spark/Scala?
If I do that (for example using json4s or lift-json), I hope I can perform basic operations on them, for example checking them for equality.
Have in mind that this is a rather large JSON object. Creating a case class is not worth it in my case since the only operation I will perform will be some filtering on two fields, comparing documents for equality, and then I will export them again.
This is how I fetch the data:
val df: DataFrame = (someSparkSession).sqlContext.read.json("src/test/resources/businessResource.json")
val myData: DataFrame = df.select("data.someBusinessData")
myData.printSchema
The schema shows:
root
|-- someBusinessData: struct (nullable = true)
| |-- capacity: struct (nullable = true)
Since "someBusinessData" is a structure, I cannot get it as String. When I try to print using
myData.first.getStruct(0), I get a string that contains the values but not the keys: [[[282],16,[true,2008-01-01T11:11,YMA]
Thanks for your help!

Instead of using .json use .textFile to read your json file.
Then we convert rdd to dataframe(will have only one string column).
Example:
//read json file as textfile and create df
val df=spark.sparkContext.textFile("<json_file_path>").toDF("str")
//use get_json_object function to traverse json string
df.selectExpr("""get_json_object(str,"$.data.someBusinessData")""").show(false)
//+-----------------------------------------------------------------------------------------------------------------------------------------------------+
//|get_json_object(str,$.data.someBusinessData) |
//+-----------------------------------------------------------------------------------------------------------------------------------------------------+
//|{"capacity":{"fuelCapacity":"(282)"},"someField":"(16)","anotherField":{"isImportant":true,"lastDateAndTime":"2008-01-01T11:11","specialFlag":"YMA"}}|
//+-----------------------------------------------------------------------------------------------------------------------------------------------------+

In fact, my post containd two questions:
How can I convert the "someBusinessData" into a JSON object using Spark/Scala?
How can I get the JSON object as a String?
1. Conversion into a JSON object
What I was doing, was already creating a DataFrame that can be navigated as a JSON object:
//read json file as Json and select the needed data
val df: DataFrame = sparkSession.sqlContext.read.json(filePath).select("data.someBusinessData")
If you do .textFile you correctly get the String, but parse the JSON you then need to resort to the answer from Shu.
2. How can I get the JSON object as a String?
Trivially:
df.toJSON.first

Related

Kotlin serialization module. How to serialize/deserialize a plain json field?

I have an object with the following structure:
enum class GeometryType { ... }
data class GeometryItem(
val type: GeometryType,
val geomtryJson: String
)
Where an example json could be:
{
type: "Point",
geometryJson: [12.0, 11.0]
}
So gemetryJson is a valid json string, but not quoted.
Could this be accomplished using Kotlin Serialization module?
According to JSON RFC:
A JSON value MUST be an object, array, number, or string, or one of
the following three literal names: false null true.
The representation of strings is similar to conventions used in the C
family of programming languages. A string begins and ends with
quotation marks.
An array structure is represented as square brackets surrounding
zero or more values (or elements). Elements are separated by
commas.
So the value of geometryJson: [12.0, 11.0] has array type, not a string type.
As for kotlinx-serialization out of the box it won't handle such deserialization. You can try to write your own decoder, but I would suggest to use Jackson library for this task. Not a full solution:
val mapper = jacksonObjectMapper()
mapper.configure(JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES, true)
val json ="""
{
type: "POINT",
geometryJson: "[12.0, 11.0]"
}
""".trimIndent()
val geometryItem = mapper.readValue<GeometryItem>(json)
println("geometryItem = $geometryItem")
With such config you can handle at least unquoted json fields.
As for reading array values I would suggest to read it to array of doubles, and if you need to represent it as string use such conversion:
val stringArray = doubleArray.map { it.toString() }.toTypedArray().contentToString()

How to retrieve multiple json fields from json string using get_json_object

I am working in Scala programming language. I have a big json string. I want to extract some fields from it based on its path. I want to extract bunch of fields based on input collection Seq[Field]. So I need to call the function get_json_object multiple times. How can I do it using get_json_object? Is there any other way to achieve this?
When I do
var json: JSONObject = null
val fields = Seq[Field]() // this has field name and path
var fieldMap = Map[String, String]()
for (field<- fields) {
fieldMap += (field.Field -> get_json_object(data, field.Path))
}
I get Type mismatch, expected: Column, actual: String
And when I do lit(data) in above code. e.g.
get_json_object(lit(data), field.Path))
then I get
found : org.apache.spark.sql.Column
[ERROR] required: String

spark streaming JSON value in dataframe column scala

I have a text file with json value. and this gets read into a DF
{"name":"Michael"}
{"name":"Andy", "age":30}
I want to infer the schema dynamically for each line while Streaming and store it in separate locations(tables) depending on its schema.
unfortunately while I try to read the value.schema it still shows as String. Please help on how to do it on Streaming as RDD is not allowed in streaming.
I wanted to use the following code which doesnt work as the value is still read as String format.
val jsonSchema = newdf1.select("value").as[String].schema
val df1 = newdf1.select(from_json($"value", jsonSchema).alias("value_new"))
val df2 = df1.select("value_new.*")
I even tried to use,
schema_of_json("json_schema"))
val jsonSchema: String = newdf.select(schema_of_json(col("value".toString))).as[String].first()
still no hope.. Please help..
You can load the data as textFile, create case class for person and parse every json string to Person instance using json4s or gson, then creating the Dataframe as follows:
case class Person(name: String, age: Int)
val jsons = spark.read.textFile("/my/input")
val persons = jsons.map{json => toPerson(json) //instead of 'toPerson' actually parse with json4s or gson to return Person instance}
val df = sqlContext.createDataFrame(persons)
Deserialize json to case class using json4s:
https://commitlogs.com/2017/01/14/serialize-deserialize-json-with-json4s-in-scala/
Deserialize json to case class using gson:
https://alvinalexander.com/source-code/scala/scala-case-class-gson-json-object-deserialization-and-scalatra

How to specify only particular fields using read.schema in JSON : SPARK Scala

I am trying to programmatically enforce schema(json) on textFile which looks like json. I tried with jsonFile but the issue is for creating a dataframe from a list of json files, spark has to do a 1 pass through the data to create a schema for the dataframe. So it needs to parse all the data which is taking longer time (4 hours since my data is zipped and of size TBs). So I want to try reading it as textFile and enforce schema to get interested fields alone to later query on the resulting data frame. But I am not sure how do I map it to the input. Can some give me some reference on how do I map schema to json like input.
input :
This is the full schema :
records: org.apache.spark.sql.DataFrame = [country: string, countryFeatures: string, customerId: string, homeCountry: string, homeCountryFeatures: string, places: array<struct<freeTrial:boolean,placeId:string,placeRating:bigint>>, siteName: string, siteId: string, siteTypeId: string, Timestamp: bigint, Timezone: string, countryId: string, pageId: string, homeId: string, pageType: string, model: string, requestId: string, sessionId: string, inputs: array<struct<inputName:string,inputType:string,inputId:string,offerType:string,originalRating:bigint,processed:boolean,rating:bigint,score:double,methodId:string>>]
But I am only interested in few fields like :
res45: Array[String] = Array({"requestId":"bnjinmm","siteName":"bueller","pageType":"ad","model":"prepare","inputs":[{"methodId":"436136582","inputType":"US","processed":true,"rating":0,"originalRating":1},{"methodId":"23232322","inputType":"UK","processed":falase,"rating":0,"originalRating":1}]
val records = sc.textFile("s3://testData/sample.json.gz")
val schema = StructType(Array(StructField("requestId",StringType,true),
StructField("siteName",StringType,true),
StructField("model",StringType,true),
StructField("pageType",StringType,true),
StructField("inputs", ArrayType(
StructType(
StructField("inputType",StringType,true),
StructField("originalRating",LongType,true),
StructField("processed",BooleanType,true),
StructField("rating",LongType,true),
StructField("methodId",StringType,true)
),true),true)))
val rowRDD = ??
val inputRDD = sqlContext.applySchema(rowRDD, schema)
inputRDD.registerTempTable("input")
sql("select * from input").foreach(println)
Is there any way to map this ? Or do I need to use son parser or something. I want to use textFile only because of the constraints.
Tried with :
val records =sqlContext.read.schema(schema).json("s3://testData/test2.gz")
But keeping getting the error :
<console>:37: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField)
StructField("inputs",ArrayType(StructType(StructField("inputType",StringType,true), StructField("originalRating",LongType,true), StructField("processed",BooleanType,true), StructField("rating",LongType,true), StructField("score",DoubleType,true), StructField("methodId",StringType,true)),true),true)))
^
It can load with following code with predefined schema, spark don't need to go through the file in ZIP file. The code in the question has ambiguity.
import org.apache.spark.sql.types._
val input = StructType(
Array(
StructField("inputType",StringType,true),
StructField("originalRating",LongType,true),
StructField("processed",BooleanType,true),
StructField("rating",LongType,true),
StructField("score",DoubleType,true),
StructField("methodId",StringType,true)
)
)
val schema = StructType(Array(
StructField("requestId",StringType,true),
StructField("siteName",StringType,true),
StructField("model",StringType,true),
StructField("inputs",
ArrayType(input,true),
true)
)
)
val records =sqlContext.read.schema(schema).json("s3://testData/test2.gz")
Not all the fields need to be provided. While it's good to provide all if possible.
Spark try best to parse all, if some row is not valid. It will add _corrupt_record as a column which contains the whole row.
While if it's plained json file file.

How to create an RDD from another RDD by extracting specific values?

I have an RDD which contains a String and JSON object (as String). I extracted required values from the JSON object. How can I use the values to create a new RDD which stores each value in each column?
RDD
(1234,{"id"->1,"name"->"abc","age"->21,"class"->5})
From which a map was generated as shown below.
"id"->1,
"name"->"abc",
"age"->21
"id"->2,
"name"->"def",
"age"->31
How to convert this to RDD[(String, String, String)], which stores data like:
1 abc 21
2 def 31
Not in front of a compiler right now, but something like this should work:
def parse(val row: (String, JValue)) : Seq((String, String, String)) = {
// Here goes your code to parse a Json into a sequence of tuples, seems like you have this already well in hand.
}
val rdd1 = ??? // Initialize your RDD[(String, JValue)]
val rdd2: RDD[(String, String, String)] = rdd1.flatMap(parse)
flatMap does the trick, as your extraction function can extract multiple rows on each Json input (or none) and they will be seamlessly be integrated into the final RDD.