I am trying to read data from Kafka using structured streaming. The data received from kafka is in json format. I use a sample json to create the schema and later in the code I use the from_json function to convert the json to a dataframe for further processing. The problem I am facing is with the nested schema and multi-values. The sample schema defines a tag (say a) as a struct. The json data read from kafka can have either one or multiple values for the same tag ( in two different values).
val df0= spark.read.format("json").load("contactSchema0.json")
val schema0 = df0.schema
val df1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "node1:9092").option("subscribe", "my_first_topic").load()
val df2 = df1.selectExpr("CAST(value as STRING)").toDF()
val df3 = df2.select(from_json($"value",schema0).alias("value"))
contactSchema0.json has a sample tag as follows:
"contactList": {
"contact": [{
"id": 1001
},
{
"id": 1002
}]
}
Thus contact is inferred as a struct. But the JSON data read from Kafka can also have data as follows:
"contactList": {
"contact": {
"id": 1001
}
}
So if I define the schema as a struct, spark.json is unable to infer single values and in case if I define the schema as string spark.json is unable to infer multi-values.
Can't find such feature in Spark JSON Options but Jackson has DeserializationFeature.ACCEPT_SINGLE_VALUE_AS_ARRAY as described in this answer.
So we can get around with something like this
case class MyModel(contactList: ContactList)
case class ContactList(contact: Array[Contact])
case class Contact(id: Int)
val txt =
"""|{"contactList": {"contact": [{"id": 1001}]}}
|{"contactList": {"contact": {"id": 1002}}}"""
.stripMargin.lines.toSeq.toDS()
txt
.mapPartitions[MyModel] { it: Iterator[String] =>
val reader = new ObjectMapper()
.registerModule(DefaultScalaModule)
.enable(DeserializationFeature.ACCEPT_SINGLE_VALUE_AS_ARRAY)
.readerFor(classOf[MyModel])
it.map(reader.readValue[MyModel])
}
.show()
Output:
+-----------+
|contactList|
+-----------+
| [[[1001]]]|
| [[[1002]]]|
+-----------+
Note that to get a Dataset in your code, you could use
val df2 = df1.selectExpr("CAST(value as STRING)").as[String]
instead and then call mapPartitions for df2 like before.
Related
There are many JSON parsers in Kotlin like Forge, Gson, JSON, Jackson... But they deserialize the JSON to a data class, meaning it's needed to define a data class with the properties corresponding to the JSON, and this for every JSON which has a different structure.
But what if you don't want to define a data class for every JSON you could have to parse?
I'd like to have a parser which wouldn't use data classes, for example it could be something like:
val jsonstring = '{"a": "b", "c": {"d: "e"}}'
parse(jsonstring).get("c").get("d") // -> "e"
Just something that doesn't require me to write a data class like
data class DataClass (
val a: String,
val b: AnotherDataClass
)
data class AnotherDataClass (
val d: String
)
which is very heavy and not useful for my use case.
Does such a library exist? Thanks!
With kotlinx.serialization you can parse JSON String into a JsonElement:
val json: Map<String, JsonElement> = Json.parseToJsonElement(jsonstring).jsonObject
You can use JsonPath
val json = """{"a": "b", "c": {"d": "e"}}"""
val context = JsonPath.parse(json)
val str = context.read<String>("c.d")
println(str)
Output:
Result: e
I am using Scala to parse json with a structure like this:
{
"root": {
"metadata": {
"name": "Farmer John",
"hasTractor": false
},
"plants": {
"corn": 137.137,
"soy": 0.45
},
"animals": {
"cow": 4,
"sheep": 12,
"pig": 1
}
}
}
And I am currently using the org.json library to parse it, like this:
val jsonObject = new JSONObject(jsonString) // Where jsonString is the above json tree
But when I run something like jsonObject.get("root.metadata.name") then I get the error:
JSONObject["root.metadata.name"] not found.
org.json.JSONException: JSONObject["root.metadata.name"] not found.
I suspect I can get the objects one at a time by splitting up that path, but that sounds tedious and I assume a better json library already exists. Is there a way to easily get the data the way I am trying to or is there a better library to use that works better with Scala?
The JSONObject you are using is deprecated. The deprecation message sends you to The Scala Library Index. I'll demonstrate how this can be done in play-json. Let's assume that the json above is stored at jsonString, so we can do:
val json = Json.parse(jsonString)
val path = JsPath \ "root" \ "metadata" \ "name"
path(json) match {
case Nil =>
??? // path does nont exist
case values =>
println(s"Values are: $values")
}
Code run at Scastie.
Looks like I was able to solve it using JsonPath like this:
val document = Configuration.defaultConfiguration().jsonProvider().parse(jsonString): Object
val value = JsonPath.read(document, "$.root.metadata.name"): Any
I'm trying to load a JSON file containing multiple JSON documents into Mongo DB using spark. All I want is that I would like to create a field _id and set it's value to one of the JSON field value,
Say I have a JSON doc like this,
{
recordId: 123,
firstName: "abc",
lastName: "xyz"
}
I want to write this into mongo DB by setting _id value = recordId value in the following format,
{
_id: 123,
recordId: 123,
firstName: "abc",
lastName: "xyz"
}
I was able to achieve the same using elastic search by setting the following property,
option("es.mapping.id", "recordId").
For Mongo, I tried it using the following way and it doesn't seem to work
val df = spark.read
.format("json")
.load(dataFile)
df = df.withColumn("_id",df["recordId"])
df.write
.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.output.uri", URI)
.mode("append")
.save()
Any help to achieve this will be appreciated. Thanks
Doing it this way actually worked,
val df = spark.read
.format("json")
.load(dataFile)
df = df.withColumn("_id",df("recordId"))
I am trying to convert a spark dataset into JSON. I tried .toJSON() method but its not much of a help.
I have a dataset which looks like this
| ord_status|count|
+--------------------+-----+
| Fallout| 3374|
| Flowthrough|12083|
| In-Progress| 3804|
I am trying to convert to it to a JSON like this:
"overallCounts": {
"flowthrough": 2148,
"fallout": 4233,
"inprogress": 1300
}
My question is that is there any way through which we can parse column values side by side and show them as JSON.
Update: I converted dataset in given json format by converting it into a list and then parsing each value and putting it into a string. Although that's lot of manual work. Are there any built in methods which can convert datasets into such json format?
Please find the below solution. Dataset has to iterate using mapPartitions then generate the final string which contains the only JSON elements.
val list = List(("Fallout",3374), ("Flowthrough", 12083), ("In-Progress", 3804))
val ds = list.toDS()
ds.show
val rows = ds.mapPartitions(itr => {
val string =
s"""
|"%s" : %d
|""".stripMargin
val pairs = itr.map(ele => string.format(ele._1, ele._2)).mkString
Iterator(pairs)
})
val text = rows.collect().mkString
val finalJson = """
|"overallCounts": {
| %s
| }
|""".stripMargin
println(finalJson.format(text))
I've been asked to parse a JSON file to get all the buses that are over a specified speed inputed by the user.
The JSON file can be downloaded here
It's like this:
{
"COLUMNS": [
"DATAHORA",
"ORDEM",
"LINHA",
"LATITUDE",
"LONGITUDE",
"VELOCIDADE"
],
"DATA": [
[
"04-16-2015 00:00:55",
"B63099",
"",
-22.7931,
-43.2943,
0
],
[
"04-16-2015 00:01:02",
"C44503",
781,
-22.853649,
-43.37616,
25
],
[
"04-16-2015 00:11:40",
"B63067",
"",
-22.7925,
-43.2945,
0
],
]
}
The thing is: I'm really new to scala and I have never worked with json before (shame on me). What I need is to get the "Ordem", "Linha" and "Velocidade" from DATA node.
I created a case class to enclousure all the data so as to later look for those who are over the specified speed.
case class Bus(ordem: String, linha: Int, velocidade: Int)
I did this reading the file as a textFile and spliting. Although this way, I need to foreknow the content of the file in order to go to the lines after DATA node.
I want to know how to do this using a JSON parser. I've tried many solutions, but I couldn't adapt to my problem, because I need to extract all the lines from DATA node instead of nodes inside one node.
Can anyone help me?
PS: Sorry for my english, not a native speaker.
First of all, you need to understand the different JSON data types. The basic types in JSON are numbers, strings, booleans, arrays, and objects. The data returned in your example is an object with two keys: COLUMNS and DATA. The COLUMNS key has a value that is an array of strings and numbers. The DATA key has a value which is an array of arrays of strings.
You can use a library like PlayJSON to work with this type of data:
val js = Json.parse(x).as[JsObject]
val keys = (js \ "COLUMNS").as[List[String]]
val values = (js \ "DATA").as[List[List[JsValue]]]
val busses = values.map(valueList => {
val keyValues = (keys zip valueList).toMap
for {
ordem <- keyValues("ORDEM").asOpt[String]
linha <- keyValues("LINHA").asOpt[Int]
velocidade <- keyValues("VELOCIDADE").asOpt[Int]
} yield Bus(ordem, linha, velocidade)
})
Note the use of asOpt when converting the properties to the expected types. This operator converts the key-values to the provided type if possible (wrapped in Some), and returns None otherwise. So, if you want to provide a default value instead of ignoring other results, you could use keyValues("LINHA").asOpt[Int].getOrElse(0), for example.
You can read more about the Play JSON methods used here, like \ and as, and asOpt in their docs.
You can use Spark SQL to achieve it. Refer section under JSON Datasets here
In essence, Use spark APIs to load a JSON and register it as temp table.
You can run your SQL queries on the table from there.
As seen on #Ben Reich answer, that code works great. Thank you very much.
Although, my Json had some type problems on "Linha". As it can be seen on the JSON example that I put on the Question, there are "" and also numbers, e.g., 781.
When trying to do keyValues("LINHA").asOpt[Int].getOrElse(0), it was producing an error saying that value flatMap is not a member of Int.
So, I had to change some things:
case class BusContainer(ordem: String, linha: String, velocidade: Int)
val jsonString = fromFile("./project/rj_onibus_gps.json").getLines.mkString
val js = Json.parse(jsonString).as[JsObject]
val keys = (js \ "COLUMNS").as[List[String]]
val values = (js \ "DATA").as[List[List[JsValue]]]
val buses = values.map(valueList => {
val keyValues = (keys zip valueList).toMap
println(keyValues("ORDEM"),keyValues("LINHA"),keyValues("VELOCIDADE"))
for {
ordem <- keyValues("ORDEM").asOpt[String]
linha <- keyValues("LINHA").asOpt[Int].orElse(keyValues("LINHA").asOpt[String])
velocidade <- keyValues("VELOCIDADE").asOpt[Int]
} yield BusContainer(ordem, linha.toString, velocidade)
})
Thanks for the help!