To handle dynamic fields at scala end - json

scenario is like :
we are pulling data from mongo in JSON format and processing through spark .
At times we are not getting the desired field inside the complex datatype eg nested array of string or struct within array .
Is there any workaround while loading JSON file to put null values to the absent field.
(validator checks)
2.If want to handle dynamic nature at scala end, how it is suppose to be.
def checkAvailableColumns(df: DataFrame, expectedColumnsInput: List[String]) : DataFrame = {
expectedColumnsInput.foldLeft(df) {
(df,column) => {
if(df.columns.contains(column) == false) {
df.withColumn(column,lit("null"))
}
else (df)
}
}
}
I am using the above code to verify if columns are present in source side while comparing with the required column names ,if not present put null to that column.
Question here is how to get complex data type like array of struct into general column name so that i can compare it .
(i can using dot operator to pull column with struct but if that column doesn't exist my script will fail .

Take a look at the Scala Option class. Let's assume you have a
case class JsonTemplate(optionalArray: Option[Seq[String]])
And let's assume you get the valid json {}, the parser will put None as the value. And you will get the instance: JsonTemplate(None)

Related

convert nested json string column into map type column in spark

overall aim
I have data landing into blob storage from an azure service in form of json files where each line in a file is a nested json object. I want to process this with spark and finally store as a delta table with nested struct/map type columns which can later be queried downstream using the dot notation columnName.key
data nesting visualized
{
key1: value1
nestedType1: {
key1: value1
keyN: valueN
}
nestedType2: {
key1: value1
nestedKey: {
key1: value1
keyN: valueN
}
}
keyN: valueN
}
current approach and problem
I am not using the default spark json reader as it is resulting in some incorrect parsing of the files instead I am loading the files as text files and then parsing using udfs by using python's json module ( eg below ) post which I use explode and pivot to get the first level of keys into columns
#udf('MAP<STRING,STRING>' )
def get_key_val(x):
try:
return json.loads(x)
except:
return None
Post this initial transformation I now need to convert the nestedType columns to valid map types as well. Now since the initial function is returning map<string,string> the values in nestedType columns are not valid jsons so I cannot use json.loads, instead I have regex based string operations
#udf('MAP<STRING,STRING>' )
def convert_map(string):
try:
regex = re.compile(r"""\w+=.*?(?:(?=,(?!"))|(?=}))""")
obj = dict([(a.split('=')[0].strip(),(a.split('=')[1])) for a in regex.findall(s)])
return obj
except Exception as e:
return e
this is fine for second level of nesting but if I want to go further that would require another udf and subsequent complications.
question
How can I use a spark udf or native spark functions to parse the nested json data such that it is queryable in columnName.key format.
also there is no restriction of spark version, hopefully I was able to explain this properly. do let me know if you want me to put some sample data and the code for ease. Any help is appreciated.

How to retrieve multiple json fields from json string using get_json_object

I am working in Scala programming language. I have a big json string. I want to extract some fields from it based on its path. I want to extract bunch of fields based on input collection Seq[Field]. So I need to call the function get_json_object multiple times. How can I do it using get_json_object? Is there any other way to achieve this?
When I do
var json: JSONObject = null
val fields = Seq[Field]() // this has field name and path
var fieldMap = Map[String, String]()
for (field<- fields) {
fieldMap += (field.Field -> get_json_object(data, field.Path))
}
I get Type mismatch, expected: Column, actual: String
And when I do lit(data) in above code. e.g.
get_json_object(lit(data), field.Path))
then I get
found : org.apache.spark.sql.Column
[ERROR] required: String

parsing json with different schema with Play json

I've got to parse a list of Json messages. I am using Play Json
All messages have similar structure, and at high level may be represented as
case class JMessage(
event: String,
messageType: String,
data: JsValue // variable data structure
)
data may hold entries of different types - Double, String, Int, so I cant go with a Map.
Currently there are at least three various types of the data. The structure of data may be identified by messageType.
So far I've created three case classes each representing the structure of data. As well as implicit Reads for them. And the 4th one that is a case class for result with some Option-al fields. So basically I need to map various json messages to some output format.
The approach I'm currently using is:
messages.map(Json.parse(_)).(_.as[JMessage]).map {
elem => {
if (elem.messageType == "event") {
Some(parseMessageOfTypeEvent(elem.data))
}
Some(..)
} else {
None
}
}
.filter(_.nonEmpty)
The parseMessageOfType%type% functions are basically (v: type) => JsValue.
So after all I have 4 case classes and 3 functions for parsing. It works, but it is ugly.
Is there a more beautiful Scala way to it?

Merging dynamic field with type date/string triggered conflict

I'm uploading json files on my Elasticsearch server and I have an object "meta" with a field name and a field value. Sometimes value is a string and sometimes is a date so the dynamic mapping doesn't work.
I tried to put an explicit mapping to set the field to string but I always have the same error "Merging dynamic updates triggered a conflict: mapper [customer.meta.value] of different type, current_type [string], merged_type [date]"}}}, :level=>:warn"
Can I use the parameter "ignore_conflict" or how can I upload multi type field?
Thx
You cannot have two data types for same field in elasticsearch. It is not possible to index it. Dynamic mapping means that the type is identified from the first value that is inserted into the field. If you try to insert some other type in that field, it'll be an error. If you need to store both string and date, your best bet is to set the mapping to use string and explicitly convert your dates to string before passing it to elasticsearch.
I disabled date_detection for _ default_ and that's working.
Now my problem is the following: I want to disable date_detection only for meta.value and customer.meta.value. It's correct for the first but I can't for the second because it's an nested object I think.
I tried this:
curl -XPUT 'localhost:9200/rr_sa' -d '
{
"mappings": {
"meta": {
"date_detection": false
},
"customer.meta": {
"date_detection": false
}
}
}
'

Parsing nodes on JSON with Scala -

I've been asked to parse a JSON file to get all the buses that are over a specified speed inputed by the user.
The JSON file can be downloaded here
It's like this:
{
"COLUMNS": [
"DATAHORA",
"ORDEM",
"LINHA",
"LATITUDE",
"LONGITUDE",
"VELOCIDADE"
],
"DATA": [
[
"04-16-2015 00:00:55",
"B63099",
"",
-22.7931,
-43.2943,
0
],
[
"04-16-2015 00:01:02",
"C44503",
781,
-22.853649,
-43.37616,
25
],
[
"04-16-2015 00:11:40",
"B63067",
"",
-22.7925,
-43.2945,
0
],
]
}
The thing is: I'm really new to scala and I have never worked with json before (shame on me). What I need is to get the "Ordem", "Linha" and "Velocidade" from DATA node.
I created a case class to enclousure all the data so as to later look for those who are over the specified speed.
case class Bus(ordem: String, linha: Int, velocidade: Int)
I did this reading the file as a textFile and spliting. Although this way, I need to foreknow the content of the file in order to go to the lines after DATA node.
I want to know how to do this using a JSON parser. I've tried many solutions, but I couldn't adapt to my problem, because I need to extract all the lines from DATA node instead of nodes inside one node.
Can anyone help me?
PS: Sorry for my english, not a native speaker.
First of all, you need to understand the different JSON data types. The basic types in JSON are numbers, strings, booleans, arrays, and objects. The data returned in your example is an object with two keys: COLUMNS and DATA. The COLUMNS key has a value that is an array of strings and numbers. The DATA key has a value which is an array of arrays of strings.
You can use a library like PlayJSON to work with this type of data:
val js = Json.parse(x).as[JsObject]
val keys = (js \ "COLUMNS").as[List[String]]
val values = (js \ "DATA").as[List[List[JsValue]]]
val busses = values.map(valueList => {
val keyValues = (keys zip valueList).toMap
for {
ordem <- keyValues("ORDEM").asOpt[String]
linha <- keyValues("LINHA").asOpt[Int]
velocidade <- keyValues("VELOCIDADE").asOpt[Int]
} yield Bus(ordem, linha, velocidade)
})
Note the use of asOpt when converting the properties to the expected types. This operator converts the key-values to the provided type if possible (wrapped in Some), and returns None otherwise. So, if you want to provide a default value instead of ignoring other results, you could use keyValues("LINHA").asOpt[Int].getOrElse(0), for example.
You can read more about the Play JSON methods used here, like \ and as, and asOpt in their docs.
You can use Spark SQL to achieve it. Refer section under JSON Datasets here
In essence, Use spark APIs to load a JSON and register it as temp table.
You can run your SQL queries on the table from there.
As seen on #Ben Reich answer, that code works great. Thank you very much.
Although, my Json had some type problems on "Linha". As it can be seen on the JSON example that I put on the Question, there are "" and also numbers, e.g., 781.
When trying to do keyValues("LINHA").asOpt[Int].getOrElse(0), it was producing an error saying that value flatMap is not a member of Int.
So, I had to change some things:
case class BusContainer(ordem: String, linha: String, velocidade: Int)
val jsonString = fromFile("./project/rj_onibus_gps.json").getLines.mkString
val js = Json.parse(jsonString).as[JsObject]
val keys = (js \ "COLUMNS").as[List[String]]
val values = (js \ "DATA").as[List[List[JsValue]]]
val buses = values.map(valueList => {
val keyValues = (keys zip valueList).toMap
println(keyValues("ORDEM"),keyValues("LINHA"),keyValues("VELOCIDADE"))
for {
ordem <- keyValues("ORDEM").asOpt[String]
linha <- keyValues("LINHA").asOpt[Int].orElse(keyValues("LINHA").asOpt[String])
velocidade <- keyValues("VELOCIDADE").asOpt[Int]
} yield BusContainer(ordem, linha.toString, velocidade)
})
Thanks for the help!