My requirement is to pass dataframe as input parameter to a scala class which saves the data in json format to hdfs.
The input parameter looks like this:
case class ReportA(
parm1: String,
parm2: String,
parm3: Double,
parm4: Double,
parm5: DataFrame
)
I have created a JSON object for this parameter like:
def write(xx: ReportA) = JsObject(
"field1" -> JsString(xx.parm1),
"field2" -> JsString(xx.parm2),
"field3" -> JsNumber(xx.parm3),
"field4" -> JsNumber(xx.parm4),
"field5" -> JsArray(xx.parm5)
)
parm5 is a dataframe and wanted to convert as Json array.
How can I convert the dataframe to Json array?
Thank you for your help!!!
A DataFrame can be seen to be the equivalent of a plain-old table in a database, with rows and columns. You can't just get a simple array from it, the closest you woud come to an array would be with the following structure :
[
"col1": [val1, val2, ..],
"col2": [val3, val4, ..],
"col3": [val5, val6, ..]
]
To achieve a similar structure, you could use the toJSON method of the DataFrame API to get an RDD<String> and then do collect on it (be careful of any OutOfMemory exceptions).
You now have an Array[String], which you can simply transform in a JsonArray depending on the JSON library you are using.
Beware though, this seems like a really bizarre way to use Spark, you generally don't output and transform an RDD or a DataFrame directly into one of your objects, you usually spill it out onto a storage solution.
Related
overall aim
I have data landing into blob storage from an azure service in form of json files where each line in a file is a nested json object. I want to process this with spark and finally store as a delta table with nested struct/map type columns which can later be queried downstream using the dot notation columnName.key
data nesting visualized
{
key1: value1
nestedType1: {
key1: value1
keyN: valueN
}
nestedType2: {
key1: value1
nestedKey: {
key1: value1
keyN: valueN
}
}
keyN: valueN
}
current approach and problem
I am not using the default spark json reader as it is resulting in some incorrect parsing of the files instead I am loading the files as text files and then parsing using udfs by using python's json module ( eg below ) post which I use explode and pivot to get the first level of keys into columns
#udf('MAP<STRING,STRING>' )
def get_key_val(x):
try:
return json.loads(x)
except:
return None
Post this initial transformation I now need to convert the nestedType columns to valid map types as well. Now since the initial function is returning map<string,string> the values in nestedType columns are not valid jsons so I cannot use json.loads, instead I have regex based string operations
#udf('MAP<STRING,STRING>' )
def convert_map(string):
try:
regex = re.compile(r"""\w+=.*?(?:(?=,(?!"))|(?=}))""")
obj = dict([(a.split('=')[0].strip(),(a.split('=')[1])) for a in regex.findall(s)])
return obj
except Exception as e:
return e
this is fine for second level of nesting but if I want to go further that would require another udf and subsequent complications.
question
How can I use a spark udf or native spark functions to parse the nested json data such that it is queryable in columnName.key format.
also there is no restriction of spark version, hopefully I was able to explain this properly. do let me know if you want me to put some sample data and the code for ease. Any help is appreciated.
I've been asked to parse a JSON file to get all the buses that are over a specified speed inputed by the user.
The JSON file can be downloaded here
It's like this:
{
"COLUMNS": [
"DATAHORA",
"ORDEM",
"LINHA",
"LATITUDE",
"LONGITUDE",
"VELOCIDADE"
],
"DATA": [
[
"04-16-2015 00:00:55",
"B63099",
"",
-22.7931,
-43.2943,
0
],
[
"04-16-2015 00:01:02",
"C44503",
781,
-22.853649,
-43.37616,
25
],
[
"04-16-2015 00:11:40",
"B63067",
"",
-22.7925,
-43.2945,
0
],
]
}
The thing is: I'm really new to scala and I have never worked with json before (shame on me). What I need is to get the "Ordem", "Linha" and "Velocidade" from DATA node.
I created a case class to enclousure all the data so as to later look for those who are over the specified speed.
case class Bus(ordem: String, linha: Int, velocidade: Int)
I did this reading the file as a textFile and spliting. Although this way, I need to foreknow the content of the file in order to go to the lines after DATA node.
I want to know how to do this using a JSON parser. I've tried many solutions, but I couldn't adapt to my problem, because I need to extract all the lines from DATA node instead of nodes inside one node.
Can anyone help me?
PS: Sorry for my english, not a native speaker.
First of all, you need to understand the different JSON data types. The basic types in JSON are numbers, strings, booleans, arrays, and objects. The data returned in your example is an object with two keys: COLUMNS and DATA. The COLUMNS key has a value that is an array of strings and numbers. The DATA key has a value which is an array of arrays of strings.
You can use a library like PlayJSON to work with this type of data:
val js = Json.parse(x).as[JsObject]
val keys = (js \ "COLUMNS").as[List[String]]
val values = (js \ "DATA").as[List[List[JsValue]]]
val busses = values.map(valueList => {
val keyValues = (keys zip valueList).toMap
for {
ordem <- keyValues("ORDEM").asOpt[String]
linha <- keyValues("LINHA").asOpt[Int]
velocidade <- keyValues("VELOCIDADE").asOpt[Int]
} yield Bus(ordem, linha, velocidade)
})
Note the use of asOpt when converting the properties to the expected types. This operator converts the key-values to the provided type if possible (wrapped in Some), and returns None otherwise. So, if you want to provide a default value instead of ignoring other results, you could use keyValues("LINHA").asOpt[Int].getOrElse(0), for example.
You can read more about the Play JSON methods used here, like \ and as, and asOpt in their docs.
You can use Spark SQL to achieve it. Refer section under JSON Datasets here
In essence, Use spark APIs to load a JSON and register it as temp table.
You can run your SQL queries on the table from there.
As seen on #Ben Reich answer, that code works great. Thank you very much.
Although, my Json had some type problems on "Linha". As it can be seen on the JSON example that I put on the Question, there are "" and also numbers, e.g., 781.
When trying to do keyValues("LINHA").asOpt[Int].getOrElse(0), it was producing an error saying that value flatMap is not a member of Int.
So, I had to change some things:
case class BusContainer(ordem: String, linha: String, velocidade: Int)
val jsonString = fromFile("./project/rj_onibus_gps.json").getLines.mkString
val js = Json.parse(jsonString).as[JsObject]
val keys = (js \ "COLUMNS").as[List[String]]
val values = (js \ "DATA").as[List[List[JsValue]]]
val buses = values.map(valueList => {
val keyValues = (keys zip valueList).toMap
println(keyValues("ORDEM"),keyValues("LINHA"),keyValues("VELOCIDADE"))
for {
ordem <- keyValues("ORDEM").asOpt[String]
linha <- keyValues("LINHA").asOpt[Int].orElse(keyValues("LINHA").asOpt[String])
velocidade <- keyValues("VELOCIDADE").asOpt[Int]
} yield BusContainer(ordem, linha.toString, velocidade)
})
Thanks for the help!
What is the fastest way to convert this
{"a":"ab","b":"cd","c":"cd","d":"de","e":"ef","f":"fg"}
into mutable map in scala ? I read this input string from ~500MB file. That is the reason I'm concerned about speed.
If your JSON is as simple as in your example, i.e. a sequence of key/value pairs, where each value is a string. You can do in plain Scala :
myString.substring(1, myString.length - 1)
.split(",")
.map(_.split(":"))
.map { case Array(k, v) => (k.substring(1, k.length-1), v.substring(1, v.length-1))}
.toMap
That looks like a JSON file, as Andrey says. You should consider this answer. It gives some example Scala code. Also, this answer gives some different JSON libraries and their relative merits.
The fastest way to read tree data structures in XML or JSON is by applying streaming API: Jackson Streaming API To Read And Write JSON.
Streaming would split your input into tokens like 'beginning of an object' or 'beginning of an array' and you would need to build a parser for these token, which in some cases is not a trivial task.
Keeping it simple. If reading a json string from file and converting to scala map
import spray.json._
import DefaultJsonProtocol._
val jsonStr = Source.fromFile(jsonFilePath).mkString
val jsonDoc=jsonStr.parseJson
val map_doc=jsonDoc.convertTo[Map[String, JsValue]]
// Get a Map key value
val key_value=map_doc.get("key").get.convertTo[String]
// If nested json, re-map it.
val key_map=map_doc.get("nested_key").get.convertTo[Map[String, JsValue]]
println("Nested Value " + key_map.get("key").get)
I want to convert a scala list of strings, List[String], to an Json object.
For each string in my list I want to add it to my Json object.
So that it would look like something like this:
{
"names":[
{
"Bob",
"Andrea",
"Mike",
"Lisa"
}
]
}
How do I create an json object looking like this, from my list of strings?
To directly answer your question, a very simplistic and hacky way to do it:
val start = """"{"names":[{"""
val end = """}]}"""
val json = mylist.mkString(start, ",", end)
However, what you almost certainly want to do is pick one of the many JSON libraries out there: play-json gets some good comments, as does lift-json. At the worst, you could just grab a simple JSON library for Java and use that.
Since I'm familiar with lift-json, I'll show you how to do it with that library.
import net.liftweb.json.JsonDSL._
import net.liftweb.json.JsonAST._
import net.liftweb.json.Printer._
import net.liftweb.json.JObject
val json: JObject = "names" -> List("Bob", "Andrea", "Mike", "Lisa")
println(json)
println(pretty(render(json)))
The names -> List(...) expression is implicitly converted by the JsonDSL, since I specified that I wanted it to result in a JObject, so now json is the in-memory model of the json data you wanted.
pretty comes from the Printer object, and render comes from the JsonAST object. Combined, they create a String representation of your data, which looks like
{
"names":["Bob","Andrea","Mike","Lisa"]
}
Be sure to check out the lift documentation, where you'll likely find answers to any further questions about lift's json support.
I have some strange json that I cannot change, and I wish to parse it using
the JsonParsen in lift.
A typical json is like:
{"name":"xxx", "data":{
"data_123456":{"id":"Hello"},
"data_789901":{"id":"Hello"},
"data_987654":{"id":"Hello"},
}}
The issue is that the keys for the data are unknown (data_xxxxx, where the xx:s
are not known).
This is bad json, but I have to live with it.
How am I supposed to setup case-classes in scala to be able to build a proper
structure when the keys here are unknown, but the structure is known?
You can use a Map, and every value can be JValue too, representing unparsed JSON. Example:
case class Id(id: String)
case class Data(name: JValue, data: Map[String, Id])
And then:
json.extract[Data]
res0: Data(JString(xxx),Map(data_123456 -> Id(Hello), data_789901 -> Id(Hello), data_987654 -> Id(Hello)))