I'm reading an .avro file where the data of a particular column is in binary format. I'm currently converting the binary format to string format with the help of UDF for a readable purpose and then finally i will need to convert it into JSON format for further parsing the data. Is there a way i can convert string object to JSON format using Spark Scala code.
Any help would be much appreciated.
val avroDF = spark.read.format("com.databricks.spark.avro").
load("file:///C:/46.avro")
import org.apache.spark.sql.functions.udf
// Convert byte object to String format
val toStringDF = udf((x: Array[Byte]) => new String(x))
val newDF = avroDF.withColumn("BODY",
toStringDF(avroDF("body"))).select("BODY")
Output of newDF is shown below:
BODY |
+---------------------------------------------------------------------------------------------------------------+
|{"VIN":"FU74HZ501740XXXXX","MSG_TYPE":"SIGNAL","TT":0,"RPM":[{"E":1566800008672,"V":1073.75},{"E":1566800002538,"V":1003.625},{"E":1566800004084,"V":1121.75}
My desired output should be like below:
I do not know if you want a generic solution but in your particular case, you can code something like this:
spark.read.json(newDF.as[String])
.withColumn("RPM", explode(col("RPM")))
.withColumn("E", col("RPM.E"))
.withColumn("V", col("RPM.V"))
.drop("RPM")
.show()
Related
I'm having trouble with json conversion within pyspark working with complex nested-struct columns. The schema for the from_json doesn't seem to behave. Example:
import pyspark.sql.functions as f
df = spark.createDataFrame([[1,'a'],[2,'b'],[3,'c']], ['rownum','rowchar'])\
.withColumn('struct', f.expr("transform(array(1,2,3), i -> named_struct('a1',rownum*i,'a2',rownum*i*2))"))
df.display()
df.withColumn('struct',f.to_json('struct')).withColumn('struct',f.from_json('struct',df.schema['struct'])).display()
df.withColumn('struct',f.to_json('struct')).withColumn('struct',f.from_json('struct',df.select('struct').schema)).display()
fails with
Cannot parse the schema in JSON format: Failed to convert the JSON string (big JSON string) to a data type
Not sure if this is a syntax error on my end, an edge case that's failing, the wrong way to do things, or something else.
You're not passing the correct schema to from_json. Try with this instead:
df.withColumn('struct', f.to_json('struct')) \
.withColumn('struct', f.from_json('struct', df.schema["struct"].dataType)) \
.display()
I have a text file with json value. and this gets read into a DF
{"name":"Michael"}
{"name":"Andy", "age":30}
I want to infer the schema dynamically for each line while Streaming and store it in separate locations(tables) depending on its schema.
unfortunately while I try to read the value.schema it still shows as String. Please help on how to do it on Streaming as RDD is not allowed in streaming.
I wanted to use the following code which doesnt work as the value is still read as String format.
val jsonSchema = newdf1.select("value").as[String].schema
val df1 = newdf1.select(from_json($"value", jsonSchema).alias("value_new"))
val df2 = df1.select("value_new.*")
I even tried to use,
schema_of_json("json_schema"))
val jsonSchema: String = newdf.select(schema_of_json(col("value".toString))).as[String].first()
still no hope.. Please help..
You can load the data as textFile, create case class for person and parse every json string to Person instance using json4s or gson, then creating the Dataframe as follows:
case class Person(name: String, age: Int)
val jsons = spark.read.textFile("/my/input")
val persons = jsons.map{json => toPerson(json) //instead of 'toPerson' actually parse with json4s or gson to return Person instance}
val df = sqlContext.createDataFrame(persons)
Deserialize json to case class using json4s:
https://commitlogs.com/2017/01/14/serialize-deserialize-json-with-json4s-in-scala/
Deserialize json to case class using gson:
https://alvinalexander.com/source-code/scala/scala-case-class-gson-json-object-deserialization-and-scalatra
I have a JSON file with some lines like:
"updatedAt" : ISODate("2018-11-20T09:32:16.732+0000"),
I tried json.loads but it has an error json.decoder.JSONDecodeError: Expecting value: line 2 column 13 (char 15).
I believe that the problem is at ISODate () but how could I handle that with Python?
Many thanks
This is not valid JSON, to begin with. I guess the ISODATE("...") is generated from MongoDB, maybe dumping the ISODate() helper directly instead of its string representation into the JSON?
In any case, you could use a regex on the whole JSON-string to get rid of the ISODate("..."), retrieve the date as a string and then use python-dateutil to parse the value to a datetime.datetime.
Something to the tune of
import json
import dateutil.parse
import re
json_str = ....
clean_json = re.compile('ISODate\(("[^"]+")\)').sub('\\1', json_str)
json_obj = json.loads(clean_json)
# use dateutil.parser.parse(s) to parse each date into a datetime.datetime
I am using Play Framework and I am trying to convert a Scala object to a JSON string.
Here is my code where I get my object:
val profile: Future[List[Profile]] = profiledao.getprofile(profileId);
The object is now in the profile value.
Now I want to convert that profile object which is a Future[List[Profile]] to JSON data and then convert that data into a JSON string then write into a file.
Here is the code that I wrote so far:
val jsondata = Json.toJson(profile)
Jackson.toJsonString(jsondata)
This is how I am trying to convert into JSON data but it is giving me the following output:
{"empty":false,"traversableAgain":true}
I am using the Jackson library to do the conversion.
Can someone help me with this ?
Why bother with Jackson? If you're using Play, you have play-json available to you, which uses Jackson under the hood FWIW:
First, you need an implicit Reads to let play-json know how to serialize Profile. If Profile is a case class, you can do this:
import play.api.libs.json._
implicit val profileFormat = Json.format[Profile]
If not, define your own Reads like this.
Then since getprofile (which should follow convention and be getProfile) returns Future[List[Profile]], you can do this to get a JsValue:
val profilesJson = profiledao.getprofile(profileId).map(toJson)
(profiledao should also be profileDao.)
In the end, you can wrap this in a Result like Ok and return that from your controller.
What is the fastest way to convert this
{"a":"ab","b":"cd","c":"cd","d":"de","e":"ef","f":"fg"}
into mutable map in scala ? I read this input string from ~500MB file. That is the reason I'm concerned about speed.
If your JSON is as simple as in your example, i.e. a sequence of key/value pairs, where each value is a string. You can do in plain Scala :
myString.substring(1, myString.length - 1)
.split(",")
.map(_.split(":"))
.map { case Array(k, v) => (k.substring(1, k.length-1), v.substring(1, v.length-1))}
.toMap
That looks like a JSON file, as Andrey says. You should consider this answer. It gives some example Scala code. Also, this answer gives some different JSON libraries and their relative merits.
The fastest way to read tree data structures in XML or JSON is by applying streaming API: Jackson Streaming API To Read And Write JSON.
Streaming would split your input into tokens like 'beginning of an object' or 'beginning of an array' and you would need to build a parser for these token, which in some cases is not a trivial task.
Keeping it simple. If reading a json string from file and converting to scala map
import spray.json._
import DefaultJsonProtocol._
val jsonStr = Source.fromFile(jsonFilePath).mkString
val jsonDoc=jsonStr.parseJson
val map_doc=jsonDoc.convertTo[Map[String, JsValue]]
// Get a Map key value
val key_value=map_doc.get("key").get.convertTo[String]
// If nested json, re-map it.
val key_map=map_doc.get("nested_key").get.convertTo[Map[String, JsValue]]
println("Nested Value " + key_map.get("key").get)