Reading JSON RDD using Spark Scala - json

I am receiving JSON data from Kafka brokers and I am reading it using Spark Streaming and Scala. Following is the example data:
{"timestamp":"2020-12-11 22:35:00.000000 UTC","tech":"Spark","version":2,"start_time":1607725688402210,"end_time":1607726131636059}
I receive this data as RDD[String] in my Scala code , now I want to read particular key from each data row, for example 'version' from the above data.
I am able to do this as follows:
for(record <- rdd){
val jsonRecord = JSON.parseFull(record );
val globalMap = jsonRecord.get.asInstanceOf[Map[String, Any]]
val version = globalMap.get("version").get.asInstanceOf[String]
}
But I am not sure if this is the best way to read RDD having JSON data. Please suggest.
Thanks,

Use json4s library to parse json data & It will be available with spark default no need to import extra libraries.
Check below code.
scala> rdd.collect.foreach(println)
{"timestamp":"2020-12-11 22:35:00.000000 UTC","tech":"Spark","version":2,"start_time":1607725688402210,"end_time":1607726131636059}
scala> :paste
// Entering paste mode (ctrl-D to finish)
rdd.map{ row =>
// Import required libraries for json parsers.
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
// parse json message using parse function from json4s lib.
val jsonData = parse(row)
// extract required fields from parsed json data.
// extracting version field value
val version = (jsonData \\ "version").extract[Int]
// extracting timestamp field value
val timestamp = (jsonData \\ "timestamp").extract[String]
(version,timestamp)
}
.collect
.foreach(println)
// Exiting paste mode, now interpreting.
(2,2020-12-11 22:35:00.000000 UTC)

Related

Is there a way to modify this code to let spark streaming read from json?

I'm working on a spark streaming app/code which continuously reads data from localhost 9098. Is there a way to modify localhost into <users/folder/path> so to read data from folder path or json automatically ?
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.log4j.Logger
import org.apache.log4j.Level
object StreamingApplication extends App {
Logger.getLogger("Org").setLevel(Level.ERROR)
//creating spark streaming context
val sc = new SparkContext("local[*]", "wordCount")
val ssc = new StreamingContext(sc, Seconds(5))
// lines is a Dstream
val lines = ssc.socketTextStream("localhost", 9098)
// words is a transformed Dstream
val words = lines.flatMap(x => x.split(" "))
// bunch of transformations
val pairs = words.map(x=> (x,1))
val wordsCount = pairs.reduceByKey((x,y) => x+y)
// print is an action
wordsCount.print()
// start the streaming context
ssc.start()
ssc.awaitTermination()
}
Basically, I need help to modify code below:
val lines = ssc.socketTextStream("localhost", 9098)
to this:
val lines = ssc.socketTextStream("<folder path>")
fyi, I'm using IntelliJ Idea to build this.
I'd recommend reading Spark documentation, especially the scaladoc.
There seem to exist a method fileStream.
https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/streaming/StreamingContext.html

Is it possible to create a dataframe column with json data which doesn't have a fixed schema?

I am trying to create a dataframe column with JSON data which does not have a fixed schema. I am trying to write it in its original form as map/object but getting various errors.
I don't want to convert it to a string as I need to write this data in it's original form to the file.
Later this file is used for json processing, original structure should not be compromised.
Currently when I try writing data to a file it contain all the escape characters and is considered entire json as a string instead of complex type. Eg
{"field1":"d1","field2":"app","value":"{\"data\":\"{\\\"app\\\":\\\"am\\\"}\"}"}
You could try to make up a schema for the json file.
I don't know what output you expect.
As a clue I give you an example and two interesting links:
spark-read-json-with-schema
spark-schema-explained-with-examples
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StringType, StructType}
object RareJson {
val spark = SparkSession
.builder()
.appName("RareJson")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","RareJson") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
val input = "/home/cloudera/files/tests/rare.json"
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.ERROR)
try {
val structureSchema = new StructType()
.add("field1",StringType)
.add("field2",StringType)
.add("value",StringType,true)
val rareJson = sqlContext
.read
.option("allowBackslashEscapingAnyCharacter", true)
.option("allowUnquotedFieldNames", true)
.option("multiLine", true)
.option("mode", "DROPMALFORMED")
.schema(structureSchema)
.json(input)
rareJson.show(truncate = false)
// To have the opportunity to view the web console of Spark: http://localhost:4041/
println("Type whatever to the console to exit......")
scala.io.StdIn.readLine()
} finally {
sc.stop()
println("SparkContext stopped")
spark.stop()
println("SparkSession stopped")
}
}
}
output
+------+------+---------------------------+
|field1|field2|value |
+------+------+---------------------------+
|d1 |app |{"data":"{\"app\":\"am\"}"}|
+------+------+---------------------------+
You can try to parse the value column too if it maintain the same format along the all rows.

spark streaming JSON value in dataframe column scala

I have a text file with json value. and this gets read into a DF
{"name":"Michael"}
{"name":"Andy", "age":30}
I want to infer the schema dynamically for each line while Streaming and store it in separate locations(tables) depending on its schema.
unfortunately while I try to read the value.schema it still shows as String. Please help on how to do it on Streaming as RDD is not allowed in streaming.
I wanted to use the following code which doesnt work as the value is still read as String format.
val jsonSchema = newdf1.select("value").as[String].schema
val df1 = newdf1.select(from_json($"value", jsonSchema).alias("value_new"))
val df2 = df1.select("value_new.*")
I even tried to use,
schema_of_json("json_schema"))
val jsonSchema: String = newdf.select(schema_of_json(col("value".toString))).as[String].first()
still no hope.. Please help..
You can load the data as textFile, create case class for person and parse every json string to Person instance using json4s or gson, then creating the Dataframe as follows:
case class Person(name: String, age: Int)
val jsons = spark.read.textFile("/my/input")
val persons = jsons.map{json => toPerson(json) //instead of 'toPerson' actually parse with json4s or gson to return Person instance}
val df = sqlContext.createDataFrame(persons)
Deserialize json to case class using json4s:
https://commitlogs.com/2017/01/14/serialize-deserialize-json-with-json4s-in-scala/
Deserialize json to case class using gson:
https://alvinalexander.com/source-code/scala/scala-case-class-gson-json-object-deserialization-and-scalatra

How to export all data from Elastic Search Index to file in JSON format with _id field specified?

I'm new to both Spark and Scala. I'm trying to read all data from a particular index in Elastic Search into a RDD and use this data to write to Mongo DB.
I'm loading the Elastic search data to a esJsonRDD and when I try to print the RDD contents, it is in the following format,
(1765770532{"FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"})
Expected format,
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
How can I achieve the output from elastic search to be formatted this way?.
Any help would be appreciated.
The data retrieved from elastic search is in the following format,
(1765770532{"FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"})
Expected format is,
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
object readFromES {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("readFromES")
.set("es.nodes", Config.ES_NODES)
.set("es.nodes.wan.only", Config.ES_NODES_WAN_ONLY)
.set("es.net.http.auth.user", Config.ES_NET_HTTP_AUTH_USER)
.set("es.net.http.auth.pass", Config.ES_NET_HTTP_AUTH_PASS)
.set("es.net.ssl", Config.ES_NET_SSL)
.set("es.output.json","true")
val sc = new SparkContext(conf)
val RDD = EsSpark.esJsonRDD(sc, "userdata/user")
//RDD.coalesce(1).saveAsTextFile(args(0))
RDD.take(5).foreach(println)
}
}
I would like the RDD output to be written to a file in the following JSON Format(one line per doc),
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
{_id:"1765770533","FirstName":DEF,"LastName":"DEF",Zipcode":"35525","City":"PortWinchestor","StateCode":"AI"}
"_id" is a part of metadata, to access it you should add .config("es.read.metadata", true) to config.
Then you can access it two ways, You can use
val RDD = EsSpark.esJsonRDD(sc, "userdata/user")
and manually add the _id field in json
Or easier way is to read as a dataframe
val df = spark.read
.format("org.elasticsearch.spark.sql")
.load("userdata/user")
.withColumn("_id", $"_metadata".getItem("_id"))
.drop("_metadata")
//Write as json in file
df.write.json("output folder ")
Here the spark is the spark session created as
val spark = SparkSession.builder().master("local[*]").appName("Test")
.config("spark.es.nodes","host")
.config("spark.es.port","ports")
.config("spark.es.nodes.wan.only","true")
.config("es.read.metadata", true) //for enabling metadata
.getOrCreate()
Hope this helps

Inserting JsNumber into Mongo

When trying to insert a MongoDBObject that contains a JsNumber
val obj: DBObject = getDbObj // contains a "JsNumber()"
collection.insert(obj)
the following error occurs:
[error] play - Cannot invoke the action, eventually got an error: java.lang.IllegalArgumentException: can't serialize class scala.math.BigDecimal
I tried to replace the JsNumber with an Int, but I got the same error.
EDIT
Error can be reproduced via this test code. Full code in scalatest (https://gist.github.com/kman007us/6617735)
val collection = MongoConnection()("test")("test")
val obj: JsValue = Json.obj("age" -> JsNumber(100))
val q = MongoDBObject("name" -> obj)
collection.insert(q)
There are no registered handlers for Plays JSON implementation - you could add handlers to automatically translate plays Js Types to BSON types. However, that wont handle mongodb extended json which has a special structure dealing with non native json types eg: date and objectid translations.
An example of using this is:
import com.mongodb.util.JSON
val obj: JsValue = Json.obj("age" -> JsNumber(100))
val doc: DBObject = JSON.parse(obj.toString).asInstanceOf[DBObject]
For an example of a bson transformer see the joda time transformer.
It seems that casbah driver isn't compatible with Plays's JSON implementation. If I look through the cashbah code than it seems that you must use a set of MongoDBObject objects to build your query. The following snippet should work.
val collection = MongoConnection()("test")("test")
val obj = MongoDBObject("age" -> 100)
val q = MongoDBObject("name" -> obj)
collection.insert(q)
If you need the compatibility with Play's JSON implementation then use ReactiveMongo and Play-ReactiveMongo.
Edit
Maybe this Gist can help to convert JsValue objects into MongoDBObject objects.