Converting DataSet to Json Array Spark using Scala - json

I am new to the spark and unable to figure out the solution for the following problem.
I have a JSON file to parse and then create a couple of metrics and write the data back into the JSON format.
now following is my code I am using
import org.apache.spark.sql._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.functions._
object quick2 {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder
.appName("quick1")
.master("local[*]")
.getOrCreate()
val rawData = spark.read.json("/home/umesh/Documents/Demo2/src/main/resources/sampleQuick.json")
val mat1 = rawData.select(rawData("mal_name"),rawData("cust_id")).distinct().orderBy("cust_id").toJSON.cache()
val mat2 = rawData.select(rawData("file_md5"),rawData("mal_name")).distinct().orderBy(asc("file_md5")).toJSON.cache()
val write1 = mat1.coalesce(1).toJavaRDD.saveAsTextFile("/home/umesh/Documents/Demo2/src/test/mat1/")
val write = mat2.coalesce(1).toJavaRDD.saveAsTextFile("/home/umesh/Documents/Demo2/src/test/mat2/")
}
}
Now above code is writing the proper json format.
However, matrices can contain duplicate result as well
example:
md5 mal_name
1 a
1 b
2 c
3 d
3 e
so with above code every object is getting written in single line
like this
{"file_md5":"1","mal_name":"a"}
{"file_md5":"1","mal_name":"b"}
{"file_md5":"2","mal_name":"c"}
{"file_md5":"3","mal_name":"d"}
and so on.
but I want to combine the data of common keys:
so the output should be
{"file_md5":"1","mal_name":["a","b"]}
can somebody please suggest me what shall I do here. Or if there is any other better way to approach this problem.
Thanks!

You can use collect_list or collect_set as per your need on mal_name column
You can directly save DataFrame/DataSet directly as JSON file
import org.apache.spark.sql.functions.{alias, collect_list}
import spark.implicits._
rawData.groupBy($"file_md5")
.agg(collect_set($"mal_name").alias("mal_name"))
.write
.format("json")
.save("json/file/location/to/save")

as wrote by #mrsrinivas I changed my code as per below
val mat2 = rawData.select(rawData("file_md5"),rawData("mal_name")).distinct().orderBy(asc("file_md5")).cache()
val labeledDf = mat2.toDF("file_md5","mal_name")
labeledDf.groupBy($"file_md5").agg(collect_list($"mal_name")).coalesce(1).write.format("json").save("/home/umesh/Documents/Demo2/src/test/run8/")
Keeping this quesion open for some more suggestions if any.

Related

Spark | Could not create FileClient | read json | scala

I am trying to read a json file in windows local machine using spark and scala. I have tried like below :
object JsonTry extends App{
System.setProperty("hadoop.home.dir", "C:\\winutils")
val sparkSession = SparkSession.builder()
.master("local[*]")
.config("some-config", "some-value")
.appName("App Name")
.getOrCreate();
val res = sparkSession.read.json("./src/main/resources/test.json")
res.printSchema()
}
Json file which is under resource folder looks like below :
{"name":"Some name"}
But i am getting an exception when I run this main class :
Exception in thread "main" java.io.IOException: Could not create
FileClient
Screenshot attached :
To my surprise this piece of code is working, but i am looking to read json from file directly.
val res = sparkSession.read.option("multiline", true).json(sparkSession.sparkContext.parallelize(Seq("{\"name\":\"name\"}")))
Please let me know what is causing this issue, as I am not getting any solution .
I tried to read a json file in a similar way but I didnt face any problem,
You may try this too .
object myTest extends App {
val spark : SparkSession = SparkSession.builder()
.appName("MyTest")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val jsonDataDF = spark.read.option("multiline","true").json("/Users/gp/Desktop/temp/test.json")
jsonDataDF.show()
}
Output -
I/P Data (JSOn) ->
Do let me know if I understood your question properly or not ?

Is there a way to modify this code to let spark streaming read from json?

I'm working on a spark streaming app/code which continuously reads data from localhost 9098. Is there a way to modify localhost into <users/folder/path> so to read data from folder path or json automatically ?
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.log4j.Logger
import org.apache.log4j.Level
object StreamingApplication extends App {
Logger.getLogger("Org").setLevel(Level.ERROR)
//creating spark streaming context
val sc = new SparkContext("local[*]", "wordCount")
val ssc = new StreamingContext(sc, Seconds(5))
// lines is a Dstream
val lines = ssc.socketTextStream("localhost", 9098)
// words is a transformed Dstream
val words = lines.flatMap(x => x.split(" "))
// bunch of transformations
val pairs = words.map(x=> (x,1))
val wordsCount = pairs.reduceByKey((x,y) => x+y)
// print is an action
wordsCount.print()
// start the streaming context
ssc.start()
ssc.awaitTermination()
}
Basically, I need help to modify code below:
val lines = ssc.socketTextStream("localhost", 9098)
to this:
val lines = ssc.socketTextStream("<folder path>")
fyi, I'm using IntelliJ Idea to build this.
I'd recommend reading Spark documentation, especially the scaladoc.
There seem to exist a method fileStream.
https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/streaming/StreamingContext.html

Reading JSON RDD using Spark Scala

I am receiving JSON data from Kafka brokers and I am reading it using Spark Streaming and Scala. Following is the example data:
{"timestamp":"2020-12-11 22:35:00.000000 UTC","tech":"Spark","version":2,"start_time":1607725688402210,"end_time":1607726131636059}
I receive this data as RDD[String] in my Scala code , now I want to read particular key from each data row, for example 'version' from the above data.
I am able to do this as follows:
for(record <- rdd){
val jsonRecord = JSON.parseFull(record );
val globalMap = jsonRecord.get.asInstanceOf[Map[String, Any]]
val version = globalMap.get("version").get.asInstanceOf[String]
}
But I am not sure if this is the best way to read RDD having JSON data. Please suggest.
Thanks,
Use json4s library to parse json data & It will be available with spark default no need to import extra libraries.
Check below code.
scala> rdd.collect.foreach(println)
{"timestamp":"2020-12-11 22:35:00.000000 UTC","tech":"Spark","version":2,"start_time":1607725688402210,"end_time":1607726131636059}
scala> :paste
// Entering paste mode (ctrl-D to finish)
rdd.map{ row =>
// Import required libraries for json parsers.
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
// parse json message using parse function from json4s lib.
val jsonData = parse(row)
// extract required fields from parsed json data.
// extracting version field value
val version = (jsonData \\ "version").extract[Int]
// extracting timestamp field value
val timestamp = (jsonData \\ "timestamp").extract[String]
(version,timestamp)
}
.collect
.foreach(println)
// Exiting paste mode, now interpreting.
(2,2020-12-11 22:35:00.000000 UTC)

Is it possible to create a dataframe column with json data which doesn't have a fixed schema?

I am trying to create a dataframe column with JSON data which does not have a fixed schema. I am trying to write it in its original form as map/object but getting various errors.
I don't want to convert it to a string as I need to write this data in it's original form to the file.
Later this file is used for json processing, original structure should not be compromised.
Currently when I try writing data to a file it contain all the escape characters and is considered entire json as a string instead of complex type. Eg
{"field1":"d1","field2":"app","value":"{\"data\":\"{\\\"app\\\":\\\"am\\\"}\"}"}
You could try to make up a schema for the json file.
I don't know what output you expect.
As a clue I give you an example and two interesting links:
spark-read-json-with-schema
spark-schema-explained-with-examples
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StringType, StructType}
object RareJson {
val spark = SparkSession
.builder()
.appName("RareJson")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","RareJson") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
val input = "/home/cloudera/files/tests/rare.json"
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.ERROR)
try {
val structureSchema = new StructType()
.add("field1",StringType)
.add("field2",StringType)
.add("value",StringType,true)
val rareJson = sqlContext
.read
.option("allowBackslashEscapingAnyCharacter", true)
.option("allowUnquotedFieldNames", true)
.option("multiLine", true)
.option("mode", "DROPMALFORMED")
.schema(structureSchema)
.json(input)
rareJson.show(truncate = false)
// To have the opportunity to view the web console of Spark: http://localhost:4041/
println("Type whatever to the console to exit......")
scala.io.StdIn.readLine()
} finally {
sc.stop()
println("SparkContext stopped")
spark.stop()
println("SparkSession stopped")
}
}
}
output
+------+------+---------------------------+
|field1|field2|value |
+------+------+---------------------------+
|d1 |app |{"data":"{\"app\":\"am\"}"}|
+------+------+---------------------------+
You can try to parse the value column too if it maintain the same format along the all rows.

How to export all data from Elastic Search Index to file in JSON format with _id field specified?

I'm new to both Spark and Scala. I'm trying to read all data from a particular index in Elastic Search into a RDD and use this data to write to Mongo DB.
I'm loading the Elastic search data to a esJsonRDD and when I try to print the RDD contents, it is in the following format,
(1765770532{"FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"})
Expected format,
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
How can I achieve the output from elastic search to be formatted this way?.
Any help would be appreciated.
The data retrieved from elastic search is in the following format,
(1765770532{"FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"})
Expected format is,
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
object readFromES {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("readFromES")
.set("es.nodes", Config.ES_NODES)
.set("es.nodes.wan.only", Config.ES_NODES_WAN_ONLY)
.set("es.net.http.auth.user", Config.ES_NET_HTTP_AUTH_USER)
.set("es.net.http.auth.pass", Config.ES_NET_HTTP_AUTH_PASS)
.set("es.net.ssl", Config.ES_NET_SSL)
.set("es.output.json","true")
val sc = new SparkContext(conf)
val RDD = EsSpark.esJsonRDD(sc, "userdata/user")
//RDD.coalesce(1).saveAsTextFile(args(0))
RDD.take(5).foreach(println)
}
}
I would like the RDD output to be written to a file in the following JSON Format(one line per doc),
{_id:"1765770532","FirstName":ABC,"LastName":"DEF",Zipcode":"36905","City":"PortAdam","StateCode":"AR"}
{_id:"1765770533","FirstName":DEF,"LastName":"DEF",Zipcode":"35525","City":"PortWinchestor","StateCode":"AI"}
"_id" is a part of metadata, to access it you should add .config("es.read.metadata", true) to config.
Then you can access it two ways, You can use
val RDD = EsSpark.esJsonRDD(sc, "userdata/user")
and manually add the _id field in json
Or easier way is to read as a dataframe
val df = spark.read
.format("org.elasticsearch.spark.sql")
.load("userdata/user")
.withColumn("_id", $"_metadata".getItem("_id"))
.drop("_metadata")
//Write as json in file
df.write.json("output folder ")
Here the spark is the spark session created as
val spark = SparkSession.builder().master("local[*]").appName("Test")
.config("spark.es.nodes","host")
.config("spark.es.port","ports")
.config("spark.es.nodes.wan.only","true")
.config("es.read.metadata", true) //for enabling metadata
.getOrCreate()
Hope this helps