How to Convert Spark RDD into JSON using Scala Language - json

I am using MongoDB Spark Connector to get a collection. The aim is that we want to return all the documents that are present in the collection. We want to return all these documents as an array of JSON documents.
I am able to get the collection but I am not sure how to convert the customRDD object which contains the list of documents to a JSON format. I can convert the first document as you can see in the code but how to convert all the documents that are read from the collection and then make one JSON message and send it.
Expected Output:
This can be the array of documents.
{
"objects":[
{
...
},
{
....
}
]
}
Existing Code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import com.mongodb.spark.config._
import com.mongodb.spark._
import org.json4s.native.JsonMethods._
import org.json4s.JsonDSL.WithDouble._
var conf = new SparkConf()
conf.setAppName("MongoSparkConnectorIntro")
.setMaster("local")
.set("spark.hadoop.validateOutputSpecs", "false")
.set("spark.mongodb.input.uri","mongodb://127.0.0.1/mystore.mycollection?readPreference=primaryPreferred")
.set("spark.mongodb.output.uri","mongodb://127.0.0.1/mystore.mycollection?readPreference=primaryPreferred")
sc = new SparkContext(conf)
val spark = SparkSession.builder().master("spark://192.168.137.103:7077").appName("MongoSparkConnectorIntro").config("spark.mongodb.input.uri", "mongodb://127.0.0.1/mystore.mycollection?readPreference=primaryPreferred").config("spark.mongodb.output.uri", "mongodb://127.0.0.1/mystore.mycollection?readPreference=primaryPreferred").getOrCreate()
//val readConfig = ReadConfig(Map("collection" -> "metadata_collection", "readPreference.name" -> "secondaryPreferred"), Some(ReadConfig(sc)))
val readConfig = ReadConfig(Map("uri" -> "mongodb://127.0.0.1/mystore.mycollection?readPreference=primaryPreferred"))
val customRdd = MongoSpark.load(sc, readConfig)
//println("Before Printing the value" + customRdd.toString())
println("The Count: "+customRdd.count)
println("The First Document: " + customRdd.first.toString())
val resultJSOn = "MetaDataFinalResponse" -> customRdd.collect().toList
val stringResponse = customRdd.first().toJson()
println("Final Response: " +stringResponse)
return stringResponse
Note:
I don't want to further map the JSON documents into another model. I want them to be as it is. I just want to aggregate them in one JSON message.
Spark Version: 2.4.0
SBT File:
name := "Test"
version := "0.1"
scalaVersion := "2.12.8"
libraryDependencies += "org.slf4j" % "slf4j-simple" % "1.7.0"
libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" % "2.4.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0"

This answer generates json string without escape characters and much more efficient but you need to collect RDD to perform this(you can remove the code from my previous answer);
// We will create a new Document with the documents that are fetched from MongoDB
import scala.collection.JavaConverters._
import org.bson.Document
// Collect customRdd and convert to java array
// (we can only create new Document with java collections)
val documents = customRdd.collect().toSeq.asJava
// Create new document with the field name you want
val stringResponse = new Document().append("objects", documents).toJson()

Related

Scala - How to get the Max value of a Json field?

I'm trying to get the maximum value of a MetricId field from a JSON String. However I'm getting a java.lang.UnsupportedOperationException: empty.max for the below String:
[{"MetricName":"name1","DateParsed":"2019-11-20 05:39:00","MetricId":"7855","isValid":"true"},
{"MetricName":"name2","DateParsed":"2019-05-22 17:45:00","MetricId":"1295","isValid":"false"}]
Here is how I've implemented a method for finding the Max value:
val metricIdRegex = """"MetricId"\s*:\s*(\d+)""".r
def maxMetricId(jsonString: String): String = {
metricIdRegex.findAllIn(jsonString).map({
case metricIdRegex(id) => id.toInt
}).max.toString
}
val maxId: String = maxMetricId(metricsString)
I'm expecting to get "7855" as a Max metric Id
What could be wrong with the method? I suspect that it could be a problem with the regex.
You could also use json4s which is quite popular and used by many other scala libraries:
import org.json4s._
import org.json4s.jackson.JsonMethods._
val data = """[{"MetricName":"name1","DateParsed":"2019-11-20 05:39:00","MetricId":"7855","isValid":"true"},
{"MetricName":"name2","DateParsed":"2019-05-22 17:45:00","MetricId":"1295","isValid":"false"}]"""
// parse data into JValue
val parsed = parse(data)
// go through the parsed variable and extract MetricId into a string list, then cast every item to int
val maxMetricId = (parsed \ "MetricId" \\ classOf[JString]).map{_.toInt}.max
Let me show an example how it can be done with a JSON parser efficiently without holding of a whole JSON input and parsed data in memory.
Add dependencies to your build.sbt:
libraryDependencies ++= Seq(
"com.github.plokhotnyuk.jsoniter-scala" %% "jsoniter-scala-core" % "2.0.2" % Compile,
"com.github.plokhotnyuk.jsoniter-scala" %% "jsoniter-scala-macros" % "2.0.2" % Provided // required only in compile-time
)
Add imports, define a data structure for repeating part of your JSON array which should be parsed out, derive a codec for it, open an input stream and scan it with provided handling function which will reduce all parsed metrics to the maximum value:
import com.github.plokhotnyuk.jsoniter_scala.macros._
import com.github.plokhotnyuk.jsoniter_scala.core._
import java.io.ByteArrayInputStream
import java.io.InputStream
case class Metric(#stringified MetricId: Int)
implicit val codec: JsonValueCodec[Metric] = JsonCodecMaker.make(CodecMakerConfig)
val in: InputStream = new ByteArrayInputStream( // <- replace it by FileInputStream
"""[{"MetricName":"name1","DateParsed":"2019-11-20 05:39:00","MetricId":"7855","isValid":"true"},
{"MetricName":"name2","DateParsed":"2019-05-22 17:45:00","MetricId":"1295","isValid":"false"}]""".getBytes("UTF-8"))
try {
var max = -1
scanJsonArrayFromStream[Metric](in) { m: Metric =>
max = Math.max(max, m.MetricId)
true
}
println(max)
} finally in.close()
And this code should print 7855.

Any scalatest matchers for matching json

My test currently expects to match the converted json string from a method under test. I have constructed an expected string to perform the match.
val input = Foobar("bar", "foo")
val body = Foobar("bar !!", "foo!!")
val responseHeaders = Map[String, String]("Content-Type" -> "application/json")
val statusCode = "200"
val responseEvent = ResponseEvent(input, body, responseHeaders, statusCode)
val expected ="{\"input\":{\"foo\":\"bar\",\"bar\":\"foo\"},\"body\":{\"foo\":\"bar !!\",\"bar\":\"foo!!\"},\"headers\":{\"Content-Type\":\"application/json\"},\"statusCode\":\"200\"}"
val result = Main.stringifyResponse(responseEvent)
result should be(expected)
The string matching is extremely sensitive and fails on any whitespace, also any string written on multiline is not accepted for the test because the result is only available on one line as a result of stringifying using json4s library.
Is there a better way to perform matching on json output without having to do a full blown string comparision using scalatest.
Is there a better approach to create this test?
checkout https://github.com/stephennancekivell/scalatest-json
libraryDependencies += "com.stephenn" %% "scalatest-json-jsonassert" % "0.0.3"
libraryDependencies += "com.stephenn" %% "scalatest-json4s" % "0.0.2"
libraryDependencies += "com.stephenn" %% "scalatest-play-json" % "0.0.1"
libraryDependencies += "com.stephenn" %% "scalatest-circe" % "0.0.1"
It lets you write tests without caring about the whitespace, since its json.
it("should pass matching json with different spacing and order") {
val input = """
|{
| "some": "valid json",
| "with": ["json", "content"]
|}
""".stripMargin
val expected = """
|{
| "with": ["json", "content"],
| "some": "valid json"
|}
""".stripMargin
input should matchJson(expected)
}
You have two options!
Use a library like Play Json with which you could box your raw Json String into a JsObject and do the check using Scalatest. If you already use a JSON library, see if you can leverage that!
Box your JSON into a case class and check for equality!

Reading a csv file as a spark dataframe

I have got a CSV file along with a header which has to be read through Spark(2.0.0 and Scala 2.11.8) as a dataframe.
Sample csv data:
Item,No. of items,Place
abc,5,xxx
def,6,yyy
ghi,7,zzz
.........
I'm facing problem when I try to read this csv data in spark as a dataframe, because the header contains column(No. of items) having special character "."
Code with which I try to read csv data is:
val spark = SparkSession.builder().appName("SparkExample")
import spark.implicits._
val df = spark.read.option("header", "true").csv("file:///INPUT_FILENAME")
Error I'm facing:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to resolve No. of items given [Item,No. of items,Place];
If I remove the "." from the header, I wont get any error. Even tried with escaping the character,but it escapes all the "." characters even from the data.
Is there any way to escape the special character "." only from the CSV header using spark code?
#Pooja Nayak, Not sure if this was solved; answering this in the interest of community.
sc: SparkContext
spark: SparkSession
sqlContext: SQLContext
// Read the raw file from localFS as-is.
val rdd_raw = sc.textFile("file:///home/xxxx/sample.csv")
// Drop the first line in first partition because it is the header.
val rdd = rdd_raw.mapPartitionsWithIndex{(idx,iter) =>
if(idx == 0) iter.drop(1) else iter
}
// A function to create schema dynamically.
def schemaCreator(header: String): StructType = {
StructType(header
.split(",")
.map(field => StructField(field.trim, StringType, true))
)
}
// Create the schema for the csv that was read and store it.
val csvSchema: StructType = schemaCreator(rdd_raw.first)
// As the input is CSV, split it at "," and trim away the whitespaces.
val rdd_curated = rdd.map(x => x.split(",").map(y => y.trim)).map(xy => Row(xy:_*))
// Create the DF from the RDD.
val df = sqlContext.createDataFrame(rdd_curated, csvSchema)
imports that are necessary
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark._
I am giving you example which is working with pyspark, hopefully same will work for you, just by adding some language related syntax.
file =r'C:\Users\e5543130\Desktop\sampleCSV2.csv'
conf = SparkConf().setAppName('FICBOutputGenerator')
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
df = sqlContext.read.options(delimiter=",", header="true").csv("cars.csv") #Without deprecated API
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", ",").load("cars.csv")

Reading JSON with Apache Spark - `corrupt_record`

I have a json file, nodes that looks like this:
[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}]
I am able to read and manipulate this record with Python.
I am trying to read this file in scala through the spark-shell.
From this tutorial, I can see that it is possible to read json via sqlContext.read.json
val vfile = sqlContext.read.json("path/to/file/nodes.json")
However, this results in a corrupt_record error:
vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound json.
As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying:
val df = spark.read.option("multiline", "true").json("<file>")
Spark cannot read JSON-array to a record on top-level, so you have to pass:
{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}
As it's described in the tutorial you're referring to:
Let's begin by loading a JSON file, where each line is a JSON object
The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).
To put more light on it, here is a quote form the official doc
Note that the file that is offered as a json file is not a typical
JSON file. Each line must contain a separate, self-contained valid
JSON object. As a consequence, a regular multi-line JSON file will
most often fail.
This format is called JSONL. Basically it's an alternative to CSV.
To read the multi-line JSON as a DataFrame:
val spark = SparkSession.builder().getOrCreate()
val df = spark.read.json(spark.sparkContext.wholeTextFiles("file.json").values)
Reading large files in this manner is not recommended, from the wholeTextFiles docs
Small files are preferred, large file is also allowable, but may cause bad performance.
I run into the same problem. I used sparkContext and sparkSql on the same configuration:
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("Simple Application")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
Then, using the spark context I read the whole json (JSON - path to file) file:
val jsonRDD = sc.wholeTextFiles(JSON).map(x => x._2)
You can create a schema for future selects, filters...
val schema = StructType( List(
StructField("toid", StringType, nullable = true),
StructField("point", ArrayType(DoubleType), nullable = true),
StructField("index", DoubleType, nullable = true)
))
Create a DataFrame using spark sql:
var df: DataFrame = spark.read.schema(schema).json(jsonRDD).toDF()
For testing use show and printSchema:
df.show()
df.printSchema()
sbt build file:
name := "spark-single"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.2"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "2.0.2"

Parsing JSON with Scala lift

I am trying to parse a json string with special characters in its attributes names (dots).
This is what I'm trying:
//Json parser objects
case class SolrDoc(`rdf.about`:String, `dc.title`:List[String],
`dc.creator`:List[String], `dc.dateCopyrighted`:List[Int],
`dc.publisher`:List[String], `dc.type` :String)
case class SolrResponse(numFound:String, start:String, docs: List[SolrDoc])
val req = url("http://localhost:8983/solr/select") <<? Map("q" -> q)
var search_result = http(req ># { json => (json \ "response") })
var response = search_result.extract[SolrResponse]
Even though my json string contains values for all the fields this is the error I'm getting:
Message: net.liftweb.json.MappingException: No usable value for docs
No usable value for rdf$u002Eabout
Did not find value which can be converted into java.lang.String
I suspect that it has something to do with the dot on the names but so far I did not manage to make it work.
Thanks!
These is an extract from my LiftProject.scala file :
"net.databinder" % "dispatch-http_2.8.1" % "0.8.6",
"net.databinder" % "dispatch-http-json_2.8.1" % "0.8.6",
"net.databinder" % "dispatch-lift-json_2.8.1" % "0.8.6"
Dots in names should not be a problem. This is with lift-json-2.4-M4
scala> val json = """ {"first.name":"joe"} """
scala> parse(json).extract[Person]
res0: Person = Person(joe)
Where
case class Person(`first.name`: String)