How to create an RDD from another RDD by extracting specific values? - json

I have an RDD which contains a String and JSON object (as String). I extracted required values from the JSON object. How can I use the values to create a new RDD which stores each value in each column?
RDD
(1234,{"id"->1,"name"->"abc","age"->21,"class"->5})
From which a map was generated as shown below.
"id"->1,
"name"->"abc",
"age"->21
"id"->2,
"name"->"def",
"age"->31
How to convert this to RDD[(String, String, String)], which stores data like:
1 abc 21
2 def 31

Not in front of a compiler right now, but something like this should work:
def parse(val row: (String, JValue)) : Seq((String, String, String)) = {
// Here goes your code to parse a Json into a sequence of tuples, seems like you have this already well in hand.
}
val rdd1 = ??? // Initialize your RDD[(String, JValue)]
val rdd2: RDD[(String, String, String)] = rdd1.flatMap(parse)
flatMap does the trick, as your extraction function can extract multiple rows on each Json input (or none) and they will be seamlessly be integrated into the final RDD.

Related

spark streaming JSON value in dataframe column scala

I have a text file with json value. and this gets read into a DF
{"name":"Michael"}
{"name":"Andy", "age":30}
I want to infer the schema dynamically for each line while Streaming and store it in separate locations(tables) depending on its schema.
unfortunately while I try to read the value.schema it still shows as String. Please help on how to do it on Streaming as RDD is not allowed in streaming.
I wanted to use the following code which doesnt work as the value is still read as String format.
val jsonSchema = newdf1.select("value").as[String].schema
val df1 = newdf1.select(from_json($"value", jsonSchema).alias("value_new"))
val df2 = df1.select("value_new.*")
I even tried to use,
schema_of_json("json_schema"))
val jsonSchema: String = newdf.select(schema_of_json(col("value".toString))).as[String].first()
still no hope.. Please help..
You can load the data as textFile, create case class for person and parse every json string to Person instance using json4s or gson, then creating the Dataframe as follows:
case class Person(name: String, age: Int)
val jsons = spark.read.textFile("/my/input")
val persons = jsons.map{json => toPerson(json) //instead of 'toPerson' actually parse with json4s or gson to return Person instance}
val df = sqlContext.createDataFrame(persons)
Deserialize json to case class using json4s:
https://commitlogs.com/2017/01/14/serialize-deserialize-json-with-json4s-in-scala/
Deserialize json to case class using gson:
https://alvinalexander.com/source-code/scala/scala-case-class-gson-json-object-deserialization-and-scalatra

scala - using map function to concatenate JValue

I've recently started using Scala and probably missing something about the map function. I understand that it returns a new value resulting from applying the given function.
For example I have an array of JValues and want to concatenate each value in the array with another JValue or just transform it to a String as in an example below.
val salesArray = salesJValue.asInstanceOf[JArray]
val storesWithSales = salesArray.map(sale => compact(render(sale)) //Type mismatch here
val storesWithSales = salesArray.map(sale => compact(render(sale) + compact(render(anotherJvalue))) //Type mismatch here
As I can see there is a Type mismatch because the actual value is a String and expected is JValue. Even if I do compact(render(sale).asInstanceOf[JValue] it's not allowed to cast string to JValue. Is it possible to return a different type from a map function? And how can I process the array values to transform each of them to another type?
Take a look at the type signature of the map method:
def map(f: JValue => JValue): JValue
So it's a bit different than other map methods in that you must specify a function whose return type is JValue. This is because a JArray represents specifically a deserialized JSON tree, and cannot hold arbitrary objects or data, only JValues.
If you want to process each of the values of a JArray, call .children on it first. That gives you a List[JValue] which then has a more general map method since Lists can hold any type. Its type signature is:
def map[B](f: A => B): List[B]
So you can do:
val storesWithSales = salesArray.children.map(sale => compact(render(sale)))

Extract a Json from an array inside a json in spark

I have a complicated JSON column whose structure is :
story{
cards: [{story-elements: [{...}{...}{...}}]}
The length of the story-elements is variable. I need to extract a particular JSON block from the story-elements array. For this, I first need to extract the story-elements.
Here is the code which I have tried, but it is giving error:
import org.json4s.{DefaultFormats, MappingException}
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.functions._
def getJsonContent(jsonstring: String): (String) = {
implicit val formats = DefaultFormats
val parsedJson = parse(jsonstring)
val value1 = (parsedJson\"cards"\"story-elements").extract[String]
value1
}
val getJsonContentUDF = udf((jsonstring: String) =>
getJsonContent(jsonstring))
input.withColumn("cards",getJsonContentUDF(input("storyDataFrame")))
According to json you provided, story-elements is a an array of json objects, but you trying to extract array as a string ((parsedJson\"cards"\"story-elements").extract[String]).
You can create case class representing on story (like case class Story(description: String, pageUrl: String, ...)) and then instead of extract[String], try extract[List[Story]] or extract[Array[Story]]
If you need just one piece of data from story (e.g. descrition), then you can use xpath-like syntax to get that and then extract List[String]

How to specify only particular fields using read.schema in JSON : SPARK Scala

I am trying to programmatically enforce schema(json) on textFile which looks like json. I tried with jsonFile but the issue is for creating a dataframe from a list of json files, spark has to do a 1 pass through the data to create a schema for the dataframe. So it needs to parse all the data which is taking longer time (4 hours since my data is zipped and of size TBs). So I want to try reading it as textFile and enforce schema to get interested fields alone to later query on the resulting data frame. But I am not sure how do I map it to the input. Can some give me some reference on how do I map schema to json like input.
input :
This is the full schema :
records: org.apache.spark.sql.DataFrame = [country: string, countryFeatures: string, customerId: string, homeCountry: string, homeCountryFeatures: string, places: array<struct<freeTrial:boolean,placeId:string,placeRating:bigint>>, siteName: string, siteId: string, siteTypeId: string, Timestamp: bigint, Timezone: string, countryId: string, pageId: string, homeId: string, pageType: string, model: string, requestId: string, sessionId: string, inputs: array<struct<inputName:string,inputType:string,inputId:string,offerType:string,originalRating:bigint,processed:boolean,rating:bigint,score:double,methodId:string>>]
But I am only interested in few fields like :
res45: Array[String] = Array({"requestId":"bnjinmm","siteName":"bueller","pageType":"ad","model":"prepare","inputs":[{"methodId":"436136582","inputType":"US","processed":true,"rating":0,"originalRating":1},{"methodId":"23232322","inputType":"UK","processed":falase,"rating":0,"originalRating":1}]
val records = sc.textFile("s3://testData/sample.json.gz")
val schema = StructType(Array(StructField("requestId",StringType,true),
StructField("siteName",StringType,true),
StructField("model",StringType,true),
StructField("pageType",StringType,true),
StructField("inputs", ArrayType(
StructType(
StructField("inputType",StringType,true),
StructField("originalRating",LongType,true),
StructField("processed",BooleanType,true),
StructField("rating",LongType,true),
StructField("methodId",StringType,true)
),true),true)))
val rowRDD = ??
val inputRDD = sqlContext.applySchema(rowRDD, schema)
inputRDD.registerTempTable("input")
sql("select * from input").foreach(println)
Is there any way to map this ? Or do I need to use son parser or something. I want to use textFile only because of the constraints.
Tried with :
val records =sqlContext.read.schema(schema).json("s3://testData/test2.gz")
But keeping getting the error :
<console>:37: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField)
StructField("inputs",ArrayType(StructType(StructField("inputType",StringType,true), StructField("originalRating",LongType,true), StructField("processed",BooleanType,true), StructField("rating",LongType,true), StructField("score",DoubleType,true), StructField("methodId",StringType,true)),true),true)))
^
It can load with following code with predefined schema, spark don't need to go through the file in ZIP file. The code in the question has ambiguity.
import org.apache.spark.sql.types._
val input = StructType(
Array(
StructField("inputType",StringType,true),
StructField("originalRating",LongType,true),
StructField("processed",BooleanType,true),
StructField("rating",LongType,true),
StructField("score",DoubleType,true),
StructField("methodId",StringType,true)
)
)
val schema = StructType(Array(
StructField("requestId",StringType,true),
StructField("siteName",StringType,true),
StructField("model",StringType,true),
StructField("inputs",
ArrayType(input,true),
true)
)
)
val records =sqlContext.read.schema(schema).json("s3://testData/test2.gz")
Not all the fields need to be provided. While it's good to provide all if possible.
Spark try best to parse all, if some row is not valid. It will add _corrupt_record as a column which contains the whole row.
While if it's plained json file file.

How do I deserialize a JSON array using the Play API

I have a string that is a Json array of two objects.
> val ss = """[ {"key1" :"value1"}, {"key2":"value2"}]"""
I want to use the Play Json libraries to deserialize it and create a map from the key values to the objects.
def deserializeJsonArray(ss:String):Map[String, JsValue] = ???
// Returns Map("value1" -> {"key1" :"value1"}, "value2" -> {"key2" :"value2"})
How do I write the deserializeJsonArray function? This seems like it should be easy, but I can't figure it out from either the Play documentation or the REPL.
I'm a bit rusty, so please forgive the mess. Perhaps another overflower can come in here and clean it up for me.
This solution assumes that the JSON is an array of objects, and each of the objects contains exactly one key-value pair. I would highly recommend spicing it up with some error handling and/or pattern matching to validate the parsed JSON string.
def deserializeJsonArray(ss: String): Map[String, JsValue] = {
val jsObjectSeq: Seq[JsObject] = Json.parse(ss).as[Seq[JsObject]]
val jsValueSeq: Seq[JsValue] = Json.parse(ss).as[Seq[JsValue]]
val keys: Seq[String] = jsObjectSeq.map(json => json.keys.head)
(keys zip jsValueSeq).toMap
}