explode json array in schema rdd - json

I have a json like :
{"name":"Yin", "address":[{"city":"Columbus","state":"Ohio"},{"city":"Columbus","state":"Ohio"}]}
{"name":"Michael", "address":[{"city":null, "state":"California"},{"city":null, "state":"California"}]}
here address is an array and if i use sqlContext.jsonfile i get the data in schema rdd as follows :
[Yin , [(Columbus , Ohio) , (Columbus , Ohio)]
[Micheal , [(null, California) , (null, California)]
I want to explode the array present and want the data in the following format in schema rdd :
[Yin, Columbus, Ohio]
[Yin, Columbus, Ohio]
[Micheal, null, California]
[Micheal, null, California]
I am using spark SQL

The typical suggestion is drop out of sql for this, but if you want to stay in SQL, here is an answer I got from asking this on the mailing list (nabble isn't showing the response for some reason):
From Michael Armbrust
You can do want with lateral view explode (using HiveContext), but what seems to be missing is that jsonRDD converts json objects into structs (fixed keys with a fixed order) and fields in a struct are accessed using a .
val myJson = sqlContext.jsonRDD(sc.parallelize("""{"foo":[{"bar":1},{"baz":2}]}""" :: Nil))
myJson.registerTempTable("JsonTest")​
val result = sql("SELECT f.bar FROM JsonTest LATERAL VIEW explode(foo) a AS f").collect()
myJson: org.apache.spark.sql.DataFrame = [foo: array<struct<bar:bigint,baz:bigint>>]
result: Array[org.apache.spark.sql.Row] = Array([1], [null])
In Spark 1.3 you can also hint to jsonRDD that you'd like the json objects converted into Maps (non-uniform keys) instead of structs, by manually specifying the schema of your JSON.
import org.apache.spark.sql.types._
val schema =
StructType(
StructField("foo", ArrayType(MapType(StringType, IntegerType))) :: Nil)
​
sqlContext.jsonRDD(sc.parallelize("""{"foo":[{"bar":1},{"baz":2}]}""" :: Nil), schema).registerTempTable("jsonTest")
​
val withSql = sql("SELECT a FROM jsonTest LATERAL VIEW explode(foo) a AS a WHERE a['bar'] IS NOT NULL").collect()
​
val withSpark = sql("SELECT a FROM jsonTest LATERAL VIEW explode(foo) a AS a").rdd.filter {
case Row(a: Map[String, Int]) if a.contains("bar") => true
case _: Row => false
}.collect()
schema: org.apache.spark.sql.types.StructType = StructType(StructField(foo,ArrayType(MapType(StringType,IntegerType,true),true),true))
withSql: Array[org.apache.spark.sql.Row] = Array([Map(bar -> 1)])
withSpark: Array[org.apache.spark.sql.Row] = Array([Map(bar -> 1)])

Related

convert struct to array in spark data frame

I have a dataframe in spark like below.
{"emp_id":1,"emp_name":"John","cust_id":"c1","cust_detail":[{"name":"abc","acc_no":123,"mobile":000},{"name":"abc","acc_no":123,"mobile":111},{"name":"abc","acc_no":123,"mobile":222}]}
I am looking for the output like below.
{"emp_id":1,"emp_name":"John","cust_id":"c1","cust_detail":[{"name":["abc"],"acc_no":[123],"mobile":[000,123,222]}
This is what you want something:
First Explode the column & then aggregate back.
val spark:SparkSession = SparkSession.builder().master("local[1]")
.appName("learn")
.getOrCreate()
val inputdf = spark.read.option("multiline","false").json("C:\\Users\\User\\OneDrive\\Desktop\\source_file.txt")
val newdf1 = inputdf.withColumn("cust_detail_exploded",explode(col("cust_detail"))).drop("cust_detail")
val newdf2 = newdf1.select( "cust_id", "emp_name","emp_id","cust_detail_exploded.mobile", "cust_detail_exploded.acc_no","cust_detail_exploded.name")
val newdf3 = newdf2.groupBy("cust_id").agg(array(struct(collect_set(col("mobile")).as("mobile"),collect_set(col("acc_no")).as("acc_no"),collect_set(col("name")).as("name"))).as("cust_detail") )
newdf3.printSchema()
newdf3.write.json("C:\\Users\\User\\OneDrive\\Desktop\\newww.txt")
Output:
{"cust_id":"c1","cust_detail":[{"mobile":["111","000","222"],"acc_no":["123"],"name":["abc"]}]}

Creating JSON file with for loop in scala

My requirement is to convert two string and create a JSON file(using spray JSON) and save in a resource directory.
one input string contains the ID and other input strings contain the score and topic
id = "alpha1"
inputstring = "science 30 math 24"
Expected output JSON is
{“ContentID”: “alpha1”,
“Topics”: [
{"Score" : 30, "TopicID" : "Science" },
{ "Score" : 24, "TopicID" : "math”}
]
}
below is the approach I have taken and am stuck in the last place
Define the case class
case class Topic(Score: String, TopicID: String)
case class Model(contentID: String, topic: Array[Topic])
implicit val topicJsonFormat: RootJsonFormat[Topic] = jsonFormat2(Topic)
implicit val modelJsonFormat: RootJsonFormat[Model] = jsonFormat2(Model)
Parsing the input string
val a = input.split(" ").zipWithIndex.collect{case(v,i) if (i % 2 == 0) =>
(v,i)}.map(_._1)
val b = input.split(" ").zipWithIndex.collect{case(v,i) if (i % 2 != 0) =>
(v,i)}.map(_._1)
val result = a.zip(b)
And finally transversing through result
paired foreach {case (x,y) =>
val tClass = Topic(x, y)
val mClassJsonString = Topic(x, y).toJson.prettyPrint
out1.write(mClassJsonString.toString)
}
And the file is generated as
{"Score" : 30, "TopicID" : "Science" }
{ "Score" : 24, "TopicID" : "math”}
The problem is I am not able to add the contentID as needed above.
Adding ContentId inside foreach is making contentID added multiple time.
You're calling toJson inside foreach creating strings and then you're appending it to buffer.
What you probably wanted to do is to create a class (ADT) hierarchy first and then serialize it:
val topics = paired.map(Topic)
//toArray might be not necessary if topics variable is already an array
val model = Model("alpha1", topics.toArray)
val json = model.toJson.prettyPrint
out1.write(json.toString)

parse Json String using scala.util.parsing.json

I have a json string and I wast to be able to parse it to get the 'key' values.
jsonString = {"id":2279,
"name":"Test",
"description":null,
"tags":[],
"keys":[{
"key":"WI1MX6XAWSY03X8Y",
"flag":true},
{"key":"BK2Q18T8RSN6VODR",
"flag":false}]}
I want to be able to parse this string and get values for both the keys.
Currently I'm doing:
val details = JSON.parseFull(jsonString)
val keys = details.get.asInstanceOf[Map[String, Any]]("keys")
println(keys)
keys here is:
List(Map(key -> 3JP11GJ5OOGOVV5N, flag -> true), Map(key -> F49M347FOHYKBT9, flag -> false))
Please let me know how i can get both the 'key' values.
There is nothing related to JSON actually, you just have to do:
val keysValues = key.map(k => k("key"))

Accessing a Single Value from Parsed JObject in Scala (Jackson, json4s)

I have an object like this:
val aa = parse(""" { "vals" : [[1,2,3,4], [4,5,6,7], [8,9,6,3]] } """)
I want to access the value '1' in the first JArray.
println(aa.values ???)
How is this done?
Thanks
One way would be :
val n = (aa \ "vals")(0)(0).extract[Int]
println(n)
Another way is to parse the whole json using a case class :
implicit val formats = DefaultFormats
case class Numbers(vals: List[List[Int]])
val numbers = aa.extract[Numbers]
This way you can access the first value of the first list however you like :
for { list <- numbers.vals.headOption; hd <- list.headOption } println(hd)
// or
println(numbers.vals.head.head)
// or ...

Spark RDD to CSV - Add empty columns

I have a RDD[Map[String,Int]] where the keys of the maps are the column names. Each map is incomplete and to know the column names I would need to union all the keys. Is there a way to avoid this collect operation to know all the keys and use just once rdd.saveAsTextFile(..) to get the csv?
For example, say I have an RDD with two elements (scala notation):
Map("a"->1, "b"->2)
Map("b"->1, "c"->3)
I would like to end up with this csv:
a,b,c
1,2,0
0,1,3
Scala solutions are better but any other Spark-compatible language would do.
EDIT:
I can try to solve my problem from another direction also. Let's say I somehow know all the columns in the beginning, but I want to get rid of columns that have 0 value in all maps. So the problem becomes, I know that the keys are ("a", "b", "c") and from this:
Map("a"->1, "b"->2, "c"->0)
Map("a"->3, "b"->1, "c"->0)
I need to write the csv:
a,b
1,2
3,1
Would it be possible to do this with only one collect?
If you're statement is: "every new element in my RDD may add a new column name I have not seen so far", the answer is obviously can't avoid a full scan. But you don't need to collect all elements on the driver.
You could use aggregate to only collect column names. This method takes two functions, one is to insert a single element into the resulting collection, and another one to merge results from two different partitions.
rdd.aggregate(Set.empty[String])( {(s, m) => s union m.keySet }, { (s1, s2) => s1 union s2 })
You will get back a set of all column names in the RDD. In a second scan you can print the CSV file.
Scala and any other supported language
You can use spark-csv
First lets find all present columns:
val cols = sc.broadcast(rdd.flatMap(_.keys).distinct().collect())
Create RDD[Row]:
val rows = rdd.map {
row => { Row.fromSeq(cols.value.map { row.getOrElse(_, 0) })}
}
Prepare schema:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(
cols.value.map(field => StructField(field, IntegerType, true)))
Convert RDD[Row] to Data Frame:
val df = sqlContext.createDataFrame(rows, schema)
Write results:
// Spark 1.4+, for other versions see spark-csv docs
df.write.format("com.databricks.spark.csv").save("mycsv.csv")
You can do pretty much the same thing using other supported languages.
Python
If you use Python and final data fits in a driver memory you can use Pandas through toPandas() method:
rdd = sc.parallelize([{'a': 1, 'b': 2}, {'b': 1, 'c': 3}])
cols = sc.broadcast(rdd.flatMap(lambda row: row.keys()).distinct().collect())
df = sqlContext.createDataFrame(
rdd.map(lambda row: {k: row.get(k, 0) for k in cols.value}))
df.toPandas().save('mycsv.csv')
or directly:
import pandas as pd
pd.DataFrame(rdd.collect()).fillna(0).save('mycsv.csv')
Edit
One possible way to the second collect is to use accumulators to either build a set of all column names or to count these where you found zeros and use this information to map over rows and remove unnecessary columns or to add zeros.
It is possible but inefficient and feels like cheating. The only situation when it makes some sense is when number of zeros is very low, but I guess it is not the case here.
object ColsSetParam extends AccumulatorParam[Set[String]] {
def zero(initialValue: Set[String]): Set[String] = {
Set.empty[String]
}
def addInPlace(s1: Set[String], s2: Set[String]): Set[String] = {
s1 ++ s2
}
}
val colSetAccum = sc.accumulator(Set.empty[String])(ColsSetParam)
rdd.foreach { colSetAccum += _.keys.toSet }
or
// We assume you know this upfront
val allColnames = sc.broadcast(Set("a", "b", "c"))
object ZeroColsParam extends AccumulatorParam[Map[String, Int]] {
def zero(initialValue: Map[String, Int]): Map[String, Int] = {
Map.empty[String, Int]
}
def addInPlace(m1: Map[String, Int], m2: Map[String, Int]): Map[String, Int] = {
val keys = m1.keys ++ m2.keys
keys.map(
(k: String) => (k -> (m1.getOrElse(k, 0) + m2.getOrElse(k, 0)))).toMap
}
}
val accum = sc.accumulator(Map.empty[String, Int])(ZeroColsParam)
rdd.foreach { row =>
// If allColnames.value -- row.keys.toSet is empty we can avoid this part
accum += (allColnames.value -- row.keys.toSet).map(x => (x -> 1)).toMap
}