convert struct to array in spark data frame - json

I have a dataframe in spark like below.
{"emp_id":1,"emp_name":"John","cust_id":"c1","cust_detail":[{"name":"abc","acc_no":123,"mobile":000},{"name":"abc","acc_no":123,"mobile":111},{"name":"abc","acc_no":123,"mobile":222}]}
I am looking for the output like below.
{"emp_id":1,"emp_name":"John","cust_id":"c1","cust_detail":[{"name":["abc"],"acc_no":[123],"mobile":[000,123,222]}

This is what you want something:
First Explode the column & then aggregate back.
val spark:SparkSession = SparkSession.builder().master("local[1]")
.appName("learn")
.getOrCreate()
val inputdf = spark.read.option("multiline","false").json("C:\\Users\\User\\OneDrive\\Desktop\\source_file.txt")
val newdf1 = inputdf.withColumn("cust_detail_exploded",explode(col("cust_detail"))).drop("cust_detail")
val newdf2 = newdf1.select( "cust_id", "emp_name","emp_id","cust_detail_exploded.mobile", "cust_detail_exploded.acc_no","cust_detail_exploded.name")
val newdf3 = newdf2.groupBy("cust_id").agg(array(struct(collect_set(col("mobile")).as("mobile"),collect_set(col("acc_no")).as("acc_no"),collect_set(col("name")).as("name"))).as("cust_detail") )
newdf3.printSchema()
newdf3.write.json("C:\\Users\\User\\OneDrive\\Desktop\\newww.txt")
Output:
{"cust_id":"c1","cust_detail":[{"mobile":["111","000","222"],"acc_no":["123"],"name":["abc"]}]}

Related

Scala Json - List of Json

I have the below code and the output from my program. However, I could not create a List of Json (desired output) given below. What kind changes do I need to do in the existing code?
case class Uiresult(AccountNo: String, Name: String)
val json = parse(jsonString)
val elements = (json \\ "_source").children
for (acct <- elements) {
val m = acct.extract[Source]
val res = write(Uiresult(m.accountNo, (m.firstName + m.lastName))
println(res)
}
Output from current program:
{"AccountNo":"1234","Name":"Augustin John"}
{"AccountNo":"1235","Name":"Juliet Paul"}
{"AccountNo":"1236","Name":"Sebastin Arul"}
Desired output:
[
{"AccountNo":"1234","Name":"Augustin John"},
{"AccountNo":"1235","Name":"Juliet Paul"},
{"AccountNo":"1236","Name":"Sebastin Arul"}
]
To create a list for the for comprehension, use the yield keyword. This will return the values from the iterations and create a list for you, which you then can assign to a val.
val list = for (acct <- elements) yield {
val m = acct.extract[Source]
val res = write(Uiresult(m.accountNo, (m.firstName + m.lastName))
res
}
This can be written even shorter,
val list = for (acct <- elements) yield {
val m = acct.extract[Source]
write(Uiresult(m.accountNo, (m.firstName + m.lastName))
}
The type of list (Array, List, Seq, etc.) will be determined by the type of elements. Other data structures such as set dictionaries are also possible to use in this way.
To print the output into the exact format as in the "desired output" above, use mkString.
println(list.mkString("[\n", ",\n", "\n]"))

Converting csv RDD to map

I have a large CSV( > 500 MB), which I take into a spark RDD, and I want to store it to a large Map[String, Array[Long]].
The CSV has multiple columns but I require only two for the time being. The first and second column, and is of the form:
A 12312 [some_value] ....
B 123123[some_value] ....
A 1222 [some_value] ....
C 1231 [some_value] ....
I want my map to basically group by the string and store an array of long
so, for the above case, my map would be:
{"A": [12312, 1222], "B": 123123, "C":1231}
But since this map would be huge, I can't simply do this directly. tsca
I take the CSV in a sql.dataframe
My code so far(Looks incorrect though):
def getMap(df: sql.DataFrame, sc: SparkContext): RDD[Map[String, Array[Long]]] = {
var records = sc.emptyRDD[Map[String, Array[Long]]]
val rows: RDD[Row] = df.rdd
rows.foreachPartition( iter => {
iter.foreach(x =>
if(records.contains(x.get(0).toString)){
val arr = temp_map.getOrElse()
records = records + (x.get(0).toString -> (temp_map.getOrElse(x.get(0).toString) :+ x.get(1).toString.toLong))
}
else{
val arr = new Array[Long](1)
arr(0) = x.get(1).toString.toLong
records = records + (x.get(0).toString -> arr)
}
)
})
}
Thanks in advance!
If I understood your question correctly then
You could groupBy first column and collect_list for the second column
import org.apache.spark.sql.functions._
val newDF = df.groupBy("column1").agg(collect_list("column2"))
newDF.show(faslse)
val rdd = newDF.rdd.map(r => (r.getString(0), r.getAs[List[Long]](1)))
This will give you RDD[(String, List[Long])] where the string will be unique

Spark Sql Flatten Json

I have a JSON which looks like this
{"name":"Michael", "cities":["palo alto", "menlo park"], "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley","year":2012}]}
I want to store output in a csv file like this:
Michael,{"sname":"stanford", "year":2010}
Michael,{"sname":"berkeley", "year":2012}
I have tried the following:
val people = sqlContext.read.json("people.json")
val flattened = people.select($"name", explode($"schools").as("schools_flat"))
The above code does not give schools_flat as a json.
Any ides on how to get the expected output.
Thanks
You need to specify schema explicitly to read the json file in the desired way.
In this case it would be like this:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
case class json_schema_class( cities: String, name : String, schools: Array[String])
var json_schema = ScalaReflection.schemaFor[json_schema_class].dataType.asInstanceOf[StructType]
var people = sqlContext.read.schema( json_schema ).json("people.json")
var flattened = people.select($"name", explode($"schools").as("schools_flat"))
The 'flattened' dataframe is like this:
+-------+--------------------+
| name| schools_flat|
+-------+--------------------+
|Michael|{"sname":"stanfor...|
|Michael|{"sname":"berkele...|
+-------+--------------------+

Accessing a Single Value from Parsed JObject in Scala (Jackson, json4s)

I have an object like this:
val aa = parse(""" { "vals" : [[1,2,3,4], [4,5,6,7], [8,9,6,3]] } """)
I want to access the value '1' in the first JArray.
println(aa.values ???)
How is this done?
Thanks
One way would be :
val n = (aa \ "vals")(0)(0).extract[Int]
println(n)
Another way is to parse the whole json using a case class :
implicit val formats = DefaultFormats
case class Numbers(vals: List[List[Int]])
val numbers = aa.extract[Numbers]
This way you can access the first value of the first list however you like :
for { list <- numbers.vals.headOption; hd <- list.headOption } println(hd)
// or
println(numbers.vals.head.head)
// or ...

explode json array in schema rdd

I have a json like :
{"name":"Yin", "address":[{"city":"Columbus","state":"Ohio"},{"city":"Columbus","state":"Ohio"}]}
{"name":"Michael", "address":[{"city":null, "state":"California"},{"city":null, "state":"California"}]}
here address is an array and if i use sqlContext.jsonfile i get the data in schema rdd as follows :
[Yin , [(Columbus , Ohio) , (Columbus , Ohio)]
[Micheal , [(null, California) , (null, California)]
I want to explode the array present and want the data in the following format in schema rdd :
[Yin, Columbus, Ohio]
[Yin, Columbus, Ohio]
[Micheal, null, California]
[Micheal, null, California]
I am using spark SQL
The typical suggestion is drop out of sql for this, but if you want to stay in SQL, here is an answer I got from asking this on the mailing list (nabble isn't showing the response for some reason):
From Michael Armbrust
You can do want with lateral view explode (using HiveContext), but what seems to be missing is that jsonRDD converts json objects into structs (fixed keys with a fixed order) and fields in a struct are accessed using a .
val myJson = sqlContext.jsonRDD(sc.parallelize("""{"foo":[{"bar":1},{"baz":2}]}""" :: Nil))
myJson.registerTempTable("JsonTest")​
val result = sql("SELECT f.bar FROM JsonTest LATERAL VIEW explode(foo) a AS f").collect()
myJson: org.apache.spark.sql.DataFrame = [foo: array<struct<bar:bigint,baz:bigint>>]
result: Array[org.apache.spark.sql.Row] = Array([1], [null])
In Spark 1.3 you can also hint to jsonRDD that you'd like the json objects converted into Maps (non-uniform keys) instead of structs, by manually specifying the schema of your JSON.
import org.apache.spark.sql.types._
val schema =
StructType(
StructField("foo", ArrayType(MapType(StringType, IntegerType))) :: Nil)
​
sqlContext.jsonRDD(sc.parallelize("""{"foo":[{"bar":1},{"baz":2}]}""" :: Nil), schema).registerTempTable("jsonTest")
​
val withSql = sql("SELECT a FROM jsonTest LATERAL VIEW explode(foo) a AS a WHERE a['bar'] IS NOT NULL").collect()
​
val withSpark = sql("SELECT a FROM jsonTest LATERAL VIEW explode(foo) a AS a").rdd.filter {
case Row(a: Map[String, Int]) if a.contains("bar") => true
case _: Row => false
}.collect()
schema: org.apache.spark.sql.types.StructType = StructType(StructField(foo,ArrayType(MapType(StringType,IntegerType,true),true),true))
withSql: Array[org.apache.spark.sql.Row] = Array([Map(bar -> 1)])
withSpark: Array[org.apache.spark.sql.Row] = Array([Map(bar -> 1)])