My JSON column names are a combination of lower and uppercase case (Ex: title/Title and name/Name), due to which in output, I am getting name and Name as two different columns (similarly title and Title).
How can I make the JSON columns as case insensitive?
config("spark.sql.caseSensitive", "true") -> I tried this, but it is not working.
val df = Seq(
("A", "B", "{\"Name\":\"xyz\",\"Address\":\"NYC\",\"title\":\"engg\"}"),
("C", "D", "{\"Name\":\"mnp\",\"Address\":\"MIC\",\"title\":\"data\"}"),
("E", "F", "{\"name\":\"pqr\",\"Address\":\"MNN\",\"Title\":\"bi\"}")
)).toDF("col_1", "col_2", "col_json")
import sc.implicits._
val col_schema = spark.read.json(df.select("col_json").as[String]).schema
val outputDF = df.withColumn("new_col", from_json(col("col_json"), col_schema))
.select("col_1", "col_2", "new_col.*")
outputDF.show(false)
Current output:
Expected/Needed output (column names to be case-insensitive):
Soltion 1
You can group the columns by their lowercase names and merge them using coalesce function:
// set spark.sql.caseSensitive to true to avoid ambuigity
spark.conf.set("spark.sql.caseSensitive", "true")
val col_schema = spark.read.json(df.select("col_json").as[String]).schema
val df1 = df.withColumn("new_col", from_json(col("col_json"), col_schema))
.select("col_1", "col_2", "new_col.*")
val mergedCols = df1.columns.groupBy(_.toLowerCase).values
.map(grp =>
if (grp.size > 1) coalesce(grp.map(col): _*).as(grp(0))
else col(grp(0))
).toSeq
val outputDF = df1.select(mergedCols:_*)
outputDF.show()
//+----+-------+-----+-----+-----+
//|Name|Address|col_1|Title|col_2|
//+----+-------+-----+-----+-----+
//|xyz |NYC |A |engg |B |
//|mnp |MIC |C |data |D |
//|pqr |MNN |E |bi |F |
//+----+-------+-----+-----+-----+
Solution 2
Another way is to parse the JSON string column into MapType instead of StructType, and using transform_keys you can lower case the column name, then explode the map and pivot to get columns:
import org.apache.spark.sql.types.{MapType, StringType}
val outputDF = df.withColumn(
"col_json",
from_json(col("col_json"), MapType(StringType, StringType))
).select(
col("col_1"),
col("col_2"),
explode(expr("transform_keys(col_json, (k, v) -> lower(k))"))
).groupBy("col_1", "col_2")
.pivot("key")
.agg(first("value"))
outputDF.show()
//+-----+-----+-------+----+-----+
//|col_1|col_2|address|name|title|
//+-----+-----+-------+----+-----+
//|E |F |MNN |pqr |bi |
//|C |D |MIC |mnp |data |
//|A |B |NYC |xyz |engg |
//+-----+-----+-------+----+-----+
For this solution transform_keys is only avlaible since Spark 3, for older versions you can use UDF :
val mapKeysToLower = udf((m: Map[String, String]) => {
m.map { case (k, v) => k.toLowerCase -> v }
})
You will need to merge your columns, using something like:
import org.apache.spark.sql.functions.when
df = df.withColumn("title", when($"title".isNull, $"Title").otherwise($"title").drop("Title")
Related
I have json data like below where version field is the differentiator -
file_1 = {"version": 1, "stats": {"hits":20}}
file_2 = {"version": 2, "stats": [{"hour":1,"hits":10},{"hour":2,"hits":12}]}
In the new format, stats column is now Arraytype(StructType).
Earlier only file_1 was needed so I was using
spark.read.schema(schema_def_v1).json(path)
Now I need to read both these type of multiple json files which come together. I cannot define stats as string in schema_def as that would affect the corruptrecord feature(for stats column) which checks malformed json and schema compliance of all the fields.
Example df output required in 1 read only -
version | hour | hits
1 | null | 20
2 | 1 | 10
2 | 2 | 12
I have tried to read with mergeSchema option but that makes stats field String type.
Also, I have tried making two dataframes by filtering on the version field, and applying spark.read.schema(schema_def_v1).json(df_v1.toJSON). Here also stats column becomes String type.
I was thinking if while reading, I could parse the df column headers as stats_v1 and stats_v2 on basis of data-types can solve the problem. Please help with any possible solutions.
UDF to check string or array, if it is string it will convert string to an array.
import org.apache.spark.sql.functions.udf
import org.json4s.{DefaultFormats, JObject}
import org.json4s.jackson.JsonMethods.parse
import org.json4s.jackson.Serialization.write
import scala.util.{Failure, Success, Try}
object Parse {
implicit val formats = DefaultFormats
def toArray(data:String) = {
val json_data = (parse(data))
if(json_data.isInstanceOf[JObject]) write(List(json_data)) else data
}
}
val toJsonArray = udf(Parse.toArray _)
scala> "ls -ltr /tmp/data".!
total 16
-rw-r--r-- 1 srinivas root 37 Jun 26 17:49 file_1.json
-rw-r--r-- 1 srinivas root 69 Jun 26 17:49 file_2.json
res4: Int = 0
scala> val df = spark.read.json("/tmp/data").select("stats","version")
df: org.apache.spark.sql.DataFrame = [stats: string, version: bigint]
scala> df.printSchema
root
|-- stats: string (nullable = true)
|-- version: long (nullable = true)
scala> df.show(false)
+-------+-------------------------------------------+
|version|stats |
+-------+-------------------------------------------+
|1 |{"hits":20} |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|
+-------+-------------------------------------------+
Output
scala>
import org.apache.spark.sql.types._
val schema = ArrayType(MapType(StringType,IntegerType))
df
.withColumn("json_stats",explode(from_json(toJsonArray($"stats"),schema)))
.select(
$"version",
$"stats",
$"json_stats".getItem("hour").as("hour"),
$"json_stats".getItem("hits").as("hits")
).show(false)
+-------+-------------------------------------------+----+----+
|version|stats |hour|hits|
+-------+-------------------------------------------+----+----+
|1 |{"hits":20} |null|20 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|1 |10 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|2 |12 |
+-------+-------------------------------------------+----+----+
Without UDF
scala> val schema = ArrayType(MapType(StringType,IntegerType))
scala> val expr = when(!$"stats".contains("[{"),concat(lit("["),$"stats",lit("]"))).otherwise($"stats")
df
.withColumn("stats",expr)
.withColumn("stats",explode(from_json($"stats",schema)))
.select(
$"version",
$"stats",
$"stats".getItem("hour").as("hour"),
$"stats".getItem("hits").as("hits")
)
.show(false)
+-------+-----------------------+----+----+
|version|stats |hour|hits|
+-------+-----------------------+----+----+
|1 |[hits -> 20] |null|20 |
|2 |[hour -> 1, hits -> 10]|1 |10 |
|2 |[hour -> 2, hits -> 12]|2 |12 |
+-------+-----------------------+----+----+
Read the second file first, explode stats, use schema to read first file.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
file_1 = {"version": 1, "stats": {"hits": 20}}
file_2 = {"version": 2, "stats": [{"hour": 1, "hits": 10}, {"hour": 2, "hits": 12}]}
df1 = spark.read.json(sc.parallelize([file_2])).withColumn('stats', explode('stats'))
schema = df1.schema
spark.read.schema(schema).json(sc.parallelize([file_1])).printSchema()
output >> root
|-- stats: struct (nullable = true)
| |-- hits: long (nullable = true)
| |-- hour: long (nullable = true)
|-- version: long (nullable = true)
IIUC, you can read the JSON files using spark.read.text and then parse the value with json_tuple, from_json. notice for stats field we use coalesce to parse fields based on two or more schema. (add wholetext=True as an argument of spark.read.text if each file contains a single JSON document cross multiple lines)
from pyspark.sql.functions import json_tuple, coalesce, from_json, array
df = spark.read.text("/path/to/all/jsons/")
schema_1 = "array<struct<hour:int,hits:int>>"
schema_2 = "struct<hour:int,hits:int>"
df.select(json_tuple('value', 'version', 'stats').alias('version', 'stats')) \
.withColumn('status', coalesce(from_json('stats', schema_1), array(from_json('stats', schema_2)))) \
.selectExpr('version', 'inline_outer(status)') \
.show()
+-------+----+----+
|version|hour|hits|
+-------+----+----+
| 2| 1| 10|
| 2| 2| 12|
| 1|null| 20|
+-------+----+----+
I want to convert an Spark DataFrame into Json file. Below is the input and output format.
Any help is appreciated.
Input :
+-------------------------+
|Name|Age|City |Data |
+-------------------------+
|Ram |30 |Delhi|[A -> ABC]|
|-------------------------|
|Shan|25 |Delhi|[X -> XYZ]|
|-------------------------|
|Riya|12 |U.P. |[M -> MNO]|
+-------------------------+
Output :
{"Name":"Ram","Age":"30","City":"Delhi","Delhi":{"A":"ABC"}}
{"Name":"Shan","Age":"25","City":"Delhi","Delhi":{"X":"XYZ"}}
{"Name":"Riya","Age":"12","City":"U.P.","U.P.":{"M":"MNO"}}
Scala: Starting from your data,
val df = Seq(("Ram",30,"Delhi",Map("A" -> "ABC")), ("Shan",25,"Delhi",Map("X" -> "XYZ")), ("Riya",12,"U.P.",Map("M" -> "MNO"))).toDF("Name", "Age", "City", "Data")
df.show
// +----+---+-----+----------+
// |Name|Age| City| Data|
// +----+---+-----+----------+
// | Ram| 30|Delhi|[A -> ABC]|
// |Shan| 25|Delhi|[X -> XYZ]|
// |Riya| 12| U.P.|[M -> MNO]|
// +----+---+-----+----------+
To change the key as City not Data,
val df2 = df.groupBy("Name", "Age", "City").pivot("City").agg(first("Data"))
df2.show
// +----+---+-----+----------+----------+
// |Name|Age| City| Delhi| U.P.|
// +----+---+-----+----------+----------+
// |Riya| 12| U.P.| null|[M -> MNO]|
// |Shan| 25|Delhi|[X -> XYZ]| null|
// | Ram| 30|Delhi|[A -> ABC]| null|
// +----+---+-----+----------+----------+
And make it by using toJson and collect.
val jsonArray = df.toJSON.collect
jsonArray.foreach(println)
It will print the result such as:
{"Name":"Riya","Age":12,"City":"U.P.","U.P.":{"M":"MNO"}}
{"Name":"Shan","Age":25,"City":"Delhi","Delhi":{"X":"XYZ"}}
{"Name":"Ram","Age":30,"City":"Delhi","Delhi":{"A":"ABC"}}
You can call write.json on DataFrame.
val df: DataFrame = ....
df.write.json("/jsonFilPath")
Here is a an example using Datasets
scala> case class Data(key: String, value: String)
scala> case class Person(name: String, age: Long, city: String, data: Data)
scala> val peopleDS = Seq(Person("Ram", 30, "Delhi", Data("A", "ABC")), Person("Shan", 25, "Delhi", Data("X", "XYZ")), Person("Riya", 12, "U.P", Data("M", "MNO"))).toDS()
scala> peopleDS.show()
+----+---+-----+--------+
|name|age| city| data|
+----+---+-----+--------+
| Ram| 30|Delhi|[A, ABC]|
|Shan| 25|Delhi|[X, XYZ]|
|Riya| 12| U.P|[M, MNO]|
+----+---+-----+--------+
scala> peopleDS.write.json("pathToData/people")
Then you would find written json files in the given folder.
> cd pathToData/people
> ls -l
part-00000-6bd00826-5a8e-4ab9-bfb0-65d722394108-c000.json
> cat part-00000-6bd00826-5a8e-4ab9-bfb0-65d722394108-c000.json
{"name":"Ram","age":30,"city":"Delhi","data":{"key":"A","value":"ABC"}}
{"name":"Shan","age":25,"city":"Delhi","data":{"key":"X","value":"XYZ"}}
{"name":"Riya","age":12,"city":"U.P","data":{"key":"M","value":"MNO"}}
I'm writing a Spark application in Scala using Spark Structured Streaming that receive some data formatted in JSON style from Kafka. This application could receive both a single or multiple JSON object formatted in this way:
[{"key1":"value1","key2":"value2"},{"key1":"value1","key2":"value2"},...,{"key1":"value1","key2":"value2"}]
I tried to define a StructType like:
var schema = StructType(
Array(
StructField("key1",DataTypes.StringType),
StructField("key2",DataTypes.StringType)
))
But it doesn't work.
My actual code for parsing JSON:
var data = (this.stream).getStreamer().load()
.selectExpr("CAST (value AS STRING) as json")
.select(from_json($"json",schema=schema).as("data"))
I would like to get this JSON objects in a dataframe like
+----------+---------+
| key1| key2|
+----------+---------+
| value1| value2|
| value1| value2|
........
| value1| value2|
+----------+---------+
Anyone can help me please?
Thank you!
As your incoming string is Array of JSON, one way is to write a UDF to parse the Array, then explode the parsed Array. Below is the complete code with each steps explained. I have written it for batch but same can be used for streaming with minimal changes.
object JsonParser{
//case class to parse the incoming JSON String
case class JSON(key1: String, key2: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder().
appName("JSON").
master("local").
getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
//sample JSON array String coming from kafka
val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
//UDF to parse JSON array String
val jsonConverter = udf { jsonString: String =>
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
mapper.readValue(jsonString, classOf[Array[JSON]])
}
val df = str.toDF("json") //json String column
.withColumn("array", jsonConverter($"json")) //parse the JSON Array
.withColumn("json", explode($"array")) //explode the Array
.drop("array") //drop unwanted columns
.select("json.*") //explode the JSON to separate columns
//display the DF
df.show()
//+------+------+
//| key1| key2|
//+------+------+
//|value1|value2|
//|value3|value4|
//+------+------+
}
}
This worked fine for me in Spark 3.0.0 and Scala 2.12.10. I used schema_of_json to get the schema of the data in a suitable format for from_json, and applied explode and * operator in the last step of the chain to expand accordingly.
// TO KNOW THE SCHEMA
scala> val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
str: Seq[String] = List([{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}])
scala> val df = str.toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]
scala> df.show()
+--------------------+
| json|
+--------------------+
|[{"key1":"value1"...|
+--------------------+
scala> val schema = df.select(schema_of_json(df.select(col("json")).first.getString(0))).as[String].first
schema: String = array<struct<key1:string,key2:string>>
Use the resulting string as your schema: 'array<structkey1:string,key2:string>', as follows:
// TO PARSE THE ARRAY OF JSON's
scala> val parsedJson1 = df.selectExpr("from_json(json, 'array<struct<key1:string,key2:string>>') as parsed_json")
parsedJson1: org.apache.spark.sql.DataFrame = [parsed_json: array<struct<key1:string,key2:string>>]
scala> parsedJson1.show()
+--------------------+
| parsed_json|
+--------------------+
|[[value1, value2]...|
+--------------------+
scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json").select("json.*")
data: org.apache.spark.sql.DataFrame = [key1: string, key2: string]
scala> data.show()
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+
Just FYI, without the star expansion the intermediate result looks as follows:
scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json")
data: org.apache.spark.sql.DataFrame = [json: struct<key1: string, key2: string>]
scala> data.show()
+----------------+
| json|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+
You can add ArrayType to your schema and from_json would
convert the data to json.
var schema = ArrayType(StructType(
Array(
StructField("key1", DataTypes.StringType),
StructField("key2", DataTypes.StringType)
)))
Explode it to get the json array element in each row.
val explodedDf = df.withColumn("jsonData", explode(from_json(col("value"), schema)))
.select($"jsonData").show
+----------------+
| jsonData|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+
Select the json keys
explodedDf.select("jsonData.*").show
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+
I have retrieved a table from SQL Server which contains over 3 million records.
Top 10 Records:
+---------+-------------+----------+
|ACCOUNTNO|VEHICLENUMBER|CUSTOMERID|
+---------+-------------+----------+
| 10003014| MH43AJ411| 20000000|
| 10003014| MH43AJ411| 20000001|
| 10003015| MH12GZ3392| 20000002|
| 10003016| GJ15Z8173| 20000003|
| 10003018| MH05AM902| 20000004|
| 10003019| GJ15CD7657| 20001866|
| 10003019| MH02BY7774| 20000005|
| 10003019| MH02DG7774| 20000933|
| 10003019| GJ15CA7387| 20001865|
| 10003019| GJ15CB9601| 20001557|
+---------+-------------+----------+
only showing top 10 rows
Here ACCOUNTNO is unique, same ACCOUNTNO might have more than one VEHICLENUMBER, for each Vehicle we might have unique CUSTOMERID with respect to that VEHICLENUMBER
I want to export as a JSON format.
This is my code to achieve the output:
package com.issuer.pack2.spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
object sqltojson {
def main(args:Array[String])
{
System.setProperty("hadoop.home.dir", "C:/winutil/")
val conf = new SparkConf().setAppName("SQLtoJSON").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val jdbcSqlConnStr = "jdbc:sqlserver://192.168.70.88;databaseName=ISSUER;user=bhaskar;password=welcome123;"
val jdbcDbTable = "[HISTORY].[TP_CUSTOMER_PREPAIDACCOUNTS]"
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> jdbcSqlConnStr,"dbtable" -> jdbcDbTable)).load()
// jdbcDF.show(10)
jdbcDF.registerTempTable("tp_customer_account")
val res01 = sqlContext.sql("SELECT ACCOUNTNO, VEHICLENUMBER, CUSTOMERID FROM tp_customer_account GROUP BY ACCOUNTNO, VEHICLENUMBER, CUSTOMERID ORDER BY ACCOUNTNO ")
// res01.show(10)
res01.coalesce(1).write.json("D:/res01.json")
}
}
The output I got:
{"ACCOUNTNO":10003014,"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000001}
{"ACCOUNTNO":10003014,"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000000}
{"ACCOUNTNO":10003015,"VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":20000002}
{"ACCOUNTNO":10003016,"VEHICLENUMBER":"GJ15Z8173","CUSTOMERID":20000003}
{"ACCOUNTNO":10003018,"VEHICLENUMBER":"MH05AM902","CUSTOMERID":20000004}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"MH02BY7774","CUSTOMERID":20000005}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CA7387","CUSTOMERID":20001865}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CD7657","CUSTOMERID":20001866}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"MH02DG7774","CUSTOMERID":20000933}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CB9601","CUSTOMERID":20001557}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CD7387","CUSTOMERID":20029961}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CF7747","CUSTOMERID":20009020}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CB727","CUSTOMERID":20000008}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CA7837","CUSTOMERID":20001223}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CD7477","CUSTOMERID":20001690}
{"ACCOUNTNO":10003020,"VEHICLENUMBER":"MH01AX5658","CUSTOMERID":20000006}
{"ACCOUNTNO":10003021,"VEHICLENUMBER":"GJ15AD727","CUSTOMERID":20000007}
{"ACCOUNTNO":10003023,"VEHICLENUMBER":"GU15PP7567","CUSTOMERID":20000009}
{"ACCOUNTNO":10003024,"VEHICLENUMBER":"GJ15CA7567","CUSTOMERID":20000010}
{"ACCOUNTNO":10003025,"VEHICLENUMBER":"GJ5JB9312","CUSTOMERID":20000011}
But I want to get the JSON format output like this:
I have written the JSON below manually (maybe I have designed wrongly, I want that the ACCOUNTNO should be unique) for first three records of my above table.
{
"ACCOUNTNO":10003014,
"VEHICLE": [
{ "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000000},
{ "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000001}
],
"ACCOUNTNO":10003015,
"VEHICLE": [
{ "VEHICLENUMBER":"MH12GZ3392", "CUSTOMERID":20000002}
]
}
So, how to achieve this JSON format using Spark code?
Scala spark-sql
You can do the following (instead of registerTempTable you can usecreateOrReplaceTempView as registerTempTable is deprecated)
jdbcDF.createGlobalTempView("tp_customer_account")
val res01 = sqlContext.sql("SELECT ACCOUNTNO, collect_list(struct(`VEHICLENUMBER`, `CUSTOMERID`)) as VEHICLE FROM tp_customer_account GROUP BY ACCOUNTNO ORDER BY ACCOUNTNO ")
res01.coalesce(1).write.json("D:/res01.json")
You should get your desired output as
{"ACCOUNTNO":"10003014","VEHICLE":[{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":"20000000"},{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":"20000001"}]}
{"ACCOUNTNO":"10003015","VEHICLE":[{"VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":"20000002"}]}
{"ACCOUNTNO":"10003016","VEHICLE":[{"VEHICLENUMBER":"GJ15Z8173","CUSTOMERID":"20000003"}]}
{"ACCOUNTNO":"10003018","VEHICLE":[{"VEHICLENUMBER":"MH05AM902","CUSTOMERID":"20000004"}]}
{"ACCOUNTNO":"10003019","VEHICLE":[{"VEHICLENUMBER":"GJ15CD7657","CUSTOMERID":"20001866"},{"VEHICLENUMBER":"MH02BY7774","CUSTOMERID":"20000005"},{"VEHICLENUMBER":"MH02DG7774","CUSTOMERID":"20000933"},{"VEHICLENUMBER":"GJ15CA7387","CUSTOMERID":"20001865"},{"VEHICLENUMBER":"GJ15CB9601","CUSTOMERID":"20001557"}]}
Scala spark API
Using spark scala API, you can do the following:
import org.apache.spark.sql.functions._
val res01 = jdbcDF.groupBy("ACCOUNTNO")
.agg(collect_list(struct("VEHICLENUMBER", "CUSTOMERID")).as("VEHICLE"))
res01.coalesce(1).write.json("D:/res01.json")
You should be getting the same answer as the sql way.
I hope the answer is helpful.
I have RDD[Row] :
|---itemId----|----Country-------|---Type----------|
| 11 | US | Movie |
| 11 | US | TV |
| 101 | France | Movie |
How to do GroupBy itemId so that I can save the result as List of json where each row is separate json object(each row in RDD) :
{"itemId" : 11,
"Country": {"US" :2 },"Type": {"Movie" :1 , "TV" : 1} },
{"itemId" : 101,
"Country": {"France" :1 },"Type": {"Movie" :1} }
RDD :
I tried :
import com.mapping.data.model.MappingUtils
import com.mapping.data.model.CountryInfo
val mappingPath = "s3://.../"
val input = sc.textFile(mappingPath)
The input is list of jsons where each line is json which I am mapping to the POJO class CountryInfo using MappingUtils which takes care of JSON parsing and conversion:
val MappingsList = input.map(x=> {
val countryInfo = MappingUtils.getCountryInfoString(x);
(countryInfo.getItemId(), countryInfo)
}).collectAsMap
MappingsList: scala.collection.Map[String,com.mapping.data.model.CountryInfo]
def showCountryInfo(x: Option[CountryInfo]) = x match {
case Some(s) => s
}
val events = sqlContext.sql( "select itemId EventList")
val itemList = events.map(row => {
val itemId = row.getAs[String](1);
val çountryInfo = showTitleInfo(MappingsList.get(itemId));
val country = if (countryInfo.getCountry() == 'unknown)' "US" else countryInfo.getCountry()
val type = countryInfo.getType()
Row(itemId, country, type)
})
Can some one let me know how can I achieve this ?
Thank You!
I can't afford the extra time to complete this, but can give you a start.
The idea is that you aggregate the RDD[Row] down into a single Map that represents your JSON structure. Aggregation is a fold that requires two function parameters:
seqOp How to fold a collection of elements into the target type
combOp How to merge two of the target types.
The tricky part comes in combOp while merging, as you need to accumulate the counts of values seen in the seqOp. I have left this as an exercise, as I have a plane to catch! Hopefully someone else can fill in the gaps if you have trouble.
case class Row(id: Int, country: String, tpe: String)
def foo: Unit = {
val rows: RDD[Row] = ???
def seqOp(acc: Map[Int, (Map[String, Int], Map[String, Int])], r: Row) = {
acc.get(r.id) match {
case None => acc.updated(r.id, (Map(r.country, 1), Map(r.tpe, 1)))
case Some((countries, types)) =>
val countries_ = countries.updated(r.country, countries.getOrElse(r.country, 0) + 1)
val types_ = types.updated(r.tpe, types.getOrElse(r.tpe, 0) + 1)
acc.updated(r.id, (countries_, types_))
}
}
val z = Map.empty[Int, (Map[String, Int], Map[String, Int])]
def combOp(l: Map[Int, (Map[String, Int], Map[String, Int])], r: Map[Int, (Map[String, Int], Map[String, Int])]) = {
l.foldLeft(z) { case (acc, (id, (countries, types))) =>
r.get(id) match {
case None => acc.updated(id, (countries, types))
case Some(otherCountries, otherTypes) =>
// todo - continue by merging countries with otherCountries
// and types with otherTypes, then update acc
}
}
}
val summaryMap = rows.aggregate(z) { seqOp, combOp }