Scala : How to do GroupBy sum for String values? - json

I have RDD[Row] :
|---itemId----|----Country-------|---Type----------|
| 11 | US | Movie |
| 11 | US | TV |
| 101 | France | Movie |
How to do GroupBy itemId so that I can save the result as List of json where each row is separate json object(each row in RDD) :
{"itemId" : 11,
"Country": {"US" :2 },"Type": {"Movie" :1 , "TV" : 1} },
{"itemId" : 101,
"Country": {"France" :1 },"Type": {"Movie" :1} }
RDD :
I tried :
import com.mapping.data.model.MappingUtils
import com.mapping.data.model.CountryInfo
val mappingPath = "s3://.../"
val input = sc.textFile(mappingPath)
The input is list of jsons where each line is json which I am mapping to the POJO class CountryInfo using MappingUtils which takes care of JSON parsing and conversion:
val MappingsList = input.map(x=> {
val countryInfo = MappingUtils.getCountryInfoString(x);
(countryInfo.getItemId(), countryInfo)
}).collectAsMap
MappingsList: scala.collection.Map[String,com.mapping.data.model.CountryInfo]
def showCountryInfo(x: Option[CountryInfo]) = x match {
case Some(s) => s
}
val events = sqlContext.sql( "select itemId EventList")
val itemList = events.map(row => {
val itemId = row.getAs[String](1);
val çountryInfo = showTitleInfo(MappingsList.get(itemId));
val country = if (countryInfo.getCountry() == 'unknown)' "US" else countryInfo.getCountry()
val type = countryInfo.getType()
Row(itemId, country, type)
})
Can some one let me know how can I achieve this ?
Thank You!

I can't afford the extra time to complete this, but can give you a start.
The idea is that you aggregate the RDD[Row] down into a single Map that represents your JSON structure. Aggregation is a fold that requires two function parameters:
seqOp How to fold a collection of elements into the target type
combOp How to merge two of the target types.
The tricky part comes in combOp while merging, as you need to accumulate the counts of values seen in the seqOp. I have left this as an exercise, as I have a plane to catch! Hopefully someone else can fill in the gaps if you have trouble.
case class Row(id: Int, country: String, tpe: String)
def foo: Unit = {
val rows: RDD[Row] = ???
def seqOp(acc: Map[Int, (Map[String, Int], Map[String, Int])], r: Row) = {
acc.get(r.id) match {
case None => acc.updated(r.id, (Map(r.country, 1), Map(r.tpe, 1)))
case Some((countries, types)) =>
val countries_ = countries.updated(r.country, countries.getOrElse(r.country, 0) + 1)
val types_ = types.updated(r.tpe, types.getOrElse(r.tpe, 0) + 1)
acc.updated(r.id, (countries_, types_))
}
}
val z = Map.empty[Int, (Map[String, Int], Map[String, Int])]
def combOp(l: Map[Int, (Map[String, Int], Map[String, Int])], r: Map[Int, (Map[String, Int], Map[String, Int])]) = {
l.foldLeft(z) { case (acc, (id, (countries, types))) =>
r.get(id) match {
case None => acc.updated(id, (countries, types))
case Some(otherCountries, otherTypes) =>
// todo - continue by merging countries with otherCountries
// and types with otherTypes, then update acc
}
}
}
val summaryMap = rows.aggregate(z) { seqOp, combOp }

Related

Merge json column names with case in-sensitive

My JSON column names are a combination of lower and uppercase case (Ex: title/Title and name/Name), due to which in output, I am getting name and Name as two different columns (similarly title and Title).
How can I make the JSON columns as case insensitive?
config("spark.sql.caseSensitive", "true") -> I tried this, but it is not working.
val df = Seq(
("A", "B", "{\"Name\":\"xyz\",\"Address\":\"NYC\",\"title\":\"engg\"}"),
("C", "D", "{\"Name\":\"mnp\",\"Address\":\"MIC\",\"title\":\"data\"}"),
("E", "F", "{\"name\":\"pqr\",\"Address\":\"MNN\",\"Title\":\"bi\"}")
)).toDF("col_1", "col_2", "col_json")
import sc.implicits._
val col_schema = spark.read.json(df.select("col_json").as[String]).schema
val outputDF = df.withColumn("new_col", from_json(col("col_json"), col_schema))
.select("col_1", "col_2", "new_col.*")
outputDF.show(false)
Current output:
Expected/Needed output (column names to be case-insensitive):
Soltion 1
You can group the columns by their lowercase names and merge them using coalesce function:
// set spark.sql.caseSensitive to true to avoid ambuigity
spark.conf.set("spark.sql.caseSensitive", "true")
val col_schema = spark.read.json(df.select("col_json").as[String]).schema
val df1 = df.withColumn("new_col", from_json(col("col_json"), col_schema))
.select("col_1", "col_2", "new_col.*")
val mergedCols = df1.columns.groupBy(_.toLowerCase).values
.map(grp =>
if (grp.size > 1) coalesce(grp.map(col): _*).as(grp(0))
else col(grp(0))
).toSeq
val outputDF = df1.select(mergedCols:_*)
outputDF.show()
//+----+-------+-----+-----+-----+
//|Name|Address|col_1|Title|col_2|
//+----+-------+-----+-----+-----+
//|xyz |NYC |A |engg |B |
//|mnp |MIC |C |data |D |
//|pqr |MNN |E |bi |F |
//+----+-------+-----+-----+-----+
Solution 2
Another way is to parse the JSON string column into MapType instead of StructType, and using transform_keys you can lower case the column name, then explode the map and pivot to get columns:
import org.apache.spark.sql.types.{MapType, StringType}
val outputDF = df.withColumn(
"col_json",
from_json(col("col_json"), MapType(StringType, StringType))
).select(
col("col_1"),
col("col_2"),
explode(expr("transform_keys(col_json, (k, v) -> lower(k))"))
).groupBy("col_1", "col_2")
.pivot("key")
.agg(first("value"))
outputDF.show()
//+-----+-----+-------+----+-----+
//|col_1|col_2|address|name|title|
//+-----+-----+-------+----+-----+
//|E |F |MNN |pqr |bi |
//|C |D |MIC |mnp |data |
//|A |B |NYC |xyz |engg |
//+-----+-----+-------+----+-----+
For this solution transform_keys is only avlaible since Spark 3, for older versions you can use UDF :
val mapKeysToLower = udf((m: Map[String, String]) => {
m.map { case (k, v) => k.toLowerCase -> v }
})
You will need to merge your columns, using something like:
import org.apache.spark.sql.functions.when
df = df.withColumn("title", when($"title".isNull, $"Title").otherwise($"title").drop("Title")

Scala - Convert Dataframe into json file and make key from column name and column value both for different different columns

I want to convert an Spark DataFrame into Json file. Below is the input and output format.
Any help is appreciated.
Input :
+-------------------------+
|Name|Age|City |Data |
+-------------------------+
|Ram |30 |Delhi|[A -> ABC]|
|-------------------------|
|Shan|25 |Delhi|[X -> XYZ]|
|-------------------------|
|Riya|12 |U.P. |[M -> MNO]|
+-------------------------+
Output :
{"Name":"Ram","Age":"30","City":"Delhi","Delhi":{"A":"ABC"}}
{"Name":"Shan","Age":"25","City":"Delhi","Delhi":{"X":"XYZ"}}
{"Name":"Riya","Age":"12","City":"U.P.","U.P.":{"M":"MNO"}}
Scala: Starting from your data,
val df = Seq(("Ram",30,"Delhi",Map("A" -> "ABC")), ("Shan",25,"Delhi",Map("X" -> "XYZ")), ("Riya",12,"U.P.",Map("M" -> "MNO"))).toDF("Name", "Age", "City", "Data")
df.show
// +----+---+-----+----------+
// |Name|Age| City| Data|
// +----+---+-----+----------+
// | Ram| 30|Delhi|[A -> ABC]|
// |Shan| 25|Delhi|[X -> XYZ]|
// |Riya| 12| U.P.|[M -> MNO]|
// +----+---+-----+----------+
To change the key as City not Data,
val df2 = df.groupBy("Name", "Age", "City").pivot("City").agg(first("Data"))
df2.show
// +----+---+-----+----------+----------+
// |Name|Age| City| Delhi| U.P.|
// +----+---+-----+----------+----------+
// |Riya| 12| U.P.| null|[M -> MNO]|
// |Shan| 25|Delhi|[X -> XYZ]| null|
// | Ram| 30|Delhi|[A -> ABC]| null|
// +----+---+-----+----------+----------+
And make it by using toJson and collect.
val jsonArray = df.toJSON.collect
jsonArray.foreach(println)
It will print the result such as:
{"Name":"Riya","Age":12,"City":"U.P.","U.P.":{"M":"MNO"}}
{"Name":"Shan","Age":25,"City":"Delhi","Delhi":{"X":"XYZ"}}
{"Name":"Ram","Age":30,"City":"Delhi","Delhi":{"A":"ABC"}}
You can call write.json on DataFrame.
val df: DataFrame = ....
df.write.json("/jsonFilPath")
Here is a an example using Datasets
scala> case class Data(key: String, value: String)
scala> case class Person(name: String, age: Long, city: String, data: Data)
scala> val peopleDS = Seq(Person("Ram", 30, "Delhi", Data("A", "ABC")), Person("Shan", 25, "Delhi", Data("X", "XYZ")), Person("Riya", 12, "U.P", Data("M", "MNO"))).toDS()
scala> peopleDS.show()
+----+---+-----+--------+
|name|age| city| data|
+----+---+-----+--------+
| Ram| 30|Delhi|[A, ABC]|
|Shan| 25|Delhi|[X, XYZ]|
|Riya| 12| U.P|[M, MNO]|
+----+---+-----+--------+
scala> peopleDS.write.json("pathToData/people")
Then you would find written json files in the given folder.
> cd pathToData/people
> ls -l
part-00000-6bd00826-5a8e-4ab9-bfb0-65d722394108-c000.json
> cat part-00000-6bd00826-5a8e-4ab9-bfb0-65d722394108-c000.json
{"name":"Ram","age":30,"city":"Delhi","data":{"key":"A","value":"ABC"}}
{"name":"Shan","age":25,"city":"Delhi","data":{"key":"X","value":"XYZ"}}
{"name":"Riya","age":12,"city":"U.P","data":{"key":"M","value":"MNO"}}

Array of JSON to Dataframe in Spark received by Kafka

I'm writing a Spark application in Scala using Spark Structured Streaming that receive some data formatted in JSON style from Kafka. This application could receive both a single or multiple JSON object formatted in this way:
[{"key1":"value1","key2":"value2"},{"key1":"value1","key2":"value2"},...,{"key1":"value1","key2":"value2"}]
I tried to define a StructType like:
var schema = StructType(
Array(
StructField("key1",DataTypes.StringType),
StructField("key2",DataTypes.StringType)
))
But it doesn't work.
My actual code for parsing JSON:
var data = (this.stream).getStreamer().load()
.selectExpr("CAST (value AS STRING) as json")
.select(from_json($"json",schema=schema).as("data"))
I would like to get this JSON objects in a dataframe like
+----------+---------+
| key1| key2|
+----------+---------+
| value1| value2|
| value1| value2|
........
| value1| value2|
+----------+---------+
Anyone can help me please?
Thank you!
As your incoming string is Array of JSON, one way is to write a UDF to parse the Array, then explode the parsed Array. Below is the complete code with each steps explained. I have written it for batch but same can be used for streaming with minimal changes.
object JsonParser{
//case class to parse the incoming JSON String
case class JSON(key1: String, key2: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder().
appName("JSON").
master("local").
getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
//sample JSON array String coming from kafka
val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
//UDF to parse JSON array String
val jsonConverter = udf { jsonString: String =>
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
mapper.readValue(jsonString, classOf[Array[JSON]])
}
val df = str.toDF("json") //json String column
.withColumn("array", jsonConverter($"json")) //parse the JSON Array
.withColumn("json", explode($"array")) //explode the Array
.drop("array") //drop unwanted columns
.select("json.*") //explode the JSON to separate columns
//display the DF
df.show()
//+------+------+
//| key1| key2|
//+------+------+
//|value1|value2|
//|value3|value4|
//+------+------+
}
}
This worked fine for me in Spark 3.0.0 and Scala 2.12.10. I used schema_of_json to get the schema of the data in a suitable format for from_json, and applied explode and * operator in the last step of the chain to expand accordingly.
// TO KNOW THE SCHEMA
scala> val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
str: Seq[String] = List([{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}])
scala> val df = str.toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]
scala> df.show()
+--------------------+
| json|
+--------------------+
|[{"key1":"value1"...|
+--------------------+
scala> val schema = df.select(schema_of_json(df.select(col("json")).first.getString(0))).as[String].first
schema: String = array<struct<key1:string,key2:string>>
Use the resulting string as your schema: 'array<structkey1:string,key2:string>', as follows:
// TO PARSE THE ARRAY OF JSON's
scala> val parsedJson1 = df.selectExpr("from_json(json, 'array<struct<key1:string,key2:string>>') as parsed_json")
parsedJson1: org.apache.spark.sql.DataFrame = [parsed_json: array<struct<key1:string,key2:string>>]
scala> parsedJson1.show()
+--------------------+
| parsed_json|
+--------------------+
|[[value1, value2]...|
+--------------------+
scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json").select("json.*")
data: org.apache.spark.sql.DataFrame = [key1: string, key2: string]
scala> data.show()
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+
Just FYI, without the star expansion the intermediate result looks as follows:
scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json")
data: org.apache.spark.sql.DataFrame = [json: struct<key1: string, key2: string>]
scala> data.show()
+----------------+
| json|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+
You can add ArrayType to your schema and from_json would
convert the data to json.
var schema = ArrayType(StructType(
Array(
StructField("key1", DataTypes.StringType),
StructField("key2", DataTypes.StringType)
)))
Explode it to get the json array element in each row.
val explodedDf = df.withColumn("jsonData", explode(from_json(col("value"), schema)))
.select($"jsonData").show
+----------------+
| jsonData|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+
Select the json keys
explodedDf.select("jsonData.*").show
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+

Better way to extract json map into struct

I'm new to elixir and I want to parse a json file. One of the parts is a question answer array of objects.
[
{
"questionId":1,
"question":"Information: Personal Information: First Name",
"answer":"Joe"
},
{
"questionId":3,
"question":"Information: Personal Information: Last Name",
"answer":"Smith"
},
...
]
I know what questionId's I want and I'm going to make a map for 1 = First Name, 2 = Last Name.
But currently I'm doing the following to put the data into the struct.
defmodule Student do
defstruct first_name: nil, last_name: nil, student_number: nil
defguard is_first_name(id) when id == 1
defguard is_last_name(id) when id == 3
defguard is_student_number(id) when id == 7
end
defmodule AFMC do
import Student
#moduledoc """
Documentation for AFMC.
"""
#doc """
Hello world.
## Examples
iex> AFMC.hello
:world
"""
def main do
get_json()
|> get_outgoing_applications
end
def get_json do
with {:ok, body} <- File.read("./lib/afmc_import.txt"),
{:ok,body} <- Poison.Parser.parse(body), do: {:ok,body}
end
def get_outgoing_applications(map) do
{:ok,body} = map
out_application = get_in(body,["outgoingApplications"])
Enum.at(out_application,0)
|> get_in(["answers"])
|> get_person
end
def get_person(answers) do
student = Enum.reduce(answers,%Student{},fn(answer,acc) ->
if Student.is_first_name(answer["questionId"]) do
acc = %{acc | first_name: answer["answer"]}
end
if Student.is_last_name(answer["questionId"]) do
acc = %{acc | last_name: answer["answer"]}
end
if Student.is_student_number(answer["questionId"]) do
acc = %{acc | student_number: answer["answer"]}
end
acc
end)
IO.inspect "test"
s
end
end
I'm wondering what is a better way to do get_person with out having to do if statements. If I know I will be mapping 1 to questionId 1 in the array of objects.
The data will then be saved into a DB.
Thanks
I'd store a mapping of id to field name. With that you don't need any if inside the reduce. Some pattern matching will also make it unnecessary to do answer["questionId"] etc.
defmodule Student do
defstruct first_name: nil, last_name: nil, student_number: nil
#fields %{
1 => :first_name,
3 => :last_name,
7 => :student_number
}
def parse(answers) do
Enum.reduce(answers, %Student{}, fn %{"questionId" => id, "answer" => answer}, acc ->
%{acc | #fields[id] => answer}
end)
end
end
IO.inspect(
Student.parse([
%{"questionId" => 1, "question" => "", "answer" => "Joe"},
%{"questionId" => 3, "question" => "", "answer" => "Smith"},
%{"questionId" => 7, "question" => "", "answer" => "123"}
])
)
Output:
%Student{first_name: "Joe", last_name: "Smith", student_number: "123"}
Edit: to skip ids not present in the map, change:
%{acc | #fields[id] => answer}
to:
if field = #fields[id], do: %{acc | field => answer}, else: acc

Scala Map Option

I have a map in scala like this.
val someData = Some(Map(genderKey -> gender,agekey -> age))
How to get the output as:
val key= genderkey
val value= gender
val key2 = agekey (Dynamic variable name)
val value2= age (Dynamic variable name)
Like this
someData.map(_.map {
case (k,v) => s"$k = $v"
}
.mkString(" and \n"))
.foreach(result => println(result))