I have a map in scala like this.
val someData = Some(Map(genderKey -> gender,agekey -> age))
How to get the output as:
val key= genderkey
val value= gender
val key2 = agekey (Dynamic variable name)
val value2= age (Dynamic variable name)
Like this
someData.map(_.map {
case (k,v) => s"$k = $v"
}
.mkString(" and \n"))
.foreach(result => println(result))
Related
My JSON column names are a combination of lower and uppercase case (Ex: title/Title and name/Name), due to which in output, I am getting name and Name as two different columns (similarly title and Title).
How can I make the JSON columns as case insensitive?
config("spark.sql.caseSensitive", "true") -> I tried this, but it is not working.
val df = Seq(
("A", "B", "{\"Name\":\"xyz\",\"Address\":\"NYC\",\"title\":\"engg\"}"),
("C", "D", "{\"Name\":\"mnp\",\"Address\":\"MIC\",\"title\":\"data\"}"),
("E", "F", "{\"name\":\"pqr\",\"Address\":\"MNN\",\"Title\":\"bi\"}")
)).toDF("col_1", "col_2", "col_json")
import sc.implicits._
val col_schema = spark.read.json(df.select("col_json").as[String]).schema
val outputDF = df.withColumn("new_col", from_json(col("col_json"), col_schema))
.select("col_1", "col_2", "new_col.*")
outputDF.show(false)
Current output:
Expected/Needed output (column names to be case-insensitive):
Soltion 1
You can group the columns by their lowercase names and merge them using coalesce function:
// set spark.sql.caseSensitive to true to avoid ambuigity
spark.conf.set("spark.sql.caseSensitive", "true")
val col_schema = spark.read.json(df.select("col_json").as[String]).schema
val df1 = df.withColumn("new_col", from_json(col("col_json"), col_schema))
.select("col_1", "col_2", "new_col.*")
val mergedCols = df1.columns.groupBy(_.toLowerCase).values
.map(grp =>
if (grp.size > 1) coalesce(grp.map(col): _*).as(grp(0))
else col(grp(0))
).toSeq
val outputDF = df1.select(mergedCols:_*)
outputDF.show()
//+----+-------+-----+-----+-----+
//|Name|Address|col_1|Title|col_2|
//+----+-------+-----+-----+-----+
//|xyz |NYC |A |engg |B |
//|mnp |MIC |C |data |D |
//|pqr |MNN |E |bi |F |
//+----+-------+-----+-----+-----+
Solution 2
Another way is to parse the JSON string column into MapType instead of StructType, and using transform_keys you can lower case the column name, then explode the map and pivot to get columns:
import org.apache.spark.sql.types.{MapType, StringType}
val outputDF = df.withColumn(
"col_json",
from_json(col("col_json"), MapType(StringType, StringType))
).select(
col("col_1"),
col("col_2"),
explode(expr("transform_keys(col_json, (k, v) -> lower(k))"))
).groupBy("col_1", "col_2")
.pivot("key")
.agg(first("value"))
outputDF.show()
//+-----+-----+-------+----+-----+
//|col_1|col_2|address|name|title|
//+-----+-----+-------+----+-----+
//|E |F |MNN |pqr |bi |
//|C |D |MIC |mnp |data |
//|A |B |NYC |xyz |engg |
//+-----+-----+-------+----+-----+
For this solution transform_keys is only avlaible since Spark 3, for older versions you can use UDF :
val mapKeysToLower = udf((m: Map[String, String]) => {
m.map { case (k, v) => k.toLowerCase -> v }
})
You will need to merge your columns, using something like:
import org.apache.spark.sql.functions.when
df = df.withColumn("title", when($"title".isNull, $"Title").otherwise($"title").drop("Title")
I'm writing a Spark application in Scala using Spark Structured Streaming that receive some data formatted in JSON style from Kafka. This application could receive both a single or multiple JSON object formatted in this way:
[{"key1":"value1","key2":"value2"},{"key1":"value1","key2":"value2"},...,{"key1":"value1","key2":"value2"}]
I tried to define a StructType like:
var schema = StructType(
Array(
StructField("key1",DataTypes.StringType),
StructField("key2",DataTypes.StringType)
))
But it doesn't work.
My actual code for parsing JSON:
var data = (this.stream).getStreamer().load()
.selectExpr("CAST (value AS STRING) as json")
.select(from_json($"json",schema=schema).as("data"))
I would like to get this JSON objects in a dataframe like
+----------+---------+
| key1| key2|
+----------+---------+
| value1| value2|
| value1| value2|
........
| value1| value2|
+----------+---------+
Anyone can help me please?
Thank you!
As your incoming string is Array of JSON, one way is to write a UDF to parse the Array, then explode the parsed Array. Below is the complete code with each steps explained. I have written it for batch but same can be used for streaming with minimal changes.
object JsonParser{
//case class to parse the incoming JSON String
case class JSON(key1: String, key2: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder().
appName("JSON").
master("local").
getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
//sample JSON array String coming from kafka
val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
//UDF to parse JSON array String
val jsonConverter = udf { jsonString: String =>
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
mapper.readValue(jsonString, classOf[Array[JSON]])
}
val df = str.toDF("json") //json String column
.withColumn("array", jsonConverter($"json")) //parse the JSON Array
.withColumn("json", explode($"array")) //explode the Array
.drop("array") //drop unwanted columns
.select("json.*") //explode the JSON to separate columns
//display the DF
df.show()
//+------+------+
//| key1| key2|
//+------+------+
//|value1|value2|
//|value3|value4|
//+------+------+
}
}
This worked fine for me in Spark 3.0.0 and Scala 2.12.10. I used schema_of_json to get the schema of the data in a suitable format for from_json, and applied explode and * operator in the last step of the chain to expand accordingly.
// TO KNOW THE SCHEMA
scala> val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
str: Seq[String] = List([{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}])
scala> val df = str.toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]
scala> df.show()
+--------------------+
| json|
+--------------------+
|[{"key1":"value1"...|
+--------------------+
scala> val schema = df.select(schema_of_json(df.select(col("json")).first.getString(0))).as[String].first
schema: String = array<struct<key1:string,key2:string>>
Use the resulting string as your schema: 'array<structkey1:string,key2:string>', as follows:
// TO PARSE THE ARRAY OF JSON's
scala> val parsedJson1 = df.selectExpr("from_json(json, 'array<struct<key1:string,key2:string>>') as parsed_json")
parsedJson1: org.apache.spark.sql.DataFrame = [parsed_json: array<struct<key1:string,key2:string>>]
scala> parsedJson1.show()
+--------------------+
| parsed_json|
+--------------------+
|[[value1, value2]...|
+--------------------+
scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json").select("json.*")
data: org.apache.spark.sql.DataFrame = [key1: string, key2: string]
scala> data.show()
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+
Just FYI, without the star expansion the intermediate result looks as follows:
scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json")
data: org.apache.spark.sql.DataFrame = [json: struct<key1: string, key2: string>]
scala> data.show()
+----------------+
| json|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+
You can add ArrayType to your schema and from_json would
convert the data to json.
var schema = ArrayType(StructType(
Array(
StructField("key1", DataTypes.StringType),
StructField("key2", DataTypes.StringType)
)))
Explode it to get the json array element in each row.
val explodedDf = df.withColumn("jsonData", explode(from_json(col("value"), schema)))
.select($"jsonData").show
+----------------+
| jsonData|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+
Select the json keys
explodedDf.select("jsonData.*").show
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+
How can I convert the following Map structure which is a Map[String,Any] to a Json in scala? I am using Play.
val result = s
.groupBy(_.dashboardId)
.map(
each => Map(
"dashboardId" -> each._1,
"cubeId" -> each._2.head.cubeid,
"dashboardName" -> each._2.head.dashboardName,
"reports" -> each._2.groupBy(_.reportId).map(
reportEach => Map(
"reportId" -> reportEach._1,
"reportName" -> reportEach._2.head.reportName,
"selectedColumns" -> reportEach._2.groupBy(_.selectedColumnid).map(
selectedColumnsEach => Map(
"selectedColumnId" -> selectedColumnsEach._1,
"columnName" ->
selectedColumnsEach._2.head.selectColumnName.orNull,
"seq" ->selectedColumnsEach._2.head.selectedColumnSeq,
"formatting" -> selectedColumnsEach._2
)
)
)
)
)
)
You cannot convert a Map[String, Any] to Json but you can convert a Map[String, String] or Map[String, JsValue].
In your case, you can do by converting each map value to a JsValue before hand with:
Map(
"dashboardId" -> Json.toJson(each._1),
"cubeId" -> Json.toJson(each._2.head.cubeid),
"dashboardName" -> Json.toJson(each._2.head.dashboardName),
"reports" -> Json.toJson(each._2.groupBy(_.reportId).map(
reportEach => Map(
"reportId" -> Json.toJson(reportEach._1),
"reportName" -> (reportEach._2.find(_.reportName != null) match {
case Some(reportNameData) => Json.toJson(reportNameData.reportName)
case None => JsNull
})),
...
)
I read the results into a Seq[Map[String,Any]] by using .toSeq and then used toJson to convert it into a Json this worked.
val s = new SaveTemplate getReportsWithDashboardId(dashboardId)
val result : Seq[Map[String,Any]] = s.groupBy(_.dashboardId)
.map(
each => Map(
"dashboardId" -> each._1,
"cubeId" -> each._2.head.cubeid,
"dashboardName" -> each._2.head.dashboardName,
"reports" -> each._2.groupBy(_.reportId).map(
reportEach => Map(
"reportId" -> reportEach._1,
"reportName" -> (reportEach._2.find(_.reportName != null) match {
case Some(reportNameData) => reportNameData.reportName
case None => null
}),
"selectedColumns" -> reportEach._2.groupBy(_.selectedColumnid).map(
selectedColumnsEach => Map(
"selectedColumnId" -> selectedColumnsEach._1,
"columnName" -> selectedColumnsEach._2.head.selectColumnName.orNull,
"seq" ->selectedColumnsEach._2.head.selectedColumnSeq,
"formatting" -> Map(
"formatId" -> (selectedColumnsEach._2.find(_.formatId != null) match {
case Some(reportNameData) => reportNameData.formatId
case None => null
}),
"formattingId" -> (selectedColumnsEach._2.find(_.formattingid != null)
match {
case Some(reportNameData) => reportNameData.formattingid
case None => null
}),
"type" -> (selectedColumnsEach._2.find(_.formattingType != null) match
{
case Some(reportNameData) => reportNameData.formattingType
case None => null
})
)
)
)
)
)
)
).toSeq
val k = toJson(result)
Ok(k)
My Mongo DB abstract Dao is defined as follows
abstract class DbMongoDAO1[K, T <: Keyable[K]] (implicit val manifestT: Manifest[T], val manifestK: Manifest[K])
extends DbDAO[K, T]
with DbDAOExtensions[K, T]
with MongoConnection2
with JsonDbImplicits
{
val thisClass = manifestT.runtimeClass
val simpleName = thisClass.getSimpleName
lazy val collection = db.getCollection(s"${DbMongoDAO1.tablePrefix}$simpleName")
override def insertNew(r:T): Result[String,T] = {
val json: String = r.toJson.compactPrint
collection.insertOne(Document(json))
KO("Not Implemented")
}
}
I'm getting an error in the following line of code when converting a case class to JSON.
Error:(31, 26) value toJson is not a member of type parameter T
val json: String = r.toJson.compactPrint
val json: String = r.toJson.compactPrint
The trait JsonDbImplicits is as follows
trait JsonDbImplicits extends DefaultJsonProtocol
with SprayJsonSupport with JodaImplicits {
implicit val json_UserEmail:RootJsonFormat[UserEmail] = jsonFormat5(UserEmail)
implicit val json_UserProfile:RootJsonFormat[UserProfile] = jsonFormat13(UserProfile)
implicit val json_UserSession:RootJsonFormat[UserSession] = jsonFormat5(UserSession)
}
The case classes UserEmail and UserProfile are defined as follows
case class UserEmail
(
// it is the full email address
#Key("_id") id: String
, account_id: String
, active: Boolean = false
, ts_created: DateTime = now
, ts_updated: DateTime = now
) extends Keyable[String]
trait DbUserEmail extends DbMongoDAO1[String,UserEmail]
and
case class UserProfile
(
// id is the same as AccountId
#Key("_id") id: String = UUID.randomUUID().toString
, gender: Option[String] = None
, first_name: Option[String] = Some("")
, last_name: Option[String] = Some("")
, yob: Option[Int] = None
, kids: Option[Int] = None
, income: Option[Int] = None
, postcode: Option[String] = None
, location: Option[Boolean] = Some(true)
, opt_in: Option[Boolean] = Some(true)
, third_party: Option[Boolean] = Some(true)
, ts_created: DateTime = now
, ts_updated: DateTime = now
) extends Keyable[String]
trait DbUserProfile extends DbMongoDAO1[String,UserProfile]
What am I missing?
I have RDD[Row] :
|---itemId----|----Country-------|---Type----------|
| 11 | US | Movie |
| 11 | US | TV |
| 101 | France | Movie |
How to do GroupBy itemId so that I can save the result as List of json where each row is separate json object(each row in RDD) :
{"itemId" : 11,
"Country": {"US" :2 },"Type": {"Movie" :1 , "TV" : 1} },
{"itemId" : 101,
"Country": {"France" :1 },"Type": {"Movie" :1} }
RDD :
I tried :
import com.mapping.data.model.MappingUtils
import com.mapping.data.model.CountryInfo
val mappingPath = "s3://.../"
val input = sc.textFile(mappingPath)
The input is list of jsons where each line is json which I am mapping to the POJO class CountryInfo using MappingUtils which takes care of JSON parsing and conversion:
val MappingsList = input.map(x=> {
val countryInfo = MappingUtils.getCountryInfoString(x);
(countryInfo.getItemId(), countryInfo)
}).collectAsMap
MappingsList: scala.collection.Map[String,com.mapping.data.model.CountryInfo]
def showCountryInfo(x: Option[CountryInfo]) = x match {
case Some(s) => s
}
val events = sqlContext.sql( "select itemId EventList")
val itemList = events.map(row => {
val itemId = row.getAs[String](1);
val çountryInfo = showTitleInfo(MappingsList.get(itemId));
val country = if (countryInfo.getCountry() == 'unknown)' "US" else countryInfo.getCountry()
val type = countryInfo.getType()
Row(itemId, country, type)
})
Can some one let me know how can I achieve this ?
Thank You!
I can't afford the extra time to complete this, but can give you a start.
The idea is that you aggregate the RDD[Row] down into a single Map that represents your JSON structure. Aggregation is a fold that requires two function parameters:
seqOp How to fold a collection of elements into the target type
combOp How to merge two of the target types.
The tricky part comes in combOp while merging, as you need to accumulate the counts of values seen in the seqOp. I have left this as an exercise, as I have a plane to catch! Hopefully someone else can fill in the gaps if you have trouble.
case class Row(id: Int, country: String, tpe: String)
def foo: Unit = {
val rows: RDD[Row] = ???
def seqOp(acc: Map[Int, (Map[String, Int], Map[String, Int])], r: Row) = {
acc.get(r.id) match {
case None => acc.updated(r.id, (Map(r.country, 1), Map(r.tpe, 1)))
case Some((countries, types)) =>
val countries_ = countries.updated(r.country, countries.getOrElse(r.country, 0) + 1)
val types_ = types.updated(r.tpe, types.getOrElse(r.tpe, 0) + 1)
acc.updated(r.id, (countries_, types_))
}
}
val z = Map.empty[Int, (Map[String, Int], Map[String, Int])]
def combOp(l: Map[Int, (Map[String, Int], Map[String, Int])], r: Map[Int, (Map[String, Int], Map[String, Int])]) = {
l.foldLeft(z) { case (acc, (id, (countries, types))) =>
r.get(id) match {
case None => acc.updated(id, (countries, types))
case Some(otherCountries, otherTypes) =>
// todo - continue by merging countries with otherCountries
// and types with otherTypes, then update acc
}
}
}
val summaryMap = rows.aggregate(z) { seqOp, combOp }