My Json is a list of objects. I want to get the first one, but Any is making it difficult:
scala> import scala.util.parsing.json._
import scala.util.parsing.json._
scala> val str ="""
| [
| {
| "UserName": "user1",
| "Tags": "one, two, three"
| },
| {
| "UserName": "user2",
| "Tags": "one, two, three"
| }
| ]""".stripMargin
str: String =
"
[
{
"UserName": "user1",
"Tags": "one, two, three"
},
{
"UserName": "user2",
"Tags": "one, two, three"
}
]"
scala> val parsed = JSON.parseFull(str)
parsed: Option[Any] = Some(List(Map(UserName -> user1, Tags -> one, two, three), Map(UserName -> user2, Tags -> one, two, three)))
scala> parsed.getOrElse(0)
res0: Any = List(Map(UserName -> user1, Tags -> one, two, three), Map(UserName -> user2, Tags -> one, two, three))
scala> parsed.getOrElse(0)(0)
<console>:13: error: Any does not take parameters
parsed.getOrElse(0)(0)
How do I get the first element?
You need to pattern match the result(Option[Any]) to List[Map[String, String]].
1) Patter match example,
scala> val parsed = JSON.parseFull("""[{"UserName":"user1","Tags":"one, two, three"},{"UserName":"user2","Tags":"one, two, three"}]""")
scala> parsed.map(users => users match { case usersList : List[Map[String, String]] => usersList(0) case _ => Option.empty })
res8: Option[Equals] = Some(Map(UserName -> user1, Tags -> one, two, three))
better pattern match,
scala> parsed.map(_ match { case head :: tail => head case _ => Option.empty })
res13: Option[Any] = Some(Map(UserName -> user1, Tags -> one, two, three))
2) Or else you can cast the result (Option[Any]) but not recommended(as cast might throw ClassCastException),
scala> parsed.map(_.asInstanceOf[List[Map[String, String]]](0))
res10: Option[Map[String,String]] = Some(Map(UserName -> user1, Tags -> one, two, three))
Related
My JSON column names are a combination of lower and uppercase case (Ex: title/Title and name/Name), due to which in output, I am getting name and Name as two different columns (similarly title and Title).
How can I make the JSON columns as case insensitive?
config("spark.sql.caseSensitive", "true") -> I tried this, but it is not working.
val df = Seq(
("A", "B", "{\"Name\":\"xyz\",\"Address\":\"NYC\",\"title\":\"engg\"}"),
("C", "D", "{\"Name\":\"mnp\",\"Address\":\"MIC\",\"title\":\"data\"}"),
("E", "F", "{\"name\":\"pqr\",\"Address\":\"MNN\",\"Title\":\"bi\"}")
)).toDF("col_1", "col_2", "col_json")
import sc.implicits._
val col_schema = spark.read.json(df.select("col_json").as[String]).schema
val outputDF = df.withColumn("new_col", from_json(col("col_json"), col_schema))
.select("col_1", "col_2", "new_col.*")
outputDF.show(false)
Current output:
Expected/Needed output (column names to be case-insensitive):
Soltion 1
You can group the columns by their lowercase names and merge them using coalesce function:
// set spark.sql.caseSensitive to true to avoid ambuigity
spark.conf.set("spark.sql.caseSensitive", "true")
val col_schema = spark.read.json(df.select("col_json").as[String]).schema
val df1 = df.withColumn("new_col", from_json(col("col_json"), col_schema))
.select("col_1", "col_2", "new_col.*")
val mergedCols = df1.columns.groupBy(_.toLowerCase).values
.map(grp =>
if (grp.size > 1) coalesce(grp.map(col): _*).as(grp(0))
else col(grp(0))
).toSeq
val outputDF = df1.select(mergedCols:_*)
outputDF.show()
//+----+-------+-----+-----+-----+
//|Name|Address|col_1|Title|col_2|
//+----+-------+-----+-----+-----+
//|xyz |NYC |A |engg |B |
//|mnp |MIC |C |data |D |
//|pqr |MNN |E |bi |F |
//+----+-------+-----+-----+-----+
Solution 2
Another way is to parse the JSON string column into MapType instead of StructType, and using transform_keys you can lower case the column name, then explode the map and pivot to get columns:
import org.apache.spark.sql.types.{MapType, StringType}
val outputDF = df.withColumn(
"col_json",
from_json(col("col_json"), MapType(StringType, StringType))
).select(
col("col_1"),
col("col_2"),
explode(expr("transform_keys(col_json, (k, v) -> lower(k))"))
).groupBy("col_1", "col_2")
.pivot("key")
.agg(first("value"))
outputDF.show()
//+-----+-----+-------+----+-----+
//|col_1|col_2|address|name|title|
//+-----+-----+-------+----+-----+
//|E |F |MNN |pqr |bi |
//|C |D |MIC |mnp |data |
//|A |B |NYC |xyz |engg |
//+-----+-----+-------+----+-----+
For this solution transform_keys is only avlaible since Spark 3, for older versions you can use UDF :
val mapKeysToLower = udf((m: Map[String, String]) => {
m.map { case (k, v) => k.toLowerCase -> v }
})
You will need to merge your columns, using something like:
import org.apache.spark.sql.functions.when
df = df.withColumn("title", when($"title".isNull, $"Title").otherwise($"title").drop("Title")
I want to convert an Spark DataFrame into Json file. Below is the input and output format.
Any help is appreciated.
Input :
+-------------------------+
|Name|Age|City |Data |
+-------------------------+
|Ram |30 |Delhi|[A -> ABC]|
|-------------------------|
|Shan|25 |Delhi|[X -> XYZ]|
|-------------------------|
|Riya|12 |U.P. |[M -> MNO]|
+-------------------------+
Output :
{"Name":"Ram","Age":"30","City":"Delhi","Delhi":{"A":"ABC"}}
{"Name":"Shan","Age":"25","City":"Delhi","Delhi":{"X":"XYZ"}}
{"Name":"Riya","Age":"12","City":"U.P.","U.P.":{"M":"MNO"}}
Scala: Starting from your data,
val df = Seq(("Ram",30,"Delhi",Map("A" -> "ABC")), ("Shan",25,"Delhi",Map("X" -> "XYZ")), ("Riya",12,"U.P.",Map("M" -> "MNO"))).toDF("Name", "Age", "City", "Data")
df.show
// +----+---+-----+----------+
// |Name|Age| City| Data|
// +----+---+-----+----------+
// | Ram| 30|Delhi|[A -> ABC]|
// |Shan| 25|Delhi|[X -> XYZ]|
// |Riya| 12| U.P.|[M -> MNO]|
// +----+---+-----+----------+
To change the key as City not Data,
val df2 = df.groupBy("Name", "Age", "City").pivot("City").agg(first("Data"))
df2.show
// +----+---+-----+----------+----------+
// |Name|Age| City| Delhi| U.P.|
// +----+---+-----+----------+----------+
// |Riya| 12| U.P.| null|[M -> MNO]|
// |Shan| 25|Delhi|[X -> XYZ]| null|
// | Ram| 30|Delhi|[A -> ABC]| null|
// +----+---+-----+----------+----------+
And make it by using toJson and collect.
val jsonArray = df.toJSON.collect
jsonArray.foreach(println)
It will print the result such as:
{"Name":"Riya","Age":12,"City":"U.P.","U.P.":{"M":"MNO"}}
{"Name":"Shan","Age":25,"City":"Delhi","Delhi":{"X":"XYZ"}}
{"Name":"Ram","Age":30,"City":"Delhi","Delhi":{"A":"ABC"}}
You can call write.json on DataFrame.
val df: DataFrame = ....
df.write.json("/jsonFilPath")
Here is a an example using Datasets
scala> case class Data(key: String, value: String)
scala> case class Person(name: String, age: Long, city: String, data: Data)
scala> val peopleDS = Seq(Person("Ram", 30, "Delhi", Data("A", "ABC")), Person("Shan", 25, "Delhi", Data("X", "XYZ")), Person("Riya", 12, "U.P", Data("M", "MNO"))).toDS()
scala> peopleDS.show()
+----+---+-----+--------+
|name|age| city| data|
+----+---+-----+--------+
| Ram| 30|Delhi|[A, ABC]|
|Shan| 25|Delhi|[X, XYZ]|
|Riya| 12| U.P|[M, MNO]|
+----+---+-----+--------+
scala> peopleDS.write.json("pathToData/people")
Then you would find written json files in the given folder.
> cd pathToData/people
> ls -l
part-00000-6bd00826-5a8e-4ab9-bfb0-65d722394108-c000.json
> cat part-00000-6bd00826-5a8e-4ab9-bfb0-65d722394108-c000.json
{"name":"Ram","age":30,"city":"Delhi","data":{"key":"A","value":"ABC"}}
{"name":"Shan","age":25,"city":"Delhi","data":{"key":"X","value":"XYZ"}}
{"name":"Riya","age":12,"city":"U.P","data":{"key":"M","value":"MNO"}}
I'm new to elixir and I want to parse a json file. One of the parts is a question answer array of objects.
[
{
"questionId":1,
"question":"Information: Personal Information: First Name",
"answer":"Joe"
},
{
"questionId":3,
"question":"Information: Personal Information: Last Name",
"answer":"Smith"
},
...
]
I know what questionId's I want and I'm going to make a map for 1 = First Name, 2 = Last Name.
But currently I'm doing the following to put the data into the struct.
defmodule Student do
defstruct first_name: nil, last_name: nil, student_number: nil
defguard is_first_name(id) when id == 1
defguard is_last_name(id) when id == 3
defguard is_student_number(id) when id == 7
end
defmodule AFMC do
import Student
#moduledoc """
Documentation for AFMC.
"""
#doc """
Hello world.
## Examples
iex> AFMC.hello
:world
"""
def main do
get_json()
|> get_outgoing_applications
end
def get_json do
with {:ok, body} <- File.read("./lib/afmc_import.txt"),
{:ok,body} <- Poison.Parser.parse(body), do: {:ok,body}
end
def get_outgoing_applications(map) do
{:ok,body} = map
out_application = get_in(body,["outgoingApplications"])
Enum.at(out_application,0)
|> get_in(["answers"])
|> get_person
end
def get_person(answers) do
student = Enum.reduce(answers,%Student{},fn(answer,acc) ->
if Student.is_first_name(answer["questionId"]) do
acc = %{acc | first_name: answer["answer"]}
end
if Student.is_last_name(answer["questionId"]) do
acc = %{acc | last_name: answer["answer"]}
end
if Student.is_student_number(answer["questionId"]) do
acc = %{acc | student_number: answer["answer"]}
end
acc
end)
IO.inspect "test"
s
end
end
I'm wondering what is a better way to do get_person with out having to do if statements. If I know I will be mapping 1 to questionId 1 in the array of objects.
The data will then be saved into a DB.
Thanks
I'd store a mapping of id to field name. With that you don't need any if inside the reduce. Some pattern matching will also make it unnecessary to do answer["questionId"] etc.
defmodule Student do
defstruct first_name: nil, last_name: nil, student_number: nil
#fields %{
1 => :first_name,
3 => :last_name,
7 => :student_number
}
def parse(answers) do
Enum.reduce(answers, %Student{}, fn %{"questionId" => id, "answer" => answer}, acc ->
%{acc | #fields[id] => answer}
end)
end
end
IO.inspect(
Student.parse([
%{"questionId" => 1, "question" => "", "answer" => "Joe"},
%{"questionId" => 3, "question" => "", "answer" => "Smith"},
%{"questionId" => 7, "question" => "", "answer" => "123"}
])
)
Output:
%Student{first_name: "Joe", last_name: "Smith", student_number: "123"}
Edit: to skip ids not present in the map, change:
%{acc | #fields[id] => answer}
to:
if field = #fields[id], do: %{acc | field => answer}, else: acc
I have RDD[Row] :
|---itemId----|----Country-------|---Type----------|
| 11 | US | Movie |
| 11 | US | TV |
| 101 | France | Movie |
How to do GroupBy itemId so that I can save the result as List of json where each row is separate json object(each row in RDD) :
{"itemId" : 11,
"Country": {"US" :2 },"Type": {"Movie" :1 , "TV" : 1} },
{"itemId" : 101,
"Country": {"France" :1 },"Type": {"Movie" :1} }
RDD :
I tried :
import com.mapping.data.model.MappingUtils
import com.mapping.data.model.CountryInfo
val mappingPath = "s3://.../"
val input = sc.textFile(mappingPath)
The input is list of jsons where each line is json which I am mapping to the POJO class CountryInfo using MappingUtils which takes care of JSON parsing and conversion:
val MappingsList = input.map(x=> {
val countryInfo = MappingUtils.getCountryInfoString(x);
(countryInfo.getItemId(), countryInfo)
}).collectAsMap
MappingsList: scala.collection.Map[String,com.mapping.data.model.CountryInfo]
def showCountryInfo(x: Option[CountryInfo]) = x match {
case Some(s) => s
}
val events = sqlContext.sql( "select itemId EventList")
val itemList = events.map(row => {
val itemId = row.getAs[String](1);
val çountryInfo = showTitleInfo(MappingsList.get(itemId));
val country = if (countryInfo.getCountry() == 'unknown)' "US" else countryInfo.getCountry()
val type = countryInfo.getType()
Row(itemId, country, type)
})
Can some one let me know how can I achieve this ?
Thank You!
I can't afford the extra time to complete this, but can give you a start.
The idea is that you aggregate the RDD[Row] down into a single Map that represents your JSON structure. Aggregation is a fold that requires two function parameters:
seqOp How to fold a collection of elements into the target type
combOp How to merge two of the target types.
The tricky part comes in combOp while merging, as you need to accumulate the counts of values seen in the seqOp. I have left this as an exercise, as I have a plane to catch! Hopefully someone else can fill in the gaps if you have trouble.
case class Row(id: Int, country: String, tpe: String)
def foo: Unit = {
val rows: RDD[Row] = ???
def seqOp(acc: Map[Int, (Map[String, Int], Map[String, Int])], r: Row) = {
acc.get(r.id) match {
case None => acc.updated(r.id, (Map(r.country, 1), Map(r.tpe, 1)))
case Some((countries, types)) =>
val countries_ = countries.updated(r.country, countries.getOrElse(r.country, 0) + 1)
val types_ = types.updated(r.tpe, types.getOrElse(r.tpe, 0) + 1)
acc.updated(r.id, (countries_, types_))
}
}
val z = Map.empty[Int, (Map[String, Int], Map[String, Int])]
def combOp(l: Map[Int, (Map[String, Int], Map[String, Int])], r: Map[Int, (Map[String, Int], Map[String, Int])]) = {
l.foldLeft(z) { case (acc, (id, (countries, types))) =>
r.get(id) match {
case None => acc.updated(id, (countries, types))
case Some(otherCountries, otherTypes) =>
// todo - continue by merging countries with otherCountries
// and types with otherTypes, then update acc
}
}
}
val summaryMap = rows.aggregate(z) { seqOp, combOp }
I'm a total newbie in Spark&Scala stuff, it would be great if someone could explain this to me.
Let's take following JSON
{
"id": 1,
"persons": [{
"name": "n1",
"lastname": "l1",
"hobbies": [{
"name": "h1",
"activity": "a1"
},
{
"name": "h2",
"activity": "a2"
}]
},
{
"name": "n2",
"lastname": "l2",
"hobbies": [{
"name": "h3",
"activity": "a3"
},
{
"name": "h4",
"activity": "a4"
}]
}]
}
I'm loading this Json to RDD via sc.parralelize(file.json) and to DF via sqlContext.sql.load.json(file.json). So far so good, this gives me RDD and DF (with schema) for mentioned Json, but I want to create annother RDD/DF from existing one that contains all distinct "hobbies" records. How can I achieve sth like that?
The only things I get from my operations are multiple WrappedArrays for Hobbies but I cannot go deeper nor assign them to DF/RDD.
Code for SqlContext I have so far
val jsonData = sqlContext.read.json("path/file.json")
jsonData.registerTempTable("jsonData") //I receive schema for whole file
val hobbies = sqlContext.sql("SELECT persons.hobbies FROM jasonData") //subschema for hobbies
hobbies.show()
That leaves me with
+--------------------+
| hobbies|
+--------------------+
|[WrappedArray([a1...|
+--------------------+
What I expect is more like:
+--------------------+-----------------+
| name | activity |
+--------------------+-----------------|
| h1| a1 |
+--------------------+-----------------+
| h2| a2 |
+--------------------+-----------------+
| h3| a3 |
+--------------------+-----------------+
| h4| a4 |
+--------------------+-----------------+
I loaded your example into the dataframe hobbies exactly as you do it and worked with it. You could run something like the following:
val distinctHobbies = hobbies.rdd.flatMap {row => row.getSeq[List[Row]](0).flatten}.map(row => (row.getString(0), row.getString(1))).distinct
val dhDF = distinctHobbies.toDF("activity", "name")
This essentially flattens your hobbies struct, transforms it into a tuple, and runs a distinct on the returned tuples. We then turn it back into a dataframe under the correct column aliases. Because we are doing this through the underlying RDD, there may also be a more efficient way to do it using just the DataFrame API.
Regardless, when I run on your example, I see:
scala> val distinctHobbies = hobbies.rdd.flatMap {row => row.getSeq[List[Row]](0).flatten}.map(row => (row.getString(0), row.getString(1))).distinct
distinctHobbies: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at distinct at <console>:24
scala> val dhDF = distinctHobbies.toDF("activity", "name")
dhDF: org.apache.spark.sql.DataFrame = [activity: string, name: string]
scala> dhDF.show
...
+--------+----+
|activity|name|
+--------+----+
| a2| h2|
| a1| h1|
| a3| h3|
| a4| h4|
+--------+----+