Scala - How to convert JSON Keys and Values as columns - json

How to parse below Input Json into key and value columns. Any help is appreciated.
Input:
{
"name" : "srini",
"value": {
"1" : "val1",
"2" : "val2",
"3" : "val3"
}
}
Output DataFrame Column:
name key value
-----------------------------
srini 1 val1
srini 2 val2
srini 3 val3
//++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++Input DataFrame :
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|json_file |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"file_path":"AAA/BBB.CCC.zip","file_name":"AAA_20200202122754.json","received_time":"2020-03-31","obj_cls":"Monitor","obj_cls_inst":"Monitor","relation_tree":"Source~>HD_Info~>Monitor","s_tag":"ABC1234","Monitor":{"Index":"0","Vendor_Data":"58F5Y","Monitor_Type":"Lenovo Monitor","HnfoID":"650FEC74"}}|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
How to convert this above json file in a DataFrame like below :
+----------------+-----------------------+--------------+--------+-------------+-------------------------+----------+----------------+----------------+
|file_path |file_name |received_time |obj_cls |obj_cls_inst |relation_tree |s_tag |attribute_name |attribute_value |
+----------------+-----------------------+--------------+--------+-------------+-------------------------+----------+----------------+----------------+
|AAA/BBB.CCC.zip |AAA_20200202122754.json|2020-03-31 |Monitor |Monitor |Source~>HD_Info~>Monitor |ABC1234 |Index |0 |
+----------------+-----------------------+--------------+--------+-------------+-------------------------+----------+----------------+----------------+
|AAA/BBB.CCC.zip |AAA_20200202122754.json|2020-03-31 |Monitor |Monitor |Source~>HD_Info~>Monitor |ABC1234 |Vendor_Data |58F5Y |
+----------------+-----------------------+--------------+--------+-------------+-------------------------+----------+----------------+----------------+
|AAA/BBB.CCC.zip |AAA_20200202122754.json|2020-03-31 |Monitor |Monitor |Source~>HD_Info~>Monitor |ABC1234 |Monitor_Type |Lenovo Monitor |
+----------------+-----------------------+--------------+--------+-------------+-------------------------+----------+----------------+----------------+
|AAA/BBB.CCC.zip |AAA_20200202122754.json|2020-03-31 |Monitor |Monitor |Source~>HD_Info~>Monitor |ABC1234 |HnfoID |650FEC74 |
+----------------+-----------------------+--------------+--------+-------------+-------------------------+----------+----------------+----------------+
//**********************************************
val rawData = sparkSession.sql("select 1").withColumn("obj_cls", lit("First")).withColumn("s_tag", lit("S_12345")).withColumn("jsonString", lit("""{"id":""1,"First":{"Info":"ABCD123","Res":"5.2"}}"""))

Once you have your json loaded into a DF as follows:
+-----+------------------+
| name| value|
+-----+------------------+
|srini|[val1, val2, val3]|
+-----+------------------+
First you select the whole values items:
df.select($"name", $"value.*")
This will give yo this:
+-----+----+----+----+
| name| 1| 2| 3|
+-----+----+----+----+
|srini|val1|val2|val3|
+-----+----+----+----+
Then you need to pivot the columns to become rows, for this I usually define a helper function kv:
def kv (columnsToTranspose: Array[String]) = explode(array(columnsToTranspose.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
Then you create an array fo the desired columns:
val pivotCols = Array("1", "2", "3")
And finally apply the function to the previous DF:
df.select($"name", $"value.*")
.withColumn("kv", kv(pivotCols))
.select($"name", $"kv.k" as "key", $"kv.v" as "value")
Result:
+-----+---+-----+
| name|key|value|
+-----+---+-----+
|srini| 1| val1|
|srini| 2| val2|
|srini| 3| val3|
+-----+---+-----+
EDIT
If you don't wanna mannually specify the columns to pivot, you can use an intermediate df as follows:
val dfIntermediate = df.select($"name", $"value.*")
dfIntermediate.withColumn("kv", kv(dfIntermediate.columns.tail))
.select($"name", $"kv.k" as "key", $"kv.v" as "value")
And you will obtain the very same result:
+-----+---+-----+
| name|key|value|
+-----+---+-----+
|srini| 1| val1|
|srini| 2| val2|
|srini| 3| val3|
+-----+---+-----+
EDIT2
With the new example is the same, you just need to change which columns you read/pivot
val pivotColumns = Array("HnfoId", "Index", "Monitor_Type", "Vendor_Data")
df.select("file_path", "file_name", "received_time", "obj_cls", "obj_cls_inst", "relation_tree", "s_Tag", "Monitor.*").withColumn("kv", kv(pivotColumns)).select($"file_path", $"file_name", $"received_time", $"obj_cls", $"obj_cls_inst", $"relation_tree", $"s_Tag", $"kv.k" as "attribute_name", $"kv.v" as "attribute_value").show
+---------------+--------------------+-------------+-------+------------+--------------------+-------+--------------+---------------+
| file_path| file_name|received_time|obj_cls|obj_cls_inst| relation_tree| s_Tag|attribute_name|attribute_value|
+---------------+--------------------+-------------+-------+------------+--------------------+-------+--------------+---------------+
|AAA/BBB.CCC.zip|AAA_2020020212275...| 2020-03-31|Monitor| Monitor|Source~>HD_Info~>...|ABC1234| HnfoId| 650FEC74|
|AAA/BBB.CCC.zip|AAA_2020020212275...| 2020-03-31|Monitor| Monitor|Source~>HD_Info~>...|ABC1234| Index| 0|
|AAA/BBB.CCC.zip|AAA_2020020212275...| 2020-03-31|Monitor| Monitor|Source~>HD_Info~>...|ABC1234| Monitor_Type| Lenovo Monitor|
|AAA/BBB.CCC.zip|AAA_2020020212275...| 2020-03-31|Monitor| Monitor|Source~>HD_Info~>...|ABC1234| Vendor_Data| 58F5Y|
+---------------+--------------------+-------------+-------+------------+--------------------+-------+--------------+---------------+

Related

Pyspark map a function to two array type

I am quite new to Pyspark, here is what I try to do, below is the table, type is ArrayType(DoubleType), ArrayType(DecimalType)
A
B
[1,2]
[2,4]
[1,2,4]
[1,3,3]
What I want to do is treat A and B as np.array, then pass a function to do calculation.
def func(row):
a = row.A
b = row.B
res = some-function(a,b)
return res
What I am trying now is
res = a.rdd.map(func)
resDF = res.toDF(res)
resDF.show()
But I am receiving the following error, could someone guide me a bit here? Thank you.
TypeError: schema should be StructType or list or None, but got: PythonRDD[167] at RDD at PythonRDD.scala:53
You can use pandas_udf
sample data
df = spark.createDataFrame([
([1,2], [2,4]),
([1,2,4], [1,3,3]),
], 'a array<int>, b array<int>')
df.show()
+---------+---------+
|a |b |
+---------+---------+
|[1, 2] |[2, 4] |
|[1, 2, 4]|[1, 3, 3]|
+---------+---------+
create new column with pandas_udf
#F.pandas_udf("array<int>")
def func(a, b):
return a * b
df.withColumn('c', func('a', 'b')).show()
+---------+---------+----------+
| a| b| c|
+---------+---------+----------+
| [1, 2]| [2, 4]| [2, 8]|
|[1, 2, 4]|[1, 3, 3]|[1, 6, 12]|
+---------+---------+----------+

Spark dataframe from Json string with nested key

I have several columns to be extracted from json string. However one field has nested values. Not sure how to deal with that?
Need to explode into multiple rows to get values of field name, Value1, Value2.
import spark.implicits._
val df = Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"p": "bar", "q": 3.0}""", "some_other_field_2"),
("3",
"""{"nestedKey":[ {"field name":"name1","Value1":false,"Value2":true},
| {"field name":"name2","Value1":"100","Value2":"200"}
|]}""".stripMargin, "some_other_field_3")
).toDF("id","json","other")
df.show(truncate = false)
val df1= df.withColumn("id1",col("id"))
.withColumn("other1",col("other"))
.withColumn("k",get_json_object(col("json"),"$.k"))
.withColumn("v",get_json_object(col("json"),"$.v"))
.withColumn("p",get_json_object(col("json"),"$.p"))
.withColumn("q",get_json_object(col("json"),"$.q"))
.withColumn("nestedKey",get_json_object(col("json"),"$.nestedKey"))
.select("id1","other1","k","v","p","q","nestedKey")
df1.show(truncate = false)
You can parse the nestedKey using from_json and explode it:
val df2 = df1.withColumn(
"nestedKey",
expr("explode_outer(from_json(nestedKey, 'array<struct<`field name`:string, Value1:string, Value2:string>>'))")
).select("*", "nestedKey.*").drop("nestedKey")
df2.show
+---+------------------+----+----+----+----+----------+------+------+
|id1| other1| k| v| p| q|field name|Value1|Value2|
+---+------------------+----+----+----+----+----------+------+------+
| 1|some_other_field_1| foo| 1.0|null|null| null| null| null|
| 2|some_other_field_2|null|null| bar| 3.0| null| null| null|
| 3|some_other_field_3|null|null|null|null| name1| false| true|
| 3|some_other_field_3|null|null|null|null| name2| 100| 200|
+---+------------------+----+----+----+----+----------+------+------+
i did it in one dataframe
val df1= df.withColumn("id1",col("id"))
.withColumn("other1",col("other"))
.withColumn("k",get_json_object(col("json"),"$.k"))
.withColumn("v",get_json_object(col("json"),"$.v"))
.withColumn("p",get_json_object(col("json"),"$.p"))
.withColumn("q",get_json_object(col("json"),"$.q"))
.withColumn("nestedKey",get_json_object(col("json"),"$.nestedKey"))
.withColumn(
"nestedKey",
expr("explode_outer(from_json(nestedKey, 'array<struct<`field name`:string, Value1:string, Value2:string>>'))")
).withColumn("fieldname",col("nestedKey.field name"))
.withColumn("valueone",col("nestedKey.Value1"))
.withColumn("valuetwo",col("nestedKey.Value2"))
.select("id1","other1","k","v","p","q","fieldname","valueone","valuetwo")```
still working to make it more elegant

How to parse json string to different columns in spark scala?

While reading parquet file this is the following file data
|id |name |activegroup|
|1 |abc |[{"groupID":"5d","role":"admin","status":"A"},{"groupID":"58","role":"admin","status":"A"}]|
data types of each field
root
|--id : int
|--name : String
|--activegroup : String
activegroup column is string explode function is not working. Following is the required output
|id |name |groupID|role|status|
|1 |abc |5d |admin|A |
|1 |def |58 |admin|A |
Do help me with parsing the above in spark scala latest version
First you need to extract the json schema:
val schema = schema_of_json(lit(df.select($"activeGroup").as[String].first))
Once you got it, you can convert your activegroup column, which is a String to json (from_json), and then explode it.
Once the column is a json, you can extract it's values with $"columnName.field"
val dfresult = df.withColumn("jsonColumn", explode(
from_json($"activegroup", schema)))
.select($"id", $"name",
$"jsonColumn.groupId" as "groupId",
$"jsonColumn.role" as "role",
$"jsonColumn.status" as "status")
If you want to extract the whole json and the element names are ok to you you can use the * to do it:
val dfresult = df.withColumn("jsonColumn", explode(
from_json($"activegroup", schema)))
.select($"id", $"name", $"jsonColumn.*")
RESULT
+---+----+-------+-----+------+
| id|name|groupId| role|status|
+---+----+-------+-----+------+
| 1| abc| 5d|admin| A|
| 1| abc| 58|admin| A|
+---+----+-------+-----+------+

Parsing JSON within a Spark DataFrame into new columns

Background
I have a dataframe that looks like this:
------------------------------------------------------------------------
|name |meals |
------------------------------------------------------------------------
|Tom |{"breakfast": "banana", "lunch": "sandwich"} |
|Alex |{"breakfast": "yogurt", "lunch": "pizza", "dinner": "pasta"} |
|Lisa |{"lunch": "sushi", "dinner": "lasagna", "snack": "apple"} |
------------------------------------------------------------------------
Obtained from the following:
var rawDf = Seq(("Tom",s"""{"breakfast": "banana", "lunch": "sandwich"}""" ),
("Alex", s"""{"breakfast": "yogurt", "lunch": "pizza", "dinner": "pasta"}"""),
("Lisa", s"""{"lunch": "sushi", "dinner": "lasagna", "snack": "apple"}""")).toDF("name", "meals")
I want to transform it into a dataframe that looks like this:
------------------------------------------------------------------------
|name |meal |food |
------------------------------------------------------------------------
|Tom |breakfast | banana |
|Tom |lunch | sandwich |
|Alex |breakfast | yogurt |
|Alex |lunch | pizza |
|Alex |dinner | pasta |
|Lisa |lunch | sushi |
|Lisa |dinner | lasagna |
|Lisa |snack | apple |
------------------------------------------------------------------------
I'm using Spark 2.1, so I'm parsing the json using get_json_object. Currently, I'm trying to get the final dataframe using an intermediary dataframe that looks like this:
------------------------------------------------------------------------
|name |breakfast |lunch |dinner |snack |
------------------------------------------------------------------------
|Tom |banana |sandwich |null |null |
|Alex |yogurt |pizza |pasta |null |
|Lisa |null |sushi |lasagna |apple |
------------------------------------------------------------------------
Obtained from the following:
val intermediaryDF = rawDf.select(col("name"),
get_json_object(col("meals"), "$." + Meals.breakfast).alias(Meals.breakfast),
get_json_object(col("meals"), "$." + Meals.lunch).alias(Meals.lunch),
get_json_object(col("meals"), "$." + Meals.dinner).alias(Meals.dinner),
get_json_object(col("meals"), "$." + Meals.snack).alias(Meals.snack))
Meals is defined in another file that has a lot more entries than breakfast, lunch, dinner, and snack, but it looks something like this:
object Meals {
val breakfast = "breakfast"
val lunch = "lunch"
val dinner = "dinner"
val snack = "snack"
}
I then use intermediaryDF to compute the final DataFrame, like so:
val finalDF = parsedDF.where(col("breakfast").isNotNull).select(col("name"), col("breakfast")).union(
parsedDF.where(col("lunch").isNotNull).select(col("name"), col("lunch"))).union(
parsedDF.where(col("dinner").isNotNull).select(col("name"), col("dinner"))).union(
parsedDF.where(col("snack").isNotNull).select(col("name"), col("snack")))
My problem
Using the intermediary DataFrame works if I only have a few types of Meals, but I actually have 40, and enumerating every one of them to compute intermediaryDF is impractical. I also don't like the idea of having to compute this DF in the first place. Is there a way to get directly from my raw dataframe to the final dataframe without the intermediary step, and also without explicitly having a case for every value in Meals?
Apache Spark provide support to parse json data, but that should have a predefined schema in order to parse it correclty. Your json data is dynamic so you cannot rely on a schema.
One way to do don;t let apache spark parse the data , but you could parse it in a key value way, (e.g by using something like Map[String, String] which is pretty generic)
Here is what you can do instead:
Use the Jackson json mapper for scala
// mapper object created on each executor node
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
val valueAsMap = mapper.readValue[Map[String, String]](s"""{"breakfast": "banana", "lunch": "sandwich"}""")
This will give you something like transforming the json string into a Map[String, String]. That can also be viewed as a List of (key, value) pair
List((breakfast,banana), (lunch,sandwich))
Now comes the Apache Spark part into the play. Define a custom user defined function to parse the string and output the List of (key, value) pairs
val jsonToArray = udf((json:String) => {
mapper.readValue[Map[String, String]](json).toList
})
Apply that transformation on the "meals" columns and will transform that into a column of type Array. After that explode on that columns and select the key entry as column meal and value entry as column food
val df1 = rowDf.select(col("name"), explode(jsonToArray(col("meals"))).as("meals"))
df1.select(col("name"), col("meals._1").as("meal"), col("meals._2").as("food"))
Showing the last dataframe it outputs:
|name| meal| food|
+----+---------+--------+
| Tom|breakfast| banana|
| Tom| lunch|sandwich|
|Alex|breakfast| yogurt|
|Alex| lunch| pizza|
|Alex| dinner| pasta|
|Lisa| lunch| sushi|
|Lisa| dinner| lasagna|
|Lisa| snack| apple|
+----+---------+--------+

Spark CSV read/write for empty field

I want to write by Dataframe's empty field as empty but it always writes as NULL. I want to write NULLS as ? and empty as empty/blank. Same while reading from a csv.
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""),
(4, null)
))
scala> df.show
| 0| a|
| 1| b|
| 2| c|
| 3| |
| 4|null|
+---+----+
df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.csv").option("nullValue","?").save("/xxxxx/test_out")
written output :
0,a
1,b
2,c
3,?
4,?
.option("treatEmptyValuesAsNulls" , "false")
This option does not work.
I need the empty to write as empty
0,a
1,b
2,c
3,
4,?
Try using sql-
I am using spark 2.2.
val ds= sqlContext.sql("select `_1`, case when `_2` is not null then `_2` else case when `_2` is null then '?' else case when `_2` = '' then '' end end end as val "+
"from global_temp.test");
ds.write.csv("<output path>");