Read JSON as dataframe using Pyspark - json

I am trying to read a JSON document which looks like this
{"id":100, "name":"anna", "hometown":"chicago"} [{"id":200, "name":"beth", "hometown":"indiana"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":700, "name":"anna", "hometown":"dudley"},{"id":1100, "name":"don", "hometown":"santa monica"},{"id":1300, "name":"sarah", "hometown":"hoboken"},{"id":1600, "name":"john", "hometown":"downtown"}]
{"id":1100, "name":"don", "hometown":"santa monica"} [{"id":100, "name":"anna", "hometown":"chicago"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":1200, "name":"jane", "hometown":"freemont"},{"id":1600, "name":"john", "hometown":"downtown"},{"id":1500, "name":"glenn", "hometown":"uptown"}]
{"id":1400, "name":"steve", "hometown":"newtown"} [{"id":100, "name":"anna", "hometown":"chicago"},{"id":600, "name":"john", "hometown":"san jose"},{"id":900, "name":"james", "hometown":"aurora"},{"id":1000, "name":"peter", "hometown":"elgin"},{"id":1100, "name":"don", "hometown":"santa monica"},{"id":1500, "name":"glenn", "hometown":"uptown"},{"id":1600, "name":"john", "hometown":"downtown"}]
{"id":1500, "name":"glenn", "hometown":"uptown"} [{"id":200, "name":"beth", "hometown":"indiana"},{"id":300, "name":"frank", "hometown":"new york"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":1100, "name":"don", "hometown":"santa monica"}]
There is a space between a key and a value (value is list containing json text).
Code which I tried
data = spark\
.read\
.format("json")\
.load("/Users/sahilnagpal/Desktop/dataworld.json")
data.show()
Result I get
+------------+----+-----+
| hometown| id| name|
+------------+----+-----+
| chicago| 100| anna|
|santa monica|1100| don|
| newtown|1400|steve|
| uptown|1500|glenn|
+------------+----+-----+
Result I want
+------------+----+-----+
| hometown| id| name|
+------------+----+-----+
| chicago| 100| anna| -- all the other ID,name,hometown corresponding to this ID and Name
|santa monica|1100| don| -- all the other ID,name,hometown corresponding to this ID and Name
| newtown|1400|steve| -- all the other ID,name,hometown corresponding to this ID and Name
| uptown|1500|glenn| -- all the other ID,name,hometown corresponding to this ID and Name
+------------+----+-----+

I think instead of reading it as a json file you should try to read it as a text file because the json string does not look like a valid json.
Below is the code that you should try to get the output that you expect:
from pyspark.sql.functions import *
from pyspark.sql.types import *
data1 = spark.read.text("/Users/sahilnagpal/Desktop/dataworld.json")
schema = StructType(
[
StructField('id', StringType(), True),
StructField('name', StringType(), True),
StructField('hometown',StringType(),True)
]
)
data2 = data1.withColumn("JsonKey",split(col("value"),"\\[")[0]).withColumn("JsonValue",split(col("value"),"\\[")[1]).withColumn("data",from_json("JsonKey",schema)).select(col('data.*'),'JsonValue')
Below is the output that you would get based on the above code.

You can read the input as a CSV file using two spaces as the separator/delimiter. Then parse each column separately using from_json with an appropriate schema.
df = spark.read.csv('/Users/sahilnagpal/Desktop/dataworld.json', sep=' ').toDF('json1', 'json2')
df2 = df.withColumn(
'json1',
F.from_json('json1', 'struct<id:int, name:string, hometown:string>')
).withColumn(
'json2',
F.from_json('json2', 'array<struct<id:int, name:string, hometown:string>>')
).select('json1.*', 'json2')
df2.show(truncate=False)
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |name |hometown |json2 |
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 |anna |chicago |[[200, beth, indiana], [400, pete, new jersey], [500, emily, san fransisco], [700, anna, dudley], [1100, don, santa monica], [1300, sarah, hoboken], [1600, john, downtown]]|
|1100|don |santa monica|[[100, anna, chicago], [400, pete, new jersey], [500, emily, san fransisco], [1200, jane, freemont], [1600, john, downtown], [1500, glenn, uptown]] |
|1400|steve|newtown |[[100, anna, chicago], [600, john, san jose], [900, james, aurora], [1000, peter, elgin], [1100, don, santa monica], [1500, glenn, uptown], [1600, john, downtown]] |
|1500|glenn|uptown |[[200, beth, indiana], [300, frank, new york], [400, pete, new jersey], [500, emily, san fransisco], [1100, don, santa monica]] |
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Related

Pyspark: How to create a nested Json by adding dynamic prefix to each column based on a row value

I have a dataframe in below format.
Input:
id
Name_type
Name
Car
1
First
rob
Nissan
2
First
joe
Hyundai
1
Last
dent
Infiniti
2
Last
Kent
Genesis
need to transform into a json column by appending a row value below format for a given key column as shown below.
Result expected:
id
json_column
1
{"First_Name":"rob","First_*Car", "Nissan","Last_Name":"dent","Last_Car", "Infiniti"}
2
{"First_Name":"joe","First_Car", "Hyundai","Last_Name":"kent","Last_Car", "Genesis"}
with below piece of code
column_set = ['Name','Car']
df = df.withColumn("json_data", to_json(struct(\[df\[x\] for x in column_set\])))
I was able to generate data as
id
Name_type
Json_data
1
First
{"Name":"rob", "Car": "Nissan"}
2
First
{"Name":"joe", "Car": "Hyundai"}
1
Last
{"Name":"dent", "Car": "infiniti"}
2
Last
{"Name":"kent", "Car": "Genesis"}
I was able to create a json column using to_json for a given row.
But not able to figure out how to append the row value to a column name and convert to nested json for a given key column.
To do what you want, you first need to manipulate your input dataframe a little bit. You can do this by grouping by the id column, and pivoting around the Name_type column like so:
from pyspark.sql.functions import first
df = spark.createDataFrame(
[
("1", "First", "rob", "Nissan"),
("2", "First", "joe", "Hyundai"),
("1", "Last", "dent", "Infiniti"),
("2", "Last", "Kent", "Genesis")
],
["id", "Name_type", "Name", "Car"]
)
output = df.groupBy("id").pivot("Name_type").agg(first("Name").alias('Name'), first("Car").alias('Car'))
output.show()
+---+----------+---------+---------+--------+
| id|First_Name|First_Car|Last_Name|Last_Car|
+---+----------+---------+---------+--------+
| 1| rob| Nissan| dent|Infiniti|
| 2| joe| Hyundai| Kent| Genesis|
+---+----------+---------+---------+--------+
Then you can use the exact same code as what you used to get your wanted result, but using 4 columns instead of 2:
from pyspark.sql.functions import to_json, struct
column_set = ['First_Name','First_Car', 'Last_Name', 'Last_Car']
output = output.withColumn("json_data", to_json(struct([output[x] for x in column_set])))
output.show(truncate=False)
+---+----------+---------+---------+--------+----------------------------------------------------------------------------------+
|id |First_Name|First_Car|Last_Name|Last_Car|json_data |
+---+----------+---------+---------+--------+----------------------------------------------------------------------------------+
|1 |rob |Nissan |dent |Infiniti|{"First_Name":"rob","First_Car":"Nissan","Last_Name":"dent","Last_Car":"Infiniti"}|
|2 |joe |Hyundai |Kent |Genesis |{"First_Name":"joe","First_Car":"Hyundai","Last_Name":"Kent","Last_Car":"Genesis"}|
+---+----------+---------+---------+--------+----------------------------------------------------------------------------------+

Spark dataframe from Json string with nested key

I have several columns to be extracted from json string. However one field has nested values. Not sure how to deal with that?
Need to explode into multiple rows to get values of field name, Value1, Value2.
import spark.implicits._
val df = Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"p": "bar", "q": 3.0}""", "some_other_field_2"),
("3",
"""{"nestedKey":[ {"field name":"name1","Value1":false,"Value2":true},
| {"field name":"name2","Value1":"100","Value2":"200"}
|]}""".stripMargin, "some_other_field_3")
).toDF("id","json","other")
df.show(truncate = false)
val df1= df.withColumn("id1",col("id"))
.withColumn("other1",col("other"))
.withColumn("k",get_json_object(col("json"),"$.k"))
.withColumn("v",get_json_object(col("json"),"$.v"))
.withColumn("p",get_json_object(col("json"),"$.p"))
.withColumn("q",get_json_object(col("json"),"$.q"))
.withColumn("nestedKey",get_json_object(col("json"),"$.nestedKey"))
.select("id1","other1","k","v","p","q","nestedKey")
df1.show(truncate = false)
You can parse the nestedKey using from_json and explode it:
val df2 = df1.withColumn(
"nestedKey",
expr("explode_outer(from_json(nestedKey, 'array<struct<`field name`:string, Value1:string, Value2:string>>'))")
).select("*", "nestedKey.*").drop("nestedKey")
df2.show
+---+------------------+----+----+----+----+----------+------+------+
|id1| other1| k| v| p| q|field name|Value1|Value2|
+---+------------------+----+----+----+----+----------+------+------+
| 1|some_other_field_1| foo| 1.0|null|null| null| null| null|
| 2|some_other_field_2|null|null| bar| 3.0| null| null| null|
| 3|some_other_field_3|null|null|null|null| name1| false| true|
| 3|some_other_field_3|null|null|null|null| name2| 100| 200|
+---+------------------+----+----+----+----+----------+------+------+
i did it in one dataframe
val df1= df.withColumn("id1",col("id"))
.withColumn("other1",col("other"))
.withColumn("k",get_json_object(col("json"),"$.k"))
.withColumn("v",get_json_object(col("json"),"$.v"))
.withColumn("p",get_json_object(col("json"),"$.p"))
.withColumn("q",get_json_object(col("json"),"$.q"))
.withColumn("nestedKey",get_json_object(col("json"),"$.nestedKey"))
.withColumn(
"nestedKey",
expr("explode_outer(from_json(nestedKey, 'array<struct<`field name`:string, Value1:string, Value2:string>>'))")
).withColumn("fieldname",col("nestedKey.field name"))
.withColumn("valueone",col("nestedKey.Value1"))
.withColumn("valuetwo",col("nestedKey.Value2"))
.select("id1","other1","k","v","p","q","fieldname","valueone","valuetwo")```
still working to make it more elegant

How to parse json string to different columns in spark scala?

While reading parquet file this is the following file data
|id |name |activegroup|
|1 |abc |[{"groupID":"5d","role":"admin","status":"A"},{"groupID":"58","role":"admin","status":"A"}]|
data types of each field
root
|--id : int
|--name : String
|--activegroup : String
activegroup column is string explode function is not working. Following is the required output
|id |name |groupID|role|status|
|1 |abc |5d |admin|A |
|1 |def |58 |admin|A |
Do help me with parsing the above in spark scala latest version
First you need to extract the json schema:
val schema = schema_of_json(lit(df.select($"activeGroup").as[String].first))
Once you got it, you can convert your activegroup column, which is a String to json (from_json), and then explode it.
Once the column is a json, you can extract it's values with $"columnName.field"
val dfresult = df.withColumn("jsonColumn", explode(
from_json($"activegroup", schema)))
.select($"id", $"name",
$"jsonColumn.groupId" as "groupId",
$"jsonColumn.role" as "role",
$"jsonColumn.status" as "status")
If you want to extract the whole json and the element names are ok to you you can use the * to do it:
val dfresult = df.withColumn("jsonColumn", explode(
from_json($"activegroup", schema)))
.select($"id", $"name", $"jsonColumn.*")
RESULT
+---+----+-------+-----+------+
| id|name|groupId| role|status|
+---+----+-------+-----+------+
| 1| abc| 5d|admin| A|
| 1| abc| 58|admin| A|
+---+----+-------+-----+------+

Parsing JSON within a Spark DataFrame into new columns

Background
I have a dataframe that looks like this:
------------------------------------------------------------------------
|name |meals |
------------------------------------------------------------------------
|Tom |{"breakfast": "banana", "lunch": "sandwich"} |
|Alex |{"breakfast": "yogurt", "lunch": "pizza", "dinner": "pasta"} |
|Lisa |{"lunch": "sushi", "dinner": "lasagna", "snack": "apple"} |
------------------------------------------------------------------------
Obtained from the following:
var rawDf = Seq(("Tom",s"""{"breakfast": "banana", "lunch": "sandwich"}""" ),
("Alex", s"""{"breakfast": "yogurt", "lunch": "pizza", "dinner": "pasta"}"""),
("Lisa", s"""{"lunch": "sushi", "dinner": "lasagna", "snack": "apple"}""")).toDF("name", "meals")
I want to transform it into a dataframe that looks like this:
------------------------------------------------------------------------
|name |meal |food |
------------------------------------------------------------------------
|Tom |breakfast | banana |
|Tom |lunch | sandwich |
|Alex |breakfast | yogurt |
|Alex |lunch | pizza |
|Alex |dinner | pasta |
|Lisa |lunch | sushi |
|Lisa |dinner | lasagna |
|Lisa |snack | apple |
------------------------------------------------------------------------
I'm using Spark 2.1, so I'm parsing the json using get_json_object. Currently, I'm trying to get the final dataframe using an intermediary dataframe that looks like this:
------------------------------------------------------------------------
|name |breakfast |lunch |dinner |snack |
------------------------------------------------------------------------
|Tom |banana |sandwich |null |null |
|Alex |yogurt |pizza |pasta |null |
|Lisa |null |sushi |lasagna |apple |
------------------------------------------------------------------------
Obtained from the following:
val intermediaryDF = rawDf.select(col("name"),
get_json_object(col("meals"), "$." + Meals.breakfast).alias(Meals.breakfast),
get_json_object(col("meals"), "$." + Meals.lunch).alias(Meals.lunch),
get_json_object(col("meals"), "$." + Meals.dinner).alias(Meals.dinner),
get_json_object(col("meals"), "$." + Meals.snack).alias(Meals.snack))
Meals is defined in another file that has a lot more entries than breakfast, lunch, dinner, and snack, but it looks something like this:
object Meals {
val breakfast = "breakfast"
val lunch = "lunch"
val dinner = "dinner"
val snack = "snack"
}
I then use intermediaryDF to compute the final DataFrame, like so:
val finalDF = parsedDF.where(col("breakfast").isNotNull).select(col("name"), col("breakfast")).union(
parsedDF.where(col("lunch").isNotNull).select(col("name"), col("lunch"))).union(
parsedDF.where(col("dinner").isNotNull).select(col("name"), col("dinner"))).union(
parsedDF.where(col("snack").isNotNull).select(col("name"), col("snack")))
My problem
Using the intermediary DataFrame works if I only have a few types of Meals, but I actually have 40, and enumerating every one of them to compute intermediaryDF is impractical. I also don't like the idea of having to compute this DF in the first place. Is there a way to get directly from my raw dataframe to the final dataframe without the intermediary step, and also without explicitly having a case for every value in Meals?
Apache Spark provide support to parse json data, but that should have a predefined schema in order to parse it correclty. Your json data is dynamic so you cannot rely on a schema.
One way to do don;t let apache spark parse the data , but you could parse it in a key value way, (e.g by using something like Map[String, String] which is pretty generic)
Here is what you can do instead:
Use the Jackson json mapper for scala
// mapper object created on each executor node
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
val valueAsMap = mapper.readValue[Map[String, String]](s"""{"breakfast": "banana", "lunch": "sandwich"}""")
This will give you something like transforming the json string into a Map[String, String]. That can also be viewed as a List of (key, value) pair
List((breakfast,banana), (lunch,sandwich))
Now comes the Apache Spark part into the play. Define a custom user defined function to parse the string and output the List of (key, value) pairs
val jsonToArray = udf((json:String) => {
mapper.readValue[Map[String, String]](json).toList
})
Apply that transformation on the "meals" columns and will transform that into a column of type Array. After that explode on that columns and select the key entry as column meal and value entry as column food
val df1 = rowDf.select(col("name"), explode(jsonToArray(col("meals"))).as("meals"))
df1.select(col("name"), col("meals._1").as("meal"), col("meals._2").as("food"))
Showing the last dataframe it outputs:
|name| meal| food|
+----+---------+--------+
| Tom|breakfast| banana|
| Tom| lunch|sandwich|
|Alex|breakfast| yogurt|
|Alex| lunch| pizza|
|Alex| dinner| pasta|
|Lisa| lunch| sushi|
|Lisa| dinner| lasagna|
|Lisa| snack| apple|
+----+---------+--------+

How to read custom formatted dates as timestamp in pyspark

I want to use spark.read() to pull data from a .csv file, while enforcing a schema. However, I can't get spark to recognize my dates as timestamps.
First I create a dummy file to test with
%scala
Seq("1|1/15/2019 2:24:00 AM","2|test","3|").toDF().write.text("/tmp/input/csvDateReadTest")
Then I try to read it, and provide a dateFormat string, but it doesn't recognize my dates, and sends the records to the badRecordsPath
df = spark.read.format('csv')
.schema("id int, dt timestamp")
.option("delimiter","|")
.option("badRecordsPath","/tmp/badRecordsPath")
.option("dateFormat","M/dd/yyyy hh:mm:ss aaa")
.load("/tmp/input/csvDateReadTest")
As the result, I get just 1 record in df (ID 3), when I'm expecting to see 2. (IDs 1 and 3)
df.show()
+---+----+
| id| dt|
+---+----+
| 3|null|
+---+----+
You must change the dateFormat to timestampFormat since in your case you need a timestamp type and not a date. Additionally the value of timestamp format should be mm/dd/yyyy h:mm:ss a.
Sample data:
Seq(
"1|1/15/2019 2:24:00 AM",
"2|test",
"3|5/30/1981 3:11:00 PM"
).toDF().write.text("/tmp/input/csvDateReadTest")
With the changes for the timestamp:
val df = spark.read.format("csv")
.schema("id int, dt timestamp")
.option("delimiter","|")
.option("badRecordsPath","/tmp/badRecordsPath")
.option("timestampFormat","mm/dd/yyyy h:mm:ss a")
.load("/tmp/input/csvDateReadTest")
And the output:
+----+-------------------+
| id| dt|
+----+-------------------+
| 1|2019-01-15 02:24:00|
| 3|1981-01-30 15:11:00|
|null| null|
+----+-------------------+
Note that the record with id 2 failed to comply with the schema definition and therefore it will contain null. If you want to keep also the invalid records you need to change the timestamp column into string and the output in this case will be:
+---+--------------------+
| id| dt|
+---+--------------------+
| 1|1/15/2019 2:24:00 AM|
| 3|5/30/1981 3:11:00 PM|
| 2| test|
+---+--------------------+
UPDATE:
In order to change the string dt into timestamp type you could try with df.withColumn("dt", $"dt".cast("timestamp")) although this will fail and replace all the values with null.
You can achieve this with the next code:
import org.apache.spark.sql.Row
import java.text.SimpleDateFormat
import java.util.{Date, Locale}
import java.sql.Timestamp
import scala.util.{Try, Success, Failure}
val formatter = new SimpleDateFormat("mm/dd/yyyy h:mm:ss a", Locale.US)
df.map{ case Row(id:Int, dt:String) =>
val tryParse = Try[Date](formatter.parse(dt))
val p_timestamp = tryParse match {
case Success(parsed) => new Timestamp(parsed.getTime())
case Failure(_) => null
}
(id, p_timestamp)
}.toDF("id", "dt").show
Output:
+---+-------------------+
| id| dt|
+---+-------------------+
| 1|2019-01-15 02:24:00|
| 3|1981-01-30 15:11:00|
| 2| null|
+---+-------------------+
Hi here is the sample code
df.withColumn("times",
from_unixtime(unix_timestamp(col("df"), "M/dd/yyyy hh:mm:ss a"),
"yyyy-MM-dd HH:mm:ss.SSSSSS"))
.show(false)