In my spark dataframe I have a column which contains a single json having multiple comma separated json having key value pair. Need to faltten the json data in different columns.
The record of json column student_data looks like below
+--+------+---------------------------------------------------------------------------------------------------------------------------------------+
|id|name |student_data |
+--+------+---------------------------------------------------------------------------------------------------------------------------------------+
|11|stephy|{{"key":"hindi","value":{"hindi_mythology":80}},{"key":"social_science","value":{"civics":65}},{"key":"maths","value":{"geometry":70}}}|
+--+------+---------------------------------------------------------------------------------------------------------------------------------------+
Schema of record is as below.
root
|-- id : int
|-- name : string
|-- student_data : string
The requirement is to flatten the json as expected output is as below.
+-----------+-----+--------------+------+
|id |name |hindi|social_science|maths |
+---+-------+-----+--------------+------+
|1 |stephy |80 |65 |70 |
+---+-------+-----+-----+--------+------+
You can transform your json into a struct type using spark function from_json() using a schema that represent the schema of the json string, after that to get the expected results you can pivot the column to go from rows into column format:
The input jdon file:
{
"id": 11,
"name": "stephy",
"student_data": "[{\"key\":\"hindi\",\"value\":{\"hindi_mythology\":80}},{\"key\":\"social_science\",\"value\":{\"civics\":65}},{\"key\":\"maths\",\"value\":{\"geometry\":70}}]"
}
Code:
val df = spark.read.json("file.json")
val schema = new StructType()
.add("key", StringType, true)
.add("value", MapType(StringType, IntegerType), true)
val res = df.withColumn("student_data", from_json(col("student_data"), ArrayType(schema)))
.select(col("id"), col("name"), explode(col("student_data")).as("student_data"))
.select("id", "name", "student_data.*")
.select(col("id"), col("name"), col("key"), map_values(col("value")).getItem(0).as("value"))
res.groupBy("id", "name").pivot("key").agg(first(col("value"))).show(false)
+---+------+-----+-----+--------------+
|id |name |hindi|maths|social_science|
+---+------+-----+-----+--------------+
|11 |stephy|80 |70 |65 |
+---+------+-----+-----+--------------+
I am trying to read a JSON document which looks like this
{"id":100, "name":"anna", "hometown":"chicago"} [{"id":200, "name":"beth", "hometown":"indiana"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":700, "name":"anna", "hometown":"dudley"},{"id":1100, "name":"don", "hometown":"santa monica"},{"id":1300, "name":"sarah", "hometown":"hoboken"},{"id":1600, "name":"john", "hometown":"downtown"}]
{"id":1100, "name":"don", "hometown":"santa monica"} [{"id":100, "name":"anna", "hometown":"chicago"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":1200, "name":"jane", "hometown":"freemont"},{"id":1600, "name":"john", "hometown":"downtown"},{"id":1500, "name":"glenn", "hometown":"uptown"}]
{"id":1400, "name":"steve", "hometown":"newtown"} [{"id":100, "name":"anna", "hometown":"chicago"},{"id":600, "name":"john", "hometown":"san jose"},{"id":900, "name":"james", "hometown":"aurora"},{"id":1000, "name":"peter", "hometown":"elgin"},{"id":1100, "name":"don", "hometown":"santa monica"},{"id":1500, "name":"glenn", "hometown":"uptown"},{"id":1600, "name":"john", "hometown":"downtown"}]
{"id":1500, "name":"glenn", "hometown":"uptown"} [{"id":200, "name":"beth", "hometown":"indiana"},{"id":300, "name":"frank", "hometown":"new york"},{"id":400, "name":"pete", "hometown":"new jersey"},{"id":500, "name":"emily", "hometown":"san fransisco"},{"id":1100, "name":"don", "hometown":"santa monica"}]
There is a space between a key and a value (value is list containing json text).
Code which I tried
data = spark\
.read\
.format("json")\
.load("/Users/sahilnagpal/Desktop/dataworld.json")
data.show()
Result I get
+------------+----+-----+
| hometown| id| name|
+------------+----+-----+
| chicago| 100| anna|
|santa monica|1100| don|
| newtown|1400|steve|
| uptown|1500|glenn|
+------------+----+-----+
Result I want
+------------+----+-----+
| hometown| id| name|
+------------+----+-----+
| chicago| 100| anna| -- all the other ID,name,hometown corresponding to this ID and Name
|santa monica|1100| don| -- all the other ID,name,hometown corresponding to this ID and Name
| newtown|1400|steve| -- all the other ID,name,hometown corresponding to this ID and Name
| uptown|1500|glenn| -- all the other ID,name,hometown corresponding to this ID and Name
+------------+----+-----+
I think instead of reading it as a json file you should try to read it as a text file because the json string does not look like a valid json.
Below is the code that you should try to get the output that you expect:
from pyspark.sql.functions import *
from pyspark.sql.types import *
data1 = spark.read.text("/Users/sahilnagpal/Desktop/dataworld.json")
schema = StructType(
[
StructField('id', StringType(), True),
StructField('name', StringType(), True),
StructField('hometown',StringType(),True)
]
)
data2 = data1.withColumn("JsonKey",split(col("value"),"\\[")[0]).withColumn("JsonValue",split(col("value"),"\\[")[1]).withColumn("data",from_json("JsonKey",schema)).select(col('data.*'),'JsonValue')
Below is the output that you would get based on the above code.
You can read the input as a CSV file using two spaces as the separator/delimiter. Then parse each column separately using from_json with an appropriate schema.
df = spark.read.csv('/Users/sahilnagpal/Desktop/dataworld.json', sep=' ').toDF('json1', 'json2')
df2 = df.withColumn(
'json1',
F.from_json('json1', 'struct<id:int, name:string, hometown:string>')
).withColumn(
'json2',
F.from_json('json2', 'array<struct<id:int, name:string, hometown:string>>')
).select('json1.*', 'json2')
df2.show(truncate=False)
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |name |hometown |json2 |
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 |anna |chicago |[[200, beth, indiana], [400, pete, new jersey], [500, emily, san fransisco], [700, anna, dudley], [1100, don, santa monica], [1300, sarah, hoboken], [1600, john, downtown]]|
|1100|don |santa monica|[[100, anna, chicago], [400, pete, new jersey], [500, emily, san fransisco], [1200, jane, freemont], [1600, john, downtown], [1500, glenn, uptown]] |
|1400|steve|newtown |[[100, anna, chicago], [600, john, san jose], [900, james, aurora], [1000, peter, elgin], [1100, don, santa monica], [1500, glenn, uptown], [1600, john, downtown]] |
|1500|glenn|uptown |[[200, beth, indiana], [300, frank, new york], [400, pete, new jersey], [500, emily, san fransisco], [1100, don, santa monica]] |
+----+-----+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I have json data like below where version field is the differentiator -
file_1 = {"version": 1, "stats": {"hits":20}}
file_2 = {"version": 2, "stats": [{"hour":1,"hits":10},{"hour":2,"hits":12}]}
In the new format, stats column is now Arraytype(StructType).
Earlier only file_1 was needed so I was using
spark.read.schema(schema_def_v1).json(path)
Now I need to read both these type of multiple json files which come together. I cannot define stats as string in schema_def as that would affect the corruptrecord feature(for stats column) which checks malformed json and schema compliance of all the fields.
Example df output required in 1 read only -
version | hour | hits
1 | null | 20
2 | 1 | 10
2 | 2 | 12
I have tried to read with mergeSchema option but that makes stats field String type.
Also, I have tried making two dataframes by filtering on the version field, and applying spark.read.schema(schema_def_v1).json(df_v1.toJSON). Here also stats column becomes String type.
I was thinking if while reading, I could parse the df column headers as stats_v1 and stats_v2 on basis of data-types can solve the problem. Please help with any possible solutions.
UDF to check string or array, if it is string it will convert string to an array.
import org.apache.spark.sql.functions.udf
import org.json4s.{DefaultFormats, JObject}
import org.json4s.jackson.JsonMethods.parse
import org.json4s.jackson.Serialization.write
import scala.util.{Failure, Success, Try}
object Parse {
implicit val formats = DefaultFormats
def toArray(data:String) = {
val json_data = (parse(data))
if(json_data.isInstanceOf[JObject]) write(List(json_data)) else data
}
}
val toJsonArray = udf(Parse.toArray _)
scala> "ls -ltr /tmp/data".!
total 16
-rw-r--r-- 1 srinivas root 37 Jun 26 17:49 file_1.json
-rw-r--r-- 1 srinivas root 69 Jun 26 17:49 file_2.json
res4: Int = 0
scala> val df = spark.read.json("/tmp/data").select("stats","version")
df: org.apache.spark.sql.DataFrame = [stats: string, version: bigint]
scala> df.printSchema
root
|-- stats: string (nullable = true)
|-- version: long (nullable = true)
scala> df.show(false)
+-------+-------------------------------------------+
|version|stats |
+-------+-------------------------------------------+
|1 |{"hits":20} |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|
+-------+-------------------------------------------+
Output
scala>
import org.apache.spark.sql.types._
val schema = ArrayType(MapType(StringType,IntegerType))
df
.withColumn("json_stats",explode(from_json(toJsonArray($"stats"),schema)))
.select(
$"version",
$"stats",
$"json_stats".getItem("hour").as("hour"),
$"json_stats".getItem("hits").as("hits")
).show(false)
+-------+-------------------------------------------+----+----+
|version|stats |hour|hits|
+-------+-------------------------------------------+----+----+
|1 |{"hits":20} |null|20 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|1 |10 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|2 |12 |
+-------+-------------------------------------------+----+----+
Without UDF
scala> val schema = ArrayType(MapType(StringType,IntegerType))
scala> val expr = when(!$"stats".contains("[{"),concat(lit("["),$"stats",lit("]"))).otherwise($"stats")
df
.withColumn("stats",expr)
.withColumn("stats",explode(from_json($"stats",schema)))
.select(
$"version",
$"stats",
$"stats".getItem("hour").as("hour"),
$"stats".getItem("hits").as("hits")
)
.show(false)
+-------+-----------------------+----+----+
|version|stats |hour|hits|
+-------+-----------------------+----+----+
|1 |[hits -> 20] |null|20 |
|2 |[hour -> 1, hits -> 10]|1 |10 |
|2 |[hour -> 2, hits -> 12]|2 |12 |
+-------+-----------------------+----+----+
Read the second file first, explode stats, use schema to read first file.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
file_1 = {"version": 1, "stats": {"hits": 20}}
file_2 = {"version": 2, "stats": [{"hour": 1, "hits": 10}, {"hour": 2, "hits": 12}]}
df1 = spark.read.json(sc.parallelize([file_2])).withColumn('stats', explode('stats'))
schema = df1.schema
spark.read.schema(schema).json(sc.parallelize([file_1])).printSchema()
output >> root
|-- stats: struct (nullable = true)
| |-- hits: long (nullable = true)
| |-- hour: long (nullable = true)
|-- version: long (nullable = true)
IIUC, you can read the JSON files using spark.read.text and then parse the value with json_tuple, from_json. notice for stats field we use coalesce to parse fields based on two or more schema. (add wholetext=True as an argument of spark.read.text if each file contains a single JSON document cross multiple lines)
from pyspark.sql.functions import json_tuple, coalesce, from_json, array
df = spark.read.text("/path/to/all/jsons/")
schema_1 = "array<struct<hour:int,hits:int>>"
schema_2 = "struct<hour:int,hits:int>"
df.select(json_tuple('value', 'version', 'stats').alias('version', 'stats')) \
.withColumn('status', coalesce(from_json('stats', schema_1), array(from_json('stats', schema_2)))) \
.selectExpr('version', 'inline_outer(status)') \
.show()
+-------+----+----+
|version|hour|hits|
+-------+----+----+
| 2| 1| 10|
| 2| 2| 12|
| 1|null| 20|
+-------+----+----+
While reading parquet file this is the following file data
|id |name |activegroup|
|1 |abc |[{"groupID":"5d","role":"admin","status":"A"},{"groupID":"58","role":"admin","status":"A"}]|
data types of each field
root
|--id : int
|--name : String
|--activegroup : String
activegroup column is string explode function is not working. Following is the required output
|id |name |groupID|role|status|
|1 |abc |5d |admin|A |
|1 |def |58 |admin|A |
Do help me with parsing the above in spark scala latest version
First you need to extract the json schema:
val schema = schema_of_json(lit(df.select($"activeGroup").as[String].first))
Once you got it, you can convert your activegroup column, which is a String to json (from_json), and then explode it.
Once the column is a json, you can extract it's values with $"columnName.field"
val dfresult = df.withColumn("jsonColumn", explode(
from_json($"activegroup", schema)))
.select($"id", $"name",
$"jsonColumn.groupId" as "groupId",
$"jsonColumn.role" as "role",
$"jsonColumn.status" as "status")
If you want to extract the whole json and the element names are ok to you you can use the * to do it:
val dfresult = df.withColumn("jsonColumn", explode(
from_json($"activegroup", schema)))
.select($"id", $"name", $"jsonColumn.*")
RESULT
+---+----+-------+-----+------+
| id|name|groupId| role|status|
+---+----+-------+-----+------+
| 1| abc| 5d|admin| A|
| 1| abc| 58|admin| A|
+---+----+-------+-----+------+
I want to use spark.read() to pull data from a .csv file, while enforcing a schema. However, I can't get spark to recognize my dates as timestamps.
First I create a dummy file to test with
%scala
Seq("1|1/15/2019 2:24:00 AM","2|test","3|").toDF().write.text("/tmp/input/csvDateReadTest")
Then I try to read it, and provide a dateFormat string, but it doesn't recognize my dates, and sends the records to the badRecordsPath
df = spark.read.format('csv')
.schema("id int, dt timestamp")
.option("delimiter","|")
.option("badRecordsPath","/tmp/badRecordsPath")
.option("dateFormat","M/dd/yyyy hh:mm:ss aaa")
.load("/tmp/input/csvDateReadTest")
As the result, I get just 1 record in df (ID 3), when I'm expecting to see 2. (IDs 1 and 3)
df.show()
+---+----+
| id| dt|
+---+----+
| 3|null|
+---+----+
You must change the dateFormat to timestampFormat since in your case you need a timestamp type and not a date. Additionally the value of timestamp format should be mm/dd/yyyy h:mm:ss a.
Sample data:
Seq(
"1|1/15/2019 2:24:00 AM",
"2|test",
"3|5/30/1981 3:11:00 PM"
).toDF().write.text("/tmp/input/csvDateReadTest")
With the changes for the timestamp:
val df = spark.read.format("csv")
.schema("id int, dt timestamp")
.option("delimiter","|")
.option("badRecordsPath","/tmp/badRecordsPath")
.option("timestampFormat","mm/dd/yyyy h:mm:ss a")
.load("/tmp/input/csvDateReadTest")
And the output:
+----+-------------------+
| id| dt|
+----+-------------------+
| 1|2019-01-15 02:24:00|
| 3|1981-01-30 15:11:00|
|null| null|
+----+-------------------+
Note that the record with id 2 failed to comply with the schema definition and therefore it will contain null. If you want to keep also the invalid records you need to change the timestamp column into string and the output in this case will be:
+---+--------------------+
| id| dt|
+---+--------------------+
| 1|1/15/2019 2:24:00 AM|
| 3|5/30/1981 3:11:00 PM|
| 2| test|
+---+--------------------+
UPDATE:
In order to change the string dt into timestamp type you could try with df.withColumn("dt", $"dt".cast("timestamp")) although this will fail and replace all the values with null.
You can achieve this with the next code:
import org.apache.spark.sql.Row
import java.text.SimpleDateFormat
import java.util.{Date, Locale}
import java.sql.Timestamp
import scala.util.{Try, Success, Failure}
val formatter = new SimpleDateFormat("mm/dd/yyyy h:mm:ss a", Locale.US)
df.map{ case Row(id:Int, dt:String) =>
val tryParse = Try[Date](formatter.parse(dt))
val p_timestamp = tryParse match {
case Success(parsed) => new Timestamp(parsed.getTime())
case Failure(_) => null
}
(id, p_timestamp)
}.toDF("id", "dt").show
Output:
+---+-------------------+
| id| dt|
+---+-------------------+
| 1|2019-01-15 02:24:00|
| 3|1981-01-30 15:11:00|
| 2| null|
+---+-------------------+
Hi here is the sample code
df.withColumn("times",
from_unixtime(unix_timestamp(col("df"), "M/dd/yyyy hh:mm:ss a"),
"yyyy-MM-dd HH:mm:ss.SSSSSS"))
.show(false)