How to read custom formatted dates as timestamp in pyspark - csv

I want to use spark.read() to pull data from a .csv file, while enforcing a schema. However, I can't get spark to recognize my dates as timestamps.
First I create a dummy file to test with
%scala
Seq("1|1/15/2019 2:24:00 AM","2|test","3|").toDF().write.text("/tmp/input/csvDateReadTest")
Then I try to read it, and provide a dateFormat string, but it doesn't recognize my dates, and sends the records to the badRecordsPath
df = spark.read.format('csv')
.schema("id int, dt timestamp")
.option("delimiter","|")
.option("badRecordsPath","/tmp/badRecordsPath")
.option("dateFormat","M/dd/yyyy hh:mm:ss aaa")
.load("/tmp/input/csvDateReadTest")
As the result, I get just 1 record in df (ID 3), when I'm expecting to see 2. (IDs 1 and 3)
df.show()
+---+----+
| id| dt|
+---+----+
| 3|null|
+---+----+

You must change the dateFormat to timestampFormat since in your case you need a timestamp type and not a date. Additionally the value of timestamp format should be mm/dd/yyyy h:mm:ss a.
Sample data:
Seq(
"1|1/15/2019 2:24:00 AM",
"2|test",
"3|5/30/1981 3:11:00 PM"
).toDF().write.text("/tmp/input/csvDateReadTest")
With the changes for the timestamp:
val df = spark.read.format("csv")
.schema("id int, dt timestamp")
.option("delimiter","|")
.option("badRecordsPath","/tmp/badRecordsPath")
.option("timestampFormat","mm/dd/yyyy h:mm:ss a")
.load("/tmp/input/csvDateReadTest")
And the output:
+----+-------------------+
| id| dt|
+----+-------------------+
| 1|2019-01-15 02:24:00|
| 3|1981-01-30 15:11:00|
|null| null|
+----+-------------------+
Note that the record with id 2 failed to comply with the schema definition and therefore it will contain null. If you want to keep also the invalid records you need to change the timestamp column into string and the output in this case will be:
+---+--------------------+
| id| dt|
+---+--------------------+
| 1|1/15/2019 2:24:00 AM|
| 3|5/30/1981 3:11:00 PM|
| 2| test|
+---+--------------------+
UPDATE:
In order to change the string dt into timestamp type you could try with df.withColumn("dt", $"dt".cast("timestamp")) although this will fail and replace all the values with null.
You can achieve this with the next code:
import org.apache.spark.sql.Row
import java.text.SimpleDateFormat
import java.util.{Date, Locale}
import java.sql.Timestamp
import scala.util.{Try, Success, Failure}
val formatter = new SimpleDateFormat("mm/dd/yyyy h:mm:ss a", Locale.US)
df.map{ case Row(id:Int, dt:String) =>
val tryParse = Try[Date](formatter.parse(dt))
val p_timestamp = tryParse match {
case Success(parsed) => new Timestamp(parsed.getTime())
case Failure(_) => null
}
(id, p_timestamp)
}.toDF("id", "dt").show
Output:
+---+-------------------+
| id| dt|
+---+-------------------+
| 1|2019-01-15 02:24:00|
| 3|1981-01-30 15:11:00|
| 2| null|
+---+-------------------+

Hi here is the sample code
df.withColumn("times",
from_unixtime(unix_timestamp(col("df"), "M/dd/yyyy hh:mm:ss a"),
"yyyy-MM-dd HH:mm:ss.SSSSSS"))
.show(false)

Related

PySpark - saving JSON column to Postgres

I am trying to insert PySpark DataFrame with JSON / struct column to Postgres. My data:
+---------+------------+-------------+----------+
| user_id| group_id|point_balance| details|
+---------+------------+-------------+----------+
| 1| 1| 777| null|
| 2| 1| 888| null|
| 3| 2| 168| null|
| 4| 2| 425|{1, thing}|
+---------+------------+-------------+----------+
details is of type struct<abc:bigint,some:string> (this is test data, so those are example JSON keys and values).
I'm trying to save the data to local Postgres:
db_url = "jdbc:postgresql://localhost:5433/test_db"
table = "user_points"
(
user_points
.write
.format("jdbc")
.option("url", db_url)
.option("dbtable", table)
.option("user", "postgres")
.option("password", "postgres")
.mode("overwrite")
.save()
)
I get error IllegalArgumentException: Can't get JDBC type for struct<abc:bigint,some:string>.
I tried this solution:
(
user_points
.select("user_id", "group_id", "point_balance", F.to_json(F.struct("details)).alias("details))
.write
.format("jdbc")
.option("url", db_url)
.option("dbtable", table)
.option("user", "postgres")
.option("password", "postgres")
.mode("overwrite")
.save()
)
This works, but it creates column of type TEXT, while I want proper JSONB.
Explicitly using JSONB does not work:
.option("createTableColumnTypes", "user_id INTEGER, group_id INTEGER, point_balance INTEGER, details JSONB")
ParseException:
DataType jsonb is not supported.(line 1, pos 72)
== SQL ==
user_id INTEGER, group_id INTEGER, point_balance INTEGER, details JSONB
------------------------------------------------------------------------^^^
How can I achieve that? I would prefer to do this in Spark, but casting in Postgres would also be ok.

spark dataframes : reading json having duplicate column names but different datatypes

I have json data like below where version field is the differentiator -
file_1 = {"version": 1, "stats": {"hits":20}}
file_2 = {"version": 2, "stats": [{"hour":1,"hits":10},{"hour":2,"hits":12}]}
In the new format, stats column is now Arraytype(StructType).
Earlier only file_1 was needed so I was using
spark.read.schema(schema_def_v1).json(path)
Now I need to read both these type of multiple json files which come together. I cannot define stats as string in schema_def as that would affect the corruptrecord feature(for stats column) which checks malformed json and schema compliance of all the fields.
Example df output required in 1 read only -
version | hour | hits
1 | null | 20
2 | 1 | 10
2 | 2 | 12
I have tried to read with mergeSchema option but that makes stats field String type.
Also, I have tried making two dataframes by filtering on the version field, and applying spark.read.schema(schema_def_v1).json(df_v1.toJSON). Here also stats column becomes String type.
I was thinking if while reading, I could parse the df column headers as stats_v1 and stats_v2 on basis of data-types can solve the problem. Please help with any possible solutions.
UDF to check string or array, if it is string it will convert string to an array.
import org.apache.spark.sql.functions.udf
import org.json4s.{DefaultFormats, JObject}
import org.json4s.jackson.JsonMethods.parse
import org.json4s.jackson.Serialization.write
import scala.util.{Failure, Success, Try}
object Parse {
implicit val formats = DefaultFormats
def toArray(data:String) = {
val json_data = (parse(data))
if(json_data.isInstanceOf[JObject]) write(List(json_data)) else data
}
}
val toJsonArray = udf(Parse.toArray _)
scala> "ls -ltr /tmp/data".!
total 16
-rw-r--r-- 1 srinivas root 37 Jun 26 17:49 file_1.json
-rw-r--r-- 1 srinivas root 69 Jun 26 17:49 file_2.json
res4: Int = 0
scala> val df = spark.read.json("/tmp/data").select("stats","version")
df: org.apache.spark.sql.DataFrame = [stats: string, version: bigint]
scala> df.printSchema
root
|-- stats: string (nullable = true)
|-- version: long (nullable = true)
scala> df.show(false)
+-------+-------------------------------------------+
|version|stats |
+-------+-------------------------------------------+
|1 |{"hits":20} |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|
+-------+-------------------------------------------+
Output
scala>
import org.apache.spark.sql.types._
val schema = ArrayType(MapType(StringType,IntegerType))
df
.withColumn("json_stats",explode(from_json(toJsonArray($"stats"),schema)))
.select(
$"version",
$"stats",
$"json_stats".getItem("hour").as("hour"),
$"json_stats".getItem("hits").as("hits")
).show(false)
+-------+-------------------------------------------+----+----+
|version|stats |hour|hits|
+-------+-------------------------------------------+----+----+
|1 |{"hits":20} |null|20 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|1 |10 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|2 |12 |
+-------+-------------------------------------------+----+----+
Without UDF
scala> val schema = ArrayType(MapType(StringType,IntegerType))
scala> val expr = when(!$"stats".contains("[{"),concat(lit("["),$"stats",lit("]"))).otherwise($"stats")
df
.withColumn("stats",expr)
.withColumn("stats",explode(from_json($"stats",schema)))
.select(
$"version",
$"stats",
$"stats".getItem("hour").as("hour"),
$"stats".getItem("hits").as("hits")
)
.show(false)
+-------+-----------------------+----+----+
|version|stats |hour|hits|
+-------+-----------------------+----+----+
|1 |[hits -> 20] |null|20 |
|2 |[hour -> 1, hits -> 10]|1 |10 |
|2 |[hour -> 2, hits -> 12]|2 |12 |
+-------+-----------------------+----+----+
Read the second file first, explode stats, use schema to read first file.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
file_1 = {"version": 1, "stats": {"hits": 20}}
file_2 = {"version": 2, "stats": [{"hour": 1, "hits": 10}, {"hour": 2, "hits": 12}]}
df1 = spark.read.json(sc.parallelize([file_2])).withColumn('stats', explode('stats'))
schema = df1.schema
spark.read.schema(schema).json(sc.parallelize([file_1])).printSchema()
output >> root
|-- stats: struct (nullable = true)
| |-- hits: long (nullable = true)
| |-- hour: long (nullable = true)
|-- version: long (nullable = true)
IIUC, you can read the JSON files using spark.read.text and then parse the value with json_tuple, from_json. notice for stats field we use coalesce to parse fields based on two or more schema. (add wholetext=True as an argument of spark.read.text if each file contains a single JSON document cross multiple lines)
from pyspark.sql.functions import json_tuple, coalesce, from_json, array
df = spark.read.text("/path/to/all/jsons/")
schema_1 = "array<struct<hour:int,hits:int>>"
schema_2 = "struct<hour:int,hits:int>"
df.select(json_tuple('value', 'version', 'stats').alias('version', 'stats')) \
.withColumn('status', coalesce(from_json('stats', schema_1), array(from_json('stats', schema_2)))) \
.selectExpr('version', 'inline_outer(status)') \
.show()
+-------+----+----+
|version|hour|hits|
+-------+----+----+
| 2| 1| 10|
| 2| 2| 12|
| 1|null| 20|
+-------+----+----+

How to parse json string to different columns in spark scala?

While reading parquet file this is the following file data
|id |name |activegroup|
|1 |abc |[{"groupID":"5d","role":"admin","status":"A"},{"groupID":"58","role":"admin","status":"A"}]|
data types of each field
root
|--id : int
|--name : String
|--activegroup : String
activegroup column is string explode function is not working. Following is the required output
|id |name |groupID|role|status|
|1 |abc |5d |admin|A |
|1 |def |58 |admin|A |
Do help me with parsing the above in spark scala latest version
First you need to extract the json schema:
val schema = schema_of_json(lit(df.select($"activeGroup").as[String].first))
Once you got it, you can convert your activegroup column, which is a String to json (from_json), and then explode it.
Once the column is a json, you can extract it's values with $"columnName.field"
val dfresult = df.withColumn("jsonColumn", explode(
from_json($"activegroup", schema)))
.select($"id", $"name",
$"jsonColumn.groupId" as "groupId",
$"jsonColumn.role" as "role",
$"jsonColumn.status" as "status")
If you want to extract the whole json and the element names are ok to you you can use the * to do it:
val dfresult = df.withColumn("jsonColumn", explode(
from_json($"activegroup", schema)))
.select($"id", $"name", $"jsonColumn.*")
RESULT
+---+----+-------+-----+------+
| id|name|groupId| role|status|
+---+----+-------+-----+------+
| 1| abc| 5d|admin| A|
| 1| abc| 58|admin| A|
+---+----+-------+-----+------+

Parsing JSON within a Spark DataFrame into new columns

Background
I have a dataframe that looks like this:
------------------------------------------------------------------------
|name |meals |
------------------------------------------------------------------------
|Tom |{"breakfast": "banana", "lunch": "sandwich"} |
|Alex |{"breakfast": "yogurt", "lunch": "pizza", "dinner": "pasta"} |
|Lisa |{"lunch": "sushi", "dinner": "lasagna", "snack": "apple"} |
------------------------------------------------------------------------
Obtained from the following:
var rawDf = Seq(("Tom",s"""{"breakfast": "banana", "lunch": "sandwich"}""" ),
("Alex", s"""{"breakfast": "yogurt", "lunch": "pizza", "dinner": "pasta"}"""),
("Lisa", s"""{"lunch": "sushi", "dinner": "lasagna", "snack": "apple"}""")).toDF("name", "meals")
I want to transform it into a dataframe that looks like this:
------------------------------------------------------------------------
|name |meal |food |
------------------------------------------------------------------------
|Tom |breakfast | banana |
|Tom |lunch | sandwich |
|Alex |breakfast | yogurt |
|Alex |lunch | pizza |
|Alex |dinner | pasta |
|Lisa |lunch | sushi |
|Lisa |dinner | lasagna |
|Lisa |snack | apple |
------------------------------------------------------------------------
I'm using Spark 2.1, so I'm parsing the json using get_json_object. Currently, I'm trying to get the final dataframe using an intermediary dataframe that looks like this:
------------------------------------------------------------------------
|name |breakfast |lunch |dinner |snack |
------------------------------------------------------------------------
|Tom |banana |sandwich |null |null |
|Alex |yogurt |pizza |pasta |null |
|Lisa |null |sushi |lasagna |apple |
------------------------------------------------------------------------
Obtained from the following:
val intermediaryDF = rawDf.select(col("name"),
get_json_object(col("meals"), "$." + Meals.breakfast).alias(Meals.breakfast),
get_json_object(col("meals"), "$." + Meals.lunch).alias(Meals.lunch),
get_json_object(col("meals"), "$." + Meals.dinner).alias(Meals.dinner),
get_json_object(col("meals"), "$." + Meals.snack).alias(Meals.snack))
Meals is defined in another file that has a lot more entries than breakfast, lunch, dinner, and snack, but it looks something like this:
object Meals {
val breakfast = "breakfast"
val lunch = "lunch"
val dinner = "dinner"
val snack = "snack"
}
I then use intermediaryDF to compute the final DataFrame, like so:
val finalDF = parsedDF.where(col("breakfast").isNotNull).select(col("name"), col("breakfast")).union(
parsedDF.where(col("lunch").isNotNull).select(col("name"), col("lunch"))).union(
parsedDF.where(col("dinner").isNotNull).select(col("name"), col("dinner"))).union(
parsedDF.where(col("snack").isNotNull).select(col("name"), col("snack")))
My problem
Using the intermediary DataFrame works if I only have a few types of Meals, but I actually have 40, and enumerating every one of them to compute intermediaryDF is impractical. I also don't like the idea of having to compute this DF in the first place. Is there a way to get directly from my raw dataframe to the final dataframe without the intermediary step, and also without explicitly having a case for every value in Meals?
Apache Spark provide support to parse json data, but that should have a predefined schema in order to parse it correclty. Your json data is dynamic so you cannot rely on a schema.
One way to do don;t let apache spark parse the data , but you could parse it in a key value way, (e.g by using something like Map[String, String] which is pretty generic)
Here is what you can do instead:
Use the Jackson json mapper for scala
// mapper object created on each executor node
val mapper = new ObjectMapper with ScalaObjectMapper
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.registerModule(DefaultScalaModule)
val valueAsMap = mapper.readValue[Map[String, String]](s"""{"breakfast": "banana", "lunch": "sandwich"}""")
This will give you something like transforming the json string into a Map[String, String]. That can also be viewed as a List of (key, value) pair
List((breakfast,banana), (lunch,sandwich))
Now comes the Apache Spark part into the play. Define a custom user defined function to parse the string and output the List of (key, value) pairs
val jsonToArray = udf((json:String) => {
mapper.readValue[Map[String, String]](json).toList
})
Apply that transformation on the "meals" columns and will transform that into a column of type Array. After that explode on that columns and select the key entry as column meal and value entry as column food
val df1 = rowDf.select(col("name"), explode(jsonToArray(col("meals"))).as("meals"))
df1.select(col("name"), col("meals._1").as("meal"), col("meals._2").as("food"))
Showing the last dataframe it outputs:
|name| meal| food|
+----+---------+--------+
| Tom|breakfast| banana|
| Tom| lunch|sandwich|
|Alex|breakfast| yogurt|
|Alex| lunch| pizza|
|Alex| dinner| pasta|
|Lisa| lunch| sushi|
|Lisa| dinner| lasagna|
|Lisa| snack| apple|
+----+---------+--------+

from_json of Spark sql return null values

I loaded a parquet file into a spark dataframe as follows :
val message= spark.read.parquet("gs://defenault-zdtt-devde/pubsub/part-00001-e9f8c58f-7de0-4537-a7be-a9a8556sede04a-c000.snappy.parquet")
when I perform a collect on my dataframe I get the following result :
message.collect()
Array[org.apache.spark.sql.Row] = Array([118738748835150,2018-08-20T17:44:38.742Z,{"id":"uplink-3130-85bc","device_id":60517119992794222,"group_id":69,"group":"box-2478-2555","profile_id":3,"profile":"eolane-movee","type":"uplink","timestamp":"2018-08-20T17:44:37.048Z","count":3130,"payload":[{"timestamp":"2018-08-20T17:44:37.048Z","data":{"battery":3.5975599999999996,"temperature":27}}],"payload_encrypted":"9da25e36","payload_cleartext":"fe1b01aa","device_properties":{"appeui":"7ca97df000001190","deveui":"7ca97d0000001bb0","external_id":"Product: 3.7 / HW: 3.1 / SW: 1.8.8","no_de_serie_eolane":"4904","no_emballage":"S02066","product_version":"1.3.1"},"protocol_data":{"AppNonce":"e820ef","DevAddr":"0e6c5fda","DevNonce":"85bc","NetID":"000007","best_gateway_id":"M40246","gateway.
The schema of this dataframe is
message.printSchema()
root
|-- Id: string (nullable = true)
|-- publishTime: string (nullable = true)
|-- data: string (nullable = true)
My aim is to work on the data column which holds json data and to flatten it.
I wrote the following code
val schemaTotal = new StructType (
Array (StructField("id",StringType,false),StructField("device_id",StringType),StructField("group_id",LongType), StructField("group",StringType),StructField("profile_id",IntegerType),StructField("profile",StringType),StructField("type",StringType),StructField("timestamp",StringType),
StructField("count",StringType),
StructField("payload",new StructType ()
.add("timestamp",StringType)
.add("data",new ArrayType (new StructType().add("battery",LongType).add("temperature",LongType),false))),
StructField("payload_encrypted",StringType),
StructField("payload_cleartext",StringType),
StructField("device_properties", new ArrayType (new StructType().add("appeui",StringType).add("deveui",StringType).add("external_id",StringType).add("no_de_serie_eolane",LongType).add("no_emballage",StringType).add("product_version",StringType),false)),
StructField("protocol_data", new ArrayType (new StructType().add("AppNonce",StringType).add("DevAddr",StringType).add("DevNonce",StringType).add("NetID",LongType).add("best_gateway_id",StringType).add("gateways",IntegerType).add("lora_version",IntegerType).add("noise",LongType).add("port",IntegerType).add("rssi",DoubleType).add("sf",IntegerType).add("signal",DoubleType).add("snr",DoubleType),false)),
StructField("lat",StringType),
StructField("lng",StringType),
StructField("geolocation_type",StringType),
StructField("geolocation_precision",StringType),
StructField("delivered_at",StringType)))
val dataframe_extract=message.select($"Id",
$"publishTime",
from_json($"data",schemaTotal).as("content"))
val table = dataframe_extract.select(
$"Id",
$"publishTime",
$"content.id" as "id",
$"content.device_id" as "device_id",
$"content.group_id" as "group_id",
$"content.group" as "group",
$"content.profile_id" as "profile_id",
$"content.profile" as "profile",
$"content.type" as "type",
$"content.timestamp" as "timestamp",
$"content.count" as "count",
$"content.payload.timestamp" as "timestamp2",
$"content.payload.data.battery" as "battery",
$"content.payload.data.temperature" as "temperature",
$"content.payload_encrypted" as "payload_encrypted",
$"content.payload_cleartext" as "payload_cleartext",
$"content.device_properties.appeui" as "appeui"
)
table.show() gives me null values for all columns:
+---------------+--------------------+----+---------+--------+-----+----------+-------+----+---------+-----+----------+-------+-----------+-----------------+-----------------+------+
| Id| publishTime| id|device_id|group_id|group|profile_id|profile|type|timestamp|count|timestamp2|battery|temperature|payload_encrypted|payload_cleartext|appeui|
+---------------+--------------------+----+---------+--------+-----+----------+-------+----+---------+-----+----------+-------+-----------+-----------------+-----------------+------+
|118738748835150|2018-08-20T17:44:...|null| null| null| null| null| null|null| null| null| null| null| null| null| null| null|
+---------------+--------------------+----+---------+--------+-----+----------+-------+----+---------+-----+----------+-------+-----------+-----------------+-----------------+------+
, whereas table.printSchema() gives me the expected result, any idea how to solve this, please?
I am working with Zeppelin as a first prototyping step thanks a lot in advance for your help.
Best Regards
from_json() SQL function has below constraint to be followed to convert column value to a dataframe.
whatever the datatype you have defined in the schema should match with the value present in the json, if there is any column's mismatch value leads to null in all column values
e.g.:
'{"name": "raj", "age": 12}' for this column value
StructType(List(StructField(name,StringType,true),StructField(age,StringType,true)))
The above schema will return you a null value on both the columns
StructType(List(StructField(name,StringType,true),StructField(age,IntegerType,true)))
The above schema will return you an expected dataframe
For this thread possible reason could be this, if there is any mismatched column value present, from_json will return all column value as null