Spark Structured streaming: primary key in JDBC sink - mysql

I am reading stream of data from kafka topic using strucured streaming with Update Mode., and then doing some transformation.
Then I have created a jdbc sink to push the data in mysql sink with Append mode. The problem is how do I tell my sink to let it know that this is my primary key and do the update based on it so that my table should not have any duplicate rows.
val df: DataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<List-here>")
.option("subscribe", "emp-topic")
.load()
import spark.implicits._
// value in kafka is bytes so cast it to String
val empList: Dataset[Employee] = df.
selectExpr("CAST(value AS STRING)")
.map(row => Employee(row.getString(0)))
// window aggregations on 1 min windows
val aggregatedDf= ......
// How to tell here that id is my primary key and do the update
// based on id column
aggregatedDf
.writeStream
.trigger(Trigger.ProcessingTime(60.seconds))
.outputMode(OutputMode.Update)
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF
.select("id", "name","salary","dept")
.write.format("jdbc")
.option("url", "jdbc:mysql://localhost/empDb")
.option("driver","com.mysql.cj.jdbc.Driver")
.option("dbtable", "empDf")
.option("user", "root")
.option("password", "root")
.mode(SaveMode.Append)
.save()
}

One way is, you can use ON DUPLICATE KEY UPDATE with foreachPartition may serve this purpose
Below is the psuedo code snippet
/**
* Insert in to database using foreach partition.
* #param dataframe : DataFrame
* #param sqlDatabaseConnectionString
* #param sqlTableName
*/
def insertToTable(dataframe: DataFrame, sqlDatabaseConnectionString: String, sqlTableName: String): Unit = {
//numPartitions = number of simultaneous DB connections you can planning to give
datframe.repartition(numofpartitionsyouwant)
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition { partition =>
// Note : Each partition one connection (more better way is to use connection pools)
val sqlExecutorConnection: Connection = DriverManager.getConnection(sqlDatabaseConnectionString)
//Batch size of 1000 is used since some databases cant use batch size more than 1000 for ex : Azure sql
partition.grouped(1000).foreach {
group =>
val insertString: scala.collection.mutable.StringBuilder = new scala.collection.mutable.StringBuilder()
group.foreach {
record => insertString.append("('" + record.mkString(",") + "'),")
}
val sql = s"""
| INSERT INTO $sqlTableName VALUES
| $tableHeader
| ${insertString}
| ON DUPLICATE KEY UPDATE
| yourprimarykeycolumn='${record.getAs[String]("key")}'
sqlExecutorConnection.createStatement()
.executeUpdate(sql)
}
sqlExecutorConnection.close() // close the connection
}
}
you can use preparedstatement instead of jdbc statement.
Further reading : SPARK SQL - update MySql table using DataFrames and JDBC

Related

How can we read invalid date column in spark scala from mysql server using jdbc driver url (connection)

I am getting error while reading this column from mysql server
id
date
1
0000-00-00
2
0000-00-01
in the above data set we can handle 0000-00-00 by using mysql server Additional parameter
zeroDateTimeBehavior=convertToNull
but i don't know how to handle this type of date 0000-00-01
help me error message i got
Exception in User Class: org.apache.spark.SparkException : Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 11) (10.100.4.111 executor 1): java.sql.SQLException: YEAR
i am using this
val a = "jdbc:mysql://<host_name>:3306/<database_name>?zeroDateTimeBehavior=convertToNull"
val mysqlServerDF = sparkSession.read.format("jdbc")
.option("url", a)
.option("query", sql)
.option("user",jdbcUserName)
.option("password", jdbcPassword)
.load()
sql is a sql query example "select * from table"
If fixing such dates in your database is not an option, I think your best bet would be to handle it directly in your sql query. E.g. we can compare for valid date range, which is '1000-01-01' to '9999-12-31' according to docs:
val sql = """
select
id,
case
when
not cast(date as char(10))
between '1000-01-02' and '9999-12-30'
then
null
else
date
end
from table1"""
val mysqlServerDF = sparkSession.read.format("jdbc")
.option("url", a)
.option("query", sql)
.option("user",jdbcUserName)
.option("password", jdbcPassword)
.load()

How to add field within nested JSON when reading from/writing to Kafka via a Spark dataframe

I've a Spark (v.3.0.1) job written in Java that reads Json from Kafka, does some transformation and then writes it back to Kafka. For now, the incoming message structure in Kafka is something like:
{"catKey": 1}. The output from the Spark job that's written back to Kafka is something like: {"catKey":1,"catVal":"category-1"}. The code for processing input data from Kafka goes something as follows:
DataFrameReader dfr = putSrcProps(spark.read().format("kafka"));
for (String key : srcProps.stringPropertyNames()) {
dfr = dfr.option(key, srcProps.getProperty(key));
}
Dataset<Row> df = dfr.option("group.id", getConsumerGroupId())
.load()
.selectExpr("CAST(value AS STRING) as value")
.withColumn("jsonData", from_json(col("value"), schemaHandler.getSchema()))
.select("jsonData.*");
// transform df
df.toJSON().write().format("kafka").option("key", "val").save()
I want to change the message structure in Kafka. Now, it should be of the format: {"metadata": <whatever>, "payload": {"catKey": 1}}. While reading, we need to read only the contents of the payload, so the dataframe remains similar. Also, while writing back to Kafka, first I need to wrap the msg in payload, add a metadata. The output will have to be of the format: {"metadata": <whatever>, "payload": {"catKey":1,"catVal":"category-1"}}. I've tried manipulating the contents of the selectExpr and from_json method, but no luck so far. Any pointer on how to achieve the functionality would be very much appreciated.
To extract the content of payload in your JSON you can use get_json_object. And to create the new output you can use the built-in functions struct and to_json.
Given a Dataframe:
val df = Seq(("""{"metadata": "whatever", "payload": {"catKey": 1}}""")).toDF("value").as[String]
df.show(false)
+--------------------------------------------------+
|value |
+--------------------------------------------------+
|{"metadata": "whatever", "payload": {"catKey": 1}}|
+--------------------------------------------------+
Then creating the new column called "value"
val df2 = df
.withColumn("catVal", lit("category-1")) // whatever your logic is to fill this column
.withColumn("payload",
struct(
get_json_object(col("value"), "$.payload.catKey").as("catKey"),
col("catVal").as("catVal")
)
)
.withColumn("metadata",
get_json_object(col("value"), "$.metadata"),
).select("metadata", "payload")
df2.show(false)
+--------+---------------+
|metadata|payload |
+--------+---------------+
|whatever|[1, category-1]|
+--------+---------------+
val df3 = df2.select(to_json(struct(col("metadata"), col("payload"))).as("value"))
df3.show(false)
+----------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------+
|{"metadata":"whatever","payload":{"catKey":"1","catVal":"category-1"}}|
+----------------------------------------------------------------------+

Handle spark reading database mysql table throwing exception

I have a spark job that reads from AWS Aurora Mysql database. Unfortunately, the job keeps failing throwing an exception because of an invalid datetime in one of the records.
Sample code:
val jdbcUrl =
s"jdbc:mysql://$dbHostname:$dbPort/$dbName?zeroDateTimeBehavior=convertToNull&serverTimezone=UTC"
val props = reportConf.connectionProps(db)
val df = spark.read
.format("jdbc")
.option("url", jdbcUrl)
.option("dbtable",
s"(SELECT *, MOD($partitionColumn,10) AS partition_key FROM $table ORDER BY $partitionColumn) as $table") // DESC LIMIT 50000
.option("user", user)
.option("password", password)
.option("driver", driver)
.option("numPartitions", numPartitions)
.option("partitionColumn", "partition_key")
.option("lowerBound", 0)
.option("upperBound", 9)
.option("mode", "DROPMALFORMED")
.load()
.drop('partition_key)
I already tried converting the zeros date values to Null - zeroDateTimeBehavior=convertToNull in my jdbcUrl but not working.
Preferably, I would like to skip the records or replace with some default values to filter later on instead of manually identifying the bad records in the database tables.
Any idea how to fix this in spark please ?
Exception:
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)
Caused by: java.lang.IllegalArgumentException: DAY_OF_MONTH
at java.util.GregorianCalendar.computeTime(GregorianCalendar.java:2648)
at java.util.Calendar.updateTime(Calendar.java:3393)
at java.util.Calendar.getTimeInMillis(Calendar.java:1782)
at com.mysql.cj.jdbc.io.JdbcDateValueFactory.createFromDate(JdbcDateValueFactory.java:67)
at com.mysql.cj.jdbc.io.JdbcDateValueFactory.createFromDate(JdbcDateValueFactory.java:39)
at com.mysql.cj.core.io.ZeroDateTimeToNullValueFactory.createFromDate(ZeroDateTimeToNullValueFactory.java:41)
at com.mysql.cj.core.io.BaseDecoratingValueFactory.createFromDate(BaseDecoratingValueFactory.java:46)
at com.mysql.cj.core.io.BaseDecoratingValueFactory.createFromDate(BaseDecoratingValueFactory.java:46)
at com.mysql.cj.core.io.MysqlTextValueDecoder.decodeDate(MysqlTextValueDecoder.java:66)
at com.mysql.cj.mysqla.result.AbstractResultsetRow.decodeAndCreateReturnValue(AbstractResultsetRow.java:70)
at com.mysql.cj.mysqla.result.AbstractResultsetRow.getValueFromBytes(AbstractResultsetRow.java:225)
at com.mysql.cj.mysqla.result.TextBufferRow.getValue(TextBufferRow.java:122)
at com.mysql.cj.jdbc.result.ResultSetImpl.getNonStringValueFromRow(ResultSetImpl.java:630)
at com.mysql.cj.jdbc.result.ResultSetImpl.getDateOrTimestampValueFromRow(ResultSetImpl.java:643)
at com.mysql.cj.jdbc.result.ResultSetImpl.getDate(ResultSetImpl.java:788)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$2.apply(JdbcUtils.scala:389)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$2.apply(JdbcUtils.scala:387)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:356)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:338)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.CompletionIterator.hasNex

Problem with reading bit data type from MySQL and transformation into Redshift with AWS Glue

I have a table in a MySQL Database that contains a column called activity which is of data type BIT. When converted to INT it can take on the values 1,2 or 3.
When using the crawler in AWS Glue it recognizes activity as BOOLEAN. I have tried to edit the schema for the table and change data type for activity to INT but Glue still reads it as BOOLEAN when running the job.
I have also tried to use ApplyMapping to convert it to INT but with no success.
Any ideas on how to resolve this?
I solved it by pushing down an query to the MySQL database where I CAST the BIT as INT when reading it into Glue using:
pushdown_query = "(SELECT col1, CAST(activity AS INT) AS activity FROM my_table) my_table"
df = glueContext.read.format("jdbc")\
.option("driver", "com.mysql.jdbc.Driver") \
.option("url", db_url) \
.option("dbtable", pushdown_query) \
.option("user", db_user) \
.option("password", db_pass).load()
You can use instead spark/pySpark code to read the table in your MySQL Database.
For example, using pySpark will be as follow:
jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.mysql.jdbc.Driver"
}
employees_table = spark.read.jdbc(jdbcUrl, "employees", connectionProperties)
You can find more information on this link: Spark Data Sources
Hopefully, spark will do a much better job than AWS Glue on inferring the schema.

Mapping a sequence of results from Slick monadic join to Json

I'm using Play 2.4 with Slick 3.1.x, specifically the Slick-Play plugin v1.1.1. Firstly, some context... I have the following search/filter method in a DAO, which joins together 4 models:
def search(
departureCity: Option[String],
arrivalCity: Option[String],
departureDate: Option[Date]
) = {
val monadicJoin = for {
sf <- slickScheduledFlights.filter(a =>
departureDate.map(d => a.date === d).getOrElse(slick.lifted.LiteralColumn(true))
)
fl <- slickFlights if sf.flightId === fl.id
al <- slickAirlines if fl.airlineId === al.id
da <- slickAirports.filter(a =>
fl.departureAirportId === a.id &&
departureCity.map(c => a.cityCode === c).getOrElse(slick.lifted.LiteralColumn(true))
)
aa <- slickAirports.filter(a =>
fl.arrivalAirportId === a.id &&
arrivalCity.map(c => a.cityCode === c).getOrElse(slick.lifted.LiteralColumn(true))
)
} yield (fl, sf, al, da, aa)
db.run(monadicJoin.result)
}
The output from this is a Vector containing sequences, e.g:
Vector(
(
Flight(Some(1),123,216,2013,3,1455,2540,3,905,500,1150),
ScheduledFlight(Some(1),1,2016-04-13,90,10),
Airline(Some(216),BA,BAW,British Airways,United Kingdom),
Airport(Some(2013),LHR,Heathrow,LON,...),
Airport(Some(2540),JFK,John F Kennedy Intl,NYC...)
),
(
etc ...
)
)
I'm currently rendering the JSON in the controller by calling .toJson on a Map and inserting this Vector (the results param below), like so:
flightService.search(departureCity, arrivalCity, departureDate).map(results => {
Ok(
Map[String, Any](
"status" -> "OK",
"data" -> results
).toJson
).as("application/json")
})
While this sort of works, it produces output in an unusual format; an array of results (the rows) within each result object the joins are nested inside objects with keys: "_1", "_2" and so on.
So the question is: How should I go about restructuring this?
There doesn't appear to be anything which specifically covers this sort of scenario in the Slick docs. Therefore I would be grateful for some input on what might be the best way to refactor this Vector of Seq's, with a view to renaming each of the joins or even flattening it out and only keeping certain fields?
Is this best done in the DAO search method before it's returned (by mapping it somehow?) or in the controller after I get back the Future results Vector from the search method?
Or I'm wondering whether it would be preferable to abstract this sort of mutation out somewhere else entirely, using a transformer perhaps?
You need JSON Reads/Writes/Format Combinators
In the first place you must have Writes[T] for all your classes (Flight, ScheduledFlight, Airline, Airport).
Simple way is using Json macros
implicit val flightWrites: Writes[Flight] = Json.writes[Flight]
implicit val scheduledFlightWrites: Writes[ScheduledFlight] = Json.writes[ScheduledFlight]
implicit val airlineWrites: Writes[Airline] = Json.writes[Airline]
implicit val airportWrites: Writes[Airport] = Json.writes[Airport]
You must implement OWrites[(Flight, ScheduledFlight, Airline, Airport, Airport)] for Vector item also. For example:
val itemWrites: OWrites[(Flight, ScheduledFlight, Airline, Airport, Airport)] = (
(__ \ "flight").write[Flight] and
(__ \ "scheduledFlight").write[ScheduledFlight] and
(__ \ "airline").write[Airline] and
(__ \ "airport1").write[Airport] and
(__ \ "airport2").write[Airport]
).tupled
for writing whole Vector as JsAray use Writes.seq[T]
val resultWrites: Writes[Seq[(Flight, ScheduledFlight, Airline, Airport, Airport)]] = Writes.seq(itemWrites)
We have all to response your data
flightService.search(departureCity, arrivalCity, departureDate).map(results =>
Ok(
Json.obj(
"status" -> "Ok",
"data" -> resultWrites.writes(results)
)
)