Handle spark reading database mysql table throwing exception - mysql

I have a spark job that reads from AWS Aurora Mysql database. Unfortunately, the job keeps failing throwing an exception because of an invalid datetime in one of the records.
Sample code:
val jdbcUrl =
s"jdbc:mysql://$dbHostname:$dbPort/$dbName?zeroDateTimeBehavior=convertToNull&serverTimezone=UTC"
val props = reportConf.connectionProps(db)
val df = spark.read
.format("jdbc")
.option("url", jdbcUrl)
.option("dbtable",
s"(SELECT *, MOD($partitionColumn,10) AS partition_key FROM $table ORDER BY $partitionColumn) as $table") // DESC LIMIT 50000
.option("user", user)
.option("password", password)
.option("driver", driver)
.option("numPartitions", numPartitions)
.option("partitionColumn", "partition_key")
.option("lowerBound", 0)
.option("upperBound", 9)
.option("mode", "DROPMALFORMED")
.load()
.drop('partition_key)
I already tried converting the zeros date values to Null - zeroDateTimeBehavior=convertToNull in my jdbcUrl but not working.
Preferably, I would like to skip the records or replace with some default values to filter later on instead of manually identifying the bad records in the database tables.
Any idea how to fix this in spark please ?
Exception:
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)
Caused by: java.lang.IllegalArgumentException: DAY_OF_MONTH
at java.util.GregorianCalendar.computeTime(GregorianCalendar.java:2648)
at java.util.Calendar.updateTime(Calendar.java:3393)
at java.util.Calendar.getTimeInMillis(Calendar.java:1782)
at com.mysql.cj.jdbc.io.JdbcDateValueFactory.createFromDate(JdbcDateValueFactory.java:67)
at com.mysql.cj.jdbc.io.JdbcDateValueFactory.createFromDate(JdbcDateValueFactory.java:39)
at com.mysql.cj.core.io.ZeroDateTimeToNullValueFactory.createFromDate(ZeroDateTimeToNullValueFactory.java:41)
at com.mysql.cj.core.io.BaseDecoratingValueFactory.createFromDate(BaseDecoratingValueFactory.java:46)
at com.mysql.cj.core.io.BaseDecoratingValueFactory.createFromDate(BaseDecoratingValueFactory.java:46)
at com.mysql.cj.core.io.MysqlTextValueDecoder.decodeDate(MysqlTextValueDecoder.java:66)
at com.mysql.cj.mysqla.result.AbstractResultsetRow.decodeAndCreateReturnValue(AbstractResultsetRow.java:70)
at com.mysql.cj.mysqla.result.AbstractResultsetRow.getValueFromBytes(AbstractResultsetRow.java:225)
at com.mysql.cj.mysqla.result.TextBufferRow.getValue(TextBufferRow.java:122)
at com.mysql.cj.jdbc.result.ResultSetImpl.getNonStringValueFromRow(ResultSetImpl.java:630)
at com.mysql.cj.jdbc.result.ResultSetImpl.getDateOrTimestampValueFromRow(ResultSetImpl.java:643)
at com.mysql.cj.jdbc.result.ResultSetImpl.getDate(ResultSetImpl.java:788)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$2.apply(JdbcUtils.scala:389)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$2.apply(JdbcUtils.scala:387)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:356)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:338)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.CompletionIterator.hasNex

Related

How can we read invalid date column in spark scala from mysql server using jdbc driver url (connection)

I am getting error while reading this column from mysql server
id
date
1
0000-00-00
2
0000-00-01
in the above data set we can handle 0000-00-00 by using mysql server Additional parameter
zeroDateTimeBehavior=convertToNull
but i don't know how to handle this type of date 0000-00-01
help me error message i got
Exception in User Class: org.apache.spark.SparkException : Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 11) (10.100.4.111 executor 1): java.sql.SQLException: YEAR
i am using this
val a = "jdbc:mysql://<host_name>:3306/<database_name>?zeroDateTimeBehavior=convertToNull"
val mysqlServerDF = sparkSession.read.format("jdbc")
.option("url", a)
.option("query", sql)
.option("user",jdbcUserName)
.option("password", jdbcPassword)
.load()
sql is a sql query example "select * from table"
If fixing such dates in your database is not an option, I think your best bet would be to handle it directly in your sql query. E.g. we can compare for valid date range, which is '1000-01-01' to '9999-12-31' according to docs:
val sql = """
select
id,
case
when
not cast(date as char(10))
between '1000-01-02' and '9999-12-30'
then
null
else
date
end
from table1"""
val mysqlServerDF = sparkSession.read.format("jdbc")
.option("url", a)
.option("query", sql)
.option("user",jdbcUserName)
.option("password", jdbcPassword)
.load()

df.to_sql with AS400

i want to put a Panda Dataframe to a IBM i Series / AS400. I already researched a much, but now I am stuck.
I already made a lot of queries, where I use pyodbc. For df.to_sql() I should use, as readed on other stacks, sqlalchemy with the ibm_db_sa dialect.
My actual code is:
CONNECTION_STRING = (
"driver={iSeries Access ODBC Driver};"
"System=111.111.111.111;"
"database=TESTDB;"
"uid=USER;"
"pwd=PASSW;"
)
quoted = urllib.parse.quote_plus(CONNECTION_STRING)
engine = create_engine('ibm_db_sa+pyodbc:///?odbc_connect={}'.format(quoted))
create_statement = df.to_sql("TABLETEST", engine, if_exists="append")
the following packages are installed
python 3.9
ibm-db 3.1.3
ibm-db-sa 0.3.7
ibm-db-sa-py3 0.3.1.post1
pandas 1.3.5
pip 22.0.4
setuptools 57.0.0
SQLAlchemy 1.4.39
when I run, i get the following error:
sqlalchemy.exc.ProgrammingError: (pyodbc.ProgrammingError) ('42S02', '[42S02] [IBM][System i Access ODBC Driver][DB2 for i5/OS]SQL0204 - COLUMNS in SYSCAT type *FILE not found. (-204) (SQLPrepare)')
[SQL: SELECT "SYSCAT"."COLUMNS"."COLNAME", "SYSCAT"."COLUMNS"."TYPENAME", "SYSCAT"."COLUMNS"."DEFAULT", "SYSCAT"."COLUMNS"."NULLS", "SYSCAT"."COLUMNS"."LENGTH", "SYSCAT"."COLUMNS"."SCALE", "SYSCAT"."COLUMNS"."IDENTITY", "SYSCAT"."COLUMNS"."GENERATED"
FROM "SYSCAT"."COLUMNS"
WHERE "SYSCAT"."COLUMNS"."TABSCHEMA" = ? AND "SYSCAT"."COLUMNS"."TABNAME" = ? ORDER BY "SYSCAT"."COLUMNS"."COLNO"]
[parameters: ('USER', 'TABLETEST')]
(Background on this error at: https://sqlalche.me/e/14/f405)
I think, the dialect could be wrong, because the parameters are the username and the table for the ODBC connection?
AND: I am not really sure, whats the difference between ibm_db_sa and ibm_db?
I tried a few days again, before someone is trying to do this via sqlalchemy should do it via pyodbc.
Here is my working example
refering the df_to_sql_bulk_insert function to this
(and now I am currently using my system-DSN):
def df_to_sql_bulk_insert(df: pd.DataFrame, table: str, **kwargs) -> str:
df = df.copy().assign(**kwargs)
columns = ", ".join(df.columns)
tuples = map(str, df.itertuples(index=False, name=None))
values = re.sub(r"(?<=\W)(nan|None)(?=\W)", "NULL", (",\n" + " " * 7).join(tuples))
return f"INSERT INTO {table} ({columns})\nVALUES {values}"
cnxn = pyodbc.connect("DSN=XXX")
cursor = cnxn.cursor()
sqlstr = df_to_sql_bulk_insert(df,"DBXXX.TBLXXX")
cursor.execute(sqlstr)
cnxn.commit()

Spark Structured streaming: primary key in JDBC sink

I am reading stream of data from kafka topic using strucured streaming with Update Mode., and then doing some transformation.
Then I have created a jdbc sink to push the data in mysql sink with Append mode. The problem is how do I tell my sink to let it know that this is my primary key and do the update based on it so that my table should not have any duplicate rows.
val df: DataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<List-here>")
.option("subscribe", "emp-topic")
.load()
import spark.implicits._
// value in kafka is bytes so cast it to String
val empList: Dataset[Employee] = df.
selectExpr("CAST(value AS STRING)")
.map(row => Employee(row.getString(0)))
// window aggregations on 1 min windows
val aggregatedDf= ......
// How to tell here that id is my primary key and do the update
// based on id column
aggregatedDf
.writeStream
.trigger(Trigger.ProcessingTime(60.seconds))
.outputMode(OutputMode.Update)
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF
.select("id", "name","salary","dept")
.write.format("jdbc")
.option("url", "jdbc:mysql://localhost/empDb")
.option("driver","com.mysql.cj.jdbc.Driver")
.option("dbtable", "empDf")
.option("user", "root")
.option("password", "root")
.mode(SaveMode.Append)
.save()
}
One way is, you can use ON DUPLICATE KEY UPDATE with foreachPartition may serve this purpose
Below is the psuedo code snippet
/**
* Insert in to database using foreach partition.
* #param dataframe : DataFrame
* #param sqlDatabaseConnectionString
* #param sqlTableName
*/
def insertToTable(dataframe: DataFrame, sqlDatabaseConnectionString: String, sqlTableName: String): Unit = {
//numPartitions = number of simultaneous DB connections you can planning to give
datframe.repartition(numofpartitionsyouwant)
val tableHeader: String = dataFrame.columns.mkString(",")
dataFrame.foreachPartition { partition =>
// Note : Each partition one connection (more better way is to use connection pools)
val sqlExecutorConnection: Connection = DriverManager.getConnection(sqlDatabaseConnectionString)
//Batch size of 1000 is used since some databases cant use batch size more than 1000 for ex : Azure sql
partition.grouped(1000).foreach {
group =>
val insertString: scala.collection.mutable.StringBuilder = new scala.collection.mutable.StringBuilder()
group.foreach {
record => insertString.append("('" + record.mkString(",") + "'),")
}
val sql = s"""
| INSERT INTO $sqlTableName VALUES
| $tableHeader
| ${insertString}
| ON DUPLICATE KEY UPDATE
| yourprimarykeycolumn='${record.getAs[String]("key")}'
sqlExecutorConnection.createStatement()
.executeUpdate(sql)
}
sqlExecutorConnection.close() // close the connection
}
}
you can use preparedstatement instead of jdbc statement.
Further reading : SPARK SQL - update MySql table using DataFrames and JDBC

Problem with reading bit data type from MySQL and transformation into Redshift with AWS Glue

I have a table in a MySQL Database that contains a column called activity which is of data type BIT. When converted to INT it can take on the values 1,2 or 3.
When using the crawler in AWS Glue it recognizes activity as BOOLEAN. I have tried to edit the schema for the table and change data type for activity to INT but Glue still reads it as BOOLEAN when running the job.
I have also tried to use ApplyMapping to convert it to INT but with no success.
Any ideas on how to resolve this?
I solved it by pushing down an query to the MySQL database where I CAST the BIT as INT when reading it into Glue using:
pushdown_query = "(SELECT col1, CAST(activity AS INT) AS activity FROM my_table) my_table"
df = glueContext.read.format("jdbc")\
.option("driver", "com.mysql.jdbc.Driver") \
.option("url", db_url) \
.option("dbtable", pushdown_query) \
.option("user", db_user) \
.option("password", db_pass).load()
You can use instead spark/pySpark code to read the table in your MySQL Database.
For example, using pySpark will be as follow:
jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.mysql.jdbc.Driver"
}
employees_table = spark.read.jdbc(jdbcUrl, "employees", connectionProperties)
You can find more information on this link: Spark Data Sources
Hopefully, spark will do a much better job than AWS Glue on inferring the schema.

Error running Hive query with JSON data?

I have data containing the following:
{"field1":{"data1": 1},"field2":100,"field3":"more data1","field4":123.001}
{"field1":{"data2": 1},"field2":200,"field3":"more data2","field4":123.002}
{"field1":{"data3": 1},"field2":300,"field3":"more data3","field4":123.003}
{"field1":{"data4": 1},"field2":400,"field3":"more data4","field4":123.004}
I uploaded it to S3 and converted it to a Hive table using the following from the Hive console:
ADD JAR s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;
CREATE EXTERNAL TABLE impressions (json STRING ) ROW FORMAT DELIMITED LINES TERMINATED BY '\n' LOCATION 's3://my-bucket/';
The query:
SELECT * FROM impressions;
gives output fine but as soon as I try and use the get_json_object UDF
and run the query:
SELECT get_json_object(impressions.json, '$.field2') FROM impressions;
I get the following error:
> SELECT get_json_object(impressions.json, '$.field2') FROM impressions;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.io.IOException: cannot find dir = s3://nick.bucket.dev/snapshot.csv in pathToPartitionInfo: [s3://nick.bucket.dev/]
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:291)
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:258)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:108)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:423)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1036)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1028)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:172)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:871)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:479)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1332)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:261)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:218)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:409)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:567)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Job Submission failed with exception 'java.io.IOException(cannot find dir = s3://my-bucket/snapshot.csv in pathToPartitionInfo: [s3://my-bucket/])'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
Does anyone know what is wrong?
Any reason why you are declaring:
ADD JAR s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;
But not using the serde in your table definition? See the code snippet below on how to use it. I can't see any reason to use get_json_object here.
CREATE EXTERNAL TABLE impressions (
field1 string, field2 string, field3 string, field4 string
)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='field1, field2, field3,
field4)
LOCATION 's3://mybucket' ;