How to insert a data into RDB (MySQL) using Spark? - mysql

I'm trying to insert a data into MySQL table via Spark SQL.
Here is my table:
CREATE TABLE images (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(100) NOT NULL,
data LONGBLOB NOT NULL
);
and my Spark code:
case class Image(name: String, data: Array[Byte])
def saveImage(image: Image): Unit = {
sqlContext.sql(s"""INSERT INTO images (name, data) VALUES ('${image.name}', ${image.data});""".stripMargin)
}
But I get an error:
java.lang.RuntimeException: [1.13] failure: ``table'' expected but identifier images found
INSERT INTO images (name, data)
^
What is wrong with my code?

Finally, I've found a solution. I can use a trick to save data into MySQL using Spark SQL. The trick is to create a new DataFrame and then to persist it. Here is an example:
def saveImage(image: Image): Unit = {
val df = sqlContext.createDataFrame {
sc.parallelize(
Image(
name = image.name,
data = image.data
) :: Nil
)
}
JdbcUtils.saveTable(df, url, "images", props)
}
And the model would be like this:
case class Image(
id : Option[Int] = None,
name : String,
data : Array[Byte]
)

Related

Validation Error in get request fastAPI sqlalchemy

I have this route, that with the actual data in the DB table is supposed to answer with a 9 and 4 if i enter a 2 as parameter, but i get the error.
#userRoutes.get(
"/users/userMatch/{idusuariobuscar}",
response_model=list[PartidosUser],
tags=["users"],
)
def get_user_matches(idusuariobuscar: str):
return conn.execute(
partidosusuarios.select(partidosusuarios.c.idpartido).where(
partidosusuarios.c.idusuario == idusuariobuscar
)
).fetchall()
that queries this table
This is the schema
class PartidosUser(BaseModel):
id: Optional[str]
idUsuario: str
idPartido: str
class Config:
orm_mode = True
And this the model of the table
partidosusuarios = Table(
"partidosusuarios",
meta,
Column("idrelacion", Integer, primary_key=True),
Column("idpartido", Integer),
Column(
"idusuario",
Integer,
),
)
And the error
raise ValidationError(errors, field.type_)
pydantic.error_wrappers.ValidationError: 4 validation errors for PartidosUser
response -> 0 -> idUsuario
field required (type=value_error.missing)
response -> 0 -> idPartido
field required (type=value_error.missing)
response -> 1 -> idUsuario
field required (type=value_error.missing)
response -> 1 -> idPartido
field required (type=value_error.missing)
You are only fetching the idpartido column from the database, hence the other fields don't exist. Changing your query from
partidosusuarios.select(partidosusuarios.c.idpartido).where(
partidosusuarios.c.idusuario == idusuariobuscar
)
to
partidosusuarios.select().where(
partidosusuarios.c.idusuario == idusuariobuscar
)
should solve your problem.
Also, be aware that since your Pydantic model contains an id field but your database model doesn't, this field will always be None. Also, your database field idrelacion is not included in the response. These might be intentional choices, or a naming error.

Scala schema_of_json function fails in spark structured streaming

I have created a function to read JSON as a string with its schema. Then using that function in spark streaming. I am getting error while doing so. The same piece works when I create schema first, then use that schema to read, but doesn't work in single line. How can I fix it?
def processBatch(microBatchOutputDF: DataFrame, batchId: Long) {
TOPICS.split(',').foreach(topic =>{
var TableName = topic.split('.').last.toUpperCase
var df = microBatchOutputDF
/*var schema = schema_of_json(df
.select($"value")
.filter($"topic".contains(topic))
.as[String]
)*/
var jsonDataDf = df.filter($"topic".contains(topic))
.withColumn("jsonData", from_json($"value", schema_of_json(lit($"value".as[String])), scala.collection.immutable.Map[String, String]().asJava))
var srcTable = jsonDataDf
.select(col(s"jsonData.payload.after.*"), $"offset", $"timestamp")
srcTable
.select(srcTable.columns.map(c => col(c).cast(StringType)) : _*)
.write
.mode("append").format("delta").save("/mnt/datalake/raw/kafka/" + TableName)
spark.sql(s"""CREATE TABLE IF NOT EXISTS kafka_raw.$TableName USING delta LOCATION '/mnt/datalake/raw/kafka/$TableName'""")
} )
}
Spark streaming code
import org.apache.spark.sql.streaming.Trigger
val StreamingQuery = InputDf
.select("*")
.writeStream.outputMode("update")
.option("queryName", "StreamingQuery")
.foreachBatch(processBatch _)
.start()
Error:
org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema_of_json(value)
Error –org.apache.spark.sql.AnalysisException: Schema should be
specified in DDL format as a string literal or output of the
schema_of_json/schema_of_csv functions instead of
schema_of_json(value)
Above error suggests issue with from_json() function.
Syntax:- from_json(jsonStr, schema[, options]) - Returns a struct value with the given jsonStr and schema.
Refer below Examples:
> SELECT from_json('{"a":1, "b":0.8}', 'a INT, b DOUBLE');
{"a":1,"b":0.8}
> SELECT from_json('{"time":"26/08/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));
{"time":2015-08-26 00:00:00}
Refer - https://docs.databricks.com/sql/language-manual/functions/from_json.html
This is how I solved this.
I created a filtered dataframe from the kafka output dataframe, and applied all the logics in it, as it was before. The problem with generating schema while reading is, from_json doesn't know which exact row to use from all the rows of the dataframe.
def processBatch(microBatchOutputDF: DataFrame, batchId: Long) {
TOPICS.split(',').foreach(topic =>{
var TableName = topic.split('.').last.toUpperCase
var df = microBatchOutputDF.where(col("topic") === topic)
var schema = schema_of_json(df
.select($"value")
.filter($"topic".contains(topic))
.as[String]
)
var jsonDataDf = df.withColumn("jsonData", from_json($"value", schema, scala.collection.immutable.Map[String, String]().asJava))
var srcTable = jsonDataDf
.select(col(s"jsonData.payload.after.*"), $"offset", $"timestamp")
srcTable
.select(srcTable.columns.map(c => col(c).cast(StringType)) : _*)
.write
.mode("append").format("delta").save("/mnt/datalake/raw/kafka/" + TableName)
spark.sql(s"""CREATE TABLE IF NOT EXISTS kafka_raw.$TableName USING delta LOCATION '/mnt/datalake/raw/kafka/$TableName'""")
} )
}

JDBI select on varbinary and uuid

A legacy mysql db table has an id column that is non-human readable raw varbinary (don't ask me why :P)
CREATE TABLE IF NOT EXISTS `tbl_portfolio` (
`id` varbinary(16) NOT NULL,
`name` varchar(128) NOT NULL,
...
PRIMARY KEY (`id`)
);
and I need to select on it based on a java.util.UUID
jdbiReader
.withHandle<PortfolioData, JdbiException> { handle ->
handle
.createQuery(
"""
SELECT *
FROM tbl_portfolio
WHERE id = :id
"""
)
.bind("id", uuid) //mapping this uuid into the varbinary
//id db column is the problem
.mapTo(PortfolioData::class.java) //the mapper out does work
.firstOrNull()
}
just in case anyone wants to see it, here's the mapper out (but again, the mapper out is not the problem - binding the uuid to the varbinary id db column is)
class PortfolioDataMapper : RowMapper<PortfolioData> {
override fun map(
rs: ResultSet,
ctx: StatementContext
): PortfolioData = PortfolioData(
fromBytes(rs.getBytes("id")),
rs.getString("name"),
rs.getString("portfolio_idempotent_key")
)
private fun fromBytes(bytes: ByteArray): UUID {
val byteBuff = ByteBuffer.wrap(bytes)
val first = byteBuff.long
val second = byteBuff.long
return UUID(first, second)
}
}
I've tried all kinds of things to get the binding to work but no success - any advice much appreciated!
Finally got it to work, partly thanks to https://jdbi.org/#_argumentfactory which actually deals with UUID specifically but I somehow missed despite looking at JDBI docs for hours, oh well
The query can remain like this
jdbiReader
.withHandle<PortfolioData, JdbiException> { handle ->
handle
.createQuery(
"""
SELECT *
FROM tbl_portfolio
WHERE id = :id
"""
)
.bind("id", uuid)
.mapTo(PortfolioData::class.java)
.firstOrNull()
}
But jdbi needs a UUIDArgumentFactory registered
jdbi.registerArgument(UUIDArgumentFactory(VARBINARY))
where
class UUIDArgumentFactory(sqlType: Int) : AbstractArgumentFactory<UUID>(sqlType) {
override fun build(
value: UUID,
config: ConfigRegistry?
): Argument {
return UUIDArgument(value)
}
}
where
class UUIDArgument(private val value: UUID) : Argument {
companion object {
private const val UUID_SIZE = 16
}
#Throws(SQLException::class)
override fun apply(
position: Int,
statement: PreparedStatement,
ctx: StatementContext
) {
val bb = ByteBuffer.wrap(ByteArray(UUID_SIZE))
bb.putLong(value.mostSignificantBits)
bb.putLong(value.leastSignificantBits)
statement.setBytes(position, bb.array())
}
}
NOTE that registering an ArgumentFactory on the entire jdbi instance like this will make ALL UUID type arguments sent to .bind map to bytes which MAY not be what you want in case you elsewhere in your code base have other UUID arguments that are stored on the mysql end with something other than VARBINARY - eg, you may have another table with a column where your JVM UUID are actually stores as VARCHAR or whatever, in which case you'd have to, rather than registering the UUID ArgumentFactory on the entire jdbi instance, only use it ad hoc on individual queries where appropriate.

Reading a row with a NULL column causes an exception in slick

I have a table with a column type date. This column accepts null values,
therefore, I declared it as an Option (see field perDate below). When I
run the select query through the application code I get the following exception
slick.SlickException: Read NULL value (null) for ResultSet column
problem.This
is the Slick table definition:
import java.sql.Date
import java.time.LocalDate
class FormulaDB(tag: Tag) extends Table[Formula](tag, "formulas") {
def sk = column[Int]("sk", O.PrimaryKey, O.AutoInc)
def formula = column[Option[String]]("formula")
def notes = column[Option[String]]("notes")
def periodicity = column[Int]("periodicity")
def perDate = column[Option[LocalDate]]("per_date")(localDateColumnType)
def * =
(sk, name, descrip, formula, notes, periodicity, perDate) <>
((Formula.apply _).tupled, Formula.unapply)
implicit val localDateColumnType = MappedColumnType.base[Option[LocalDate], Date](
{
case Some(localDate) => Date.valueOf(localDate)
case None => null
}, { sqlDate =>
if (sqlDate != null) Some(sqlDate.toLocalDate) else None
}
)
}
Your mapped column function just needs to provide the LocalDate to Date conversion. Slick will automatically handle Option[LocalDate] if it knows how to handle LocalDate.
That means changing your localDateColumnType to be:
implicit val localDateColumnType = MappedColumnType.base[LocalDate, Date](
Date.valueOf(_), _.toLocalDate
)
Chapter 5 of Essential Slick covers some of this, as does the section on User Defined Features in the Manual.
I'm not 100% sure why you're seeing the run-time error: my guess is that the column is being treated as an Option[Option[LocalDate]] or similar, and there's a level of null in there that's being missed.
BTW, your def * can probably be:
def * = (sk, name, descrip, formula, notes, periodicity, perDate).mapTo[Formula]
...which is a little nicer to read. The mapTo was added in Slick 3 at some point.

How to properly use Scala Play Anorm and Option[String] to insert NULL SQL

What is the proper way to insert Option[String] when it is None? The below code inserts an empty string, which is not the same as NULL in mysql.
Is the only way to build the SQL string beforehand based on the content of partnerCode? Sigh...Anorm...
DB.withConnection { implicit connection =>
val id: Option[Long] = SQL(
"""
INSERT INTO users (email, partner_code, is_active, created, pass) VALUES ({email}, {partnerCode}, 0, NOW(), {pass})
"""
).on(
'email -> user.email,
'partnerCode -> user.partnerCode.getOrElse(""), // FIXME: how to use NULL keyword instead of empty string? Is Anorm just this dumb?
'pass -> hashPassword(user.password.get)
).executeInsert()
id.get
}
None should actually work fine if you import anorm._ or anorm.toParameterValue:
'partnerCode -> user.partnerCode