Configure a debezium connector for multiple tables in a database - mysql

I'm trying to configure a Debezium connector for multiple tables in a MySQL database (i'm using debezium 1.4 on a MySQL 8.0).
My company have a nomenclature pattern to follow when creating topics in kafka, and this pattern does not allow the use of underscores (_), so I had to replace them with hyphens (-)
So, my topics names are:
Topic 1
fjf.db.top-domain.domain.sub-domain.transaction-search.order-status
WHERE
- transaction-search = schema "transaction_search"
- order-status = table "order_status".
- All changes in that table, must go to that topic.
Topic 2
fjf.db.top-domain.domain.sub-domain.transaction-search.shipping-tracking
WHERE
- transaction-search = schema "transaction_search"
- shipping-tracking = table "shipping_tracking"
- All changes in that table, must go to that topic.
Topic 3
fjf.db.top-domain.domain.sub-domain.transaction-search.proposal
WHERE
- transaction-search = schema "transaction_search"
- proposal = table "proposal"
- All changes in that table, must go to that topic.
I'm trying to use the transforms "ByLogicalTableRouter", but i can't find a regex solution that solve my case.
{ "name": "debezium.connector",
"config":
{
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "myhostname",
"database.port": "3306",
"database.user": "debezium",
"database.password": "password",
"database.server.id": "1000",
"database.server.name": "fjf.db.top-domain.domain.sub-domain.transaction-search",
"schema.include.list": "transaction_search",
"table.include.list": "transaction_search.order_status,transaction_search.shipping_tracking,transaction_search.proposal",
"database.history.kafka.bootstrap.servers": "kafka.intranet:9097",
"database.history.kafka.topic": "fjf.db.top-domain.domain.sub-domain.transaction-search.schema-history",
"snapshot.mode": "schema_only",
"transforms":"RerouteName,RerouteUnderscore",
"transforms.RerouteName.type":"io.debezium.transforms.ByLogicalTableRouter",
"transforms.RerouteName.topic.regex":"(.*)transaction_search(.*)",
"transforms.RerouteName.topic.replacement": "$1$2"
"transforms.RerouteUnderscore.type":"io.debezium.transforms.ByLogicalTableRouter",
"transforms.RerouteUnderscore.topic.regex":"(.*)_(.*)",
"transforms.RerouteUnderscore.topic.replacement": "$1-$2"
}
}
In the first transforms,im trying to remove the duplicated schema
name in the topic routering.
In the second transforms, to replace all
remains underscores _ for hiphens -
But with that, I'm getting the error below, which indicates that it is trying to send everything to the same topic
Caused by: org.apache.kafka.connect.errors.SchemaBuilderException: Cannot create field because of field name duplication __dbz__physicalTableIdentifier
How can i make a transform that will forward the events of each table to their respective topic?

Removing the schema name
In the first transforms,im trying to remove the duplicated schema name in the topic routering.
After transforamtion with your regex you'll have two dots, so you need to fix it:
"transforms.RerouteName.topic.regex":"([^.]+)\\.transaction_search\\.([^.]+)",
"transforms.RerouteName.topic.replacement": "$1.$2"
Replace underscores for hiphens
You can try to use ChangeCase SMT from Kafka Connect Common Transformations.

Related

How to use postgres JDBC sink connector to stream to mysql connector

Happy Thursday! I've been experimenting with creating a connector Postgres Connector in Debezium but I can only capture the changes if the table already exists in my MySQL instance which isn't ideal. Because then I would have to write a script in Python that would handle such events and it may be easier to use something that already exists than reinvent the wheel. I want to be able to capture the DDL in the actual connector. I came across this blog post. https://debezium.io/blog/2017/09/25/streaming-to-another-database/ and I got it working on my local set-up which is great, but the only issue is I want to go in the opposite direction. (I am able to capture new records, deleted records, and updated records, and it creates the new tables and new columns as well if they don't exist). I want to stream from postgres and have the connector insert into a target db in mysql. I tried switching the jdbc source and sink connectors respectively but I wasn't getting the new records inserted from postgres into mysql. It seems like I can find people inserting into postgres from mysql all over the place but not the other direction. Here is the GitHub directory I based my set up off of to get the mysql-kafka-postgres to work. https://github.com/debezium/debezium-examples/tree/main/unwrap-smt
I tried to go a different way but it seems like it's killing my docker image as I boot up saying " Couldn't resolve server kafka:9092 from bootstrap.servers as DNS resolution failed for kafka [org.apache.kafka.clients.ClientUtils]" Here is my source json and my sink json.
{
"name": "jdbc-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "postgres.public.customers",
"connection.url": "jdbc:mysql://mysql:3306/inventory",
"connection.user": "debezium",
"connection.password": "dbz",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"auto.create": "true",
"auto.evolve": "true",
"insert.mode": "upsert",
"delete.enabled": "true",
"pk.fields": "id",
"pk.mode": "record_key"
}
}
{
"name": "inventory-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"topic.prefix": "psql",
"mode": "bulk",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "postgresuser",
"database.password": "postgrespw",
"database.dbname": "inventory",
"table.include.list": "public.customers",
"slot.name": "test_slot",
"plugin.name": "wal2json",
"database.server.name": "psql",
"tombstones.on.delete": "true",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "https://[APACHE_KAFKA_HOST]:[SCHEMA_REGISTRY_PORT]",
"key.converter.basic.auth.credentials.source": "USER_INFO",
"key.converter.schema.registry.basic.auth.user.info": "[SCHEMA_REGISTRY_USER]:[SCHEMA_REGISTRY_PASSWORD]",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "https://[APACHE_KAFKA_HOST]:[SCHEMA_REGISTRY_PORT]",
"value.converter.basic.auth.credentials.source": "USER_INFO",
"value.converter.schema.registry.basic.auth.user.info": "[SCHEMA_REGISTRY_USER]:[SCHEMA_REGISTRY_PASSWORD]"
}
}
Everything else remains the same on the blog that I have followed. Any help is welcome.
I believe there are two different questions here:
How to handle non existing columns in Mysql. The JDBC sink connector should have a flag called auto.create that, if set to true allows the connector to create tables if they don't exist (auto.evolve also allows table evolution)
PG -> Kafka -> Mysql is possible, you can find an example of it that I wrote some time ago here. The examples uses Aiven for PostgreSQL and Aiven for Apache Kafka but you should be able to adapt the connectors to work in any kind of PG and Kafka.
Would be interesting to know there your PG->Kafka->MySQL pipeline stops working.
Disclaimer: I work for Aiven

Exporting MySql table into JSON Object

how to convert my mysql table result into json Object in database level
for example ,
SELECT json_array(
group_concat(json_object( name, email))
FROM ....
it will produce the result as
[
{
"name": "something",
"email": "someone#somewhere.net"
},
{
"name": "someone",
"email": "something#someplace.com"
}
]
but what i need is i need to given my own query which may contains functions, subqueries etc.
like in postgres select row_to_json(select name,email,getcode(branch) from .....) then i will get the whole result as json object
in mysql is there any possibilities to do like this?
select jsonArray(select name,email,getcode(branch) from .....)
I only found in official Mysql 8 and 5.7 documentation that it supports casting to JSON type. It includes a JSON_ARRAY function in MySQL 8 and 5.7, and JSON_ARRAYAGG function in MySQL 8. Please see the full JSON functions reference here.
It means that does not exist an easy mysql built-in solution to the problem.
Fortunately, our colleagues started a similar discussion here. Maybe you could find your solution there.
For one searching for well-defined attributes JSON casting, the solution is here.

Create dynamic frame from options (from rds - mysql) providing a custom query with where clause

I want to create a DynamicFrame in my Glue job from an Aurora-rds mysql table. Can I create DynamicFrame from my rds table using a custom query - having a where clause?
I dont want to read the entire table every time in my DynamicFrame and then filter later.
Looked at this website but didnt find any option here or elsewhere, https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html
Construct JDBC connection options
connection_mysql5_options = {
"url": "jdbc:mysql://:3306/db",
"dbtable": "test",
"user": "admin",
"password": "pwd"}
Read DynamicFrame from MySQL 5
df_mysql5 = glueContext.create_dynamic_frame.from_options(connection_type="mysql",
connection_options=connection_mysql5_options)
Is there any way to give a where clause and say select only top 100 rows from test table, say it has a column named "id" and I want to fetch using this query:
select * from test where id<100;
Appreciate any help. Thank you!
The way I was able to provide a custom query was by creating a Spark DataFrame and specifying it with options:
https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#manually-specifying-options
Then transform that DataFrame into a DynamicFrame using said class:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html
tmp_data_frame = spark.read.format("jbdc")
.option("url", jdbc_url)
.option("user", username)
.option("password", password)
.option("query", "select * from test where id<100")
.load()
dynamic_frame = DynamicFrame.fromDF(tmp_data_frame, glueContext)
Apologies, I would have made a comment but I do not have sufficient reputation. I was able to make the solution that Guillermo AMS provided work within AWS Glue, but it did require two changes:
The "jdbc" format was unrecognized (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.lang.ClassNotFoundException: Failed to find data source: jbdc. Please find packages at http://spark.apache.org/third-party-projects.html") -- I had to use the full name: "org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider"
The query option was not working for me (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.sql.SQLSyntaxErrorException: ORA-00911: invalid character"), but fortunately, the "dbtable" option supports passing in either a table or a subquery -- that is using parentheses around a query.
In my solution below I have also added a bit of context around the needed objects and imports.
My solution ended up looking like:
from awsglue.context import GlueContext
from pyspark.context import SparkContext
glue_context = GlueContext(SparkContext.getOrCreate())
tmp_data_frame = glue_context.spark_session.read\
.format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider")\
.option("url", jdbc_url)\
.option("user", username)\
.option("password", password)\
.option("dbtable", "(select * from test where id<100)")\
.load()

issue using incrementing ingest in jdbc connector

I'm trying to use an incrementing ingest to produce a message to a topic on update of a table in mysql. It works using timestamp but doesn't seem to be working using incrementing column mode. When I insert a new row into the table, I do not see any message published to the topic.
{
"_comment": " --- JDBC-specific configuration below here --- ",
"_comment": "JDBC connection URL. This will vary by RDBMS. Consult your manufacturer's handbook for more information",
"connection.url": "jdbc:mysql://localhost:3306/lte?user=root&password=tiger",
"_comment": "Which table(s) to include",
"table.whitelist": "candidate_score",
"_comment": "Pull all rows based on an timestamp column. You can also do bulk or incrementing column-based extracts. For more information, see http://docs.confluent.io/current/connect/connect-jdbc/docs/source_config_options.html#mode",
"mode": "incrementing",
"_comment": "Which column has the timestamp value to use? ",
"incrementing.column.name": "attempt_id",
"_comment": "If the column is not defined as NOT NULL, tell the connector to ignore this ",
"validate.non.null": "true",
"_comment": "The Kafka topic will be made up of this prefix, plus the table name ",
"topic.prefix": "mysql-"
}
attempt_id is an auto incrementing non null column which is also the primary key.
Actually, its my fault. I was listening to the wrong topic.

MaxScale JSON cache rules

I would like to know how to correctly format JSON cache rules to MaxScale. I need to store multiple tables for multiple databases and multiple users, how to correctly format this?
Here, i can store one table on one database and use it for one user.
{
"store": [
{
"attribute": "table",
"op": "=",
"value": "databse_alpha.xhtml_content"
}
],
"use": [
{
"attribute": "user",
"op": "=",
"value": "'user_databse_1'#'%'"
}
]
}
I need to create rule to store multiple databases for multiple users like table1 and table2 being accessed by user1, table3 and table4 being accessed by user2...and son on.
Thanks.
In Maxscale 2.1 it is only possible to give a single pair of store/use values for the cache filter rules.
I took the liberty of opening a feature request for MaxScale on the MariaDB Jira as it appears this functionality was not yet requested.
I think that as workaround you should be able to create two cache filters, with a different set of rules, and then in your service use
filters = cache1 | cache2
Note that picking out the table using exact match, as in your sample above, implies that the statement has to be parsed, which carries a significant performance cost. You'll get much better performance if no matching is needed or if it is performed using regular expressions.