Filtering rows of database - mysql

I would like to know if kafka platform is suitable for the following job.
I'm trying to ingest a full database with multiple tables. Once ingested by Kafka, I would like to filter rows of tables based on condition.
I think that is an easy job to do using Kafka streams, but what happen to messages that are rejected by the filter ?
Conditions could be met in the future if based on a date for example, so will there be a chance that a rejected message be filtered again to eventually pass the filter and be further processed ?
Is it better to filter the rows of data before feeding Kafka with it ?
Thank you.

You might want to consider using a database connector such as Debezium or the Confluent JDBC Source Connector which are both based on Kafka Connect
More on Debezium connector for MySQL see http://debezium.io/docs/connectors/mysql
More on Confluent JDBC Connector see http://docs.confluent.io/current/connect/connect-jdbc/docs/source_connector.html
With connectors based on Kafka Connect you can filter the rows of data before publishing to Kafka using the Single Message Transform (SMT) feature in Kafka Connect.
See discussion on Row Filtering with Kafka Connect here Kafka connect (Single message transform) row filtering

Related

Debezium transforms 1GB database to 100GBs topic storage

I have Debezium in a container, capturing all changes of PostgeSQL database records.
PostgeSQL database is around 1GB having 1thousand tables. On the other side, Debezium is configured to capture all table changes and it's storage is around 100GB after initial load.
I understand that there will be an overhead from conversion to JSON but the difference is multiple times bigger.
Is there anything which can be configured to reduce kafka topic storage?
You can consider single message transformation (SMT) to reduce the size of topic messages, just adding the SMT configuration details to your connector’s configuration:
transforms=unwrap,...
transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
See the documentation:
A Debezium data change event has a complex structure that provides a
wealth of information. Kafka records that convey Debezium change
events contain all of this information. However, parts of a Kafka
ecosystem might expect Kafka records that provide a flat structure of
field names and values. To provide this kind of record, Debezium
provides the event flattening single message transformation (SMT).
Configure this transformation when consumers need Kafka records that
have a format that is simpler than Kafka records that contain Debezium
change events.
At the same time Kafka supports compression at topic-level, so you can specify connector's configuration for the default topic compression as part of default topic creation group.
See the documentation:
topic.creation.default.compression.type is mapped to the
compression.type property of the topic level configuration parameters
and defines how messages are compressed on hard disk.

Merge multi data sources to sink and keep up to date

My business has many tables in different MySQL instances, now we want to build a searching platform.
So we have to extract and join them to a width column table, insert them to elastic search, and keep them up to date in es.
In addition, We have an application that converts MySQL binlog to changing message and deliver them to Kafka.
Is there a suitable solution? Can Flink or Materialize help me?
have an application that converts MySQL binlog to changing message and deliver them to Kafka
That's a start.
You would then need to do the same with your "width column table", and try to put that into a Kafka KTable, which you can then join using Kafka Streams or ksqlDB, as one example. Flink may work as well here.
After you have a joined stream+table written into a new Kafka topic, you can use Kafka Connect Elastic connector, or again, try using Flink, Logstash, etc, to ingest into the search indicies.
One everything is running, any new inserts to either table will be reflected in Elastic.

Flume Spark Streaming into Mysql Database

I have Spark Single node and I'm going to stream data into mysql using Apache Flume -> Spark -> Mysql,
I used foreachPartition method but data insertion is getting slow and there is a queue in Spark UI.
Is there any suggestion to improve this data insertion. I need to process 3000 rows per second.
Please provide the configuration details for your query.
Are you using the bulkimport for mysql.

How to stream sharded MySQL Tables CDC to Kafka Cluster

I have new shared MySQL tables being created every day with names as "output_table_[date]".
I want to capture CDC from it and stream to my Kafka cluster, so what can I use for that?
I explored Debezium connector for MySQL, but couldn't find any configuration which can take into account the new tables created everyday. eg. while creating the connector we have to provide “table.whitelist” for the tables we would like to capture CDC for, but in my case of new tables being created everyday, we'll have to manually update that everyday.
So can anyone help me with possible solutions to my use case?
You can capture all tables by simply omitting table.include.list and related options (btw. your usage of table.whitelist indicates you're using an older version of Debezium). Note for Debezium 1.6 we're also planning to do a PoC for supporting changes to the filter configuration.

Kafka producer vs Kafka connect to read MySQL Datasource

I have created a kafka producer that reads website click data streams from MySQL database and it works well. I found out that I can also just connect kafka to MySQL datasource using kafka connect or debezium. My target is to ingest the data using kafka and send it to Storm to consume and do analysis. It looks like both ways can achieve my target but using kafka producer may require me to build a kafka service that keeps reading the datasource.
Which of the two approaches would be more efficient for my data pipe line?
I'd advice to not re-invent the wheel and use Debezium (disclaimer: I'm its project lead).
It's feature-rich (supported data types, configuration options, can do initial snapshotting etc.) and well tested in production. Another key aspect to keep in mind is that Debezium is based on reading the DB's log instead of polling (you might do the same in your producer, it's not clear from the question). This provides many advantages over polling:
no delay as with low-frequency polls, no CPU load as with high-frequency polls
can capture all changes without missing some between two polls
can capture DELETEs
no impact on schema (doesn't need a column to identify altered rows)