I have new shared MySQL tables being created every day with names as "output_table_[date]".
I want to capture CDC from it and stream to my Kafka cluster, so what can I use for that?
I explored Debezium connector for MySQL, but couldn't find any configuration which can take into account the new tables created everyday. eg. while creating the connector we have to provide “table.whitelist” for the tables we would like to capture CDC for, but in my case of new tables being created everyday, we'll have to manually update that everyday.
So can anyone help me with possible solutions to my use case?
You can capture all tables by simply omitting table.include.list and related options (btw. your usage of table.whitelist indicates you're using an older version of Debezium). Note for Debezium 1.6 we're also planning to do a PoC for supporting changes to the filter configuration.
Related
My business has many tables in different MySQL instances, now we want to build a searching platform.
So we have to extract and join them to a width column table, insert them to elastic search, and keep them up to date in es.
In addition, We have an application that converts MySQL binlog to changing message and deliver them to Kafka.
Is there a suitable solution? Can Flink or Materialize help me?
have an application that converts MySQL binlog to changing message and deliver them to Kafka
That's a start.
You would then need to do the same with your "width column table", and try to put that into a Kafka KTable, which you can then join using Kafka Streams or ksqlDB, as one example. Flink may work as well here.
After you have a joined stream+table written into a new Kafka topic, you can use Kafka Connect Elastic connector, or again, try using Flink, Logstash, etc, to ingest into the search indicies.
One everything is running, any new inserts to either table will be reflected in Elastic.
I have a system where there's a mysql database to which changes are done. Then I have other machines that connect to this mysql database every ten minutes or so, and re-download tables concerning them (for example, one machine might download tables A, B, C, while another machine might download table A, D, E).
Without using Debezium or Kafka, is there a way to get all MySQL changes done after a certain timestamp, so that only those changes are sent to a machine requesting the updates, instead of the whole tables ? ... For example, machine X might want all mysql changes done since it last contacted the mysql database, and then apply those changes to its own old data to update it.
Is there some way to do this ?
MySQL can be setup to automatically replicate databases, tables etc. automatically. If the connection is lost, it will catch up when the connection is restored.
Take a look at this page MySQL V5.5 Replication, or this one MySQL V8.0 Replication
You can use Debezium as a library embedded into your application, if you don't want or can deploy a Kafka cluster.
Alternatively, you could directly use the MySQL Binlog Connector (it's used by the Debezium connector underneath, too), it lets you read the binlog from given offset positions. But then you'd have to deal with many things yourself, which are handled by solutions such as Debezium already, e.g. the correct handling of schema metadata and changes (e.g. additions of columns to existing tables). Usually this involves parsing the MySQL DDL, which by itself is quite complex.
Disclaimer: I'm the lead of Debezium
I have created a kafka producer that reads website click data streams from MySQL database and it works well. I found out that I can also just connect kafka to MySQL datasource using kafka connect or debezium. My target is to ingest the data using kafka and send it to Storm to consume and do analysis. It looks like both ways can achieve my target but using kafka producer may require me to build a kafka service that keeps reading the datasource.
Which of the two approaches would be more efficient for my data pipe line?
I'd advice to not re-invent the wheel and use Debezium (disclaimer: I'm its project lead).
It's feature-rich (supported data types, configuration options, can do initial snapshotting etc.) and well tested in production. Another key aspect to keep in mind is that Debezium is based on reading the DB's log instead of polling (you might do the same in your producer, it's not clear from the question). This provides many advantages over polling:
no delay as with low-frequency polls, no CPU load as with high-frequency polls
can capture all changes without missing some between two polls
can capture DELETEs
no impact on schema (doesn't need a column to identify altered rows)
I am working for a client who uses multiple RDS (MySQL) instances on AWS and wants me to consolidate data from there and other sources into a single instance and do reporting off that.
What would be the most efficient way to transfer selective data from other AWS RDS MySQL instances to mine?
I don't want to migrate the entire DB, rather just a few columns and rows based on which have relevant data and what was last created/updated.
One option would be to use a PHP script that'd read from one DB and insert it into another, but it'd be very inefficient. Unlike SQL Server or ORACLE, MySQL also does not have the ability to write queries across servers, else I'd have just used that in a stored procedure.
I'd appreciate any inputs regarding this.
If your overall objective is reporting and analytics, the standard practice is to move your transactional data from RDS to Redshift which will become your data warehouse. This blog article by AWS provides an approach to do it.
For the consolidation operation, you can use AWS Data Migration Service which will allow you to migrate data column wise with following options.
Migrate existing data
Migrate existing data & replicate ongoing changes
Replicate data changes only
For more details read this whitepaper.
Note: If you need to process the data while moving, use AWS Data Pipeline.
Did you take a look at the RDS migration tool?
I have task to implement particular database structure:
Multiple mysql servers with data using the same schema. Each server can see and edit only his particular part of data.
And
One master server with his own data that can run queries using data from all previously mentioned servers, but cannot edit them.
Example would be multiple hospitals with data of their patients and master server that can use combined data from all hospitals.
Previous system was written using mysql cluster, so i tried it naturally. I can create mysql cluster with multiple nodes and maybe even partition data so i can have particular set of data in particular node, but as far as i know i can't connect to single node using mysql, because it is already connected to cluster.
Can it be done with mysql cluster? Is there other framework that can do that easily?
You could try to use http://galeracluster.com/. You can perform updates on all slaves and every server has all data, but it might still meet your requirements.