Merge multi data sources to sink and keep up to date - mysql

My business has many tables in different MySQL instances, now we want to build a searching platform.
So we have to extract and join them to a width column table, insert them to elastic search, and keep them up to date in es.
In addition, We have an application that converts MySQL binlog to changing message and deliver them to Kafka.
Is there a suitable solution? Can Flink or Materialize help me?

have an application that converts MySQL binlog to changing message and deliver them to Kafka
That's a start.
You would then need to do the same with your "width column table", and try to put that into a Kafka KTable, which you can then join using Kafka Streams or ksqlDB, as one example. Flink may work as well here.
After you have a joined stream+table written into a new Kafka topic, you can use Kafka Connect Elastic connector, or again, try using Flink, Logstash, etc, to ingest into the search indicies.
One everything is running, any new inserts to either table will be reflected in Elastic.

Related

How can we use streaming in spark from multiple source? e.g First take data from HDFS and then consume streaming from Kafka

The problem arise when I already have a system and I want to implement a Spark Streaming on top.
I have 50 million rows transactional data on MySQL, I want to do reporting on those data. I thought to dump the data into HDFS.
Now, Data are coming everyday also in DB and I am adding KAFKA for new data.
I want to know how can I combine multiple source data and do analytics in real-time (1-2 minutes delay is ok) and save those results because future data needs previous results.
Joins are possible in SparkSQL, but what happens when you need to update data in mysql? Then your HDFS data becomes invalid very quickly (faster than a few minutes, for sure). Tip: Spark can use JDBC rather than need HDFS exports.
Without knowing more about your systems, I say keep the mysql database running, as there is probably something else actively using it. If you want to use Kafka, then that's a continous feed of data, but HDFS/MySQL are not. Combining remote batch lookups with streams will be slow (could be more than few minutes).
However, if you use Debezium to get data into Kafka from mysql, then you then have data centralized in one location, and then ingest from Kafka into an indexable location such as Druid, Apache Pinot, Clickhouse, or maybe ksqlDB to ingest.
Query from those, as they are purpose built for that use case, and you don't need Spark. Pick one or more as they each support different use cases / query patterns.

How to stream sharded MySQL Tables CDC to Kafka Cluster

I have new shared MySQL tables being created every day with names as "output_table_[date]".
I want to capture CDC from it and stream to my Kafka cluster, so what can I use for that?
I explored Debezium connector for MySQL, but couldn't find any configuration which can take into account the new tables created everyday. eg. while creating the connector we have to provide “table.whitelist” for the tables we would like to capture CDC for, but in my case of new tables being created everyday, we'll have to manually update that everyday.
So can anyone help me with possible solutions to my use case?
You can capture all tables by simply omitting table.include.list and related options (btw. your usage of table.whitelist indicates you're using an older version of Debezium). Note for Debezium 1.6 we're also planning to do a PoC for supporting changes to the filter configuration.

Kafka producer vs Kafka connect to read MySQL Datasource

I have created a kafka producer that reads website click data streams from MySQL database and it works well. I found out that I can also just connect kafka to MySQL datasource using kafka connect or debezium. My target is to ingest the data using kafka and send it to Storm to consume and do analysis. It looks like both ways can achieve my target but using kafka producer may require me to build a kafka service that keeps reading the datasource.
Which of the two approaches would be more efficient for my data pipe line?
I'd advice to not re-invent the wheel and use Debezium (disclaimer: I'm its project lead).
It's feature-rich (supported data types, configuration options, can do initial snapshotting etc.) and well tested in production. Another key aspect to keep in mind is that Debezium is based on reading the DB's log instead of polling (you might do the same in your producer, it's not clear from the question). This provides many advantages over polling:
no delay as with low-frequency polls, no CPU load as with high-frequency polls
can capture all changes without missing some between two polls
can capture DELETEs
no impact on schema (doesn't need a column to identify altered rows)

Filtering rows of database

I would like to know if kafka platform is suitable for the following job.
I'm trying to ingest a full database with multiple tables. Once ingested by Kafka, I would like to filter rows of tables based on condition.
I think that is an easy job to do using Kafka streams, but what happen to messages that are rejected by the filter ?
Conditions could be met in the future if based on a date for example, so will there be a chance that a rejected message be filtered again to eventually pass the filter and be further processed ?
Is it better to filter the rows of data before feeding Kafka with it ?
Thank you.
You might want to consider using a database connector such as Debezium or the Confluent JDBC Source Connector which are both based on Kafka Connect
More on Debezium connector for MySQL see http://debezium.io/docs/connectors/mysql
More on Confluent JDBC Connector see http://docs.confluent.io/current/connect/connect-jdbc/docs/source_connector.html
With connectors based on Kafka Connect you can filter the rows of data before publishing to Kafka using the Single Message Transform (SMT) feature in Kafka Connect.
See discussion on Row Filtering with Kafka Connect here Kafka connect (Single message transform) row filtering

Real time migration of data from MySQL to elasticsearch?

I have tons of data present in MySQL in form of different database, and their respective tables. They all are related to each other. But when I have to do analysis in data, I have to create different scripts, that combine data, merge it and show me as a result, but this takes a lot of time, and effort too. I love elasticsearch for its speed and visualization of data via kibana, therefore I have decided to move my entire MySQL data in real time to elasticsearch, keeping data in MySQL too. But I want a scalable strategy, and process that migrates that data to elasticsearch.
Suggest the best tool, or methods to do the job.
Thank you.
Prior to Elasticsearch 2.x you could write your own Elasticsearch _river plugin that you can install into elasticsearch. You can control how often you want this said data you've munged with your scripts to be pulled in by the _river (Note: this is not truly recommended).
You may also use your favourite Queuing Message Broker tool such as ActiveMQ to push your data into elasticsearch
If you want full control to meet your need for real time migration of data you may also write a simple app that makes use of elasticsearch REST end point, by simply writing to it via REST. You can even do bulk POST
Make use of any of the elasticsearch tools such as beat, logstash that are great at shipping almost any type of data into elasticsearch
For other alternatives of munging your data to a flat file, or if you want to maintain relationships see this post here