In MySQL, using binlog, we can extract the data changes. But I need only the latest changes made during that time / day and need to feed those data into timeseries DB (planning to go with druid)
While reading binlog, Is there any mechanism to avoid duplicate and keep latest changes?
My intension is to get the entire MySQL DB backed up every day in a timeseries DB. It helps to debug my application for past dates by referring actual data present on that day
Kafka, by design, is append only log (no updates).
Kafka Connect source connector will continuously capture all the changes from binlog into the Kafka topic. Connector stores its position in binlog and will only write new changes into Kafka as they become available in MySLQ.
For consuming from Kafka, as one option, you can use sink connector that will write all the changes to your target. Or, instead of the Kafka Connect sink connector, some independent process that will read (consume) from Kafka. For Druid specifically, you may look https://www.confluent.io/hub/imply/druid-kafka-indexing-service.
Consumer (either connector or some independent process), will store its position (offset) in the Kafka topic, and will only write new changes into the target (Druid) as they become available in Kafka.
Processes described above captures all the changes and allows you to view source (MySQL) data at any point in time in target (Druid). It is the best practice to have all the changes available in the target. Use your target's functionality to limit view of data to certain time of the day, if needed.
If, for example, there are huge number of daily changes to a record in MySQL and you'd like to only write latest status as of specific time of the day to target. You'll still need to read all the changes from MySQL. Create some additional daily process that will read all the changes since prior run and filter only latest records and write them to target.
Related
The problem arise when I already have a system and I want to implement a Spark Streaming on top.
I have 50 million rows transactional data on MySQL, I want to do reporting on those data. I thought to dump the data into HDFS.
Now, Data are coming everyday also in DB and I am adding KAFKA for new data.
I want to know how can I combine multiple source data and do analytics in real-time (1-2 minutes delay is ok) and save those results because future data needs previous results.
Joins are possible in SparkSQL, but what happens when you need to update data in mysql? Then your HDFS data becomes invalid very quickly (faster than a few minutes, for sure). Tip: Spark can use JDBC rather than need HDFS exports.
Without knowing more about your systems, I say keep the mysql database running, as there is probably something else actively using it. If you want to use Kafka, then that's a continous feed of data, but HDFS/MySQL are not. Combining remote batch lookups with streams will be slow (could be more than few minutes).
However, if you use Debezium to get data into Kafka from mysql, then you then have data centralized in one location, and then ingest from Kafka into an indexable location such as Druid, Apache Pinot, Clickhouse, or maybe ksqlDB to ingest.
Query from those, as they are purpose built for that use case, and you don't need Spark. Pick one or more as they each support different use cases / query patterns.
I want to export updated data from MySQL/postgreSQL to mongodb every time specified table has changed or, if that's impossible, make the dump of whole table to NoSQL every X seconds/minutes. What can I do to achieve this? I've googled and I found only paid, enterprise level solutions and those are out of reach for my amateur project.
SymmetricDS provides and open source database replication option that would that has support for replicating a RDMS database (MySQL, Postgres) into MongoDB.
Here is the specific documentation to setup the Mongo target node in SymmetricDS.
http://www.symmetricds.org/doc/3.11/html/user-guide.html#_mongodb
There is also a blog about setting up Mongo in a bit more detail.
https://www.jumpmind.com/blog/mongodb-synchronization
To get online replication into a target database you can use:
Get the data stream at the same time in both databases
Enterprise solution which reads the transaction log and pushes the data to the next database
Check periodically for change dates > X
Export table periodically
Write changed records to a certain table with trigger and poll this table to select the changes
Push the changed data with triggers in a datastreamservice into the next database
Many additional approaches
Depending on the amount of time you want to use and the lag the data can have it depends which solution fits for your demand.
If the amount of data gets bigger or the number or transaction increases some solutions which fit for an amateur project don't fit anymore.
I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner.
It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.
Any help would be welcome!
This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?
You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.
If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.
That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.
I was wondering if you can help me decide which one is best suitable to use in my case.
Use case:
I want to batch process ~200M events that are stored in apache kafka and ~20M rows in different sql tables per day. Data in rows represent states of users and events in kafka change these states. Events in kafka are well partitioned (all events for one user are stored in exactly one kafka segment),but still, there are more users than kafka segments.
(EDIT)
State updates can`t be handled in real-time as events come from different sources in different times. All events have timestamps with proper timezone, but events might be observed late which will yield in shifted timestamp. There are business rules how to handle these
I know to compute user state for any given time if all events and starting state is available.
Output:
consistent final user states are stored in mysql
writes during computation to other sources (kafka, text files, etc..) can occur based on current state
All of them are able to read and group data so I can process them, but as far as I know:
spark and flink can work withou hadoop (so far I don't have any stable cluster)
Spark has problem with dealing more data than RAM available (?)
with Flink I`m not sure if I can combine data from data stream (kafka) and table (sql)
with m/r I need to set up hadoop cluster
Also in the future there might be 100M events per hour and there will be functional hadoop cluster.
I need to back up a few DynamoDB tables which are not too big for now to S3. However, these are tables another team uses/works on but not me. These back ups need to happen once a week, and will only be used to restore the DynamoDB tables in disastrous situations (so hopefully never).
I saw that there is a way to do this by setting up a data pipeline, which I'm guessing you can schedule to do the job once a week. However, it seems like this would keep the pipeline open and start incurring charges. So I was wondering, if there is a significant cost difference between backing the tables up via the pipeline and keeping it open, or creating something like a powershellscript that will be scheduled to run on an EC2 instance, which already exists, which would manually create a JSON mapping file and update that to S3.
Also, I guess another question is more of a practicality question. How difficult it is to backup dynamoDB tables to Json format. It doesn't seem too hard but wasn't sure. Sorry if these questions are too general.
Are you are working under the assumption that Data Pipeline keeps the server up forever? That is not the case.
For instance, you have defined a Shell Activity, after the activity completes, the server will terminate. (You may manually set the termination protection. Ref.
Since you only run a pipeline once a week, the costs are not high.
If you run a cron job on ec2 instance, that instance needs to up when you want to run the backup - and that could be a point of failure.
Incidentally, Amazon provides a Datapipeline sample on how to export data from dynamodb.
I just checked the pipeline cost page, and it says "For example, a pipeline that runs a daily job (a Low Frequency activity) on AWS to replicate an Amazon DynamoDB table to Amazon S3 would cost $0.60 per month". So I think I'm safe.