Stream data from MySQL Binary Log to Kinesis - mysql

We have a write-intensive table (on AWS RDS MySQL) from a legacy system and we'd like to stream every write event (insert or updated) from that table to kinesis. The idea is to create a pipe to warmup caches and update search engines.
Currently we do that using a rudimentar polling architecture, basically using SQL, but the ideal would be to have a push architecture reading the events directly from the transaction log.
Has anyone tried it? Any suggested architecture?

I've worked with some customers doing that already, in Oracle. Seems also that LinkedIn uses a lot that technique of streaming data from databases to somewhere else. They created a platform called Databus to accomplish that in an agnostic way - https://github.com/linkedin/databus/wiki/Databus-for-MySQL.
There is a public project in Github, following LinkedIn principles that is already streaming binlog from Mysql to Kinesis Streams - https://github.com/cmerrick/plainview
If you want to get into the nitty gritty details of LinkedIn approach, there is a really nice (and extensive) blog post available - https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying.
Last but not least, Yelp is doing that as well, but with Kafka - https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html
Not getting into the basics of Kinesis Streams, for the sake of brevity, if we bring Kinesis Streams to the game, I don't see why it shouldn't work. As a matter of fact, it was built for that - your database transaction log is a stream of events. Borrowing an excerpt from Amazon Web Services public documentation: Amazon Kinesis Streams allows for real-time data processing. With Amazon Kinesis Streams, you can continuously collect data as it is generated and promptly react to critical information about your business and operations.
Hope this helps.

aws DMS service offers data migration from SQL db to kinesis.

Use the AWS Database Migration Service to Stream Change Data to Amazon Kinesis Data Streams
https://aws.amazon.com/blogs/database/use-the-aws-database-migration-service-to-stream-change-data-to-amazon-kinesis-data-streams/

Related

How to generate notifications from MySQL table updates?

I have a full stack app that uses React, Node.js, Express, and MySQL. I want the react app to respond to database updates similar to Firebase: When data changes, I want a real-time notification sent to my app.
I want to use stock MySQL (no plugins), so that I can use AWS RDB or whatever.
I will use socket.io to push the real-time notifications to the web app.
To avoid off-target responses, I'll summarize various approaches that are not what I am looking for:
The server could poll, or each client could poll. (Not real-time, but included for completeness. When I search, polling is the only solution I find.)
Write a wrapper that handles all MySQL updates, handles subscriptions, and sends the notifications. This is a complicated component that adds complexity. Firebase is popular because it both increases performance and reduces complexity. I like Firebase a lot but want to do the same thing with MySQL.
Use Firebase to handle the real-time notifications. The MySQL wrapper could use Firebase to handle the subscriptions and notifications, but there is still the problem of triggering the notifications in the first place. Also, I don't want to use Firebase. (For example, my application needs to run in an air-gapped environment.)
The question: Using a stock MySQL database, when a table changes, can a notification server discover the change in real-time (no polling), so that it can send notifications?
The approach that works is to listen to the binary logs. This way, any change to the database will be communicated in real-time. The consumer of the binary logs can then publish this information in a number of ways. A common choice is to feed a stream of events to Apache Kafka.
Debezium, Maxwell, and NiFi work this way.

Mirroring homogeneous data from one MySQL RDS to another MySQL RDS

I have two MySQL RDS's (hosted on AWS). One of these RDS instances is my "production" RDS, and the other is my "performance" RDS. These RDS's have the same schema and tables.
Once a year, we take a snapshot of the production RDS, and load it into the performance RDS, so that our performance environment will have similar data to production. This process takes a while - there's data specific to the performance environment that must be re-added each time we do this mirror.
I'm trying to find a way to automate this process, and to achieve the following:
Do a one time mirror in which all data is copied over from our production database to our performance database.
Continuously (preferably weekly) mirror all new data (but not old data) between our production and performance MySQL RDS's.
During the continuous mirroring, I'd like for the production data not to overwrite anything already in the performance database. I'd only want new data to be inserted into the production database.
During the continuous mirroring, I'd like to change some of the data as it goes onto the performance RDS (for instance, I'd like to obfuscate user emails).
The following are the tools I've been researching to assist me with this process:
AWS Database Migration Service seems to be capable of handling a task like this, but the documentation recommends using different tools for homogeneous data migration.
Amazon Kinesis Data Streams also seems able to handle my use case - I could write a "fetcher" program that gets all new data from the prod MySQL binlog, sends it to Kinesis Data Streams, then write a Lambda that transforms the data (and decides on what data to send/add/obfuscate) and sends it to my destination (being the performance RDS, or if I can't directly do that, then a consumer HTTP endpoint I write that updates the performance RDS).
I'm not sure which of these tools to use - DMS seems to be built for migrating heterogeneous data and not homogeneous data, so I'm not sure if I should use it. Similarly, it seems like I could create something that works with Kinesis Data Streams, but the fact that I'll have to make a custom program that fetches data from MySQL's binlog and another program that consumes from Kinesis makes me feel like Kinesis isn't the best tool for this either.
Which of these tools is best capable of handling my use case? Or is there another tool that I should be using for this instead?

Kafka producer vs Kafka connect to read MySQL Datasource

I have created a kafka producer that reads website click data streams from MySQL database and it works well. I found out that I can also just connect kafka to MySQL datasource using kafka connect or debezium. My target is to ingest the data using kafka and send it to Storm to consume and do analysis. It looks like both ways can achieve my target but using kafka producer may require me to build a kafka service that keeps reading the datasource.
Which of the two approaches would be more efficient for my data pipe line?
I'd advice to not re-invent the wheel and use Debezium (disclaimer: I'm its project lead).
It's feature-rich (supported data types, configuration options, can do initial snapshotting etc.) and well tested in production. Another key aspect to keep in mind is that Debezium is based on reading the DB's log instead of polling (you might do the same in your producer, it's not clear from the question). This provides many advantages over polling:
no delay as with low-frequency polls, no CPU load as with high-frequency polls
can capture all changes without missing some between two polls
can capture DELETEs
no impact on schema (doesn't need a column to identify altered rows)

network stream analysis using wireshark and apache storm

I want to make a simple application that analyzes network traffic generated by Wireshark and counts how often I access google. From this project I want to introduce myself to network traffic processing.
In order to get traffic, I use Wireshark.
In order to analyze the network traffic streams I want to use Apache Storm.
Well, I am new to stream processing, apache storm and wireshark (this is very first use), so I want to ask you if is possible to realize this simple project.
I think Apache Storm can read the data easily from json files.
So the things that puzzles me are:
Can I export wireshark data to json files?
Can I, somehow read in real time the data generated from wireshark,
Process this data using Apache Storm,
Or, if you can suggest me a more appropriate instrument with a concrete tutorial I would be grateful.
Sounds like an interesting project to start with.
To read data from Wireshark in real-time, I suppose you can make use of any Wireshark API and then create a client that pushes data to some port in your system.
Create a Storm Spout that reads from this port
Create a Storm Bolt to parse these message (if you want to convert to json)
Create a Storm Bolt to filter the traffic (the ones between google? may be based on IP address?)
Optionally store the results to some datastore.
Alternatives : You can also try apache-flink or apache-spark, which also provides stream processing capabilities.

How does a LAMP developer get started using a Redis/Node.js Solution?

I come from the cliche land of PHP and MySQL on Dreamhost. BUT! I am also a javascript jenie and I've been dying to get on the Node.js train. In my reading I've discovered inadvertently a NoSQL solution called Redis!
With my shared web host and limited server experience (I know how to install Linux on one of my old dell's and do some basic server admin) how can I get started using Redis and Node.js? and the next best question is -- what does one even use Redis for? What situation would Redis be better suited than MySQL? And does Node.js remove the necessity for Apache? If so why do developers recommend using NGINX server?
Lots of questions but there doesnt seem to be a solid source out there with this info all in one place!
Thanks again for your guidance and feedback!
NoSQL is just an inadequate buzz word.
I'll attempt to answer the latter part of the question.
Redis is a key-value store database system. Speed is its primary objective, so most of its use comes from event driven implementations (as it goes over in its reddit tutorial).
It excels at areas like logging, message transactions, and other reactive processes.
Node.js on the other hand is mainly for independent HTTP transactions. It is basically used to serve content (much like a web server, but Node.js really wouldn't be necessarily public facing) very fast which makes it useful for backend business logic applications.
For example, having a C program calculate stock values and having Node.js serve the content for another internal application to retrieve or using Node.js to serve a web page one is developing so one's coworkers can view it internally.
It really excels as a middleman between applications.
Redis
Redis is an in-memory datastore : All your data are stored in the memory meaning that a huge database means huge memory usage, but with really fast access and lookup.
It is also a key-value store : You don't have any realtionships, or queries to retrieve your data. You can only set a key value pair, and retreive it by its id. (Redis also provides useful types such as sets and hashes).
These particularities makes Redis really well suited for storing sessions in a web application, creating indexes on a database, handling real-time data like analytics.
So if you need something that will "replace" MySQL for storing your basic application models I suggest you try something like MongoDB, Riak or CouchDB that are document store.
Document stores manages your data as something analogous to JSON objects (I know it's a huge shortcut).
Read this article if you want to know more about popular nosql databases.
Node.js
Node.js provides asynchrous I/O for the V8 JavaScript engine.
When you run a node server, it listens on a port on your machine (e.g. 3000). It does not do any sort of Domain name resolution and Virtual Host handling so you have to use a http server with a proxy such as Apache or nginx.
Choosing over nginx in production is a matter of performance, and I find it easier to use. But I suggest you use the one you're the most comfortable with.
To get started with it just install them and start playing with it. HowToNode
You can get a free plan from https://redistogo.com/ - it is a hosted redis database instance.
Quick intro to redis data types and basic commands is available here - http://redis.io/topics/data-types-intro.
A good comparison of when to use what is here - http://playbook.thoughtbot.com/choosing-platforms/databases/