DR - Missing binlogs in Maxwell - mysql

We have a standard db.r6g.2xlarge RDS MySQL instance running in production with 2 replicas. Maxwell CDC is setup to publish binlogs to a Kafka topic.
From a RPO point of view: in the worst case (say RDS stuck in "failed" state) we might have to go for PITR which could result in upto 5 mins of data loss. We are currently thinking about leveraging the CDC topic to reconstruct the queries and replay them to recover the last 5 mins of data (whatever is possible).
Given above, is there a possibility (however remote) where binlogs were written but are not available in the CDC topic? I am thinking situations like disk crashes, corruption, etc.

Related

Master Slave Replication in Databases

As per How To Set Up Replication in MySQL,
Once the replica instance has been initialized, it creates two
threaded processes. The first, called the IO thread, connects to the
source MySQL instance and reads the binary log events line by line,
and then copies them over to a local file on the replica’s server
called the relay log. The second thread, called the SQL thread, reads
events from the relay log and then applies them to the replica
instance as fast as possible.
Isn't it contradictory to the theory of master-slave database replication in which the master copies data to the slaves?
Reliability. (A mini-history of MySQL's efforts.)
When a write occurs on the Primary, N+1 extra actions occur:
One write to the binlog -- this is to allow for any Replicas that happen to be offline (for any reason); they can come back later and request data from this file. (Also see sync_binlog)
N network writes, one per Replica. These are to get the data to the Replicas ASAP.
Normally, if you want more than a few Replicas, you can "fan out" through several levels, thereby allowing for an unlimited number of Replicas. (10 per level would give you 1000 Replicas in 3 layers.)
The product called Orchestrator carries this to an extra level -- the binlog is replicated to an extra server and the network traffic occurs from there. This offloads the Primary. (Booking.com uses it to handle literally hundreds of replicas.)
On the Replica's side the two threads were added 20 years ago because of the following scenario:
The Replica is busy doing one query at a time.
It gets busy with some long query (say an ALTER)
Lots of activity backs up on the Primary
The Primary dies.
Now the Replica finishes the Alter, but does not have anything else to work on, so it is very "behind" and will take extra time to "catch up" once the Primary comes back online.
Hence, the 2-thread Replica "helps" keep things in sync, but it is still not fully synchronous.
Later there was "semi-synchronous" replication and multiple SQL threads in the Replica (still a single I/O thread).
Finally, InnoDB Cluster and Galera became available to provide [effectively] synchronous replication. But they come with other costs.
"master-slave database replication in which the master copies data to the slaves" - it's just a concept - data from a leader is copied to followers. There are many options how this could be done. Some of those are the write ahead log replication, blocks replication, rows replication.
Another interesting approach is to use a replication system completely separate from the storage. An example for this would be Bucardo - replication system for PostgreSQL. In that case nighter leader or follower actually do work.

Usability of Binary Log in data streaming in MYSQL: What are the drawbacks and advantages?

I have been trying to read data out of MySQL using Kafka connect using MySQL source connector for database and debezium connector for bin logs. I am trying to understand what could be the better way to pull the change data. Bin logs has overhead of writing to logs etc while reading from database has the overhead of querying database. What are the other major advantages and disadvantages that are associated with both of these approaches? What could be a better way of capturing change data? Also starting from MySQL 8 the bin logs are enabled by default. Does this mean it could be a better way of doing things?
This question can be summarized as follows:
What are the pros and cons of a log-based CDC (represented by Debezium Connector) versus a polling-based CDC (represented by JDBC Source Connector)?
Query-based CDC:
✓ Usually easier to setup
✓ Requires fewer permissions
✗ Impact of polling the DB
✗ Needs specific columns in source schema to track changes
✗ Can't track deletes
✗ Can't track multiple events between polling interval
Log-based CDC:
✓ All data changes are captured
✓ Low delays of events while avoiding increased CPU load
✓ No impact on data model
✓ Can capture deletes
✓ Can capture old record state and further meta data
✗ More setup steps
✗ Higher system previleges required
✗ Can be expensive for some proprietary DB
Reference:
Five Advantages of Log-Based Change Data Capture by Gunnar
Morling
No More Silos: How to Integrate Your Databases with Apache
Kafka and CDC by Robin Moffatt
StackOverflow: Kafka Connect JDBC vs Debezium CDC
The list given by #Iskuskov Alexander is great. I'd add a few more points:
Log-based CDC also requires writing to logs (you mentioned this in your question). This has overhead not only for performance, but also storage space.
Log-based CDC requires a continuous stream of logs. If the CDC misses a log, then the replica cannot be kept in sync, and the whole replica must be replaced by a new replica initialized by a new snapshot of the database.
If your CDC is offline periodically, this means you need to keep logs until the CDC runs, and this can be hard to predict how long it will be. This leads to needing more storage space.
That said, query-based CDC has its own drawbacks. At my company, we have used a query-based CDC, but we found that it is inconvenient, and we're working on replacing it with a Debezium log-based solution. For many of the reasons in the other answer, and also:
Query-based CDC makes it hard to keep schema changes in sync with the replica, so if a schema change occurs in the source database, it may require the replica be trashed and replaced with a fresh snapshot.
The replica is frequently in a "rebuilding" state for hours, when it needs to be reinitialized from a snapshot, and users don't like this downtime. Also snapshot transfers increase the network bandwidth requirements.
Neither solution is "better" than the other. Both have pros and cons. Your job as an engineer is to select the option that fits your project's requirements the best. In other words, choose the one whose disadvantages are least bad for your needs.
We can't make that choice for you, because you know your project better than we do.
Re your comments:
Enabling binary logs has no overhead for read queries, but significant overhead for write queries. The overhead became greater in MySQL 8.0, as measured by Percona CTO Vadim Tkachenko and reported here: https://www.percona.com/blog/2018/05/04/how-binary-logs-affect-mysql-8-0-performance/
He concludes the overhead of binary logs is about 13% for MySQL 5.7, and up to 30% for MySQL 8.0.
Can you also explain "The replica is frequently in a "rebuilding" state for hours, when it needs to be reinitialized from a snapshot"? Do you mean building a replication database?
Yes, if you need to build a new replica, you acquire a snapshot of the source database and import it to the replica. Every step of this takes time:
Create the snapshot of the source
Transfer the snapshot to the host where the replica lives
Import the snapshot into the replica instance
How long depends on the size of the database, but it can be hours or even days. While waiting for this, users can't use the replica database, at least not if they want their queries to analyze a complete copy of the source data. They have to wait for the import to finish.

"Read Only Database" Vs "Read and Write database" Configuration in mysql

We are using two databases one is as read-only and the second one is for "read and write" and able to achieve what we are getting.
But sometimes our read-only database took more time to execute the same query and looks like queries going to the queue kind of thing.
Does it due to we are using high configuration "Read, Write " database as compare to "Read-only" Database. (Amazon RDS)
we tried to found the article or any post but we couldn't find it. Can you help me to understand, please. because my theory says it is something if you put water from a Big pipe to small pipe then in any of the time it could create the problem.
Server is on Heroku and database M4 Large (Read & Write) && T2 Mid (Read Only) – Arvind 7 mins ago
Your databases are on different "hardware", they'll have different performance.
The most significant different I see is memory: 4 vs 8 GB. This will affect how much caching each database can do. Your leader (read & write) has more memory and can cache more. Your follower (read only), with less memory, might have things pushed out of cache that your leader retains.
There is also network performance. t2.medium is listed at "low to moderate" while m4.large is "moderate". What that actually means I have no idea except that the T2 has less.
Finally, a T2 instance is "burstable" meaning it's normally running at about 20% CPU capacity with bursts to maximum performance using CPU credits. If you start to run out of CPU credits in standard mode (the default for T2) CPU performance will drop. It's possible your T2 follower is in standard mode and periodically running low on CPU credits.

Reliability of MySQL master-slave replication

I have a an application that requires a master catalogue of about 30 tables which require to be copied out to many (100+) slave copies of the application. Slaves may be in their own DB instance or there may be multiple slaves in single DB instances. Any changes to the Master catalogue require to be copied out to the slaves within a reasonable time - around 5 minutes. Our infrastructure is all AWS EC2 and we use MySQL. Master and slaves will all reside within a single AWS region.
I had planned to use Master-Slave replication but I see reports of MySQL replication being sometimes unreliable and I am not sure if this is due to failings inherent in the particular implementations or failings in MySQL itself. We need a highly automated and reliable system and it may be that we have to develop monitoring scripts that allow a slave to continuously monitor its catalogue relative to the master.
Any observations?
When I was taking dance lessons before my wedding, the instructor said, "You don't have to do every step perfectly, you just have to learn to recover gracefully when missteps happen. If you can do that quickly, with a smile on your face, no one will notice."
If you have 100+ replicas, expect that you will be reinitializing replicas frequently, probably at least one or two every day. This is normal.
All software has bugs. Expecting anything different is, frankly, naive. Don't expect software to be flawless and continue operating 24/7 indefinitely without errors, because you will be disappointed. You should not seek a perfect solution, you should think like a dancer and recover gracefully.
MySQL replication is reasonably stable, and no less so than other solutions. But there are a variety of failures that can happen, without it being MySQL's fault.
Binlogs can develop corrupted packets in transit due to network glitches. MySQL 5.6 introduced binlog checksums to detect this.
The master instance can crash and fail to write an event to the binlog. sync_binlog can help to ensure all transactions are written to the binlog on commit (though with overhead for transactions).
Replica data can fall out of sync due to non-deterministic SQL statements, or packet corruption, or log corruption on disk, or some user can change data directly on a replica. Percona's pt-table-checksum can detect this, and pt-table-sync can correct errors. Using binlog_format=ROW reduces the chance of non-deterministic changes. Setting the replicas read-only can help, and don't let users have SUPER privilege.
Resources can run out. For example, you could fill up the disk on the master or the replica.
Replicas can fall behind, if they can't keep up with the changes on the master. Make sure your replica instances are not under-powered. Use binlog_format=ROW. Write fewer changes to an individual MySQL master. MySQL 5.6 introduces multi-threaded replicas, but so far I've seen some cases where this is still a bit buggy, so test carefully.
Replicas can be offline for an extended time, and when they come back online, some of the master's binlogs have been expired so the replica can't replay a continuous stream of events from where it left off. In that case, you should trash the replica and reinitialize it.
Bugs happen in any software project, and MySQL's replication has had their share. You should keep reading release notes of MySQL, and be prepared to upgrade to take advantage of bug fixes.
Managing a big collection of database servers in continuous operation takes a significant amount of full-time work, no matter what brand of database you use. But data has become the lifeblood of most businesses, so it's necessary to manage this resource. MySQL is no better and no worse than any other brand of database, and if anyone tells you something different, they're selling something.
P.S.: I'd like to hear why you think you need 100+ replicas in a single AWS region, because that is probably overkill by an order of magnitude for any goal of high availability or scaling.

Homemade cheap and cheerful clustering with MySQL+EC2?

I've got a Java web service backed by MySQL + EC2 + EBS. For data integrity I've looked into DRBD, MySQL cluster etc. but wonder if there isn't a simpler solution. I don't need high availability (can handle downtime)
There are only a few operations whose data I need to preserve -- creating an account, changing password, purchase receipt. The majority of the data I can afford to recover from a stale backup.
What I am thinking is that I could pipe selected INSERT/UPDATE commands to storage (S3, SimpleDB for instance) and when required (when the db blows up) replay these commands from the point of last backup. And wouldn't it be neat if this functionality was implemented in the JDBC driver itself.
Is this too silly to work, or am I missing another obvious and robust solution?
Have you looked into moving your MySQL into Amazon Web Services as well? You can use Amazon Relational Database Service (RDS). Also see MySQL Enterprise Support.
You always have a window where total loss of a server and associated file storage will result in some amount of lost data.
When I ran a modestly busy SaaS solution in AWS, I had a MySQL Master running on a large instance and a MySQL Slave running on a small instance in a different availability zone. The replication lag was typically no more than 2 seconds, though a surge in traffic could take that up to a minute or two.
If you can't afford losing 5 minutes of data, I would suggest running a Master/Slave setup over rolling your own recovery mechanism. If you do roll your own, ensure the "stale" backups and the logged/journaled critical data are in a different availability zone. AWS has lost entire zones before.