AWS MySQL RDS becomes unresponsive time to time - mysql

We have two MySQL RDS instances (Master and read replica). As usual we write to the master and read from the slave.
Master server works fine, but we observed that slave server becomes unresponsive time to time.
Observations:
Monitoring Graphs
CPU utilization drops down to 0
Increase in number of connections
Write IOPS, read IOPS, queue depth, write throughput, write latency and read latency drop to 0.
This can be resolved with a restart, but we are interested in finding the root cause. Basically when this happens, we can still log in to mysql prompt, but we can't execute any queries. AWS console shows instance as healthy, no errors are shown.
According to the graphs, there is no any abnormal activity or increase in resource utilization just before this happens. Everything looks normal.
(Small climbs in the attached graphs are normal. Those are in line with the business pattern. Historically instance survived against larger mountains)
Please let me know if you happen to come across such a situation.
Thanks.
Note:
Instance Information
db.m4.xlarge
IOPS 2000
Size 50G
Basically, instance is being under utilized when the issue happens
Note:
If we wait without restarting the instance, it gets restarted automatically with following error.
MySQL restart initiated to address MySQL induced log backup issues. Note that as part of this resulution, a DB Snapshot will be performed after MySQL completes restarting.

Related

MySQL heavy disk activity even with no queries running

Trying to troubleshoot an issue with a mysterious disk io bottleneck caused by MySQL.
I'm using the following commands to test disk read/write speed:
#write
dd if=/dev/zero of=/tmp/writetest bs=1M count=1024 conv=fdatasync,notrunc
#read
echo 3 > /proc/sys/vm/drop_caches; dd if=/tmp/writetest of=/dev/null bs=1M count=1024
I rebooted the machine, disabled cron so none of my usual processes are running queries, killed the web server which usually runs, and killed mysqld.
When I run the read test without mysqld running, I get 1073741824 bytes (1.1 GB) copied, 2.19439 s, 489 MB/s. Consistently around 450-500 MB/s.
When I start back up the mysql service back up, then run the read test again, I get 1073741824 bytes (1.1 GB) copied, 135.657 s, 7.9 MB/s. Consistently around 5MB/s.
Running show full processlist in mysql doesn't show any queries (and I disabled everything that would be running queries anyway). In MySQLWorkbench's Server Status tab, I can see InnoDB reads fluctuate between 30-200 reads per second, and 3-15 writes per second even when no queries are running.
If I run iotop -oPa I can see that mysqld is racking up like 1MB disk reads per second when no queries are running. That seems like a lot considering no queries are running, but at the same time that doesn't seem like enough to cause my dd command to take so long... The only other thing performing disk io is jbd2/sda3-8.
Not sure if it's related, but if I try to kill the mysql server with service mysql stop it says "Attempt to stop MySQL timed out", and the mysqld process continues running, but I can no longer connect to the DB. I have to use kill -9 to kill the mysqld process and restart the server.
All of this appears to be out of the blue. This server was doing heavy duty log parsing, high volume inserts and selects for months, until this last weekend we started seeing this disk io bottleneck.
How can I find out why MySQL is doing so much disk reading when it's essentially idle?
Did you update/delete/insert a large number rows? If so, consider these "delays" in writing to disk:
The block containing the data is not written back to disk immediately.
Ditto for UNIQUE keys.
Updates to secondary indexes go into the "change buffer" They get folded into the index blocks, often even later.
Updates/deletes leave behind a "history list" that needs to be cleaned up after the transaction is complete.
Those things are handled by background tasks that do not show up in the PROCESSLIST. They may be visible on mysqld process(es), mostly as I/O. (CPU is probably minimal.)
Was there a ROLLBACK? Transactions are "optimistic". So a ROLLBACK has to do a lot of work to "undo" what was optimistically already committed.
If you abruptly kill mysqld (or turn off the power), then the ROLLBACK occurs after restarting.
SSDs have no "seek" time. HDDs must move the read/write heads by a variable amount; this takes time. If your dd is working on one end of the disk, and mysqld is working on the other end, the "seeking" adds to the apparent I/O time.
This turned out, like many performance problems, to be a multifaceted issue.
Essentially the issue turned out to be with nightly system and db backups writing to a separate HDD raid array running into the next day, then the master sending FLUSH TABLES and causing mysql jobs and replication work to wait for that. In addition, an unnecessary side process copying many gigabytes of text files around the system a few times a day. Tons of context switching as the system was trying to copy data for backups while also performing mysql work (replication and other jobs).
I ended up reducing the number of tables we were replicating (some were unnecessary), reducing the copying of text files around the system when not needed, increasing memory and io allocated to the mysql server, streamlining the mysql backups and system backups, and limiting cron jobs running mysql processes to give the mysql backups more time to complete. With all that, the backups were barely completing by 7AM each morning, so I ended up determining that we need to run the mysql backups only on weekends instead of nightly, which is fine since this is all fairly static data.

AWS MySQL RDS instance becomes unresponsive and getting restarted automatically

We have a AWS MySQL RDS instance which is about 1.7T in size. Sometimes it becomes unresponsive and no operations can be performed.
CPU utilization, Write IOPS, read IOPS, queue depth, write throughput, write latency and read latency drops to zero.
Number of connections get piled up.
"Show engine innodb status" hangs
Lot of queries (around 25 for each) by rdsadmin which are in hang state.
SELECT count(*) from mysql.rds_replication_status WHERE action = 'reset slave' and master_host is NULL and master_port is NULL GROUP BY action_timestamp,called_by_user,action,mysql_version,master_host,master_port ORDER BY action_timestamp LIMIT 1;
SELECT NAME, VALUE FROM mysql.rds_configuration;
After sometime, instance gets rebooted automatically with following error.
MySQL restart initiated to address MySQL induced log backup issues. Note that as part of this resulution, a DB Snapshot will be performed after MySQL completes restarting.
What can be the issue? This happens quite often. Sometimes, for our surprise, this happens in off-peak times too.
I faced the same issue and raised an issue with AWS Support. Got the following explanation:
The RDS Monitoring service discovered issue regarding backing up Binary Logs of your databases which is critical for Point in Time Restore (PITR) feature. To mitigate this issue and in order to avoid data corruption, RDS monitoring restarted the RDS instance and hence a restart was automatically triggered. In order to make sure that there is no data loss it took a snapshot of DB instance.
Although the RDS instance was multi-AZ it didn't fail over because of following reason:
Multi AZ has 2 criteria:
1- Single Box Experience, which means that Customer always finds his data even after failover.
2- Higher Availability than Single AZ.
So both criteria have to be present when AWS monitoring service takes the Decision to failover to the standby instance, but in your case AWS monitoring service noticed some risk that can cause data loss after the failover and that is why it took the decision to reboot instead of failing over.
Hope this helps. This has happened to me 3 times in last one week though.
check your db maintenance window timing i mean when your schedule maintenance is happening , and note at what time this issue happening is it happening in regular interval or randomly .
check both mysql error logs and slow query logs.
if possible paste the suspected issue here
We were able to resolve this issue by upgrading the instances to 5.6.34.

Slow https requests after sudden spike in mysql queries

I have wordpress website running in a VPS server and is handling about 150 mysql queries / second.
Ocassionaly when we notice a spike in traffic to about 200 mysql queries a second the https requests to the site is extremely slow.
The site loads decently with http but with https it takes 20+ seconds.
Gradually over a period of an hour after the spike, the load times get better and then it gets back to normal again.
There server load and memory looks fine. There is only a spike in mysql queries, firewall traffic and eth0 requests. There are no mysql slow queries
Any help would be appreciated.
Thank You
I think your answer is in "disk latency" and "disk utilization" charts.
MySQL works well under small loads, when it can cache all data it needs. But when your results or queries get too big, or you request too many of them, it will start doing many disk I/O operations. This enables you to handle huge loads and very big data, but in the moment you exceed your MySQL allocated memory, you will need to read and write everything to disk.
This would not be so bad if you ran on a local SSD drive. But from the device name I see that you are running on an EBS volume, which is not a real hard drive. It is networked drive, so all the traffic overloads your network connection.
You have several options:
1.) install mysqltuner, let the server operate for some time and then run it and see what it suggest. My guess is, that it will suggest you to increase your MySQL memory pool, decrease number of parallel connections or restructure your queries.
2.) use EC2 instance type with actual local storage (e.g. m3 or r3) and write to local SSD. You can do RAID on several SSD drives to make it even faster.
3.) use EBS optimized instance (dedicated EBS network bandwidth) and correct volume type (some EBS volume types have I/O credits, similar to CPU credits for t-type instances, and when you run of those, your operations will slow down to crawl).

Google Compute Engine disk snapshot stuck at creating

I took a snapshot of a 50GB volume (non-boot) which is attached to an instance. The snapshot was successful.
I shut the instance and tried taking another snapshot of the same volume. This time the command hung. gcloud status reflects "CREATING" for this attempt. It is hours since I started the snapshot command. I tried the same using google developers console. The behaviour remains the same.
I restarted the instance and the status of the snapshot changes to "READY" within seconds.
It seems that snapshots should be taken if the volume is attached to a running instance. Otherwise the command is queued and executed when the volume/instance is live. Is this expected behaviour?
I replicated your issue and indeed the snapshot process halts when the instance is shut down. You may have also noticed that now the Shutdown/Start feature has been introduced - it was not available before.
I believe this is due to how snapshots are being handled on the platform. Your first snapshot creates a full copy of the disk, while the second one is differential - the differential one will fail or stay in pending as it cannot query the source disk while the instance is down to check what has been changed . You can check this for further info.
Dettaching the disk from the instance and then creating a snapshot works, so that could be a workaround for you.
Hope this helps

Snapshot of EBS volume used for replication

I setup an EC2 instance with MySQL on EBS volume and setup another instance which acts as Slave for Replication. The replication set up was fine. My question is about taking snapshots of these volumes. I noticed that the tables need to be locked for the snapshot process which may cause inconvenience for the users. So, my idea is to leave the Master instance alone and take a snapshot of instance acting as slave. Is this a good idea? Is there anyone out with a similar setup and could guide me in a right way?
Also, taking snapshot of slave instance would require locking of tables. Would that mean replication will break?
Thanks in advance.
Though it's a good idea to lock the database and freeze the file system when you initiate the snapshot, the actual API call to initiate the snapshot takes a fraction of a second, so your database and file system aren't locked/frozen for long.
That said, there are a couple other considerations you did not mention:
When you attempt to create the lock on the database, it might need to wait for other statements to finish before the lock is granted. During this time, your pending lock might further statements to wait until you get and release the lock. This can cause interruptions in the flow of statements on your production database.
After you initiate the creation of the snapshot, your application/database is free to use the file system on the volume, but if you have a lot of writes, you could experience high iowait, sometimes enough to create a noticeable slowdown of your application. The reason for this is that the background snapshot process needs to copy a block to S3 before it will allow a write to that block on the active volume.
I solve the first issue by requesting a lock and timing out if it is not granted quickly. I then wait a bit and keep retrying until I get the lock. Appropriate timeouts and retry delay may vary for different database loads.
I solve the second problem by performing the frequent, consistent snapshots on the slave instead of the master, just as you proposed. I still recommend performing occasional snapshots against the master simply to improve its intrinsic durability (a deep EBS property) but those snapshots do not need to be performed with locking or freezing as you aren't going to use them for backups.
I also recommend the use of a file system that supports flushing and freezing (XFS). Otherwise, you are snapshotting locked tables in MySQL that might not yet even have all their blocks on the EBS volume yet or other parts of the file system might be modified and inconsistent in the snapshot.
If you're interested, I've published open source software that performs the best practices I've collected related to creating consistent EBS snapshots with MySQL and XFS (both optional).
http://alestic.com/2009/09/ec2-consistent-snapshot
To answer your last question, locking tables in the master will not break replication. In my snapshot software I also flush the tables with read lock to make sure that everything is on the disk being snapshotted and I add the keyword "LOCAL" so that the flush is not replicated to any potential slaves.
You can definitely take a snapshot of the slave.
From your description, it does not seem like the slave is being used operationally.
If this is the case, then the safest method of obtaining a reliable volume snapshot would be to:
Stop mysql server on the slave
start the snapshot (either through the AWS Console, or by command line)
When the snapshot is complete, restart mysqld on the slave server