AWS MySQL RDS instance becomes unresponsive and getting restarted automatically - mysql

We have a AWS MySQL RDS instance which is about 1.7T in size. Sometimes it becomes unresponsive and no operations can be performed.
CPU utilization, Write IOPS, read IOPS, queue depth, write throughput, write latency and read latency drops to zero.
Number of connections get piled up.
"Show engine innodb status" hangs
Lot of queries (around 25 for each) by rdsadmin which are in hang state.
SELECT count(*) from mysql.rds_replication_status WHERE action = 'reset slave' and master_host is NULL and master_port is NULL GROUP BY action_timestamp,called_by_user,action,mysql_version,master_host,master_port ORDER BY action_timestamp LIMIT 1;
SELECT NAME, VALUE FROM mysql.rds_configuration;
After sometime, instance gets rebooted automatically with following error.
MySQL restart initiated to address MySQL induced log backup issues. Note that as part of this resulution, a DB Snapshot will be performed after MySQL completes restarting.
What can be the issue? This happens quite often. Sometimes, for our surprise, this happens in off-peak times too.

I faced the same issue and raised an issue with AWS Support. Got the following explanation:
The RDS Monitoring service discovered issue regarding backing up Binary Logs of your databases which is critical for Point in Time Restore (PITR) feature. To mitigate this issue and in order to avoid data corruption, RDS monitoring restarted the RDS instance and hence a restart was automatically triggered. In order to make sure that there is no data loss it took a snapshot of DB instance.
Although the RDS instance was multi-AZ it didn't fail over because of following reason:
Multi AZ has 2 criteria:
1- Single Box Experience, which means that Customer always finds his data even after failover.
2- Higher Availability than Single AZ.
So both criteria have to be present when AWS monitoring service takes the Decision to failover to the standby instance, but in your case AWS monitoring service noticed some risk that can cause data loss after the failover and that is why it took the decision to reboot instead of failing over.
Hope this helps. This has happened to me 3 times in last one week though.

check your db maintenance window timing i mean when your schedule maintenance is happening , and note at what time this issue happening is it happening in regular interval or randomly .
check both mysql error logs and slow query logs.
if possible paste the suspected issue here

We were able to resolve this issue by upgrading the instances to 5.6.34.

Related

First query or connection to AWS RDS is very very slow

I have a product built with laravel, with multi-tenancy.
Deployed on EC2 instance and using AWS RDS as the database server.
I am currently having around 100 databases on the production.
Laravel's hyn tenancy module is handling the connections.
Now, the problem is for each tenant after some idle time, the first request takes too long. around 15-20 seconds. and after that, it works smoothly.
In the test environment, we are not using RDS but a local MySQL instance. and the problem does not occur in the test environment. the only difference between test and production is the AWS RDS.
I have looked into max connections, query cache, and so on... but no luck so far.
Any suggestions?
The solution will depend on what kind of RDS you have.
I assume it's serverless (more common). In that case, there's a setting for min and max for ACU. It will (I believe) go down to zero by default if the DB is not accessed in a while. Check that and see if it is properly set.
If you have a Provisioned DB, then it's more complex. It will start caching things once queries are executed but until a particular query is run, you will be waiting for the DB to "wake up" and run a full query.
Check this page for relevant info.

AWS MySQL RDS becomes unresponsive time to time

We have two MySQL RDS instances (Master and read replica). As usual we write to the master and read from the slave.
Master server works fine, but we observed that slave server becomes unresponsive time to time.
Observations:
Monitoring Graphs
CPU utilization drops down to 0
Increase in number of connections
Write IOPS, read IOPS, queue depth, write throughput, write latency and read latency drop to 0.
This can be resolved with a restart, but we are interested in finding the root cause. Basically when this happens, we can still log in to mysql prompt, but we can't execute any queries. AWS console shows instance as healthy, no errors are shown.
According to the graphs, there is no any abnormal activity or increase in resource utilization just before this happens. Everything looks normal.
(Small climbs in the attached graphs are normal. Those are in line with the business pattern. Historically instance survived against larger mountains)
Please let me know if you happen to come across such a situation.
Thanks.
Note:
Instance Information
db.m4.xlarge
IOPS 2000
Size 50G
Basically, instance is being under utilized when the issue happens
Note:
If we wait without restarting the instance, it gets restarted automatically with following error.
MySQL restart initiated to address MySQL induced log backup issues. Note that as part of this resulution, a DB Snapshot will be performed after MySQL completes restarting.

Google compute costs?

I have gone over the pricing and documentation so many times but still do not understand pricing...
I picked a bare minimum server setup (CPU, RAM, etc). I am using this server as a development server (eventually) so it will be actively used about 6-8 hours a day, 5 days per week...when I entered these values in their "cost calculator" the result was a few bucks a month...perfect!
However I have been up and running for less than a week and already the price is $0.65 with a usage of 2,880.00 Minutes?!?!?!
So I am not being billed only for activity but for server uptime, entirely??? So even if the server sits idle, I am getting charged? Is there a way I can disable the instance during non-work hours? Re-enabling it when I arrive in the morning?
EDIT | how to stop compute engine instance without terminating the instance?
This may have answered my questions...
As the other question answered, you are billed by the minute while your server is running, whether or not it is actively using the CPU.
At the moment, there's no way to leave a server shut down and restart it later; you can use a persistent boot disk to store your development state and create/delete an instance (with the same name) each day that you want to use your server.
To use a persistent boot disk like this, you'll want to make sure that the "Delete boot disk when instance is deleted" checkbox is UNCHECKED -- you want your boot disk to stick around after the instance is deleted. The next time you create your instance, select "Use existing disk", and select the same disk.
You'll still pay $0.04/GB/month for the disk storage all the time, but you'll only pay for the instance running when you need it.
You can add a cron job that checks every 10 minutes to see if the load on the machine is less than 0.05 and no one is logged in and then runs "shutdown -p now" to shut down the machine if you're concerned about forgetting to shut down the machine.

AWS RDS MySQL Read Replica Lag Issues

I run a service that needs to be able to support about 4000+ IOPS and keep replica lag <=1 second to function properly.
I am using AWS RDS MySQL instances and have 2 read replica's. My service was experiencing giant replica lag spikes on the read replica's so I was in contact with AWS support for a week trying to understand why I was experiencing the lag--I had 6000 IOPS provisioned and my instances were very powerful. They gave me all kinds of reasons.
After changing instance types, upgrading to MySQL 5.6 from 5.5 to take advantage of multi-threading, and them replacing underlying hardware I was still seeing significant replica lag randomly.
Eventually I decided to start tinkering with the parameter groups changing my configs for just the read replica's on anything I could find that was involved in the replication process and am now finally experiencing <= 1 second of replica lag.
Here are the settings I changed and their values that appear to be successful (I copied the default mysql 5.6 param group and changed these values applying the updated paramater group to just the read replicas):
innodb_flush_log_at_trx_commit=0
sync_binlog=0
sync_master_info=0
sync_relay_log=0
sync_relay_log_info=0
Please read about each of these to understand the impact of the modifications: http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html
Other things to make sure you take care of:
Convert any MyISAM tables to InnoDB
Upgrade from MySQL < 5.6 to MySQL >= 5.6
Ensure that your provisioned IOPS are > the combined read/write IOPS you require
Ensure that your read replica instances are >= master instance
If anyone else has any additional parameters that could be modified on the read replica's or master DB to get the best replication performance I'd love to hear more.
UPDATE 7-8-2014
To take advantage of Mysql 5.6 multi-thread replication I've set:
slave_parallel_workers=5 (Set it to the number of read replica DBs you have running)
I found this in this here:
https://blogs.oracle.com/MySQL/entry/benchmarking_mysql_replication_with_multi
Mysql replication executes all the transactions on a single database in order , and master - can execute those transactions in parallel.
You probably have most of the updates executed on a single DA, and that is what not allowing you to get advantage of multithreaded replication.
Check the iostat on your replica server. Most of the time those problem occurs because of high IO on the machine.
In order to decrease the IO on a machine - there are several additional changes that you can do:
Increase innodb_buffer_pool_size - this is the first thing you should change from default. If this instance runs only mysql - you can allocate about 80% of your available the memory here.
Verify also the following parameters:
log_slave_updates = false
binlog_format = STATEMENT
(if you have MIXED or ROW binlog_format configured - verify that you understand what does that means from here http://dev.mysql.com/doc/refman/5.6/en/binary-log-setting.html
If you have a lot of data that is being changed for several times - increasing
innodb_max_dirty_pages_pct to 90 or 95% can be worth checking.

How to find CPU utilazetion of mysql RDS instance

How to find which process/query consume CPU in amazon mysql RDS instance? I have medium instance on amazon RDS of mysql, and It is working smoothly previously, but since yesterday Its throwing error 'connection timeout' while accessing RDS instance. When I checked cloud watch, It shows me high CPU utilization during that period. Now I want to check what is the problem? So, can some one tell me how to check it?
thanks
use 'show processlist' in mysql. with this you can see which queries are in what state, doing what, since when
also check slow query log:
http://dev.mysql.com/doc/refman/5.0/en/slow-query-log.html
Using show process list, you can see only current running thread info, but as per context of your query, you want to see the historical status. You can achieve it via enabling the slow query log and setting long query time to 1 second. You can pass slow query logs to cloud watch and can set alerts according to your db system loads and types of queries.