AWS RDS MySQL Read Replica Lag Issues - mysql

I run a service that needs to be able to support about 4000+ IOPS and keep replica lag <=1 second to function properly.
I am using AWS RDS MySQL instances and have 2 read replica's. My service was experiencing giant replica lag spikes on the read replica's so I was in contact with AWS support for a week trying to understand why I was experiencing the lag--I had 6000 IOPS provisioned and my instances were very powerful. They gave me all kinds of reasons.
After changing instance types, upgrading to MySQL 5.6 from 5.5 to take advantage of multi-threading, and them replacing underlying hardware I was still seeing significant replica lag randomly.
Eventually I decided to start tinkering with the parameter groups changing my configs for just the read replica's on anything I could find that was involved in the replication process and am now finally experiencing <= 1 second of replica lag.
Here are the settings I changed and their values that appear to be successful (I copied the default mysql 5.6 param group and changed these values applying the updated paramater group to just the read replicas):
innodb_flush_log_at_trx_commit=0
sync_binlog=0
sync_master_info=0
sync_relay_log=0
sync_relay_log_info=0
Please read about each of these to understand the impact of the modifications: http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html
Other things to make sure you take care of:
Convert any MyISAM tables to InnoDB
Upgrade from MySQL < 5.6 to MySQL >= 5.6
Ensure that your provisioned IOPS are > the combined read/write IOPS you require
Ensure that your read replica instances are >= master instance
If anyone else has any additional parameters that could be modified on the read replica's or master DB to get the best replication performance I'd love to hear more.
UPDATE 7-8-2014
To take advantage of Mysql 5.6 multi-thread replication I've set:
slave_parallel_workers=5 (Set it to the number of read replica DBs you have running)
I found this in this here:
https://blogs.oracle.com/MySQL/entry/benchmarking_mysql_replication_with_multi

Mysql replication executes all the transactions on a single database in order , and master - can execute those transactions in parallel.
You probably have most of the updates executed on a single DA, and that is what not allowing you to get advantage of multithreaded replication.
Check the iostat on your replica server. Most of the time those problem occurs because of high IO on the machine.
In order to decrease the IO on a machine - there are several additional changes that you can do:
Increase innodb_buffer_pool_size - this is the first thing you should change from default. If this instance runs only mysql - you can allocate about 80% of your available the memory here.
Verify also the following parameters:
log_slave_updates = false
binlog_format = STATEMENT
(if you have MIXED or ROW binlog_format configured - verify that you understand what does that means from here http://dev.mysql.com/doc/refman/5.6/en/binary-log-setting.html
If you have a lot of data that is being changed for several times - increasing
innodb_max_dirty_pages_pct to 90 or 95% can be worth checking.

Related

Most efficient way to clone an AWS RDS database?

I have 2 MySQL databases running on a server called X and Y, which both have identical content. A series of updates run throughout the day, which changes the content of X. At the end of the day, a process runs that compares the content of X with the content of Y (for various tables) in order to discover new rows, updated row data etc. Once the updates have been processed, mysqldump is used to dump X and then Y is overwritten with the dump. Both X and Y are now the same again, and the whole process repeats.
I'm investigating migration of these databases to Amazon RDS. What's the most efficient way to accomplish the process outlined above?
I understand that I can take a snapshot of a DB and restore it, but I think this is at the instance level only? That would mean I have to run 2 instances, which seems unnecessary. I don't have a problem running both databases on the same instance (I don't want to pay for more than one instance unnecessarily).
Do I just do what I'm doing now i.e. mysqldump X and restore it to Y, or is there some other method/shortcut that RDS provides?
Since the title is concerned AWS instance migration the best way is with my case (can be vary to others case)
Goto -> https://console.aws.amazon.com/rds
Select your DB Instance
Actions -> Take Snapshot
Goto -> https://console.aws.amazon.com/rds
Snapshots from left pane
select your snapshot just created
Action -> Restore Snapshot
After above steps you will be redirected to RDS instance creation page fill out required fields as per requirements and you are done with migration :D
Consider migrating to RDS Aurora for MySQL.
It supports native copy-on-write clones of the entire database (meaning server instance, not schema) without the need to make an actual "copy."
Copy-on-write means the "original" server and the "clone" share the same physical disk (called an Aurora Cluster Volume, which is replicates itself twice across 3 availability zones, using a 4/6 quorum), with both servers sharing the same disk blocks until one of them makes a change... which is when the copy action actually occurs ("on write"). So, you only use as much storage as is required to store your original working data set plus changes that occurred after cloning.
No server is the master in such a setup -- they all operate independently after cloning. I suspect that I'm not doing this innovation justice with my description -- it involves quite a bit of dark magic. See the write-up (with illustrations of copy-on-write):
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.Managing.Clone.html
Aurora is compatible with MySQL 5.6. To be more precise, Aurora is MySQL 5.6, with MyISAM removed and InnoDB heavily rewritten to optimize performance and work with the replicated Aurora Cluster Volume storage technology.
A bit late in the day but I have just managed to do this by (1) creating a database back up to S3 and then (2) restoring the backup from S3, i.e.
a. Create database back up in S3
EXEC msdb.dbo.rds_backup_database #source_db_name = '<database-name-goes-here>'
,#s3_arn_to_backup_to = 'arn:aws:s3:::<bucket-name-goes-here>/<backup-filename-goes-here>.bak'
,#overwrite_S3_backup_file = 1;
b. Wait for the task to complete. You can execute the following SQL to check this
exec msdb.dbo.rds_task_status #db_name='<database-name-goes-here>';
c. When the lifecycle is "SUCCCESS" you can then restore from the S3 bucket using the following command
exec msdb.dbo.rds_restore_database #restore_db_name='<new-database-name-goes-here>'
,#s3_arn_to_restore_from='arn:aws:s3:::<bucket-name-goes-here>/<backup-filename-goes-here>.bak';
d. Again you can monitor the status of the restore with the following SQL command
exec msdb.dbo.rds_task_status #db_name='<database-name-goes-here>';
You could setup AWS MySQL RDS instance as a slave of an external master.
After loading a full dump to RDS, Call the stored procedure mysql.rds_set_external_master like this:
mysql> call mysql.rds_set_external_master ('10.10.3.2', 3306, 'replica', 'password', 'mysql-bin-changelog.122', 108433, 0);
Then start the replication by doing:
mysql> call mysql.rds_start_replication;
Once you have data in sync you can promote RDS to master by doing:
mysql> call mysql.rds_stop_replication;
mysql> call mysql.rds_reset_external_master;
By doing this either using your external X or Y servers, the AWS RDS behaves like a replica, the one you could use as your future master if required.

AWS MySQL RDS instance becomes unresponsive and getting restarted automatically

We have a AWS MySQL RDS instance which is about 1.7T in size. Sometimes it becomes unresponsive and no operations can be performed.
CPU utilization, Write IOPS, read IOPS, queue depth, write throughput, write latency and read latency drops to zero.
Number of connections get piled up.
"Show engine innodb status" hangs
Lot of queries (around 25 for each) by rdsadmin which are in hang state.
SELECT count(*) from mysql.rds_replication_status WHERE action = 'reset slave' and master_host is NULL and master_port is NULL GROUP BY action_timestamp,called_by_user,action,mysql_version,master_host,master_port ORDER BY action_timestamp LIMIT 1;
SELECT NAME, VALUE FROM mysql.rds_configuration;
After sometime, instance gets rebooted automatically with following error.
MySQL restart initiated to address MySQL induced log backup issues. Note that as part of this resulution, a DB Snapshot will be performed after MySQL completes restarting.
What can be the issue? This happens quite often. Sometimes, for our surprise, this happens in off-peak times too.
I faced the same issue and raised an issue with AWS Support. Got the following explanation:
The RDS Monitoring service discovered issue regarding backing up Binary Logs of your databases which is critical for Point in Time Restore (PITR) feature. To mitigate this issue and in order to avoid data corruption, RDS monitoring restarted the RDS instance and hence a restart was automatically triggered. In order to make sure that there is no data loss it took a snapshot of DB instance.
Although the RDS instance was multi-AZ it didn't fail over because of following reason:
Multi AZ has 2 criteria:
1- Single Box Experience, which means that Customer always finds his data even after failover.
2- Higher Availability than Single AZ.
So both criteria have to be present when AWS monitoring service takes the Decision to failover to the standby instance, but in your case AWS monitoring service noticed some risk that can cause data loss after the failover and that is why it took the decision to reboot instead of failing over.
Hope this helps. This has happened to me 3 times in last one week though.
check your db maintenance window timing i mean when your schedule maintenance is happening , and note at what time this issue happening is it happening in regular interval or randomly .
check both mysql error logs and slow query logs.
if possible paste the suspected issue here
We were able to resolve this issue by upgrading the instances to 5.6.34.

MySQL Master-Master replication performance

I have the following situation:
I have to set up a high-performance server-cluster with maximum availability with nginx and MySQL. The cluster consists of four web servers which are load ballanced with nginx+gluster which works just fine.
In addition there's another server with 2 SSDs in RAID1. On that server I intend to install 2 VMs each with 12GB of RAM where I set up the MySQL cluster with Master-Master replication.
But that only prevents the system to break down if the MySQL service breaks down on one of the VMs, not if the host system is offline.
To counter that I thought of adding 2 more nodes on other machines to the MySQL cluster as failover. Unfortunately I don't have more machines with SSDs.
Now my question: Would I have to expect performance issues because of the much slower hard drives on the failover machines? And if so, would these issues occur only when inserting data or also when calling pure select queries?
Of course I'd set the loadballancer to prioritize the faster nodes.

MySQL/Magento Performance issues on Amazon EC2

We are running 2 web servers that host a Magento eCommerce site and 1 MySQL database server on the Amazon EC2.
We are experiencing major performance issues, deadlocks, 'lock wait timeout exceeded' errors etc on the MySQL server and really struggling to get these resolved.
We have recently upgraded the db server to an m1.xlarge instance (from m1.large) and still we are continuing to experience these problems.
We've been attributing these issues to bad disk IO we often see on the EC2 servers, but recently I've seen issues with deadlocks etc even when the disk IO is fine.
The "sar" command is showing that we have pretty poor disk IO performance at peak times or when we perform database intensive operations like creating invoices via the Magento API. We often see the iowait go up to over 20%.
Below is a link to a screenshot show the results of an "mtop" during a recent problem we had where a query was causing a slow down of the entire database:
http://i.imgur.com/AARlc.png
This screenshot shows one or other query that is holding up the rest of the queries from executing. It also shows quite a low load average, often we see the load average going up to 3.0 when an intensive command is being executed.
Here are the my.cnf settings:
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
symbolic-links=0
innodb_file_per_table=1
key_buffer=512M
max_allowed_packet=64M
table_cache=512
innodb_thread_concurrency=5
innodb_buffer_pool_size=4976M
innodb_additional_mem_pool_size=8M
innodb_log_file_size=128M
innodb_log_buffer_size=8M
thread_cache_size=150
sort_buffer_size=4M
read_buffer_size=4M
read_rnd_buffer_size=2M
myisam_sort_buffer_size=64M
tmp_table_size=256M
query_cache_type=1
query_cache_size=128M
max_connections=400
wait_timeout=28800
innodb_lock_wait_timeout=120
max_heap_table_size=256M
long_query_time=3
log-slow-queries=...mysql-slow.log
[mysqld_safe]
log-error=...mysqld.log
pid-file=...mysqld.pid
We have used the pt-query-digest function extensively to analyze our MySQL slow query log.
Basically we are seeing that the sales_flat_quote table is extremely slow with updates and inserts, but so are a number of other tables.
sales_flat_quote is not particularly large though, there are only around 100k rows in the table.
Several root causes are possible:
Some of your slow queries may be locking tables, thus queueing other queries
Your queries may not be optimized
Your queries may need more indexes on some tables
Check your slowest queries using this official tool:
mysqldumpslow
We have observed similar hogs on our mysql EC2 server, however, we quickly migrated our database to an RDS instance. Since then, there have been very few problems. One might argue that RDS are costly and EC2 are not, however, you would also save on the time spent on managing databases/daily backups etc.
I recommend to migrate your database to an RDS instance.

MySql Replication - slave lagging behind master

I have a master/slave replication on my MySql DB.
my slave DB was down for a few hours and is back up again (master was up all the time), when issuing show slave status I can see that the slave is X seconds behind the master.
the problem is that the slave dont seem to catch up with the master, the X seconds behind master dont seem to drop...
any ideas on how I can help the slave catch up?
Here is an idea
In order for you to know that MySQL is fully processing the SQL from the relay logs. Try the following:
STOP SLAVE IO_THREAD;
This will stop replication from downloading new entries from the master into its relay logs.
The other thread, known as the SQL thread, will continue processing the SQL statements it downloaded from the master.
When you run SHOW SLAVE STATUS\G, keep your eye on Exec_Master_Log_Pos. Run SHOW SLAVE STATUS\G again. If Exec_Master_Log_Pos does not move after a minute, you can go ahead run START SLAVE IO_THREAD;. This may reduce the number of Seconds_Behind_Master.
Other than that, there is really nothing you can do except to:
Trust Replication
Monitor Seconds_Behind_Master
Monitor Exec_Master_Log_Pos
Run SHOW PROCESSLIST;, take note of the SQL thread to see if it is processing long running queries.
BTW Keep in mind that when you run SHOW PROCESSLIST; with replication running, there should be two DB Connections whose user name is system user. One of those DB Connections will have the current SQL statement being processed by replication. As long as a different SQL statement is visible each time you run SHOW PROCESSLIST;, you can trust mysql is still replicating properly.
What binary log format are you using ? Are you using ROW or STATEMENT ?
SHOW GLOBAL VARIABLES LIKE 'binlog_format';
If you are using ROW as a binlog format make sure that all your tables has Primary or Unique Key:
SELECT t.table_schema,t.table_name,engine
FROM information_schema.tables t
INNER JOIN information_schema .columns c
on t.table_schema=c.table_schema
and t.table_name=c.table_name
and t.table_schema not in ('performance_schema','information_schema','mysql')
GROUP BY t.table_schema,t.table_name
HAVING sum(if(column_key in ('PRI','UNI'), 1,0)) =0;
If you execute e.g. one delete statement on the master to delete 1 million records on a table without a PK or unique key then only one full table scan will take place on the master's side, which is not the case on the slave.
When ROW binlog_format is being used, MySQL writes the rows changes to the binary logs (not as a statement like STATEMENT binlog_format) and that change will be applied on the slave's side row by row, which means a 1 million full table scan will take place on the slave's to reflect only one delete statement on the master and that is causing slave lagging problem.
"seconds behind" isn't a very good tool to find out how much behind the master you really is. What it says is "the query I just executed was executed X seconds ago on the master". That doesn't mean that you will catch up and be right behind the master the next second.
If your slave is normally not lagging behind and the work load on the master is roughly constant you will catch up, but it might take some time, it might even take "forever" if the slave is normally just barely keeping up with the master. Slaves operate on one single thread so it is by design much slower than the master, also if there are some queries that take a while on the master they will block replication while running on the slave.
Just check if you have same time and timezones on both the servers, i.e., Master as well as Slave.
If you are using INNODB tables, check that you have innodb_flush_log_at_trx_commit to a value different that 0 at SLAVE.
http://dev.mysql.com/doc/refman/4.1/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit
We had exactly the same issue after setting up our slave from a recent backup.
We had changed the configuration of our slave to be more crash-safe:
sync_binlog = 1
sync_master_info = 1
relay_log_info_repository = TABLE
relay_log_recovery = 1
I think that especially the sync_binlog = 1 causes the problem, as the specs of this slave is not so fast as in the master. This config option forces the slave to store every transaction in the binary lo before they are executed (instead of the default every 10k transactions).
After disabling these config options again to their default values I see that the slave is catching up again.
Just to add the findings in my similar case.
There were few bulk temporary table insert/update/delete were happening in master which occupied most of the space from relay log in slave. And in Mysql 5.5, since being single threaded, CPU was always in 100% and took lot of time to process these records.
All I did was to add these line in mysql cnf file
replicate-ignore-table=<dbname>.<temptablename1>
replicate-ignore-table=<dbname>.<temptablename2>
and everything became smooth again.
Inorder to figure out which tables are taking more space in relay log, try the following command and then open in a text editor. You may get some hints
cd /var/lib/mysql
mysqlbinlog relay-bin.000010 > /root/RelayQueries.txt
less /root/RelayQueries.txt
If u have multiple schema's consider using multi threaded slave replication.This is relatively new feature.
This can be done dynamically without stopping server.Just stop the slave sql thread.
STOP SLAVE SQL_THREAD;
SET GLOBAL slave_parallel_threads = 4;
START SLAVE SQL_THREAD;
I have an issue similar to this. and both of my MySQL server hosted on AWS EC2 (master and replication). by increasing EBS disk size (which automatically increased IOPS) for MySQL slave server, its turned out the solution for me. R/W Throughput and bandwidth is increased R/W latency were decreased.
now my MySQL database replication is catching up to the master. and Seconds_Behind_Master was decreased (it was got increased from day to day).
so if you have MySQL hosted on EC2. I suggest you tried to increase EBS disk size or its IOPS on the slave.
I know it's been a while since OP asked but it would have helped me to read the following answer.
In /etc/mysql/mysql.cnf :
[mysql]
disable_log_bin
innodb_flush_log_at_trx_commit=2
innodb_doublewrite = 0
sync_binlog=0
disable_log_bin REALLY carried the trick for me.