RDS Multi-AZ bottlenecking write performance - mysql

We are using an RDS MySQL 5.6 instance (db.m3.2xlarge) on sa-east-1 region and during write intensive operations we are seeing (on CloudWatch) that both our Write Throughput and the Network Transmit Throughput are capped at 60MB/s.
We suspected that the Multi-AZ could be responsible for this behaviour and turned it off for testing purposes. We did the same operation and noticed now that the Write Througput wasn't capped anymore and the Network Transmit Throughput was actually zero. This reinforced the idea that this network traffic is between the primary instance and the failover instance on the Multi-AZ setup.
Here is the Cloudwatch chart showing the operation without Multi-AZ and right after the same one with Multi-AZ enabled:
We tried upgrading the instance to one with the highest network performance and also provisioned IOPs but there was no change, when Multi-AZ is on we are always capped at 60MB/s for write.
It's our understanding that Multi-AZ uses synchronous data replication but we can't find any information on the bandwidth limits for the link thru which this replication occurs. Does anyone know anything about it and how to avoid this limits? Or should we live with it?

I don't think you're seeing a limitation of the replication service per se, but it appears that your replication bandwidth shares the same transport as the EBS volume on your instance, thus it's a limitation of the Ethernet bandwidth available to your instance itself (remembering that EBS is network-attached storage).
The network connection on an m3.2xlarge is 1000 Mbit/s, which is equivalent to 125 MiB/s.
Divide that number by two and you get ~60 MB/s for writing to the local instance's EBS volume and another ~60 MB/s for writing to the synchronous replica.
Unfortunately, the implementation details of Multi-AZ replication are not something AWS has publicly explained in enough detail to say conclusively that this is indeed the explanation, but the numbers are suspiciously close to what would be predicted if it is correct.
The m3 family and m4 family of instances have similar specs but also (apparently) some fundamental design differences, so it might be informative to see if the same behavior is true of the m4.2xlarge.

I have experienced the same issue, after activating Multi AZ the Write Latency increased dramatically:
(The instance type is m4.4xlarge)
The reason looks to be the synchronous synchronization process, each write action has to wait until both DBs are responding positively to the modification.
Looks like there is not solution and it is an expected behaviour:
DB instances using Multi-AZ deployments may have increased write and
commit latency compared to a Single-AZ deployment, due to the
synchronous data replication that occurs
– from AWS documentation
Here is an interesting Redis thread regarding to this:
https://www.reddit.com/r/aws/comments/61ewvp/rds_multiaz_slow_insert/
the only recommendation I see is moving to Aurora :/

Well, I never got an ACTUAL explanation from anywhere, but after tons of tests it seems that the m3.2x.large is actually "bugged". I wrote a detailed explanation in my blog.

Related

AWS RDS MYSQL Optimization For Writes (INSERT, DELETE, UPDATE)

We have a rds t3.small instance which we perform write actions on. We have 2 read replicas for this instance which we route to using weighted routing policy via Route53. The reads are fine for now but we are getting massive operations for the write (master) database instance. CPU Utilization is nearing baseline of 20% and connections keep increasing (we are using connection pools but the traffic is too much)
Any possibilities on how we can manage load for writes? Is it possible to launch another instance for the same rds mysql database?
You can't distribute writes across multiple instances, they all have to go to the master instance. It sounds like you will soon need to increase the size of the writer instance. If downtime is a concern, you could add a new, larger instance as a read replica, and then promote it as the new writer instance.
All RDS instance types suck for write performance. They all use remotely-attached EBS storage, and remote storage incurs a heavy performance penalty for I/O.
At my last job, I benchmarked RDS versus Aurora for MySQL, and also benchmarked our physical servers (non-cloud) and also tested MySQL installed manually on EC2 i3 instances. RDS had consistent poor performance by a wide margin.
The t3 instances will be even worse, because they use burstable performance. Their baseline performance assumes a very light load, and they can add a burst of extra performance but only for short periods.
If you have write performance issues, especially for an application that requires consistent high performance, then you should upgrade to a more powerful instance type such as M5 or R5.
I would move away from RDS. I think it's useful only for very light load, or for temporary use during testing or development. I would recommend using Aurora instead of RDS, but my first preference would be to operate MySQL myself on EC2 i3 instances.

Resize Amazon RDS storage

We are currently working with a 200 GB database and we are running out of space, so we would like to increment the allocated storage.
We are using General Purpose (SSD) and a MySQL 5.5.53 database (without Multi-AZ deployment).
If I go to the Amazon RDS menu and change the Allocated storage to a bit more (from 200 to 500) I get the following "warnings":
Deplete the initial General Purpose (SSD) I/O credits, leading to longer conversion times: What does this mean?
Impact instance performance until operation completes: And this is the most important question for me. Can I resize the instance with 0 downtime? I mean, I dont care if the queries are a bit slower if they work while it's resizing, but what I dont want to to is to stop all my production websites, resize the instance, and open them again (aka have downtime).
Thanks in advance.
You can expect degraded performance but you should really test the impact in a dev environment before running this on production so you're not caught off guard. If you perform this operation during off-peak hours you should be fine though.
To answer your questions:
RDS instances can burst with something called I/O credits. Burst means its performance can go above the baseline performance to meet spikes in demand. It shouldn't be a big deal if you burn through them unless your instance relies on them (you can determine this from the rds instance metrics). Have a read through I/O Credits and Burst Performance.
Changing the disk size will not result in a complete rds instance outage, just performance degradation so it's better to do it during off-peak hours to minimise the impact as much as possible.
First according to RDS FAQs, there should be no downtime at all as long as you are only increasing storage size but not upgrading instance tier.
Q: Will my DB instance remain available during scaling?
The storage capacity allocated to your DB Instance can be increased
while maintaining DB Instance availability.
Second, according to RDS documentation:
Baseline I/O performance for General Purpose SSD storage is 3 IOPS for
each GiB, which means that larger volumes have better performance....
Volumes below 1 TiB in size also have ability to burst to 3,000 IOPS
for extended periods of time (burst is not relevant for volumes above
1 TiB). Instance I/O credit balance determines burst performance.
I can not say for certain why but I guess when RDS increase the disk size, it may defragment the data or rearrange data blocks, which causes heavy I/O. If you server is under heavy usage during the resizing, it may fully consume the I/O credits and result in less I/O and longer conversion times. However given that you started with 200GB I suppose it should be fine.
Finally I would suggest you to use multi-az deployemnt if you are so worried about downtime or performance impact. During maintenance windows or snapshots, there will be a brief I/O suspension for a few seconds, which can be avoided with standby or read replicas.
The technical answer is that AWS supports no downtime when scaling storage.
However, in the real world you need to factor how busy your current database is and how the "slowdown" will affect users. Consider the possibility that connections might timeout or the site may appear slower than usual for the duration of the scaling event.
In my experience, RDS storage resizing has been smooth without problems. However, we pick the best time of day (least busy) to implement this. We also go thru a backup procedure. We snapshot and bring up a standby server to switch over to manually just in case.

AWS RDS MySQL performance drop after random timespan

QUESTION OUTLINE
Our AWS RDS instance starts slowing down after about 7-14 days, by a quite large factor (~400% load times for a specific set of queries). RDS monitoring shows no signs of resource shortage. (see below the question update for detailed problem description)
Question Update
So after more than one month of investigating and some developer support by AWS, I am not exactly closer to a solution.
Here are a couple of steps which I checked off the list, more or less without any further hint of the problem:
Index / Fragmentation (all tables have correct indexes/keys and have no fragmentation)
MySQL Stats Update (manually updating stats source)
Thread Concurrency (changing innodb_thread_concurrency to various different parameters)
Query Cache Hit Ratio doesn't show problems
EXPLAIN to see if any SELECTs are actually slow or not using indexes/keys
SLOW QUERY LOG (returns no results, because see paragraph below, it's a number of prepared SELECTs)
RDS and EC2 are within one VPC
For explanation, the used PlayFramework (2.3.8) has BoneCP and we are using eBeans to select our data. So basically I am running through a nested object and all those child objects, this produces a couple of hundred prepared SELECTs for the API call in question. This should basically also be fine for the used hardware, neither CPU nor RAM are extensively used by these operations.
I also included NewRelic for more insights on this issue and did some JVM profiling. Obviously, most of the time is consumed by NETTY/eBeans?
Is anyone able to make sense of this?
ORIGINAL QUESTION: Problem Outline
Our AWS RDS instance starts slowing down after about 7-14 days, by a quite large factor (~400% load times for a specific set of queries). RDS monitoring shows no signs of resource shortage.
Infrastructure
We run a PlayFramework backend for a mobile app on AWS EC2 instances, connected to AWS RDS MySQL instances, one PROD environment, one DEV environment. Usually the PROD EC2 instance is pointing to the PROD RDS instance, and the DEV EC2 points to the DEV RDS (hi from captain obvious!); however sometimes we also let the DEV EC2 point to the PROD DB for some testing purposes. The PlayFramework in use is working with BoneCP.
Detailed Problem Description
In a quite essential sync process, our app is making a certain API call many times a day per user. I discussed the backgrounds of the functionality in this SO question, where, thanks to comments, I could nail the problem down to be a MySQL issue of some kind.
In short, the API call is loading a set of data, the maximum is about 1MB of json data, which currently takes about 18s to load. When things are running perfectly fine, this takes about 4s to load.
Curious enough, what "solved" the problem last time was upgrading the RDS instance to another instance type (from db.m3.large to db.m4.large, which is just a very marginal step). Now, after about 2-3 weeks, the RDS instance is once again performing slow as before. Rebooting the RDS instance showed no effect. Also re-launching the EC2 instance shows no effect.
I also checked if the indices of the affected mySQL tables are set properly, which is the case. The API call itself is not eager-loading any BLOB fields or similar, I double-checked this. The CPU-usage of the RDS instances is below 1% most of the time, when I stress tested it with 100 simultaneous API calls, it went to ~5%, so this is not the bottleneck. Memory is fine too, so I guess the RDS instance doesn't start swapping which could slow down the whole process.
Giving hard evidence, a (smaller) public API call on the DEV environment currently takes 2.30s load, on the PROD environment it takes 4.86s. Which is interesting, because the DEV environment has both in EC2 and RDS a much smaller instance type. So basically the turtle wins the race here. (If you are interested in this API call I am happy to share it with you via PN, but I don't really want to post links to API calls, even if they are basically public.)
Conclusion
Concluding, it feels (I wittingly say 'feels') like the DB is clogged after x days of usage / after a certain amount of API calls. Not sure if this a RDS-specific issue, once I 'largely' reset the DB instance by changing the instance type, things run fast and smooth. But re-creating my DB instance from a snapshot every 2 weeks is not an option, especially if I don't understand why this is happening.
Do you have any ideas what further steps I could take to investigate this matter?
(Too long for just a comment) I know you have checked a lot of things, but I would like to look at them with a different set of eyes...
Please provide
SHOW VARIABLES; (probably need post.it or something, due to size)
SHOW GLOBAL STATUS;
how much RAM? Sounds like 7.5G
The query. -- Unclear what query/queries you are using
SHOW CREATE TABLE for the table(s) in the query -- indexes, datatypes, etc
(Some of the above may help with "clogging over time" question.)
Meanwhile, here are some guesses/questions/etc...
Some other customer sharing the hardware is busy.
It could be a network problem?
Shrink long_query_time to 1 so you can catch slow queries.
When are backups done on your instance?
4s-18s to load a megabyte -- what percentage of that is SQL statements?
Do you "batch" the inserts? Is it a single transaction? Are lengthy queries going on at the same time?
What, if any, MySQL tunables did you change from the AWS defaults?
6GB buffer_pool on a 7.5GB partition? That sounds dangerously tight. Can you see if there was any swapping?
Any PARTITIONing involved? (Of course the CREATE will answer that.)
There is one very important bit of information missing from your description: The total allocated space for the database. I/O for RDS is around 3x the allocated space, so for a 100GB allocation, you should get around 300 IOPS. That allocated space also includes logs.
Since you don't really know what's going on, the first step should be to turn on detailed monitoring, which will give you more idea of what is happening on the instance.
Until you have additional stats gathered during a slowdown, you can try increasing the allocated space, which will increase the IOPS available.
Also, check the events for the db - are logs getting purged on a regular basis? That might indicate that there's not enough space.
Finally, you can try going with PIOPS (provisioned IOPS) if you have an idea of what the application needs, though at this point it sounds like that would be a guess.
maybe your burst credit balance is (slowly) being depleted? finally, you end up with baseline performance, which may appear "too slow".
this would also explain why the upgrade to another instance type did help, as you then start with a full burst balance again.
i would suggest to increase the size of the volume, even if you don't need the extra space, as the baseline performance grows linearly with volume size.

Performance effects of moving mysql db to another Amazon EC2 instance

We have an EC2 running both apache and mysql at the moment. I am wondering if moving the mysql to another EC2 instance will increase or decrease the performance of the site. I am more worried about the network speed issues between the two instances.
EC2 instances in the same availability zone are connected via a 10,000 Mbps network - that's faster than a good solid state drive on a SATA-3 interface (6Gb/s)
You won't see any performance drop by moving a database to another server, in fact you'll probably see a performance increase because of having separate memory and cpu cores for the two servers.
If your worry is network latency then forget about it - not a problem on AWS in the same availability zone.
Another consideration is that you're probably storing your website & db file on an EBS mounted volume. That EBS block is stored off-instance so you're actually storing a storage array on the same super-fast 10Gbps network.
So what I'm saying is... with EBS your website and database are already talking across the network to get their data, putting them on seperate instances won't really change anything in that respect - besides giving more resources to both servers. More resources means more data stored locally in memory and more performance.
The answer depends largely on what resources apache and MySQL are using. They can happily co-habit if demands on your website are low, and each are configured with enough memory that they don't shell out to virtual memory. In this instance, they are best kept together.
As traffic grows, or your application grows, you will benefit from splitting them out because they can then both run inside dedicated memory. Provided that the instances are in the same region then you should see fast performance between them. I have even run a web application in Europe with the DB in USA and performance wasn't noticeably bad! I wouldn't recommend that though!
Because AWS is easy and cheap, your best bet is to set it up and benchmark it!

What are your experiences regarding performance with amazon-rds

Did you try amazon-rds? How is it, performance-wise?
I think this is a hard question to answer as it is highly specific to the problem you are trying to solve, but I will try to give you a picture of what we have seen.
We have been benchmarking RDS using CloudWatch metric gathering tools (provided here: http://aws.amazon.com/articles/2934) and have found it does perform nearly as well as our production servers for our data set. We tested both with a single RDS instance and with a Multi-AZ setup (what we plan to use in production) with no back-up retention.
The load we have been able to throw at it so far we are able to get up into the 1000-1100 Write IOPS range (their metric) even on a small database instance (db.m1.small). At least for our load, increasing the instance class did not affect our throughput IOPS or Bytes. We saw about a 10% reduction in performance when
Amazon freely admitted up front that the solution to really scale out is to subdivide your problem such that you can scale/store it across multiple database servers. We in fact have this in our application (very similar to sharding) and therefore will be able to take advantage and very easily move past this IOPS measurement.
We've found RDS to be pretty comparable performance-wise to having our own production servers (either dedicated or virtual or EC2). Note that you will always suffer some IO/performance degradation using a virtualization solution, which is what RDS seems to be using, and this will show up under heavy load (but with heavy load, you should be having a dedicated MySQL/DB box anyway.)
Take note: the biggest performance you will likely see is the network latency - if you are reading/writing from an EC2 box to an RDS box and vice versa, the network latency will probably be the bottlebeck, particularly for a large number of queries. This is likely to be worse if you are connecting from a non-Amazon/non-EC2 box to RDS.
You will probably get more performance from an equivalent spec physical box than a virtual box, but this is true of dedicated vs EC2/RDS, and is not a RDS-specific problem.
Regarding RDS vs EC2, the defaults that Amazon has set up RDS with seem to be pretty good, so if you are simply looking to have database server(s) up and running and connect to it, RDS is more than suitable. Do make sure you have the cost correctly analyzed though - its not the same pricing model as, say, an EC2 instance.