I plan on mount persistent disks into folders Apache(/var/www) and Mysql (/var/lib/mysql) to avoid having to replicate information between servers.
Anyone has done tests to know the I/O performance of persistent disk is similar when attaching the same disk to 100 instances as well as only 2 instances? Also has a limit of how many instances can be attach one persistent disk?
I'm not sure exactly what setup you're planning to use, so it's a little hard to comment specifically.
If you plan to attach the same persistent disk to all servers, note that a disk can only be attached to multiple instances in read-only mode, so you may not be able to use temporary tables, etc. in MySQL without extra configuration.
It's a bit hard to give performance numbers for a hypothetical configuration; I'd expect performance would depend on amount of data stored (e.g. 1TB of data will behave differently than 100MB), instance size (larger instances have more memory for page cache and more CPU for processing I/O), and access pattern. (Random reads vs. sequential reads)
The best option is to set up a small test system and run an actual loadtest using something like apachebench, jmeter, or httpperf. Failing that, you can try to construct an artificial load that's similar to your target benchmark.
Note that just running bonnie++ or fio against the disk may not tell you if you're going to run into problems; for example, it could be that a combination of sequential reads from one machine and random reads from another causes problems, or that 500 simultaneous sequential reads from the same block causes a problem, but that your application never does that. (If you're using Apache+MySQL, it would seem unlikely that your application would do that, but it's hard to know for sure until you test it.)
Related
I came across this article https://github.com/donnemartin/system-design-primer/blob/master/solutions/system_design/pastebin/README.md which says 4 writes per second should be doable for a single SQL write master-slave. In another article, it is mentioned that 2000 writes per second is too much for a single SQL write master-slave. Not having worked on setting up SQL databases directly, my question is: How can I tell how much can a single write master-slave handle? I would like to understand:
(1) What are the typical write QPS that this setup can handle in modern machines? This is for general intuition.
(2) Suppose my application is using this setup for its database. How should I load test the database first to identify write QPS capacity , and then how should I monitor it as there is more usage?
There is no way to determine the exact number of queries you can run on a master/slave system as it depends on a lot of variables.
How powerful is the CPU, is a SSD or HDD used, what exactly are the writes/reads, database version, network connectivity ect.
4 writes/seconds is laughably low, depending on your setup you should be able to consistently do thousands of writes per second.
I would recommend first testing a master/slave system with a test load and to determine if it's feasible for your case from there. If you don't actually have a working system in place and are just wondering if you should start with a master/slave , you can safely start at such, you will most likely not hit bottlenecks related to it anytime soon.
QUESTION OUTLINE
Our AWS RDS instance starts slowing down after about 7-14 days, by a quite large factor (~400% load times for a specific set of queries). RDS monitoring shows no signs of resource shortage. (see below the question update for detailed problem description)
Question Update
So after more than one month of investigating and some developer support by AWS, I am not exactly closer to a solution.
Here are a couple of steps which I checked off the list, more or less without any further hint of the problem:
Index / Fragmentation (all tables have correct indexes/keys and have no fragmentation)
MySQL Stats Update (manually updating stats source)
Thread Concurrency (changing innodb_thread_concurrency to various different parameters)
Query Cache Hit Ratio doesn't show problems
EXPLAIN to see if any SELECTs are actually slow or not using indexes/keys
SLOW QUERY LOG (returns no results, because see paragraph below, it's a number of prepared SELECTs)
RDS and EC2 are within one VPC
For explanation, the used PlayFramework (2.3.8) has BoneCP and we are using eBeans to select our data. So basically I am running through a nested object and all those child objects, this produces a couple of hundred prepared SELECTs for the API call in question. This should basically also be fine for the used hardware, neither CPU nor RAM are extensively used by these operations.
I also included NewRelic for more insights on this issue and did some JVM profiling. Obviously, most of the time is consumed by NETTY/eBeans?
Is anyone able to make sense of this?
ORIGINAL QUESTION: Problem Outline
Our AWS RDS instance starts slowing down after about 7-14 days, by a quite large factor (~400% load times for a specific set of queries). RDS monitoring shows no signs of resource shortage.
Infrastructure
We run a PlayFramework backend for a mobile app on AWS EC2 instances, connected to AWS RDS MySQL instances, one PROD environment, one DEV environment. Usually the PROD EC2 instance is pointing to the PROD RDS instance, and the DEV EC2 points to the DEV RDS (hi from captain obvious!); however sometimes we also let the DEV EC2 point to the PROD DB for some testing purposes. The PlayFramework in use is working with BoneCP.
Detailed Problem Description
In a quite essential sync process, our app is making a certain API call many times a day per user. I discussed the backgrounds of the functionality in this SO question, where, thanks to comments, I could nail the problem down to be a MySQL issue of some kind.
In short, the API call is loading a set of data, the maximum is about 1MB of json data, which currently takes about 18s to load. When things are running perfectly fine, this takes about 4s to load.
Curious enough, what "solved" the problem last time was upgrading the RDS instance to another instance type (from db.m3.large to db.m4.large, which is just a very marginal step). Now, after about 2-3 weeks, the RDS instance is once again performing slow as before. Rebooting the RDS instance showed no effect. Also re-launching the EC2 instance shows no effect.
I also checked if the indices of the affected mySQL tables are set properly, which is the case. The API call itself is not eager-loading any BLOB fields or similar, I double-checked this. The CPU-usage of the RDS instances is below 1% most of the time, when I stress tested it with 100 simultaneous API calls, it went to ~5%, so this is not the bottleneck. Memory is fine too, so I guess the RDS instance doesn't start swapping which could slow down the whole process.
Giving hard evidence, a (smaller) public API call on the DEV environment currently takes 2.30s load, on the PROD environment it takes 4.86s. Which is interesting, because the DEV environment has both in EC2 and RDS a much smaller instance type. So basically the turtle wins the race here. (If you are interested in this API call I am happy to share it with you via PN, but I don't really want to post links to API calls, even if they are basically public.)
Conclusion
Concluding, it feels (I wittingly say 'feels') like the DB is clogged after x days of usage / after a certain amount of API calls. Not sure if this a RDS-specific issue, once I 'largely' reset the DB instance by changing the instance type, things run fast and smooth. But re-creating my DB instance from a snapshot every 2 weeks is not an option, especially if I don't understand why this is happening.
Do you have any ideas what further steps I could take to investigate this matter?
(Too long for just a comment) I know you have checked a lot of things, but I would like to look at them with a different set of eyes...
Please provide
SHOW VARIABLES; (probably need post.it or something, due to size)
SHOW GLOBAL STATUS;
how much RAM? Sounds like 7.5G
The query. -- Unclear what query/queries you are using
SHOW CREATE TABLE for the table(s) in the query -- indexes, datatypes, etc
(Some of the above may help with "clogging over time" question.)
Meanwhile, here are some guesses/questions/etc...
Some other customer sharing the hardware is busy.
It could be a network problem?
Shrink long_query_time to 1 so you can catch slow queries.
When are backups done on your instance?
4s-18s to load a megabyte -- what percentage of that is SQL statements?
Do you "batch" the inserts? Is it a single transaction? Are lengthy queries going on at the same time?
What, if any, MySQL tunables did you change from the AWS defaults?
6GB buffer_pool on a 7.5GB partition? That sounds dangerously tight. Can you see if there was any swapping?
Any PARTITIONing involved? (Of course the CREATE will answer that.)
There is one very important bit of information missing from your description: The total allocated space for the database. I/O for RDS is around 3x the allocated space, so for a 100GB allocation, you should get around 300 IOPS. That allocated space also includes logs.
Since you don't really know what's going on, the first step should be to turn on detailed monitoring, which will give you more idea of what is happening on the instance.
Until you have additional stats gathered during a slowdown, you can try increasing the allocated space, which will increase the IOPS available.
Also, check the events for the db - are logs getting purged on a regular basis? That might indicate that there's not enough space.
Finally, you can try going with PIOPS (provisioned IOPS) if you have an idea of what the application needs, though at this point it sounds like that would be a guess.
maybe your burst credit balance is (slowly) being depleted? finally, you end up with baseline performance, which may appear "too slow".
this would also explain why the upgrade to another instance type did help, as you then start with a full burst balance again.
i would suggest to increase the size of the volume, even if you don't need the extra space, as the baseline performance grows linearly with volume size.
I've seen pictures like this where multiple rails engines write to a single mySQL server.
1) Is this possible? Or does Rails want each application server to write to one database server?
2) If this is possible, how is it accomplished? Are there queues and a scheduler between the application servers and the write database server?
Scaling a mysql db is a pretty difficult thing to do, but its certainly been done plenty of times and there are a lot of best practices out there for you to take advantage of. The first thing you should know is that before you worry about scaling writes for a while yet, you probably need to scale your reads first.
Scaling reads can be done fairly easily using replication. There are several tools out there that make managing replication a lot easier such as Amazon RDS. Generally speaking many web severs can connect to many databases (as suggested by others), however you quickly run into scale issues once you have a lot of traffic, connections or whatever other action you are performing that generates load on the server.
As replicated severs are read only, you need to manage which sever you connect to depending on the action you're performing. I.e. if you had a users table, when creating, updating or deleting users you need to use the "write" database (the primary "source" sever) but when reading the user table, you can use one of the read replicas. This reduces the load on the primary write sever (allowing it to deal with even more writes) and as you can have multiple read databases behind a load balancer, you can get away with this structure for a very long time and scale reads across tens of database severs before you'll hit any significant issues (however most apps get away with 1-3).
There are situations where you will need to use your write database for read actions (although you should avoid it as much as possible) as the read replicas can be slightly behind the write dbs due to latency in replicating the write db queries, however most of the time you should be able to code knowing that there is the possibility that the read db is delayed (i.e. queue actions a reasonable period of time such that the updates will propagate across all the read severs) and simply use one of your read dbs rather than the write db.
Beyond this the key items to work on are ensuring you have efficient indexes and applying other best practices around maintaining a sensible data structure. You might also want to consider having 3 distinct "groups" of database servers. I generally like to have write, read and "stats" db groups. The write group for create, update and delete operations (as well as select for update), the read for general read items that must return their results quickly, and stats for anything that is going to be under high load and that you do not rely on for a prompt response (this keeps heavy queries that are not time sensitive away from your read db that you need quick responses from for general reads)
Once you get into a situation where you can no longer buy larger hardware and you're near maxing out your write capacity, you'll need to look into sharding, however that will take a lot of traffic / data (so dont worry about it unless you've done all of the above already).
I have replication setup mysql databases....the log file location the bin-log file all are at one path that is default my data directory of mysql.
I have read that for better performance one should store them separately.
Can anyone provide me how this improves the performance. Is there is documentation available for the same. The reason why one should do so?
Mainly because then, reads and writes can be made almost in parallel. Stored separately meaning on different disks.
Linux and H/W optimizations for MySQL is a nice presentation of ways to improve MySQL performance - it presents benchmarks and conclusions of when to use SSD disks and when to use SCSI disks, what kind of processors are better for what tasks.
Very good presentation, a must read for any DBA!!
It also can be really embarrassing to have your log files fill the file system and bring the database to a halt.
One consideration is that using a separate disk for binlogging introduces another SPOF since if MySQL cannot write the binlog it will croak the same as if it couldn't write to the data files. Otherwise, adding another disk just better separates the two tasks so that binlog writes and data file writes don't have to contend for resources. With SSDs this is much less of an issue unless you have some crazy heavy write load and are already bound by SSD performance.
It's mostly for cases where your database write traffic is so high that a single disk volume can't keep up while writing for both data files and log files. Disks have a finite amount of throughput, and you could have a very busy database server.
But it's not likely that separating data files from binlogs will give better performance for queries, because MySQL writes to the binlog at commit time, not at query time. If your disks were too slow to keep up with the traffic, you'd see COMMIT become a bottleneck.
The system I currently support stores binlogs in the same directory as the datadir. The datadir is on a RAID10 volume over 12 physical drives. This has plenty of throughput to support our workload. But if we had about double our write traffic, this RAID array wouldn't be able to keep up.
You don't need to do every tip that someone says gives better performance, because any given tip might make no difference to your application's workload. You need to measure many metrics of performance and resource use, and come up with the right tuning or configuration to help the bottlenecks under your workload.
There is no magic configuration that makes everything have high performance.
I currently have an application that is using 130 MySQL table all with MyISAM storage engine. Every table has multiple queries every second including select/insert/update/delete queries so the data and the indexes are constantly changing.
The problem I am facing is that the hard drive is unable to cope, with waiting times up to 6+ seconds for I/O access with so many read/writes being done by MySQL.
I was thinking of changing to just 1 table and making it memory based. I've never used a memory table for something with so many queries though, so I am wondering if anyone can give me any feedback on whether it would be the right thing to do?
One possibility is that there may be other issues causing performance problems - 6 seconds seems excessive for CRUD operations, even on a complex database. Bear in mind that (back in the day) ArsDigita could handle 30 hits per second on a two-way Sun Ultra 2 (IIRC) with fairly modest disk configuration. A modern low-mid range server with a sensible disk layout and appropriate tuning should be able to cope with quite a substantial workload.
Are you missing an index? - check the query plans of the slow queries for table scans where they shouldn't be.
What is the disk layout on the server? - do you need to upgrade your hardware or fix some disk configuration issues (e.g. not enough disks, logs on the same volume as data).
As the other poster suggests, you might want to use InnoDB on the heavily written tables.
Check the setup for memory usage on the database server. You may want to configure more cache.
Edit: Database logs should live on quiet disks of their own. They use a sequential access pattern with many small sequential writes. Where they share disks with a random access work load like data files the random disk access creates a big system performance bottleneck on the logs. Note that this is write traffic that needs to be completed (i.e. written to physical disk), so caching does not help with this.
I've now changed to a MEMORY table and everything is much better. In fact I now have extra spare resources on the server allowing for further expansion of operations.
Is there a specific reason you aren't using innodb? It may yield better performance due to caching and a different concurrency model. It likely will require more tuning, but may yield much better results.
should-you-move-from-myisam-to-innodb
I think that that your database structure is very wrong and needs to be optimised, has nothing to do with the storage