By what factor does the performance (read queries/sec) increase when a machine is added to a cluster of machines running either:
a Bigtable-like database
MySQL?
Google's research paper on Bigtable suggests that "near-linear" scaling is achieved can be achieved with Bigtable. This page here featuring MySQL's marketing jargon suggests that MySQL is capable of scaling linearly.
Where is the truth?
Having built and benchmarked several applications using VoltDB I consistently measure between 90% and 95% of additional transactional throughput as each new server is added to the cluster. So if an application is performing 100,000 transaction per second (TPS) on a single server, I measure 190,000 TPS on 2 servers, 280,000 TPS on 3 servers, and so on. At some point we expect the server to server networking to become a bottleneck but our largest cluster (30 servers) is still above 90%.
If you don't do that many writes to the database, MySQL may be a good and easy solution, especially if coupled with memcached in order to increase the read speed.
OTOH if you data is constantly changing, you should probably look somewhere else:
Cassandra
VoltDB
Riak
MongoDB
CouchDB
HBase
These systems have been designed to scale linearly with the number of computers added to the system.
A full list is available here.
Related
I've got an AWS Aurora DB cluster running that is 99.9% focused on writes. At it's peak, it will be running 2-3k writes/sec.
I know Aurora is somewhat optimized by default for writes, but I wanted to ask as a relative newcomer to AWS - what are some best practices/tips for write performance with Aurora?
From my experience, Amazon Aurora is unsuited to running a database with heavy write traffic. At least in its implementation circa 2017. Maybe it'll improve over time.
I worked on some benchmarks for a write-heavy application earlier in 2017, and we found that RDS (non-Aurora) was far superior to Aurora on write performance, given our application and database. Basically, Aurora was two orders of magnitude slower than RDS. Amazon's claims of high performance for Aurora are apparently completely marketing-driven bullshit.
In November 2016, I attended the Amazon re:Invent conference in Las Vegas. I tried to find a knowledgeable Aurora engineer to answer my questions about performance. All I could find were junior engineers who had been ordered to repeat the claim that Aurora is magically 5-10x faster than MySQL.
In April 2017, I attended the Percona Live conference and saw a presentation about how to develop an Aurora-like distributed storage architecture using standard MySQL with CEPH for an open-source distributed storage layer. There's a webinar on the same topic here: https://www.percona.com/resources/webinars/mysql-and-ceph, co-presented by Yves Trudeau, the engineer I saw speak at the conference.
What became clear about using MySQL with CEPH is that the engineers had to disable the MySQL change buffer because there's no way to cache changes to secondary indexes, while also have the storage distributed. This caused huge performance problems for writes to tables that have secondary (non-unique) indexes.
This was consistent with the performance problems we saw in benchmarking our application with Aurora. Our database had a lot of secondary indexes.
So if you absolutely have to use Aurora for a database that has high write traffic, I recommend the first thing you must do is drop all your secondary indexes.
Obviously, this is a problem if the indexes are needed to optimize some of your queries. Both SELECT queries of course, but also some UPDATE and DELETE queries may use secondary indexes.
One strategy might be to make a non-Aurora read replica of your Aurora cluster, and create the secondary indexes only in the read replica to support your SELECT queries. I've never done this, but apparently it's possible, according to https://aws.amazon.com/premiumsupport/knowledge-center/enable-binary-logging-aurora/
But this still doesn't help cases where your UPDATE/DELETE statements need secondary indexes. I don't have any suggestion for that scenario. You might be out of luck.
My conclusion is that I wouldn't choose to use Aurora for a write-heavy application. Maybe that will change in the future.
Update April 2021:
Since writing the above, I have run sysbench benchmarks against Aurora version 2. I can't share the specific numbers, but I conclude that current Aurora improvements are better for write-heavy workload. I did run tests with lots of secondary indexes to make sure. But I encourage anyone serious about adopting Aurora to run their own benchmarks.
At least, Aurora is much better than conventional Amazon RDS for MySQL using EBS storage. That's probably where they claim Aurora is 5x faster than MySQL. But Aurora is no faster than some other alternatives I tested, and in fact cannot match:
MySQL Server installed myself on EC2 instances using local storage, especially i3 instances with locally-attached NVMe. I understand instance storage is not dependable, so one would need to run redundant nodes.
MySQL Server installed myself on physical hosts in our data center, using direct-attached SSD storage.
The value of using Aurora as a managed cloud database is not just about performance. It also has automated monitoring, backups, failover, upgrades, etc.
I had a relatively positive experience w/ Aurora, for my use case. I believe ( time has passed ) we were pushing somewhere close to 20k DML per second, largest instance type ( I think db.r3.8xlarge? ). Apologies for vagueness, I no longer have the ability to get the metrics for that particular system.
What we did:
This system did not require "immediate" response to a given insert, so writes were enqueued to a separate process. This process would collect N queries, and split them into M batches, where each batch correlated w/ a target table. Those batches would be put inside a single txn.
We did this to achieve the write efficiency from bulk writes, and to avoid cross table locking. There were 4 separate ( I believe? ) processes doing this dequeue and write behavior.
Due to this high write load, we absolutely had to push all reads to a read replica, as the primary generally sat at 50-60% CPU. We vetted this arch in advance by simply creating random data writer processes, and modeled the general system behavior before we committed the actual application to it.
The writes were almost all INSERT ON DUPLICATE KEY UPDATE writes, and the tables had a number of secondary indexes.
I suspect this approach worked for us simply because we were able to tolerate delay between when information appeared in the system, and when readers would actually need it, thus allowing us to batch at much higher amounts. YMMV.
For Googlers:
Aurora needs to write to multiple replicas in real time, thus there must be a queue w/ locking, waiting, checking mechanisms
This behavior inevitably causes ultra high CPU utilization and lag when there are continuous writing requests which only succeed when multiple replicas are sync'd
This has been around since Aurora's inception, up til 2020, which is logically difficult if not impossible to solve if we were to keep the low storage cost and fair compute cost of the service
High-volume writing performance of Aurora MySQL could be more than 10x worse than RDS MySQL (from personal experience and confirmed by above answers)
To solve the problem (more like a work-around):
BE CAREFUL with Aurora if more than 5% of your workload is writing
BE CAREFUL with Aurora if you need near real-time result of large volume writing
Drop secondary indices as #Bill Karwin points out to improve writing
Batch apply inserts and updates may improve writing
I said "BE CAREFUL" but not "DO NOT USE" as many scenarios could be solved by clever architecture design. Database writing performance can be hardly depended on.
I am curious to know the overall effects on performance if i have my Database server separately from Tomcat server(Spin a MySQL server on Amazon). Actually, I am having some performance issues and not sure if it might be the cause of.
Yes, absolutely i have found that separating the DB and application can actually uncover performance issues not evident in a co-location situation, for the network latency reasons mentioned by ck1. In fact, if you capture stack traces during the slow operations, by sampling it will indicate the database/application code sensitive to network latency. The use cases with performance issues ( in non co-located apps) generally make a lot of round trips to the database. Instead try offloading the processing into the DB with a more complex query and reduce the rows returned.
Pros of having database and app servers co-located:
Network latency will be minimized
You only need to maintain a single server
Cons of co-location:
The app and database servers will contend for a common set of CPU, memory, and disk I/O resources. For example, queries causing a spike in CPU usage will affect the app server's performance
You have more than a single server to maintain
It's difficult to scale horizontally
My application's typical DB usage is to read/update on one large table. I wonder if MySQL scales read operations on a single multi-processor machine? How about write operations - are they able to utilize multi-processors?
By the way - unfortunately I am not able to optimize the table schema.
Thank you.
Setup details:
X64, quard core
Single hard disk (no RAID)
Plenty of memory (4GB+)
Linux 2.6
MySQL 5.5
If you're using conventional hard disks, you'll often find you run out of IO bandwidth before you run out of CPU cores. The only way to pin a four core machine is to have a very high performance SSD striped RAID array.
If you're not able to optimize the schema you have very limited options. This is like asking to tune a car without lifting the hood. Maybe you can change the tires or use better gasoline, but fundamental performance gains come from several factors, including, most notably, additional indexes and strategically de-normalizing data.
In database land, 4GB of memory is almost nothing, 8GB is the absolute minimum for a system with any loading, and a single disk is a very bad idea. At the very least you should have some form of mirroring for data integrity reasons.
I have a Java program and PHP website I plan to run on my Amazon EC2 instance with an EBS volume. The program writes to and reads from a database. The website only reads from the same database.
On AWS you pay for the amount of IOPS (I/O requests Per Second) to the volume. Which database has the least IOPS? Also, can SQLite handle queries from both the program and website simultaneously?
The amount of IO is going to depend a lot on how you have MySQL configured and how your application uses the database. Caching, log file sizes, database engine, transactions, etc. will all affect how much IO you do. In other words, it's probably not possible to predict in advance although I'd guess that SQLite would have more disk IO simply because the database file has to be opened and closed all the time while MySQL writes and reads (in particular) can be cached in memory by MySQL itself.
This site, Estimating I/O requests, has a neat method for calculating your actual IO and using that to estimate your EBS costs. You could run your application on a test system under simulated loads and use this technique to measure the difference in IO between a MySQL solution and a SQLite solution.
In practice, it may not really matter. The cost is $0.10 per million IO requests. On a medium-traffic e-commerce site with heavy database access we were doing about 315 million IO requests per month, or $31. This was negligible compared to the EC2, storage, and bandwidth costs which ran into the thousands. You can use the AWS cost calculator to plug in estimates and calculate all of your AWS costs.
You should also keep in mind that the SQLite folks only recommend that you use it for low to medium traffic websites. MySQL is a better solution for high traffic sites.
Yes SQLite can handle queries from both the program and website simultaneously. SQLite uses file level locking to ensure consistency.
In memory SQLite is intended for standalone or embedded programs.
Do not use in memory only SQLite:
when you share the db between multiple processes
when you have a php based website in which case you won't be able to leverage php fastcgi
Did you try amazon-rds? How is it, performance-wise?
I think this is a hard question to answer as it is highly specific to the problem you are trying to solve, but I will try to give you a picture of what we have seen.
We have been benchmarking RDS using CloudWatch metric gathering tools (provided here: http://aws.amazon.com/articles/2934) and have found it does perform nearly as well as our production servers for our data set. We tested both with a single RDS instance and with a Multi-AZ setup (what we plan to use in production) with no back-up retention.
The load we have been able to throw at it so far we are able to get up into the 1000-1100 Write IOPS range (their metric) even on a small database instance (db.m1.small). At least for our load, increasing the instance class did not affect our throughput IOPS or Bytes. We saw about a 10% reduction in performance when
Amazon freely admitted up front that the solution to really scale out is to subdivide your problem such that you can scale/store it across multiple database servers. We in fact have this in our application (very similar to sharding) and therefore will be able to take advantage and very easily move past this IOPS measurement.
We've found RDS to be pretty comparable performance-wise to having our own production servers (either dedicated or virtual or EC2). Note that you will always suffer some IO/performance degradation using a virtualization solution, which is what RDS seems to be using, and this will show up under heavy load (but with heavy load, you should be having a dedicated MySQL/DB box anyway.)
Take note: the biggest performance you will likely see is the network latency - if you are reading/writing from an EC2 box to an RDS box and vice versa, the network latency will probably be the bottlebeck, particularly for a large number of queries. This is likely to be worse if you are connecting from a non-Amazon/non-EC2 box to RDS.
You will probably get more performance from an equivalent spec physical box than a virtual box, but this is true of dedicated vs EC2/RDS, and is not a RDS-specific problem.
Regarding RDS vs EC2, the defaults that Amazon has set up RDS with seem to be pretty good, so if you are simply looking to have database server(s) up and running and connect to it, RDS is more than suitable. Do make sure you have the cost correctly analyzed though - its not the same pricing model as, say, an EC2 instance.