Amazon Aurora DB Cluster Not Auto Balancing Correctly - mysql

I have created an Amazon Aurora Database cluster runing MySQL with three instances: the main instance that backs the cluster and two read replicas for balancing. However, the cluster does not seem to be balancing the reads at all. I have one replica managing 700+ Selects/sec maximizing the CPU at 99.75% or higher while the other replica is doing virtually nothing with a CPU usage of 4% at 1 select per second, if that. The main cluster instance itself is at 33% CPU usage as it is being written to simultaneously while the replicas should are being read from. The lag time between the replicas is under 20 milliseconds. My application is querying the read only endpoint of the cluster but no balancing appears to be happening. Does anyone have any insight into why this may be happening or why the replica is at such a high CPU usage? The queries being ran against it are not complex by any means.

Aurora Cluster endpoints are DNS records and they only do DNS round robin during resolution. This means that when your client application opens connections to a cluster endpoint, you end up resolving the endpoint to different instances (different IPs basically), there by striping your connections across multiple replicas. Past that point, there is no load balancing. Connections are striped across instances, and queries run on each of those connections go to the corresponding instance backing it.
Now consider the scenario where your connection pool was already created to the cluster endpoint when you have one instance behind it. Now, if you add more instances, there will be no impact to your application, unless you terminate your connection and reestablish them. You would do a DNS round robin again, and this time some of your connections would land on the new instance that you provisioned.
Few callouts:
In Aurora, you have 2 cluster endpoints. One (RW) endpoint always points to the current writer and one (RO) does the DNS round robin between your read replicas.
Also, DNS propagation might take a few seconds when failovers happen, so that occasional errors are quite natural when failovers occur.
Hope this helps.

We've implemented a driver to try to mitigate this problem, with some visible gains: https://github.com/DiceTechnology/dice-fairlink
It regularly discovers the read-replicas to catch up with cluster changes and round-robins connections among them.
Despite not measuring any CPU utilisation, we've observed a better load distribution than with the native DNS based round-robin of the cluster reader endpoint

The Aurora's DNS based load balancing works at the connection level (not the individual query level). You must keep resolving the endpoint without caching DNS to get a different instance IP on each resolution. If you only resolve the endpoint once and then keep the connection in your pool, every query on that connection goes to the same instance. If you cache DNS, you receive the same instance IP each time you resolve the endpoint.
Unless you use a smart database driver, you depend on DNS record updates and DNS propagation for failovers, instance scaling, and load balancing across Aurora Replicas. Currently, Aurora DNS zones use a short Time-To-Live (TTL) of 5 seconds. Ensure that your network and client configurations don’t further increase the DNS cache TTL. Remember that DNS caching can occur anywhere from your network layer, through the operating system, to the application container. For example, Java virtual machines (JVMs) are notorious for caching DNS indefinitely unless configured otherwise. Here are AWS documentation and Aurora whitepaper on configuring DNS cache ttl.

My guess is that you are not connecting to the cluster endpoint.
Load Balancing – Connecting to the cluster endpoint allows Aurora to load-balance connections across the replicas in the DB cluster. This helps to spread the read workload around and can lead to better performance and more equitable use of the resources available to each replica. In the event of a failover, if the replica that you are connected to is promoted to the primary instance, the connection will be dropped. You can then reconnect to the reader endpoint in order to send your read queries to the other replicas in the cluster.
New Reader Endpoint for Amazon Aurora – Load Balancing & Higher Availability
[EDIT]
To load balance within a single application, you will need to reconnect to the endpoint. If you use the same connection for all queries only one replica will be responding. However, opening connections is expensive so this might not provide much benefit unless your queries take some time to run.

Related

AWS: Too many connections

I have an RDS instance hosting a mySQL database. Instance size is db.t2.micro
I also have an ExpressJS backend connecting to the mySQL RDS instance via a connection pool:
Additionally i have a mobile app, the client, feeding off the ExpressJS API.
The issue i'm facing is, either via the mobile app or via Postman, there are times where i get a 'Too many connections' error and therefore several requests fail:
On the RDS instance. On current activity i sometimes get 65 connections, showing it's reaching the limit. What i need clarity on is:
When 200 mobile app instances connect to the API, to the RDS instance, does it register as 200 connections or 1 connection from ExpressJS?
Is it normal to be reaching the RDS instance 65 connection limit?
Is this just a matter of me using db.t2.micro instance size which is not recommended for prod? Will upgrading the instance size resolve this issue?
Is there something i'm doing wrong with my requests?
Thank you and your feedback is appreciated.
If your app creates a connection pool of 100, that's the number of database connections it will try to open. It must be lower than your MySQL connection limit.
Typically connection pools open all the connections for the pool, so they are ready when a client calls the http API. The connections might normally be running no SQL queries, if there are not many clients using the API at a given moment. The database connections are nevertheless connected.
Sort of like when you ssh to a remote linux server but you just sit there at a shell prompt for a while before running any command. You're still connected.
You asked if a db.t2.micro instance was not recommended for production. Yes, I would agree with that. It's tempting to use the smallest instance possible to save money, but a db.t2.micro is too small for anything but light testing, in my opinion.
In fact, I would not use any t2 instance for production, regardless of size. The t2 type uses "burstable" performance. This means it can provide only brief periods of good performance. Once the instance depletes its performance credits, they recharge slowly, and while they recharge, the performance of that instance is very low. This is okay for testing, but not for production, if you expect to provide consistent performance at any time.

Majority of Database Connections going to one Aurora DB

I have an AWS Mysql Aurora DB setup to autoscale when it reaches a certain CPU threshold, that said, looking at usage a majority of DB connections are only going to a single DB. Aurora is meant to load balance this, but is there anything additional I can do to ensure a more even distribution of DB connections? I'm having some latency issues with my website, and I think they could be originating from here as all other metrics are normal.

Which MySQL configuration do I want for simple load balancing for a web application?

We are building a small advertising platform that will be used on several client sites. We want to setup multiple servers and load balancing (using Amazon Elastic Load Balancer) to help prevent downtime.
Our basic functions include rendering HTML for ads, recording click information (IP, user-agent, location, etc.), redirecting traffic with their click ID as a tracking variable (?click_id=XX), and basic data analysis for clients. It is very important that two different clicks don't end up with the same click ID.
I understand the basics of load balancing, but I'm not sure how to setup the MySQL server(s).
It seems there are a lot of options out there: master-master, master-slave, clusters, shards.
I'm trying to figure out what is best for us. The most important aspects we are looking for are:
Uptime - if one server goes down, automatically get picked up by another server.
Load sharing - keep CPU and RAM usage spread out.
From what I've read, it sounds like my best option might be a Master with 2 or more slaves. The Master would not be responsible for any HTTP traffic, that would go to the slaves only. The Master server would therefore only be responsible for database writes.
Would this slow down our click script? Since we have to insert first to get a click ID before redirecting, the Slave would have to contact the Master and then return with the data. Right now our click script is lightning fast and I'd like to keep it that way.
Also, what happens if the Master goes down? Would a slave be able to serve as the Master until the Master was back online?
If you use Amazon's managed database service, RDS, this will take a lot of the pain out of managing your database.
You can select the multi-AZ option on your master database instance to provide a redundant, synchronously replicated slave in another availability zone. In the event of a failure of the instance or the entire availability zone Amazon will automatically flip the A record pointing to your master instance to the slave in the backup AZ. This process, on vanilla MySQL or MariaDB, can take a couple of minutes during which time your application will be unable to write to the database.
You can also provision up to 5 read replicas for a MySQL or MariaDB instance that will replicate from the master asynchronously. You could then use an Elastic Load Balancer (or other TCP load balancer such as HAProxy or MariaDB's MaxScale for a more MySQL aware option) to distribute traffic across the read replicas. By default each of these read replicas will have a full copy of the master's data set but if you wanted to you could attempt to manually shard the data across these. You'd have to have some more complicated logic in your application or the load balancer to work out where to find the relevant shard of the data set though.
You can choose to promote a read replica into a stand alone master which will break replication to the master and give you a stand alone cluster which can then be reconfigured as to your previous setup (or something different if you want and just using the data set you had at the point of promotion). It doesn't sound like something you need to do here though.
Another option would be to use Amazon's own flavour of MySQL, Aurora, on RDS. Aurora is completely MySQL over the wire compatible so you can use whatever MySQL driver your application already uses to talk to it. Aurora will allow up to 15 read replicas and completely transparent load balancing. You simply provide your application with the Aurora cluster endpoint and then any writes will happen on the master and any reads will be balanced across however many read replicas you have in the cluster. In my limited testing, Aurora's failover between instances is pretty much instant too so that minimises down time in the event of a failure.

persistently replicating RDS MySQL database to external slave

AWS now allows you to replicate data from an RDS instance to an external MySQL database.
However, according to the docs:
Replication to an instance of MySQL running external to Amazon RDS is only supported during the time it takes to export a database from a MySQL DB instance. The replication should be terminated when the data has been exported and applications can start accessing the external instance.
Is there a reason for this? Can I choose to ignore this if I want the replication to be persistent and permanent? Or does AWS enforce this somehow? If so, are there any work-arounds?
It doesn't look like Amazon explicitly states why they don't support ongoing replication other than the statement you quoted. In my experience, if AWS doesn't explicitly document a reason for why they do something then you're not likely to find out unless they decide to document it at a later time.
My guess would be that it has to do with the dynamic nature of Amazon instances and how they operate within RDS. RDS instances can have their IP address change suddenly without warning. We've encountered that on more than one occasion with the RDS instances that we run. According to the RDS Best Practices guide :
If your client application is caching the DNS data of your DB instances, set a TTL of less than 30 seconds. Because the underlying IP address of a DB instance can change after a failover, caching the DNS data for an extended time can lead to connection failures if your application tries to connect to an IP address that no longer is in service.
Given that RDS instances can and do change their IP address from time to time my guess is that they simply want to avoid the possibility of having to support people who set up external replication only to have it suddenly break if/when an RDS instance gets assigned a new IP address. Unless you set the replication user and any firewalls protecting your external mysql server to be pretty wide open then replication could suddenly stop if the RDS master reboots for any reason (maintenance, hardware failure, etc). From a security point of view, opening up your replication user and firewall port like that are not a good idea.

Amazon RDS: can databases be setup in replicaton mode?

I am studying the new Amazon RDS product and it seems it can be scaled only vertically (i.e. put a stronger server).
Did anyone see a possibility to configure multiple instances so that one is master and the other/s is/are replication slaves?
Same question asked (and answered) here http://developer.amazonwebservices.com/connect/thread.jspa?threadID=37823
Looks like there are plans for Master-Master HA or similar but that's not the same a replicated scale-out offering.
According to the FAQ it is possible now, see http://aws.amazon.com/rds/faqs/#86 :
Q: What types of replication does
Amazon RDS support and when should I
use each?
Amazon RDS provides two distinct replication options to serve different
purposes.
If you are looking to use replication to increase database
availability while protecting your
latest database updates against
unplanned outages, consider running
your DB Instance as a Multi-AZ
deployment. When you create or modify
your DB Instance to run as a Multi-AZ
deployment, Amazon RDS will
automatically provision and manage a
“standby” replica in a different
Availability Zone (independent
infrastructure in a physically
separate location). In the event of
planned database maintenance, DB
Instance failure, or an Availability
Zone failure, Amazon RDS will
automatically failover to the standby
so that database operations can resume
quickly without administrative
intervention. Multi-AZ deployments
utilize synchronous replication,
making database writes concurrently on
both the primary and standby so that
the standby will be up-to-date in the
event a failover occurs. While our
technological implementation for
Multi-AZ DB Instances maximizes data
durability in failure scenarios, it
precludes the standby from being
accessed directly or used for read
operations. The fault tolerance
offered by Multi-AZ deployments make
them a natural fit for production
environments; to learn more about
Multi-AZ deployments, please visit
this FAQ section.
If you are looking to take advantage of MySQL 5.1’s built-in
replication to scale beyond the
capacity constraints of a single DB
Instance for read-heavy database
workloads, Amazon RDS makes it easier
with Read Replicas. You can create a
Read Replica of a given “source” DB
Instance using the AWS Management
Console or CreateDBInstanceReadReplica
API. Once the Read Replica is created,
database updates on the source DB
Instance will be propagated to the
Read Replica. You can create multiple
Read Replicas for a given source DB
Instance and distribute your
application’s read traffic amongst
them. Unlike Multi-AZ deployments,
Read Replicas use MySQL 5.1’s built-in
replication and are subject to its
strengths and limitations. In
particular, updates are applied to
your Read Replica(s) after they occur
on the source DB Instance
(“asynchronous” replication), and
replication lag can vary
significantly. This means recent
database updates made to a standard
(non Multi-AZ) source DB Instance may
not be present on associated Read
Replicas in the event of an unplanned
outage on the source DB Instance. As
such, Read Replicas do not offer the
same data durability benefits as
Multi-AZ deployments. While Read
Replicas can provide some read
availability benefits, they and are
not designed to improve write
availability.
With Amazon RDS, you can use Multi-AZ deployments and Read Replicas
in conjunction to enjoy the
complementary benefits of each. You
can simply specify that a given
Multi-AZ deployment is the source DB
Instance for your Read Replica(s).
That way you gain both the data
durability and availability benefits
of Multi-AZ deployments and the read
scaling benefits of Read Replicas.