Intermittently can't connect to mysql on AWS RDS (Error 2003) - mysql

We are having an intermittent issue with connections to our mysql server timing out.
The error we are receiving is as following.
(2003, 'Can\'t connect to MySQL server on \'<connection>\' ((2013, "Lost connection to MySQL server during query (error(104, \'Connection reset by peer\'))"))')
Callstack:
File "/usr/lib64/python2.7/site-packages/pymysql/connections.py", line 818, in _connect
2003, "Can't connect to MySQL server on %r (%s)" % (self.host, e))
File "/usr/lib64/python2.7/site-packages/pymysql/connections.py", line 626, in __init__
self._connect()
Some more info:
We have a flight of EC2 servers that are constantly running queries to a backend RDS.
We average about 500 connections per second to the RDS
We have around 0 - 4 hiccups per RDS per day
The hiccups don't correspond with our maintenance window
When we hit a hiccup it can affect quite a few connections ~50
When a hiccup happens it will disrupt connections across all servers and ports
The error itself looks to be generated from the tcp connection being closed on the ec2. Our TCP keep alive time is set to 7200 seconds and that's when the error is fired off.
My question is what can be done to track down why these hiccups happen? It's great that they don't happen often, but it's not ideal that they happen at all.
Any advice would be appreciated thanks!
Update 10/29:
I've been running a service checking to see if I have any long processes running on the sql server and it looks like these errors aren't getting that far. A new process is never created for this connection! I have still been receiving the hiccups, just no signs of connections.

So after a back and forth with amazon support here is the current solution we have come to.
Amazon has raised our socket listen backlog by adjusting the somaxconn value on the RDS instance.
The value was at the default of 128 and has been bumped up to 1024.
Once the value was adjusted we no longer received the Lost Connection error.

Related

AWS Time Out Problems with Elastic Beanstalk App with DB Access

Hi When my Elastic Beanstalk (m5a.large Windows Server with deployed .net Core WebApi) comes under heavy load, the Status in the Health Page for my EC2 instances turns red, my Requests and the Healthcheck are timing out. That happens around 1-3 minutes after having a minimum of 10-20 Req/sec for a server.
I have to launch a lot of servers, so that each server gets a Request/Second count of 1-5 so they do not turn red.
In my logs I saw the following Errors:
Exception=MySql.Data.MySqlClient.MySqlException (0x80004005): Unable to connect to any of the specified MySQL hosts.
---> MySql.Data.MySqlClient.MySqlException (0x80004005): Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
These Errors brought me to the topic Connection Pooling so i switched
using MySql.Data.MySqlClient;
to
using MySqlConnector;
Now these Errors do not come up anymore but the Problem remains.
The Monitoring Feature of EB and RDS do not state any obvious Problems. Running Queries in Mysql Workbench against the Database is fast as usual.
At the moment, my Database calls from the server are synchronous and not using the async feature of MysqlConnector.
Does the m5a.large cannot process more than 5 Request/Second?
Kind Regards

Unexpected MySql PoolExhaustedException

I am running two EC2 instances on AWS to serve my application one for each. Each application can open up to 100 connections to MySql for default. For database I use RDS and t2.medium instance which can handle 312 connections at a time.
In general, my connection size does not get larger than 20. When I start sending notifications to the users to come to the application, the connection size increases a lot (which is expected.). In some cases, MySql connections increases unexpectedly and my application starts to throw PoolExhaustedException:
PoolExhaustedException: [pool-7-thread-92] Timeout: Pool empty. Unable to fetch a connection in 30 seconds, none available[size:100; busy:100; idle:0; lastwait:30000].
When I check the database connections from Navicat, I see that there are about 200 connections and all of them are sleeping. I do not understand why the open connections are not used. I use standard Spring Data Jpa to save and read my entities which means I do not open or close the connections manually.
Unless I shut down one of the instance and the mysql connections are released, both instances do not response at all.
You can see the mysql connection size change graphic here, and a piece of log here.

Increase the amount of connections in my server MySQL

I have aplications that connect to a remote server (MySQL 5.5 on Windows Server 2012), at first I started receiving "too many connections" message which I solved by increasing MAX_CONNECTION value in my.inf to 500, then I start getting "can't create new thread" message so I decrease decrease timeouts to avoid idle connections using a socket, which didn't completely work. Now I get odd messages like 'file not found', as soon as I restart the service I stop getting the messages and everything works correctly.
The problem occurs when the server reaches around 170 connections at the same time.
Is there some configuration I'm missing?, I really don't know what info you need to give me a hint to fix this. I mean, there are servers that accept a lot morw of connections at the same time, right? waht I'm missing.
RAM and CPU of the system dosen't reach 35-40% at max connections (170).
Edit: Error occur at 2 'places', when running a query or at the attempt of conennection, it's like the MySQL service rejects the attempt. VB6 is the language used in the client app (ODBC connector). The app opens, executes and closes the connection.
Note: I have full control over client app and server config.

Intermittent 2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0" to CloudSQL

I setup an old Django app on a new GCE instance on Sunday and pointed it to a new CloudSQL instance with imported data there. This code and data has successfully run over the past few years on a variety of dedicated hosting setups, on EC2 and on EC2+RDS.
Since Sunday I have had intermittent reports of 2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 0" from the App. In particular today it happened in two bursts of 3 separated by about 7 hours.
I panicked in the earlier outages and restarted both the app and the CloudSQL instance which did the trick. However the latter ones today righted themselves after a few minutes.
I've never encountered this error before working with MySQL and any searching on the error gives results related to people who have general access problems to the DB.
On the GCE side the only difference I can think of from previous setups is that it is using Google's out-of-the-box Debian image instead of Ubuntu 12.04. On the MySQL side I have no idea as I've successfully run this on both MySQL 5.x and MariaDB.
Is there any way of figuring out why this is happening and fixing it?
Thanks.
Have you tried changing keep alive settings for TCP connections? GCE has a firewall rules that drops idle TCP connections after 10 mins:
https://cloud.google.com/compute/docs/troubleshooting#communicatewithinternet
You can check current value of 'tcp_keepalive_time':
cat /proc/sys/net/ipv4/tcp_keepalive_time
And change it to 60 seconds:
vi /etc/sysctl.conf
# Add this line
net.ipv4.tcp_keepalive_time = 60
# Reload Sysctl interface
sudo /sbin/sysctl --load=/etc/sysctl.conf
You might need to restart Django server to pick new keep alive settings.
Note: If this problem was limited to yesterday (18/11/2014) and your Cloud SQL instances are located in EU, you might have been affected by this:
https://groups.google.com/forum/#!topic/google-cloud-sql-announce/k5raPT48hc0
It seems there is an issue with the firewall, preventing incoming connections, just change the location from the server to another mirror things suppose to go fine. It worked for me

SQL Server "network-related or instance-specific error" once a day or so (perplexed!)

We are experiencing the same error as this StackOverflow Q ...
System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)
at System.Data.ProviderBase.DbConnectionPool.GetConnection(DbConnection owningObject)
at System.Data.ProviderBase.DbConnectionFactory.GetConnection(DbConnection owningConnection)
at System.Data.ProviderBase.DbConnectionClosed.OpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory)
at System.Data.SqlClient.SqlConnection.Open()
at System.Data.Linq.SqlClient.SqlConnectionManager.UseConnection(IConnectionUser user)
at System.Data.Linq.SqlClient.SqlProvider.get_IsSqlCe()
at System.Data.Linq.SqlClient.SqlProvider.InitializeProviderMode()
at System.Data.Linq.SqlClient.SqlProvider.System.Data.Linq.Provider.IProvider.Execute(Expression query)
... except that in the referenced StackOverflow Q, they need to restart SQL Server once the error occurs - and we do not. We'll get this error once a day, or once every few days - and all is fine after the error occurs, until the next time it occurs.
This makes us think it's not a "forgot to close connections" issue. We have a moderately busy ASP.NET 4.0 WebForms / SQL Server 2008 R2 app; but we're quite positive we're not exceeding the max # of database connections.
Any thoughts on this problem, or an approach to diagnose?
Thought I would comment on our progress with this.
While none of the SQL Server documentation/articles/blogs mention that this error can be caused by server busyness, I found a forum posting where some seasoned IT pro named Matt Neerincx states that it can be, as follows:
Possible reasons for this error include:
1. Poor network link from client to server.
2. Server is very busy (meaning high CPU) and cannot respond to new connection attempts.
3. Server is running out of memory (so high memory usage for SQL).
4. tcp-ip layer on client is over-saturated with connection attempts so tcp-ip layer rejects the connection.
5. tcp-ip layer on server side is over-staturated with connection attempts and so tcp-ip layer is rejecting new connections.
6. With SQL 2005 SP2 and later there could be a custom login trigger that rejects your connection.
You can increase the connect timeout to potentially alleviate issues #2, #3, #4, #5. Setting a longer connect timeout means the driver will try longer to connect and may eventually succeed.
To determine the root cause of these intermittent failures is not super easy to do unfortunately. What I normally do is start by examining the server environment, is the server constantly running in high CPU for example, this points to #2. Is the server using a hugh amount of memory, this points to #3. You can run SQL Profiler to monitor logins and look for patterns of logins, perhaps every morning at 9AM there is a flurry of connections etc...
So we are presently walking down this path - reducing the # of queries that execute at the same time in some of our batch queries, optimizing some of our queries, etc.
Also, in our app connection string, we increased the connection timeout, and set Min Pool Size to 20 (thinking it's good to try to ensure some existing, unused connections for the app to grab, rather than needing to establish a new connection).
At this moment, it's been almost 48 hours without receiving the error; making us very hopeful.