AWS Time Out Problems with Elastic Beanstalk App with DB Access - mysql

Hi When my Elastic Beanstalk (m5a.large Windows Server with deployed .net Core WebApi) comes under heavy load, the Status in the Health Page for my EC2 instances turns red, my Requests and the Healthcheck are timing out. That happens around 1-3 minutes after having a minimum of 10-20 Req/sec for a server.
I have to launch a lot of servers, so that each server gets a Request/Second count of 1-5 so they do not turn red.
In my logs I saw the following Errors:
Exception=MySql.Data.MySqlClient.MySqlException (0x80004005): Unable to connect to any of the specified MySQL hosts.
---> MySql.Data.MySqlClient.MySqlException (0x80004005): Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
These Errors brought me to the topic Connection Pooling so i switched
using MySql.Data.MySqlClient;
to
using MySqlConnector;
Now these Errors do not come up anymore but the Problem remains.
The Monitoring Feature of EB and RDS do not state any obvious Problems. Running Queries in Mysql Workbench against the Database is fast as usual.
At the moment, my Database calls from the server are synchronous and not using the async feature of MysqlConnector.
Does the m5a.large cannot process more than 5 Request/Second?
Kind Regards

Related

SQL Communications link failure when accessing SQL database on the same machine

I am running a Minecraft server which is querying a SQL database on the same VPS as the game server for various bits of data (ranks, warps and such). Today have encountered something which has stumped me and I am turning to you hoping for an answer.
My VPS is running Ubuntu Server 18.04.4 LTS, the Minecraft server is being ran from within a screen instance and no changes to the server, operating system or mySQL that I am aware of have been made that could have caused an issue and given I am the only one with SSH access to the server theres no way an update could have been made unless it was silently updated in the background without my knowledge. The problem just randomly happened out of nowhere.
The specific error I am getting (output from MC server console)
2021-05-19 12:44:51 [INFO] [STDOUT] [SERVERMANAGER EXCEPTION | com.mysql.cj.jdbc.exceptions.CommunicationsException] Communications link failure
2021-05-19 12:44:51 [INFO] [STDOUT] The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
Granted I have little experience with SQL (I inherited this SQL setup from someone else who I am unable to contact for assistance to fix this) but the last message of this error lead me to believe that everything with the Minecraft server itself was fine and the SQL database simply was not acknowledging the packets the Minecraft server was sending.
However despite...
Restarting the mySQL service with sudo service mysql restart
Verifying that the servermanager database I am using is still present with SHOW DATABASES;
Verifying that the three tables servermanager has are still present with SHOW TABLES;
Restarting the VPS
I cannot for the life of me seem to get this issue to fix itself!
Anyone able to offer any insight?

Increase the amount of connections in my server MySQL

I have aplications that connect to a remote server (MySQL 5.5 on Windows Server 2012), at first I started receiving "too many connections" message which I solved by increasing MAX_CONNECTION value in my.inf to 500, then I start getting "can't create new thread" message so I decrease decrease timeouts to avoid idle connections using a socket, which didn't completely work. Now I get odd messages like 'file not found', as soon as I restart the service I stop getting the messages and everything works correctly.
The problem occurs when the server reaches around 170 connections at the same time.
Is there some configuration I'm missing?, I really don't know what info you need to give me a hint to fix this. I mean, there are servers that accept a lot morw of connections at the same time, right? waht I'm missing.
RAM and CPU of the system dosen't reach 35-40% at max connections (170).
Edit: Error occur at 2 'places', when running a query or at the attempt of conennection, it's like the MySQL service rejects the attempt. VB6 is the language used in the client app (ODBC connector). The app opens, executes and closes the connection.
Note: I have full control over client app and server config.

Intermittently can't connect to mysql on AWS RDS (Error 2003)

We are having an intermittent issue with connections to our mysql server timing out.
The error we are receiving is as following.
(2003, 'Can\'t connect to MySQL server on \'<connection>\' ((2013, "Lost connection to MySQL server during query (error(104, \'Connection reset by peer\'))"))')
Callstack:
File "/usr/lib64/python2.7/site-packages/pymysql/connections.py", line 818, in _connect
2003, "Can't connect to MySQL server on %r (%s)" % (self.host, e))
File "/usr/lib64/python2.7/site-packages/pymysql/connections.py", line 626, in __init__
self._connect()
Some more info:
We have a flight of EC2 servers that are constantly running queries to a backend RDS.
We average about 500 connections per second to the RDS
We have around 0 - 4 hiccups per RDS per day
The hiccups don't correspond with our maintenance window
When we hit a hiccup it can affect quite a few connections ~50
When a hiccup happens it will disrupt connections across all servers and ports
The error itself looks to be generated from the tcp connection being closed on the ec2. Our TCP keep alive time is set to 7200 seconds and that's when the error is fired off.
My question is what can be done to track down why these hiccups happen? It's great that they don't happen often, but it's not ideal that they happen at all.
Any advice would be appreciated thanks!
Update 10/29:
I've been running a service checking to see if I have any long processes running on the sql server and it looks like these errors aren't getting that far. A new process is never created for this connection! I have still been receiving the hiccups, just no signs of connections.
So after a back and forth with amazon support here is the current solution we have come to.
Amazon has raised our socket listen backlog by adjusting the somaxconn value on the RDS instance.
The value was at the default of 128 and has been bumped up to 1024.
Once the value was adjusted we no longer received the Lost Connection error.

Configure GlassFish JDBC connection pool to handle Amazon RDS Multi-AZ failover

I have a Java EE application running in GlassFish on EC2, with a MySQL database on Amazon RDS.
I am trying to configure the JDBC connection pool to in order to minimize downtime in case of database failover.
My current configuration isn't working correctly during a Multi-AZ failover, as the standby database instance appears to be available in a couple of minutes (according to the AWS console) while my GlassFish instance remains stuck for a long time (about 15 minutes) before resuming work.
The connection pool is configured like this:
asadmin create-jdbc-connection-pool --restype javax.sql.ConnectionPoolDataSource \
--datasourceclassname com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource \
--isconnectvalidatereq=true --validateatmostonceperiod=60 --validationmethod=auto-commit \
--property user=$DBUSER:password=$DBPASS:databaseName=$DBNAME:serverName=$DBHOST:port=$DBPORT \
MyPool
If I use a Single-AZ db.m1.small instance and reboot the database from the console, GlassFish will invalidate the broken connections, throw some exceptions and then reconnect as soon the database is available. In this setup I get less than 1 minute of downtime.
If I use a Multi-AZ db.m1.small instance and reboot with failover from the AWS console, I see no exception at all. The server halts completely, with all incoming requests timing out. After 15 minutes I finally get this:
Communication failure detected when attempting to perform read query outside of a transaction. Attempting to retry query. Error was: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.3.2.v20111125-r10461): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 940,715 milliseconds ago. The last packet sent successfully to the server was 935,598 milliseconds ago.
It appears as if each HTTP thread gets blocked on an invalid connection without getting an exception and so there's no chance to perform connection validation.
Downtime in the Multi-AZ case is always between 15-16 minutes, so it looks like a timeout of some sort but I was unable to change it.
Things I have tried without success:
connection leak timeout/reclaim
statement leak timeout/reclaim
statement timeout
using a different validation method
using MysqlDataSource instead of MysqlConnectionPoolDataSource
How can I set a timeout on stuck queries so that connections in the pool are reused, validated and replaced?
Or how can I let GlassFish detect a database failover?
As I commented before, it is because the sockets that are open and connected to the database don't realize the connection has been lost, so they stayed connected until the OS socket timeout is triggered, which I read might be usually in about 30 minutes.
To solve the issue you need to override the socket Timeout in your JDBC Connection String or in the JDNI COnnection Configuration/Properties to define the socketTimeout param to a smaller time.
Keep in mind that any connection longer than the value defined will be killed, even if it is being used (I haven't been able to confirm this, is what I read).
The other two parameters I mention in my comment are connectTimeout and autoReconnect.
Here's my JDBC Connection String:
jdbc:(...)&connectTimeout=15000&socketTimeout=60000&autoReconnect=true
I also disabled Java's DNS cache by doing
java.security.Security.setProperty("networkaddress.cache.ttl" , "0");
java.security.Security.setProperty("networkaddress.cache.negative.ttl" , "0");
I do this because Java doesn't honor the TTL's, and when the failover takes place, the DNS is the same but the IP changes.
Since you are using an Application Server, the parameters to disable DNS cache must be passed to the JVM when starting the glassfish with -Dnet and not the application itself.

Quartz failure in notifyJobStoreJobComplete method

Scenario:
We have a scheduler which is using JDBC Job Store. Quartz version is 2.1.2.
The job which is being scheduling is also updating a database.
The database is same for both quartz and the job itself and is hosted in MySQL Server. Both application tables and quartz tables are stored in the same database.
Connection pool is different for both application and quartz. In the application we are using spring for connection pooling and quartz is forced to use connection pooling via quartz.properties.
Here is the snippet of quartz.properties
org.quartz.dataSource.qzDS.driver = com.mysql.jdbc.Driver
org.quartz.dataSource.qzDS.URL = jdbc:mysql://localhost:3306/dbname?autoReconnect=true
org.quartz.dataSource.qzDS.user = dbuser
org.quartz.dataSource.qzDS.password =dbpassword
org.quartz.dataSource.qzDS.maxConnections = 30
org.quartz.datasource.qzDS.validationQuery = select 1
#org.quartz.datasource.qzDS.minEvictableIdleTimeMillis=21600000
#org.quartz.datasource.qzDS.timeBetweenEvictionRunsMillis=1800000
#org.quartz.datasource.qzDS.numTestsPerEviction=-1
#org.quartz.datasource.qzDS.testWhileIdle=true
org.quartz.datasource.qzDS.debugUnreturnedConnectionStackTraces=true
org.quartz.datasource.qzDS.unreturnedConnectionTimeout=120
org.quartz.datasource.qzDS.initialPoolSize=5
org.quartz.datasource.qzDS.minPoolSize=5
org.quartz.datasource.qzDS.maxPoolSize=30
org.quartz.datasource.qzDS.acquireIncrement=5
org.quartz.datasource.qzDS.maxIdleTime=120
org.quartz.datasource.qzDS.validateOnCheckout=true
Database is clustered with MASTER-MASTER replication on two servers and they are being used via virtual IP everywhere in the application and quartz.
Scheduler i.e. quartz is also clustered on the same two machines where MySQL is clustered.
The problem:
One of the servers (till now we have got the problem with backup server machine) is occasionally throwing database connection error while calling notifyJobStoreJobComplete method. This is causing the job to stay in BLOCKED state even if the job itself has successfully completed but quartz was unable to update its status.
Questions:
What can be the cause of the problem?
How to move the BLOCKED jobs into WAITING state so that the jobs can be run on their next scheduled time at least. Direct editing the QRTZ_SIMPLE_TRIGGERS tables would not be a good solution, even if it works.
EDIT: To bump up the question.
the error during notifyJobStoreJobComplete is: org.quartz.impl.jdbcjobstore.JobStoreTX - Failed to override connection auto commit/transaction isolation.
[java] com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: The last packet successfully received from the server was 619,082,686 milliseconds ago. The last packet sent successfully to the server was 619,082,686 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
I think main problem was communication link failure by MySQL which we solved it by increasing 'wait_timeout' to 14 days and as our maintenance is scheduled in every 15 days, we restart the each of MySQL server is our DB cluster (We have Master-Master replication in place). With approach we haven't get any communication link failure after that. In fact some time we don't restart the server in every 15 days but still no error(touch wood). :)
And as far as Quartz triggers being locked in BLOCKED state, we updated the quartz to 2.1.4 which possibly has the fix for the almost same problem. After the quartz update, we have faced the triggers being in BLOCKED state very very less frequent.
We are still unable to find out how to get the trigger out of BLOCKED state without directly modifying the quartz tables. Whenever we face this problem, we manually remove the entry for BLOCKED trigger from the qrtz_fired_triggers table and it solves the problem. I think enterprise version of quartz may have this feature from some web UI.