Configure GlassFish JDBC connection pool to handle Amazon RDS Multi-AZ failover - mysql

I have a Java EE application running in GlassFish on EC2, with a MySQL database on Amazon RDS.
I am trying to configure the JDBC connection pool to in order to minimize downtime in case of database failover.
My current configuration isn't working correctly during a Multi-AZ failover, as the standby database instance appears to be available in a couple of minutes (according to the AWS console) while my GlassFish instance remains stuck for a long time (about 15 minutes) before resuming work.
The connection pool is configured like this:
asadmin create-jdbc-connection-pool --restype javax.sql.ConnectionPoolDataSource \
--datasourceclassname com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource \
--isconnectvalidatereq=true --validateatmostonceperiod=60 --validationmethod=auto-commit \
--property user=$DBUSER:password=$DBPASS:databaseName=$DBNAME:serverName=$DBHOST:port=$DBPORT \
MyPool
If I use a Single-AZ db.m1.small instance and reboot the database from the console, GlassFish will invalidate the broken connections, throw some exceptions and then reconnect as soon the database is available. In this setup I get less than 1 minute of downtime.
If I use a Multi-AZ db.m1.small instance and reboot with failover from the AWS console, I see no exception at all. The server halts completely, with all incoming requests timing out. After 15 minutes I finally get this:
Communication failure detected when attempting to perform read query outside of a transaction. Attempting to retry query. Error was: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.3.2.v20111125-r10461): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 940,715 milliseconds ago. The last packet sent successfully to the server was 935,598 milliseconds ago.
It appears as if each HTTP thread gets blocked on an invalid connection without getting an exception and so there's no chance to perform connection validation.
Downtime in the Multi-AZ case is always between 15-16 minutes, so it looks like a timeout of some sort but I was unable to change it.
Things I have tried without success:
connection leak timeout/reclaim
statement leak timeout/reclaim
statement timeout
using a different validation method
using MysqlDataSource instead of MysqlConnectionPoolDataSource
How can I set a timeout on stuck queries so that connections in the pool are reused, validated and replaced?
Or how can I let GlassFish detect a database failover?

As I commented before, it is because the sockets that are open and connected to the database don't realize the connection has been lost, so they stayed connected until the OS socket timeout is triggered, which I read might be usually in about 30 minutes.
To solve the issue you need to override the socket Timeout in your JDBC Connection String or in the JDNI COnnection Configuration/Properties to define the socketTimeout param to a smaller time.
Keep in mind that any connection longer than the value defined will be killed, even if it is being used (I haven't been able to confirm this, is what I read).
The other two parameters I mention in my comment are connectTimeout and autoReconnect.
Here's my JDBC Connection String:
jdbc:(...)&connectTimeout=15000&socketTimeout=60000&autoReconnect=true
I also disabled Java's DNS cache by doing
java.security.Security.setProperty("networkaddress.cache.ttl" , "0");
java.security.Security.setProperty("networkaddress.cache.negative.ttl" , "0");
I do this because Java doesn't honor the TTL's, and when the failover takes place, the DNS is the same but the IP changes.
Since you are using an Application Server, the parameters to disable DNS cache must be passed to the JVM when starting the glassfish with -Dnet and not the application itself.

Related

Unable to configure HikariCP in Spring Boot/JDBI/MySQL application

I am building a RESTful interface to a MariaDB-hosted database, and I cannot figure out how to properly configure HikariCP so that my database connections don't time out after the server has been idle for a while.
I am on Linux, Java 1.8, and my database server is stock MariaDB 5.5.60. My application uses the following tech stack:
spring-boot-starter-jdbc:2.0.1
spring-boot-data-rest:2.0.1
jdbi3-core:3.1.0
jdbi3-sqlobject:3.1.0
mysql-connector-java:5.1.46
HikariCP:2.7.8 (implicitly provided via Spring)
My application.properties file currently looks like this:
spring.datasource.url=jdbc:mysql://localhost/my_database
spring.datasource.username=myusername
spring.datasource.password=myp#ssw0rd
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
# 15 min * 60 sec * 1000 ms = 900000
spring.datasource.hikari.maxLifetime=900000
The "maxLifetime" value is being ignored. I have tried all sorts of Hikari-related things in this file (many found here on SO) but none of them seem to work. When I try hitting the server after it has been idle overnight, I get the following warning:
com.zaxxer.hikari.pool.ProxyConnection: HikariPool-1 - Connection com.mysql.jdbc.JDBC4Connection#140ae1bb marked as broken because of SQLSTATE(08S01) ,ErrorCode(0)
com.mysql.jdbc.exception.jdb4.CommunicationsException: The last packet successfully received from the server was 422,968,077 milliseconds ago. The last packet sent successfully to the server was 422,968,086 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
...and then a pile of errors and stack traces from which I'll spare you.
My intuition tells me that there is some magical combo of parameters missing from my application.properties file, but I'm at a loss. I am also at a loss with how to verify it's actually working without having to wait overnight.
Any help is appreciated!

Tomcat Jdbc Connection Pool active connection

We have a spring-boot application which uses embedded tomcat for deployment and default tomcat-jdbc connection pooling with MySQL back-end with no customization for MySQL or Tomcat side. The app has a few schedulers that runs mostly during specific time in a day i.e. between the last cron run yesterday and 1st cron runs today, there is more than 9 hrs of gap. However, whenever the cron ran earlier, it has never come across idle connection issue. Nowadays we see an error message
The last packet successfully received from the server was XXXXXXXX milliseconds ago. The last packet sent successfully to the server was XXXXXXXY milliseconds ago.
I can always try using testOnBorrow with validateQuery adn/or testWhileIdle etc as reqd to get this working but...
I'm trying to understand the lifecycle of the active connection in tomcat-jdbc connection pooling. Acc to the documentation, the default value for wait_timeout for MySQL is 8 hrs, whereas default for idle_connection_timeout on Tomcat_jdbc is nearly 6 secs.
If the default value is in use everywhere, then why issue has never surfaced before?
Or is it something that the connections in the tomcat-jdbc connection pool are made active every time the cron starts running and becomes idle thereafter?
Is it the state of the spring-boot app or the scheduler that makes any difference?
The problem is not in configuration or setup. spring-boot app uses spring-data lib which makes use of the underlying connection pool. The pool handles the connection(s) as per the connection pool implementation. The use of #Transactional however decides when the underlying connection is opened. If there is none specified in spring-boot app the default implementation of spring-data opens it during crud operations; else it is opened during the method call in application annotated with #Transactional.
In my case it was the latter.. After opening the connection a time taking non DB process was running which was making the connection to go idle after opening and was throwing exception while actually using it later.

Broken Pipe exception on idle server

I am using a dropwizard server to serve http requests. This dropwizard application is backed my mysql server for data storage. But when left idle (overnight) it gives a 'broken pipe exception'
I did a few things that I thought might help. I set the jdbc url in the yaml file to'autoConnect=true'. I also added a 'checkOnBorrow' property. I have increased the jvm to use 4gb
none of these fixes worked.
Also the wait_timeout and 'interactive_timeout for mysql serveris set to 8 hours.
does this need to more more/less?
Also is there a configuration property that can be set in the dropwizard yaml file? Or in other words how is connection pooling managed in dropwizard?
The problem:
MySql server has a timeout configured after which it terminates all connections idle in the connection pool. This in my case was the default (8 hrs). However the database connection pool is unaware of the terminated connections in the pool. So when a new request comes in, a dead connection is accessed from teh connection pool which results in a 'Broken Pipe' exception.
Solution:
So to fix this, we need to get rid of the dead connections and make the pool aware if the connection it is trying to borrow is a dead connection. This can be achieved by setting the following in the .yml configuration.
checkOnReturn: true
checkWhileIdle: true
checkOnBorrow: true

SORM vs MySQL idle connection

I'm using Play Framework 2.2.1, MySQL 5.5 and sorm 0.3.10
Since MySQL drops inactive connections after specified idle timeout, I'm getting this exception in my app:
[CommunicationsException: Communications link failure The last packet successfully received from the server was 162 701 milliseconds ago. The last packet sent successfully to the server was 0 milliseconds ago.]
As far as I understand, sorm is using c3p0 connection pool. Is it possible to configure somehow c3p0 or sorm to kick mysql with specified delay or reconnect automatically after connection was dropped?
0.3.13-SNAPSHOT of SORM introduces a timeout parameter for Instance with a default setting of 30. This setting determines the amount of seconds the underlying connections are allowed to be idle. When the timeout is reached a sort of a "keepalive" request is sent to the db and the timer is reset. The timer gets reset when any normal query is made as well. The implementation simply relies on the idleConnectionTestPeriod of C3P0.
For further discussion, suggestions and reports please visit the associated ticket on the issue tracker or open another one. If there'll be no complaints in the associated ticket, this change will make it into the 0.3.13 release.
it's very easy to resolve this issue with c3p0, but i'd double check whether you are using it. BoneCP is the default play2 Connection pool. it would be easy to solve this problem with BoneCP too!
in c3p0, config params maxIdleTime, maxConnectionAge, or (much better yet) a Connection testing regime, would help. see http://www.mchange.com/projects/c3p0/#configuring_connection_testing
if you want to use c3p0 in play2, see https://github.com/swaldman/c3p0-play

Quartz failure in notifyJobStoreJobComplete method

Scenario:
We have a scheduler which is using JDBC Job Store. Quartz version is 2.1.2.
The job which is being scheduling is also updating a database.
The database is same for both quartz and the job itself and is hosted in MySQL Server. Both application tables and quartz tables are stored in the same database.
Connection pool is different for both application and quartz. In the application we are using spring for connection pooling and quartz is forced to use connection pooling via quartz.properties.
Here is the snippet of quartz.properties
org.quartz.dataSource.qzDS.driver = com.mysql.jdbc.Driver
org.quartz.dataSource.qzDS.URL = jdbc:mysql://localhost:3306/dbname?autoReconnect=true
org.quartz.dataSource.qzDS.user = dbuser
org.quartz.dataSource.qzDS.password =dbpassword
org.quartz.dataSource.qzDS.maxConnections = 30
org.quartz.datasource.qzDS.validationQuery = select 1
#org.quartz.datasource.qzDS.minEvictableIdleTimeMillis=21600000
#org.quartz.datasource.qzDS.timeBetweenEvictionRunsMillis=1800000
#org.quartz.datasource.qzDS.numTestsPerEviction=-1
#org.quartz.datasource.qzDS.testWhileIdle=true
org.quartz.datasource.qzDS.debugUnreturnedConnectionStackTraces=true
org.quartz.datasource.qzDS.unreturnedConnectionTimeout=120
org.quartz.datasource.qzDS.initialPoolSize=5
org.quartz.datasource.qzDS.minPoolSize=5
org.quartz.datasource.qzDS.maxPoolSize=30
org.quartz.datasource.qzDS.acquireIncrement=5
org.quartz.datasource.qzDS.maxIdleTime=120
org.quartz.datasource.qzDS.validateOnCheckout=true
Database is clustered with MASTER-MASTER replication on two servers and they are being used via virtual IP everywhere in the application and quartz.
Scheduler i.e. quartz is also clustered on the same two machines where MySQL is clustered.
The problem:
One of the servers (till now we have got the problem with backup server machine) is occasionally throwing database connection error while calling notifyJobStoreJobComplete method. This is causing the job to stay in BLOCKED state even if the job itself has successfully completed but quartz was unable to update its status.
Questions:
What can be the cause of the problem?
How to move the BLOCKED jobs into WAITING state so that the jobs can be run on their next scheduled time at least. Direct editing the QRTZ_SIMPLE_TRIGGERS tables would not be a good solution, even if it works.
EDIT: To bump up the question.
the error during notifyJobStoreJobComplete is: org.quartz.impl.jdbcjobstore.JobStoreTX - Failed to override connection auto commit/transaction isolation.
[java] com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: The last packet successfully received from the server was 619,082,686 milliseconds ago. The last packet sent successfully to the server was 619,082,686 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
I think main problem was communication link failure by MySQL which we solved it by increasing 'wait_timeout' to 14 days and as our maintenance is scheduled in every 15 days, we restart the each of MySQL server is our DB cluster (We have Master-Master replication in place). With approach we haven't get any communication link failure after that. In fact some time we don't restart the server in every 15 days but still no error(touch wood). :)
And as far as Quartz triggers being locked in BLOCKED state, we updated the quartz to 2.1.4 which possibly has the fix for the almost same problem. After the quartz update, we have faced the triggers being in BLOCKED state very very less frequent.
We are still unable to find out how to get the trigger out of BLOCKED state without directly modifying the quartz tables. Whenever we face this problem, we manually remove the entry for BLOCKED trigger from the qrtz_fired_triggers table and it solves the problem. I think enterprise version of quartz may have this feature from some web UI.