AWS Aurora Serverless - Communication Link Failure - mysql

I'm using MySQL Aurora Serverless cluster (with the Data API enabled) in my python code and I am getting a communications link failure exception. This usually occurs when the cluster has been dormant for some time.
But, once the cluster is active, I get no error. I have to send 3-4 requests every time before it works fine.
Exception detail:
The last packet sent successfully to the server was 0 milliseconds
ago. The driver has not received any packets from the server. An error
occurred (BadRequestException) when calling the ExecuteStatement
operation: Communications link failure
How can I solve this issue? I am using standard boto3 library

Here is the reply from AWS Premium Business Support.
Summary: It is an expected behavior
Detailed Answer:
I can see that you receive this error when your Aurora Serverless
instance is inactive and you stop receiving it once your instance is
active and accepting connection. Please note that this is an expected
behavior. In general, Aurora Serverless works differently than
Provisioned Aurora , In Aurora Serverless, while the cluster is
"dormant" it has no compute resources assigned to it and when a db.
connection is received, Compute resources are assigned. Because of
this behavior, you will have to "wake up" the clusters and it may take
a few minutes for the first connection to succeed as you have seen.
In order to avoid that you may consider increasing the timeout on the
client side. Also, if you have enabled Pause, you may consider
disabling it [2]. After disabling Pause, you can also adjust the
minimum Aurora capacity unit to higher value to make sure that your
Cluster always having enough computing resource to serve the new
connections [3]. Please note that adjusting the minimum ACU might
increase the cost of service [4].
Also note that Aurora Serverless is only recommend for certain
workloads [5]. If your workload is highly predictable and your
application needs to access the DB on a regular basis, I would
recommend you use Provisioned Aurora cluster/instance to insure high
availability of your business.
[2] How Aurora Serverless Works - Automatic Pause and Resume for Aurora Serverless - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.how-it-works.html#aurora-serverless.how-it-works.pause-resume
[3] Setting the Capacity of an Aurora Serverless DB Cluster - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.setting-capacity.html
[4] Aurora Serverless Price https://aws.amazon.com/rds/aurora/serverless/
[5] Using Amazon Aurora Serverless - Use Cases for Aurora Serverless - https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless.html#aurora-serverless.use-cases

If it is useful to someone this is how I manage retries while Aurora Serverless wake up.
Client returns a BadRequestException so boto3 will not retry even if you change the config for the client, see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html.
My first option was to try with Waiters but RDSData does not have any waiter, then I tried to create a custom Waiter with an Error matcher but only tries to match error code, ignoring message, and because a BadRequestException can be raised by an error in a sql statement I needed to validate message too, so I using a kind of waiter function:
def _wait_for_serverless():
delay = 5
max_attempts = 10
attempt = 0
while attempt < max_attempts:
attempt += 1
try:
rds_data.execute_statement(
database=DB_NAME,
resourceArn=CLUSTER_ARN,
secretArn=SECRET_ARN,
sql_statement='SELECT * FROM dummy'
)
return
except ClientError as ce:
error_code = ce.response.get("Error").get('Code')
error_msg = ce.response.get("Error").get('Message')
# Aurora serverless is waking up
if error_code == 'BadRequestException' and 'Communications link failure' in error_msg:
logger.info('Sleeping ' + str(delay) + ' secs, waiting RDS connection')
time.sleep(delay)
else:
raise ce
raise Exception('Waited for RDS Data but still getting error')
and I use it in this way:
def begin_rds_transaction():
_wait_for_serverless()
return rds_data.begin_transaction(
database=DB_NAME,
resourceArn=CLUSTER_ARN,
secretArn=SECRET_ARN
)

I also got this issue, and taking inspiration from the solution used by Arless and the conversation with Jimbo, came up with the following workaround.
I defined a decorator which retries the serverless RDS request until the configurable retry duration expires.
import logging
import functools
from sqlalchemy import exc
import time
logger = logging.getLogger()
def retry_if_db_inactive(max_attempts, initial_interval, backoff_rate):
"""
Retry the function if the serverless DB is still in the process of 'waking up'.
The configration retries follows the same concepts as AWS Step Function retries.
:param max_attempts: The maximum number of retry attempts
:param initial_interval: The initial duration to wait (in seconds) when the first 'Communications link failure' error is encountered
:param backoff_rate: The factor to use to multiply the previous interval duration, for the next interval
:return:
"""
def decorate_retry_if_db_inactive(func):
#functools.wraps(func)
def wrapper_retry_if_inactive(*args, **kwargs):
interval_secs = initial_interval
attempt = 0
while attempt < max_attempts:
attempt += 1
try:
return func(*args, **kwargs)
except exc.StatementError as err:
if hasattr(err.orig, 'response'):
error_code = err.orig.response["Error"]['Code']
error_msg = err.orig.response["Error"]['Message']
# Aurora serverless is waking up
if error_code == 'BadRequestException' and 'Communications link failure' in error_msg:
logger.info('Sleeping for ' + str(interval_secs) + ' secs, awaiting RDS connection')
time.sleep(interval_secs)
interval_secs = interval_secs * backoff_rate
else:
raise err
else:
raise err
raise Exception('Waited for RDS Data but still getting error')
return wrapper_retry_if_inactive
return decorate_retry_if_db_inactive
which can then be used something like this:
#retry_if_db_inactive(max_attempts=4, initial_interval=10, backoff_rate=2)
def insert_alert_to_db(sqs_alert):
with db_session_scope() as session:
# your db code
session.add(sqs_alert)
return None
Please note I'm using sqlalchemy, so the code would need tweaking to suit specific purposes, but hopefully will be useful as a starter.

This may be a little late, but there is a way to deactivate the DORMANT behavior of the database.
When creating the Cluster from the CDK, you can configure an attribute as follows:
new rds.ServerlessCluster(
this,
'id',
{
engine: rds.DatabaseClusterEngine.AURORA_MYSQL,
defaultDatabaseName: 'name',
vpc,
scaling:{
autoPause:Duration.millis(0) //Set to 0 to disable
}
}
)
The attribute is autoPause. The default value is 5 minutes (Communication link failure message may appear after 5 minutes of not using the DB). The max value is 24 hours. However, you can set the value to 0 and this disables the automatic shutdown. After this, the database will not go to sleep even if there are no connections.
When looking at the configuration from AWS (RDS -> Databases -> 'instance' -> Configuration -> Capacity Settings), you'll notice this attribute without a value (if set to 0):
Finally, if you don't want the database to be ON all the time, set your own autoPause value so that it behaves as expected.

Related

How can we run queries concurrently, using go routines?

I am using gorm v1 (ORM), go version 1.14
DB connection is created at the start of my app
and that DB is being passed throughout the app.
I have a complex & long functionality.
Let's say I have 10 sets of queries to run and the order doesn't matter.
So, what I did was
go queryset1(DB)
go queryset2(DB)
...
go queryset10(DB)
// here I have a wait, maybe via channel or WaitGroup.
Inside queryset1:
func queryset1(db *gorm.DB, /*wg or errChannel*/){
db.Count() // basic count query
wg.Done() or errChannel <- nil
}
Now, the problem is I encounter the error :1040 "too many connections" - Mysql.
Why is this happening? Does every go routine create a new connection?
If so, is there a way to check this & "live connections" in mysql
(Not the show status variables like connection)
How can I concurrently query the DB?
Edit:
This guy has the same problem
The error is not directly related to go-gorm, but to the underlying MySQL configuration and your initial connection configuration. In your code, you can manage the following parameters during your initial connection to the database.
maximum open connections (SetMaxOpenConns function)
maximum idle connections (SetMaxIdleConns function)
maximum timeout for idle connections (SetConnMaxLifetime function)
For more details, check the official docs or this article how to get the maximum performance from your connection configuration.
If you want to prevent a situation where each goroutine uses a separate connection, you can do something like this:
// restrict goroutines to be executed 5 at a time
connCh := make(chan bool, 5)
go queryset1(DB, &wg, connCh)
go queryset2(DB, &wg, connCh)
...
go queryset10(DB, &wg, connCh)
wg.Wait()
close(connCh)
Inside your queryset functions:
func queryset1(db *gorm.DB, wg *sync.WaitGroup, connCh chan bool){
connCh <- true
db.Count() // basic count query
<-connCh
wg.Done()
}
The connCh will allow the first 5 goroutines to write in it and block the execution of the rest of the goroutines until one of the first 5 goroutines takes the value from the connCh channel. This will prevent the situations where each goroutine will start it's own connection. Some of the connections should be reused, but that also depends on the initial connection configuration.

Should pika's channel.basic_publish timeout sooner if there is no network connection?

I'm in the process of upgrading to pika 1.1.0, and performed some sanity testing.
I have:
placed a breakpoint at the following
disconnected the network
stepped over this command
... and no exception was thrown. Is this expected?
channel.basic_publish(
exchange=EXCHANGE,
routing_key=ROUTING_KEY,
body=message,
properties=pika.BasicProperties(
delivery_mode=MQ_TRANSIENT_DELIVERY_MODE,
headers=headers,
)
)
The connection is created with:
connection = pika.BlockingConnection(pika.ConnectionParameters(
host=rabbit_config.host,
credentials=credentials,
port=rabbit_config.port,
connection_attempts=1,
blocked_connection_timeout=10,
retry_delay=5,
socket_timeout=20,
heartbeat=30, ))
update:
If I call channel.confirm_delivery() before this, I successfully get a AMQPError.
However, this doesn't happen for 60 seconds (which doesn't my ConnectionParameters). How can I have it notice the connection loss quicker?

NodeJS - Process out of memory for 100+ concurrent connections

I am working on an IoT application where the clients send bio-potential information every 2 seconds to the server. The client sends a CSV file containing 400 rows of data every 2 seconds. I have a Socket.IO websocket server running on my server which captures this information from each client. Once this information is captured, the server must push these 400 records into a mysql database every 2 seconds for each client. While this worked perfectly well as long as the number of clients were small, as the number of clients grew the server started throwing the "Process out of memory exception."
Following is the exception received :
<--- Last few GCs --->
98522 ms: Mark-sweep 1397.1 (1457.9) -> 1397.1 (1457.9) MB, 1522.7 / 0 ms [allocation failure] [GC in old space requested].
100059 ms: Mark-sweep 1397.1 (1457.9) -> 1397.0 (1457.9) MB, 1536.9 / 0 ms [allocation failure] [GC in old space requested].
101579 ms: Mark-sweep 1397.0 (1457.9) -> 1397.0 (1457.9) MB, 1519.9 / 0 ms [last resort gc].
103097 ms: Mark-sweep 1397.0 (1457.9) -> 1397.0 (1457.9) MB, 1517.9 / 0 ms [last resort gc].
<--- JS stacktrace --->
==== JS stack trace =========================================
Security context: 0x35cc9bbb4629 <JS Object>
2: format [/xxxx/node_modules/mysql/node_modules/sqlstring/lib/SqlString.js:~73] [pc=0x6991adfdf6f] (this=0x349863632099 <an Object with map 0x209c9c99fbd1>,sql=0x2dca2e10a4c9 <String[84]: Insert into rent_66 (sample_id,sample_time, data_1,data_2,data_3) values ? >,values=0x356da3596b9 <JS Array[1]>,stringifyObjects=0x35cc9bb04251 <false>,timeZone=0x303eff...
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
Aborted
Following is the code for my server:
var app = require('express')();
var http = require('http').Server(app);
var io = require('socket.io')(http);
var mysql = require('mysql');
var conn = mysql.createConnection({
host: '<host>',
user: '<user>',
password: '<password>',
database: '<db>',
debug: false,
});
conn.connect();
io.on('connection', function (socket){
console.log('connection');
var finalArray = []
socket.on('data_to_save', function (from, msg) {
var str_arr = msg.split("\n");
var id = str_arr[1];
var timestamp = str_arr[0];
var data = str_arr.splice(2);
finalArray = [];
var dataPoint = [];
data.forEach(function(value){
dataPoint = value.split(",");
if(dataPoint[0]!=''){
finalArray.push([dataPoint[0],1,dataPoint[1],dataPoint[2],dataPoint[3]]);
finalArray.push([dataPoint[0],1,dataPoint[4],dataPoint[5],dataPoint[5]]);
}
});
var sql = "Insert into rent_"+id+" (sample_id,sample_time, channel_1,channel_2,channel_3) values ? ";
var query = conn.query (sql, [finalArray],function(err,result){
if(err)
console.log(err);
else
console.log(result);
});
conn.commit();
console.log('MSG from ' + str_arr[1] + ' ' + str_arr[0] );
});
});
http.listen(9000, function () {
console.log('listening on *:9000');
});
I was able to get the server to handle 100 concurrent connections after which I started receiving process out of memory exceptions. Before the database inserts were introduced, the server would simply store the csv as a file on disk. With that set up the server was able to handle 1200+ concurrent connections.
Based on the information available on the internet, looks like the database insert query (which is asynchronous) holds the 400 row array in memory till the insert goes through. As a result, as the number of clients grow, the memory foot-print of the server increases, thereby running out of memory eventually.
I did go through many suggestions made on the internet regarding --max_old_space_size, I am not sure that this is a long term solution. Also, I am not sure on what basis I should decide the value that should be mentioned here.
Also, I have gone through suggestions which talk about async utility module. However, inserting data serially may introduce a huge delay between the time when client inserts data and when the server saves this data to the database.
I have gone in circles around this problem many times. Is there a way the server can handle information coming from 1000+ concurrent clients and save that data into Mysql database with minimum latency. I have hit a road block here, and any help in this direction is highly appreciated.
I'll summarize my comments since they sent you on the correct path to address your issue.
First, you have to establish whether the issue is caused by your database or not. The simplest way to do that is to comment out the database portion and see how high you can scale. If you get into the thousands without a memory or CPU issue, then your focus can shift to figuring out why adding the database code into the mix causes the problem.
Assuming the issues is caused by your database, then you need to start understanding how it is handling things when there are lots of active database requests. Oftentimes, the first thing to use with a busy database is connection pooling. This gives you three main things that can help with scale.
It gives you fast reuse of previously opened connections so you don't have every single operation creating its own connection and then closing it.
It lets you specify the max number of simultaneous database connections in the pool you want at the same time (controlling the max load you throw at the database and also probably limiting the max amount of memory it will use). Connections beyond that limit will be queued (which is usually what you want in high load situations so you don't overwhelm the resources you have).
It makes it easier to see if you have a connection leak problem as rather than just leak connections until you run out of some resource, the pool will quickly be empty in testing and your server will not be able to process any more transactions (so you are much more likely to see the problem in testing).
Then, you probably also want to look at the transaction times for your database connections to see how fast they can handle any given transaction. You know how many transactions/sec you are trying to process so you need to see if your database and the way it's configured and resourced (memory, CPU, speed of disk, etc...) is capable of keeping up with the load you want to throw at it.
You should increase the default memory(512MB) by using the command below:
node --max-old-space-size=1024 index.js
This increases the size to 1GB. You can use this command to further increase the default memory.

Scala / Slick, "Timeout after 20000ms of waiting for a connection" error

The block of code below has been throwing an error.
Timeout after 20000ms of waiting for a connection.","stackTrace":[{"file":"BaseHikariPool.java","line":228,"className":"com.zaxxer.hikari.pool.BaseHikariPool","method":"getConnection"
Also, my database accesses seem too slow, with each element of xs.map() taking about 1 second. Below, getFutureItem() calls db.run().
xs.map{ x =>
val item: Future[List[Sometype], List(Tables.myRow)] = getFutureItem(x)
Await.valueAfter(item, 100.seconds) match {
case Some(i) => i
case None => println("Timeout getting items after 100 seconds")
}
}
Slick logs this with each iteration of an "x" value:
[akka.actor.default-dispatcher-3] [akka://user/IO-HTTP/listener-0/24] Connection was PeerClosed, awaiting TcpConnection termination...
[akka.actor.default-dispatcher-3] [akka://user/IO-HTTP/listener-0/24] TcpConnection terminated, stopping
[akka.actor.default-dispatcher-3] [akka://system/IO-TCP/selectors/$a/0] New connection accepted
[akka.actor.default-dispatcher-7] [akka://user/IO-HTTP/listener-0/25] Dispatching POST request to http://localhost:8080/progress to handler Actor[akka://system/IO-TCP/selectors/$a/26#-934408297]
My configuration:
"com.zaxxer" % "HikariCP" % "2.3.2"
default_db {
url = ...
user = ...
password = ...
queueSize = -1
numThreads = 16
connectionPool = HikariCP
connectionTimeout = 20000
maxConnections = 40
}
Is there anything obvious that I'm doing wrong that is causing these database accesses to be so slow and throw this error? I can provide more information if needed.
EDIT: I have received one recommendation that the issue could be a classloader error, and that I could resolve it by deploying the project as a single .jar, rather than running it with sbt.
EDIT2: After further inspection, it appears that many connections were being left open, which eventually led to no connections being available. This can likely be resolved by calling db.close() to close the connection at the appropriate time.
EDIT3: Solved. The connections made by slick exceeded the max connections allowed by my mysql config.
OP wrote:
EDIT2: After further inspection, it appears that many connections were being left open, which eventually led to no connections being available. This can likely be resolved by calling db.close() to close the connection at the appropriate time.
EDIT3: Solved. The connections made by slick exceeded the max connections allowed by my mysql config.

python: sqlalchemy - how do I ensure connection not stale using new event system

I am using the sqlalchemy package in python. I have an operation that takes some time to execute after I perform an autoload on an existing table. This causes the following error when I attempt to use the connection:
sqlalchemy.exc.OperationalError: (OperationalError) (2006, 'MySQL server has gone away')
I have a simple utility function that performs an insert many:
def insert_data(data_2_insert, table_name):
engine = create_engine('mysql://blah:blah123#localhost/dbname')
# Metadata is a Table catalog.
metadata = MetaData()
table = Table(table_name, metadata, autoload=True, autoload_with=engine)
for c in mytable.c:
print c
column_names = tuple(c.name for c in mytable.c)
final_data = [dict(zip(column_names, x)) for x in data_2_insert]
ins = mytable.insert()
conn = engine.connect()
conn.execute(ins, final_data)
conn.close()
It is the following line that times long time to execute since 'data_2_insert' has 677,161 rows.
final_data = [dict(zip(column_names, x)) for x in data_2_insert]
I came across this question which refers to a similar problem. However I am not sure how to implement the connection management suggested by the accepted answer because robots.jpg pointed this out in a comment:
Note for SQLAlchemy 0.7 - PoolListener is deprecated, but the same solution can be implemented using the new event system.
If someone can please show me a couple of pointers on how I could go about integrating the suggestions into the way I use sqlalchemy I would be very appreciative. Thank you.
I think you are looking for something like this:
from sqlalchemy import exc, event
from sqlalchemy.pool import Pool
#event.listens_for(Pool, "checkout")
def check_connection(dbapi_con, con_record, con_proxy):
'''Listener for Pool checkout events that pings every connection before using.
Implements pessimistic disconnect handling strategy. See also:
http://docs.sqlalchemy.org/en/rel_0_8/core/pooling.html#disconnect-handling-pessimistic'''
cursor = dbapi_con.cursor()
try:
cursor.execute("SELECT 1") # could also be dbapi_con.ping(),
# not sure what is better
except exc.OperationalError, ex:
if ex.args[0] in (2006, # MySQL server has gone away
2013, # Lost connection to MySQL server during query
2055): # Lost connection to MySQL server at '%s', system error: %d
# caught by pool, which will retry with a new connection
raise exc.DisconnectionError()
else:
raise
If you wish to trigger this strategy conditionally, you should avoid use of decorator here and instead register listener using listen() function:
# somewhere during app initialization
if config.check_connection_on_checkout:
event.listen(Pool, "checkout", check_connection)
More info:
Connection Pool Events
Events API
There is a better way to handle it right now - pool_recycle
engine = create_engine('mysql://...', pool_recycle=3600)
MySQL has a default timeout of 8 hours.
This leads to the connection to be closed by MySQL but the engine above it (such as SQLAlchemy) to not know about it.
There are 2 ways to solve it -
Optimistic - Using pool_recycle
Pessimistic - using pool_pre_ping=True
I prefer to go with the pool_recycle as it doesn't emit a SELECT 1 before each query - causing less stress on the db