I have implemented InnoDB cluster using MySql router(version - 2.1.4) for HA
This is my mysqlrouter.conf file
[DEFAULT]
user=mysqlrouter
logging_folder=
runtime_folder=/tmp/mysqlrouter/run
data_folder=/tmp/mysqlrouter/data
keyring_path=/tmp/mysqlrouter/data/keyring
master_key_path=/tmp/mysqlrouter/mysqlrouter.key
[logger]
level = DEBUG
[metadata_cache:magentoCluster]
router_id=49
bootstrap_server_addresses=mysql://ic-1:3306,mysql://ic-2:3306,mysql://ic-3:3306
user=mysql_router49_sqxivre03wzz
metadata_cluster=magentoCluster
ttl=1
[routing:magentoCluster_default_rw]
bind_address=0.0.0.0
bind_port=6446
destinations=metadata-cache://magentoCluster/default?role=PRIMARY
mode=read-write
protocol=classic
[routing:magentoCluster_default_ro]
bind_address=0.0.0.0
bind_port=6447
destinations=metadata-cache://magentoCluster/default?role=ALL
mode=read-only
protocol=classic
[routing:magentoCluster_default_x_rw]
bind_address=0.0.0.0
bind_port=64460
destinations=metadata-cache://magentoCluster/default?role=PRIMARY
mode=read-write
protocol=x
[routing:magentoCluster_default_x_ro]
bind_address=0.0.0.0
bind_port=64470
destinations=metadata-cache://magentoCluster/default?role=ALL
mode=read-only
protocol=x
MySql Router split the read requests to slave nodes, if I down slave 1 then router takes some seconds to know the slave 1 is down. So the requests are sent to the down slave node and the request fails. Any Suggestion how to handle this failure?
The client should always check for errors. This is a necessity for any system, because network errors, power outages, etc, can occur in any configuration.
When the client discovers a connection failure (failure to connect / dropped connection), it should start over by reconnecting and replaying the transaction it is in the middle of.
For transaction integrity, the client must be involved in the process; recovery cannot be provide by any proxy.
Related
If we don't close the MySQL connection at the end of the handler function in lambda-- will the MySQL connection close automatically when lambda dies and re-connect at the cold-start?
The connections won't be closed immediately but eventually they will. By default, the connection timeouts are 8 hour on MySQL and maximum connections are also capped at 66.
show variables like "wait_timeout"; -- 28800
show variables like "max_connections"; -- 66
When you create a connection to MySQL server, it would create a Thread on the MySQL server to serve this connection.
show status where variable_name = 'threads_connected';
select * from information_schema.processlist;
After a Lambda executes a request and sends a response, the Lambda execution environment is not removed immediately and the same one may be used to serve other requests. This is your Warm/Hot Lambda and in this case an active MySQL connection would be really good for your function execution and this is possible only when you did not close the connection in the previous invocation. Eventually, when there are no more requests, this Lambda execution environment can be shutdown and the resources are returned to the pool of AWS compute resources. When the Lambda execution environment shuts down, the TCP connection to the MySQL server from the Lambda will also terminate. Now the MySQL server can remove the thread associated with the Lambda and in essence would reduce the pool of active connections on the server. This also takes a bit of time. So if you are getting a lot of requests concurrently and if the maximum connections are already active, then the request would start failing.
I did some test to see how long does it really take to reclaim the connections and here is the snapshot. The X axis is in minutes and the Y axis is on the scale of 0-70 where each line parallel to X-Axis is 10 units away from each other.
It roughly takes 10-15 minutes to reclaim the connections. But again it depends on the Lambda usage pattern as well.
So should you close the connection on every invocation? Well, it depends!
Take a look at Lambda Runtime extensions and see if you can use the shutdown hook to close connection. If you can, then it would mean while the Lambda execution environment was serving multiple requests, you used a cached connection and just before your Lambda execution environment is taken away from you, you closed the connection.
Lambda RDS Proxy is also an alternative as mentioned above, but it is not free. Before you take the RDS Proxy route, do consider using another Serverless solution like AWS Fargate. In this case probably you would use a connection pool just like any long running server side application.
No, they will not be closed automatically unless you are doing something with your mysql client that implicit closes the connection when it goes out of scope.
The connection will stay open until it times out. There has been many people who reported problems in the past with poorly written Lambdas creating tons of open sessions/connections to relational databases because the connections were not properly closed and they had to wait to be timed out.
One feature that came out a year or so ago was RDS Proxies which are sort of an intermediary between clients and the MySQL server that implements connection pooling. This solves the problem with Lambdas not being able to effectively use connection pooling since RDS Proxies service can do that for serverless clients.
we have mysql-server(5.5.47)that hosted on physical server. It listen external internet interface(with restrict user access), mysql server intensively used from different places(we use different libraries to communicate with mysql). But sometimes whole mysql server(or network) stuck and stop accept connection, and a clients failed with etimedout(connect)/timeout(recv), even direct connection from server to mysql with mysql cli not working(stuck without any response — seems to be try to establish connections).
First thought was that it is related to tcp backlog, so mysql backlog was increased — but this not help at all.
Issue not repeatable, so last time when this issue happened we sniff traffic, and what we get:
http://grab.by/STwq — screenshot
*.*.27.65 — it is client
*.*.20.80 — it is mysql server
From session we can assume that tcp connection established, but server retransmit SYN/ACK to client(from dump we see that server receive ACK, why retransmit ?), but in normal case mysql must generate init packet and send to client, after connection was established.
It is only screen from 1 session, but all other sessions mostly same, SYN -> SYN/ACK -> ACK -> and server retransmit SYN/ACK up to retries_count.
After restart mysql all get normal immediately after restart. So not sure it is related to network or mysql.
Any thoughts would be appropriate.
Thank you!
i am developing a distributed database middleware which intends to act as a proxy of MySQLs. When it comes to transactions across multiple MySQLs, i find it difficult to let multiple MySQLs commit or rollback as a whole. Here comes the case:
Say there are 2 mysql instances which are proxied by my database middleware, on the application side, when i want to perform a "prepare-commit" action on both mysql instances, Firstly, i send the "prepare" request to the middleware, and the middleware forwards the request to the 2 mysql instances, then i execute some sql through the middleware, finally, when i send the "commit" request to the middleware, the middleware will forwards the request to the 2 mysql instances, here is what confuses me:
if the "commit" request sent to the first mysql instance is successfully executed, while the "commit" request sent to the second instance somehow failed, as i know, if a transaction has been commited, it cannot be rollback, but this has caused the 2 mysql instances to be in an inconsistent state.
i am wondering how to deal with this problem, any help will be appreciated.
One is committed and one failed, coordinator should log the transaction changes, and return success to user. Once failed instance recover, retrieve log from the coordinator and make sure the consistency. Like the binlog inside of mysql as coordinator, your middleware should take the same responsibility.
MySQL external XA is also a example of distributed transaction which coordinator is client.
Spanner choose 2PC+Paxos to reduce the possibility of one node failure.
I have a Java EE application running in GlassFish on EC2, with a MySQL database on Amazon RDS.
I am trying to configure the JDBC connection pool to in order to minimize downtime in case of database failover.
My current configuration isn't working correctly during a Multi-AZ failover, as the standby database instance appears to be available in a couple of minutes (according to the AWS console) while my GlassFish instance remains stuck for a long time (about 15 minutes) before resuming work.
The connection pool is configured like this:
asadmin create-jdbc-connection-pool --restype javax.sql.ConnectionPoolDataSource \
--datasourceclassname com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource \
--isconnectvalidatereq=true --validateatmostonceperiod=60 --validationmethod=auto-commit \
--property user=$DBUSER:password=$DBPASS:databaseName=$DBNAME:serverName=$DBHOST:port=$DBPORT \
MyPool
If I use a Single-AZ db.m1.small instance and reboot the database from the console, GlassFish will invalidate the broken connections, throw some exceptions and then reconnect as soon the database is available. In this setup I get less than 1 minute of downtime.
If I use a Multi-AZ db.m1.small instance and reboot with failover from the AWS console, I see no exception at all. The server halts completely, with all incoming requests timing out. After 15 minutes I finally get this:
Communication failure detected when attempting to perform read query outside of a transaction. Attempting to retry query. Error was: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.3.2.v20111125-r10461): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 940,715 milliseconds ago. The last packet sent successfully to the server was 935,598 milliseconds ago.
It appears as if each HTTP thread gets blocked on an invalid connection without getting an exception and so there's no chance to perform connection validation.
Downtime in the Multi-AZ case is always between 15-16 minutes, so it looks like a timeout of some sort but I was unable to change it.
Things I have tried without success:
connection leak timeout/reclaim
statement leak timeout/reclaim
statement timeout
using a different validation method
using MysqlDataSource instead of MysqlConnectionPoolDataSource
How can I set a timeout on stuck queries so that connections in the pool are reused, validated and replaced?
Or how can I let GlassFish detect a database failover?
As I commented before, it is because the sockets that are open and connected to the database don't realize the connection has been lost, so they stayed connected until the OS socket timeout is triggered, which I read might be usually in about 30 minutes.
To solve the issue you need to override the socket Timeout in your JDBC Connection String or in the JDNI COnnection Configuration/Properties to define the socketTimeout param to a smaller time.
Keep in mind that any connection longer than the value defined will be killed, even if it is being used (I haven't been able to confirm this, is what I read).
The other two parameters I mention in my comment are connectTimeout and autoReconnect.
Here's my JDBC Connection String:
jdbc:(...)&connectTimeout=15000&socketTimeout=60000&autoReconnect=true
I also disabled Java's DNS cache by doing
java.security.Security.setProperty("networkaddress.cache.ttl" , "0");
java.security.Security.setProperty("networkaddress.cache.negative.ttl" , "0");
I do this because Java doesn't honor the TTL's, and when the failover takes place, the DNS is the same but the IP changes.
Since you are using an Application Server, the parameters to disable DNS cache must be passed to the JVM when starting the glassfish with -Dnet and not the application itself.
I have a live Couchbase cluster on two Amazon EC2 instances (version 1.8.0) and about 5 application servers each running PHP with moxi clients on them. Once in a while, Moxi will return a SERVER_ERROR when attempting to access data. This happens about once every few minutes on average. The cluster processes about 500 operations per second.
After inspecting the moxi logs (with -vvv enabled), I notice the following at around the time I get a SERVER_ERROR:
2013-07-16 03:07:22: (cproxy.c.2680) downstream_timeout
2013-07-16 03:07:22: (cproxy.c.1911) 56: could not forward upstream to downstream
2013-07-16 03:07:22: (cproxy.c.2004) 56: upstream_error: SERVER_ERROR proxy downstream timeout^M
I tried increasing the downstream timeout in the moxi configs from 5000 to 25000, but that doesn't help at all. The errors still happen just as frequently.
Can someone suggest any ideas for me to discover the cause of the problem? Or if there's some likely culprit?
SERVER_ERROR proxy downstream timeout
In this error response, moxi reached a timeout while waiting for a
downstream server to respond to a request. That is, moxi did not see
any explicit errors such as a connection going down, but the response
is just taking too long. The downstream connection will be also closed
by moxi rather than putting the downstream connection back into a
connection pool. The default downstream_timeout configuration is 5000
(milliseconds).
Pretty straight forward error, but it can be caused by a few possible things.
try getting the output of "stats proxy" from moxi:
echo stats proxy | nc HOST 11211
obviously you have already figured out that you are concerned with these settings:
STAT 11211:default:pstd_stats:tot_downstream_timeout xxxx
STAT 11211:default:pstd_stats:tot_wait_queue_timeout nnnnn
your downstream_timeout as you've said should appear as 5000
but also check out:
STAT 11211:default:pstd_stats:tot_downstream_conn_queue_timeout 0
from URL:
http://www.couchbase.com/docs/moxi-manual-1.8/moxi-dataflow.html
pretty much a perfect walk through of the way moxi operates.
To understand some of the configurable command-line flags in moxi
(concurrency, downstream_max, downstream_conn_max, downstream_timeout,
wait_queue_timeout, etc), it can be helpful to follow a request
through moxi...
The normal flow of data for moxi is as follows:
A client connects
A client creates a connection (an upstream conn) to moxi.
moxi's -c command-line parameter ultimately controls the limits on
the maximum number of connections.
In this -c parameter, moxi inherits the same behavior as memcached,
and will stop accept()'ing client connections until
existing connections are closed. When the count of existing
connections drops below the -c defined level, moxi will accept() more
client connections.
The client makes a request, which goes on the wait queue
Next, the client makes a request — such as simple single-key
command (like set, add, append, or a single-key get).
At this point, moxi places the upstream conn onto the tail of a wait
queue. moxi's wait_queue_timeout parameter controls how long an
upstream conn should stay on the wait queue before moxi times it out
and responds to the client with a SERVER_ERROR response.
The concurrency parameter
Next, there's a configurable max limit to how many upstream conn
requests moxi will process concurrently off the head of the wait
queue. This configurable limit is called concurrency. (This formerly
used to be known, perhaps confusingly, as downstream_max. For
backwards compatibility, concurrency and downstream_max configuration
flags are treated as synonyms.)
The concurrency configuration is per-thread and per-bucket. That
is, the moxi process-level concurrency is actually concurrency X
num-worker-threads X num-buckets.
The default concurrency configuration value is 1024. This means moxi
will concurrently process 1024 upstream connection requests from
the head of the wait queue. (There are more queues in moxi, however,
before moxi actually forwards a request. This is discussed in later
sections.)
Taking the concurrency value of 1024 as an example, if you have 4
worker threads (the default, controlled by moxi's -t parameter) and 1
bucket (what most folks start out with, such as the "default" bucket),
you'll have a limit of 1024 x 4 x 1 or 4096 concurrently processed
client requests in that single moxi process.
The rationale behind the concurrency increase to 1024 for moxi's
configuration (it used to be much lower) is due to the evolving design
of moxi. Originally, moxi only had the wait queue as its only internal
queue. As more, later-stage queues were added during moxi's history,
we found that getting requests off the wait queue sooner and onto the
later stage queues was a better approach. We'll discuss these
later-stage queues below.
Next, let's discuss how client requests are matched to downstream connections.
Key hashing
The concurrently processed client requests (taken from the head
of the wait queue) now need to be matched up with downstream connections
to the Couchbase server. If the client's request comes with a key
(like a SET, DELETE, ADD, INCR, single-key GET), the request's key is
hashed to find the right downstream server "host:port:bucket" info.
For example, something like — "memcache1:11211:default". If the
client's request was a broadcast-style command (like FLUSH_ALL, or a
multi-key GET), moxi knows the downstream connections that it needs to
acquire.
The downstream conn pool
Next, there's a lookup using those host:port:bucket identifiers into
a downstream conn pool in order to acquire or reserve the
appropriate downstream conns. There's a downstream conn pool per
thread. Each downstream conn pool is just a hashmap keyed by
host:port:bucket with hash values of a linked-list of available
downstream conns. The max length of any downstream conn linked list is
controlled by moxi's downstream_conn_max configuration parameter.
The downstream_conn_max parameter
By default the downstream_conn_max value is 4. A value of 0 means no limit.
So, if you've set downstream_conn_max of 4, have 4 worker threads,
and have 1 bucket, you should see moxi create a maximum of 4
X 4 X 1 or 16 connections to any Couchbase server.
Connecting to a downstream server
If there isn't a downstream conn available, and the
downstream_conn_max wasn't reached, moxi creates a downstream conn as
needed by doing a connect() and SASL auth as needed.
The connect_timeout and auth_timeout parameters
The connect() and SASL auth have their own configurable timeout
parameters, called connect_timeout and auth_timeout, and these
are in milliseconds. The default value for connect_timeout is 400
milliseconds, and the auth_timeout default is 100 milliseconds.
The downstream conn queue
If downstream_conn_max is reached, then the request must wait until a
downstream conn becomes available; the request therefore is
placed on a per-thread, per-host:port:bucket queue, which is called a
downstream conn queue. As downstream conns are released back into the
downstream conn pool, they will be assigned to any requests that are
waiting on the downstream conn queue.
The downstream_conn_queue_timeout parameter
There is another configurable timeout, downstream_conn_queue_timeout,
that defines how long a request should
stay on the downstream conn queue in milliseconds before timing out.
By default, the downstream_conn_queue_timeout is 200 milliseconds. A
value of 0 indicates no timeout.
A downstream connection is reserved
Finally, at this point, downstream conn's are matched up for the
client's request. If you've configured moxi to track timing histogram
statistics, moxi will now get the official start time of the request.
moxi now starts asynchronously sending request message bytes to the
downstream conn and asynchronously awaits responses.
To turn on timing histogram statistics, use the "time_stats=1"
configuration flag. By default, time_stats is 0 or off.
The downstream_timeout parameter
Next, if you've configured a downstream_timeout, moxi starts a timer
for the request where moxi can limit the time it will spend
processing a request at this point. If the timer fires, moxi will
return a "SERVER_ERROR proxy downstream timeout" back to the client.
The downstream_timeout default value is 5000 milliseconds. If moxi sees
this time elapse, it will close any downstream connections that
were assigned to the request. Due to this simple behavior of closing
downstream connections on timeout, having a very short
downstream_timeout is not recommended. This will help avoid repetitive
connection creation, timeout, closing and reconnecting. On an
overloaded cluster, you may want to increase downstream_timeout so
that moxi does not constantly attempt to time out downstream
connections on an already overloaded cluster, or by creating even more
new connections when servers are already trying to process requests on
old, closed connections. If you see your servers greatly spiking, you
should consider making this adjustment.
Responses are received
When all responses are received from the downstream servers for a request (or the
downstream conn had an error), moxi asynchronously
sends those responses to the client's upstream conn. If you've
configured moxi to track timing histogram statistics, moxi now tracks
the official end time of the request. The downstream conn is now
released back to the per-thread downstream conn pool, and another
waiting client request (if any) is taken off the downstream conn queue
and assigned to use that downstream conn.
Backoff/Blacklisting
At step 6, there's a case where a connect() attempt might fail. Moxi
can be configured to count up the number of connect() failures for a
downstream server, and will also track the time of the last failing
connect() attempt.
With the connect() failure counting, moxi can be configured to
blacklist a server if too many connect() failures are seen, which is
defined by the connect_max_errors configuration parameter. When more
than connect_max_errors number of connect() failures are seen, moxi
can be configured to temporarily stop making connect() attempts to
that server (or backoff) for a configured amount of time. The backoff
time is defined via the connect_retry_interval configuration, in
milliseconds.
The default for connect_max_errors is 5 and the connect_retry_interval
is 30000 millisecods, that is, 30 seconds.
If you use connect_max_errors parameter, it should be set greater than
the downstream_conn_max configuration parameter.