Couchbase/Membase: Moxi proxy downstream timeout SERVER_ERROR - couchbase

I have a live Couchbase cluster on two Amazon EC2 instances (version 1.8.0) and about 5 application servers each running PHP with moxi clients on them. Once in a while, Moxi will return a SERVER_ERROR when attempting to access data. This happens about once every few minutes on average. The cluster processes about 500 operations per second.
After inspecting the moxi logs (with -vvv enabled), I notice the following at around the time I get a SERVER_ERROR:
2013-07-16 03:07:22: (cproxy.c.2680) downstream_timeout
2013-07-16 03:07:22: (cproxy.c.1911) 56: could not forward upstream to downstream
2013-07-16 03:07:22: (cproxy.c.2004) 56: upstream_error: SERVER_ERROR proxy downstream timeout^M
I tried increasing the downstream timeout in the moxi configs from 5000 to 25000, but that doesn't help at all. The errors still happen just as frequently.
Can someone suggest any ideas for me to discover the cause of the problem? Or if there's some likely culprit?

SERVER_ERROR proxy downstream timeout
In this error response, moxi reached a timeout while waiting for a
downstream server to respond to a request. That is, moxi did not see
any explicit errors such as a connection going down, but the response
is just taking too long. The downstream connection will be also closed
by moxi rather than putting the downstream connection back into a
connection pool. The default downstream_timeout configuration is 5000
(milliseconds).
Pretty straight forward error, but it can be caused by a few possible things.
try getting the output of "stats proxy" from moxi:
echo stats proxy | nc HOST 11211
obviously you have already figured out that you are concerned with these settings:
STAT 11211:default:pstd_stats:tot_downstream_timeout xxxx
STAT 11211:default:pstd_stats:tot_wait_queue_timeout nnnnn
your downstream_timeout as you've said should appear as 5000
but also check out:
STAT 11211:default:pstd_stats:tot_downstream_conn_queue_timeout 0
from URL:
http://www.couchbase.com/docs/moxi-manual-1.8/moxi-dataflow.html
pretty much a perfect walk through of the way moxi operates.
To understand some of the configurable command-line flags in moxi
(concurrency, downstream_max, downstream_conn_max, downstream_timeout,
wait_queue_timeout, etc), it can be helpful to follow a request
through moxi...
The normal flow of data for moxi is as follows:
A client connects
A client creates a connection (an upstream conn) to moxi.
moxi's -c command-line parameter ultimately controls the limits on
the maximum number of connections.
In this -c parameter, moxi inherits the same behavior as memcached,
and will stop accept()'ing client connections until
existing connections are closed. When the count of existing
connections drops below the -c defined level, moxi will accept() more
client connections.
The client makes a request, which goes on the wait queue
Next, the client makes a request — such as simple single-key
command (like set, add, append, or a single-key get).
At this point, moxi places the upstream conn onto the tail of a wait
queue. moxi's wait_queue_timeout parameter controls how long an
upstream conn should stay on the wait queue before moxi times it out
and responds to the client with a SERVER_ERROR response.
The concurrency parameter
Next, there's a configurable max limit to how many upstream conn
requests moxi will process concurrently off the head of the wait
queue. This configurable limit is called concurrency. (This formerly
used to be known, perhaps confusingly, as downstream_max. For
backwards compatibility, concurrency and downstream_max configuration
flags are treated as synonyms.)
The concurrency configuration is per-thread and per-bucket. That
is, the moxi process-level concurrency is actually concurrency X
num-worker-threads X num-buckets.
The default concurrency configuration value is 1024. This means moxi
will concurrently process 1024 upstream connection requests from
the head of the wait queue. (There are more queues in moxi, however,
before moxi actually forwards a request. This is discussed in later
sections.)
Taking the concurrency value of 1024 as an example, if you have 4
worker threads (the default, controlled by moxi's -t parameter) and 1
bucket (what most folks start out with, such as the "default" bucket),
you'll have a limit of 1024 x 4 x 1 or 4096 concurrently processed
client requests in that single moxi process.
The rationale behind the concurrency increase to 1024 for moxi's
configuration (it used to be much lower) is due to the evolving design
of moxi. Originally, moxi only had the wait queue as its only internal
queue. As more, later-stage queues were added during moxi's history,
we found that getting requests off the wait queue sooner and onto the
later stage queues was a better approach. We'll discuss these
later-stage queues below.
Next, let's discuss how client requests are matched to downstream connections.
Key hashing
The concurrently processed client requests (taken from the head
of the wait queue) now need to be matched up with downstream connections
to the Couchbase server. If the client's request comes with a key
(like a SET, DELETE, ADD, INCR, single-key GET), the request's key is
hashed to find the right downstream server "host:port:bucket" info.
For example, something like — "memcache1:11211:default". If the
client's request was a broadcast-style command (like FLUSH_ALL, or a
multi-key GET), moxi knows the downstream connections that it needs to
acquire.
The downstream conn pool
Next, there's a lookup using those host:port:bucket identifiers into
a downstream conn pool in order to acquire or reserve the
appropriate downstream conns. There's a downstream conn pool per
thread. Each downstream conn pool is just a hashmap keyed by
host:port:bucket with hash values of a linked-list of available
downstream conns. The max length of any downstream conn linked list is
controlled by moxi's downstream_conn_max configuration parameter.
The downstream_conn_max parameter
By default the downstream_conn_max value is 4. A value of 0 means no limit.
So, if you've set downstream_conn_max of 4, have 4 worker threads,
and have 1 bucket, you should see moxi create a maximum of 4
X 4 X 1 or 16 connections to any Couchbase server.
Connecting to a downstream server
If there isn't a downstream conn available, and the
downstream_conn_max wasn't reached, moxi creates a downstream conn as
needed by doing a connect() and SASL auth as needed.
The connect_timeout and auth_timeout parameters
The connect() and SASL auth have their own configurable timeout
parameters, called connect_timeout and auth_timeout, and these
are in milliseconds. The default value for connect_timeout is 400
milliseconds, and the auth_timeout default is 100 milliseconds.
The downstream conn queue
If downstream_conn_max is reached, then the request must wait until a
downstream conn becomes available; the request therefore is
placed on a per-thread, per-host:port:bucket queue, which is called a
downstream conn queue. As downstream conns are released back into the
downstream conn pool, they will be assigned to any requests that are
waiting on the downstream conn queue.
The downstream_conn_queue_timeout parameter
There is another configurable timeout, downstream_conn_queue_timeout,
that defines how long a request should
stay on the downstream conn queue in milliseconds before timing out.
By default, the downstream_conn_queue_timeout is 200 milliseconds. A
value of 0 indicates no timeout.
A downstream connection is reserved
Finally, at this point, downstream conn's are matched up for the
client's request. If you've configured moxi to track timing histogram
statistics, moxi will now get the official start time of the request.
moxi now starts asynchronously sending request message bytes to the
downstream conn and asynchronously awaits responses.
To turn on timing histogram statistics, use the "time_stats=1"
configuration flag. By default, time_stats is 0 or off.
The downstream_timeout parameter
Next, if you've configured a downstream_timeout, moxi starts a timer
for the request where moxi can limit the time it will spend
processing a request at this point. If the timer fires, moxi will
return a "SERVER_ERROR proxy downstream timeout" back to the client.
The downstream_timeout default value is 5000 milliseconds. If moxi sees
this time elapse, it will close any downstream connections that
were assigned to the request. Due to this simple behavior of closing
downstream connections on timeout, having a very short
downstream_timeout is not recommended. This will help avoid repetitive
connection creation, timeout, closing and reconnecting. On an
overloaded cluster, you may want to increase downstream_timeout so
that moxi does not constantly attempt to time out downstream
connections on an already overloaded cluster, or by creating even more
new connections when servers are already trying to process requests on
old, closed connections. If you see your servers greatly spiking, you
should consider making this adjustment.
Responses are received
When all responses are received from the downstream servers for a request (or the
downstream conn had an error), moxi asynchronously
sends those responses to the client's upstream conn. If you've
configured moxi to track timing histogram statistics, moxi now tracks
the official end time of the request. The downstream conn is now
released back to the per-thread downstream conn pool, and another
waiting client request (if any) is taken off the downstream conn queue
and assigned to use that downstream conn.
Backoff/Blacklisting
At step 6, there's a case where a connect() attempt might fail. Moxi
can be configured to count up the number of connect() failures for a
downstream server, and will also track the time of the last failing
connect() attempt.
With the connect() failure counting, moxi can be configured to
blacklist a server if too many connect() failures are seen, which is
defined by the connect_max_errors configuration parameter. When more
than connect_max_errors number of connect() failures are seen, moxi
can be configured to temporarily stop making connect() attempts to
that server (or backoff) for a configured amount of time. The backoff
time is defined via the connect_retry_interval configuration, in
milliseconds.
The default for connect_max_errors is 5 and the connect_retry_interval
is 30000 millisecods, that is, 30 seconds.
If you use connect_max_errors parameter, it should be set greater than
the downstream_conn_max configuration parameter.

Related

Regarding MySQL Aborted connection

I'm looking into aborted connection -
2022-11-21T20:10:43.215738Z 640870 [Note] Aborted connection 640870 to db: '' user: '' host: '10.0.0.**' (Got timeout reading communication packets)
My understanding is that I need to figure out whether it is an interactive or not connection, and increase wait_timeout (or interactive_timeout) accordingly. If it has no effect, then I'll need to adjust net_read_timeout or net_write_timeout and see.
I'd like to ask:
Is there a meta table that I can query for the connection type
(interactive or not)?
There are how-to's on the internet on adjusting wait_timeout (or
interactive_timeout) and all of them have rebooting the database as
the last step. Is that really required? Given that immediate effect
is not required, the sessions are supposed to come and go, and new
sessions will pick up the new value (after the system value is set),
I suppose if there is a way to track how many connections are left
with the old values, then it will be ok?
Finally, can someone suggest any blog (strategy) on handling aborted
connection or adjusting the timeout values?
Thank you!
RDS MySQL version 5.7
There is only one client that sets the interactive flag by default: the mysql command-line client. All other client tools and connectors do not set this flag by default. You can choose to set the interactive flag, because it's a flag in the MySQL client API mysql_real_connect(). So you would know if you did it. In some connectors, you aren't calling the MySQL client API directly, and it isn't even an option to set this flag.
So for practical purposes, you can ignore the difference between wait_timeout and interactive_timeout, unless you're trying to tune the timeout of the mysql client in a shell window.
You should never need to restart the MySQL Server. The timeout means the client closed the session after there has been no activity for wait_timeout seconds. The default value is 28800, which is 8 hours.
The proper way of handling this in application code is to catch exceptions, reconnect if necessary, and then retry whatever query was interrupted.
Some connectors have an auto-reconnect option. Auto-reconnect does not automatically retry the query.
In many applications, you are borrowing a connection from a connection pool, and the connection pool manager is supposed to test the connection before returning it to the caller. For example running SELECT 1; is a common test. The action of testing the connection causes a reconnect if the connection was not used for 8 hours.
If you don't use a connection pool (for example if your client program is PHP, which doesn't support connection pools as far as I know), then your client opens a new connection on request, so naturally it can't be idle for 8 hours if it's a new connection. Then the connection is closed as the request finishes, and presumably this request lasts less than 8 hours.
So this comes up only if your client opens a long-lived MySQL connection that is inactive for periods of 8 hours or more. In such cases, it's your responsibility to test the connection and reopen it if necessary before running a query.

Will AWS Lambda automatically close MySQL connections?

If we don't close the MySQL connection at the end of the handler function in lambda-- will the MySQL connection close automatically when lambda dies and re-connect at the cold-start?
The connections won't be closed immediately but eventually they will. By default, the connection timeouts are 8 hour on MySQL and maximum connections are also capped at 66.
show variables like "wait_timeout"; -- 28800
show variables like "max_connections"; -- 66
When you create a connection to MySQL server, it would create a Thread on the MySQL server to serve this connection.
show status where variable_name = 'threads_connected';
select * from information_schema.processlist;
After a Lambda executes a request and sends a response, the Lambda execution environment is not removed immediately and the same one may be used to serve other requests. This is your Warm/Hot Lambda and in this case an active MySQL connection would be really good for your function execution and this is possible only when you did not close the connection in the previous invocation. Eventually, when there are no more requests, this Lambda execution environment can be shutdown and the resources are returned to the pool of AWS compute resources. When the Lambda execution environment shuts down, the TCP connection to the MySQL server from the Lambda will also terminate. Now the MySQL server can remove the thread associated with the Lambda and in essence would reduce the pool of active connections on the server. This also takes a bit of time. So if you are getting a lot of requests concurrently and if the maximum connections are already active, then the request would start failing.
I did some test to see how long does it really take to reclaim the connections and here is the snapshot. The X axis is in minutes and the Y axis is on the scale of 0-70 where each line parallel to X-Axis is 10 units away from each other.
It roughly takes 10-15 minutes to reclaim the connections. But again it depends on the Lambda usage pattern as well.
So should you close the connection on every invocation? Well, it depends!
Take a look at Lambda Runtime extensions and see if you can use the shutdown hook to close connection. If you can, then it would mean while the Lambda execution environment was serving multiple requests, you used a cached connection and just before your Lambda execution environment is taken away from you, you closed the connection.
Lambda RDS Proxy is also an alternative as mentioned above, but it is not free. Before you take the RDS Proxy route, do consider using another Serverless solution like AWS Fargate. In this case probably you would use a connection pool just like any long running server side application.
No, they will not be closed automatically unless you are doing something with your mysql client that implicit closes the connection when it goes out of scope.
The connection will stay open until it times out. There has been many people who reported problems in the past with poorly written Lambdas creating tons of open sessions/connections to relational databases because the connections were not properly closed and they had to wait to be timed out.
One feature that came out a year or so ago was RDS Proxies which are sort of an intermediary between clients and the MySQL server that implements connection pooling. This solves the problem with Lambdas not being able to effectively use connection pooling since RDS Proxies service can do that for serverless clients.

Mysql Router send request to down slave node for a Second

I have implemented InnoDB cluster using MySql router(version - 2.1.4) for HA
This is my mysqlrouter.conf file
[DEFAULT]
user=mysqlrouter
logging_folder=
runtime_folder=/tmp/mysqlrouter/run
data_folder=/tmp/mysqlrouter/data
keyring_path=/tmp/mysqlrouter/data/keyring
master_key_path=/tmp/mysqlrouter/mysqlrouter.key
[logger]
level = DEBUG
[metadata_cache:magentoCluster]
router_id=49
bootstrap_server_addresses=mysql://ic-1:3306,mysql://ic-2:3306,mysql://ic-3:3306
user=mysql_router49_sqxivre03wzz
metadata_cluster=magentoCluster
ttl=1
[routing:magentoCluster_default_rw]
bind_address=0.0.0.0
bind_port=6446
destinations=metadata-cache://magentoCluster/default?role=PRIMARY
mode=read-write
protocol=classic
[routing:magentoCluster_default_ro]
bind_address=0.0.0.0
bind_port=6447
destinations=metadata-cache://magentoCluster/default?role=ALL
mode=read-only
protocol=classic
[routing:magentoCluster_default_x_rw]
bind_address=0.0.0.0
bind_port=64460
destinations=metadata-cache://magentoCluster/default?role=PRIMARY
mode=read-write
protocol=x
[routing:magentoCluster_default_x_ro]
bind_address=0.0.0.0
bind_port=64470
destinations=metadata-cache://magentoCluster/default?role=ALL
mode=read-only
protocol=x
MySql Router split the read requests to slave nodes, if I down slave 1 then router takes some seconds to know the slave 1 is down. So the requests are sent to the down slave node and the request fails. Any Suggestion how to handle this failure?
The client should always check for errors. This is a necessity for any system, because network errors, power outages, etc, can occur in any configuration.
When the client discovers a connection failure (failure to connect / dropped connection), it should start over by reconnecting and replaying the transaction it is in the middle of.
For transaction integrity, the client must be involved in the process; recovery cannot be provide by any proxy.

How can I configure HAProxy to work with server sent events?

I'm trying to add an endpoint to an existing application that sends Server Sent Events. There often may be no event for ~5 minutes. I'm hoping to configure that endpoint to not cut off my server even when the response has not been completed in ~1min, but all other endpoints to timeout if the server fails to respond.
Is there an easy way to support server sent events in HAProxy?
Here is my suggestion for HAProxy and SSE: you have plenty of custom timeout options in HAProxy, and there is 2 interesting options for you.
The timeout tunnel specifies timeout for tunnel connection - used for Websockets, SSE or CONNECT. Bypass both server and client timeout.
The timeout client handles the situation where a client looses their connection (network loss, disappear before the ACK of ending session, etc...)
In your haproxy.cfg, this is what you should do, first in your defaults section :
# Set the max time to wait for a connection attempt to a server to succeed
timeout connect 30s
# Set the max allowed time to wait for a complete HTTP request
timeout client 50s
# Set the maximum inactivity time on the server side
timeout server 50s
Nothing special until there.
Now, still in the defaults section :
# handle the situation where a client suddenly disappears from the net
timeout client-fin 30s
Next, jump to your backend definition and add this:
timeout tunnel 10h
I suggest a high value, 10 hours seems ok.
You should also avoid using the default http-keep-alive option, SSE does not use it. Instead, use http-server-close.

MySQL - Persistent connection vs connection pooling

In order to avoid the overhead of establishing a new connection each time a query needs fired against MySQL, there are two options available:
Persistent connections, whereby a new connection is requested a check is made to see if an 'identical' connection is already open and if so use it.
Connection pooling, whereby the client maintains a pool of connections, so that each thread that needs to use a connection will check out one from the pool and return it back to the pool when done.
So, if I have a multi-threaded server application expected to handle thousands of requests per second, and each thread needs to fire a query against the database, then what is a better option?
From my understanding, With persistent connections, all the threads in my application will try and use the same persistent connection to the database because they all are using identical connections. So it is one connection shared across multiple application threads - as a result the requests will block on the database side soon.
If I use a connection pooling mechanism, I will have all application threads share a pool of connections. So there is less possibility of a blocking request. However, with connection pooling, should an application thread wait to acquire a connection from the pool or should it send a request on the connections in the pool anyway in a round-robin manner, and let the queuing if any, happen on the database?
Having persistent connections does not imply that all threads use the same connection. It just "says" that you keep the connection open (in contradiction to open a connection each time you need one). Opening a connection is an expensive operation, so - in general - you try to avoid opening connections more often than necessary.
This is the reason why multithreaded applications often use connection pools. The pool takes care of opening and closing connections and every thread that needs a connection requests one from the pool. It is important to take care that the thread returns the connection as soon as possible to the pool, so that another thread can use it.
If your application has only a few long running threads that need connections you can also open a connection for each thread and keep this open.
Using just one connection (as you described it) is equal to a connection pool with the maximum size one. This will be sooner or later your bottleneck as all threads will have to wait for the connection. This could be an option to serialize the database operations (perform them in a certain order), although there are better options to ensure serialisation.
Update: The newer X Protocol supports asynchronous connections, and newer drivers like Node's can utilize this.
Regarding your question about should the application server wait for a connection, the answer is yes.
MySQL connections are blocking. When you issue a request from MySQL server over a connection, the connection will wait, idle, until a response is received from the server.
There is no way to send two requests on the same connection and see which returns first. You can only send one request at a time.
So, generally, a single thread in a connection pool consists of one client side connection (in your case, the application server is the client) and one server side connection (database).
Your application should wait for an available connection thread from the pool, allowing the pool to grow when it's needed, and to shrink back to your default number of threads, when it's less busy.