We are using DBCP inside a Grails application. The database is on another server, so TCP/IP is in play here. We have monitored the database by doing a show processlist frequently, and we never see above 50 connections. However, the sockets on the client grow enormously (at one point I saw over 2700). Most of them are in TIME_WAIT status.
So eventually we get a NoRouteToHostException, because it cannot open a socket.
Note that we hit the database over 40,000 times in less than a minute in this use case.
Does anyone have suggestions as to why this might be? I would think that, since our connection pool is limited to 100 (and we only see about 50 connections open), I'd only see slightly more than 50, since occasionally one might get stale. But we're seeing thousands. Is this expected? Or any other tips about something we might be missing when looking at this situation?
Here are the dbcp settings we are using:
properties {
maxActive = 100
maxIdle = 4
minIdle = 1
initialSize = 1
minEvictableIdleTimeMillis = 60000
timeBetweenEvictionRunsMillis = 60000
maxWait = 10000
removeAbandoned = true
removeAbandonedTimeout = 60
validationQuery = "/* PING */ SELECT 1"
testOnBorrow = true
testWhileIdle = true
numTestsPerEvictionRun = -1
logAbandoned = true
}
Also note that we use autoReconnect=true on the connection string, although we are considering dropping it (we get stale connections overnight otherwise).
Thanks!
Ok, so I was able to sort it out. Turns out I was misunderstanding the maxIdle and how it works.
Anything returned to the pool above maxIdle is immediately released. So most of the connections were being closed and reopened, hence why the sockets were exhausted.
Related
I want my consumers to process large batches, so I aim to have the consumer listener "awake", say, on 1800mb of data or every 5min, whichever comes first.
Mine is a kafka-springboot application, the topic has 28 partitions, and this is the configuration I explicitly change:
Parameter
Value I set
Default Value
Why I set it this way
fetch.max.bytes
1801mb
50mb
fetch.min.bytes+1mb
fetch.min.bytes
1800mb
1b
desired batch size
fetch.max.wait.ms
5min
500ms
desired cadence
max.partition.fetch.bytes
1801mb
1mb
unbalanced partitions
request.timeout.ms
5min+1sec
30sec
fetch.max.wait.ms + 1sec
max.poll.records
10000
500
1500 found too low
max.poll.interval.ms
5min+1sec
5min
fetch.max.wait.ms + 1sec
Nevertheless, I produce ~2gb of data to the topic, and I see the consumer-listener (a Batch Listener) is called many times per second -- way more than desired rate.
I logged the serialized-size of the ConsumerRecords<?,?> argument, and found that it is never more than 55mb.
This hints that I was not able to set fetch.max.bytes above the default 50mb.
Any idea how I can troubleshoot this?
Edit:
I found this question: Kafka MSK - a configuration of high fetch.max.wait.ms and fetch.min.bytes is behaving unexpectedly
Is it really impossible as stated?
Finally found the cause.
There is a broker fetch.max.bytes setting, and it defaults to 55mb. I only changed the consumer preferences, unaware of the broker-side limit.
see also
The kafka KIP and the actual commit.
I have installed the prometheus node_exporter running on port 9100 and mysqld_exporterrunning in port 9104 and configured grafana to use prometheus as the default source.
From the grafana explorer, I can query the node_memory_MemTotal_bytes using something like:
node_memory_MemTotal_bytes{instance="10.0.0.4:9100"}
notice port 9100 (node_exporter)
And I can query also the innodb_buffer_pool_size using:
mysql_global_variables_innodb_buffer_pool_size{instance="10.0.0.4:9104"}
notice port 9104 (mysqld_exporter)
I would like to calculate the Buffer pool size of total RAM using:
(mysql_global_variables_innodb_buffer_pool_size{instance=~"$host"} * 100) / on (instance) node_memory_MemTotal_bytes{instance=~"$host"}
The problem I have is that $host is the IP and the port: 10.0.0.4:9104 and can only obtain the mysql_global_variables_innodb_buffer_pool_size from the mysqld_exporter and not the node_memory_MemTotal_bytes since is in port 9100 because of this I am getting No Data
Any ideas about how could I mix data the metrics from the node_exporter & the mysqld_exporter?
This is the prometheus configuration:
- job_name: test_mysql
scheme: http
static_configs:
- targets:
- 10.0.0.4:9104
- job_name: test_node
scheme: http
static_configs:
- targets:
- 10.0.0.4:9100
I just spent a whole afternoon to find a fix for it so I thought I would share it in case this can help somebody else.
I was finally able to make it work using the label_replace function.
I replaced the original query by the following one:
(label_replace(mysql_global_variables_innodb_buffer_pool_size{instance="$host"}, "nodename", "$1", "instance", "(.*):.*") * 100) / on(nodename) (label_replace(node_memory_MemTotal_bytes, "nodename", "$1", "instance", "(.*):.*"))
label_replace allows (among other things) to add a new label, which can be based on the value of an already existing one. In this case, we use it to add a new nodename label, which get the value of the instance label (hostname:port) from which we remove the :port part.
This thus allows to have metrics from different exporters instances sharing a label with the same value and thus use these together when needed (here, we wanted to use the mysql_global_variables_innodb_buffer_pool_size metric from the mysqld_exporter and the node_memory_MemTotal_bytes from the node_exporter in the same query, for a given host).
HTH.
Baptiste
Follow the steps to update buffer pool size-
Edit innodb_buffer_pool_size
MySQL Overview / Edit Panel, replace Metrics (mysql_global_variables_innodb_buffer_pool_size{instance="$host"} * 100) / on (instance) node_memory_MemTotal_bytes{instance="$host"}
into avg by (node_name) ((mysql_global_variables_innodb_buffer_pool_size{service_name=~""} * 100)) /on (node_name) (avg by (node_name) (node_memory_MemTotal_bytes{node_name=~""}))
Replace old metric as given below-
Save setting
My server used to see APPARENT DEADLOCK in the logs. I have several servers running behind a load balancer, and the interesting thing is I see the DEADLOCK occur on all servers at the same time (does anyone know why it affects all servers)?. During this time period, MySQL queries that normally take 200ms take >60 seconds. Here's what logs looked like then:
com.mchange.v2.async.ThreadPoolAsynchronousRunner: com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector#58780f76
-- APPARENT DEADLOCK!!! Complete Status:
Managed Threads: 3
Active Threads: 3
Active Tasks:
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#25ff87d4 (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#10ccf7ef (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#3305ec37 (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2)
Pending Tasks:
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#39cc9e5a
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#60d46f90
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#17509fea
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#b28bd63
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#56cbdc12
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#15a091b4
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#61ce325
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#48119520
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#4032fb7c
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#518eefff
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#30ea3b20
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#74960088
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#23a8fc7d
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#5ff0ee0
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#642d0644
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#207bc809
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#44d4936f
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#39a10d1b
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#3532334d
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#4bf79e62
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#2bd83398
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#1a202a2d
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#3eacda7f
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#495f5746
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#23f1f906
So I came to Stack Overflow and found this answer which suggested I set statementCacheNumDeferredCloseThreads to 1. I did this, and I see DEADLOCK less frequently and only on a few servers behind the load balancer instead of all.
The logs look a little different now, but during DEADLOCK period, queries still very long:
10 Oct 2018 06:33:32,037 [WARN] (Timer-0) com.mchange.v2.async.ThreadPoolAsynchronousRunner: com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector#4f39ad63 -- APPARENT DEADLOCK!!! Complete Status:
Managed Threads: 3
Active Threads: 3
Active Tasks:
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#34dee200 (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#3727ee6b (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#4afb8b9 (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0)
Pending Tasks:
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#384a3b5b
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#7bc700b0
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#731bfd15
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#a88e9bf
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#63f18b56
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#20f0c518
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#caf7746
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#41a7a27d
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#2ee32a24
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#81df2e5
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#7f7fa1e7
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#337503f
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#34b2f877
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#53dfbede
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#512d5ddb
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#68a25969
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#4bf0754a
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#65770ba4
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#5e0f4154
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#249c22ed
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#6c8e5911
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#3179550f
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#15d8a795
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#50966489
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#4ecee95b
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#35640ca0
com.mchange.v2.resourcepool.BasicResourcePool$AsyncTestIdleResourceTask#6550f196
com.mchange.v2.resourcepool.BasicResourcePool$AsyncTestIdleResourceTask#6816399
com.mchange.v2.resourcepool.BasicResourcePool$AsyncTestIdleResourceTask#3fbcd623
Pool thread stack traces:
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2,5,main]
com.mysql.jdbc.PreparedStatement.realClose(PreparedStatement.java:2765)
com.mysql.jdbc.StatementImpl.close(StatementImpl.java:541)
com.mchange.v1.db.sql.StatementUtils.attemptClose(StatementUtils.java:41)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask.run(GooGooStatementCache.java:404)
com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:547)
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1,5,main]
com.mysql.jdbc.PreparedStatement.realClose(PreparedStatement.java:2765)
com.mysql.jdbc.StatementImpl.close(StatementImpl.java:541)
com.mchange.v1.db.sql.StatementUtils.attemptClose(StatementUtils.java:41)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask.run(GooGooStatementCache.java:404)
com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:547)
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0,5,main]
com.mysql.jdbc.PreparedStatement.realClose(PreparedStatement.java:2765)
com.mysql.jdbc.StatementImpl.close(StatementImpl.java:541)
com.mchange.v1.db.sql.StatementUtils.attemptClose(StatementUtils.java:41)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask.run(GooGooStatementCache.java:404)
com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:547)
Any idea how to fix this? I could try disable statement caching altogether but I'm concerned about the general performance hit. Some other relevant parameters:
minPoolSize = 30
maxPoolSize = 30
maxStatements = 100
unreturnedConnectionTimeout = 500
idleConnectionTestPeriod = 60
acquireIncrements = 3
C3p0 version = 0.9.1.2
Edit: I forgot to mention, during this improvement where I saw less deadlocks, I also increased maxStatements which could explain the improvement. However now I just see https://github.com/swaldman/c3p0/issues/53 which says version 0.9.2 introduces this new parameter statementCacheNumDeferredCloseThreads. My version is too old. I get no warnings/errors about this parameter not existing.
Maybe it's too late, but have you tried to increase the number of numHelperThreads?
We have MySQL 5.7 master - slaves replications and on the slave servers side, it hapens from time to time that our application monitoring tools (Tideways and PHP7.0) are reporting
MySQL has gone away.
Checking the MYSQL side:
show global status like '%Connection%';
+-----------------------------------+----------+
| Variable_name | Value |
+-----------------------------------+----------+
| Connection_errors_accept | 0 |
| Connection_errors_internal | 0 |
| Connection_errors_max_connections | 0 |
| Connection_errors_peer_address | 323 |
| Connection_errors_select | 0 |
| Connection_errors_tcpwrap | 0 |
| Connections | 55210496 |
| Max_used_connections | 387 |
| Slave_connections | 0 |
+-----------------------------------+----------+
The Connection_errors_peer_address shows 323. How to further investigate on what is causing this issue on both sides:
MySQL has gone away
and
Connection_errors_peer_address
EDIT:
Master Server
net_retry_count = 10
net_read_timeout = 120
net_write_timeout = 120
skip_networking = OFF
Aborted_clients = 151650
Slave Server 1
net_retry_count = 10
net_read_timeout = 30
net_write_timeout = 60
skip_networking = OFF
Aborted_clients = 3
Slave Server 2
net_retry_count = 10
net_read_timeout = 30
net_write_timeout = 60
skip_networking = OFF
Aborted_clients = 3
In MySQL 5.7, when a new TCP/IP connection reaches the server, the server performs several checks, implemented in sql/sql_connect.cc in function check_connection()
One of these checks is to get the IP address of the client side connection, as in:
static int check_connection(THD *thd)
{
...
if (!thd->m_main_security_ctx.host().length) // If TCP/IP connection
{
...
peer_rc= vio_peer_addr(net->vio, ip, &thd->peer_port, NI_MAXHOST);
if (peer_rc)
{
/*
Since we can not even get the peer IP address,
there is nothing to show in the host_cache,
so increment the global status variable for peer address errors.
*/
connection_errors_peer_addr++;
my_error(ER_BAD_HOST_ERROR, MYF(0));
return 1;
}
...
}
Upon failure, the status variable connection_errors_peer_addr is incremented, and the connection is rejected.
vio_peer_addr() is implemented in vio/viosocket.c (code simplified to show only the important calls)
my_bool vio_peer_addr(Vio *vio, char *ip_buffer, uint16 *port,
size_t ip_buffer_size)
{
if (vio->localhost)
{
...
}
else
{
/* Get sockaddr by socked fd. */
err_code= mysql_socket_getpeername(vio->mysql_socket, addr, &addr_length);
if (err_code)
{
DBUG_PRINT("exit", ("getpeername() gave error: %d", socket_errno));
DBUG_RETURN(TRUE);
}
/* Normalize IP address. */
vio_get_normalized_ip(addr, addr_length,
(struct sockaddr *) &vio->remote, &vio->addrLen);
/* Get IP address & port number. */
err_code= vio_getnameinfo((struct sockaddr *) &vio->remote,
ip_buffer, ip_buffer_size,
port_buffer, NI_MAXSERV,
NI_NUMERICHOST | NI_NUMERICSERV);
if (err_code)
{
DBUG_PRINT("exit", ("getnameinfo() gave error: %s",
gai_strerror(err_code)));
DBUG_RETURN(TRUE);
}
...
}
...
}
In short, the only failure path in vio_peer_addr() happens when a call to mysql_socket_getpeername() or vio_getnameinfo() fails.
mysql_socket_getpeername() is just a wrapper on top of getpeername().
The man 2 getpeername manual lists the following possible errors:
NAME
getpeername - get name of connected peer socket
ERRORS
EBADF The argument sockfd is not a valid descriptor.
EFAULT The addr argument points to memory not in a valid part of the process address space.
EINVAL addrlen is invalid (e.g., is negative).
ENOBUFS
Insufficient resources were available in the system to perform the operation.
ENOTCONN
The socket is not connected.
ENOTSOCK
The argument sockfd is a file, not a socket.
Of these errors, only ENOBUFS is plausible.
As for vio_getnameinfo(), it is just a wrapper on getnameinfo(), which also according to the man page man 3 getnameinfo can fail for the following reasons:
NAME
getnameinfo - address-to-name translation in protocol-independent manner
RETURN VALUE
EAI_AGAIN
The name could not be resolved at this time. Try again later.
EAI_BADFLAGS
The flags argument has an invalid value.
EAI_FAIL
A nonrecoverable error occurred.
EAI_FAMILY
The address family was not recognized, or the address length was invalid for the specified family.
EAI_MEMORY
Out of memory.
EAI_NONAME
The name does not resolve for the supplied arguments. NI_NAMEREQD is set and the host's name cannot be located, or neither
hostname nor service name
were requested.
EAI_OVERFLOW
The buffer pointed to by host or serv was too small.
EAI_SYSTEM
A system error occurred. The error code can be found in errno.
The gai_strerror(3) function translates these error codes to a human readable string, suitable for error reporting.
Here many failures can happen, basically due to heavy load or the network.
To understand the process behind this code, what the MySQL server is essentially doing is a Reverse DNS lookup, to:
find the hostname of the client
find the IP address corresponding to this hostname
to later convert this IP address to a hostname again (see the call to ip_to_hostname() that follows).
Overall, failures accounted with Connection_errors_peer_address can be due to system load (causing transient failures like out of memory, etc) or due to network issues affecting DNS.
Disclosure: I happen to be the person who implemented this Connection_errors_peer_address status variable in MySQL, as part of an effort to have better visibility / observability in this area of the code.
[Edit] To follow up with more details and/or guidelines:
When Connection_errors_peer_address is incremented, the root cause is not printed in logs. That is unfortunate for troubleshooting, but also avoid flooding logs causing even more damage, there is a tradeoff here. Keep in mind that anything that happen before logging in is very sensitive ...
If the server really goes out of memory, it is very likely that many other things will break, and that the server will go down very quickly. By monitoring the total memory usage of mysqld, and monitoring the uptime, it should be fairly easy to determine if the failure "only" caused connections to be closed with the server staying up, or if the server itself failed catastrophically.
Assuming the server stays up on failure, the more likely culprit is the second call then, to getnameinfo.
Using skip-name-resolve will have no effect, as this check happens later (see specialflag & SPECIAL_NO_RESOLVE in the code in check_connection())
When Connection_errors_peer_address fails, note that the server cleanly returns the error ER_BAD_HOST_ERROR to the client, and then closes the socket. This is different from just closing abruptly a socket (like in a crash) : the former should be reported by the client as "Can't get hostname for your address", while the later is reported as "MySQL has gone away".
Whether the client connector actually treat ER_BAD_HOST_ERROR and a socket closed differently is another story
Given that this failure overall seems related to DNS lookups, I would check the following items:
See how many rows are in the performance_schema.host_cache table.
Compare this with the size of the host cache, see the host_cache_size system variable.
If the host cache appear full, consider increasing its size: this will reduce the number of DNS calls overall, relieving pressure on DNS, in hope (admittedly, this is just a shot in the dark) that DNS transient failures will disappear.
323 out of 55 million connections indeed seems transient. Assuming the monitoring client sometime do get connected properly, inspect the row in table host_cache for this client: it may contains other failures reported.
Table performance_schema.host_cache documentation:
https://dev.mysql.com/doc/refman/5.7/en/host-cache-table.html
Further readings:
http://marcalff.blogspot.com/2012/04/performance-schema-nailing-host-cache.html
[Edit 2] Based on the new data available:
The Aborted_clients status variable shows some connections forcefully closed by the server. This typically happens when a session is idle for a very long time.
A typical scenario for this to happen is:
A client opens a connection, and sends some queries
Then the client does nothing for an extended amount of time (greater than the net_read_timeout)
Due to lack of traffic, the server closes the session, and increments Aborted_connects
The client then sends another query, sees a closed connection, and reports "MySQL has gone away"
Note that a client application forgetting to cleanly close sessions will execute 1-3, this could be the case for Aborted_clients on the master. Some cleanup here to fix clients applications using the master would help to decrease resource consumption, as leaving 151650 sessions open to die on timeout has a cost.
A client application executing 1-4 can cause Aborted_clients on the server and MySQL has gone away on the client. The client application reporting "MySQL has gone away" is most likely the culprit here.
If a monitoring application, say, checks the server every N seconds, then make sure the timeouts (here 30 and 60 sec) are significantly greater that N, or the server will kill the monitoring session.
I've tried prepending my query with:
set mapred.running.reduce.limit = 25;
And
set hive.exec.reducers.max = 35;
The last one jailed a job with 530 reducers down to 35... which makes me think it was going to try and shoe horn 530 reducers worth of work into 35.
Now giving
set mapred.tasktracker.reduce.tasks.maximum = 3;
a try to see if that number is some sort of max per node ( previously was 7 on a cluster with 70 potential reducer's ).
Update:
set mapred.tasktracker.reduce.tasks.maximum = 3;
Had no effect, was worth a try though.
Not exactly a solution to the question, but potentially a good compromise.
set hive.exec.reducers.max = 45;
For a super query of doom that has 400+ reducers, this jails the most expensive hive task down to 35 reducers total. My cluster currently only has 10 nodes, each node supporting 7 reducers...so in reality only 70 reducers can run as one time. By jailing the job down to less then 70, I've noticed a slight improvement in speed without any visible changes to the final product. Testing this in production to figure out what exactly is going on here. In the interim it's a good compromise solution.