My server used to see APPARENT DEADLOCK in the logs. I have several servers running behind a load balancer, and the interesting thing is I see the DEADLOCK occur on all servers at the same time (does anyone know why it affects all servers)?. During this time period, MySQL queries that normally take 200ms take >60 seconds. Here's what logs looked like then:
com.mchange.v2.async.ThreadPoolAsynchronousRunner: com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector#58780f76
-- APPARENT DEADLOCK!!! Complete Status:
Managed Threads: 3
Active Threads: 3
Active Tasks:
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#25ff87d4 (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#10ccf7ef (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#3305ec37 (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2)
Pending Tasks:
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#39cc9e5a
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#60d46f90
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#17509fea
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#b28bd63
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#56cbdc12
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#15a091b4
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#61ce325
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#48119520
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#4032fb7c
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#518eefff
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#30ea3b20
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#74960088
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#23a8fc7d
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#5ff0ee0
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#642d0644
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#207bc809
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#44d4936f
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#39a10d1b
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#3532334d
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#4bf79e62
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#2bd83398
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#1a202a2d
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#3eacda7f
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#495f5746
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#23f1f906
So I came to Stack Overflow and found this answer which suggested I set statementCacheNumDeferredCloseThreads to 1. I did this, and I see DEADLOCK less frequently and only on a few servers behind the load balancer instead of all.
The logs look a little different now, but during DEADLOCK period, queries still very long:
10 Oct 2018 06:33:32,037 [WARN] (Timer-0) com.mchange.v2.async.ThreadPoolAsynchronousRunner: com.mchange.v2.async.ThreadPoolAsynchronousRunner$DeadlockDetector#4f39ad63 -- APPARENT DEADLOCK!!! Complete Status:
Managed Threads: 3
Active Threads: 3
Active Tasks:
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#34dee200 (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#3727ee6b (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask#4afb8b9 (com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0)
Pending Tasks:
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#384a3b5b
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#7bc700b0
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#731bfd15
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#a88e9bf
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#63f18b56
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#20f0c518
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#caf7746
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#41a7a27d
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#2ee32a24
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#81df2e5
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#7f7fa1e7
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#337503f
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#34b2f877
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#53dfbede
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#512d5ddb
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#68a25969
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#4bf0754a
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#65770ba4
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#5e0f4154
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#249c22ed
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#6c8e5911
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#3179550f
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#15d8a795
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#50966489
com.mchange.v2.resourcepool.BasicResourcePool$1RefurbishCheckinResourceTask#4ecee95b
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StmtAcquireTask#35640ca0
com.mchange.v2.resourcepool.BasicResourcePool$AsyncTestIdleResourceTask#6550f196
com.mchange.v2.resourcepool.BasicResourcePool$AsyncTestIdleResourceTask#6816399
com.mchange.v2.resourcepool.BasicResourcePool$AsyncTestIdleResourceTask#3fbcd623
Pool thread stack traces:
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2,5,main]
com.mysql.jdbc.PreparedStatement.realClose(PreparedStatement.java:2765)
com.mysql.jdbc.StatementImpl.close(StatementImpl.java:541)
com.mchange.v1.db.sql.StatementUtils.attemptClose(StatementUtils.java:41)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask.run(GooGooStatementCache.java:404)
com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:547)
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#1,5,main]
com.mysql.jdbc.PreparedStatement.realClose(PreparedStatement.java:2765)
com.mysql.jdbc.StatementImpl.close(StatementImpl.java:541)
com.mchange.v1.db.sql.StatementUtils.attemptClose(StatementUtils.java:41)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask.run(GooGooStatementCache.java:404)
com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:547)
Thread[com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#0,5,main]
com.mysql.jdbc.PreparedStatement.realClose(PreparedStatement.java:2765)
com.mysql.jdbc.StatementImpl.close(StatementImpl.java:541)
com.mchange.v1.db.sql.StatementUtils.attemptClose(StatementUtils.java:41)
com.mchange.v2.c3p0.stmt.GooGooStatementCache$1StatementCloseTask.run(GooGooStatementCache.java:404)
com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:547)
Any idea how to fix this? I could try disable statement caching altogether but I'm concerned about the general performance hit. Some other relevant parameters:
minPoolSize = 30
maxPoolSize = 30
maxStatements = 100
unreturnedConnectionTimeout = 500
idleConnectionTestPeriod = 60
acquireIncrements = 3
C3p0 version = 0.9.1.2
Edit: I forgot to mention, during this improvement where I saw less deadlocks, I also increased maxStatements which could explain the improvement. However now I just see https://github.com/swaldman/c3p0/issues/53 which says version 0.9.2 introduces this new parameter statementCacheNumDeferredCloseThreads. My version is too old. I get no warnings/errors about this parameter not existing.
Maybe it's too late, but have you tried to increase the number of numHelperThreads?
Related
I want my consumers to process large batches, so I aim to have the consumer listener "awake", say, on 1800mb of data or every 5min, whichever comes first.
Mine is a kafka-springboot application, the topic has 28 partitions, and this is the configuration I explicitly change:
Parameter
Value I set
Default Value
Why I set it this way
fetch.max.bytes
1801mb
50mb
fetch.min.bytes+1mb
fetch.min.bytes
1800mb
1b
desired batch size
fetch.max.wait.ms
5min
500ms
desired cadence
max.partition.fetch.bytes
1801mb
1mb
unbalanced partitions
request.timeout.ms
5min+1sec
30sec
fetch.max.wait.ms + 1sec
max.poll.records
10000
500
1500 found too low
max.poll.interval.ms
5min+1sec
5min
fetch.max.wait.ms + 1sec
Nevertheless, I produce ~2gb of data to the topic, and I see the consumer-listener (a Batch Listener) is called many times per second -- way more than desired rate.
I logged the serialized-size of the ConsumerRecords<?,?> argument, and found that it is never more than 55mb.
This hints that I was not able to set fetch.max.bytes above the default 50mb.
Any idea how I can troubleshoot this?
Edit:
I found this question: Kafka MSK - a configuration of high fetch.max.wait.ms and fetch.min.bytes is behaving unexpectedly
Is it really impossible as stated?
Finally found the cause.
There is a broker fetch.max.bytes setting, and it defaults to 55mb. I only changed the consumer preferences, unaware of the broker-side limit.
see also
The kafka KIP and the actual commit.
I'm using MySQL 5.7.14 x64 on Windows Server 2008 R2
Sometimes (randomly times at day) mysql crashing with this stack trace
11:44:40 UTC - mysqld got exception 0x80000003 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
key_buffer_size=8388608
read_buffer_size=65536
max_used_connections=369
max_threads=2800
thread_count=263
connection_count=263
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 3195125 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x2ee2b72b0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
13fe1bad2 mysqld.exe!my_sigabrt_handler()[my_thr_init.c:449]
1401c7979 mysqld.exe!raise()[winsig.c:587]
1401c6870 mysqld.exe!abort()[abort.c:82]
13ff1dd38 mysqld.exe!ut_dbg_assertion_failed()[ut0dbg.cc:67]
13ff1df51 mysqld.exe!ib::fatal::~fatal()[ut0ut.cc:916]
13ff0e008 mysqld.exe!buf_LRU_check_size_of_non_data_objects()[buf0lru.cc:1219]
13ff0f4ab mysqld.exe!buf_LRU_get_free_block()[buf0lru.cc:1303]
1400305cb mysqld.exe!buf_block_alloc()[buf0buf.cc:557]
13ff3767e mysqld.exe!mem_heap_create_block_func()[mem0mem.cc:319]
13ff37499 mysqld.exe!mem_heap_add_block()[mem0mem.cc:408]
13ffd87f4 mysqld.exe!RecLock::lock_alloc()[lock0lock.cc:1441]
13ffd795c mysqld.exe!RecLock::create()[lock0lock.cc:1534]
13ffd73a6 mysqld.exe!RecLock::add_to_waitq()[lock0lock.cc:1735]
13ffdcaaa mysqld.exe!lock_rec_lock_slow()[lock0lock.cc:2007]
13ffdc6ce mysqld.exe!lock_rec_lock()[lock0lock.cc:2081]
13ffd8cc7 mysqld.exe!lock_clust_rec_read_check_and_lock()[lock0lock.cc:6307]
140076fe3 mysqld.exe!row_ins_set_shared_rec_lock()[row0ins.cc:1502]
140072927 mysqld.exe!row_ins_check_foreign_constraint()[row0ins.cc:1739]
140072de8 mysqld.exe!row_ins_check_foreign_constraints()[row0ins.cc:1932]
140075d69 mysqld.exe!row_ins_sec_index_entry()[row0ins.cc:3356]
1400758a6 mysqld.exe!row_ins_index_entry_step()[row0ins.cc:3583]
140071b30 mysqld.exe!row_ins()[row0ins.cc:3721]
14007755a mysqld.exe!row_ins_step()[row0ins.cc:3907]
13ffaad50 mysqld.exe!row_insert_for_mysql_using_ins_graph()[row0mysql.cc:1735]
13fe7a7d3 mysqld.exe!ha_innobase::write_row()[ha_innodb.cc:7489]
13f6e5531 mysqld.exe!handler::ha_write_row()[handler.cc:7891]
13f8e54de mysqld.exe!write_record()[sql_insert.cc:1860]
13f8e916a mysqld.exe!read_sep_field()[sql_load.cc:1222]
13f8e7af4 mysqld.exe!mysql_load()[sql_load.cc:563]
13f716e86 mysqld.exe!mysql_execute_command()[sql_parse.cc:3649]
13f7194b3 mysqld.exe!mysql_parse()[sql_parse.cc:5565]
13f71267d mysqld.exe!dispatch_command()[sql_parse.cc:1430]
13f71368a mysqld.exe!do_command()[sql_parse.cc:997]
13f6d82bc mysqld.exe!handle_connection()[connection_handler_per_thread.cc:300]
140105122 mysqld.exe!pfs_spawn_thread()[pfs.cc:2191]
13fe1b93b mysqld.exe!win_thread_start()[my_thread.c:38]
1401c73ef mysqld.exe!_callthreadstartex()[threadex.c:376]
1401c763a mysqld.exe!_threadstartex()[threadex.c:354]
772859bd kernel32.dll!BaseThreadInitThunk()
773ba2e1 ntdll.dll!RtlUserThreadStart()
At this time active only 2 transactions
---TRANSACTION 1111758443, ACTIVE 565 sec
mysql tables in use 7, locked 7
7527 lock struct(s), heap size 876752, 721803 row lock(s), undo log entries 379321
MySQL thread id 166068, OS thread handle 1508, query id 112695582 localhost converter Waiting for table level lock
delete from pl
using
import_k2b_product_links ipl inner join k2b_products pSource on ipl.src_product = pSource.article and pSource.account_id = 22
inner join k2b_products pDest on ipl.dst_product = pDest.article and pDest.account_id = 22
inner join k2b_product_links pl on pl.src_product_id = pSource.id and pl.dst_product = pDest.id
where ipl.action = 1
---TRANSACTION 1111759716, ACTIVE 496 sec inserting, thread declared inside InnoDB 1
mysql tables in use 4, locked 4
7 lock struct(s), heap size 1304535248, 102060778 row lock(s), undo log entries 1
MySQL thread id 19436, OS thread handle 11664, query id 112301161 localhost exchange_central
LOAD DATA INFILE 'd:/kdm/temp/webCentral/ufrd1uwx.v2r'
INTO TABLE k2b_orders
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
(id_status, dt, account_id, sms_sended, params, update_ts, exported, id_editor, dt_offset, device_id, gen, changer_device_id, total, creator_device_id, id, dt_server, device_category_id, original_params, order_num, sended, editor_comment, admin_comment)
I don't understand why transaction 1111758443 Waiting for table level lock?
And why transaction 1111759716 lock 102060778 rows while it load just only one from external file and it showed in undo log entries 1?
Which investigation I must done for known reason of this enormous locks and crash.
Thanks!
Two things make me think that the crash is not the 'real' problem.
Both queries in the log show 'huge' times, such as ACTIVE 565 sec.
And these are all quite large:
max_used_connections=369
max_threads=2800
thread_count=263
connection_count=263
When there are hundreds of threads simultaneously active, InnoDB stumbles over itself. Throughput stalls, and latency goes through the roof.
One cure is to avoid so many connections. This is sometimes best done at the client. What is the client? For example, Apache has MaxClients. A dozen Apaches, each with MaxClients = 50 would be trying to open 600 connections. Probably one Apache cannot effectively handle 50 threads at once. Lower that number.
Are there any VIEWs deceiving us?
Another thing to do is to pursue table level lock. Let's see SHOW CREATE TABLE for the tables involved. Check for appropriate indexes.
import_k2b_product_links: INDEX(action, ...)
k2b_products: INDEX(account_id, src_product) -- in either order
k2b_products: INDEX(account_id, dest_product) -- in either order
k2b_product_links: INDEX(src_product_id, dest_product_id) -- or PK, see below
Is k2b_product_links a many:many mapping table? If so, get rid of id auto_increment as discussed Here .
The index suggestions, if useful, could speed up the DELETE, thereby cutting down on possible contention.
I need the figure out how to manage my retries in Nservicebus.
If there is any exception in my flow, It should retry 10 times every 10 seconds. But when I search in Nservicebus' website (http://docs.particular.net/nservicebus/errors/automatic-retries), there are 2 different retry mechanisms which are First Level Retry(FLR) and Second Level Retry (SLR).
FLR is for transient errors. When you got exception, It will try instantly according to your MaxRetries parameter. This parameter should be 1 for me.
SLR is for errors that persist after FLR, where a small delay is needed between retries. There is a config parameter called "TimeIncrease" defines a delay time between tries. However, Nservicebus do these retries increasingly delay time. When you set this parameter to 10 second. It will try 10.seconds, 30.seconds, 60.seconds and so on.
What do you suggest to me to provide my first request to try every 10 seconds with or without these mechanisms?
I found my answer;
The reply of Particular Software's community(John Simon), You need to apply a custom retry policy, have a look at http://docs.particular.net/nservicebus/errors/automatic-retries#second-level-retries-custom-retry-policy-simple-policy for an example.
here is some log from chrome://net-internals/#events
HTTP_TRANSACTION_SEND_REQUEST
t=18581 [st= 2] +HTTP_TRANSACTION_READ_HEADERS [dt=2744]
t=18581 [st= 2] HTTP_STREAM_PARSER_READ_HEADERS [dt=2744]
t=21325 [st=2746] HTTP_TRANSACTION_READ_RESPONSE_HEADERS
and the request pending almost 2700 ms. How can I solve this problem?
I have test that client,internet speed is fast,and ping my domain, also normal;
server(other user response quickly) and api(this api just print current time) also ok,
There may be several reasons of it
internet speed may be slow.
server configuration may be lower then required (RAM and processor).
too much conditions in api that you calling or there may complicated query in api(with joins and sub query).
check above possibilities in your code.
We are using DBCP inside a Grails application. The database is on another server, so TCP/IP is in play here. We have monitored the database by doing a show processlist frequently, and we never see above 50 connections. However, the sockets on the client grow enormously (at one point I saw over 2700). Most of them are in TIME_WAIT status.
So eventually we get a NoRouteToHostException, because it cannot open a socket.
Note that we hit the database over 40,000 times in less than a minute in this use case.
Does anyone have suggestions as to why this might be? I would think that, since our connection pool is limited to 100 (and we only see about 50 connections open), I'd only see slightly more than 50, since occasionally one might get stale. But we're seeing thousands. Is this expected? Or any other tips about something we might be missing when looking at this situation?
Here are the dbcp settings we are using:
properties {
maxActive = 100
maxIdle = 4
minIdle = 1
initialSize = 1
minEvictableIdleTimeMillis = 60000
timeBetweenEvictionRunsMillis = 60000
maxWait = 10000
removeAbandoned = true
removeAbandonedTimeout = 60
validationQuery = "/* PING */ SELECT 1"
testOnBorrow = true
testWhileIdle = true
numTestsPerEvictionRun = -1
logAbandoned = true
}
Also note that we use autoReconnect=true on the connection string, although we are considering dropping it (we get stale connections overnight otherwise).
Thanks!
Ok, so I was able to sort it out. Turns out I was misunderstanding the maxIdle and how it works.
Anything returned to the pool above maxIdle is immediately released. So most of the connections were being closed and reopened, hence why the sockets were exhausted.