We've set up a database replication about a week ago, and we are having an issue with keeping it in sync.
The setup is a master-master replication with MariaDB 10.1.35/MySQL 5.5.5. Only one database is being used to make calls on. The other database will only be used as a backup. I will refer to this one as the slave. And it's the slave we're having issues with. The replication is statement based.
The first 24 hours went fine. The next day, the slave was more and more behind, up until almost 24 hours. When we checked 24 hours later, the slave was back on track again, being behind on the master by just a few seconds.
Now again, it's starting to get behind more and more (over 5 hours of data now).
It's still syncing, so the replication itself is working. However, some queries just take way too long on the slave, which is delaying everything.
All queries are being executed quite fast, except for one UPDATE query. It's this one which stays in the processlist for 5, 10 and sometimes even 20 or 30 seconds. The query is being handled in less than a second on the master, and also when we execute this query manually on the slave, it's doesn't take longer than a second. So, we don't think it's related to the query itself. The structure of both databases/tables are exactly the same. The storage engine of the table is InnoDB.
At this point, we have no clue what could be causing this delay. Inserts are being processed instantly.
There's one difference in the processlist when the query is being executed on the slave; the command stays on 'Connect', while the command says 'Execute' on the master. Is this normal behaviour?
If I should provide more information, please let me know. It's clear that a slave only handles one query at a time and thus it can get behind if there are a lot of queries on the master, but it should not be necassary for that query to take up to 30 seconds, while it takes less than one when being executed manually.
Thank you.
P.S. We already optimized the table (OPTIMIZE) but unfortunately that didn't make a difference.
Related
I was trying to run analyze command on a table out of 900 tables in mysql 5.7.30. Its stuck my all db process-list and connections spike immediate and lot of commands found with state "Waiting for table flush" even our max_connection parameter reaches at 2500. We are running the analyze table command from last 3 years but from last 1 month we notice this issue 4th time. If we didn't analyze our tables then we see severe performance issues and lot of queries enter into state "statistics". Whats your thoughts on it
You most definitely shouldn't be running ANALYZE regularly or automatically. It sounds like you were dodging the bullet of queries stuck in the waiting for able flush state purely because the load on your servers was sufficiently low that you didn't notice it before. You should only ever run this on a table sparingly when you have clear, definitive evidence that the index statistics on that table are sufficiently detached from reality to cause the query optimiser to regularly come up with egregiously poor execution plan.
I am loading data into Mysql DB(8.0.16) through pentaho(8.3) jobs, from the last 2months process was going fine(jobs usually complete in 1hr), Since last night the process is going very slow(executing for more than 12hrs), I have checked the processlist the queries are taking for ever for executing(even the count on small volume tables). am unable to figureout the issue.
any suggestions to findout the bottlenecks.
we have a MySQL with a replica (5.7 with row based replication).
Now, the master performs at peak about 3000 inserts per second, and the replica seems to read that just fine. However, sometimes we execute long-time select queries (that ran from 10 to 20 seconds). And during those queries the replication lag becomes very huge.
What I do not understand is how the usual mysql threads that execute selects (without locking any tables) can cause the replication thread to slow down (i.e. it performs about 2.5K inserts instead of 3K like master)? What would I need to tune exactly?
Now I checked the slave status and it's not about the IO thread - this one manages to read events from the master just fine. It's SQL slave thread, that somehow does not manage to catch up. The isolation level is Read Committed, so the select queries potentially could lock some records and make the slave thread wait. But I'm not sure about that.
UPDATED. I have checked again - it turns out that even a single heavy query (that scans the entire table for example) on the slave produces the lag. It seems like slave sql thread is blocked, but I do not understand why?
UPDATED 2. I finally found the solution. First I increased number of slave_parallel_workers to 4 and set slave_parallel_type to LOGICAL_CLOCK. However, and this is important, that gave me no improvement at all, since the transactions were dependent. But, after I increased on master binlog_group_commit_sync_delay to 10000 (that is, 10 milliseconds), the lag disappeared.
There Might be many reasons why replication lag in mysql slave database.
But as you mentioned
It's SQL slave thread, that somehow does not manage to catch up.
Assuming that IO works fine, Percona says (emphasis mine):
[...] when the slave SQL_THREAD is the source of replication delays it is probably because of queries coming from the replication stream are taking too long to execute on the slave. This is sometimes because of different hardware between master/slave, different schema indexes, workload. Moreover, the slave OLTP workload sometimes causes replication delays because of locking. For instance, if a long-running read against a MyISAM table blocks the SQL thread, or any transaction against an InnoDB table creates an IX lock and blocks DDL in the SQL thread. Also, take into account that slave is single threaded prior to MySQL 5.6, which would be another reason for delays on the slave SQL_THREAD.
I'm really confused. We have a process - admittedly inefficient, and I'm fixing it - that runs about a quarter million tiny update queries. These finished on the master server, which is on MySQL 5.0, in half an hour; the newly-upgraded MySQL 5.5 slave has been working on them for six hours. The key seems to be "query end" - each one spends over a tenth of a second in this state on the slave, which is really dragging things down as 10 queries a second means, well... six hours. The master spends less than .06 seconds in the entire query, and this one is spending .13 seconds (89% of the query) in "query end".
Did 5.5 change something that my 5.0 configuration is interfering with? I'm at my wits end, as this is really starting to slow down some reports we have that are inefficient like this. I will change the reports, but I also want to find out what went wrong.
Before you ask: The 5.5 slave is still doing everything in MyISAM, so that hasn't changed. In fact, the configuration is generally identical to what the other, still 5.0, slave has, and that slave also finished in half an hour.
Disabling the binary log on the slave fixed it. This makes me worry about when we upgrade the master to 5.5, but for now things are working much much better.
I have a service that sits on top of a MySQL 5.5 database (INNODB). The service has a background job that is supposed to run every week or so. On a high level the background job does the following:
Do some initial DB read and write in one transaction
Execute UMQ (described below) with a set of parameters in one transaction.
If no records are returned we are done!
Process the result from UMQ (this is a bit heavy so it is done outside of any DB
transaction)
Write the outcome of the previous step to DB in one transaction (this
writes to tables queried by UMQ and ensures that the same records are not found again by UMQ).
Goto step 2.
UMQ - Ugly Monster Query: This is a nasty database query that joins a bunch of tables, has conditions on columns in several of these tables and includes a NOT EXISTS subquery with some more joins and conditions. UMQ includes ORDER BY also has LIMIT 1000. Even though the query is bad I have done what I can here - there are indexes on all columns filtered on and the joins are all over foreign key relations.
I do expect UMQ to be heavy and take some time, which is why it's executed in a background job. However, what I'm seeing is rapidly degrading performance until it eventually causes a timeout in my service (maybe 50 times slower after 10 iterations).
First I thought that it was because the data queried by UMQ changes (see step 4 above) but that wasn't it because if I took the last query (the one that caused the timeout) from the slow query log and executed it myself directly I got the same behavior only until I restated the MySQL service. After restart the exact query on the exact same data that took >30 seconds before restart now took <0.5 seconds. I can reproduce this behavior every time by restoring the database to it's initial state and restarting the process.
Also, using the trick described in this question I could see that the query scans around 60K rows after restart as opposed to 18M rows before. EXPLAIN tells me that around 10K rows should be scanned and the result of EXPLAIN is always the same. No other processes are accessing the database at the same time and the lock_time in the slow query log is always 0. SHOW ENGINE INNODB STATUS before and after restart gives me no hints.
So finally the question: Does anybody have any clue of why I'm seeing this behavior? And how can I analyze this further?
I have the feeling that I need to configure MySQL differently in some way but I have searched and tested like crazy without coming up with anything that makes a difference.
Turns out that the behavior I saw was the result of how the MySQL optimizer uses InnoDB statistics to decide on an execution plan. This article put me on the right track (even though it does not exactly discuss my problem). The most important thing I learned from this is that MySQL calculates statistics on startup and then once in a while. This statistics is then used to optimize queries.
The way I had set up the test data the table T where most writes are done in step 4 started out as empty. After each iteration T would contain more and more records but the InnoDB statistics had not yet been updated to reflect this. Because of this the MySQL optimizer always chose an execution plan for UMQ (which includes a JOIN with T) that worked well when T was empty but worse and worse the more records T contained.
To verify this I added an ANALYZE TABLE T; before every execution of UMQ and the rapid degradation disappeared. No lightning performance but acceptable. I also saw that leaving the database for half an hour or so (maybe a bit shorter but at least more than a couple of minutes) would allow the InnoDB statistics to refresh automatically.
In a real scenario the relative difference in index cardinality for the tables involved in UMQ will look quite different and will not change as rapidly so I have decided that I don't really need to do anything about it.
thank you very much for the analysis and answer. I've been searching this issue for several days during ci on mariadb 10.1 and bacula server 9.4 (debian buster).
The situation was that after fresh server installation during a CI cycle, the first two tests (backup and restore) runs smoothly on unrestarted mariadb server and only the third test showed that one particular UMQ took about 20 minutes (building directory tree during restore process from the table with about 30k rows).
Unless the mardiadb server was restarted or table has been analyzed the problem would not go away. ANALYZE TABLE or the restart changed the cardinality of the fields and internal query processing exactly as stated in the linked article.