We have a query which most of the time takes about 150ms to complete. Thousands go through the system each day without any issue. But every few days, something happens, and one of these queries suddenly takes roughly 30 minutes to complete. Any subsequent queries also run slow. The only way we have been able to recover is to kill ever single one of these queries. Soon as we do that, any subsequent queries again run at usual 150ms speed. For security reasons, I am not allowed to post the query itself. But it is nothing special.
The dB is MySQL 5.5.40 and using the innodb engine. During this period, all the usual system resources look fine - memory, cpu, disk space, disk i/o, network i/o.
Can someone give me some ideas about how I can troubleshoot the nature of this issue? I do not believe it is the query, since it seems to work just great 99% of the time. So I am thinking there is some kind of MySQL bug or a weird race condition going on.
This appears to have been some kind of issue with the planner in MySQL 5.5. Since our upgrade to 5.6, the problem has gone away. Also the explain of the query is completely different.
Related
This morning, one of our Aurora clusters suddenly began experiencing high latency, slower running queries and was being reported as exceeding exceeding capacity - with up to 20 sessions for the db.r5.large instance which has only 2 CPUs.
There we no code changes, no deploy, no background process or any other cause we can identify. The higher latency is intermittent, occurring every 10 minutes and lasting for about as long. The Aurora monitoring isn't helping much, the only change of note being higher latency on the all queries (selects, updates and deletes).
Under Performance Metrics, the cases where usage spikes - we're seeing that of the total 20 sessions, these are attributed almost solely to the io/file/myisam/kfile Wait. Researching online has yielded very little and so I'm somewhat stumped as to what this means, and how to go about getting to the cause of the issue. Looking at the SQL queries ran during spikes, their slow run time appears more caused by the intermittent issue - as opposed to the being the cause of it.
So my question is: can anyone explain what the 'myisam/kfile' Wait is, and how I can use this knowledge to help diagnose the cause of the problem here?
My feeling is that it's one of those rare occurrences where an AWS instance unexplainably goes rogue at a level below which we can directly control and is only solved by spinning up a new instance (even where all else is equal from a configuration and code perspective). All the same, I'd love to better understand the issue here, especially when none of our DB table are MyISAM, all being innoDB.
Is there a table called kfile? How big is it? What operations are being performed?
While the problem is occurring, do SHOW FULL PROCESSLIST; to see what is running. That may give a good clue.
If the slowlog is turned on, look at it shortly after the problem has subsided; probably the naughty query will be at the end of the list. Publish pt-query-digest path_to_slowlog. The first one or two queries are very likely to be the villains.
Check SHOW ENGINE INNODB STATUS;. Near the front will be the "latest deadlock". That may be a clue.
In most situations most of the graphs don't provide any useful information. When something does go wrong, it is not obvious which graph to look at. I would look for graphs that look "different" in sync with the problem. The one you show us perhaps indicates that there is a 20-second timeout, and everyone is stuck until they hit that timeout.
Do you run any ALTERs? When do backups occur?
Are you using MyISAM? Don't. That ENGINE does not allow concurrency, hence could lead to a bunch of queries piling up. More on converting to InnoDB: http://mysql.rjweb.org/doc.php/myisam2innodb
I'm really new to server-side and I've been administrating a dedicated server for my website.
At peak hours the website is very slow, in order to know why I've been monitoring it using htop.
Some really long-lasting (more than few hours) mysql processes use up to 95% of the CPU !
The thing is I don't know what queries might be the cause of it nor how to monitor it.
I do have a cron every quarter of hour that sometimes takes a long time to run but the time of the slow down is not always matching cron's.
I've heard of a solution using a cron job killing too long mysql processes but wouldn't it cause discrepancies in the db ?
Configure MySQL server to log slow queries, then look at them. You need to understand why these queries are so slow.
Most likely, these queries can be sped up by adding proper indexes, but this is not fast and hard rule - you need to understand what is really happening.
You can kill long-running queries, but if you have MyISAM engine it can corrupt your tables. In that case, seriously consider switching to Innodb engine. With transactional engine like Innodb, currently running transactions will likely be rolled back, but data should not be corrupted.
All my websites are running super slow because I tried to optimize the Magento DB in phpmyadmin in an effort to get the sites to speed UP. They're so slow they might as well be down. I want the pain to stop. Can I kill the optimize and how would I do that? Or is it better just to wait until it's finished.
For the record, I have a backup of the database.
Usually you can KILL any operation you don't want to finish, but be aware that the time to undo the operation may in fact be longer than it takes to simply complete.
Unless you're running a massive installation it sounds like your database needs to be tuned better. MySQL comes with a default my.cnf configuration that is terrible, barely any memory allocated to it, and runs slowly.
Secondly, you might be running a lot of queries that are slow because they're missing indexes. How much data is in this table you're trying to optimize? SHOW TABLE STATUS will give you an idea.
I use MySQL (Percona ExtraDB 5.1 to be exact) as my database of choice. Overall, very impressed with performance. The applications that use it are quite large.
We believe that a query is sometimes causing a backup of threads on the database for whatever reason (i.e., memory/buffers). The server has been tweaked countless times to prevent this so it's literally a 1% problem now, but still very annoying. Unless you are monitoring the database server 24/7 you are unlikely to ever see the cause of the backup.
Is there any recommendation (apart from going through the slow query log) which anyone can suggest to track the problematic queries (i.e., reporting via the application)?
Percona Server with XtraDB actually logs both the timestamp and the execution time in microsecond resolution, so you can find the start and the end of the queries precisely. However, log analysis is probably the wrong approach. You probably need to use Aspersa's stalk+collect tools.
As you point out in your question, your best bet will be the slow query log:
http://dev.mysql.com/doc/refman/5.5/en/slow-query-log.html
You might also want to log this at the app level:
At the beginning of your scripts, keep a note of what you're about to do and when it started. At the end of it, log this information if the time spent processing the request is higher than a certain threshold.
That way, you'll be able to identify problematic sequences of queries rather than individual queries. (Which, incidentally, might reveal that no individual query is slow but some requests might fire gazillions of small queries.)
Have a look at this script which allows you to extract a more abstracted representations of the queries causing the problems.
I usually sort the list by the product of frequency and runtime to get the queries causing the most problems.
NB recording the actual start and end of the queries is irrelevant to measuring the queries actually causing locks - from the manual "The time to acquire the initial table locks is not counted as execution time"
You just need to fix the slow stuff.
This is the most puzzling MySQL problem that I've encountered in my career as an administrator. Can anyone with MySQL mastery help me a bit on this?:
Right now, I run an application that queries my MySQL/InnoDB tables many times a second. These queries are simple and optimized -- either single row inserts or selects with an index.
Usually, the queries are super fast, running under 10 ms. However, once every hour or so, all the queries slow down. For example, at 5:04:39 today, a bunch of simple queries all took more than 1-3 seconds to run, as shown in my slow query log.
Why is this the case, and what do you think the solution is?
I have some ideas of my own: maybe the hard drive is busy during that time? I do run a cloud server (rackspace) But I have flush_log_at_trx_commit set to 0 and tons of buffer memory (10x the table size on disk). So the inserts and selects should be done from memory right?
Has anyone else experience something like this before? I've searched all over this forum and others, and it really seems like no other MySQL problem I've seen before.
There are many reasons for sudden stalls. For example - even if you are using flush_log_at_trx_commit=0, InnoDB will need to pause briefly as it extends the size of data files.
My experience with the smaller instance types on Rackspace is that IO is completely awful. I've seen random writes (which should take 10ms) take 500ms.
There is nothing in built-in MySQL that will help you identify the problem easier. What you might want to do is take a look at Percona Server's slow query log enhancements. There's a specific feature called "profiling_server" which can break down time:
http://www.percona.com/docs/wiki/percona-server:features:slow_extended#changes_to_the_log_format