Aurora database exceeding capacity due to "/io/file/myisam/kfile" Wait - mysql

This morning, one of our Aurora clusters suddenly began experiencing high latency, slower running queries and was being reported as exceeding exceeding capacity - with up to 20 sessions for the db.r5.large instance which has only 2 CPUs.
There we no code changes, no deploy, no background process or any other cause we can identify. The higher latency is intermittent, occurring every 10 minutes and lasting for about as long. The Aurora monitoring isn't helping much, the only change of note being higher latency on the all queries (selects, updates and deletes).
Under Performance Metrics, the cases where usage spikes - we're seeing that of the total 20 sessions, these are attributed almost solely to the io/file/myisam/kfile Wait. Researching online has yielded very little and so I'm somewhat stumped as to what this means, and how to go about getting to the cause of the issue. Looking at the SQL queries ran during spikes, their slow run time appears more caused by the intermittent issue - as opposed to the being the cause of it.
So my question is: can anyone explain what the 'myisam/kfile' Wait is, and how I can use this knowledge to help diagnose the cause of the problem here?
My feeling is that it's one of those rare occurrences where an AWS instance unexplainably goes rogue at a level below which we can directly control and is only solved by spinning up a new instance (even where all else is equal from a configuration and code perspective). All the same, I'd love to better understand the issue here, especially when none of our DB table are MyISAM, all being innoDB.

Is there a table called kfile? How big is it? What operations are being performed?
While the problem is occurring, do SHOW FULL PROCESSLIST; to see what is running. That may give a good clue.
If the slowlog is turned on, look at it shortly after the problem has subsided; probably the naughty query will be at the end of the list. Publish pt-query-digest path_to_slowlog. The first one or two queries are very likely to be the villains.
Check SHOW ENGINE INNODB STATUS;. Near the front will be the "latest deadlock". That may be a clue.
In most situations most of the graphs don't provide any useful information. When something does go wrong, it is not obvious which graph to look at. I would look for graphs that look "different" in sync with the problem. The one you show us perhaps indicates that there is a 20-second timeout, and everyone is stuck until they hit that timeout.
Do you run any ALTERs? When do backups occur?
Are you using MyISAM? Don't. That ENGINE does not allow concurrency, hence could lead to a bunch of queries piling up. More on converting to InnoDB: http://mysql.rjweb.org/doc.php/myisam2innodb

Related

AWS MySQL RDS huge CPU spike and rapid storage loss - possible attack?

As you can see from the screenshot attached, we have experienced sudden cpu spike and storage loss. We have nearly lost all storage and had to increase it manually.
When we check database size, it still has the size before this occured, so it seems its not database related. We have checked a lot of stuff (slow logs etc.) but couldn't find the problem.
Is it possible there has been an attack, or any other ideas why this happened and how to recover our free storage?
Thank you.
It's hard to say what the exact issue is but looking at the graph, it looks like you have some huge query running against your database that is filling any temp space up. When it runs itself out of room it is killing the query and that is then flushing a bunch of writes to the disk which could either be related to the query/statement or simply be unrelated queued inserts/updates.
You need to look at the slow query log if it's enabled to see if there's anything unusual there and also check your application(s) to see if they were trying to execute a ridiculous query/statement that was hammering the database.

Slow but steady rise of MYSQL cpu load

I'm running a website on a cloud server. The website is functioning completely around a (rather large) database. Over the last two weeks I've noticed steady rise in the CPU load of the MYSQL and I'm not sure why. It has been 15-16% for a while and then it started climbing by 1-2% a day. Currently we are at 27% and thought there has been a rise in traffic, it wasn't that big.... What could be causing this?
Thanks!
Check you MySQL slow log. Don't forget to add the queries not using indexes in the log.
Fix any queries you find in there.
The problem actually was the build-up of several caches. If someone else encounters this problem, I suggest to look at the red values under 'Status'-> 'All statusvariables' in your PHPmyAdmin. Enlarging the tmp_table_size and flushing the query cache did wonders for me.
Good luck!

MySQL sudden performance drop

One of the projects I'm working on is suffering from a recent slowdown in the DB (since last week).
Code hasn't changed, data may have changed a little but not significantly so at this stage I'm just exploring DB configuration (as we are on a managed hosting platform, end have had some similar issues in the past).
Unfortunately I'm out of my depth a bit... could anyone please take a look at the output from SHOW STATUS below and see if any of it sets alarm bells off? The only thing I've spotted so far is that key_reads vs key_read_requests don't seem quite right.
Our setup is 2 servers replicated, with all reads done from the slave. Queries which run in 0.01 secs on the master are taking up to 7 secs on the slave... and this has only started recently.
All tables are MyIsam and inserts/updates are negligible (updates happen out of hours). Front end is an ASP .NET website (.NET 4) running on IIS8 with a devart component for data access.
Thanks!
SHOW STATUS output is here: http://pastebin.com/w6xDeD48
Other factors can impact MySQL performance:
virus scanning software -> I had a issue with McAfee bogging out peformance due to it scanning temporary table files
Other services running on server?
Have you tried a EXPLAIN SELECT on the query? This would given you an indication of the index size. As #Liath indicated the indexes may be out of date on the slave but find on the master.
Just an update in case it ever helps anyone else in future - it looks like the culprit might be the query cache for now, as we are seeing better performance with it turned off (still not quite as good as we had before the issue).
So we will try to tune it a little and get back to great performance!

How to track down problematic MySQL queries?

I use MySQL (Percona ExtraDB 5.1 to be exact) as my database of choice. Overall, very impressed with performance. The applications that use it are quite large.
We believe that a query is sometimes causing a backup of threads on the database for whatever reason (i.e., memory/buffers). The server has been tweaked countless times to prevent this so it's literally a 1% problem now, but still very annoying. Unless you are monitoring the database server 24/7 you are unlikely to ever see the cause of the backup.
Is there any recommendation (apart from going through the slow query log) which anyone can suggest to track the problematic queries (i.e., reporting via the application)?
Percona Server with XtraDB actually logs both the timestamp and the execution time in microsecond resolution, so you can find the start and the end of the queries precisely. However, log analysis is probably the wrong approach. You probably need to use Aspersa's stalk+collect tools.
As you point out in your question, your best bet will be the slow query log:
http://dev.mysql.com/doc/refman/5.5/en/slow-query-log.html
You might also want to log this at the app level:
At the beginning of your scripts, keep a note of what you're about to do and when it started. At the end of it, log this information if the time spent processing the request is higher than a certain threshold.
That way, you'll be able to identify problematic sequences of queries rather than individual queries. (Which, incidentally, might reveal that no individual query is slow but some requests might fire gazillions of small queries.)
Have a look at this script which allows you to extract a more abstracted representations of the queries causing the problems.
I usually sort the list by the product of frequency and runtime to get the queries causing the most problems.
NB recording the actual start and end of the queries is irrelevant to measuring the queries actually causing locks - from the manual "The time to acquire the initial table locks is not counted as execution time"
You just need to fix the slow stuff.

A Most Puzzling MySQL Problem: Queries Sporadically Slow

This is the most puzzling MySQL problem that I've encountered in my career as an administrator. Can anyone with MySQL mastery help me a bit on this?:
Right now, I run an application that queries my MySQL/InnoDB tables many times a second. These queries are simple and optimized -- either single row inserts or selects with an index.
Usually, the queries are super fast, running under 10 ms. However, once every hour or so, all the queries slow down. For example, at 5:04:39 today, a bunch of simple queries all took more than 1-3 seconds to run, as shown in my slow query log.
Why is this the case, and what do you think the solution is?
I have some ideas of my own: maybe the hard drive is busy during that time? I do run a cloud server (rackspace) But I have flush_log_at_trx_commit set to 0 and tons of buffer memory (10x the table size on disk). So the inserts and selects should be done from memory right?
Has anyone else experience something like this before? I've searched all over this forum and others, and it really seems like no other MySQL problem I've seen before.
There are many reasons for sudden stalls. For example - even if you are using flush_log_at_trx_commit=0, InnoDB will need to pause briefly as it extends the size of data files.
My experience with the smaller instance types on Rackspace is that IO is completely awful. I've seen random writes (which should take 10ms) take 500ms.
There is nothing in built-in MySQL that will help you identify the problem easier. What you might want to do is take a look at Percona Server's slow query log enhancements. There's a specific feature called "profiling_server" which can break down time:
http://www.percona.com/docs/wiki/percona-server:features:slow_extended#changes_to_the_log_format