How to track down problematic MySQL queries? - mysql

I use MySQL (Percona ExtraDB 5.1 to be exact) as my database of choice. Overall, very impressed with performance. The applications that use it are quite large.
We believe that a query is sometimes causing a backup of threads on the database for whatever reason (i.e., memory/buffers). The server has been tweaked countless times to prevent this so it's literally a 1% problem now, but still very annoying. Unless you are monitoring the database server 24/7 you are unlikely to ever see the cause of the backup.
Is there any recommendation (apart from going through the slow query log) which anyone can suggest to track the problematic queries (i.e., reporting via the application)?

Percona Server with XtraDB actually logs both the timestamp and the execution time in microsecond resolution, so you can find the start and the end of the queries precisely. However, log analysis is probably the wrong approach. You probably need to use Aspersa's stalk+collect tools.

As you point out in your question, your best bet will be the slow query log:
http://dev.mysql.com/doc/refman/5.5/en/slow-query-log.html
You might also want to log this at the app level:
At the beginning of your scripts, keep a note of what you're about to do and when it started. At the end of it, log this information if the time spent processing the request is higher than a certain threshold.
That way, you'll be able to identify problematic sequences of queries rather than individual queries. (Which, incidentally, might reveal that no individual query is slow but some requests might fire gazillions of small queries.)

Have a look at this script which allows you to extract a more abstracted representations of the queries causing the problems.
I usually sort the list by the product of frequency and runtime to get the queries causing the most problems.
NB recording the actual start and end of the queries is irrelevant to measuring the queries actually causing locks - from the manual "The time to acquire the initial table locks is not counted as execution time"
You just need to fix the slow stuff.

Related

Aurora database exceeding capacity due to "/io/file/myisam/kfile" Wait

This morning, one of our Aurora clusters suddenly began experiencing high latency, slower running queries and was being reported as exceeding exceeding capacity - with up to 20 sessions for the db.r5.large instance which has only 2 CPUs.
There we no code changes, no deploy, no background process or any other cause we can identify. The higher latency is intermittent, occurring every 10 minutes and lasting for about as long. The Aurora monitoring isn't helping much, the only change of note being higher latency on the all queries (selects, updates and deletes).
Under Performance Metrics, the cases where usage spikes - we're seeing that of the total 20 sessions, these are attributed almost solely to the io/file/myisam/kfile Wait. Researching online has yielded very little and so I'm somewhat stumped as to what this means, and how to go about getting to the cause of the issue. Looking at the SQL queries ran during spikes, their slow run time appears more caused by the intermittent issue - as opposed to the being the cause of it.
So my question is: can anyone explain what the 'myisam/kfile' Wait is, and how I can use this knowledge to help diagnose the cause of the problem here?
My feeling is that it's one of those rare occurrences where an AWS instance unexplainably goes rogue at a level below which we can directly control and is only solved by spinning up a new instance (even where all else is equal from a configuration and code perspective). All the same, I'd love to better understand the issue here, especially when none of our DB table are MyISAM, all being innoDB.
Is there a table called kfile? How big is it? What operations are being performed?
While the problem is occurring, do SHOW FULL PROCESSLIST; to see what is running. That may give a good clue.
If the slowlog is turned on, look at it shortly after the problem has subsided; probably the naughty query will be at the end of the list. Publish pt-query-digest path_to_slowlog. The first one or two queries are very likely to be the villains.
Check SHOW ENGINE INNODB STATUS;. Near the front will be the "latest deadlock". That may be a clue.
In most situations most of the graphs don't provide any useful information. When something does go wrong, it is not obvious which graph to look at. I would look for graphs that look "different" in sync with the problem. The one you show us perhaps indicates that there is a 20-second timeout, and everyone is stuck until they hit that timeout.
Do you run any ALTERs? When do backups occur?
Are you using MyISAM? Don't. That ENGINE does not allow concurrency, hence could lead to a bunch of queries piling up. More on converting to InnoDB: http://mysql.rjweb.org/doc.php/myisam2innodb

Determine if mysql index is being used on production site, so it can be safely removed

We have some large indexes that we suspect are not being used in our Rails site, and would like to drop them to save space and computation. However, doing so could be catastrophic if it turns out they were being used. How can we confirm they are not being used?
One option is to log all queries for a time and run 'explain plan' on any of them that use the table in question. But I've heard 'explain plan' can occasionally be inaccurate. We would also have to collect queries for a few hours to be sure, which is quite a lot of log to store and process.
If there was a way to temporarily disable an index, we'd be willing to do that, as long as we could quickly enable it if problems arose. But I don't see a way to do that universally; you can only specify an 'ignore index' hint to individual sql statements.
Short Answer:
With MySQL 5.6 it is possible to do this by using the PERFORMANCE_SCHEMA and ps_helper.
ps_helper is a series of views and routines that present the data in PERFORMANCE_SCHEMA in more useful ways. The VIEW you are after is this one: http://www.markleith.co.uk/ps_helper/#schema_unused_indexes
More detailed:
The idea of disabling an index is called 'invisible indexes' in Oracle. MySQL doesn't support them, but I'd love to see this feature as well - I filed http://bugs.mysql.com/bug.php?id=70299 a couple of months ago on it.
Removing unused indexes is very important, as it can help optimizer performance. I have a war-story of using the ps_helper + unused_indexes view here: http://www.tocker.ca/2013/09/05/migrating-from-postgresql-to-mysql.html
There's only one procedure for this: Testing, testing, testing and benchmarking.
The primary function of indexes, apart form ensuring uniqueness, is to expedite data access. If all operations were O(1) there would be no need for indexes in the first place.
You need to have another instance of your application where you can experiment with adding, removing, and adjusting your indexes. It's impossible to replicate both real-world hardware and real-world loads, but you can come pretty close if you pay careful attention to how your hardware is configured, how your application is exercised, and can produce that's reasonably similar.
If you have a application logs that are sufficiently detailed, sometimes you can replay these operations. Read operations are easier to replay than writes, but both can be simulated if you've got enough time to invest in this.
For any application running at scale, you want to know where performance falls off a cliff. So long as your production load is well below this level you'll be okay. If you don't know where the cliff is, you might hit it without any warning.
Remember that indexes not only take up space, which is a minor issue, but the size of the index has an effect on how expensive it is to update, making writes more costly. It's ideal to have only the ones you need, but it's almost impossible to identify which are actually used. There are many that might be used in theory, but never are, and some that shouldn't be used, but which are because the query optimizer is a little dumb sometimes.

mysql performance benchmark

I'm thinking about moving our production env from a self hosted solution to amazon aws. I took a look at the different services and thought about using RDS as replacement for our mysql instances. The hardware we're using for our master seems to be better than the best hardware we can get when using rds (Quadruple Extra Large DB Instance). Since I can't simply move our production env to aws and see if the performance is still good enough I'd love to make some tests in advance.
I thought about creating a full query log from our current master, configure the rds instance and start to replay the full query log against it. Actually I don't even know if this kind of testing is a good idea but I guess you'll tell me if there are better ways to make sure the performance of mysql won't drop dramatically when making the move to rds.
Is there a preferred tool to replay the full query log?
at what metrics should I take a look while running the test
cpu usage?
memory usage?
disk usage?
query time?
anything else?
Thanks in advance
I'd recommend against replaying the query log - it's almost certainly not going to give you the information you want, and will take a significant amount of effort.
Firstly, you'd need to prepare your database so that replaying the query log won't break constraints when inserting, updating or deleting data, and that subsequent "select" queries will find the records they should find. This is distinctly non-trivial on anything other than a toy database - just taking a back-up and replaying the log doesn't necessarily guarantee the ordering of DML statements will match what happened on production. This may well give you a false sense of comfort - all your select statements return in a few milliseconds, because the data they're looking for doesn't exist!
Secondly, load and performance testing rarely works by replaying what happened on production - that doesn't (usually) reflect the peak conditions that will bring your system to its knees. For instance, most production systems run happily most of the time at <50% capacity, but go through spikes during the day, when they might reach 80% or more of capacity - that's what you care about, can your new environment handle the peaks.
My recommendation would be to use a tool like JMeter to write performance scripts (either directly to the database using the JDBC driver, or through the front end if you've got a web appilcation). Your performance scripts should reflect the behaviour you see from users, and be parameterized so they're not dependent on the order in which records are created.
Set yourself some performance targets (ideally based on current production levels, with a multiplier to cover you against spikes), e.g. "100 concurrent users, with no query taking more than 1 second"), and use JMeter to simulate that load. If you reach it first time, congratulations - go home! If not, look at the performance counters to see where the bottleneck is; see if you can alleviate that bottleneck (or tune your queries, your awesome on-premise hardware may be hiding some performance issues). Typical bottlenecks are CPU, RAM, and disk I/O.
Experiment with different test scenarios - "lots of writes", "lots of reads", "lots of reporting queries", and mix them up.
The idea is to understand the bottlenecks on the system, and see how far you are from those bottleneck, and understand what you can do to alleviate them. Once you know that, your decision to migrate will be far more robust.

MySQL scale up or scale out?

I have been tasked with investigating reasons why our internal web application is hitting performance problems.
The web application itself is part written in PHP and part written in Perl, and we have a MySQL database which is where I believe the source of performance hit is occurring.
We have about 400 users of the system, of which, most are spread across different timezones, so generally there are only ever a max of 30 users online at any one time. The performance problems have crept up on us, particularly over the past year as the database keeps growing.
The system is running on one single 32-bit debian server - 6GB of RAM, with 8 x 2.4GHz intel CPU. This is probably not hefty enough for the job in-hand. However, even at times where I am the only user online, page loading time can still be slow.
I'm trying to determine whether we need to scale up or scale out. Firstly, I'd like to know is how well our hardware is coping with the demands placed upon it. And secondly, whether it might be worth scaling out and creating some replication slaves to balance the load.
There are a lot of tools available on the internet - probably a bit too many to investigate. Can anyone recommend any tools that can provide some profiling/performance monitoring that may help me on my quest.
Many thanks,
ns
Your slow-down seems to be related to the data and not to the number of concurrent users.
Properly indexed queries tend to scale logarithmically with the amount of data - i.e. doubling the data increases the query time by some constant C, doubling the data again by the same C, doubling again by the same C etc... Before you know it, you have humongous amounts of data, yet your queries are just a little slower.
If the slow-down wasn't as gradual in your case (i.e. it was linear to the amount of data, or worse), this might be an indication of badly optimized queries. Throwing more iron at the problem will postpone it, but unless you have unlimited budget, you'll have to actually solve the root cause at some point:
Measure the query performance on the actual data to identify slow queries.
Examine the execution plans for possible improvements.
If necessary, learn about indexing, clustering, covering and other performance techniques.
And finally, apply that knowledge onto queries you have identified in steps (1) and (2).
If nothing else helps, think about your data model. Sometimes, a "perfectly" normalized model is not the best performing one, so a little judicial denormalization might be warranted.
The easy (lazy) way if you have budget is just to throw some more iron at it.
A better way would be, before deciding where or how to scale, would be to identify the bottlenecks. Is it every page load that is slow? Or just particular pages? If it is just a few pages then invest in a profiler (for PHP both xDebug and the Zend Debugger can do profiling). I would also (if you haven't) invest in a test system that is as similar as possible to the live system to run diagnostics.
You could also look at gathering some stats; either at server level with a program such as sar (from the sysstat package and also at the db level (have you got the slow query log running?).

SQL query optimization and debugging

the question is about the best practice.
How to perform a reliable SQL query test?
That is the question is about optimization of DB structure and SQL query itself not the system and DB performance, buffers, caches.
When you have a complicated query with a lot of joins etc, one day you need to understand how to optimize it and you come to EXPLAIN command (mysql::explain, postresql::explain) to study the execution plan.
After tuning the DB structure you execute the query to see any performance changes but here you're on the pan of multiple level of optimization/buffering/caching. How to avoid this? I need the pure time for the query execution and be sure it is not affected.
If you know different practise for different servers please specify explicitly: mysql, postgresql, mssql etc.
Thank you.
For Microsoft SQL Server you can use DBCC FREEPROCCACHE (to drop compiled query plans) and DBCC DROPCLEANBUFFERS (to purge the data cache) to ensure that you are starting from a completely uncached state. Then you can profile both uncached and cached performance, and determine your performance accurately in both cases.
Even so, a lot of the time you'll get different results at different times depending on how complex your query is and what else is happening on the server. It's usually wise to test performance multiple times in different operating scenarios to be sure you understand what the full performance profile of the query is.
I'm sure many of these general principles apply to other database platforms as well.
In the PostgreSQL world you need to flush the database cache as well as the OS cache as PostgreSQL leverages the OS caching system.
See this link for some discussions.
http://archives.postgresql.org/pgsql-performance/2010-08/msg00295.php
Why do you need pure execution time? It depends on so many factors and almost meaningless on live server. I would recommend to collect some statistic from live server and analyze queries execution time using pgfouine tool (it's for postgresql) and make decisions based on it. You will see exactly what do you need to tune and how effective was your changes on a report.