How to make execution plans as most accurate as possible - mysql

every night I make a dump of my PROD environment (mysql 8.0.23) to import it in my TEST environment. The hardware and the mysql config is the same on both systems
I noticed that some query (executed just after the dump, when the 2 databases are still aligned) behave in a different way on the 2 systems, with different execution plans and, of course, different performances
My guess is that the TEST environment performs better because every table is created from scratch, so the stats for tables and indexes are more accurate, while PROD is loaded with tons of inserts, updates, deletes during the day.
Is there any way to make PROD behave the same way?
I tryed to run OPTIMIZE and ANALYZE on every table (most of them are InnoDB), but the execution plans are still different
Thanks

The fresh load on TEST needs to do more I/O because the "cache" is initially empty. (Cf InnoDB's buffer_pool.) I would expect that TEST might be up to 10 times as slow. Run the same query again to see it speed up.
EXPLAIN makes estimates; it is usually inconsistent. Don't worry about that.
If PROD is quite busy, but TEST is just running a single query, PROD could be slower. This is due to the other stuff that is going on.
OPTIMIZE is essentially never needed for InnoDB; don't waste time on it.
ANALYZE will sometimes convince EXPLAIN to provide a different query plan. This does not prove much.
Drop and reimport is unlikely to resolve anything.
Show us one example of a notable difference. We are more likely to spot something when given a specific example. Provide
The query
EXPLAIN SELECT -- on both servers
Timing for the queries
SHOW CREATE TABLE for the table(s) involved in the query.

Related

Understanding the MySQL Query Optimizer

We use two MySQL servers with the same setup and master-master replication. They are located behind a loadbalancer and get almost the same traffic, statements etc.
On server1, there are some cron jobs running additionally, which is the only difference.
However, we've seen that the query optimizer in some cases behaves different with the same query on both servers.
In those cases we have to use FORCE_INDEX to get the best results on both servers.
The main questions are:
Is there any metadata that is stored somewhere on the server that is used by the Query Optimizer?
If we have to backup and restore a database (with XtraBackup), will the query optimizer behave in the same way or is it built from scratch?
Thanks for any reply
Joachim
The Optimizer bases its actions partially on "statistics". Those statistics come from explicit ANALYZE TABLE or from certain changes to the table, such as "adding more than 10% to the table".
There is nothing to sync the stats between the two Masters, so they can drift apart. Even running ANALYZE TABLE on both will not necessarily bring them in sync. This is because of the "random probes" that are made into the table to concoct the stats.
FORCE INDEX is risky because "what helps today may hurt tomorrow".
Probably this is what is happening: You have a query that is on the borderline between picking one query plan and another. The Optimizer's analysis says they are about equivalent, but (for various reasons) they are not. This could lead to one server doing one thing (fast query plan) and the other doing something else (poor query plan).
There is no 'reliable' and 'consistent' way to solve your problem.
Let's try to tackle it a different way. Provide the query, the EXPLAIN, and SHOW CREATE TABLE. There may be a better index and/or a reformulation of the query that avoids the problem completely, and possibly runs faster than either of your current query plans.

MySQL server very high load

I run a website with ~500 real time visitors, ~50k daily visitors and ~1,3million total users. I host my server on AWS, where I use several instances of different kind. When I started the website the different instances cost rougly the same. When the website started to gain users the RDS instance (MySQL DB) CPU constantly keept hitting the roof, I had to upgrade it several times, now it have started to take up the main part of the performance and monthly cost (around 95% of (2,8k$/month)). I currently use a database server with 16vCPU and 64GiB of RAM, I also use Multi-AZ Deployment to protect against failures. I wonder if it is normal for the database to be that expensive, or if I have done something terribly wrong?
Database Info
At the moment my database have 40 tables with the most of them have 100k rows, some have ~2millions and 1 have 30 millions.
I have a system the archives rows that are older then 21 days when they are not needed anymore.
Website Info
The website mainly use PHP, but also some NodeJS and python.
Most of the functions of the website works like this:
Start transaction
Insert row
Get last inserted id (lastrowid)
Do some calculations
Updated the inserted row
Update the user
Commit transaction
I also run around 100bots wich polls from the database with 10-30sec interval, they also inserts/updates the database sometimes.
Extra
I have done several things to try to lower the load on the database. Such as enable database cache, use a redis cache for some queries, tried to remove very slow queries, tried to upgrade the storage type to "Provisioned IOPS SSD". But nothing seems to help.
This is the changes I have done to the setting paramters:
I have though about creating a MySQL cluster of several smaller instances, but I don't know if this would help, and I also don't know if this works good with transactions.
If you need any more information, please ask, any help on this issue is greatly appriciated!
In my experience, as soon as you ask the question "how can I scale up performance?" you know you have outgrown RDS (edit: I admit my experience that leads me to this opinion may be outdated).
It sounds like your query load is pretty write-heavy. Lots of inserts and updates. You should increase the innodb_log_file_size if you can on your version of RDS. Otherwise you may have to abandon RDS and move to an EC2 instance where you can tune MySQL more easily.
I would also disable the MySQL query cache. On every insert/update, MySQL has to scan the query cache to see if there any results cached that need to be purged. This is a waste of time if you have a write-heavy workload. Increasing your query cache to 2.56GB makes it even worse! Set the cache size to 0 and the cache type to 0.
I have no idea what queries you run, or how well you have optimized them. MySQL's optimizer is limited, so it's frequently the case that you can get huge benefits from redesigning SQL queries. That is, changing the query syntax, as well as adding the right indexes.
You should do a query audit to find out which queries are accounting for your high load. A great free tool to do this is https://www.percona.com/doc/percona-toolkit/2.2/pt-query-digest.html, which can give you a report based on your slow query log. Download the RDS slow query log with the http://docs.aws.amazon.com/cli/latest/reference/rds/download-db-log-file-portion.html CLI command.
Set your long_query_time=0, let it run for a while to collect information, then change long_query_time back to the value you normally use. It's important to collect all queries in this log, because you might find that 75% of your load is from queries under 2 seconds, but they are run so frequently that it's a burden on the server.
After you know which queries are accounting for the load, you can make some informed strategy about how to address them:
Query optimization or redesign
More caching in the application
Scale out to more instances
I think the answer is "you're doing something wrong". It is very unlikely you have reached an RDS limitation, although you may be hitting limits on some parts of it.
Start by enabling detailed monitoring. This will give you some OS-level information which should help determine what your limiting factor really is. Look at your slow query logs and database stats - you may have some queries that are causing problems.
Once you understand the problem - which could be bad queries, I/O limits, or something else - then you can address them. RDS allows you to create multiple read replicas, so you can move some of your read load to slaves.
You could also move to Aurora, which should give you better I/O performance. Or use PIOPS (or allocate more disk, which should increase performance). You are using SSD storage, right?
One other suggestion - if your calculations (step 4 above) takes a significant amount of time, you might want look at breaking it into two or more transactions.
A query_cache_size of more than 50M is bad news. You are writing often -- many times per second per table? That means the QC needs to be scanned many times/second to purge the entries for the table that changed. This is a big load on the system when the QC is 2.5GB!
query_cache_type should be DEMAND if you can justify it being on at all. And in that case, pepper the SELECTs with SQL_CACHE and SQL_NO_CACHE.
Since you have the slowlog turned on, look at the output with pt-query-digest. What are the first couple of queries?
Since your typical operation involves writing, I don't see an advantage of using readonly Slaves.
Are the bots running at random times? Or do they all start at the same time? (The latter could cause terrible spikes in CPU, etc.)
How are you "archiving" "old" records? It might be best to use PARTITIONing and "transportable tablespaces". Use PARTITION BY RANGE and 21 partitions (plus a couple of extras).
Your typical transaction seems to work with one row. Can it be modified to work with 10 or 100 all at once? (More than 100 is probably not cost-effective.) SQL is much more efficient in doing lots of rows at once versus lots of queries of one row each. Show us the SQL; we can dig into the details.
It seems strange to insert a new row, then update it, all in one transaction. Can't you completely compute it before doing the insert? Hanging onto the inserted_id for so long probably interferes with others doing the same thing. What is the value of innodb_autoinc_lock_mode?
Do the "users" interactive with each other? If so, in what way?

MySQL query slowing down until restart

I have a service that sits on top of a MySQL 5.5 database (INNODB). The service has a background job that is supposed to run every week or so. On a high level the background job does the following:
Do some initial DB read and write in one transaction
Execute UMQ (described below) with a set of parameters in one transaction.
If no records are returned we are done!
Process the result from UMQ (this is a bit heavy so it is done outside of any DB
transaction)
Write the outcome of the previous step to DB in one transaction (this
writes to tables queried by UMQ and ensures that the same records are not found again by UMQ).
Goto step 2.
UMQ - Ugly Monster Query: This is a nasty database query that joins a bunch of tables, has conditions on columns in several of these tables and includes a NOT EXISTS subquery with some more joins and conditions. UMQ includes ORDER BY also has LIMIT 1000. Even though the query is bad I have done what I can here - there are indexes on all columns filtered on and the joins are all over foreign key relations.
I do expect UMQ to be heavy and take some time, which is why it's executed in a background job. However, what I'm seeing is rapidly degrading performance until it eventually causes a timeout in my service (maybe 50 times slower after 10 iterations).
First I thought that it was because the data queried by UMQ changes (see step 4 above) but that wasn't it because if I took the last query (the one that caused the timeout) from the slow query log and executed it myself directly I got the same behavior only until I restated the MySQL service. After restart the exact query on the exact same data that took >30 seconds before restart now took <0.5 seconds. I can reproduce this behavior every time by restoring the database to it's initial state and restarting the process.
Also, using the trick described in this question I could see that the query scans around 60K rows after restart as opposed to 18M rows before. EXPLAIN tells me that around 10K rows should be scanned and the result of EXPLAIN is always the same. No other processes are accessing the database at the same time and the lock_time in the slow query log is always 0. SHOW ENGINE INNODB STATUS before and after restart gives me no hints.
So finally the question: Does anybody have any clue of why I'm seeing this behavior? And how can I analyze this further?
I have the feeling that I need to configure MySQL differently in some way but I have searched and tested like crazy without coming up with anything that makes a difference.
Turns out that the behavior I saw was the result of how the MySQL optimizer uses InnoDB statistics to decide on an execution plan. This article put me on the right track (even though it does not exactly discuss my problem). The most important thing I learned from this is that MySQL calculates statistics on startup and then once in a while. This statistics is then used to optimize queries.
The way I had set up the test data the table T where most writes are done in step 4 started out as empty. After each iteration T would contain more and more records but the InnoDB statistics had not yet been updated to reflect this. Because of this the MySQL optimizer always chose an execution plan for UMQ (which includes a JOIN with T) that worked well when T was empty but worse and worse the more records T contained.
To verify this I added an ANALYZE TABLE T; before every execution of UMQ and the rapid degradation disappeared. No lightning performance but acceptable. I also saw that leaving the database for half an hour or so (maybe a bit shorter but at least more than a couple of minutes) would allow the InnoDB statistics to refresh automatically.
In a real scenario the relative difference in index cardinality for the tables involved in UMQ will look quite different and will not change as rapidly so I have decided that I don't really need to do anything about it.
thank you very much for the analysis and answer. I've been searching this issue for several days during ci on mariadb 10.1 and bacula server 9.4 (debian buster).
The situation was that after fresh server installation during a CI cycle, the first two tests (backup and restore) runs smoothly on unrestarted mariadb server and only the third test showed that one particular UMQ took about 20 minutes (building directory tree during restore process from the table with about 30k rows).
Unless the mardiadb server was restarted or table has been analyzed the problem would not go away. ANALYZE TABLE or the restart changed the cardinality of the fields and internal query processing exactly as stated in the linked article.

How to predict MySQL tipping points?

I work on a big web application that uses a MySQL 5.0 database with InnoDB tables. Twice over the last couple of months, we have experienced the following scenario:
The database server runs fine for weeks, with low load and few slow queries.
A frequently-executed query that previously ran quickly will suddenly start running very slowly.
Database load spikes and the site hangs.
The solution in both cases was to find the slow query in the slow query log and create a new index on the table to speed it up. After applying the index, database performance returned to normal.
What's most frustrating is that, in both cases, we had no warning about the impending doom; all of our monitoring systems (e.g., graphs of system load, CPU usage, query execution rates, slow queries) told us that the database server was in good health.
Question #1: How can we predict these kinds of tipping points or avoid them altogether?
One thing we are not doing with any regularity is running OPTIMIZE TABLE or ANALYZE TABLE. We've had a hard time finding a good rule of thumb about how often (if ever) to manually do these things. (Since these commands LOCK tables, we don't want to run them indiscriminately.) Do these scenarios sound like the result of unoptimized tables?
Question #2: Should we be manually running OPTIMIZE or ANALYZE? If so, how often?
More details about the app: database usage pattern is approximately 95% reads, 5% writes; database executes around 300 queries/second; the table used in the slow queries was the same in both cases, and has hundreds of thousands of records.
The MySQL Performance Blog is a fantastic resource. Namely, this post covers the basics of properly tuning InnoDB-specific parameters.
I've also found that the PDF version of the MySQL Reference Manual to be essential. Chapter 7 covers general optimization, and section 7.5 covers server-specific optimizations you can toy with.
From the sound of your server, the query cache may be of IMMENSE value to you.
The reference manual also gives you some great detail concerning slow queries, caches, query optimization, and even disk seek analysis with indexes.
It may be worth your time to look into multi-master replication, allowing you to lock one server entirely and run OPTIMIZE/ANALYZE, without taking a performance hit (as 95% of your queries are reads, the other server could manage the writes just fine).
Section 12.5.2.5 covers OPTIMIZE TABLE in detail, and 12.5.2.1 covers ANALYZE TABLE in detail.
Update for your edits/emphasis:
Question #2 is easy to answer. From the reference manual:
OPTIMIZE:
OPTIMIZE TABLE should be used if you have deleted a large part of a table or if you have made many changes to a table with variable-length rows. [...] You can use OPTIMIZE TABLE to reclaim the unused space and to defragment the data table.
And ANALYZE:
ANALYZE TABLE analyzes and stores the key distribution for a table. [...] MySQL uses the stored key distribution to decide the order in which tables should be joined when you perform a join on something other than a constant. In addition, key distributions can be used when deciding which indexes to use for a specific table within a query.
OPTIMIZE is good to run when you have the free time. MySQL optimizes well around deleted rows, but if you go and delete 20GB of data from a table, it may be a good idea to run this. It is definitely not required for good performance in most cases.
ANALYZE is much more critical. As noted, having the needed table data available to MySQL (provided with ANALYZE) is very important when it comes to pretty much any query. It is something that should be run on a common basis.
Question #1 is a bit more of a trick. I would watch the server very carefully when this happens, namely disk I/O. My bet would be that your server is thrashing either your swap or the (InnoDB) caches. In either case, it may be query, tuning, or load related. Unoptimized tables could cause this. As mentioned, running ANALYZE can immensely help performance, and will likely help out too.
I haven't found any good way of predicting MySQL "tipping points" -- and I've run into a few.
Having said that, I've found tipping points are related to table size. But not merely raw table size, rather how big the "area of interest" is to a query. For example, in a table of over 3 million rows and about 40 columns, about three-quarters integers, most queries that would easily select a portion of them based on indices are fast. However, when one value in a query on one indexed column means two-thirds of the rows are now "interesting", the query is now about 5-times slower than normal. Lesson: try to arrange your data so such a scan isn't necessary.
However, such behaviour now gives you a size to look for. This size will be heavily dependant on your server setup, the MySQL server variables and the table's schema and data.
Similarly, I've seen reporting queries run in reasonable time (~45 seconds) if the period is two weeks, but take half-an-hour if the period is extended to four weeks.
Use slow query log that will help you to narrow down the queries you want to optimize.
For time critical queries it sometimes better to keep stable plan by using hints.
It sounds like you have a frustrating situation and maybe not the best code review process and development environment.
Whenever you add a new query to your code you need to check that it has the appropriate indexes ready and add those with the code release.
If you don't do that your second option is to constantly monitor the slow query log and then go beat the developers; I mean go add the index.
There's an option to enable logging of queries that didn't use an index which would be useful to you.
If there are some queries that "works and stops working" (but are "using and index") then it's likely that the query wasn't very good in the first place (low cardinality in the index; inefficient join; ...) and the first rule of evaluating the query carefully when it's added would apply.
For question #2 - On InnoDB "analyze table" is basically free to run, so if you have bad join performance it doesn't hurt to run it. Unless the balance of the keys in the table are changing a lot it's unlikely to help though. It almost always comes down to bad queries. "optimize table" rebuilds the InnoDB table; in my experience it's relatively rare that it helps enough to be worth the hassle of having the table unavailable for the duration (or doing the master-master failover stuff while it's running).

MySQL slow query log - how slow is slow?

What do you find is the optimal setting for mysql slow query log parameter, and why?
I recommend these three lines
log_slow_queries
set-variable = long_query_time=1
log-queries-not-using-indexes
The first and second will log any query over a second. As others have pointed out a one second query is pretty far gone if you are a shooting for a high transaction rate on your website, but I find that it turns up some real WTFs; queries that should be fast, but for whatever combination of data it was run against was not.
The last will log any query that does not use an index. Unless your doing data warehousing any common query should have the best index you can find so pay attention to its output.
Although its certainly not for production, this last option
log = /var/log/mysql/mysql.log
will log all queries, which can be useful if you are trying to tune a specific page or action.
Whatever time /you/ feel is unacceptably slow for a query on your systems.
It depends on the kind of queries you run and the kind of system; a query taking several seconds might not matter if it's some back-end reporting system doing complex data-mining etc where a delay doesn't matter, but might be completely unacceptable on a user-facing system which is expected to return results promptly.
Set it to whatever you like. The only problem is that in a stock MySQL, it can only be set in increments of 1 second, which is too slow for some people.
Most heavily used production servers execute far too many queries to log them all. The slow log is a way of filtering the log so that we can see the ones which take a long time (most queries are likely to be executed almost instantly). It's a bit of a blunt instrument.
Set it to 1 sec if you like, you're probably not going to run out of disc space or create a performance problem by doing that.
It's really about the risk of enabling the slow log- don't do it if you feel it's likely to cause further disc or performance problems.
Of course you could enable the slow log on a non-production server and put simulated load through, but that is never quite the same.
Peter Zaitsev posted a nice article about using the slow query log. One thing he notes is important is to also consider how often a certain query is used. Reports run once a day are not important to be fast. But something that is run very often might be a problem even if it takes half a second. And you cant detect that without the microslow patch.
Not only is it a blunt instrument as far as resolution is concerned, but also it is MySQL-instance wide, so that if you have different databases with differing performancy requirements you're kind of out of luck. Obviously there are ways around that, but it's important to keep that in mind when setting your slow log setting.
Aside from performance requirements of your application, another factor to consider is what you're trying to log. Are you using the log to catch queries that would threaten the stability of your db instance (ones that cause deadlocks or Cartesian joins, for instance) or queries that affect the performance for specific users and that might require a little tuning? That will influence where you set your threshold.