MySQL server very high load - mysql

I run a website with ~500 real time visitors, ~50k daily visitors and ~1,3million total users. I host my server on AWS, where I use several instances of different kind. When I started the website the different instances cost rougly the same. When the website started to gain users the RDS instance (MySQL DB) CPU constantly keept hitting the roof, I had to upgrade it several times, now it have started to take up the main part of the performance and monthly cost (around 95% of (2,8k$/month)). I currently use a database server with 16vCPU and 64GiB of RAM, I also use Multi-AZ Deployment to protect against failures. I wonder if it is normal for the database to be that expensive, or if I have done something terribly wrong?
Database Info
At the moment my database have 40 tables with the most of them have 100k rows, some have ~2millions and 1 have 30 millions.
I have a system the archives rows that are older then 21 days when they are not needed anymore.
Website Info
The website mainly use PHP, but also some NodeJS and python.
Most of the functions of the website works like this:
Start transaction
Insert row
Get last inserted id (lastrowid)
Do some calculations
Updated the inserted row
Update the user
Commit transaction
I also run around 100bots wich polls from the database with 10-30sec interval, they also inserts/updates the database sometimes.
Extra
I have done several things to try to lower the load on the database. Such as enable database cache, use a redis cache for some queries, tried to remove very slow queries, tried to upgrade the storage type to "Provisioned IOPS SSD". But nothing seems to help.
This is the changes I have done to the setting paramters:
I have though about creating a MySQL cluster of several smaller instances, but I don't know if this would help, and I also don't know if this works good with transactions.
If you need any more information, please ask, any help on this issue is greatly appriciated!

In my experience, as soon as you ask the question "how can I scale up performance?" you know you have outgrown RDS (edit: I admit my experience that leads me to this opinion may be outdated).
It sounds like your query load is pretty write-heavy. Lots of inserts and updates. You should increase the innodb_log_file_size if you can on your version of RDS. Otherwise you may have to abandon RDS and move to an EC2 instance where you can tune MySQL more easily.
I would also disable the MySQL query cache. On every insert/update, MySQL has to scan the query cache to see if there any results cached that need to be purged. This is a waste of time if you have a write-heavy workload. Increasing your query cache to 2.56GB makes it even worse! Set the cache size to 0 and the cache type to 0.
I have no idea what queries you run, or how well you have optimized them. MySQL's optimizer is limited, so it's frequently the case that you can get huge benefits from redesigning SQL queries. That is, changing the query syntax, as well as adding the right indexes.
You should do a query audit to find out which queries are accounting for your high load. A great free tool to do this is https://www.percona.com/doc/percona-toolkit/2.2/pt-query-digest.html, which can give you a report based on your slow query log. Download the RDS slow query log with the http://docs.aws.amazon.com/cli/latest/reference/rds/download-db-log-file-portion.html CLI command.
Set your long_query_time=0, let it run for a while to collect information, then change long_query_time back to the value you normally use. It's important to collect all queries in this log, because you might find that 75% of your load is from queries under 2 seconds, but they are run so frequently that it's a burden on the server.
After you know which queries are accounting for the load, you can make some informed strategy about how to address them:
Query optimization or redesign
More caching in the application
Scale out to more instances

I think the answer is "you're doing something wrong". It is very unlikely you have reached an RDS limitation, although you may be hitting limits on some parts of it.
Start by enabling detailed monitoring. This will give you some OS-level information which should help determine what your limiting factor really is. Look at your slow query logs and database stats - you may have some queries that are causing problems.
Once you understand the problem - which could be bad queries, I/O limits, or something else - then you can address them. RDS allows you to create multiple read replicas, so you can move some of your read load to slaves.
You could also move to Aurora, which should give you better I/O performance. Or use PIOPS (or allocate more disk, which should increase performance). You are using SSD storage, right?
One other suggestion - if your calculations (step 4 above) takes a significant amount of time, you might want look at breaking it into two or more transactions.

A query_cache_size of more than 50M is bad news. You are writing often -- many times per second per table? That means the QC needs to be scanned many times/second to purge the entries for the table that changed. This is a big load on the system when the QC is 2.5GB!
query_cache_type should be DEMAND if you can justify it being on at all. And in that case, pepper the SELECTs with SQL_CACHE and SQL_NO_CACHE.
Since you have the slowlog turned on, look at the output with pt-query-digest. What are the first couple of queries?
Since your typical operation involves writing, I don't see an advantage of using readonly Slaves.
Are the bots running at random times? Or do they all start at the same time? (The latter could cause terrible spikes in CPU, etc.)
How are you "archiving" "old" records? It might be best to use PARTITIONing and "transportable tablespaces". Use PARTITION BY RANGE and 21 partitions (plus a couple of extras).
Your typical transaction seems to work with one row. Can it be modified to work with 10 or 100 all at once? (More than 100 is probably not cost-effective.) SQL is much more efficient in doing lots of rows at once versus lots of queries of one row each. Show us the SQL; we can dig into the details.
It seems strange to insert a new row, then update it, all in one transaction. Can't you completely compute it before doing the insert? Hanging onto the inserted_id for so long probably interferes with others doing the same thing. What is the value of innodb_autoinc_lock_mode?
Do the "users" interactive with each other? If so, in what way?

Related

MySQL queries very slow - occasionally

I'm running MariaDB 10.2.31 on Ubuntu 18.4.4 LTS.
On a regular basis I encounter the following conundrum - especially when starting out in the morning, that is when my DEV environment has been idle for the night - but also during the day from time to time.
I have a table (this applies to other tables as well) with approx. 15.000 rows and (amongst others) an index on a VARCHAR column containing on average 5 to 10 characters.
Notably, most columns including this one are GENERATED ALWAYS AS (JSON_EXTRACT(....)) STORED since 99% of my data comes from a REST API as JSON-encoded strings (and conveniently I simply store those in one column and extract everything else).
When running a query on that column WHERE colname LIKE 'text%' I find query-result durations of i.e. 0.006 seconds. Nice. When I have my query EXPLAINed, I can see that the index is being used.
However, as I have mentioned, when I start out in the morning, this takes way longer (14 seconds this morning). I know about the query cache and I tried this with query cache turned off (both via SET GLOBAL query_cache_type=OFF and RESET QUERY CACHE). In this case I get consistent times of approx. 0.3 seconds - as expected.
So, what would you recommend I should look into? Is my DB sleeping? Is there such a thing?
There are two things that could be going on:
1) Cold caches (overnight backup, mysqld restart, or large processing job results in this particular index and table data being evicted from memory).
2) Statistics on the table go stale and the query planner gets confused until you run some queries against the table and the statistics get refreshed. You can force an update using ANALYZE TABLE table_name.
3) Query planner heisenbug. Very common in MySQL 5.7 and later, never seen it before on MariaDB so this is rather unlikely.
You can get to the bottom of this by enablign the following in the config:
log_output='FILE'
log_slow_queries=1
log_slow_verbosity='query_plan,explain'
long_query_time=1
Then review what is in the slow log just after you see a slow occurrence. If the logged explain plan looks the same for both slow and fast cases, you have a cold caches issue. If they are different, you have a table stats issue and you need to cron ANALYZE TABLE at the end of the over night task that reads/writes a lot to that table. If that doesn't help, as a last resort, hard code an index hint into your query with FORCE INDEX (index_name).
Enable your slow query log with log_slow_verbosity=query_plan,explain and the long_query_time sufficient to catch the results. See if occasionally its using a different (or no) index.
Before you start your next day, look at SHOW GLOBAL STATUS LIKE "innodb_buffer_pool%" and after your query look at the values again. See how many buffer pool reads vs read requests are in this status output to see if all are coming off disk.
As #Solarflare mentioned, backups and nightly activity might be purging the innodb buffer pool of cached data and reverting bad to disk to make it slow again. As part of your nightly activites you could set innodb_buffer_pool_dump_now=1 to save the pages being hot before scripted activity and innodb_buffer_pool_load_now=1 to restore it.
Shout-out and Thank you to everyone giving valuable insight!
From all the tips you guys gave I think I am starting to understand the problem better and beginning to narrow it down:
First thing I found was my default innodb_buffer_pool_size of 134 MB. With the sort and amount of data I'm processing this is ridiculously low - so I was able to increase it.
Very helpful post: https://dba.stackexchange.com/a/27341
And from the docs: https://dev.mysql.com/doc/refman/8.0/en/innodb-buffer-pool-resize.html
Now that I have increased it to close to 2GB and am able to monitor its usage and RAM usage in general (cli: cat /proc/meminfo) I realize that my 4GB RAM is in fact on the low side of things. I am nowhere near seeing any unused overhead (buffer usage still at 99% and free RAM around 100MB).
I will start to optimize RAM usage of my daemon next and see where this leads - but this will not free enough RAM altogether.
#danblack mentioned innodb_buffer_pool_dump_now and innodb_buffer_pool_load_now. This is an interesting approach to maybe use whenever the daemon accesses the DB as I would love to separate my daemon's buffer usage from the front end's (apparently this is not possible!). I will look into this further but as my daemon is running all the time (not only at night) this might not be feasible.
#Gordan Bobic mentioned "refreshing" DBtables by using ANALYZE TABLE tableName. I found this to be quite fast and incorporated it into the daemon after each time it does an extensive read/write. This increases daemon run times by a few seconds but this is no issue at all. And I figure I can't go wrong with it :)
So, in the end I believe my issue to be a combination of things: Too small buffer size, too small RAM, too many read/write operations for that environment (evicting buffered indexes etc.).
Also I will have to learn more about memory allocation etc and optimize this better (large-pages=1 etc).

When does a slow MySQL query on a given connection affect other connections?

I think I have a basic understanding of this, but am hoping that someone can give me more details as I am interested in learning more about database performance.
Lets say I have a very large database, with many millions of entries, the database supports many connections. Doing simple queries on the database will be slow as there's so much data. I'm trying to understand exactly when a query on a given connection starts to have a direct effect on the performance of queries running on other connections.
If one connection locks some elements, I understand that that will hold up queries running the other connections that need those elements . For example doing:
SELECT FOR UPDATE
will lock what you are selecting.
What happens when you do something simple like:
SELECT COUNT(*) FROM myTable
lets say we have a table with a billion rows so running the count is going to take some time (running on innodb). Will it affect queries running on other connections?
What if you select a large amount of data using SELECT and JOIN, like:
SELECT * FROM myTable1 JOIN myTable2 ON myTable1.id = myTable2.id;
does having a join lock anything for other queries?
I'm finding it hard to know which queries will have a direct effect on the performance of queries running on other connections.
Thanks
There are different angles:
Row locking: this shouldn't happen if you tune your architecture, so you should forget about it
Real performances issues and bottleneck. In our case, collateral effects.
About this second point, the problem is mainly divided in 3 areas:
Disk reads
Memory usage (buffer)
CPU usage.
About disk reads: the more data (in bytes) you will retrieve, the more the harddrive is going to be busy and slowdown any other activity using it. Reduce the size of selected rows to avoid disk overhead.
About memory usage: mysql manages an internal buffer, that can get stuck in some situations. I don't know enough about it to give you a proper answer, but I know this is definetly something you should keep an eye on.
About cpu usage: basically the cpu will get busy when it
has to calculate (joins, preparing statements, arithmetics...)
has to do all the peripheric stuff: moving bytes from disk to memory for instance.
Optimize your queries to reduce cpu overhead. (sounds silly but, well, it always turns out to be the problem anyway...)
So, now when to know when there's a collateral effect? By profiling your hardware...
How to profile?
absolute profiling: use SHOW INNODB STATUS or SHOW PROFILE to get useful informations about main mysql harddrive, cpu and memory watches.
relative profiling: use your favorite OS profiler. Under windows xp for instance, you can use the great perfmon.exe and watch for PRIVATE BYTES and VIRTUAL BYTES of the mysql process. I say relative, because afterall if a query is time consuming on your computer, it might not be on the NASA system...
Hope it helps, regards.
This is a very general question, so giving a precise answer is difficult.
You can think of the database as a pool of shared resources; especially because the underlying hardware your database runs on has physical limits. Most often the reason you see something like a select query that causes a performance impact on other queries it's because they're all competing for using those underlying physical resources like Disk IO or RAM access or CPU time and there isn't enough to go around.
So the actual results you wil see depend heavily on your database's physical hardware, and the configuration settings.
For instance in your select examples the variables might be: Is the data the query needs already in RAM? Can it look up the rows efficiently by an index? If it does have to do IO, how many other queries are asking to read data from disk? Are you using a secondary index and have to do multiple reads? Is the database doing read-ahead to buffer other pages? Is the query causing sequential or random io? Are any updates holding locks on the data? How much read IO can physical hardware support?
You would have to answer all those questions for all queries currently executing to know if they're going to affect performance of others queries.
This is why DBAs exist. Busy databases are complex system, and it's all about the interaction of a great many different operations, all with thousands of possible variables affecting them.
So what you generally do is optimize the things you can control as well as you know how (hardware, mysql configuration, schema and indexes) then start measuring the system as it runs to understand what is actually going on.
So in your case, I would say that it's infinitely more helpful to focus on simply optimizing your queries individually. The faster they execute, the less resources they are probably using and the less change they will impact others. Then you learn to analyze the system. Just look at one thing that's slow and ask "why is this slow?" Then fix it. That's the optimization process.
However, in the first case you wrote with SELECT ... FOR UPDATE explicit locks can and will be big performance issues. Be careful with those.
Read queries are only affected by isolation levels of other queries. They themselves do not block the table ever.
Isolation levels are designated transactional safety modes. If another query that uses locking does not allow dirty reads your reads will be held until the other query finishes writing or unlocks.
MVCC is a mechanism that allows databases to create a new version of the data when they need to update or delete. Which means that when you start a read on the current version of the data, it data won't get tainted by future updates/deletes.
When you start a write on current data despite the data being currently read by another process, you're in fact writing the new stuff somewhere else and marking them as the newest version. Which in the end means no blocking for the writing process (at least not because of the reading process).

MySQL tuning on a regular basis without restart

I have a 5GB database, all tables are MyISAM. It runs into heavy load time from 01:30AM to 8:30AM (100+ selects, 150+ updates, 200+ cache hits per second) to do data analysis, during other time, load is moderate (10 selects, 5 inserts per second).
Problem is after a few days, data analysis during heavy load time appears to be slow down maybe due to query cache prunes (iowait increases). Current query cache is set to 1.5G while total RAM is 4G. It runs fast again after manually restart mysql server.
Is there a way to do regular optimization or cleaning up on mysql server to keep it running in a efficiently without a restart
It sounds to me like your application is busy updating the tables and you might have table contention. Do you have mytop running, or does SHOW PROCESSLIST give you any insight as to what part of your application is doing the most work? Have you enabled --slow-query-log setting?
Also, your database table engine might be an issue. Are you using MyISAM or InnoDB? You want to look out for table locking during updates, and how much of a backup that can create.
If you are issuing FLUSH QUERY CACHE, that can lead to badness, many versions of MySQL exhibit near-lockup when running that command.
Also, running top and checking /var/log/cron for cronjobs that might be affecting system load could help. If you are running updatedb or logrotate on your server, that could affect iowait.
It seems like your query cache size is far too large. While the query cache is usually a good thing, if it is too large it can hurt more then it helps.
This behavior is discussed in this article:
The issue here was that the customer had a moderate level of write traffic, and the current query cache implementation invalidates all result sets for a given table whenever that table is updated. As the query cache grows in size, the number of entries that must be invalidated for a given table may grow as well. In addition, the coarse locking on the cache can lead to lock contention that can kill performance, particularly on multi-core hardware.
I would recommend lowering the size of your query cache to somewhere between 16-128MB and see how that effects performance.
Another possibility is that the queries are generating really small result sets which is causing memory fragmentation. More information on this is available here, look for the "query_cache_min_res_unit" setting.

How to predict MySQL tipping points?

I work on a big web application that uses a MySQL 5.0 database with InnoDB tables. Twice over the last couple of months, we have experienced the following scenario:
The database server runs fine for weeks, with low load and few slow queries.
A frequently-executed query that previously ran quickly will suddenly start running very slowly.
Database load spikes and the site hangs.
The solution in both cases was to find the slow query in the slow query log and create a new index on the table to speed it up. After applying the index, database performance returned to normal.
What's most frustrating is that, in both cases, we had no warning about the impending doom; all of our monitoring systems (e.g., graphs of system load, CPU usage, query execution rates, slow queries) told us that the database server was in good health.
Question #1: How can we predict these kinds of tipping points or avoid them altogether?
One thing we are not doing with any regularity is running OPTIMIZE TABLE or ANALYZE TABLE. We've had a hard time finding a good rule of thumb about how often (if ever) to manually do these things. (Since these commands LOCK tables, we don't want to run them indiscriminately.) Do these scenarios sound like the result of unoptimized tables?
Question #2: Should we be manually running OPTIMIZE or ANALYZE? If so, how often?
More details about the app: database usage pattern is approximately 95% reads, 5% writes; database executes around 300 queries/second; the table used in the slow queries was the same in both cases, and has hundreds of thousands of records.
The MySQL Performance Blog is a fantastic resource. Namely, this post covers the basics of properly tuning InnoDB-specific parameters.
I've also found that the PDF version of the MySQL Reference Manual to be essential. Chapter 7 covers general optimization, and section 7.5 covers server-specific optimizations you can toy with.
From the sound of your server, the query cache may be of IMMENSE value to you.
The reference manual also gives you some great detail concerning slow queries, caches, query optimization, and even disk seek analysis with indexes.
It may be worth your time to look into multi-master replication, allowing you to lock one server entirely and run OPTIMIZE/ANALYZE, without taking a performance hit (as 95% of your queries are reads, the other server could manage the writes just fine).
Section 12.5.2.5 covers OPTIMIZE TABLE in detail, and 12.5.2.1 covers ANALYZE TABLE in detail.
Update for your edits/emphasis:
Question #2 is easy to answer. From the reference manual:
OPTIMIZE:
OPTIMIZE TABLE should be used if you have deleted a large part of a table or if you have made many changes to a table with variable-length rows. [...] You can use OPTIMIZE TABLE to reclaim the unused space and to defragment the data table.
And ANALYZE:
ANALYZE TABLE analyzes and stores the key distribution for a table. [...] MySQL uses the stored key distribution to decide the order in which tables should be joined when you perform a join on something other than a constant. In addition, key distributions can be used when deciding which indexes to use for a specific table within a query.
OPTIMIZE is good to run when you have the free time. MySQL optimizes well around deleted rows, but if you go and delete 20GB of data from a table, it may be a good idea to run this. It is definitely not required for good performance in most cases.
ANALYZE is much more critical. As noted, having the needed table data available to MySQL (provided with ANALYZE) is very important when it comes to pretty much any query. It is something that should be run on a common basis.
Question #1 is a bit more of a trick. I would watch the server very carefully when this happens, namely disk I/O. My bet would be that your server is thrashing either your swap or the (InnoDB) caches. In either case, it may be query, tuning, or load related. Unoptimized tables could cause this. As mentioned, running ANALYZE can immensely help performance, and will likely help out too.
I haven't found any good way of predicting MySQL "tipping points" -- and I've run into a few.
Having said that, I've found tipping points are related to table size. But not merely raw table size, rather how big the "area of interest" is to a query. For example, in a table of over 3 million rows and about 40 columns, about three-quarters integers, most queries that would easily select a portion of them based on indices are fast. However, when one value in a query on one indexed column means two-thirds of the rows are now "interesting", the query is now about 5-times slower than normal. Lesson: try to arrange your data so such a scan isn't necessary.
However, such behaviour now gives you a size to look for. This size will be heavily dependant on your server setup, the MySQL server variables and the table's schema and data.
Similarly, I've seen reporting queries run in reasonable time (~45 seconds) if the period is two weeks, but take half-an-hour if the period is extended to four weeks.
Use slow query log that will help you to narrow down the queries you want to optimize.
For time critical queries it sometimes better to keep stable plan by using hints.
It sounds like you have a frustrating situation and maybe not the best code review process and development environment.
Whenever you add a new query to your code you need to check that it has the appropriate indexes ready and add those with the code release.
If you don't do that your second option is to constantly monitor the slow query log and then go beat the developers; I mean go add the index.
There's an option to enable logging of queries that didn't use an index which would be useful to you.
If there are some queries that "works and stops working" (but are "using and index") then it's likely that the query wasn't very good in the first place (low cardinality in the index; inefficient join; ...) and the first rule of evaluating the query carefully when it's added would apply.
For question #2 - On InnoDB "analyze table" is basically free to run, so if you have bad join performance it doesn't hurt to run it. Unless the balance of the keys in the table are changing a lot it's unlikely to help though. It almost always comes down to bad queries. "optimize table" rebuilds the InnoDB table; in my experience it's relatively rare that it helps enough to be worth the hassle of having the table unavailable for the duration (or doing the master-master failover stuff while it's running).

MySQL slow query log - how slow is slow?

What do you find is the optimal setting for mysql slow query log parameter, and why?
I recommend these three lines
log_slow_queries
set-variable = long_query_time=1
log-queries-not-using-indexes
The first and second will log any query over a second. As others have pointed out a one second query is pretty far gone if you are a shooting for a high transaction rate on your website, but I find that it turns up some real WTFs; queries that should be fast, but for whatever combination of data it was run against was not.
The last will log any query that does not use an index. Unless your doing data warehousing any common query should have the best index you can find so pay attention to its output.
Although its certainly not for production, this last option
log = /var/log/mysql/mysql.log
will log all queries, which can be useful if you are trying to tune a specific page or action.
Whatever time /you/ feel is unacceptably slow for a query on your systems.
It depends on the kind of queries you run and the kind of system; a query taking several seconds might not matter if it's some back-end reporting system doing complex data-mining etc where a delay doesn't matter, but might be completely unacceptable on a user-facing system which is expected to return results promptly.
Set it to whatever you like. The only problem is that in a stock MySQL, it can only be set in increments of 1 second, which is too slow for some people.
Most heavily used production servers execute far too many queries to log them all. The slow log is a way of filtering the log so that we can see the ones which take a long time (most queries are likely to be executed almost instantly). It's a bit of a blunt instrument.
Set it to 1 sec if you like, you're probably not going to run out of disc space or create a performance problem by doing that.
It's really about the risk of enabling the slow log- don't do it if you feel it's likely to cause further disc or performance problems.
Of course you could enable the slow log on a non-production server and put simulated load through, but that is never quite the same.
Peter Zaitsev posted a nice article about using the slow query log. One thing he notes is important is to also consider how often a certain query is used. Reports run once a day are not important to be fast. But something that is run very often might be a problem even if it takes half a second. And you cant detect that without the microslow patch.
Not only is it a blunt instrument as far as resolution is concerned, but also it is MySQL-instance wide, so that if you have different databases with differing performancy requirements you're kind of out of luck. Obviously there are ways around that, but it's important to keep that in mind when setting your slow log setting.
Aside from performance requirements of your application, another factor to consider is what you're trying to log. Are you using the log to catch queries that would threaten the stability of your db instance (ones that cause deadlocks or Cartesian joins, for instance) or queries that affect the performance for specific users and that might require a little tuning? That will influence where you set your threshold.