I have a MySQL database running in Amazon RDS that serves a site. Around 80MB of data in one of the tables gets replaced hourly. I load the data into another table and then use rename to swap the tables and then delete the other table.
Most of the time everything is fine. When the update happens hourly, there is a CPU spike to around to 10% and back to 2% or 3%, which is the CPU consumption with normal site traffic.
Once in a while (no pattern), however, the CPU spikes to almost 100% right at the time of the update and stays very high until the next update an hour later, long after the update is completed. The update takes seconds. During the entire hour, traffic to the site is normal. Also, during that time, show processlist is empty - no stuck processes.
A couple of times, I manually triggered the update before the hour ended and the CPU went back to normal.
My question is this: Assuming there is something inefficient going on with the data loading, can a bad query or update cause the CPU to stay high long after it completes executing?
I will run SHOW ENGINE INNODB STATUS next time it happens. When I do, what should I be looking for?
Some details if useful:
RDS m4.large instance running in multi-az
MySQL version 5.6.27
InnoDB engine
Connection pool used with default 8 connections
Related
I'm running MariaDB 10.2.31 on Ubuntu 18.4.4 LTS.
On a regular basis I encounter the following conundrum - especially when starting out in the morning, that is when my DEV environment has been idle for the night - but also during the day from time to time.
I have a table (this applies to other tables as well) with approx. 15.000 rows and (amongst others) an index on a VARCHAR column containing on average 5 to 10 characters.
Notably, most columns including this one are GENERATED ALWAYS AS (JSON_EXTRACT(....)) STORED since 99% of my data comes from a REST API as JSON-encoded strings (and conveniently I simply store those in one column and extract everything else).
When running a query on that column WHERE colname LIKE 'text%' I find query-result durations of i.e. 0.006 seconds. Nice. When I have my query EXPLAINed, I can see that the index is being used.
However, as I have mentioned, when I start out in the morning, this takes way longer (14 seconds this morning). I know about the query cache and I tried this with query cache turned off (both via SET GLOBAL query_cache_type=OFF and RESET QUERY CACHE). In this case I get consistent times of approx. 0.3 seconds - as expected.
So, what would you recommend I should look into? Is my DB sleeping? Is there such a thing?
There are two things that could be going on:
1) Cold caches (overnight backup, mysqld restart, or large processing job results in this particular index and table data being evicted from memory).
2) Statistics on the table go stale and the query planner gets confused until you run some queries against the table and the statistics get refreshed. You can force an update using ANALYZE TABLE table_name.
3) Query planner heisenbug. Very common in MySQL 5.7 and later, never seen it before on MariaDB so this is rather unlikely.
You can get to the bottom of this by enablign the following in the config:
log_output='FILE'
log_slow_queries=1
log_slow_verbosity='query_plan,explain'
long_query_time=1
Then review what is in the slow log just after you see a slow occurrence. If the logged explain plan looks the same for both slow and fast cases, you have a cold caches issue. If they are different, you have a table stats issue and you need to cron ANALYZE TABLE at the end of the over night task that reads/writes a lot to that table. If that doesn't help, as a last resort, hard code an index hint into your query with FORCE INDEX (index_name).
Enable your slow query log with log_slow_verbosity=query_plan,explain and the long_query_time sufficient to catch the results. See if occasionally its using a different (or no) index.
Before you start your next day, look at SHOW GLOBAL STATUS LIKE "innodb_buffer_pool%" and after your query look at the values again. See how many buffer pool reads vs read requests are in this status output to see if all are coming off disk.
As #Solarflare mentioned, backups and nightly activity might be purging the innodb buffer pool of cached data and reverting bad to disk to make it slow again. As part of your nightly activites you could set innodb_buffer_pool_dump_now=1 to save the pages being hot before scripted activity and innodb_buffer_pool_load_now=1 to restore it.
Shout-out and Thank you to everyone giving valuable insight!
From all the tips you guys gave I think I am starting to understand the problem better and beginning to narrow it down:
First thing I found was my default innodb_buffer_pool_size of 134 MB. With the sort and amount of data I'm processing this is ridiculously low - so I was able to increase it.
Very helpful post: https://dba.stackexchange.com/a/27341
And from the docs: https://dev.mysql.com/doc/refman/8.0/en/innodb-buffer-pool-resize.html
Now that I have increased it to close to 2GB and am able to monitor its usage and RAM usage in general (cli: cat /proc/meminfo) I realize that my 4GB RAM is in fact on the low side of things. I am nowhere near seeing any unused overhead (buffer usage still at 99% and free RAM around 100MB).
I will start to optimize RAM usage of my daemon next and see where this leads - but this will not free enough RAM altogether.
#danblack mentioned innodb_buffer_pool_dump_now and innodb_buffer_pool_load_now. This is an interesting approach to maybe use whenever the daemon accesses the DB as I would love to separate my daemon's buffer usage from the front end's (apparently this is not possible!). I will look into this further but as my daemon is running all the time (not only at night) this might not be feasible.
#Gordan Bobic mentioned "refreshing" DBtables by using ANALYZE TABLE tableName. I found this to be quite fast and incorporated it into the daemon after each time it does an extensive read/write. This increases daemon run times by a few seconds but this is no issue at all. And I figure I can't go wrong with it :)
So, in the end I believe my issue to be a combination of things: Too small buffer size, too small RAM, too many read/write operations for that environment (evicting buffered indexes etc.).
Also I will have to learn more about memory allocation etc and optimize this better (large-pages=1 etc).
I have a database with a few tables in it. One of the tables has ~3000 rows each with ~20 columns. Every 30 seconds one of the rows in the table is UPDATE'ed with new information. I'm having a problem where sometimes (infrequently) I will notice the memory being used by the process that is updating the rows will start increasing "indefinitely" (I stop the process before it stops growing, but I'm sure it stops at some upper limit). The database is not growing during this time. Only existing rows are being updated.
I'm looking for ideas on what could cause the memory usage to start going up so that I can prevent it from happening. Since most of the time the memory usage stays the same despite running the same update process I'm not sure what condition is triggering the failure state (growing memory usage) so that I can recreate the failure on demand.
The table is using the Memory engine and I've seen the same failure using the InnoDB engine.
The MEMORY_USAGE I'm looking at is in the table returned by the below query. Are there other mysql variables I can look at to get a better idea of what specifically is using up the memory?
SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST
I found my bug. To anyone else who ends up here remember to call mysql_free_result() (I had a case where I wasn't).
For MySQL 5.6
Queries in our Dev VM linux box w/MySQL take over 4+ seconds to run the first time, then fast after sub 100ms. After a period of time query becomes slow again. I increased the RAM 2+ GB and buffer pool (it was really small using default install number)
The behavior remains, query runs slow then fast once cached. How do we monitor or check if the query is still cached or know when approximate time data gets evicted out of cache.
There's not a heavy load (far as I can tell) to expect data to be aged out.
I believe it's disk io, but I am open to suggestions. Thanks!
If the Query Cache is "ON" EVERY!!! Query & Result goes into the Cache and stays there. There a 2 situation when they Query goes out of the Cache.
1) If no more Memory for a new Query, the oldest Querys are purged
2) if you change the Table, so Result may be changed
If you switch the cache to "ON DEMAND" so you can say for each Query if it goes to the cache. So it is possible to put only important Querys into the cache
We have nightly load jobs that writes several hundred thousand records to an Mysql reporting database running in Amazon RDS.
The load jobs are taking several hours to complete, but I am having a hard time figuring out where the bottleneck is.
The instance is currently running with General Purpose (SSD) storage. By looking at the cloudwatch metrics, it appears I am averaging less than 50 IOPS for the last week. However, Network Receive Throughput is less than 0.2 MB/sec.
Is there anyway to tell from this data if I am being bottlenecked by network latency (we are currently loading the data from a remote server...this will change eventually) or by Write IOPS?
If IOPS is the bottleneck, I can easily upgrade to Provisioned IOPS. But if network latency is the issue, I will need to redesign our load jobs to load raw data from EC2 instances instead of our remote servers, which will take some time to implement.
Any advice is appreciated.
UPDATE:
More info about my instance. I am using an m3.xlarge instance. It is provisioned for 500GB in size. The load jobs are done with the ETL tool from pentaho. They pull from multiple (remote) source databases and insert into the RDS instance using multiple threads.
You aren't using up much CPU. Your memory is very low. An instance with more memory should be a good win.
You're only doing 50-150 iops. That's low, you should get 3000 in a burst on standard SSD-level storage. However, if your database is small, it is probably hurting you (since you get 3 iops per GB- so if you are on a 50gb or smaller database, consider paying for provisioned iops).
You might also try Aurora; it speaks mysql, and supposedly has great performance.
If you can spread out your writes, the spikes will be smaller.
A very quick test is to buy provisioned IOPS, but be careful as you may get fewer than you do currently during a burst.
Another quick means to determine your bottleneck is to profile your job execution application with a profiler that understands your database driver. If you're using Java, JProfiler will show the characteristics of your job and it's use of the database.
A third is to configure your database driver to print statistics about the database workload. This might inform you that you are issuing far more queries than you would expect.
Your most likely culprit accessing the database remotely is actually round-trip latency. The impact is easy to overlook or underestimate.
If the remote database has, for example, a 75 millisecond round-trip time, you can't possibly execute more than 1000 (milliseconds/sec) / 75 (milliseconds/round trip) = 13.3 queries per second if you're using a single connection. There's no getting around the laws of physics.
The spikes suggest inefficiency in the loading process, where it gathers for a while, then loads for a while, then gathers for a while, then loads for a while.
Separate but related, if you don't have the MySQL client/server compression protocol enabled on the client side... find out how to enable it. (The server always supports compression but the client has to request it during the initial connection handshake), This won't fix the core problem but should improve the situation somewhat, since less data to physically transfer could mean less time wasted in transit.
I'm not an RDS expert and I don't know if my own particular case can shed you some light. Anyway, hope this give you any kind of insight.
I have a db.t1.micro with 200GB provisioned (that gives be 600 IOPS baseline performance), on a General Purpose SSD storage.
The heaviest workload is when I aggregate thousands of records from a pool of around 2.5 million rows from a 10 million rows table and another one of 8 million rows. I do this every day. This is what I average (it is steady performance, unlike yours where I see a pattern of spikes):
Write/ReadIOPS: +600 IOPS
NetworkTrafficReceived/Transmit throughput: < 3,000 Bytes/sec (my queries are relatively short)
Database connections: 15 (workers aggregating on parallel)
Queue depth: 7.5 counts
Read/Write Throughput: 10MB per second
The whole aggregation task takes around 3 hours.
Also check 10 tips to improve The Performance of your app in AWS slideshare from AWS Summit 2014.
I don't know what else to say since I'm not an expert! Good luck!
In my case it was the amount of records. I was writing only 30 records per minute and had an Write IOPS of round about the same 20 to 30. But this was eating at the CPU, which reduced the CPU credit quite steeply. So I took all the data in that table and moved it to another "historic" table. And cleared all data in that table.
CPU dropped back down to normal measures, but Write IOPS stayed about the same, this was fine though. The problem: Indexes, I think because so many records needed to indexed when inserting it took way more CPU to do this indexing with that amount of rows. Even though the only index I had was a Primary Key.
Moral of my story, the problem is not always where you think it lies, although I had increased Write IOPS it was not the root cause of the problem, but rather the CPU that was being used to do index stuff when inserting which caused the CPU credit to fall.
Not even X-RAY on the Lambda could catch an increased query time. That is when I started to look at the DB directly.
Your Queue depth graph shows > 2 , which clearly indicate that the IOPS is under provisioned. (if Queue depth < 2 then IOPS is over provisioned)
I think you have used the default AUTOCOMMIT = 1 (autocommit mode). It performs a log flush to disk for every insert and exhausted the IOPS.
So,It is better to use (for performance tuning) AUTOCOMMIT = 0 before bulk inserts of data in MySQL, if the insert query looks like
set AUTOCOMMIT = 0;
START TRANSACTION;
-- first 10000 recs
INSERT INTO SomeTable (column1, column2) VALUES (vala1,valb1),(vala2,valb2) ... (val10000,val10000);
COMMIT;
--- next 10000 recs
START TRANSACTION;
INSERT INTO SomeTable (column1, column2) VALUES (vala10001,valb10001),(vala10001,valb10001) ... (val20000,val20000);
COMMIT;
--- next 10000 recs
.
.
.
set AUTOCOMMIT = 1
Using the above approach in t2.micro and inserted 300000 in 15 minutes using PHP.
I have a 5GB database, all tables are MyISAM. It runs into heavy load time from 01:30AM to 8:30AM (100+ selects, 150+ updates, 200+ cache hits per second) to do data analysis, during other time, load is moderate (10 selects, 5 inserts per second).
Problem is after a few days, data analysis during heavy load time appears to be slow down maybe due to query cache prunes (iowait increases). Current query cache is set to 1.5G while total RAM is 4G. It runs fast again after manually restart mysql server.
Is there a way to do regular optimization or cleaning up on mysql server to keep it running in a efficiently without a restart
It sounds to me like your application is busy updating the tables and you might have table contention. Do you have mytop running, or does SHOW PROCESSLIST give you any insight as to what part of your application is doing the most work? Have you enabled --slow-query-log setting?
Also, your database table engine might be an issue. Are you using MyISAM or InnoDB? You want to look out for table locking during updates, and how much of a backup that can create.
If you are issuing FLUSH QUERY CACHE, that can lead to badness, many versions of MySQL exhibit near-lockup when running that command.
Also, running top and checking /var/log/cron for cronjobs that might be affecting system load could help. If you are running updatedb or logrotate on your server, that could affect iowait.
It seems like your query cache size is far too large. While the query cache is usually a good thing, if it is too large it can hurt more then it helps.
This behavior is discussed in this article:
The issue here was that the customer had a moderate level of write traffic, and the current query cache implementation invalidates all result sets for a given table whenever that table is updated. As the query cache grows in size, the number of entries that must be invalidated for a given table may grow as well. In addition, the coarse locking on the cache can lead to lock contention that can kill performance, particularly on multi-core hardware.
I would recommend lowering the size of your query cache to somewhere between 16-128MB and see how that effects performance.
Another possibility is that the queries are generating really small result sets which is causing memory fragmentation. More information on this is available here, look for the "query_cache_min_res_unit" setting.