MySQL tune slow delete - mysql

I'm using MySQL InnoDB, one of the most important tables has over 700 million records with totally 23 actively used indexes.
I am trying to delete the records in a batch of 2000 based on the record date (and order by on primary key column). Each date has around 6 million records, I delete it date by date with limit 2000. Each batch takes around 25 seconds to complete. Since it is a Production database, I want this delete operation to complete faster. Is there a better way to do this?

There are many solutions. See http://mysql.rjweb.org/doc.php/deletebig .
Which takes 25 seconds? A batch of 2000 rows? Or several of those? Are they in a single, big, transaction? Lots of little transactions would be slower, but less invasive. And there would be no ACID problem since re-deleting the same 'date' should be idempotent.
If the table is "locked" at some level for 25 seconds, then I understand your concern. If it is a bunch of sub-second deletes, then does it really matter that it takes a long time?
Furthermore, instead of deleting once a day, you could delete once an hour. This might decrease the tasks to 2 seconds instead of 25.
25 indexes is terribly large. Please provide SHOW CREATE TABLE. It is all too common that there are "redundant" indexes that could (should) be dropped. The main example is INDEX(a,b) takes care of INDEX(a) (but not b), so if you have both, drop the latter.
It may be impractical to make the change now, but PARTITION BY RANGE(TO_DAYS(...)) lets you DROP PARTITION, which is virtually instantaneous. (Adding partitioning to a 700M row table would take a long time.)

Related

Choosing the right MySQL structure for a very large time-based dataset

I have been using MySQL for the past few months and I have a good handle on smaller database structures. Now, however, I need to decide on how to create a database that can store a large set of time oriented data in either multiple tables, or a single table.
Using a single table, I have tried partitioning it into yearly segments, however, the load times and insert times are still quite long. Especially for searching. The data consists of roughly 8000 reporting stations with about 300-500 reports per day (several per hour). The reports go back all the way to 1980, so easily over 120 million data points and growing.
I am not sure what may provide the best results for searching such a vast amount of data, or if it would be better to separate the data into several tables. Each report has only a couple columns of information (time, temperature and wind).
I am sure this question has been asked many times, but any help would be appreciated.
Thank you!
120M rows is big enough to conisder PARTITIONing. And that it good for time-based data if you need to delete "old" data. This because DROP PARTITION is a lot faster and less invasive than DELETE.
I discuss this at length here.
Loading into a partitioned table should be only slightly slower (or faster in rare cases) than for a non-partitioned table.
Searching problems -- sounds like you did not index the table properly. Some tips:
(Usually) Put the "partition key" last in any index, if it is needed at all.
Use PARTITION BY RANGE(TO_DAYS(...)) only.
40 years? 40 partitions is reasonable.
Do not partition by station, but probably use that column at the start of some indexes.
Please show me the CREATE TABLE so I can be more specific in my tips.
If you won't be deleting 'old' rows, then partitioning is probably a waste. Let's see some of the queries.
On the other hand, if you often use a date range and several stations, then you have the "2D index problem". Partition by year; start the PRIMARY KEY with station
Do not use multiple tables. This is a common Question on this forum, and the answer is always the same.
Quite possibly you need some sort of "summary table". It might include high, low, average temp, etc for each week. For, say, a multi-year temperature graph, this is clearly 7 times as fast. More here.
Inserting only 37 rows/second should not be a problem, even on a slow HDD. If they come in batches, then batch the INSERTs via multiple rows per INSERT statement or via LOAD DATA.

Sql insert query slow down while inserting 100k records

I have table which contain 20 million records.
daily I delete 100K records and 100K record are inserted but when i inserting records it takes more time.
table has one clustered index (primary key).
I already tried sp_updatestats after deleting records
So, you are keeping about 200 days's worth of data? (Perhaps 6 months?) And you are deleting the oldest day? What is the PRIMARY KEY? Perhaps an AUTO_INCREMENT? If not, then we need to study it. And you have some datetime or timestamp column.
PARTITION BY RANGE(TO_DAYS(datetime)) into about 28 weeks. Then DROP PARTITION weekly and REORGANIZE PARTITIONS future INTO next_week, future. More details here .
With that, the delete will be 'instantaneous', as will the creation of a new partition. And the other partitions will not be messed with, thereby avoiding whatever is currently causing it to "take more time".
If you need to discuss this further, please provide SHOW CREATE TABLE and tell us how you were doing the big delete.
You're experiencing table bloat. Disk space is never freed up by mysql when you delete records so selects and inserts need to seek past all the garbage you've deleted causing your slow downs. This is one of the main reasons why I choose to use postgres for non-trivial projects. Your actual problem is your database setup choices.
Your best bet is partitioning your tables by date. I've found this speeds things up greatly in your situation. https://dev.mysql.com/doc/refman/5.5/en/partitioning-range.html
EDIT: this might be worth your time to read as well: https://www.jeffgeerling.com/blogs/jeff-geerling/reclaim-your-hard-drive-saving

Is adding and dropping indexes everyday on huge tables a good practice?

I'm building a Web Application that is connected to a MySQL database.
I've got two huge tables containing each about 40 millions rows at the moment, and they are receiving new rows everyday (which adds ~ 500 000-1000 000 rows everyday).
The process to add new rows runs during the night, while no one can use the application, and the new rows' content depends on the result of some basic SELECT queries on the current database.
In order to get the result of those SELECT statement fast enough, I'm using simple indexes (one column per index) on each column that appears at least once in a WHERE clause.
The thing is, during the day, some totally different queries are run against those tables, including some "range WHERE clause" (SELECT * FROM t1 WHERE a = a1 AND b = b1 AND (date BETWEEN d1 AND d2)).
I found on stack this very helpful mini-cookbook that advises you on which INDEXes you should use depending on how the database is queried: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
They advice to use compound index: in my example query above it would give INDEX(a, b, date).
It indeed increased the speed of the queries run during the day (from 1 minute to 8 seconds so I was truly happy).
However, with those compound indexes, the required time to add new rows during the night totally explode (it would take more than one day to add the daily content).
Here is my question: would that be ok to drop all the indexes every night, add the new content, and set back up the daily indexes?
Or would that be dangerous since indexes are not meant to be rebuilt every day, especially on such big tables?
I know such an operation would take approximately two hours in total (drop and recreate INDEXes).
I am aware of the existence of ALTER TABLE table_name DISABLE KEYS; but I'm using InnoDB and I believe it is not made to work on InnoDB table.
I believe you have answered your own question: You need the indexes during the day, but not at night. Given what you describe, you should drop the indexes for the bulk inserts at night and re-create them afterwards. Dropping indexes for data loads is not unheard of, and seems appropriate in your case.
I would ask about how you are inserting new data. One method is to insert the values one row at a time. Another is to put the values into a temporary table (with no index) and do a bulk insert:
insert into bigtable( . . .)
select . . .
from smalltable;
These have different performance characteristics. You might find that using a single insert (if you are not already doing so) is fast enough for your purposes.
A digression... PARTITIONing by date should be very useful for you since you are deleting things over a year ago. I would recommend PARTITION BY RANGE(TO_DAYS(...)) and breaking it into 14 or 54 partitions (months or weeks, plus some overhead). This will eliminate the time it takes to delete the old rows, since DROP PARTITION is almost instantaneous.
More details are in my partition blog. Your situation sounds like both Use case #1 and Use case #3.
But back to your clever idea of dropping and rebuilding indexes. To others, I point out the caveat that you have the luxury of not otherwise touching the table for long enough to do the rebuild.
With PARTITIONing, all the rows being inserted will go into the 'latest' partition, correct? This partition is a lot smaller than the entire table, so there is a better chance that the indexes will fit in RAM, thereby be 10 times as fast to update (without rebuilding the indexes). If you provide SHOW CREATE TABLE, SHOW TABLE STATUS, innodb_buffer_pool_size, and RAM size, I can help you do the arithmetic to see if your 'last' partition will fit in RAM.
A note about index updates in InnoDB -- they are 'delayed' by sitting in the "Change buffer", which is a portion of the buffer_pool. See innodb_change_buffer_size_max, available since 5.6. Are you using that version, or newer? (If not, you ought to upgrade, for many reasons.)
The default for that setting is 25, meaning that 25% of the buffer_pool is set aside for pending updates to indexes, as caused by INSERT, etc. That acts like a "cache", such that multiple updates to the same index block are held there until they get bumped out. A higher setting should make index updates hit the disk less often, hence finish faster.
Where I am heading with this... By increasing this setting, you would make the inserts (direct, not rebuild) more efficient. I'm thinking that this might speed it up:
Just before the nightly INSERTs:
innodb_change_buffer_size_max = 70
innodb_old_blocks_pct = 10
Soon after the nightly INSERTs:
innodb_change_buffer_size_max = 25
innodb_old_blocks_pct = 37
(I am not sure about that other setting, but it seems reasonable to push it out of the way.)
Meanwhile, what is the setting of innodb_buffer_pool_size? Typically, it should be 70% of available RAM.
In a similar application, I had big, hourly, dumps to load into a table, and a 90-day retention. I stretched my Partition rules by having 90 daily partitions and 24 hourly partitions. Every night, I spent a lot of time (but less than an hour) doing REORGANIZE PARTITION to turn the 24 hourly partitions into a new daily (and dropping the 90-day-old partition). During each hour, the load had the added advantage that nothing else was touching the 1-hour partition -- I could do normalization, summarization, and loading all in 7 minutes. The entire 90 days fit in 400GB. (Side note: a large number of partitions is a performance killer until 8.0; so don't even consider daily partitions for you 1-year retention.)
The Summary tables made so that 50-minute queries (in the prototype) shrank to only 2 seconds. Perhaps you need a summary table with PRIMARY KEY (a, b, date)? That will let you get rid of such an index on the 'Fact' table. Oops, that eliminates the entire premise of your original question ! See the links at the bottom of my blogs; look for "Summary Tables". A general rule: Don't have any indexes (other than the PRIMARY KEY) on the Fact table; use Summary tables for things that need messier indexes.

MySQL: Partition-like function for a single set of data?

I have a table that has millions of records, and they utilize EFF_FROM and EFF_TO date fields to version the records.
99% of the time, when this table is queried by an application, it is only concerned with records that have an EFF_TO of 2099-12-31, or records that are active and not historical.
I copied just the active records to a test version of the table and the application's SELECT query went from 60 seconds to 3 seconds.
I don't necessarily want to partition every EFF_TO date. I don't want to add that overhead especially to processes that populate the table. I only want the optimization for querying records with 2099-12-31, and I want the performance to be instant.
Is there a straightforward way to do this? Or do I have to resort to creating an active table and a historical table?
Partition like function for a single set of data?
This is something of any oxymoron, however you are asking about partitioning into two sets of data, one where EFF_TO is in the future and one where it is in the past.
have an EFF_TO of 2099-12-31
Design fault - these should be null.
If they were null the the partitioning would be simple. As it stands you will have to drop and recreate the partitions - which is rather an expensive operation (have a look at tools for doing online schema updates).
You could minimize the impact by creating multiple partitions defining the period around NOW then adding an extra one onto the end of and removing one from the beginning at regular intervals.
application's SELECT query went from 60 seconds to 3 seconds.
There are lots of other reasons why the performance improved than just the size of the table
if it's doing a full table table scan, this is a design fault in the application.
You're indexes may not be as up to date as they should be
the logical structure of the indexes may be unbalanced and need optimized
the physical structure of the table and indexes many be fragmented and need optimized

MySQL alter table enable keys not as fast as promised

So I have a large table with a bit more than 2 billion records, and 5 multi-column keys.
There are two methods I can use for inserting data:
Method 1
load data infile ...;
Method 2
alter table disable keys;
load data infile ...;
alter table enable keys;
If I'm starting from an empty table, for 2 billion records, method 1 takes about 60 hours (estimated, may be more), while method 2 takes 12 hours to insert the data, and 3 hours recreating the keys. So far so good.
However, if I already have my 2 billion records, and attempt to insert an additional 5 million, method 1 takes about 3 hours, while method 2 takes 30 minutes inserting the data, and a whopping 7 hours recreating the keys. I confirmed that during the entire key regeneration, it used Repair by sorting, so it's not like it fell back to Repair with keycache.
I wonder why this is. MySQL claims that disabling keys is very good for inserting bulk data, but this is obviously dependent on the context. If it is about to regenerate all keys from scratch, why doesn't it take around 3 hours, as when I started with an empty table? or if it inserts keys one by one, why doesn't it take around 3 hours, which is what it took for method 1?
Comments are welcome
The time taken can vary quite a bit apparently.
http://www.mysqlperformanceblog.com/2007/07/05/working-with-large-data-sets-in-mysql/
If you're working with billions of records, and using MySQL 5.1 or above, then you might find partitioning will benefit performance... when working with indexes in a partitioned table, indexes are also partitioned; and because each index is only built against a partitiion/subset of your total data, the sorting overheads of rebuilding should be significantly less.
"not as fast as promised" - uh, you have 5000000 records, of course it will take a bit longer than inserting 20 records.
With the first method, it is changing the indexes a little bit on every row insert, so they are always consistent with the data.
With the second method, it is rebuilding the indexes by sorting the whole table (2005000000 rows) - which means it's moving a large amount of existing index data to and fro (disk speed is a likely to emerge as a bottleneck here), which depends on 1) amount of existing data, and 2) amount of new data.
You could use method 3: drop keys before the second insert (this could take some time, too) and recreate them afterwards. I suspect the time will be similar to recreating the keys after the initial insert
The speeds you are describing are quite reasonable IMHO - just use the fastest method.