MySQL alter table enable keys not as fast as promised - mysql

So I have a large table with a bit more than 2 billion records, and 5 multi-column keys.
There are two methods I can use for inserting data:
Method 1
load data infile ...;
Method 2
alter table disable keys;
load data infile ...;
alter table enable keys;
If I'm starting from an empty table, for 2 billion records, method 1 takes about 60 hours (estimated, may be more), while method 2 takes 12 hours to insert the data, and 3 hours recreating the keys. So far so good.
However, if I already have my 2 billion records, and attempt to insert an additional 5 million, method 1 takes about 3 hours, while method 2 takes 30 minutes inserting the data, and a whopping 7 hours recreating the keys. I confirmed that during the entire key regeneration, it used Repair by sorting, so it's not like it fell back to Repair with keycache.
I wonder why this is. MySQL claims that disabling keys is very good for inserting bulk data, but this is obviously dependent on the context. If it is about to regenerate all keys from scratch, why doesn't it take around 3 hours, as when I started with an empty table? or if it inserts keys one by one, why doesn't it take around 3 hours, which is what it took for method 1?
Comments are welcome

The time taken can vary quite a bit apparently.
http://www.mysqlperformanceblog.com/2007/07/05/working-with-large-data-sets-in-mysql/

If you're working with billions of records, and using MySQL 5.1 or above, then you might find partitioning will benefit performance... when working with indexes in a partitioned table, indexes are also partitioned; and because each index is only built against a partitiion/subset of your total data, the sorting overheads of rebuilding should be significantly less.

"not as fast as promised" - uh, you have 5000000 records, of course it will take a bit longer than inserting 20 records.
With the first method, it is changing the indexes a little bit on every row insert, so they are always consistent with the data.
With the second method, it is rebuilding the indexes by sorting the whole table (2005000000 rows) - which means it's moving a large amount of existing index data to and fro (disk speed is a likely to emerge as a bottleneck here), which depends on 1) amount of existing data, and 2) amount of new data.
You could use method 3: drop keys before the second insert (this could take some time, too) and recreate them afterwards. I suspect the time will be similar to recreating the keys after the initial insert
The speeds you are describing are quite reasonable IMHO - just use the fastest method.

Related

MySQL tune slow delete

I'm using MySQL InnoDB, one of the most important tables has over 700 million records with totally 23 actively used indexes.
I am trying to delete the records in a batch of 2000 based on the record date (and order by on primary key column). Each date has around 6 million records, I delete it date by date with limit 2000. Each batch takes around 25 seconds to complete. Since it is a Production database, I want this delete operation to complete faster. Is there a better way to do this?
There are many solutions. See http://mysql.rjweb.org/doc.php/deletebig .
Which takes 25 seconds? A batch of 2000 rows? Or several of those? Are they in a single, big, transaction? Lots of little transactions would be slower, but less invasive. And there would be no ACID problem since re-deleting the same 'date' should be idempotent.
If the table is "locked" at some level for 25 seconds, then I understand your concern. If it is a bunch of sub-second deletes, then does it really matter that it takes a long time?
Furthermore, instead of deleting once a day, you could delete once an hour. This might decrease the tasks to 2 seconds instead of 25.
25 indexes is terribly large. Please provide SHOW CREATE TABLE. It is all too common that there are "redundant" indexes that could (should) be dropped. The main example is INDEX(a,b) takes care of INDEX(a) (but not b), so if you have both, drop the latter.
It may be impractical to make the change now, but PARTITION BY RANGE(TO_DAYS(...)) lets you DROP PARTITION, which is virtually instantaneous. (Adding partitioning to a 700M row table would take a long time.)

Choosing the right MySQL structure for a very large time-based dataset

I have been using MySQL for the past few months and I have a good handle on smaller database structures. Now, however, I need to decide on how to create a database that can store a large set of time oriented data in either multiple tables, or a single table.
Using a single table, I have tried partitioning it into yearly segments, however, the load times and insert times are still quite long. Especially for searching. The data consists of roughly 8000 reporting stations with about 300-500 reports per day (several per hour). The reports go back all the way to 1980, so easily over 120 million data points and growing.
I am not sure what may provide the best results for searching such a vast amount of data, or if it would be better to separate the data into several tables. Each report has only a couple columns of information (time, temperature and wind).
I am sure this question has been asked many times, but any help would be appreciated.
Thank you!
120M rows is big enough to conisder PARTITIONing. And that it good for time-based data if you need to delete "old" data. This because DROP PARTITION is a lot faster and less invasive than DELETE.
I discuss this at length here.
Loading into a partitioned table should be only slightly slower (or faster in rare cases) than for a non-partitioned table.
Searching problems -- sounds like you did not index the table properly. Some tips:
(Usually) Put the "partition key" last in any index, if it is needed at all.
Use PARTITION BY RANGE(TO_DAYS(...)) only.
40 years? 40 partitions is reasonable.
Do not partition by station, but probably use that column at the start of some indexes.
Please show me the CREATE TABLE so I can be more specific in my tips.
If you won't be deleting 'old' rows, then partitioning is probably a waste. Let's see some of the queries.
On the other hand, if you often use a date range and several stations, then you have the "2D index problem". Partition by year; start the PRIMARY KEY with station
Do not use multiple tables. This is a common Question on this forum, and the answer is always the same.
Quite possibly you need some sort of "summary table". It might include high, low, average temp, etc for each week. For, say, a multi-year temperature graph, this is clearly 7 times as fast. More here.
Inserting only 37 rows/second should not be a problem, even on a slow HDD. If they come in batches, then batch the INSERTs via multiple rows per INSERT statement or via LOAD DATA.

Is adding and dropping indexes everyday on huge tables a good practice?

I'm building a Web Application that is connected to a MySQL database.
I've got two huge tables containing each about 40 millions rows at the moment, and they are receiving new rows everyday (which adds ~ 500 000-1000 000 rows everyday).
The process to add new rows runs during the night, while no one can use the application, and the new rows' content depends on the result of some basic SELECT queries on the current database.
In order to get the result of those SELECT statement fast enough, I'm using simple indexes (one column per index) on each column that appears at least once in a WHERE clause.
The thing is, during the day, some totally different queries are run against those tables, including some "range WHERE clause" (SELECT * FROM t1 WHERE a = a1 AND b = b1 AND (date BETWEEN d1 AND d2)).
I found on stack this very helpful mini-cookbook that advises you on which INDEXes you should use depending on how the database is queried: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
They advice to use compound index: in my example query above it would give INDEX(a, b, date).
It indeed increased the speed of the queries run during the day (from 1 minute to 8 seconds so I was truly happy).
However, with those compound indexes, the required time to add new rows during the night totally explode (it would take more than one day to add the daily content).
Here is my question: would that be ok to drop all the indexes every night, add the new content, and set back up the daily indexes?
Or would that be dangerous since indexes are not meant to be rebuilt every day, especially on such big tables?
I know such an operation would take approximately two hours in total (drop and recreate INDEXes).
I am aware of the existence of ALTER TABLE table_name DISABLE KEYS; but I'm using InnoDB and I believe it is not made to work on InnoDB table.
I believe you have answered your own question: You need the indexes during the day, but not at night. Given what you describe, you should drop the indexes for the bulk inserts at night and re-create them afterwards. Dropping indexes for data loads is not unheard of, and seems appropriate in your case.
I would ask about how you are inserting new data. One method is to insert the values one row at a time. Another is to put the values into a temporary table (with no index) and do a bulk insert:
insert into bigtable( . . .)
select . . .
from smalltable;
These have different performance characteristics. You might find that using a single insert (if you are not already doing so) is fast enough for your purposes.
A digression... PARTITIONing by date should be very useful for you since you are deleting things over a year ago. I would recommend PARTITION BY RANGE(TO_DAYS(...)) and breaking it into 14 or 54 partitions (months or weeks, plus some overhead). This will eliminate the time it takes to delete the old rows, since DROP PARTITION is almost instantaneous.
More details are in my partition blog. Your situation sounds like both Use case #1 and Use case #3.
But back to your clever idea of dropping and rebuilding indexes. To others, I point out the caveat that you have the luxury of not otherwise touching the table for long enough to do the rebuild.
With PARTITIONing, all the rows being inserted will go into the 'latest' partition, correct? This partition is a lot smaller than the entire table, so there is a better chance that the indexes will fit in RAM, thereby be 10 times as fast to update (without rebuilding the indexes). If you provide SHOW CREATE TABLE, SHOW TABLE STATUS, innodb_buffer_pool_size, and RAM size, I can help you do the arithmetic to see if your 'last' partition will fit in RAM.
A note about index updates in InnoDB -- they are 'delayed' by sitting in the "Change buffer", which is a portion of the buffer_pool. See innodb_change_buffer_size_max, available since 5.6. Are you using that version, or newer? (If not, you ought to upgrade, for many reasons.)
The default for that setting is 25, meaning that 25% of the buffer_pool is set aside for pending updates to indexes, as caused by INSERT, etc. That acts like a "cache", such that multiple updates to the same index block are held there until they get bumped out. A higher setting should make index updates hit the disk less often, hence finish faster.
Where I am heading with this... By increasing this setting, you would make the inserts (direct, not rebuild) more efficient. I'm thinking that this might speed it up:
Just before the nightly INSERTs:
innodb_change_buffer_size_max = 70
innodb_old_blocks_pct = 10
Soon after the nightly INSERTs:
innodb_change_buffer_size_max = 25
innodb_old_blocks_pct = 37
(I am not sure about that other setting, but it seems reasonable to push it out of the way.)
Meanwhile, what is the setting of innodb_buffer_pool_size? Typically, it should be 70% of available RAM.
In a similar application, I had big, hourly, dumps to load into a table, and a 90-day retention. I stretched my Partition rules by having 90 daily partitions and 24 hourly partitions. Every night, I spent a lot of time (but less than an hour) doing REORGANIZE PARTITION to turn the 24 hourly partitions into a new daily (and dropping the 90-day-old partition). During each hour, the load had the added advantage that nothing else was touching the 1-hour partition -- I could do normalization, summarization, and loading all in 7 minutes. The entire 90 days fit in 400GB. (Side note: a large number of partitions is a performance killer until 8.0; so don't even consider daily partitions for you 1-year retention.)
The Summary tables made so that 50-minute queries (in the prototype) shrank to only 2 seconds. Perhaps you need a summary table with PRIMARY KEY (a, b, date)? That will let you get rid of such an index on the 'Fact' table. Oops, that eliminates the entire premise of your original question ! See the links at the bottom of my blogs; look for "Summary Tables". A general rule: Don't have any indexes (other than the PRIMARY KEY) on the Fact table; use Summary tables for things that need messier indexes.

Fill Factor And Insert Speed

I have 3 very large tables with clustered indexes on composite keys. No updates only inserts. New inserts will not be within the existing index range but the new inserts will not align with the clustered index and these tables get a lot of inserts (hundreds - thousands per second). What would like to do is DBREINDEX with Fill Factor = 100 but then set a Fill Factor of 5 and have that Fill Factor ONLY applied to inserts. Right now a Fill Factor applies to the whole table only. Is there a way to have a Fill Factor that applies to inserts (or inserts and updates) only? I don't care about select speed at this time. I am loading data. When the data load is complete then I will DBREINDEX at 100. A Fill Factor of 10 versus 30 doubles the rates at which new data is inserted. This load will takes a couple days and it cannot go live until the data is loaded. The clustered indexes are aligned with dominate query used by the end user application.
My practice is to DBREINDEX daily but the problem is now that the tables are getting large a 10 DBREINDEX takes a long time. I have considered indexing into "daily" tables and then inserting that data daily sorted by the clustered index into the production tables.
If you read this far even more. The indexes are all composite and I am running 6 instances of the parser on an 8 core server (lot of testing and that seems to have the best throughput). The data out of a SINGLE parser is in PK order and I am doing the inserts 990 values at a time (SQL value limits). The 3 active tables only share data via a foreign key relationship with a single relative inactive 4th table. My thought at this time is to have holding tables for each parser and then have another process that polls those table for the next complete insert and move the data into the production table in PK order. That is going to be a lot of work. I hope someone has a better idea.
The parses start in PK order but rarely finish in PK order. Some individual parses are so large that I could not hold all the data in memory until the end. Right now the SQL insert is slightly faster than the parse that creates the data. In an individual parse I run the insert asynch and go on parsing but don't insert until the prior insert is complete.
I agree you should have holding tables for the parser data and only insert to the main tables when you're ready. I implemented something similar in a former life (it was quasi-hashed into 10 tables based on mod 10 of the unique ID, then rolled into the primary table later - primarily to assist in load speed). If you're going to use holding tables then I see no need to have them at anything but FF = 100. The less pages you have to use the better.
Apparently, too, you should test the difference permanent tables, #temp tables and table-valued parameters. :-)

MySQL - why not index every field?

Recently I've learned the wonder of indexes, and performance has improved dramatically. However, with all I've learned, I can't seem to find the answer to this question.
Indexes are great, but why couldn't someone just index all fields to make the table incredibly fast? I'm sure there's a good reason to not do this, but how about three fields in a thirty-field table? 10 in a 30 field? Where should one draw the line, and why?
Indexes take up space in memory (RAM); Too many or too large of indexes and the DB is going to have to be swapping them to and from the disk. They also increase insert and delete time (each index must be updated for every piece of data inserted/deleted/updated).
You don't have infinite memory. Making it so all indexes fit in RAM = good.
You don't have infinite time. Indexing only the columns you need indexed minimizes the insert/delete/update performance hit.
Keep in mind that every index must be updated any time a row is updated, inserted, or deleted. So the more indexes you have, the slower performance you'll have for write operations.
Also, every index takes up further disk space and memory space (when called), so it could potentially slow read operations as well (for large tables).
Check this out
You have to balance CRUD needs. Writing to tables becomes slow. As for where to draw the line, that depends on how the data is being acessed (sorting filtering, etc.).
Indexing will take up more allocated space both from drive and ram, but also improving the performance a lot. Unfortunately when it reaches memory limit, the system will surrender the drive space and risk the performance. Practically, you shouldn't index any field that you might think doesn't involve in any kind of data traversing algorithm, neither inserting nor searching (WHERE clause). But you should if otherwise. By default you have to index all fields. The fields which you should consider unindexing is if the queries are used only by moderator, unless if they need for speed too
It is not a good idea to indexes all the columns in a table. While this will make the table very fast to read from, it also becomes much slower to write to. Writing to a table that has every column indexed would involve putting the new record in that table and then putting each column's information in the its own index table.
this answer is my personal opinion based I m using my mathematical logic to answer
the second question was about the border where to stop, First let do some mathematical calculation, suppose we have N rows with L fields in a table if we index all the fields we will get a L new index tables where every table will sort in a meaningfull way the data of the index field, in first glance if your table is a W weight it will become W*2 (1 tera will become 2 tera) if you have 100 big table (I already worked in project where the table number was arround 1800 table ) you will waste 100 times this space (100 tera), this is way far from wise.
If we will apply indexes in all tables we will have to think about index updates were one update trigger all indexes update this is a select all unordered equivalent in time
from this I conclude that you have in this scenario that if you will loose this time is preferable to lose it in a select nor an update because if you will select a field that is not indexed you will not trigger another select on all fields that are not indexed
what to index ?
foreign-keys : is a must based on
primary-key : I m not yet sure about it may be if someone read this could help on this case
other fields : the first natural answer is the half of the remaining filds why : if you should index more you r not far from the best answer if you should index less you are not also far because we know that no index is bad and all indexed is also bad.
from this 3 points I can conclude that if we have L fields composed of K keys the limit should be somewhere near ((L-K)/2)+K more or less by L/10
this answer is based on my logic and personal prictices
First of all, at least in SAP - ABAP and in background database table, we can create one index table for all required index fields, we will have their addresses only. So other SQL related software-database system can also use one table for all fields to be indexed.
Secondly, what is the writing performance? A company in one day records 50 sales orders for example. And let assume there is a table VBAK sales order header table with 30 fields for example each has 20 CHAR length..
I can write to real table in seconds, but other index table can work in the background, and at the same time a report is tried to be run, for this report while index table is searched, ther can be a logic- for database programming- a index writing process is contiuning and wait it for ending ( 5 sales orders at the same time were being recorded for example and take maybe 5 seconds) ..so , a running report can wait 5 seconds then runs 5 seconds total 10 seconds..
without index, a running report does not wait 5 seconds for writing performance..but runs maybe 40 seconds...
So, what is the meaning of writing performance no one writes thousands of records at the same time. But reading them.
And reading a second table means that : there were all ready sorted fields.I have 3 fields selected and I can find in which sorted sets I need to search these data, then I bring them...what RAM, what memory it is just a copied index table with only one data for each field -address data..What memory?
I think, this is one of the software company secrets hide from customers, not to wake them up , otherwise they will not need another system in the future with an expensive price.