InnoDB Performance on primary index added/altering - mysql

So I have a huge update where I have to insert around 40gb data into an innodb table. Its taking quite a while, so Im wondering which method would be the fastest (and more importantly why, as I could just do a split test).
Method 1)
a) Insert all rows
b) create ALTER TABLE su_tmp_matches ADD PRIMARY KEY ( id )
Method 2)
a) ALTER TABLE su_tmp_matches ADD PRIMARY KEY ( id )
b) Insert all rows
Currently we are using method 1, but the step b) seems to take a shitload of time. So Im wondering if there is any implication of the size here (40gb - 5 million rows).
---- so I decided to test this as well ---
Pretty quick brand new mysql server - loads and loads of ram, and fast ram, fast discs as well, and pretty tuned up (we have more than 5000 requests per second on one pieces):
1,6 mio rows / 6gb data:
81 seconds to "delete" a primary index
550 seconds to "add" a primary index (after data is added)
120 seconds to create a copy of the table with the primary index create BEFORE data insert
80 seconds to create a copy of the table without the primary index (which then is 550 seconds to create afterwards)
Seems pretty absurd - question is, if indexes are the same thing.

From the documentation :
InnoDB does not have a special optimization for separate index
creation the way the MyISAM storage engine does. Therefore, it does
not pay to export and import the table and create indexes afterward.
The fastest way to alter a table to InnoDB is to do the inserts
directly to an InnoDB table.
It seems to me that adding the constraint of unicity before the insert could only help the engine if your column having a primary key is an autoincremented integer. But I really doubt there would be a notable difference.
A useful recommendation :
During the conversion of big tables, increase the size of the InnoDB
buffer pool to reduce disk I/O, to a maximum of 80% of physical
memory. You can also increase the sizes of the InnoDB log files.
EDIT : as by experience MySQL doesn't always perform as expected from the documentation performance-wise, I think any benchmark you do on this would be interesting, even if not a definite answer per se.

Related

MySQL ADD COLUMN slow under AWS RDS

I have an RDS MySql with the following settings:
Class: db.m5.xlarge
Storage: Prosisionned 1000 IOPS (SSD)
I then want to add a few columns to a table that is about 20 GB in size (according to INFORMATION_SCHEMA.files). Here's my statement:
ALTER TABLE MY_TABLE
ADD COLUMN NEW_COLUMN_1 DECIMAL(39, 30) NULL,
ADD COLUMN NEW_COLUMN_2 DECIMAL(39, 30) NULL,
ADD COLUMN NEW_COLUMN_3 INT(10) UNSIGNED NULL,
ADD CONSTRAINT SOME_CONSTRAINT FOREIGN KEY (NEW_COLUMN_3) REFERENCES SOME_OTHER_TABLE(SOME_OTHER_PK),
ADD COLUMN NEW_COLUMN_4 DATE NULL;
This query took 172 minutes to execute. Most of this time was spent coping the data to a temporary table.
During that operation, there were no other queries (read or write) being executed. I had the database just for myself. SHOW FULL PROCESSLIST was saying that State was equal to copy to tmp table for my query.
What I don't understand is that the the AWS RDS Console tells me that the write througput was between 30 MB/s and 35 MB/s for 172 minutes.
Assuming a write througput of 30 MB/s, I should have been able to write 30 * 60 * 172 = 309600 MB = 302 GB. This is much bigger than the size of the temporary table that was created during the opration (20 GB).
So two questions:
what is mysql/rds writing beside my temp table? Is there a way to disable that so that I can get the full bandwidth to create the temp table?
is there any way to accelerate that operation? Taking 3 hours to write 20 GB of data seems pretty long.
I was using MySQL 5.7. According to this MySQL blog post, version 8.0 improved the situation: "InnoDB now supports Instant ADD COLUMN".
I therefore changed my query to use the new feature.
-- Completes in 0.375 seconds!
ALTER TABLE MY_TABLE
ADD COLUMN NEW_COLUMN_1 DECIMAL(39, 30) NULL,
ADD COLUMN NEW_COLUMN_2 DECIMAL(39, 30) NULL,
ADD COLUMN NEW_COLUMN_3 INT(10) UNSIGNED NULL,
-- 'ALGORITHM=INSTANT' is not compatible with foreign keys.
-- The foreign key will need to be added in another statement
-- ADD CONSTRAINT SOME_CONSTRAINT FOREIGN KEY (NEW_COLUMN_3) REFERENCES SOME_OTHER_TABLE(SOME_OTHER_PK),
ADD COLUMN NEW_COLUMN_4 DATE NULL,
-- the new option
ALGORITHM=INSTANT;
-- This completed in about 6 minutes.
-- Adding the foreign creates an index under the hood.
-- This index was 1.5 GB big.
SET FOREIGN_KEY_CHECKS=0;
ALTER TABLE MY_TABLE
ADD FOREIGN KEY (NEW_COLUMN_3) REFERENCES SOME_OTHER_TABLE(SOME_OTHER_PK);
SET FOREIGN_KEY_CHECKS=1;
So my conclusions:
upgrade to MySQL 8 if you can
make sure that you always use (when possible) the ALGORITHM=INSTANT option.
InnoDB is probably the storage engine you are using, since it's the default storage engine. InnoDB does some I/O that might seem redundant, to ensure there is no data loss.
For example:
Data and index pages modified in the buffer pool must be written to the tablespace. The table may need to split some pages during the process of adding columns, because the rows become wider, and fewer rows fit per page.
During writing pages to the tablespace, InnoDB first writes those pages to the doublewrite buffer, to ensure against data loss if there's a crash during a page write.
Transactions are written to the InnoDB redo log, and this may even result in multiple overwrites to the same block in the log.
Transactions are also written to the binary log if it is enabled for purposes of replication. Though this shouldn't be a big cost in the cast of an ALTER TABLE statement, because DDL statements are always written to the binary log in statement format, not in row format.
You also asked what can be done to speed up the ALTER TABLE. The reason to want it to run faster is usually because during an ALTER TABLE, the table is locked and may block concurrent queries.
At my company, we use the free tool pt-online-schema-change, so we can continue to use the table more or less freely while it is being altered. It actually takes longer to complete the alter this way, but it's not so inconvenient since it doesn't block our access to the table.

Create foreign key on MySQL table takes forever with copy to tmp table

I am trying to set a foreign key constraint on a 5.7 InnoDB table with 30M+ rows.
It now already runs for 45 minutes on a quad core 64GB server. The processlist outputs the state copy to tmp table for the issued alter table command.
InnoDB_buffer_pool_size is set to 32G and has room.
Why does the system create a tmp table and can this somehow be increased in performance?
It's likely that the time is being taken building an index for that foreign key. If you already had an index where the foreign key column(s) were the leftmost columns of the index, then it would use that index and not build a new one.
45 minutes doesn't sound like an unusual amount of time to build an index on such a large table. You haven't said what the data type of the foreign key column(s) are, so perhaps it's a large varchar or something and it is taking many gigabytes to build that index.
Perhaps your server's disk is too slow. If you're using non-SSD storage, or remote storage (like Amazon EBS), it's slow by modern standards.
The CPU cores isn't going to make any difference, because the work is being done in one thread anyway. A faster CPU speed would help, but not more cores.
At my company, we use pt-online-schema-change to apply all schema changes or index builds. This allows clients to read and write the table concurrently, so it doesn't matter that it takes 45 minutes or 90 minutes or even longer. Eventually it finishes, and swaps the new table for the old table.
Attention! This disables key checking so know what you are doing, in some cases this is not recommended, but can help many people so I think it's worth answering.
I had this problem this week, I have a client that still have mySQL 5.5, so I had to make it work. You just need to disable keys checking, as well as put your application down for maintenance (so you don't have any operations).
Before creating your FK or adding a column, use:
ALTER TABLE table_name DISABLE KEYS;
Then run your command, my table with 1M rows took only 57 seconds.
Then you run:
ALTER TABLE table_name ENABLE KEYS;

Handling huge MyISAM table for optimisation

I have a huge (and growing) MyISAM table (700millions rows = 140Gb).
CREATE TABLE `keypairs` (
`ID` char(60) NOT NULL,
`pair` char(60) NOT NULL,
PRIMARY KEY (`ID`)
) ENGINE=MyISAM
The table option was changed to ROW_FORMAT=FIXED, cause both columns are always fixed length to max (60). And yes yes, ID is well a string sadly and not an INT.
SELECT queries are pretty ok in speed efficiency.
Databases and mysql engine are all 127.0.0.1/localhost. (nothing distant)
Sadly, INSERT is slow as hell. I dont even talk about trying to LOAD DATA millions new rows... takes days.
There won't have any concurrent read on it. All SELECTs are done one by one by only my local server.(it is not for client's use)
(for infos : files sizes .MYD=88Gb, .MYI=53Gb, .TMM=400Mb)
How could i speed up inserts into that table?
Would it help to PARTITION that huge table ? (how then?)
I heard MyISAM is using "structure cache" as .frm files. And that a line into config file is helping mysql keep in memory all the .frm (in case of partitionned), would it help also? Actualy, my .frm file is 9kb only for 700millions rows)
string shortenning/compress function... the ID string? (same idea as rainbow tables) even if it lowers the max allowed unique ID's, i will anyway never reach the max of 60chars. so maybe its an idea? but before creating a new unique ID i have to check if shortened string doesn't exists in db ofc
Same idea as shortening ID strings, what about using md5() on the ID? shorten string means faster or not in that case?
Sort the incoming data before doing the LOAD. This will improve the cacheability of the PRIMARY KEY(id).
PARTITIONing is unlikely to help, unless there is some useful pattern to ID.
PARTITIONing will not help for single-row insert nor for single-row fetch by ID.
If the strings are not a constant width of 60, you are wasting space and speed by saying CHAR instead of VARCHAR. Change that.
MyISAM's FIXED is useful only if there is a lot of 'churn' (deletes+inserts, and/or updates).
Smaller means more cacheable means less I/O means faster.
The .frm is an encoding of the CREATE TABLE; it is not relevant for this discussion.
A simple compress/zip/whatever will almost always compress text strings longer than 10 characters. And they can be uncompressed, losslessly. What do your strings look like? 60-character English text will shrink to 20-25 bytes.
MD5 is a "digest", not a "compression". You cannot recover the string from its MD5. Anyway, it would take 16 bytes after converting to BINARY(16).
The PRIMARY KEY is a BTree. If ID is somewhat "random", then the 'next' ID (unless the input is sorted) is likely not to be cached. No, the BTree is not rebalanced all the time.
Turning the PRIMARY KEY into a secondary key (after adding an AUTO_INCREMENT) will not speed things up -- it still has to update the BTree with ID in it!
How much RAM do you have? For your situation, and for this LOAD, set MyISAM's key_buffer_size to about 70% of available RAM, but not bigger than the .MYI file. I recommend a big key_buffer because that is where the random accesses are occurring; the .MYD is only being appended to (assuming you have never deleted any rows).
We do need to see your SELECTs to make sure these changes are not destroying performance somewhere else.
Make sure you are using CHARACTER SET latin1 or ascii; utf8 would waste a lot more space with CHAR.
Switching to InnoDB will double, maybe triple, the disk space for the table (data+index). Therefore, it will probably show down. But a mitigating factor is that the PK is "clustered" with the data, so you are not updating two things for each row inserted. Note that key_buffer_size should be lowered to 10M and innodb_buffer_pool_size should be set to 70% of available RAM.
(My bullet items apply to InnoDB except where MyISAM is specified.)
In using InnoDB, it would be good to try to insert 1000 rows per transaction. Less than that leads to more transaction overhead; more than that leads to overrunning the undo log, causing a different form of slowdown.
Hex ID
Since ID is always 60 hex digits, declare it to be BINARY(30) and pack them via UNHEX(...) and fetch via HEX(ID). Test via WHERE ID = UNHEX(...). That will shrink the data about 25%, and MyISAM's PK by about 40%. (25% overall for InnoDB.)
To do just the conversion to BINARY(30):
CREATE TABLE new (
ID BINARY(30) NOT NULL,
`pair` char(60) NOT NULL
-- adding the PK later is faster for MyISAM
) ENGINE=MyISAM;
INSERT INTO new
SELECT UNHEX(ID),
pair
FROM keypairs;
ALTER TABLE keypairs ADD
PRIMARY KEY (`ID`); -- For InnoDB, I would do differently
RENAME TABLE keypairs TO old,
new TO keypairs;
DROP TABLE old;
Tiny RAM
With only 2GB of RAM, a MyISAM-only dataset should use something like key_buffer_size=300M and innodb_buffer_pool_size=0. For InnoDB-only: key_buffer_size=10M and innodb_buffer_pool_size=500M. Since ID is probably some kind of digest, it will be very random. The small cache and the random key combine to mean that virtually every insert will involve a disk I/O. My first estimate would be more like 30 hours to insert 10M rows. What kind of drives do you have? SSDs would make a big difference if you don't already have such.
The other thing to do to speed up the INSERTs is to sort by ID before starting the LOAD. But that gets tricky with the UNHEX. Here's what I recommend.
Create a MyISAM table, tmp, with ID BINARY(30) and pair, but no indexes. (Don't worry about key_buffer_size; it won't be used.)
LOAD the data into tmp.
ALTER TABLE tmp ORDER BY ID; This will sort the table. There is still no index. I think, without proof, that this will be a filesort, which is much faster that "repair by key buffer" for this case.
INSERT INTO keypairs SELECT * FROM tmp; This will maximize the caching by feeding rows to keypairs in ID order.
Again, I have carefully spelled out things so that it works well regardless of which Engine keypairs is. I expect step 3 or 4 to take the longest, but I don't know which.
Optimizing a table requires that you optimize for specific queries. You can't determine the best optimization strategy unless you have specific queries in mind. Any optimization improves one type of query at the expense of other types of queries.
For example, if your query is SELECT SUM(pair) FROM keypairs (a query that would have to scan the whole table anyway), partitioning won't help, and just adds overhead.
If we assume your typical query is inserting or selecting one keypair at a time by its primary key, then yes, partitioning can help a lot. It all depends on whether the optimizer can tell that your query will find its data in a narrow subset of partitions (ideally one partition).
Also make sure to tune MyISAM. There aren't many tuning options:
Allocate key_buffer_size as high as you can spare to cache your indexes. Though I haven't ever tried anything higher than about 10GB, and I can't guarantee that MyISAM key buffers are stable at 53GB (the size of your MYI file).
Pre-load the key buffers: https://dev.mysql.com/doc/refman/5.7/en/cache-index.html
Size read_buffer_size and read_rnd_buffer_size appropriately given the queries you run. I can't give a specific value here, you should test different values with your queries.
Size bulk_insert_buffer_size to something large if you want to speed up LOAD DATA INFILE. It's 8MB by default, I'd try at least 256MB. I haven't experimented with that setting, so I can't speak from experience.
I try not to use MyISAM at all. MySQL is definitely trying to deprecate its use.
...is there a mysql command to ALTER TABLE add INT ID increment column automatically?
Yes, see my answer to https://stackoverflow.com/a/251630/20860
First, your primary key is not incrementable.
Which means, roughly: at every insert the index have to be rebalanced.
No wonder it goes slowpoke at the table of such a size.
And such an engine...
So, to the second: what's the point of keeping that MyISAM old junk?
Like, for example, you don't mind to loose row or two (or -teen) in case of an accident? And etc, etc, etc, even setting aside that current MySQL maintainer (Oracle Corp) explicitly discourages usage of MyISAM.
So, here are possible solutions:
1) Switch to Inno;
2) If you can't surrender the char ID, then:
Add autoincrement numerical key and set it primary - then, index would be clustered and the cost of insert would drop significantly;
Turn your current key into secondary index;
3) In case you can - it's obvious

Insertion speed slowdown as the table grows in mysql

I am trying to get a better understanding about insertion speed and performance patterns in mysql for a custom product. I have two tables to which I keep appending new rows. The two tables are defined as follows:
CREATE TABLE events (
added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id BINARY(16) NOT NULL,
body MEDIUMBLOB,
UNIQUE KEY (id)) ENGINE InnoDB;
CREATE TABLE index_fpid (
fpid VARCHAR(255) NOT NULL,
event_id BINARY(16) NOT NULL UNIQUE,
PRIMARY KEY (fpid, event_id)) ENGINE InnoDB;
And I keep inserting new objects to both tables (for each new object, I insert the relevant information to both tables in one transaction). At first, I get around 600 insertions / sec, but after ~ 30000 rows, I get a significant slowdown (around 200 insertions/sec), and then a more slower, but still noticeable slowdown.
I can see that as the table grows, the IO wait numbers get higher and higher. My first thought was memory taken by the index, but those are done on a VM which has 768 Mb, and is dedicated to this task alone (2/3 of memory are unused). Also, I have a hard time seeing 30000 rows taking so much memory, even more so just the indexes (the whole mysql data dir < 100 Mb anyway). To confirm this, I allocated very little memory to the VM (64 Mb), and the slowdown pattern is almost identical (i.e. slowdown appears after the same numbers of insertions), so I suspect some configuration issues, especially since I am relatively new to databases.
The pattern looks as follows:
I have a self-contained python script which reproduces the issue, that I can make available if that's helpful.
Configuration:
Ubuntu 10.04, 32 bits running on KVM, 760 Mb allocated to it.
Mysql 5.1, out of the box configuration with separate files for tables
[EDIT]
Thank you very much to Eric Holmberg, he nailed it. Here are the graphs after fixing the innodb_buffer_pool_size to a reasonable value:
Edit your /etc/mysql/my.cnf file and make sure you allocate enough memory to the InnoDB buffer pool. If this is a dedicated sever, you could probably use up to 80% of your system memory.
# Provide a buffer pool for InnoDB - up to 80% of memory for a dedicated database server
innodb_buffer_pool_size=614M
The primary keys are B Trees so inserts will always take O(logN) time and once you run out of cache, they will start swapping like mad. When this happens, you will probably want to partition the data to keep your insertion speed up. See http://dev.mysql.com/doc/refman/5.1/en/partitioning.html for more info on partitioning.
Good luck!
Your indexes may just need to be analyzed and optimized during the insert, they gradually get out of shape as you go along. The other option of course is to disable indexes entirely when you're inserting and rebuild them later which should give more consistent performance.
Great link about insert speed.
ANALYZE. OPTIMIZE
Verifying that the insert doesn't violate a key constraint takes some time, and that time grows as the table gets larger. If you're interested in flat out performance, using LOAD DATA INFILE will improve your insert speed considerably.

Generating a massive 150M-row MySQL table

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?