Handling huge MyISAM table for optimisation - mysql

I have a huge (and growing) MyISAM table (700millions rows = 140Gb).
CREATE TABLE `keypairs` (
`ID` char(60) NOT NULL,
`pair` char(60) NOT NULL,
PRIMARY KEY (`ID`)
) ENGINE=MyISAM
The table option was changed to ROW_FORMAT=FIXED, cause both columns are always fixed length to max (60). And yes yes, ID is well a string sadly and not an INT.
SELECT queries are pretty ok in speed efficiency.
Databases and mysql engine are all 127.0.0.1/localhost. (nothing distant)
Sadly, INSERT is slow as hell. I dont even talk about trying to LOAD DATA millions new rows... takes days.
There won't have any concurrent read on it. All SELECTs are done one by one by only my local server.(it is not for client's use)
(for infos : files sizes .MYD=88Gb, .MYI=53Gb, .TMM=400Mb)
How could i speed up inserts into that table?
Would it help to PARTITION that huge table ? (how then?)
I heard MyISAM is using "structure cache" as .frm files. And that a line into config file is helping mysql keep in memory all the .frm (in case of partitionned), would it help also? Actualy, my .frm file is 9kb only for 700millions rows)
string shortenning/compress function... the ID string? (same idea as rainbow tables) even if it lowers the max allowed unique ID's, i will anyway never reach the max of 60chars. so maybe its an idea? but before creating a new unique ID i have to check if shortened string doesn't exists in db ofc
Same idea as shortening ID strings, what about using md5() on the ID? shorten string means faster or not in that case?

Sort the incoming data before doing the LOAD. This will improve the cacheability of the PRIMARY KEY(id).
PARTITIONing is unlikely to help, unless there is some useful pattern to ID.
PARTITIONing will not help for single-row insert nor for single-row fetch by ID.
If the strings are not a constant width of 60, you are wasting space and speed by saying CHAR instead of VARCHAR. Change that.
MyISAM's FIXED is useful only if there is a lot of 'churn' (deletes+inserts, and/or updates).
Smaller means more cacheable means less I/O means faster.
The .frm is an encoding of the CREATE TABLE; it is not relevant for this discussion.
A simple compress/zip/whatever will almost always compress text strings longer than 10 characters. And they can be uncompressed, losslessly. What do your strings look like? 60-character English text will shrink to 20-25 bytes.
MD5 is a "digest", not a "compression". You cannot recover the string from its MD5. Anyway, it would take 16 bytes after converting to BINARY(16).
The PRIMARY KEY is a BTree. If ID is somewhat "random", then the 'next' ID (unless the input is sorted) is likely not to be cached. No, the BTree is not rebalanced all the time.
Turning the PRIMARY KEY into a secondary key (after adding an AUTO_INCREMENT) will not speed things up -- it still has to update the BTree with ID in it!
How much RAM do you have? For your situation, and for this LOAD, set MyISAM's key_buffer_size to about 70% of available RAM, but not bigger than the .MYI file. I recommend a big key_buffer because that is where the random accesses are occurring; the .MYD is only being appended to (assuming you have never deleted any rows).
We do need to see your SELECTs to make sure these changes are not destroying performance somewhere else.
Make sure you are using CHARACTER SET latin1 or ascii; utf8 would waste a lot more space with CHAR.
Switching to InnoDB will double, maybe triple, the disk space for the table (data+index). Therefore, it will probably show down. But a mitigating factor is that the PK is "clustered" with the data, so you are not updating two things for each row inserted. Note that key_buffer_size should be lowered to 10M and innodb_buffer_pool_size should be set to 70% of available RAM.
(My bullet items apply to InnoDB except where MyISAM is specified.)
In using InnoDB, it would be good to try to insert 1000 rows per transaction. Less than that leads to more transaction overhead; more than that leads to overrunning the undo log, causing a different form of slowdown.
Hex ID
Since ID is always 60 hex digits, declare it to be BINARY(30) and pack them via UNHEX(...) and fetch via HEX(ID). Test via WHERE ID = UNHEX(...). That will shrink the data about 25%, and MyISAM's PK by about 40%. (25% overall for InnoDB.)
To do just the conversion to BINARY(30):
CREATE TABLE new (
ID BINARY(30) NOT NULL,
`pair` char(60) NOT NULL
-- adding the PK later is faster for MyISAM
) ENGINE=MyISAM;
INSERT INTO new
SELECT UNHEX(ID),
pair
FROM keypairs;
ALTER TABLE keypairs ADD
PRIMARY KEY (`ID`); -- For InnoDB, I would do differently
RENAME TABLE keypairs TO old,
new TO keypairs;
DROP TABLE old;
Tiny RAM
With only 2GB of RAM, a MyISAM-only dataset should use something like key_buffer_size=300M and innodb_buffer_pool_size=0. For InnoDB-only: key_buffer_size=10M and innodb_buffer_pool_size=500M. Since ID is probably some kind of digest, it will be very random. The small cache and the random key combine to mean that virtually every insert will involve a disk I/O. My first estimate would be more like 30 hours to insert 10M rows. What kind of drives do you have? SSDs would make a big difference if you don't already have such.
The other thing to do to speed up the INSERTs is to sort by ID before starting the LOAD. But that gets tricky with the UNHEX. Here's what I recommend.
Create a MyISAM table, tmp, with ID BINARY(30) and pair, but no indexes. (Don't worry about key_buffer_size; it won't be used.)
LOAD the data into tmp.
ALTER TABLE tmp ORDER BY ID; This will sort the table. There is still no index. I think, without proof, that this will be a filesort, which is much faster that "repair by key buffer" for this case.
INSERT INTO keypairs SELECT * FROM tmp; This will maximize the caching by feeding rows to keypairs in ID order.
Again, I have carefully spelled out things so that it works well regardless of which Engine keypairs is. I expect step 3 or 4 to take the longest, but I don't know which.

Optimizing a table requires that you optimize for specific queries. You can't determine the best optimization strategy unless you have specific queries in mind. Any optimization improves one type of query at the expense of other types of queries.
For example, if your query is SELECT SUM(pair) FROM keypairs (a query that would have to scan the whole table anyway), partitioning won't help, and just adds overhead.
If we assume your typical query is inserting or selecting one keypair at a time by its primary key, then yes, partitioning can help a lot. It all depends on whether the optimizer can tell that your query will find its data in a narrow subset of partitions (ideally one partition).
Also make sure to tune MyISAM. There aren't many tuning options:
Allocate key_buffer_size as high as you can spare to cache your indexes. Though I haven't ever tried anything higher than about 10GB, and I can't guarantee that MyISAM key buffers are stable at 53GB (the size of your MYI file).
Pre-load the key buffers: https://dev.mysql.com/doc/refman/5.7/en/cache-index.html
Size read_buffer_size and read_rnd_buffer_size appropriately given the queries you run. I can't give a specific value here, you should test different values with your queries.
Size bulk_insert_buffer_size to something large if you want to speed up LOAD DATA INFILE. It's 8MB by default, I'd try at least 256MB. I haven't experimented with that setting, so I can't speak from experience.
I try not to use MyISAM at all. MySQL is definitely trying to deprecate its use.
...is there a mysql command to ALTER TABLE add INT ID increment column automatically?
Yes, see my answer to https://stackoverflow.com/a/251630/20860

First, your primary key is not incrementable.
Which means, roughly: at every insert the index have to be rebalanced.
No wonder it goes slowpoke at the table of such a size.
And such an engine...
So, to the second: what's the point of keeping that MyISAM old junk?
Like, for example, you don't mind to loose row or two (or -teen) in case of an accident? And etc, etc, etc, even setting aside that current MySQL maintainer (Oracle Corp) explicitly discourages usage of MyISAM.
So, here are possible solutions:
1) Switch to Inno;
2) If you can't surrender the char ID, then:
Add autoincrement numerical key and set it primary - then, index would be clustered and the cost of insert would drop significantly;
Turn your current key into secondary index;
3) In case you can - it's obvious

Related

mysql partitioning by key internal hashing function

We have a table partitioned by key (binary(16))
Is there any option to calculate which partition record will go outside of MySQL?
What is the hash function (not linear one)?
The reason is to sort the CSV files outside MySQL and insert them in parallel in right partitions with LOAD DATA INFILE and then index in parallel too.
I can't find the function in MySQL docs
What's wrong with Linear? Are trying to LOAD in parallel?
How many indexes do you have? If only that hash, sort the table, then load into a non-partitioned InnoDB with the PK already in place. Meanwhile, make sure every column uses the smallest possible datatype. How much RAM do you have?
If you are using MyISAM, consider MERGE - With that, you can load each partition-like table as in a separate thread. When finished, construct the "merge" table that combines them.
What types of queries will you be using? Single row lookups by the BINARY(16)? Anything else might have big performance issues.
How much RAM? We need to tune either key_buffer_size or innodb_buffer_pool_size.
Be aware of the limitations. MyISAM defaults to a 7-byte data pointer and a 6-byte index pointer. 15TB would need only a 6-byte data pointer if the rows are DYNAMIC (byte pointer), or 5 bytes if they are FIXED (row number). So that could be 1 or 2 bytes to be saved. If anything is variable length, go with Dynamic; it would waste too much space (and probably not improve speed) to go fixed. I don't know of the index pointer can be shrunk in your case.
You are in 5.7? MySQL; 8.0 removes MyISAM. Meanwhile, MariaDB still handles MyISAM.
Will you first split the data by "partition"? Or send off INSERTs to different "partitions" one by one. (This choice adds some more wrinkles and possibly optimizations.)
Maybe...
Sort the incoming data by the binary version of MD5().
Split into chunks based on the first 4 bits. (Or do the split without sorting first) Be sure to run LOAD DATA for one 4-bit value in only one thread.
Have PARTITION BY RANGE with 16 partitions:
VALUES LESS THAN 0x1000000000000000
VALUES LESS THAN 0x2000000000000000
...
VALUES LESS THAN 0xF000000000000000
VALUES LESS THAN MAXVALUE
I don't know of a limit on the number of rows in a LOAD DATA, but I would worry about ACID locks having problems if you go over, say, 10K rows at a time.
This technique may even work for a non-partitioned table.

innodb_ft_min_token_size = 1 performance implications

If I change innodb_ft_min_token_size =1 from default of 3, will this cause a lot more disk usage? Any performance issues with search?
I want to be able to use fulltext search in 1 character in words.
Also once I make this change how would I rebuild the index? Will this put a lot of load on server?
There are not that many 1- and 2- letter words, so the space change may not be that great.
Modifying innodb_ft_min_token_size, innodb_ft_max_token_size, or ngram_token_size [in my.cnf] requires restarting the server.
To rebuild FULLTEXT indexes for an InnoDB table, use ALTER TABLE with the DROP INDEX and ADD INDEX options to drop and re-create each index.
-- https://dev.mysql.com/doc/refman/8.0/en/fulltext-fine-tuning.html
The "Scope" of innodb_ft_min_token_size is "Global". That is, it applies to all InnoDB FT indexes.
-- https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_ft_min_token_size
Recreating the index will read the entire table and rebuild the FT index, which will "lock" the table at some level for some period of time. The time to rebuild will be roughly proportional to the size of the table. And it will consume a bunch of extra disk space until it is finished. (The table and all the indexes will be copied over and at least the FT index will be rebuilt.)
If you have a thousand rows, no big deal. If you have a billion rows, you will need a long "downtime".
After changing the innodb_ft_max_token_size, I would be afraid to do a short wildcard test like
AGAINST('a*' IN BOOLEAN MODE)
If you have a test server, simply try it.
I noticed that the documentation recommends a value of 1 for Chinese, etc.

InnoDB Performance on primary index added/altering

So I have a huge update where I have to insert around 40gb data into an innodb table. Its taking quite a while, so Im wondering which method would be the fastest (and more importantly why, as I could just do a split test).
Method 1)
a) Insert all rows
b) create ALTER TABLE su_tmp_matches ADD PRIMARY KEY ( id )
Method 2)
a) ALTER TABLE su_tmp_matches ADD PRIMARY KEY ( id )
b) Insert all rows
Currently we are using method 1, but the step b) seems to take a shitload of time. So Im wondering if there is any implication of the size here (40gb - 5 million rows).
---- so I decided to test this as well ---
Pretty quick brand new mysql server - loads and loads of ram, and fast ram, fast discs as well, and pretty tuned up (we have more than 5000 requests per second on one pieces):
1,6 mio rows / 6gb data:
81 seconds to "delete" a primary index
550 seconds to "add" a primary index (after data is added)
120 seconds to create a copy of the table with the primary index create BEFORE data insert
80 seconds to create a copy of the table without the primary index (which then is 550 seconds to create afterwards)
Seems pretty absurd - question is, if indexes are the same thing.
From the documentation :
InnoDB does not have a special optimization for separate index
creation the way the MyISAM storage engine does. Therefore, it does
not pay to export and import the table and create indexes afterward.
The fastest way to alter a table to InnoDB is to do the inserts
directly to an InnoDB table.
It seems to me that adding the constraint of unicity before the insert could only help the engine if your column having a primary key is an autoincremented integer. But I really doubt there would be a notable difference.
A useful recommendation :
During the conversion of big tables, increase the size of the InnoDB
buffer pool to reduce disk I/O, to a maximum of 80% of physical
memory. You can also increase the sizes of the InnoDB log files.
EDIT : as by experience MySQL doesn't always perform as expected from the documentation performance-wise, I think any benchmark you do on this would be interesting, even if not a definite answer per se.

Fastest MySQL peformance updating a single field in a single indexed row

I'm trying to get the fastest performance from an application that updates indexed rows repeatedly replacing data in a varchar field. This varchar field will be updated with data that is of equal size upon subsequent updates (so a single row never grows). To my utter confusion I have found that the performance is directly related to the size of the field itself and is nowhere near the performance of replacing data in a filesystem file directly. ie 1k field size orders of magnitude faster than 50k field size. (within the row size limit) If the row exists in the database and the size is not changing why would an update incur so much overhead?
i am using innodb and have disabled binary logging. i've ruled out communications overhead by using sql generated strings. tried using myisam and it was roughly 2-3x faster but still too slow. i understand the database has overhead but again i am simply replacing data in a single field with data that is of equal size. what is the db doing other than directly replacing bits?
rough peformance #'s
81 updates/sec (60k string)
1111 updates/sec (1k string)
filesystem performance:
1428 updates/sec (60k string)
the updates i'm doing are insert...on duplicate key update. straight updates are roughly 50% faster but still ridiculously slow for what it is doing.
Can any experts out there enlighten me? Any way to improve these numbers?
I addressed a question in the DBA StackExchange concerning using CHAR vs VARCHAR. Please read all the answers, not just mine.
Keep something else in mind as well. InnoDB features the gen_clust_index, the internal row id clustered index for all InnoDB Tables, one per InnoDB table. If you change anything in the primary key, this will give the gen_clust_index a real workout getting reoganized.

Insertion speed slowdown as the table grows in mysql

I am trying to get a better understanding about insertion speed and performance patterns in mysql for a custom product. I have two tables to which I keep appending new rows. The two tables are defined as follows:
CREATE TABLE events (
added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id BINARY(16) NOT NULL,
body MEDIUMBLOB,
UNIQUE KEY (id)) ENGINE InnoDB;
CREATE TABLE index_fpid (
fpid VARCHAR(255) NOT NULL,
event_id BINARY(16) NOT NULL UNIQUE,
PRIMARY KEY (fpid, event_id)) ENGINE InnoDB;
And I keep inserting new objects to both tables (for each new object, I insert the relevant information to both tables in one transaction). At first, I get around 600 insertions / sec, but after ~ 30000 rows, I get a significant slowdown (around 200 insertions/sec), and then a more slower, but still noticeable slowdown.
I can see that as the table grows, the IO wait numbers get higher and higher. My first thought was memory taken by the index, but those are done on a VM which has 768 Mb, and is dedicated to this task alone (2/3 of memory are unused). Also, I have a hard time seeing 30000 rows taking so much memory, even more so just the indexes (the whole mysql data dir < 100 Mb anyway). To confirm this, I allocated very little memory to the VM (64 Mb), and the slowdown pattern is almost identical (i.e. slowdown appears after the same numbers of insertions), so I suspect some configuration issues, especially since I am relatively new to databases.
The pattern looks as follows:
I have a self-contained python script which reproduces the issue, that I can make available if that's helpful.
Configuration:
Ubuntu 10.04, 32 bits running on KVM, 760 Mb allocated to it.
Mysql 5.1, out of the box configuration with separate files for tables
[EDIT]
Thank you very much to Eric Holmberg, he nailed it. Here are the graphs after fixing the innodb_buffer_pool_size to a reasonable value:
Edit your /etc/mysql/my.cnf file and make sure you allocate enough memory to the InnoDB buffer pool. If this is a dedicated sever, you could probably use up to 80% of your system memory.
# Provide a buffer pool for InnoDB - up to 80% of memory for a dedicated database server
innodb_buffer_pool_size=614M
The primary keys are B Trees so inserts will always take O(logN) time and once you run out of cache, they will start swapping like mad. When this happens, you will probably want to partition the data to keep your insertion speed up. See http://dev.mysql.com/doc/refman/5.1/en/partitioning.html for more info on partitioning.
Good luck!
Your indexes may just need to be analyzed and optimized during the insert, they gradually get out of shape as you go along. The other option of course is to disable indexes entirely when you're inserting and rebuild them later which should give more consistent performance.
Great link about insert speed.
ANALYZE. OPTIMIZE
Verifying that the insert doesn't violate a key constraint takes some time, and that time grows as the table gets larger. If you're interested in flat out performance, using LOAD DATA INFILE will improve your insert speed considerably.