Insertion speed slowdown as the table grows in mysql - mysql

I am trying to get a better understanding about insertion speed and performance patterns in mysql for a custom product. I have two tables to which I keep appending new rows. The two tables are defined as follows:
CREATE TABLE events (
added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id BINARY(16) NOT NULL,
body MEDIUMBLOB,
UNIQUE KEY (id)) ENGINE InnoDB;
CREATE TABLE index_fpid (
fpid VARCHAR(255) NOT NULL,
event_id BINARY(16) NOT NULL UNIQUE,
PRIMARY KEY (fpid, event_id)) ENGINE InnoDB;
And I keep inserting new objects to both tables (for each new object, I insert the relevant information to both tables in one transaction). At first, I get around 600 insertions / sec, but after ~ 30000 rows, I get a significant slowdown (around 200 insertions/sec), and then a more slower, but still noticeable slowdown.
I can see that as the table grows, the IO wait numbers get higher and higher. My first thought was memory taken by the index, but those are done on a VM which has 768 Mb, and is dedicated to this task alone (2/3 of memory are unused). Also, I have a hard time seeing 30000 rows taking so much memory, even more so just the indexes (the whole mysql data dir < 100 Mb anyway). To confirm this, I allocated very little memory to the VM (64 Mb), and the slowdown pattern is almost identical (i.e. slowdown appears after the same numbers of insertions), so I suspect some configuration issues, especially since I am relatively new to databases.
The pattern looks as follows:
I have a self-contained python script which reproduces the issue, that I can make available if that's helpful.
Configuration:
Ubuntu 10.04, 32 bits running on KVM, 760 Mb allocated to it.
Mysql 5.1, out of the box configuration with separate files for tables
[EDIT]
Thank you very much to Eric Holmberg, he nailed it. Here are the graphs after fixing the innodb_buffer_pool_size to a reasonable value:

Edit your /etc/mysql/my.cnf file and make sure you allocate enough memory to the InnoDB buffer pool. If this is a dedicated sever, you could probably use up to 80% of your system memory.
# Provide a buffer pool for InnoDB - up to 80% of memory for a dedicated database server
innodb_buffer_pool_size=614M
The primary keys are B Trees so inserts will always take O(logN) time and once you run out of cache, they will start swapping like mad. When this happens, you will probably want to partition the data to keep your insertion speed up. See http://dev.mysql.com/doc/refman/5.1/en/partitioning.html for more info on partitioning.
Good luck!

Your indexes may just need to be analyzed and optimized during the insert, they gradually get out of shape as you go along. The other option of course is to disable indexes entirely when you're inserting and rebuild them later which should give more consistent performance.
Great link about insert speed.
ANALYZE. OPTIMIZE

Verifying that the insert doesn't violate a key constraint takes some time, and that time grows as the table gets larger. If you're interested in flat out performance, using LOAD DATA INFILE will improve your insert speed considerably.

Related

MySQL performance on large, write-only table

thanks in advance for your answers, and sorry for my bad english, I'm not a native speaker.
We're actually developping a mobile game with a backend. In this mobile game, we've got a money system, we keep track of each transaction for verification purpose.
In order to read a user balance, we've got an intermediary table, in which the user balance is updated on each transaction so the transaction table is never read directly by the users, in order to reduce load on high traffics.
The transaction table is uniquely read from time to time in the backoffice.
Here is the schema of the transaction table :
create table money_money_transaction (
`id` BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
`userID` INT UNSIGNED NOT NULL,
`amount` INT NOT NULL,
`transactionType` TINYINT NOT NULL,
`created` DATETIME NOT NULL,
CONSTRAINT money_money_transaction_userID FOREIGN KEY (`userID`) REFERENCES `user_user` (`id`)
ON DELETE CASCADE
);
We planned to have a lot of users, the transaction table could grow up to 1 billion row, so my questions are :
Will it affect the performance of other tables ?
If the database is too large to fit in RAM, does MySQL have some sort of optimisation, storing in RAM only the most read table ?
Does MySQL will be able to scale correctly up to this billion row ? Knowing we do mostly insert and that the only index is on the id (the id is needed for details) and that there is no "bulk insert" (there will not be 1M insert to do concurrently on this table)
Also, we're on a RDS server, so we could switch to Aurora and try a master-master or master-slave replication if needed. Do you think it would help in this case ?
You might consider MyRocks (see http://myrocks.io), which is a third-party storage engine that is designed for fast INSERT speed and compressed data storage. I won't make a recommendation that you should switch to MyRocks, because I don't have enough information to make an unequivocal statement about it for your workload. But I will recommend that it's worth your time to evaluate it and see if it works better for your application.
If the database is too large to fit in RAM, does MySQL have some sort of optimisation, storing in RAM only the most read table ?
Yes, MySQL (assuming InnoDB storage engine) stores partial tables in RAM, in the buffer pool. It breaks down tables into pages, and fits pages in the buffer pool as queries request them. It's like a cache. Over time, the most requested pages stay in the buffer pool, and others get evicted. So it more or less balances out to serve most of your queries as quickly as possible. Read https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool.html for more information.
Will it affect the performance of other tables ?
Tables don't have performance — queries have performance.
The buffer pool has fixed size. Suppose you have six tables that need to share it, their pages must fit into the same buffer pool. There's no way to set priorities for each table, or dedicate buffer pool space for certain tables or "lock" them in RAM. All pages of all tables share the same buffer pool. So as your queries request pages from various tables, they do affect each other in the sense that frequently-requested pages from one table may evict pages from another table.
Does MySQL will be able to scale correctly up to this billion row ?
MySQL has many features to try to help performance and scalability (those are not the same thing). Again, queries have performance, not tables. A table without queries just sits there. It's the queries that get optimized by different techniques.
Knowing we do mostly insert and that the only index is on the id (the id is needed for details) and that there is no "bulk insert" (there will not be 1M insert to do concurrently on this table)
Indexes do add overhead to inserts. You can't eliminate the primary key index, this is a necessary part of every table. But for example, you might find it worthwhile to drop your FOREIGN KEY, which includes an index.
Usually, most tables are read more than they are written to, so it's worth keeping an index to help reads (or even an UPDATE or DELETE that uses a WHERE clause). But if your workload is practically all INSERT, maybe the extra index for the foreign key is purely overhead and gives no benefit for any queries.
Also, we're on a RDS server, so we could switch to Aurora and try a master-master or master-slave replication if needed. Do you think it would help in this case ?
I worked on benchmarks of Aurora in early 2017, and found that for the application we tested, is was not good for high write traffic. You should always test it for your application, instead of depending on the guess of someone on the internet. But I predict that Aurora in its current form (circa 2017) will completely suck for your all-write workload.

Handling huge MyISAM table for optimisation

I have a huge (and growing) MyISAM table (700millions rows = 140Gb).
CREATE TABLE `keypairs` (
`ID` char(60) NOT NULL,
`pair` char(60) NOT NULL,
PRIMARY KEY (`ID`)
) ENGINE=MyISAM
The table option was changed to ROW_FORMAT=FIXED, cause both columns are always fixed length to max (60). And yes yes, ID is well a string sadly and not an INT.
SELECT queries are pretty ok in speed efficiency.
Databases and mysql engine are all 127.0.0.1/localhost. (nothing distant)
Sadly, INSERT is slow as hell. I dont even talk about trying to LOAD DATA millions new rows... takes days.
There won't have any concurrent read on it. All SELECTs are done one by one by only my local server.(it is not for client's use)
(for infos : files sizes .MYD=88Gb, .MYI=53Gb, .TMM=400Mb)
How could i speed up inserts into that table?
Would it help to PARTITION that huge table ? (how then?)
I heard MyISAM is using "structure cache" as .frm files. And that a line into config file is helping mysql keep in memory all the .frm (in case of partitionned), would it help also? Actualy, my .frm file is 9kb only for 700millions rows)
string shortenning/compress function... the ID string? (same idea as rainbow tables) even if it lowers the max allowed unique ID's, i will anyway never reach the max of 60chars. so maybe its an idea? but before creating a new unique ID i have to check if shortened string doesn't exists in db ofc
Same idea as shortening ID strings, what about using md5() on the ID? shorten string means faster or not in that case?
Sort the incoming data before doing the LOAD. This will improve the cacheability of the PRIMARY KEY(id).
PARTITIONing is unlikely to help, unless there is some useful pattern to ID.
PARTITIONing will not help for single-row insert nor for single-row fetch by ID.
If the strings are not a constant width of 60, you are wasting space and speed by saying CHAR instead of VARCHAR. Change that.
MyISAM's FIXED is useful only if there is a lot of 'churn' (deletes+inserts, and/or updates).
Smaller means more cacheable means less I/O means faster.
The .frm is an encoding of the CREATE TABLE; it is not relevant for this discussion.
A simple compress/zip/whatever will almost always compress text strings longer than 10 characters. And they can be uncompressed, losslessly. What do your strings look like? 60-character English text will shrink to 20-25 bytes.
MD5 is a "digest", not a "compression". You cannot recover the string from its MD5. Anyway, it would take 16 bytes after converting to BINARY(16).
The PRIMARY KEY is a BTree. If ID is somewhat "random", then the 'next' ID (unless the input is sorted) is likely not to be cached. No, the BTree is not rebalanced all the time.
Turning the PRIMARY KEY into a secondary key (after adding an AUTO_INCREMENT) will not speed things up -- it still has to update the BTree with ID in it!
How much RAM do you have? For your situation, and for this LOAD, set MyISAM's key_buffer_size to about 70% of available RAM, but not bigger than the .MYI file. I recommend a big key_buffer because that is where the random accesses are occurring; the .MYD is only being appended to (assuming you have never deleted any rows).
We do need to see your SELECTs to make sure these changes are not destroying performance somewhere else.
Make sure you are using CHARACTER SET latin1 or ascii; utf8 would waste a lot more space with CHAR.
Switching to InnoDB will double, maybe triple, the disk space for the table (data+index). Therefore, it will probably show down. But a mitigating factor is that the PK is "clustered" with the data, so you are not updating two things for each row inserted. Note that key_buffer_size should be lowered to 10M and innodb_buffer_pool_size should be set to 70% of available RAM.
(My bullet items apply to InnoDB except where MyISAM is specified.)
In using InnoDB, it would be good to try to insert 1000 rows per transaction. Less than that leads to more transaction overhead; more than that leads to overrunning the undo log, causing a different form of slowdown.
Hex ID
Since ID is always 60 hex digits, declare it to be BINARY(30) and pack them via UNHEX(...) and fetch via HEX(ID). Test via WHERE ID = UNHEX(...). That will shrink the data about 25%, and MyISAM's PK by about 40%. (25% overall for InnoDB.)
To do just the conversion to BINARY(30):
CREATE TABLE new (
ID BINARY(30) NOT NULL,
`pair` char(60) NOT NULL
-- adding the PK later is faster for MyISAM
) ENGINE=MyISAM;
INSERT INTO new
SELECT UNHEX(ID),
pair
FROM keypairs;
ALTER TABLE keypairs ADD
PRIMARY KEY (`ID`); -- For InnoDB, I would do differently
RENAME TABLE keypairs TO old,
new TO keypairs;
DROP TABLE old;
Tiny RAM
With only 2GB of RAM, a MyISAM-only dataset should use something like key_buffer_size=300M and innodb_buffer_pool_size=0. For InnoDB-only: key_buffer_size=10M and innodb_buffer_pool_size=500M. Since ID is probably some kind of digest, it will be very random. The small cache and the random key combine to mean that virtually every insert will involve a disk I/O. My first estimate would be more like 30 hours to insert 10M rows. What kind of drives do you have? SSDs would make a big difference if you don't already have such.
The other thing to do to speed up the INSERTs is to sort by ID before starting the LOAD. But that gets tricky with the UNHEX. Here's what I recommend.
Create a MyISAM table, tmp, with ID BINARY(30) and pair, but no indexes. (Don't worry about key_buffer_size; it won't be used.)
LOAD the data into tmp.
ALTER TABLE tmp ORDER BY ID; This will sort the table. There is still no index. I think, without proof, that this will be a filesort, which is much faster that "repair by key buffer" for this case.
INSERT INTO keypairs SELECT * FROM tmp; This will maximize the caching by feeding rows to keypairs in ID order.
Again, I have carefully spelled out things so that it works well regardless of which Engine keypairs is. I expect step 3 or 4 to take the longest, but I don't know which.
Optimizing a table requires that you optimize for specific queries. You can't determine the best optimization strategy unless you have specific queries in mind. Any optimization improves one type of query at the expense of other types of queries.
For example, if your query is SELECT SUM(pair) FROM keypairs (a query that would have to scan the whole table anyway), partitioning won't help, and just adds overhead.
If we assume your typical query is inserting or selecting one keypair at a time by its primary key, then yes, partitioning can help a lot. It all depends on whether the optimizer can tell that your query will find its data in a narrow subset of partitions (ideally one partition).
Also make sure to tune MyISAM. There aren't many tuning options:
Allocate key_buffer_size as high as you can spare to cache your indexes. Though I haven't ever tried anything higher than about 10GB, and I can't guarantee that MyISAM key buffers are stable at 53GB (the size of your MYI file).
Pre-load the key buffers: https://dev.mysql.com/doc/refman/5.7/en/cache-index.html
Size read_buffer_size and read_rnd_buffer_size appropriately given the queries you run. I can't give a specific value here, you should test different values with your queries.
Size bulk_insert_buffer_size to something large if you want to speed up LOAD DATA INFILE. It's 8MB by default, I'd try at least 256MB. I haven't experimented with that setting, so I can't speak from experience.
I try not to use MyISAM at all. MySQL is definitely trying to deprecate its use.
...is there a mysql command to ALTER TABLE add INT ID increment column automatically?
Yes, see my answer to https://stackoverflow.com/a/251630/20860
First, your primary key is not incrementable.
Which means, roughly: at every insert the index have to be rebalanced.
No wonder it goes slowpoke at the table of such a size.
And such an engine...
So, to the second: what's the point of keeping that MyISAM old junk?
Like, for example, you don't mind to loose row or two (or -teen) in case of an accident? And etc, etc, etc, even setting aside that current MySQL maintainer (Oracle Corp) explicitly discourages usage of MyISAM.
So, here are possible solutions:
1) Switch to Inno;
2) If you can't surrender the char ID, then:
Add autoincrement numerical key and set it primary - then, index would be clustered and the cost of insert would drop significantly;
Turn your current key into secondary index;
3) In case you can - it's obvious

MySQL InnoDB INSERT performance for large tables

I have 8 tables with over 2 millions rows using INT(4B) PK used for frequent inserting and reading. The older 9/10 of the data is read occasionally and doesn't matter how long it takes to access it, while the newer 1/10 has to be fast for both INSERT and SELECT. The tables are divided into 2 categories of requirements:
100 INSERTs/sec
20 INSERTs/sec, occasional UPDATE
Because it should be working with innodb_buffer_pool_size set to 32M and the old data is unimportant, I think the best solution will be to say once per week to copy the older half of each table into big archive tables. Plus I should use infile inserting instead of the current transactions. Is it a good silution? I'll appreciate any advice and link about this problem.
If you are using InnoDB, the data is naturally "clustered" using the PRIMARY KEY of the table, if you defined one (ie: "id INT NOT NULL PRIMARY KEY AUTO_INCREMENT"), the data is grouped by ID (and will stay that way).
So your most recent INSERTed data is naturally grouped on some InnoDB buffers, and your old archive data doesn't matter at all.
I don't think you will benefit from splitting data into archive tables/database and recent one, except you will make everything much more complex!
To speed-up insertion/updates/delete on InnoDB, you have to think about the physical location of your InnoDB log file: InnoDB needs to insert the modification into it to achieve the operation (wether it's a explicit transaction or an implicit one!), it doesn't wait for the data or index to be put back in disk. It's a pretty different strategy than MyISAM.
So if you could allocate a fast sequential storage for the InnoDB log file, 10krpm+ hard-drive or SSD, and keep the ibdata in another drive or raid-array, you will be able to sustain an impressive number of DB modifications: it's IO bound on InnoDB log file (exception is you use complex or heavy where clause on update/delete).

Generating a massive 150M-row MySQL table

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?

Alternatives to the MEMORY storage engine for MySQL

I'm currently running some intensive SELECT queries against a MyISAM table. The table is around 100 MiB (800,000 rows) and it never changes.
I need to increase the performance of my script, so I was thinking on moving the table from MyISAM to the MEMORY storage engine, so I could load it completely into the memory.
Besides the MEMORY storage engine, what are my options to load a 100 MiB table into the memory?
A table with 800k rows shouldn't be any problem to mysql, no matter what storage engine you are using. With a size of 100 MB the full table (data and keys) should live in memory (mysql key cache, OS file cache, or propably in both).
First you check the indices. In most cases, optimizing the indices gives you the best performance boost. Never do anything else, unless you are pretty sure they are in shape. Invoke the queries using EXPLAIN and watch for cases where no or the wrong index is used. This should be done with real world data and not on a server with test data.
After you optimized your indices the queries should finish by a fraction of a second. If the queries are still too slow then just try to avoid running them by using a cache in your application (memcached, etc.). Given that the data in the table never changes there shouldn't be any problems with old cache data etc.
Assuming the data rarely changes, you could potentially boost the performance of queries significantly using MySql query caching.
If your table is queried a lot it's probably already cached at the operating system level, depending on how much memory is in your server.
MyISAM also allows for preloading MyISAM table indices into memory using a mechanism called the MyISAM Key Cache. After you've created a key cache you can load an index into the cache using the CACHE INDEX or LOAD INDEX syntax.
I assume that you've analyzed your table and queries and optimized your indices after the actual queries? Otherwise that's really something you should do before attempting to store the entire table in memory.
If you have enough memory allocated for Mysql's use - in the Innodb buffer pool, or for use by MyIsam, you can read the database into memory (just a 'SELECT * from tablename') and if there's no reason to remove it, it stays there.
You also get better key use, as the MEMORY table only does hash-bashed keys, rather than full btree access, which for smaller, non-unique keys might be fats enough, or not so much with such a large table.
As usual, the best thing to do it to benchmark it.
Another idea is, if you are using v5.1, to use an ARCHIVE table type, which can be compressed, and may also speed access to the contents, if they are easily compressible. This swaps the CPU time to de-compress for IO/memory access.
If the data never changes you could easily duplicate the table over several database servers.
This way you could offload some queries to a different server, gaining some extra breathing room for the main server.
The speed improvement depends on the current database load, there will be no improvement if your database load is very low.
PS:
You are aware that MEMORY tables forget their contents when the database restarts!