MySQL Partitioning: Simultaneous insertion to different partitions performance

MySQL Partitioning: Simultaneous insertion to different partitions performance - mysql

I have a partitioned InnoDB mysql table, and I need to insert hundreds of millions of rows.
I am currently using the LOAD DATA INFILE command for loading many (think 10's of thousands) of .csv files into said table.
What are the performance implications if I simultaneously insert large blocks of data into different distinct partitions?
Might I benefit from running multiple processes which each run batches of LOAD DATA INFILE statements?
Miscellaneous information:
Hardware: Intel i7, 24GB ram, Ubuntu 10.04 w/ MySQL 5.5.11, Raid 1 storage
#mysql on freenode IRC have told me that the performance implications will be the same as with normal InnoDB or MyISAM - InnoDB will do row-level locking and MyISAM will do table-level locking.
Table Structure:
CREATE TABLE `my_table` (
`short_name` varchar(10) NOT NULL,
`specific_info` varchar(20) NOT NULL,
`date_of_inquiry` datetime DEFAULT NULL,
`price_paid` decimal(8,2) DEFAULT NULL,
`details` varchar(255) DEFAULT '',
UNIQUE KEY `unique_record` (`short_name`,`specific_info`,`date_of_inquiry`),
KEY `short_name` (`short_name`),
KEY `underlying_quotedate` (`short_name`,`date_of_inquiry`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50500 PARTITION BY LIST COLUMNS(short_name)*/
(PARTITION pTOYS_R_US VALUES IN ('TOYS-R-US') ENGINE = InnoDB,
PARTITION pZAPPOS VALUES IN ('ZAPPOS') ENGINE = InnoDB,
PARTITION pDC VALUES IN ('DC') ENGINE = InnoDB,
PARTITION pGUCCI VALUES IN ('GUCCI') ENGINE = InnoDB,
...on and on...
);

Not a full list, but some pointers...
The fastest way to insert rows is to use LOAD DATA INFILE
See: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
If that's not an option and you want to speed up things, you'll need to find the bottleneck and optimize for that.
If the partitions are across a network, network traffic might kill you same for CPU, disk I/O and memory, only profiling a sample will tell.
Disable key updates
If you cannot do load data infile make sure you disable key updates
ALTER TABLE table1 DISABLE KEYS
... lots of inserts
ALTER TABLE table1 ENABLE KEYS
Note that disable key updates only disables non-unique keys, unique keys are always updated.
Binary log
If you have the binary log running, this will record all those inserts, consider disabling it, you can disable it with MySQL running by using a symlink and pointing that to /dev/null for the duration of the mass insert.
If you want the binary log to persist, you can do a simultaneous insert to a parallel database with blackhole tables and binary log enabled.
Autoincrement key
If you let MySQL calculate the autoincrement key this will create contention around the key generation. Consider feeding MySQL a precalculated autoincrementing primay key value instead of NULL
Unique keys
Unique keys are checked on every insert (for uniqueness) and they eat a lot of time. Because MySQL needs to do a full scan on that index on every insert.
If you know that the values that you insert are unique, it's better to drop that requirement and add it after you are done.
When you add it back in MySQL will take a lot of time checking, but at least it will do it only once, not on every insert.

If you want to get maximum I/O performance from it you'll want the different partitions on different disks volumes.
I'm not sure about the performance implications if all of the partitions are on the same physical disks but obviously you're more likely to run out of I/O capacity that way.

It's likely to depend on your machine specs, but for what it's worth I've tried this and it definitely speeds things up for my specific task. Ie, it takes me about an hour to load all the data into one partition. If I don't partition, I have to perform the task serially so it takes 12 * 1 = 12 hours. However, on my machine with 24 cores, I can parallelize the task to complete in just 1 hour.

Related

MySQL ADD COLUMN slow under AWS RDS

I have an RDS MySql with the following settings:
Class: db.m5.xlarge
Storage: Prosisionned 1000 IOPS (SSD)
I then want to add a few columns to a table that is about 20 GB in size (according to INFORMATION_SCHEMA.files). Here's my statement:
ALTER TABLE MY_TABLE
ADD COLUMN NEW_COLUMN_1 DECIMAL(39, 30) NULL,
ADD COLUMN NEW_COLUMN_2 DECIMAL(39, 30) NULL,
ADD COLUMN NEW_COLUMN_3 INT(10) UNSIGNED NULL,
ADD CONSTRAINT SOME_CONSTRAINT FOREIGN KEY (NEW_COLUMN_3) REFERENCES SOME_OTHER_TABLE(SOME_OTHER_PK),
ADD COLUMN NEW_COLUMN_4 DATE NULL;
This query took 172 minutes to execute. Most of this time was spent coping the data to a temporary table.
During that operation, there were no other queries (read or write) being executed. I had the database just for myself. SHOW FULL PROCESSLIST was saying that State was equal to copy to tmp table for my query.
What I don't understand is that the the AWS RDS Console tells me that the write througput was between 30 MB/s and 35 MB/s for 172 minutes.
Assuming a write througput of 30 MB/s, I should have been able to write 30 * 60 * 172 = 309600 MB = 302 GB. This is much bigger than the size of the temporary table that was created during the opration (20 GB).
So two questions:
what is mysql/rds writing beside my temp table? Is there a way to disable that so that I can get the full bandwidth to create the temp table?
is there any way to accelerate that operation? Taking 3 hours to write 20 GB of data seems pretty long.

I was using MySQL 5.7. According to this MySQL blog post, version 8.0 improved the situation: "InnoDB now supports Instant ADD COLUMN".
I therefore changed my query to use the new feature.
-- Completes in 0.375 seconds!
ALTER TABLE MY_TABLE
ADD COLUMN NEW_COLUMN_1 DECIMAL(39, 30) NULL,
ADD COLUMN NEW_COLUMN_2 DECIMAL(39, 30) NULL,
ADD COLUMN NEW_COLUMN_3 INT(10) UNSIGNED NULL,
-- 'ALGORITHM=INSTANT' is not compatible with foreign keys.
-- The foreign key will need to be added in another statement
-- ADD CONSTRAINT SOME_CONSTRAINT FOREIGN KEY (NEW_COLUMN_3) REFERENCES SOME_OTHER_TABLE(SOME_OTHER_PK),
ADD COLUMN NEW_COLUMN_4 DATE NULL,
-- the new option
ALGORITHM=INSTANT;
-- This completed in about 6 minutes.
-- Adding the foreign creates an index under the hood.
-- This index was 1.5 GB big.
SET FOREIGN_KEY_CHECKS=0;
ALTER TABLE MY_TABLE
ADD FOREIGN KEY (NEW_COLUMN_3) REFERENCES SOME_OTHER_TABLE(SOME_OTHER_PK);
SET FOREIGN_KEY_CHECKS=1;
So my conclusions:
upgrade to MySQL 8 if you can
make sure that you always use (when possible) the ALGORITHM=INSTANT option.

InnoDB is probably the storage engine you are using, since it's the default storage engine. InnoDB does some I/O that might seem redundant, to ensure there is no data loss.
For example:
Data and index pages modified in the buffer pool must be written to the tablespace. The table may need to split some pages during the process of adding columns, because the rows become wider, and fewer rows fit per page.
During writing pages to the tablespace, InnoDB first writes those pages to the doublewrite buffer, to ensure against data loss if there's a crash during a page write.
Transactions are written to the InnoDB redo log, and this may even result in multiple overwrites to the same block in the log.
Transactions are also written to the binary log if it is enabled for purposes of replication. Though this shouldn't be a big cost in the cast of an ALTER TABLE statement, because DDL statements are always written to the binary log in statement format, not in row format.
You also asked what can be done to speed up the ALTER TABLE. The reason to want it to run faster is usually because during an ALTER TABLE, the table is locked and may block concurrent queries.
At my company, we use the free tool pt-online-schema-change, so we can continue to use the table more or less freely while it is being altered. It actually takes longer to complete the alter this way, but it's not so inconvenient since it doesn't block our access to the table.

MariaDB InnoDB bulk INSERT slow overtime

Currently, I have a Server A that is holding about 25 billion records (several terabytes size) with the following structure:
CREATE TABLE `table_x` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`a1` char(64) DEFAULT NULL,
`b1` int(11) unsigned DEFAULT NULL,
`c1` tinyint(1) DEFAULT NULL,
`LastUpdate` timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
PRIMARY KEY (`id`),
UNIQUE KEY `idxb1a1` (`b1`,`a1`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
As the data is growing too big, I am trying to migrate these records into Server B with the same schema structure using bulk insert of 10K records (e.g INSERT INTO yourtable VALUES (1,2), (5,5), ...;) in asecending order by the column id.
Initially, the insertion rate was really quick - however, it gradually slowed down and now takes about 10 secs to bulk insert 10K records (i.e 1K/sec).
I am guessing its because it needs to update the indexes after every insertion.
I have done the following configuration on Server B before starting the migration :
innodb_flush_log_at_trx_commit=2
SET unique_checks=0;
autocommit=0 and commit every 50K
Server B hardware configuration :
300GB+ ram (240GB used for innodb_buffer_pool_size)
SSDs for data storage
Server B my.cnf :
innodb_buffer_pool_size=240G
innodb_buffer_pool_instances=64
innodb_page_cleaners=32
innodb_purge_threads=1
innodb_read_io_threads=64
innodb_write_io_threads=64
innodb_use_native_aio=0
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_autoinc_lock_mode=2
innodb_file_per_table=1
max_connections=10000
skip_name_resolve=1
tmp_table_size=134217728
max_heap_table_size=134217728
back_log=1000
wait_timeout=900
innodb_log_buffer_size=32M
innodb_log_file_size=768M
Is there anything else I can do or configure to speed up the insertion?
Update #1 :
The reason why I am trying to migrate the records over to Server B is because I would like to break/shard the data into few servers (to use MariaDB SPIDER engine sharding solution). As such, solutions that involved sending a snapshot of the data or directly copying over the data doesn't seem viable.

The reason it slows down is likely because your transaction log gets full and the purging isn't keeping up. Increasing innodb_log_file_size (requires shutdown with innodb_fast_shutdown=0 and removing the logs) and innodb_log_files_in_group will postpone the slowdown. Increasing innodb_io_capacity and innidb_io_capacity_max to match what your storage can achieve should help.
Why don't you use xtrabackup to take a point-in-time copy and replication to finish the sync? That will be orders of magnitude faster than INSERT-ing mysqldump style.

in addition to the answer from #Gordon-bobic removing the indices and reapplying at the end speeds things up a lot.

Agree with #gordan-bobić regarding the use of xtrabackup. If you are applying for a data migration, using physical copy/backup is your best approach if you want speed. Using logical copy such as using mysqldump or a query-based copy can take much time because it has to apply checks based on the configuration set loaded during runtime and done dynamically. However, if bulk insert is your option, then consider adjusting the innodb_autoinc_lock_mode = 2 if this is applicable.
Another thing is that, if you are using INSERT statement, are you loading it one value at a time? Consider using the multiple-value lists as this is proven to be faster. On the other hand, consider also using LOAD DATA instead of the INSERT statement.
Also have your innodb_change_buffering=insert (since you are using unique_checks=0 in MariaDB) and consider increasing the innodb_change_buffer_max_size for example from 30 to 50. It is most likely best to do this when there's not much activity on the target table you are inserting, so that you can monitor such activity on your target database server. Also consider that there's not much disturbance from other applications or daemons running on this target server as well.

Benchmark MySQL with batch insert on multiple threads within same table

I want to test high-intensive write between InnoDB and MyRock engine of the MySQL database. For this purpose, I use sysbench to benchmark. My requirements are:
multiple threads concurrency write to the same table.
support batch insert (each insert transaction will insert bulk of records)
I check all pre-made tests of sysbench and I don't see any tests that satisfy my requirements.
oltp_write_only: supports multiple threads that write to the same table. But this test doesn't have bulk insert option.
bulk_insert: support multiple threads, but each thread writes to a different table.
Are there any pre-made sysbench tests satisfied my requirement? If not, can I find custom Lua scripts somewhere which already are done this?
(from Comment:)
CREATE TABLE IF NOT EXISTS `tableA` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`user_id` VARCHAR(63) NOT NULL DEFAULT '',
`data` JSON NOT NULL DEFAULT '{}',
PRIMARY KEY (`id`),
UNIQUE INDEX `user_id_UNIQUE` (`user_id` ASC)
) ENGINE = InnoDB;

(From a MySQL point of view...)
Toss id and the PK -- saves 8 bytes per row.
Promote UNIQUE(user_id) to PRIMARY KEY(user_id) -- might save 40 bytes per row (depends on LENGTH(user_id)).
Doing those will
Shrink the disk I/O needed (providing some speedup)
Eliminate one of the indexes (probably a significant part of the post-load processing)
Run OS monitoring tools to see what percentage of the I/O is being consumed. That is likely to be the limiting factor.
Benchmarking products are handy for limited situations. For your situation (and many others), it is best to build your product and time it.
Another thought...
What does the JSON look like? If the JSON has a simple structure (a consistent set of key:value pairs), then the disk footprint might be half as much (hence speed doubling) if you made individual columns. The processing to change from JSON to individual columns would be done in the client, which may (or may not) cancel out the savings I predict.
If the JSON is more complex, there still might be savings by pulling out "columns" that are always present.
If the JSON is "big", then compress it in the client, then write to a BLOB. This may shrink the disk footprint and network bandwidth by a factor of 3.
You mentioned 250GB for 250M rows? That's 1000 bytes/row. That means the JSON averages 700 bytes? (Note: there is overhead.) Compressing the JSON column into a BLOB would shrink to maybe 400 bytes/row total, hence only 100GB for 250M rows.
{"b": 100} takes about 10 bytes. If b could be stored in a 2-byte SMALLINT column, that would shrink the record considerably.
Another thing: If you promote user_id to PK, then this is worth considering: Use a file sort to sort the table by user_id before loading it. This is probably faster than INSERTing the rows 'randomly'. (If the data is already sorted, then this extra sort would be wasted.)

Is it fit using TokuDB engine when the record average length is 3k?

The table have two fields: id, content, and only one Primary key(id).
The field id type is bigint. The field content type is TEXT, for this field is var-lenth, maybe some record will be 20k, and average length of record is 3k.
Table schema:
CREATE TABLE `events` (
`eventId` bigint(20) NOT NULL DEFAULT '0',
`content` text,
PRIMARY KEY (`eventId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
It's just used as a Key-Value storage.
My Test result is:
InnoDB: 2200 records/second
TokuDB: 1300 records/second
BDB-JE: 12000 records/second
LevelDB-JNI: 22000 records/second(not stable, need test again)
the result is very very bad.
Is 3K too big for tokuDB?
In My Application, there many insert(>2000 records/second， about 100M records/day), and rare update/delete.
TokuDB version: mysql-5.1.52-tokudb-5.0.6-36394-linux-x86_64-glibc23.tar.gz
InnoDB version: mysql 5.1.34
OS: CentOS 5.4 x86_64
One reason that we choose InnoDB/TokuDB is we need partition support and maintenance friendly.
Maybe I will try LevelDB or other Key-Value storage? any sugguest will welcome.
===========
Thanks everybody, finally test performance of TokuDB and InnoDB both not good enough for our use case.
Now we have using solution like bitcask as our storage.
Bitcask append-only style write performance is much better than what we expect.
We just need to handle the memory problem about the hash index.

If you are inserting your rows one at a time I'd recommend changing to a multi-row insert statement if possible as in "insert into events (eventId,content) values (1,'value 1'), (2,'value 2'), ..." as there can be quite a bit of overhead transactions. Also, TokuDB has improved performance with each release, I'd recommend running on a current release.

the main feature of TokuDB is:
Hot schema changes:
Hotindex creation: TokuDB tables support insertions, deletions and queries with no down time while indexes are being added to that table. It uses Fractal Tree for indexing while innodb uses B-Tree
Hot column addition and deletion: TokuDB tables support insertions, deletions and queries with minimal down time when an alter table adds or deletes columns.
More details at what is TokuDB? How to install TokuDB on mysql 5.1?

Insertion speed slowdown as the table grows in mysql

I am trying to get a better understanding about insertion speed and performance patterns in mysql for a custom product. I have two tables to which I keep appending new rows. The two tables are defined as follows:
CREATE TABLE events (
added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id BINARY(16) NOT NULL,
body MEDIUMBLOB,
UNIQUE KEY (id)) ENGINE InnoDB;
CREATE TABLE index_fpid (
fpid VARCHAR(255) NOT NULL,
event_id BINARY(16) NOT NULL UNIQUE,
PRIMARY KEY (fpid, event_id)) ENGINE InnoDB;
And I keep inserting new objects to both tables (for each new object, I insert the relevant information to both tables in one transaction). At first, I get around 600 insertions / sec, but after ~ 30000 rows, I get a significant slowdown (around 200 insertions/sec), and then a more slower, but still noticeable slowdown.
I can see that as the table grows, the IO wait numbers get higher and higher. My first thought was memory taken by the index, but those are done on a VM which has 768 Mb, and is dedicated to this task alone (2/3 of memory are unused). Also, I have a hard time seeing 30000 rows taking so much memory, even more so just the indexes (the whole mysql data dir < 100 Mb anyway). To confirm this, I allocated very little memory to the VM (64 Mb), and the slowdown pattern is almost identical (i.e. slowdown appears after the same numbers of insertions), so I suspect some configuration issues, especially since I am relatively new to databases.
The pattern looks as follows:
I have a self-contained python script which reproduces the issue, that I can make available if that's helpful.
Configuration:
Ubuntu 10.04, 32 bits running on KVM, 760 Mb allocated to it.
Mysql 5.1, out of the box configuration with separate files for tables
[EDIT]
Thank you very much to Eric Holmberg, he nailed it. Here are the graphs after fixing the innodb_buffer_pool_size to a reasonable value:

Edit your /etc/mysql/my.cnf file and make sure you allocate enough memory to the InnoDB buffer pool. If this is a dedicated sever, you could probably use up to 80% of your system memory.
# Provide a buffer pool for InnoDB - up to 80% of memory for a dedicated database server
innodb_buffer_pool_size=614M
The primary keys are B Trees so inserts will always take O(logN) time and once you run out of cache, they will start swapping like mad. When this happens, you will probably want to partition the data to keep your insertion speed up. See http://dev.mysql.com/doc/refman/5.1/en/partitioning.html for more info on partitioning.
Good luck!

Your indexes may just need to be analyzed and optimized during the insert, they gradually get out of shape as you go along. The other option of course is to disable indexes entirely when you're inserting and rebuild them later which should give more consistent performance.
Great link about insert speed.
ANALYZE. OPTIMIZE

Verifying that the insert doesn't violate a key constraint takes some time, and that time grows as the table gets larger. If you're interested in flat out performance, using LOAD DATA INFILE will improve your insert speed considerably.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008