Currently, I have a Server A that is holding about 25 billion records (several terabytes size) with the following structure:
CREATE TABLE `table_x` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`a1` char(64) DEFAULT NULL,
`b1` int(11) unsigned DEFAULT NULL,
`c1` tinyint(1) DEFAULT NULL,
`LastUpdate` timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
PRIMARY KEY (`id`),
UNIQUE KEY `idxb1a1` (`b1`,`a1`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
As the data is growing too big, I am trying to migrate these records into Server B with the same schema structure using bulk insert of 10K records (e.g INSERT INTO yourtable VALUES (1,2), (5,5), ...;) in asecending order by the column id.
Initially, the insertion rate was really quick - however, it gradually slowed down and now takes about 10 secs to bulk insert 10K records (i.e 1K/sec).
I am guessing its because it needs to update the indexes after every insertion.
I have done the following configuration on Server B before starting the migration :
innodb_flush_log_at_trx_commit=2
SET unique_checks=0;
autocommit=0 and commit every 50K
Server B hardware configuration :
300GB+ ram (240GB used for innodb_buffer_pool_size)
SSDs for data storage
Server B my.cnf :
innodb_buffer_pool_size=240G
innodb_buffer_pool_instances=64
innodb_page_cleaners=32
innodb_purge_threads=1
innodb_read_io_threads=64
innodb_write_io_threads=64
innodb_use_native_aio=0
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_autoinc_lock_mode=2
innodb_file_per_table=1
max_connections=10000
skip_name_resolve=1
tmp_table_size=134217728
max_heap_table_size=134217728
back_log=1000
wait_timeout=900
innodb_log_buffer_size=32M
innodb_log_file_size=768M
Is there anything else I can do or configure to speed up the insertion?
Update #1 :
The reason why I am trying to migrate the records over to Server B is because I would like to break/shard the data into few servers (to use MariaDB SPIDER engine sharding solution). As such, solutions that involved sending a snapshot of the data or directly copying over the data doesn't seem viable.
The reason it slows down is likely because your transaction log gets full and the purging isn't keeping up. Increasing innodb_log_file_size (requires shutdown with innodb_fast_shutdown=0 and removing the logs) and innodb_log_files_in_group will postpone the slowdown. Increasing innodb_io_capacity and innidb_io_capacity_max to match what your storage can achieve should help.
Why don't you use xtrabackup to take a point-in-time copy and replication to finish the sync? That will be orders of magnitude faster than INSERT-ing mysqldump style.
in addition to the answer from #Gordon-bobic removing the indices and reapplying at the end speeds things up a lot.
Agree with #gordan-bobić regarding the use of xtrabackup. If you are applying for a data migration, using physical copy/backup is your best approach if you want speed. Using logical copy such as using mysqldump or a query-based copy can take much time because it has to apply checks based on the configuration set loaded during runtime and done dynamically. However, if bulk insert is your option, then consider adjusting the innodb_autoinc_lock_mode = 2 if this is applicable.
Another thing is that, if you are using INSERT statement, are you loading it one value at a time? Consider using the multiple-value lists as this is proven to be faster. On the other hand, consider also using LOAD DATA instead of the INSERT statement.
Also have your innodb_change_buffering=insert (since you are using unique_checks=0 in MariaDB) and consider increasing the innodb_change_buffer_max_size for example from 30 to 50. It is most likely best to do this when there's not much activity on the target table you are inserting, so that you can monitor such activity on your target database server. Also consider that there's not much disturbance from other applications or daemons running on this target server as well.
Related
I have an RDS MySql with the following settings:
Class: db.m5.xlarge
Storage: Prosisionned 1000 IOPS (SSD)
I then want to add a few columns to a table that is about 20 GB in size (according to INFORMATION_SCHEMA.files). Here's my statement:
ALTER TABLE MY_TABLE
ADD COLUMN NEW_COLUMN_1 DECIMAL(39, 30) NULL,
ADD COLUMN NEW_COLUMN_2 DECIMAL(39, 30) NULL,
ADD COLUMN NEW_COLUMN_3 INT(10) UNSIGNED NULL,
ADD CONSTRAINT SOME_CONSTRAINT FOREIGN KEY (NEW_COLUMN_3) REFERENCES SOME_OTHER_TABLE(SOME_OTHER_PK),
ADD COLUMN NEW_COLUMN_4 DATE NULL;
This query took 172 minutes to execute. Most of this time was spent coping the data to a temporary table.
During that operation, there were no other queries (read or write) being executed. I had the database just for myself. SHOW FULL PROCESSLIST was saying that State was equal to copy to tmp table for my query.
What I don't understand is that the the AWS RDS Console tells me that the write througput was between 30 MB/s and 35 MB/s for 172 minutes.
Assuming a write througput of 30 MB/s, I should have been able to write 30 * 60 * 172 = 309600 MB = 302 GB. This is much bigger than the size of the temporary table that was created during the opration (20 GB).
So two questions:
what is mysql/rds writing beside my temp table? Is there a way to disable that so that I can get the full bandwidth to create the temp table?
is there any way to accelerate that operation? Taking 3 hours to write 20 GB of data seems pretty long.
I was using MySQL 5.7. According to this MySQL blog post, version 8.0 improved the situation: "InnoDB now supports Instant ADD COLUMN".
I therefore changed my query to use the new feature.
-- Completes in 0.375 seconds!
ALTER TABLE MY_TABLE
ADD COLUMN NEW_COLUMN_1 DECIMAL(39, 30) NULL,
ADD COLUMN NEW_COLUMN_2 DECIMAL(39, 30) NULL,
ADD COLUMN NEW_COLUMN_3 INT(10) UNSIGNED NULL,
-- 'ALGORITHM=INSTANT' is not compatible with foreign keys.
-- The foreign key will need to be added in another statement
-- ADD CONSTRAINT SOME_CONSTRAINT FOREIGN KEY (NEW_COLUMN_3) REFERENCES SOME_OTHER_TABLE(SOME_OTHER_PK),
ADD COLUMN NEW_COLUMN_4 DATE NULL,
-- the new option
ALGORITHM=INSTANT;
-- This completed in about 6 minutes.
-- Adding the foreign creates an index under the hood.
-- This index was 1.5 GB big.
SET FOREIGN_KEY_CHECKS=0;
ALTER TABLE MY_TABLE
ADD FOREIGN KEY (NEW_COLUMN_3) REFERENCES SOME_OTHER_TABLE(SOME_OTHER_PK);
SET FOREIGN_KEY_CHECKS=1;
So my conclusions:
upgrade to MySQL 8 if you can
make sure that you always use (when possible) the ALGORITHM=INSTANT option.
InnoDB is probably the storage engine you are using, since it's the default storage engine. InnoDB does some I/O that might seem redundant, to ensure there is no data loss.
For example:
Data and index pages modified in the buffer pool must be written to the tablespace. The table may need to split some pages during the process of adding columns, because the rows become wider, and fewer rows fit per page.
During writing pages to the tablespace, InnoDB first writes those pages to the doublewrite buffer, to ensure against data loss if there's a crash during a page write.
Transactions are written to the InnoDB redo log, and this may even result in multiple overwrites to the same block in the log.
Transactions are also written to the binary log if it is enabled for purposes of replication. Though this shouldn't be a big cost in the cast of an ALTER TABLE statement, because DDL statements are always written to the binary log in statement format, not in row format.
You also asked what can be done to speed up the ALTER TABLE. The reason to want it to run faster is usually because during an ALTER TABLE, the table is locked and may block concurrent queries.
At my company, we use the free tool pt-online-schema-change, so we can continue to use the table more or less freely while it is being altered. It actually takes longer to complete the alter this way, but it's not so inconvenient since it doesn't block our access to the table.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have installed MySQL 8.0.25 on top of Ubuntu 20.04, running on the C5.2xlarge instance.
Then I ran a script that fills 10 tables with data. The test took exactly 2 hours, during which is has created 123146.5MB of data:
That means that on average, 17.1MB/s were written to the database.
However, atop is reporting something weird: while it shows disk activity around 18-19MB/s, it also shows that the process mysqld wrote 864MB in the 10 second sample - which translates to 86.4MB/s, about 5 times as much as the amount of data actually committed to the database:
Why such disrepancy?
iotop is also typically showing that MySQL is writing 5x:
Same for pidstat:
I also tried to use pt-diskstats from the Percona toolkit, but it didn't show anything...
I also reproduced the issue on RDS. In both cases (EC2 and RDS), the Cloudwatch statistics also show 5x writes...
The database has 10 tables that were filled.
5 of them have this definition:
CREATE TABLE `shark` (
`height` int DEFAULT NULL,
`weight` int DEFAULT NULL,
`name` mediumtext,
`shark_id` bigint NOT NULL,
`owner_id` bigint DEFAULT NULL,
PRIMARY KEY (`shark_id`),
KEY `owner_id` (`owner_id`),
CONSTRAINT `shark_ibfk_1` FOREIGN KEY (`owner_id`) REFERENCES `shark_owners` (`owner_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
another 5 tables have this definition:
CREATE TABLE `shark_owners` (
`name` mediumtext,
`owner_id` bigint NOT NULL,
PRIMARY KEY (`owner_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
I could understand if the difference was about 2x - data is first written to a transaction log, and then it is committed to the database, but 5x?
Is this a normal behavior for MySQL, or is something in my tables triggering this?
And why are there so many "cancelled writes" - about 12% ?
LOAD DATA runs very fast, with minimal I/O
Bulk INSERT with at least 100 rows per query runs 10 times as fast as single-row Inserts.
autocommit causes at least one extra I/O after each SQL (for transactional integrity).
50 1-line Inserts, then a COMMIT is something of a compromise.
FOREIGN KEY requires checking the other table.
If innodb_buffer_pool_size is too small, there will be disk churn.
owner_id is a "secondary index". It is done in a semi-optimized way, but may involve both reads and writes, depending on a variety of things.
The tables would be smaller if you could use smaller datatypes. (Eg, BIGINT takes 8 bytes and is usually overkill.) Smaller would lead to less I/O.
How big is name? What ROW_FORMAT is used? They conspire to lead to more or less "off-record" storage, hence disk I/O.
Were you using multiple threads when doing the Inserts?
In other words, a lot more details are needed in order to analyze your problem.
MySQL writes data several times when you use InnoDB tables. Mostly this is worth it to prevent data loss or corruption, but if you need greater throughput you may need to reduce the durability.
If you don't need durability at all, another solution is to use the MEMORY storage engine. That would eliminate all writes except the binary log and query logs.
You already mentioned the InnoDB redo log (aka transaction log). This cannot be disabled, but you can reduce the number of file sync operations. Read https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit for details.
innodb_flush_log_at_trx_commit = 0
You might reduce the number of page flushes, or help MySQL consolidate page flushes, by increasing RAM allocation to the InnoDB buffer pool. Do not overallocate this, because other processes needs RAM too.
innodb_buffer_pool_size = XXX
The binary log is a record of all committed changes. You can reduce the number of file syncs. See https://www.percona.com/blog/2018/05/04/how-binary-logs-and-filesystems-affect-mysql-performance/ for description of how this impacts performance.
sync_binlog = 0
You can also disable the binary log completely, if you don't care about replication or point-in-time recovery. Turn off the binary log by commenting out the directive:
# log_bin
Or in MySQL 8.0, they finally have a directive to explicitly disable it:
skip_log_bin
Or
disable_log_bin
See https://dev.mysql.com/doc/refman/8.0/en/replication-options-binary-log.html#option_mysqld_log-bin for details.
The doublewrite buffer is used to protect against database corruption if your MySQL Server crashes during a page write. Think twice before disabling this, but it can give you some performance boost if you disable it.
See https://www.percona.com/blog/2006/08/04/innodb-double-write/ for discussion.
innodb_doublewrite = 0
MySQL also has two query logs: the general query log and the slow query log. Either of these causes some overhead, so disable the query logs if you need top performance. https://www.percona.com/blog/2009/02/10/impact-of-logging-on-mysql’s-performance/
There are ways to keep the slow query log enabled but only if queries take longer than N seconds. This reduces the overhead, but still allows you to keep a log of the slowest queries you may want to know about.
long_query_time = 10
Another strategy is to forget about optimizing the number of writes, and just let them happen. But use faster storage. In an AWS environment, this means using instance storage instead of EBS storage. This comes with the risk that the whole database may be lost if the instance is terminated, so you should maintain good backups or replicas.
I have a production mysql 8 server that has a table for user sessions for a PHP application. I am using innodb_file_per_table. The table is small at any given time (about 300-1000 rows), but rows are constantly being deleted and added. Without interference, the sessions.ibd file slowly grows until it takes up all available disk space. This morning, the table was 300 records and took up over 90GB. This built up over the long term (months).
Running OPTIMIZE TABLE reclaims all of the disk space and brings the table back under 100M. An easy solution would be to make a cron script that runs OPTIMIZE TABLE once a week during our maintenance period. Another proposed suggestion is to convert the table to a MyISAM table, since it doesn't really require any of the features of InnoDB. Both of these solutions should be effective, but they are table specific and don't protect against the general problem. I'd like to know whether there is a solution to this problem that involves database configuration.
Here are the non-default innodb configuration options that we're using:
innodb-flush-log-at-trx-commit = 1
innodb-buffer-pool-size = 24G
innodb-log-file-size = 512M
innodb-buffer-pool-instances = 8
Are there other options we should be using so that the sessions.ibd file doesn't continually grow?
Here is the CREATE TABLE for the table:
CREATE TABLE `sessions` (
`id` varchar(255) NOT NULL DEFAULT '',
`data` mediumtext,
`expires` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
In addition to additions and subtractions, the data column is updated often.
MyISAM would have a different problem -- fragmentation. After a delete, there is a hole in the table. The hole is filled in first. Then a link is made to the next piece of the record. Eventually fetching a row would involve jumping around most of the table.
If 300 rows takes 100MB, a row averages 333KB? That's rather large. And does that number vary a lot? Do you have a lot of text/blob columns? Do they change often? Or is it just delete and add? Would you care to share SHOW CREATE TABLE.
I can't think how a table could grow by a factor of 900 without having at least one very multi-GB row added, then deleted. Perhaps with the schema, I could think of some cause and/or workaround.
In summary, date range partitioning and memory configuration achieved my goal.
I needed to increase memory allocated to innodb_buffer_pool_size as the default 8M was far too low. Rick James recommends 70% of RAM for this setting, his has a lot of great information.
Edlerd was correct with both suggestions :-)
I split my data into monthly partitions and then ran a 6,000 row response query which originally took between 6 to 12 seconds. It now completes in less than a second (.984/.031). I ran this using the default innodb buffer size (innodb_buffer_pool_size = 8M) to make sure it wasnt just the memory increase.
I then set innodb_buffer_pool_size = 4G and ran the query with an even better response of .062/.032.
I’d also like to mention that increasing the memory has also improved the overall speed of my web application and service which receives and writes messages to this table, I am astounded at how much of a difference this configuration setting has made. The Time To First Byte (TTFB) from my web server is now almost on par with MySQL Workbench which at times would reach 20 seconds.
I also found that the slow query log file was an excellent tool to identify issues, it was there that I saw it suggesting my innodb_buffer_pool_size was low and higlighted all the poor performing queries. This also identified areas where I needed to index other tables.
EDIT 2016-11-12 SOLUTION
I am in the process of refactoring a large table that logs telemetry data, it has been running for about 4-5 months and has generated approx. 54 million records with an average row size approx. 380 bytes.
I have started to see some performance lag on one of my raw data queries that returns all logs for a device over a 24 hour period.
Initially I thought it was indexing, but I think it is the amount of I/O that needs to be processed by MySQL. A typical 24 hour query would contain 2.2k 3k to 9k records and I’d actually like to support an export of about 7 days.
I am not experienced in database performance tuning so still just learning the ropes. I am considering a few strategies.
Tweak compound indexes according to query for raw data, although I think my indexes are OK as the explain plan is showing 100% hit rate.
Consider creating a covering index to include all rows needed
Implement ranged partitioning by date:
a) Keep monthly partitions. E.g. last 6 months
b) Move anything older to archive table.
Create a separate table (vertical partitioning) with the raw data and join it with the IDs of the primary query table. Not sure this is my problem as my indexes are working.
Change my queries to pull data in batches with limits, then order by created date limit X and carry on until no more records are returned.
Review server configuration
1,2 (INDEXES):
I’ll rework my indexes with my queries, but I think I am good here as Explain is showing 100% hit, unless I am reading this wrong.
I’ll try a covering index when they are rebuilt, but how do I determine the knock on effects of making a bad setting? E.G. insert speeds are compromised.
How would I best monitor the performance of my table in a live environment?
EDIT: I've just started using the slow log file which looks like a good tool for finding issues and I suppose a query on the performance_schema might be another option?
3 (PARTITIONING):
I have read a bit about partitions and not sure if the size of my data would make much of a difference.
Rick James suggests >1M records, I’m at 54M and would like to keep around 300M prior to archiving, is my table is complex enough to benefit?
I have to test this out myself as I do not have experience with any of this stuff and it all theoretical to me. I just don’t want to go down this path if it isn’t suitable for my needs.
4 (Vertical partitioning via ‘joined’ detail table): I don’t I think am having table scan issues and I need all rows, so I am not sure this technique would be of benefit.
5 (Use limits and fetch again): Would this free up the server if I used less of its time in a single request? Would I see better I/O throughput at the cost of more commands on the same connection?
6 (Review Config): The other piece would be to review the default non developer configuration that is used when you install MySQL, perhaps there are some settings that can be adjusted? :-)
Thanks for reading, keen to hear any and all suggestions.
The following FYI:
TABLE:
CREATE TABLE `message_log` (
`db_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`db_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`created` datetime DEFAULT NULL,
`device_id` int(10) unsigned NOT NULL,
`display_name` varchar(50) DEFAULT NULL,
`ignition` binary(1) DEFAULT NULL COMMENT 'This is actually IO8 from the falcom device',
`sensor_a` float DEFAULT NULL,
`sensor_b` float DEFAULT NULL,
`lat` double DEFAULT NULL COMMENT 'default GPRMC format ddmm.mmmm \n',
`lon` double DEFAULT NULL COMMENT 'default GPRMC longitude format dddmm.mmmm ',
`heading` float DEFAULT NULL,
`speed` float DEFAULT NULL,
`pos_validity` char(1) DEFAULT NULL,
`device_temp` float DEFAULT NULL,
`device_volts` float DEFAULT NULL,
`satellites` smallint(6) DEFAULT NULL, /* TINYINT will suffice */
`navdist` double DEFAULT NULL,
`navdist2` double DEFAULT NULL,
`IO0` binary(1) DEFAULT NULL COMMENT 'Duress',
`IO1` binary(1) DEFAULT NULL COMMENT 'Fridge On/Off',
`IO2` binary(1) DEFAULT NULL COMMENT 'Not mapped',
`msg_name` varchar(20) DEFAULT NULL, /* Will be removed */
`msg_type` varchar(16) DEFAULT NULL, /* Will be removed */
`msg_id` smallint(6) DEFAULT NULL,
`raw` text, /* Not needed in primary query, considering adding to single table mapped to this ID or a UUID correlation ID to save on #ROWID query */
PRIMARY KEY (`db_id`),
KEY `Name` (`display_name`),
KEY `Created` (`created`),
KEY `DeviceID_AND_Created` (`device_id`,`created`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DeviceID_AND_Created is the main index. I need the PK clustered index because I am using the record ID in a summary table that keeps track of the last message for a given device. Created would be the partition column, so I guess that would also be added to the PK cluster?
QUERY:
SELECT
ml.db_id, ml.db_created, ml.created, ml.device_id, ml.display_name, bin(ml.ignition) as `ignition`,
bin(ml.IO0) as `duress`, bin(ml.IO1) as `fridge`,ml.sensor_a, ml.sensor_b, ml.lat, ml.lon, ml.heading,
ml.speed,ml.pos_validity, ml.satellites, ml.navdist2, ml.navdist,ml.device_temp, ml.device_volts,ml.msg_id
FROM message_log ml
WHERE ml.device_id = #IMEI
AND ml.created BETWEEN #STARTDATE AND DATE_ADD(#STARTDATE,INTERVAL 24 hour)
ORDER BY ml.db_id;
This returns all logs for a given 24 hour period which at the moment is approx. 3k to 9k rows, average row size 381 bytes and will be reduced once I remove one of the TEXT fields (raw)
Implement ranged partitioning by date: a) Keep monthly partitions. E.g. last 6 months b) Move anything older to archive table.
This is a very good idea. I gues all the writes will be in the newest partition and you will query recent data only. You always want a situation where your data and index fits in memory. So no disk i/o on reads.
Depending on your use case it might even be wise to have one partition per week. Then you only have to keep max two weeks of data in memory for reading the last 7 days.
You might also want to tune your buffer sizes (i.e. innodb_buffer_pool_size) if you are using innodb as a engine or myisam_key_cache when using myisam engine.
Also adding ram to the DB machine usually helps as the os can then have the data files in memory.
If you have heavy writes you can also tune other options (i.e. how often writes are persisted to disk with innodb_log_buffer_size). This is in order to let dirty pages be in memory for longer to avoid writing them back to disk too often.
For those who are curious, the following is what I used to create my partition and configure memory.
Creating the partitions
Updated PK to include the range column used in partition
ALTER TABLE message_log
CHANGE COLUMN created DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
DROP PRIMARY KEY,
ADD PRIMARY KEY (db_id, created);
Added the partitions using ALTER TABLE.
In hindsight, I should have created each partition as a single ALTER statement and used Reorganize Partition (and here) on subsequent partitions as doing it in one hit consumed a lot of resources and time.
ALTER TABLE message_log
PARTITION BY RANGE(to_days(created)) (
partition invalid VALUES LESS THAN (0),
partition from201607 VALUES LESS THAN (to_days('2016-08-01')),
partition from201608 VALUES LESS THAN (to_days('2016-09-01')),
partition from201609 VALUES LESS THAN (to_days('2016-10-01')),
partition from201610 VALUES LESS THAN (to_days('2016-11-01')),
partition from201611 VALUES LESS THAN (to_days('2016-12-01')),
partition from201612 VALUES LESS THAN (to_days('2017-01-01')),
partition from201701 VALUES LESS THAN (to_days('2017-02-01')),
partition from201702 VALUES LESS THAN (to_days('2017-03-01')),
partition from201703 VALUES LESS THAN (to_days('2017-04-01')),
partition from201704 VALUES LESS THAN (to_days('2017-05-01')),
partition future values less than (MAXVALUE)
);
NOTE: I am not sure if using to_days() or the raw column makes much difference, but I've seen it used in most examples so I've taken it on as an assumed best practice.
Setting the buffer pool size
To change the value of innodb_db_buffer_pool_size you can find info:
MySQL InnoDB Buffer Pool Resize and Rick Jame's page on memory
You can also do it in MySQL Workbench in the options file menu and then the innoDB tab. Any changes you make here will be written in the config file, but you'll need to stop and start MySQL to read out the configuration, otherwise you can also set the global value to do it live.
Such a deal! I get 4 mentions, even without writing a comment or answer. I'm writing an answer because I may have some further improvements...
Yes, PARTITION BY RANGE(TO_DAYS(...)) is the right way to go. (There may be a small number of alternatives.)
70% of 4GB of RAM is tight. Be sure there is no swapping.
You mentioned one query. If it is the main one of concern, then this would be slightly better:
PRIMARY KEY(device_id, created, db_id), -- desired rows will be clustered
INDEX(db_id) -- to keep AUTO_INCREMENT happy
If you are not purging old data, then the above key suggestion provides just as much efficiency even without partitioning.
lat/lon representation says that DOUBLE is overkill.
Beware of the inefficiency of UUID, especially for huge tables.
I have a partitioned InnoDB mysql table, and I need to insert hundreds of millions of rows.
I am currently using the LOAD DATA INFILE command for loading many (think 10's of thousands) of .csv files into said table.
What are the performance implications if I simultaneously insert large blocks of data into different distinct partitions?
Might I benefit from running multiple processes which each run batches of LOAD DATA INFILE statements?
Miscellaneous information:
Hardware: Intel i7, 24GB ram, Ubuntu 10.04 w/ MySQL 5.5.11, Raid 1 storage
#mysql on freenode IRC have told me that the performance implications will be the same as with normal InnoDB or MyISAM - InnoDB will do row-level locking and MyISAM will do table-level locking.
Table Structure:
CREATE TABLE `my_table` (
`short_name` varchar(10) NOT NULL,
`specific_info` varchar(20) NOT NULL,
`date_of_inquiry` datetime DEFAULT NULL,
`price_paid` decimal(8,2) DEFAULT NULL,
`details` varchar(255) DEFAULT '',
UNIQUE KEY `unique_record` (`short_name`,`specific_info`,`date_of_inquiry`),
KEY `short_name` (`short_name`),
KEY `underlying_quotedate` (`short_name`,`date_of_inquiry`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50500 PARTITION BY LIST COLUMNS(short_name)*/
(PARTITION pTOYS_R_US VALUES IN ('TOYS-R-US') ENGINE = InnoDB,
PARTITION pZAPPOS VALUES IN ('ZAPPOS') ENGINE = InnoDB,
PARTITION pDC VALUES IN ('DC') ENGINE = InnoDB,
PARTITION pGUCCI VALUES IN ('GUCCI') ENGINE = InnoDB,
...on and on...
);
Not a full list, but some pointers...
The fastest way to insert rows is to use LOAD DATA INFILE
See: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
If that's not an option and you want to speed up things, you'll need to find the bottleneck and optimize for that.
If the partitions are across a network, network traffic might kill you same for CPU, disk I/O and memory, only profiling a sample will tell.
Disable key updates
If you cannot do load data infile make sure you disable key updates
ALTER TABLE table1 DISABLE KEYS
... lots of inserts
ALTER TABLE table1 ENABLE KEYS
Note that disable key updates only disables non-unique keys, unique keys are always updated.
Binary log
If you have the binary log running, this will record all those inserts, consider disabling it, you can disable it with MySQL running by using a symlink and pointing that to /dev/null for the duration of the mass insert.
If you want the binary log to persist, you can do a simultaneous insert to a parallel database with blackhole tables and binary log enabled.
Autoincrement key
If you let MySQL calculate the autoincrement key this will create contention around the key generation. Consider feeding MySQL a precalculated autoincrementing primay key value instead of NULL
Unique keys
Unique keys are checked on every insert (for uniqueness) and they eat a lot of time. Because MySQL needs to do a full scan on that index on every insert.
If you know that the values that you insert are unique, it's better to drop that requirement and add it after you are done.
When you add it back in MySQL will take a lot of time checking, but at least it will do it only once, not on every insert.
If you want to get maximum I/O performance from it you'll want the different partitions on different disks volumes.
I'm not sure about the performance implications if all of the partitions are on the same physical disks but obviously you're more likely to run out of I/O capacity that way.
It's likely to depend on your machine specs, but for what it's worth I've tried this and it definitely speeds things up for my specific task. Ie, it takes me about an hour to load all the data into one partition. If I don't partition, I have to perform the task serially so it takes 12 * 1 = 12 hours. However, on my machine with 24 cores, I can parallelize the task to complete in just 1 hour.