MySQL index creation is slow (on EC2)

MySQL index creation is slow (on EC2) - mysql

I have a fairly simple table
requestparams (
requestid varchar(64) NOT NULL,
requestString text,
) ENGINE=MyISAM;
After populating the table with "LOAD DATA", I am changing the schema and making "requestid" the primary key.
There are 11 million rows in this table and the data size is less than 2GB (size of the MYD file.) The index file size is around 600M at the end of the process.
Creating index takes less than 20 minutes on my laptop. But, when I ran the process on Amazon's EC2 (medium instance), the process took more than 12 hours. During entire process, the disk was superbusy with IO wait (as see by top) between 40-100%. The CPU was mostly idle. I don't think, the disk on EC2 are that slow.
On MySQL mailing list, some had suggested to change server variables myisam_sort_buffer_size & myisam_max_sort_file_size. I set them to 512MB & 4GB respectively. But index creation was equally slow. In fact, the memory usage by MySQL rarely went beyond 40M.
How do I fix this?
Solution: Increasing "key_buffer_size" helped. I set this value to 1GB and the process completed in 4 minutes. Make sure you verify the new settings with mysqladmin variables command.

The suggestions you received are correct. I suspect the changes you made didn't actually take effect. Try making the changes in my.cnf and restart the server.

Related

How to diagnose extremely slow AWS RDS MySQL Performance?

My DB has around 15 tables, each with 40 columns, with 10.000 rows each.
Most of it with VARCHAR, some indexes and foreign keys.
Sometime I need to reconstruct my database (design flaw, working on it), which takes about 40 seconds locally. Now I'm trying to do the same to a AWS RDS MySQL 5.75 instance, but it takes forever, something like 40-50 minutes. The last time I had to do this same process it took no more than 5 minutes, still way more than the local 40 seconds, but I'm happy with it.
My internet speed is at about 35 Mbps Download / 5 Mbps Upload.
I know it's not fast, but it's consistent, and it hasn't changed since my last rebuilt.
I enabled General Logs, but all I can see are the INSERT queries, occasionally some "SELECT 1".
I do have same space for improvements on my code, but still, from 00:40:00 to 50:00:00, it seems that there's something else going on.
Any ideas on how to diagnose and find the bottleneck?
Thanks
--
Additional relevant information:
It is a Micro instance from AWS, all of the relevant monitoring indicators are basically flat: CPU at 4%, Free Storage Space at 20.000 MB, Freeable Memory at 200 MB, Write IOPS at around 2,5, the server runs a 5.7.25 MySQL, 1vCPU, 1Gb of RAM and 20GB of SSD. This is the same as 3 months ago when I last rebuilt the database.
SHOW GLOBAL STATUS: https://pastebin.com/jSrAzYZP
SHOW GLOBAL VARIABLES: https://pastebin.com/YxD7dVhR
SHOW ENGINE INNODB STATUS: https://pastebin.com/r5wffB5t
SHOW PROCESS LIST: https://pastebin.com/kWwiyGwf
SELECT * FROM information_schema...: https://pastebin.com/eXGBmetP
I haven't made any big changes to the server configuration, except enabling logs, e maxing out max_allowed_packets and saving logs to file.
In my backend I have a Flask app running, when it receives the API call, it takes a bunch of pickled objects and adds them all to the database (appending the Flask SQLAlchemy class to a list) and then running db.session.add_all(entries), trying to run a bulk operation. The code is the same, both for localhost and my remote server.
It does get slower in three specific tables, most of them with VARCHAR columns, but nothing different from my last inserts - it seems odd that the problem would be data, or the way the code is structured, or at least doesn't seem reasonable that this would result in a 20 second (localhost) to 40 minutes (hosted server) time, specially when the rest of the tables work mostly the same.

Enable the slow log, set long_query_time=0, run your code, then put the resulting log through mysqldumpslow.
Establish which queries contribute most to slowness and take it from there.
Compare the config between your old server and your new one.
Also, are they the same version of MySQL? 5.6, 5.7 and 8.0 can produce very different execution plans (with 5.6 usually coming up with the sane one if they differ).

Rate Per Second = RPS
Suggestions to consider for your AWS RDS Parameters group
thread_cache_size=24 # from 8 to reduce threads_created count
innodb_io_capacity=1900 # from 200 to enable more use of SSD IOPS capacity
read_rnd_buffer_size=128K # from 512K to reduce handler_read_rnd_next RPS of 21
query_cache_size=0 # from 1M since you have QC turned off with query_cache_typ=OFF
Determine why com_flush is running 13 times per hour and get it stopped to avoid table open thrashing.

I found that after migrating to RDS all my database Indexes are gone! They weren't migrated along with the schema and data. Make sure you're indexes are there.
Also, MySQL query cache is OFF by default in RDS. This won't help the performance of your initial query, but it may speed things up in general.
You can set query_cache_type to 1 and define a value for query_cache_size. I also changed the thread_cache_size from 8 to 24 and innodb_io_capacity from 200 to 1900 don't know if it helps you.
Also creating AWS DB Parameter Groups helped me a lot with configuring and tuning DB variables. Here you can read more:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithParamGroups.html

Improve performance of mysql LOAD DATA / mysqlimport?

I'm batching CSV 15GB (30mio rows) into a mysql-8 database.
Problem: the task takes about 20min, with approxy throughput of 15-20 MB/s. While the harddrive is capable of transfering files with 150 MB/s.
I have a RAM disk of 20GB, which holds my csv. Import as follows:
mysqlimport --user="root" --password="pass" --local --use-threads=8 mytable /tmp/mydata.csv
This uses LOAD DATA under the hood.
My target table does not have any indexes, but approx 100 columns (I cannot change this).
What is strange: I tried tweaking several config parameters as follows in /etc/mysql/my.cnf, but they did not give any significant improvement:
log_bin=OFF
skip-log-bin
innodb_buffer_pool_size=20G
tmp_table_size=20G
max_heap_table_size=20G
innodb_log_buffer_size=4M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_autoinc_lock_mode=2
Question: does LOAD DATA / mysqlimport respect those config changes? Or does it bypass? Or did I use the correct configuration file at all?
At least a select on the variables shows they are correctly loaded by the mysql server. For example show variables like 'innodb_doublewrite' shows OFF
Anyways, how could I improve import speed further? Or is my database the bottleneck and there is no way to overcome the 15-20 MB/s threshold?
Update:
Interestingly if I import my csv from harddrive into the ramdisk, performance is almost the same (just a little bit better, but never over 25 MB/s). I also tested the same amount of rows, but only with a few (5) columns. And there I'm getting to about 80 MB/s. So clearly the number of columns is the bottleneck? But why do more columns slow down this process?

MySQL/MariaDB engine have little parallelization when making bulk inserts. It can only use one CPU core per LOAD DATA statement. You may probably monitor CPU utilization during load to see one core is fully utilized and it can provide only so much of output data - thus leaving disk throughput underutilized.
The most recent version of MySQL has new parallel load feature: https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-utilities-parallel-table.html . It looks promising but probably hasn't received much feedback yet. I'm not sure it would help in your case.
I saw various checklists on the internet that recommended having higher values in the following config params: log_buffer_size, log_file_size, write_io_threads, bulk_insert_buffer_size . But the benefits were not very pronounced when I performed comparison tests (maybe 10-20% faster than just innodb_buffer_pool_size being large enough).

This could be normal. Let's walk through what is being done:
The csv file is being read from a RAM disk, so no IOPs are being used.
Are you using InnoDB? If so, the data is going into the buffer_pool. As blocks are being built there, they are being marked 'dirty' for eventual flushing to disk.
Since the buffer_pool is large, but probably not as large as the table will become, some of the blocks will need to be flushed before it finishes reading all the data.
After all the data is read, and the table is finished, the dirty blocks will gradually be flushed to disk.
If you had non-unique indexes, they would similarly be written in a delayed manner to disk (cf 'Change buffering'). The change_buffer, by default occupies 25% of the buffer_pool.
How large is the resulting table? It may be significantly larger, or even smaller, than the 15GB of the csv file.
How much time did it take to bring the csv file into the ram disk? I proffer that that was wasted time and it should have been read from disk while doing the LOAD DATA; that I/O can be overlapped.
Please SHOW GLOBAL VARIABLES LIKE 'innodb%';; there are several others that may be relevant.
More
These are terrible:
tmp_table_size=20G
max_heap_table_size=20G
If you have a complex query, 20GB could be allocated in RAM, possibly multiple times!. Keep those to under 1% of RAM.
If copying the csv from hard disk to ram disk runs slowly, I would suspect the validity of 150 MB/s.
If you are loading the table once every 6 hours, and it takes 1/3 of an hour to perform, I don't see the urgency of making it faster. OTOH, there may be something worth looking into. If that 20 minutes is downtime due to the table being locked, that can be easily eliminated:
CREATE TABLE t LIKE real_table;
LOAD DATA INFILE INTO t ...; -- not blocking anyone
RENAME TABLE real_table TO old, t TO real_table; -- atomic; fast
DROP TABLE old;

Issues with wikipedia dump table pagelinks

I downloaded the enwiki-latest-pagelinks.sql.gz dump from dumps.wikimedia.org/enwiki/latest/.
I upacked the file, its uncompressed size is 37G.
The table structure is this:
SHOW CREATE TABLE wp_dump.pagelinks;
CREATE TABLE `pagelinks` (
`pl_from` int(8) unsigned NOT NULL DEFAULT '0',
`pl_namespace` int(11) NOT NULL DEFAULT '0',
`pl_title` varbinary(255) NOT NULL DEFAULT '',
`pl_from_namespace` int(11) NOT NULL DEFAULT '0',
UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`),
KEY `pl_backlinks_namespace` (`pl_from_namespace`,`pl_namespace`,`pl_title`,`pl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary
I imported the table into a new, empty database:
mysql -D wp_dump -u root -p < enwiki-latest-pagelinks.sql
The computer I am running the task on has 16G of RAM and the mysql database is located on a SSD, so I was assuming that despite the table's size the import would not take too long.
However, the task is running since over a day and still running. There are no other processes accessing mysql and there is no workload on the computer.
The database file itself now is 79G large.
ls -lh
-rw-r----- 1 mysql mysql 65 May 11 17:40 db.opt
-rw-r----- 1 mysql mysql 8,6K May 12 07:06 pagelinks.frm
-rw-r----- 1 mysql mysql 79G May 13 16:59 pagelinks.ibd
The table now has over 500 million rows.
SELECT table_name, table_rows FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'wp_dump';
+------------+------------+
| table_name | table_rows |
+------------+------------+
| pagelinks | 520919860 |
+------------+------------+
I am wondering:
Is the enwiki-latest-pagelinks.sql really over 79G large?
Does pagelinks really contain over 500 million rows?
Does it really take that long to import the pagelinks table?
Can you provide some metrics the expected table size and the row amount, please?
Update: 14th may, 2017:
insert still running; pagelinks.ibdfile now 130G; number of rows now almost 700 million
Update: 16th may, 2017:
insert still running; pagelinks.ibdfile now 204G; number of rows now over 1.2 billion
I calculated the rows inserted per second over the last two days:
rows/sek = 3236
And: It is many thousand inserts per insert statement in the sql script (head -41 enwiki-latest-pagelinks.sql | tail -1 | grep -o "(" | wc -l is 30471)
So, my follow-up / modified questions:
Is the number of rows and the idb file size to be expected given the sql file size of 37G and the table structure (as listed above)?
Is rows/sek = 3236 a good value (meaning that it takes a few days to insert the table)?
What may be the limiting speed factor / how can I speed up the import?
Disable the indexes (and calculate them after the insert)?
Optimize transactions (commit (nothing set in script) / autocommit (now ON))?
Optimize variable settings (e.g. innodb_buffer_pool_size, now 134217728)?

#Sim Betren: I was currently importing the same table, I can get about 7700 rows/s. Which means about 600.000.000 rows a day. Probably the most important thing is to get the right settings on InnoDB:
https://dba.stackexchange.com/questions/83125/mysql-any-way-to-import-a-huge-32-gb-sql-dump-faster
innodb_buffer_pool_size = 4G
innodb_log_buffer_size = 256M
innodb_log_file_size = 1G
innodb_write_io_threads = 16
innodb_flush_log_at_trx_commit = 0
Those settings work good. From what I read and tried, InnoDB loves high memory settings. Ideally, one would use a 16Gb or even 32Gb machine, then increase these settings even more. But I got 7700 rows/s on a modest setup, that's almost 10 years old:
Intel Q6700 quad
8 Gb DDR2 memory
I combined that 10 year old hardware with a 2017 model 500Gb SSD, which is dedicated to the job and handles both the reading and the writing. the reason for using the older hardware is that the SSD is the most important part of the setup (because of IOPS). Plus by using older hardware I saved a bit of money. However, the hardware is limited to 8Gb of DDR2. A newer dedicated machine with 32Gb or 64Gb internal memory could really fly I reckon.
Software setup:
Linux Mint 64bit
MySQL Server 5.7.18 for Ubuntu
MySQL Workbench for importing
I also tried this on Windows 10 and the speed is almost the same on both. So you could try Windows too.
Note: I did try changing the engine to MyISAM. MyISAM can be pretty fast, also around 8000 rows/sec or more. But the import always became corrupted for some reason. So I would stick to InnoDB
Update 17-06-2017:
Finished the import. The table "pagelinks" is about 214Gb large with 1200 million rows. About 112Gb is raw data, 102Gb are indexes. The original uncompressed file was about 37Gb.
It took about 2 days and 6 hours to import. Avg speed = 5350 rows/second. With high end equipment (huge memory, preferably 64Gb or more) and optimal settings, it can probably be done faster. But I let it run on a dedicated machine 24/7 and I wasn't in a hurry, so 2 days seems OK.
Update 18-06-2017:
Also imported "page.sql" because this contains the names connected to the ID's. The uncompressed file is about 5Gb, import took 1 hour. Which seems quick: the pagelink file is about 37Gb which is 7x bigger than "page.sql". Yet takes 50x longer to import. So, there are a few reasons why "pagelinks" takes so long: (A) probably because it doesn't fit in memory (B) The table structure, many data per insert (C) Settings. But most likely it's memory.
Conclusion: try get a PC with 32Gb or 64Gb internal memory. Maybe even more. And use SSD's that can keep up with that memory, 500Gb or more. The SSD is more important than memory so try that first.

#Sim Betren:
I want to open a completely new answer, since I have discovered a new solution. Splitting up the file is probably the best answer. As discussed in the other answer, InnoDB works best when the entire model fits in memory. The delays start when it needs to swap stuff on disk. The pagelinks file is 37Gb and that's simply too big for most machines to fit easily into memory. Maybe a $1000+ dedicated machine with endless memory can do it, most desktops can't. So what you can do:
The plan is to split the file. First thing to do, is to separate the SQL structure from the data.
There's probably better ways to do that, but a program I found was this:
SqlDumpSplitter2
That dump splitter program maybe old but it worked on pagelinks. It is Windows only though. I simply told it to split the unpacked 37Gb file into 37 chunks of 1Gb and it dutifully did. Checked the data and it seems to be working. You could also use 74 chunks of 500Mb.
The import of each file takes maybe 10 to 20 minutes per 1Gb.
Total time: about 1 to 2 hours for splitting the 37Gb file. About 6 to 12 hours for importing. That easily beats the previous answer I gave
When importing, use the same big data settings as in the previous answer. And try to find a machine that has big memory 16Gb or 32Gb preferred.
What is most important here is: it doesn't really matter that much how you split it. Just split the file anyway you can. Then build it up by re-creating the structure and the data separately. This way import could be down from 2 days to maybe as little as a few hours. Given a big dedicated machine it can probably be done in just 1 to 6 hours.

37GB of data --> 79GB of InnoDB table seems reasonable...
Title: 2 quotes and 1 comma --> 1 byte for length
Ints: several bytes, plus comma --> 4 bytes for INT (regardless of the (...) after INT. See MEDIUMINT.
20-30 bytes overhead per row
20-40% overhead for BTrees.
UNIQUE index becomes PRIMARY KEY and clusters with data --> very little overhead.
Other two indexes: Each is virtually the same size as the data. This more that allows for the increased size.
Adding it all together, I would expect the table to be more than 120GB. So, there are probably some details missing. A guess: The dump is one row per INSERT, instead of the less verbose many-rows-per-INSERT.
As for performance, it all depends on the SELECTs. Set innodb_buffer_pool_size to somewhere around 11G. This may work efficiently enough for caching the 79G.
More
Change UNIQUE to PRIMARY, for clarity and because InnoDB really needs a PK.
Check the source data. Is it in (pl_from,pl_namespace,pl_title) order? If not, can you sort the file before loading? If you can, that, alone, should significantly help the speed.
128MB for the buffer_pool is also significantly hampering progress.

skip copying to tmp table on disk mysql

I have a question for large mysql queries. Is it possible to skip the copying to tmp table on disk step that mysql takes for large queries or is it possible to make it go faster? because this step is taking way too long to get the results of my queries back. I read on the MySQL page that mysql performs this to save memory, but I don't care about saving memory I just want to get the results of my queries back FAST, I have enough memory on my machine. Also, my tables are properly indexed so that's not the reason why my queries are slow.
Any help?
Thank you

There are two things you can do to lessen the impact by this
OPTION #1 : Increase the variables tmp_table_size and/or max_heap_table_size
These options will govern how large an in-memory temp table can be before it is deemed too large and then pages to disk as a temporary MyISAM table. The larger these values are, the less likely you will get 'copying to tmp table on disk'. Please, make sure your server has enough RAM and max_connections is moderately configured should a single DB connection need a lot of RAM for its own temp tables.
OPTION #2 : Use a RAM disk for tmp tables
You should be able to configure a RAM disk in Linux and then set the tmpdir in mysql to be the folder that has the RAM disk mounted.
For starters, configure a RAM disk in the OS
Create a folder in the Linux called /var/tmpfs
mkdir /var/tmpfs
Next, add this line to /etc/fstab (for example, if you want a 16GB RAM disk)
none /var/tmpfs tmpfs defaults,size=16g 1 2
and reboot the server.
Note : It is possible to make a RAM disk without rebooting. Just remember to still add the aforementioned line to /etc/fstab to have the RAM disk after a server reboot.
Now for MySQL:
Add this line in /etc/my.cnf
[mysqld]
tmpdir=/var/tmpfs
and restart mysql.
OPTION #3 : Get tmp table into the RAM Disk ASAP (assuming you apply OPTION #2 first)
You may want to force tmp tables into the RAM disk as quickly as possible so that MySQL does not spin its wheels migrating large in-memory tmp tables into a RAM disk. Just add this to /etc/my.cnf:
[mysqld]
tmpdir=/var/tmpfs
tmp_table_size=2K
and restart mysql. This will cause even the tiniest temp table to be brought into existence right in the RAM disk. You could periodically run ls -l /var/tmpfs to watch temp tables come and go.
Give it a Try !!!
CAVEAT
If you see nothing but temp tables in /var/tmpfs 24/7, this could impact OS functionality/performance. To make sure /var/tmpfs does not get overpopulated, look into tuning your queries. Once you do, you should see less tmp tables materializing in /var/tmpfs.

You can also skip the copy to tmp table on disk part (not answered in the selected answer)
If you avoid some data types :
Support for variable-length data types (including BLOB and TEXT) not supported by MEMORY.
from https://dev.mysql.com/doc/refman/8.0/en/memory-storage-engine.html
(or https://mariadb.com/kb/en/library/memory-storage-engine/ if you are using mariadb).
If your temporary table is small enough : as said in selected answer, you can
Increase the variables tmp_table_size and/or max_heap_table_size
But if you split your query in smaller queries (not having the query does not help to analyze your problem), you can make it fit inside a memory temporary table.

Insertion speed slowdown as the table grows in mysql

I am trying to get a better understanding about insertion speed and performance patterns in mysql for a custom product. I have two tables to which I keep appending new rows. The two tables are defined as follows:
CREATE TABLE events (
added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id BINARY(16) NOT NULL,
body MEDIUMBLOB,
UNIQUE KEY (id)) ENGINE InnoDB;
CREATE TABLE index_fpid (
fpid VARCHAR(255) NOT NULL,
event_id BINARY(16) NOT NULL UNIQUE,
PRIMARY KEY (fpid, event_id)) ENGINE InnoDB;
And I keep inserting new objects to both tables (for each new object, I insert the relevant information to both tables in one transaction). At first, I get around 600 insertions / sec, but after ~ 30000 rows, I get a significant slowdown (around 200 insertions/sec), and then a more slower, but still noticeable slowdown.
I can see that as the table grows, the IO wait numbers get higher and higher. My first thought was memory taken by the index, but those are done on a VM which has 768 Mb, and is dedicated to this task alone (2/3 of memory are unused). Also, I have a hard time seeing 30000 rows taking so much memory, even more so just the indexes (the whole mysql data dir < 100 Mb anyway). To confirm this, I allocated very little memory to the VM (64 Mb), and the slowdown pattern is almost identical (i.e. slowdown appears after the same numbers of insertions), so I suspect some configuration issues, especially since I am relatively new to databases.
The pattern looks as follows:
I have a self-contained python script which reproduces the issue, that I can make available if that's helpful.
Configuration:
Ubuntu 10.04, 32 bits running on KVM, 760 Mb allocated to it.
Mysql 5.1, out of the box configuration with separate files for tables
[EDIT]
Thank you very much to Eric Holmberg, he nailed it. Here are the graphs after fixing the innodb_buffer_pool_size to a reasonable value:

Edit your /etc/mysql/my.cnf file and make sure you allocate enough memory to the InnoDB buffer pool. If this is a dedicated sever, you could probably use up to 80% of your system memory.
# Provide a buffer pool for InnoDB - up to 80% of memory for a dedicated database server
innodb_buffer_pool_size=614M
The primary keys are B Trees so inserts will always take O(logN) time and once you run out of cache, they will start swapping like mad. When this happens, you will probably want to partition the data to keep your insertion speed up. See http://dev.mysql.com/doc/refman/5.1/en/partitioning.html for more info on partitioning.
Good luck!

Your indexes may just need to be analyzed and optimized during the insert, they gradually get out of shape as you go along. The other option of course is to disable indexes entirely when you're inserting and rebuild them later which should give more consistent performance.
Great link about insert speed.
ANALYZE. OPTIMIZE

Verifying that the insert doesn't violate a key constraint takes some time, and that time grows as the table gets larger. If you're interested in flat out performance, using LOAD DATA INFILE will improve your insert speed considerably.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008