Optimizing Performance of Bulk Update of Existing MySQL Data - mysql

I'm trying to perform a bulk update of around 200,000 existing MySQL rows. More specifically, I need to update eight empty LONG BLOB fields in these rows, each with a ~ 0.5 Mb file (LONG BLOB is used because there are some special cases where significantly large files are stored; however, these are not considered in this bulk update). The files that need to be inserted are stored locally on disk.
I'm using a MATLAB script I wrote to loop through each of the folders that store these files, read the files in and convert them to hexadecimal representations, then execute an UPDATE query to update the eight columns with the eight files for each row.
Initially, things ran fairly quickly; however, I noticed that after a couple thousand completed queries things really started to slow down. I did a bit of research on optimizing MySQL and InnoDB system variables and increased by innodb_buffer_pool_size to 25G and innodb_buffer_pool_instances to 25.
After this modification, things sped up again, but slowed down after another couple thousand queries. I did a little more research and tried to mess around with some other variables such as innodb_log_buffer_size and innodb_log_file_size increasing both to 100M just to see what would happen. I also set innodb_write_io_threads and innodb_read_io_threads to 16 as I am running this all on a fairly high-end server with 32 GB of RAM. Unfortunately, these modifications didn't help much and now I'm stuck with queries taking a few minutes each to complete.
Does anyone have any suggestions or ideas on how I can optimize this process and have it run as fast as possible?
Thanks,
Joe

innodb_buffer_pool_instances=8 will likely serve your requirements with less overhead.
innodb_log_buffer_size=10M would 'buffer' before writing to the innodb_log_file AFTER 10M has been accumulated in memory. A ratio of buffer*10=log file size is reasonable.
When innodb_log_buffer_size is the same as innodb_log_file_size, you effectively have NO buffering. Make buffer_size much less than log file size.

Related

Improve performance of mysql LOAD DATA / mysqlimport?

I'm batching CSV 15GB (30mio rows) into a mysql-8 database.
Problem: the task takes about 20min, with approxy throughput of 15-20 MB/s. While the harddrive is capable of transfering files with 150 MB/s.
I have a RAM disk of 20GB, which holds my csv. Import as follows:
mysqlimport --user="root" --password="pass" --local --use-threads=8 mytable /tmp/mydata.csv
This uses LOAD DATA under the hood.
My target table does not have any indexes, but approx 100 columns (I cannot change this).
What is strange: I tried tweaking several config parameters as follows in /etc/mysql/my.cnf, but they did not give any significant improvement:
log_bin=OFF
skip-log-bin
innodb_buffer_pool_size=20G
tmp_table_size=20G
max_heap_table_size=20G
innodb_log_buffer_size=4M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_autoinc_lock_mode=2
Question: does LOAD DATA / mysqlimport respect those config changes? Or does it bypass? Or did I use the correct configuration file at all?
At least a select on the variables shows they are correctly loaded by the mysql server. For example show variables like 'innodb_doublewrite' shows OFF
Anyways, how could I improve import speed further? Or is my database the bottleneck and there is no way to overcome the 15-20 MB/s threshold?
Update:
Interestingly if I import my csv from harddrive into the ramdisk, performance is almost the same (just a little bit better, but never over 25 MB/s). I also tested the same amount of rows, but only with a few (5) columns. And there I'm getting to about 80 MB/s. So clearly the number of columns is the bottleneck? But why do more columns slow down this process?
MySQL/MariaDB engine have little parallelization when making bulk inserts. It can only use one CPU core per LOAD DATA statement. You may probably monitor CPU utilization during load to see one core is fully utilized and it can provide only so much of output data - thus leaving disk throughput underutilized.
The most recent version of MySQL has new parallel load feature: https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-utilities-parallel-table.html . It looks promising but probably hasn't received much feedback yet. I'm not sure it would help in your case.
I saw various checklists on the internet that recommended having higher values in the following config params: log_buffer_size, log_file_size, write_io_threads, bulk_insert_buffer_size . But the benefits were not very pronounced when I performed comparison tests (maybe 10-20% faster than just innodb_buffer_pool_size being large enough).
This could be normal. Let's walk through what is being done:
The csv file is being read from a RAM disk, so no IOPs are being used.
Are you using InnoDB? If so, the data is going into the buffer_pool. As blocks are being built there, they are being marked 'dirty' for eventual flushing to disk.
Since the buffer_pool is large, but probably not as large as the table will become, some of the blocks will need to be flushed before it finishes reading all the data.
After all the data is read, and the table is finished, the dirty blocks will gradually be flushed to disk.
If you had non-unique indexes, they would similarly be written in a delayed manner to disk (cf 'Change buffering'). The change_buffer, by default occupies 25% of the buffer_pool.
How large is the resulting table? It may be significantly larger, or even smaller, than the 15GB of the csv file.
How much time did it take to bring the csv file into the ram disk? I proffer that that was wasted time and it should have been read from disk while doing the LOAD DATA; that I/O can be overlapped.
Please SHOW GLOBAL VARIABLES LIKE 'innodb%';; there are several others that may be relevant.
More
These are terrible:
tmp_table_size=20G
max_heap_table_size=20G
If you have a complex query, 20GB could be allocated in RAM, possibly multiple times!. Keep those to under 1% of RAM.
If copying the csv from hard disk to ram disk runs slowly, I would suspect the validity of 150 MB/s.
If you are loading the table once every 6 hours, and it takes 1/3 of an hour to perform, I don't see the urgency of making it faster. OTOH, there may be something worth looking into. If that 20 minutes is downtime due to the table being locked, that can be easily eliminated:
CREATE TABLE t LIKE real_table;
LOAD DATA INFILE INTO t ...; -- not blocking anyone
RENAME TABLE real_table TO old, t TO real_table; -- atomic; fast
DROP TABLE old;

Problem improving MySQL. (innodb_log_file_size)

we are trying to improve the efficiency of our database server.
One of the recommendations of MySQLTunner is rising the innodb_log_file_size to 12 GB. As we see this change could improve significantly the speed and the performance of our queries. The problem came when we increase this parameter more than 1 GB, the service wont start, we delete the logs, stop it cleanly, and still cant start with this parameter above 1 GB.
Someinfo:
mysql Ver 14.14 Distrib 5.5.62, for debian-linux-gnu
innodb_buffer_pool_size = 100G
innodb_file_per_table = ON
innodb_buffer_pool_instances = 64
innodb_stats_on_metadata = OFF
innodb_log_file_size = 1G
innodb_log_buffer_size = 8M
Theres algo enough space in partitions for keeping this log size
Thanks!
In MySQL 5.5, you can't increase the innodb log file size over 4GB total. The innodb_log_file_size can only be 4GB / innodb_log_files_in_group (which is 2 by default, and there's no benefit to change that). So you can set the log file to a max of 2GB.
See https://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_log_file_size
The max combined size of the log files increased to 512GB in 5.6.3. Again, innodb_log_file_size should be the size of one log file, so if you use multiple log files, the total cannot exceed 512GB.
I agree with Rick James' answer that increasing the log file size is not a magical solution to make queries run faster. It won't do that.
It's sometimes useful to increase the innodb log file size, if the bottleneck is that you run out of log space faster than dirty pages can be flushed to the tablespace, because you have very high write traffic. That's affected by the rate of writes, not the speed of individual writes.
For most apps, two 2GB log files is more than enough. If it isn't, it's probably time to run multiple MySQL instances, and distribute your write traffic over them as evenly as you can.
It's tricky to change the log_file_size in that old version. For the steps, see https://dba.stackexchange.com/questions/1261/how-to-safely-change-mysql-innodb-variable-innodb-log-file-size/4103#4103
However, I predict that changing the log_file_size will not help much. "Performance" usually implies a few slow queries. Do you have the slowlog turned on? with a low value of long_query_time? Find the few worst queries; we can probably improve performance by tackling them. Steps on that: http://mysql.rjweb.org/doc.php/mysql_analysis
If you change the innodb_log_file_size parameter, you need to remove the old log files. Otherwise, Innodb won't start successfully if the existing files do not match the specified size in the config file.
On the other hand, The innodb_buffer_pool_size should be set to about 70% of available RAM if you are running InnoDB only.

Neo4j batch insertion with .CSV files taking huge amount of time to sort&index

I'm trying to create a database with data collected from google n-grams. It's actually a lot of data, but after the creation of the CSV files the insertion was pretty fast. The problem is that, immediately after the insertion, the neo4j-import tool indexes the data, and this step its taking too much time. It's been more than an hour and it looks like it achieved 10% of progress.
Nodes
[*>:9.85 MB/s---------------|PROPERTIES(2)====|NODE:198.36 MB--|LABE|v:22.63 MB/s-------------] 25M
Done in 4m 54s 828ms
Prepare node index
[*SORT:295.94 MB-------------------------------------------------------------------------------] 26M
This is the console info atm. Does anyone have a suggestion about what to do to speed up this process?
Thank you. (:
Indexing takes a long time depending on number of nodes. I tried indexing with 10 million nodes and it took around 35 minutes, but you can still try these settings :
Increase your page cache size which is stored in '/var/lib/neo4j/conf/neo4j.properties' file (in my ubuntu system). Edit the following line
dbms.pagecache.memory=4g
according to your RAM, allocate size, here, 4g means 4gb space. Also, you can try changing java memory size which is stored in neo4j-wrapper.conf
wrapper.java.initmemory=1024
wrapper.java.maxmemory=1024
You can also read neo4j documentation on this - http://neo4j.com/docs/stable/configuration-io-examples.html

MySQL Optimising Table Cache & tmp disk tables

I'm trying to optimise my MySQL database.
I've got around 90 tables most of which are hardly ever used.
Only 10 or so do the vast bulk of the work running my website.
MySQL status statistics show approx 2M queries over 2.5 days and reports "Opened_tables" of 1.7k (with Open_tables 256). I have the table_cache set at 256, increased from 32.
I presume most of the opened tables are either multiple instances of the same tables from different connections or some temporary tables.
In the same period it reports "Created_tmp_tables" of 19.1 k and more annoyingly Created_tmp_disk_tables of 5.7k. I have max_heap_table_size and tmp_table_size both set at 128M.
I've tried to optimise my indexes & joins as best i can, and i've tried to avoid BLOB and TEXT fields in the tables to avoid disk usage.
Is there anything you can suggest to improve things?
First of all, don't conclude your MySQL database is performing poorly based on these internal statistics. There's nothing wrong with tmp tables. In fact, queries involving ordering or summaries require their creation.
It's like trying to repair your vehicle after analyzing the amount of time it spent in second gear. Substantially less than 1% of your queries are generating tmp tables. That is good. That number is low enough that these queries might be for backups or some kind of maintenance operation, rather than production.
If you are having performance problems, you will know that because certain queries are working too slowly, and certain pages on your web app are slow. Can you figure out which queries have problems? There's a slow query log that might help you.
http://dev.mysql.com/doc/refman/5.1/en/slow-query-log.html
You might try increasing tmp_table_size if you have plenty of RAM. Why not take it up to a couple of megabytes and see if things get better? But, they probably won't change noticeably.

Reducing write I/O in MySQL for non-critical data?

I have a MySQL table with about 2 million rows, and a script that updates approximately 100 rows per second. I'd like to reduce the amount of disk write I/O that's going on. For this particular table, ACID isn't important as if I were to lose some rows in a crash the script would just resume at the proper place. Even if I lost the past hour's work, it wouldn't be that big of deal.
The table was using InnoDB, but I switched over to MyISAM because I figured if it wasn't logging every write that could cut the I/O by half.
But even with MyISAM there is a lot of write I/O going on. The table + index takes up about 1300 MB on disk, but MySQL is writing about 1600 MB to disk every hour. I've calculated that if each row could be written to disk perfectly efficiently, that would be about 160 MB written per hour. So its writing about 10x as much data as it needs to. I realize there are some inefficiencies, but I'm guessing that most of the writes are because its writing an entire page out to disk.
Is there any way to make it write less often, so it waits until more rows on each page have been updated so that it can be more efficient with the writes (even though there would be more data lost in event of a crash)?
If you are using innodb you could set innodb_flush_log_at_trx_commit to 2 for example. This greatly improved the I/O during updates on our system.
Here some clarification on the setting:
http://www.mysqlperformanceblog.com/?s=innodb_flush_log_at_trx_commit