I have a MySQL table with about 2 million rows, and a script that updates approximately 100 rows per second. I'd like to reduce the amount of disk write I/O that's going on. For this particular table, ACID isn't important as if I were to lose some rows in a crash the script would just resume at the proper place. Even if I lost the past hour's work, it wouldn't be that big of deal.
The table was using InnoDB, but I switched over to MyISAM because I figured if it wasn't logging every write that could cut the I/O by half.
But even with MyISAM there is a lot of write I/O going on. The table + index takes up about 1300 MB on disk, but MySQL is writing about 1600 MB to disk every hour. I've calculated that if each row could be written to disk perfectly efficiently, that would be about 160 MB written per hour. So its writing about 10x as much data as it needs to. I realize there are some inefficiencies, but I'm guessing that most of the writes are because its writing an entire page out to disk.
Is there any way to make it write less often, so it waits until more rows on each page have been updated so that it can be more efficient with the writes (even though there would be more data lost in event of a crash)?
If you are using innodb you could set innodb_flush_log_at_trx_commit to 2 for example. This greatly improved the I/O during updates on our system.
Here some clarification on the setting:
http://www.mysqlperformanceblog.com/?s=innodb_flush_log_at_trx_commit
Related
I'm batching CSV 15GB (30mio rows) into a mysql-8 database.
Problem: the task takes about 20min, with approxy throughput of 15-20 MB/s. While the harddrive is capable of transfering files with 150 MB/s.
I have a RAM disk of 20GB, which holds my csv. Import as follows:
mysqlimport --user="root" --password="pass" --local --use-threads=8 mytable /tmp/mydata.csv
This uses LOAD DATA under the hood.
My target table does not have any indexes, but approx 100 columns (I cannot change this).
What is strange: I tried tweaking several config parameters as follows in /etc/mysql/my.cnf, but they did not give any significant improvement:
log_bin=OFF
skip-log-bin
innodb_buffer_pool_size=20G
tmp_table_size=20G
max_heap_table_size=20G
innodb_log_buffer_size=4M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_autoinc_lock_mode=2
Question: does LOAD DATA / mysqlimport respect those config changes? Or does it bypass? Or did I use the correct configuration file at all?
At least a select on the variables shows they are correctly loaded by the mysql server. For example show variables like 'innodb_doublewrite' shows OFF
Anyways, how could I improve import speed further? Or is my database the bottleneck and there is no way to overcome the 15-20 MB/s threshold?
Update:
Interestingly if I import my csv from harddrive into the ramdisk, performance is almost the same (just a little bit better, but never over 25 MB/s). I also tested the same amount of rows, but only with a few (5) columns. And there I'm getting to about 80 MB/s. So clearly the number of columns is the bottleneck? But why do more columns slow down this process?
MySQL/MariaDB engine have little parallelization when making bulk inserts. It can only use one CPU core per LOAD DATA statement. You may probably monitor CPU utilization during load to see one core is fully utilized and it can provide only so much of output data - thus leaving disk throughput underutilized.
The most recent version of MySQL has new parallel load feature: https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-utilities-parallel-table.html . It looks promising but probably hasn't received much feedback yet. I'm not sure it would help in your case.
I saw various checklists on the internet that recommended having higher values in the following config params: log_buffer_size, log_file_size, write_io_threads, bulk_insert_buffer_size . But the benefits were not very pronounced when I performed comparison tests (maybe 10-20% faster than just innodb_buffer_pool_size being large enough).
This could be normal. Let's walk through what is being done:
The csv file is being read from a RAM disk, so no IOPs are being used.
Are you using InnoDB? If so, the data is going into the buffer_pool. As blocks are being built there, they are being marked 'dirty' for eventual flushing to disk.
Since the buffer_pool is large, but probably not as large as the table will become, some of the blocks will need to be flushed before it finishes reading all the data.
After all the data is read, and the table is finished, the dirty blocks will gradually be flushed to disk.
If you had non-unique indexes, they would similarly be written in a delayed manner to disk (cf 'Change buffering'). The change_buffer, by default occupies 25% of the buffer_pool.
How large is the resulting table? It may be significantly larger, or even smaller, than the 15GB of the csv file.
How much time did it take to bring the csv file into the ram disk? I proffer that that was wasted time and it should have been read from disk while doing the LOAD DATA; that I/O can be overlapped.
Please SHOW GLOBAL VARIABLES LIKE 'innodb%';; there are several others that may be relevant.
More
These are terrible:
tmp_table_size=20G
max_heap_table_size=20G
If you have a complex query, 20GB could be allocated in RAM, possibly multiple times!. Keep those to under 1% of RAM.
If copying the csv from hard disk to ram disk runs slowly, I would suspect the validity of 150 MB/s.
If you are loading the table once every 6 hours, and it takes 1/3 of an hour to perform, I don't see the urgency of making it faster. OTOH, there may be something worth looking into. If that 20 minutes is downtime due to the table being locked, that can be easily eliminated:
CREATE TABLE t LIKE real_table;
LOAD DATA INFILE INTO t ...; -- not blocking anyone
RENAME TABLE real_table TO old, t TO real_table; -- atomic; fast
DROP TABLE old;
There are 10 InnoDB partitioned tables. MySQL is configured with option innodb-file-per-table=1 (innodb-file per table/partition - for some reasons). Tables size is abount 40GB each. They contains statictics data.
During normal operation, the system can handle the load. The accumulated data is processed every N minutes. However, if for some reason, there was no treatment for more than 30 minutes (eg, maintenance of the system - it is rare, but once a year is necessary to make changes), begin to lock timeout.
I will not tell you how to come to such an architecture, but it is the best solution - way was long.
Đ•ach time, making changes requires more and more time. Today, for example, a simple ALTER TABLE took 2:45 hours. This is unacceptable.
So, as I said, processing the accumulated data requires a lot of resources and SELECT statements are beginning to return lock timeout errors. Of course, the tables in the query are not involved, and the work goes to the results of queries to them. Total size of these 10 tables is about 400GB, and a few dozen small tables, the total size of which is comparable to (and maybe not yet) to the size of an big table. Problems with small tables there.
My question is: how can I solve the problem with a lock timeout error? A server is not bad - 8 core xeon, 64 RAM. And this is only the database server. Of course, the entire system is not located on the same machine.
There is an only reason why I get this errors: on data transfrom process from big tables to small.
Any ideas?
The Situation:
I use a (php) cronjob to keep my database up-to-date. the affected table contains about 40,000 records. basically, the cronjob deletes all entries and inserts them afterwards (with different values ofc). I have to do it this way, because they really ALL change, because they are all interrelated.
The Problem:
Actually, everything works fine. The cronjob is doin' his job within 1.5 to 2 seconds (again, for about 40k inserts - i think this is adequate). MOSTLY.. But sometimes, the query takes up to 60, 90 or even 120 seconds!
I indexed my database. And I think query is good working, due to the fact it only needs 2 seconds mots of the time. I close the connection via mysql_close();
Do you have any ideas? If you need more information please tell me.
Thanks in advance.
Edit: Well, it seems like there was no problem with the inserts. it was a complex SELECT query, that made some trouble. Tho, thanks to everyone who answered!
[Sorry, apparently I haven't mastered the formatting yet]
From what I read, I can conclude that your cronjob is using bulk-insert statements. If you know when cronjob works, I suggest you to start a Database Engine Tuning Advisor session and see what other processes are running while the cronjob do its things. A bulk-insert has some restrictions with the number of fields and the number of rows at once. You could read the subtitles of this msdn http://technet.microsoft.com/en-us/library/ms188365.aspx
Performance Considerations
If the number of pages to be flushed in a
single batch exceeds an internal threshold, a full scan of the buffer
pool might occur to identify which pages to flush when the batch
commits. This full scan can hurt bulk-import performance. A likely
case of exceeding the internal threshold occurs when a large buffer
pool is combined with a slow I/O subsystem. To avoid buffer
overflows on large machines, either do not use the TABLOCK hint (which
will remove the bulk optimizations) or use a smaller batch size
(which preserves the bulk optimizations). Because computers vary, we
recommend that you test various batch sizes with your data load to
find out what works best for you.
How's that possible when I added index to a column it slowed down the execution time?
Trying to get rid of the query from slow queries log.
My slow-query settings:
slow_query_log = 1
long_query_time = 1 # seconds
log_queries_not_using_indexes = 1
slow_query_log_file = /var/log/mysql-slow.log
Indexes do not always speed up execution. The effect of an index depends primarily on the "selectivity" of the query: how many rows are processed by the overall query.
In general, reading a database (a "full table scan") is an efficient operation. The database engine knows what pages it needs to read and can read ahead to get them. Such I/O often occurs in the background, while processing the pages is in the foreground. When the next page is needed though, there is a good chance it is already in the page cache.
The performance issue with full table scans is that tables are big. So even efficient reads take time. When you are looking for one row in a million ("needle-in-the-haystack" queries), the reads are a waste of time. This is where indexes fix things.
However, say you have 100 records per page and you are reading more than 1% of the records. On average, every page will need to be read -- whether you are using an index or a full-table scan. The problem is that index reads are less efficient than scan reads. A read-ahead mechanism doesn't help them, because the reads are random.
This problem can be further exacerbated through something called thrashing. If the table does not fit into memory, then each random read is likely to be a "cache miss", incurring the overhead of a read from disk. The full table scan would just read the data, and with a decent look-ahead system, there would be no cache misses.
In your example, you could increase the selectivity of the index by including both banner and event in the index (these are compared using equality) and one of the other fields.
Depending on structure of the data on disk, it might be faster to just load the entire db/column and sort/filter it in ram (which will likely happen when no index exists), than to traverse a sparsed index on disk. I don't know if this applies to your specific context or if you have another issue here though.
We have an innodb table with 12,000,000+ records.
I use two ways to SELECT * from this table using JDBC.
Statement stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
This way lets the driver to stream result sets row-by-row , and it takes 7s to finish scanning.
Statement stmt =conn.createStatement();
ResultSets are completely retrieved and stored in memory.And this way takes 21s !
Just feel confused, why fetching the result set row by row is faster than retrieve the result set completed into client memory?
The way of row by row should not take more time on networking transferring?
Just to expand on my comment on the OP
This is most likely a memory issue - reading 12m results into memory could cause paging unless the client has a lot of RAM. As soon as you start thrashing the disk, performance will drop considerably. It's worth noting that if you do start increasing RAM, the JVM has some quirks in how it addresses >32G (it switches to 64-bit pointers) which means that as you transition past 32G you actually lose available memory and may have other issues depending on how your code is written.
To put things into perspective, we're using elasticsearch at the moment to index ~60 million documents. Admittedly, the memory usage will be more involved as it's handling indices, caches, etc... but we wouldn't consider giving it less than 16G of RAM to get performant responses. I've met people using >100G per shard for really big record sets.