What is MySQL "Key Efficiency" - mysql

MySQL Workbench reports a value called "Key Efficiency" in association with server health. What does this mean and what are its implications?
From MySQL.com, "Key Efficiency" is:
...an indication of the number of key_read_requests that resulted in actual key_reads.
Ok, so what does that mean. What does it tell me about how I'm supposed to tune the server?

"Key Efficiency" is an indication of how much value you are getting from the index caches held within MySQL's memory. If your key efficiency is high, then most often MySQL is performing key lookups from within memory space, which is much faster than having to retrieve the relevant index blocks from disk.
The way to improve key efficiency is to dedicate more of your system memory to MySQL's index caches. How you do this depends on the storage engine you use. For MyISAM, increase the value of key-buffer-size. For InnoDB, increase the value of innodb-buffer-pool-size.
However, as Michael Eakins points out, the operating system also holds caches of disk blocks which it has accessed recently. The more memory that your operating system has available, the more disk blocks it can cache. Further, the disk drives themselves (and disk controllers in some cases), also have caches - which again can speed up retrieving data from disk. The hierarchy is a bit like this:
fastest - retrieving index data from within MySQL's index cache. The cost is a few memory operations.
retrieving index data that is held in the OS file system cache. The cost is a system call (for the read), and some memory operations.
retrieving index data that is held in the disk system cache (controller and drives). The cost is a system call (for the read), communication with the disk device, and some memory operations.
slowest - retrieving index data from the disk surface. The cost is a system call, communication with the device, physical movement of the disk (arm movement + rotation).
In practice, the difference between 1 and 2 is almost unnoticeable unless your system is very busy. Also, it is unlikely (unless your system has less spare RAM than your disk controller) that scenario 3 will come into play.
I have used servers with MyISAM tables with relatively small index caches (512MB), but massive system memory (64GB) and have found it difficult to demonstrate the value of increasing the size of the index cache. I guess it depends on what else is happening on your server. If all you are running is a MySQL data base, it is quite likely that the OS cache will be quite effective. However, if you run other jobs on the same server and these use lots of memory / disk accesses, then these might evict valuable cached index blocks leading to MySQL hitting disk more often.
An interesting exercise (if you have time) is to tinker with your system to make it run slower. Running a standard workload on large tables, reduce the MySQL buffers until the impact becomes noticeable. Flush your file system cache by pumping huge amounts (greater than RAM) of irrelevant data through your file system ( cat large-file > /dev/null ). Watch iostat as your queries run.
"Key Efficiency" is NOT a measure of how good your keys are. Well designed keys will have a much larger impact on performance than high "Key Efficiency". MySQL does not have much to help you there, unfortunately.

Key_read_requests is the number of requests to read a key block from the cache. While
key_reads is the number of physical reads of a key block from disk. So these 2 variables
can increase independently.
(http://bugs.mysql.com/bug.php?id=28384)
Which is still as clear as mud.
On to the next bit of explaination:
A partially valid use of Key_reads
There is a partially valid reason to
examine Key_reads, assuming that we
care about the number of physical
reads that occur, because we know that
disks are very slow relative to other
parts of the computer. And here's
where I return to what I called
"mostly factual" above, because
Key_reads actually aren't physical
disk reads at all. If the requested
block of data isn't in the operating
system's cache, then a Key_read is a
disk read -- but if it is cached, then
it's just a system call. However,
let's make our first hard-to-prove
assumption:
Hard-to-prove assumption #1: A
Key_read might correspond to a
physical disk read, maybe. If we take
that assumption as true, then what
other reason might we have for caring
about Key_reads? This assumption leads
to "a cache miss is significantly
slower than a cache hit," which makes
sense. If it were just as fast to do a
Key_read as a Key_read_request, what
use would the key buffer be anyway?
Let's trust MyISAM's creators on this
one, because they designed a cache hit
to be faster than a miss.
(http://planet.mysql.com/entry/?id=23679)

Related

Improve performance of mysql LOAD DATA / mysqlimport?

I'm batching CSV 15GB (30mio rows) into a mysql-8 database.
Problem: the task takes about 20min, with approxy throughput of 15-20 MB/s. While the harddrive is capable of transfering files with 150 MB/s.
I have a RAM disk of 20GB, which holds my csv. Import as follows:
mysqlimport --user="root" --password="pass" --local --use-threads=8 mytable /tmp/mydata.csv
This uses LOAD DATA under the hood.
My target table does not have any indexes, but approx 100 columns (I cannot change this).
What is strange: I tried tweaking several config parameters as follows in /etc/mysql/my.cnf, but they did not give any significant improvement:
log_bin=OFF
skip-log-bin
innodb_buffer_pool_size=20G
tmp_table_size=20G
max_heap_table_size=20G
innodb_log_buffer_size=4M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_autoinc_lock_mode=2
Question: does LOAD DATA / mysqlimport respect those config changes? Or does it bypass? Or did I use the correct configuration file at all?
At least a select on the variables shows they are correctly loaded by the mysql server. For example show variables like 'innodb_doublewrite' shows OFF
Anyways, how could I improve import speed further? Or is my database the bottleneck and there is no way to overcome the 15-20 MB/s threshold?
Update:
Interestingly if I import my csv from harddrive into the ramdisk, performance is almost the same (just a little bit better, but never over 25 MB/s). I also tested the same amount of rows, but only with a few (5) columns. And there I'm getting to about 80 MB/s. So clearly the number of columns is the bottleneck? But why do more columns slow down this process?
MySQL/MariaDB engine have little parallelization when making bulk inserts. It can only use one CPU core per LOAD DATA statement. You may probably monitor CPU utilization during load to see one core is fully utilized and it can provide only so much of output data - thus leaving disk throughput underutilized.
The most recent version of MySQL has new parallel load feature: https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-utilities-parallel-table.html . It looks promising but probably hasn't received much feedback yet. I'm not sure it would help in your case.
I saw various checklists on the internet that recommended having higher values in the following config params: log_buffer_size, log_file_size, write_io_threads, bulk_insert_buffer_size . But the benefits were not very pronounced when I performed comparison tests (maybe 10-20% faster than just innodb_buffer_pool_size being large enough).
This could be normal. Let's walk through what is being done:
The csv file is being read from a RAM disk, so no IOPs are being used.
Are you using InnoDB? If so, the data is going into the buffer_pool. As blocks are being built there, they are being marked 'dirty' for eventual flushing to disk.
Since the buffer_pool is large, but probably not as large as the table will become, some of the blocks will need to be flushed before it finishes reading all the data.
After all the data is read, and the table is finished, the dirty blocks will gradually be flushed to disk.
If you had non-unique indexes, they would similarly be written in a delayed manner to disk (cf 'Change buffering'). The change_buffer, by default occupies 25% of the buffer_pool.
How large is the resulting table? It may be significantly larger, or even smaller, than the 15GB of the csv file.
How much time did it take to bring the csv file into the ram disk? I proffer that that was wasted time and it should have been read from disk while doing the LOAD DATA; that I/O can be overlapped.
Please SHOW GLOBAL VARIABLES LIKE 'innodb%';; there are several others that may be relevant.
More
These are terrible:
tmp_table_size=20G
max_heap_table_size=20G
If you have a complex query, 20GB could be allocated in RAM, possibly multiple times!. Keep those to under 1% of RAM.
If copying the csv from hard disk to ram disk runs slowly, I would suspect the validity of 150 MB/s.
If you are loading the table once every 6 hours, and it takes 1/3 of an hour to perform, I don't see the urgency of making it faster. OTOH, there may be something worth looking into. If that 20 minutes is downtime due to the table being locked, that can be easily eliminated:
CREATE TABLE t LIKE real_table;
LOAD DATA INFILE INTO t ...; -- not blocking anyone
RENAME TABLE real_table TO old, t TO real_table; -- atomic; fast
DROP TABLE old;

Is search speed achieved with fast data access or fast index access?

From MySQL doc:
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] tbl_name
(create_definition,...)
{DATA|INDEX} DIRECTORY [=] 'absolute path to directory'
My table is for search only and takes 8G of disk space (4G data + 4G index) with 80M rows
I can't use ENGINE = Memory to store the whole table into memory but I can store either the data or the index in a RAM drive through the DIRECTORY table options
From a theorical knoledge, is it better to store the data or the index in RAM?
MySQL's default storage engine is InnoDB. As you run queries against an InnoDB table, the portion of that table or indexes that it reads are copied into the InnoDB Buffer Pool in memory. This is done automatically. So if you query the same table later, chances are it's already in memory.
If you run queries against other tables, it load those into memory too. If the buffer pool is full, it will evicting some data that belongs to your first table. This is not a problem, since it was only a copy of what's on disk.
There's no way to specifically "lock" a table on an index in memory. InnoDB will load either data or index if it needs to. InnoDB is smart enough not to evict data you used a thousand times, just for one other table requested one time.
Over time, this tends to balance out, using memory for your most-frequently queried subset of each table and index.
So if you have system memory available, allocate more of it to your InnoDB Buffer Pool. The more memory the Buffer Pool has, the more able it is to store all the frequently-queried tables and indexes.
Up to the size of your data + indexes, of course. The content copied from the data + indexes is stored only once in memory. So if you have only 8G of data + indexes, there's no need to give the buffer pool more and more memory.
Don't allocate more system memory to the buffer pool than your server can afford. Overallocating memory leads to swapping memory for disk, and that will be bad for performance.
Don't bother with the {DATA|INDEX} DIRECTORY options. Those are for when you need to locate a table on another disk volume, because you're running out of space. It's not likely to help performance. Allocating more system memory to the buffer pool will accomplish that much more reliably.
but I can store either the data or the index in a RAM drive through the DIRECTORY table options...
Short answer: let the database and OS do it.
Using a RAM disk might have made sense 10-20 years ago, but these days the software manages caching disk to RAM for you. The disk itself has its own RAM cache, especially if it's a hybrid drive. The OS will cache file system access in RAM. And then MySQL itself will do its own caching.
And if it's an SSD that's already extremely fast, so a RAM cache is unlikely to show much improvement.
So making your own RAM disk isn't likely to do anything that isn't already happening. What you will do is pull resources away from the OS and MySQL that they could have managed smarter themselves likely slowing everything on that machine down.
What you're describing a micro-optimization. This is attempting to make individual operations faster. They tend to add complexity and degrade the system as a whole. And there are limits to how much optimizing you can do with micro-optimizations. For example, if you have to search 1,000,000 rows, and it takes 1ms per row, that's 1,000,000 ms. If you make it 0.9ms per row then it's 900,000 ms.
What you want to focus on is algorithmic optimization, improvements to the algorithm. These tend to make the code simpler and less complex, though often the data structures need to be more thought out, because you're doing less work. Take those same 1,000,000 rows and add an index. Instead of looking at 1,000,000 rows you'll spend, say, 100 ms to look at the index.
The numbers are made up, but I hope you get the point. If "what you want is speed", algorithmic optimizations will take you where no micro-optimization will.
There's also the performance of the code using the database to consider, it is often the real bottleneck using unoptimized queries, poor patterns for fetching related data, and not taking advantage of caching.
Micro-optimizations, with their complexities and special configurations, tend to make algorithmic optimizations more difficult. So you might be slowing yourself down in the long run by worrying about micro-optimizations now. Furthermore, you're doing this at the very start when you only have fuzzy ideas about how this thing will be used or perform or where the bottlenecks will be.
Spend your time optimizing your data structures and indexes, not minute details of your database storage. Once you've done that, if it still isn't fast enough, then look at tweaking settings.
As a side note, there is one possible benefit to playing with DIRECTORY. You can put the data and index on separate physical drives. Then both can be accessed simultaneously with the full I/O throughput of each drive.
Though you've just made it twice as likely to have a disk failure, and complicated backups. You're probably better off with an SSD and/or RAID.
And consider whether a cloud database might actually out-perform any hardware you might be able to afford.

Will this result in two full table scans?

SELECT P_CODE, P_PRICE
FROM PRODUCT
WHERE P_PRICE >= (SELECT AVG(P_PRICE) FROM PRODUCT);
Will this query (under mysql) result in two full table scans (from disk) or will the optimizer understand that it's faster too (if there is enough RAM to hold the result set) only do one full table scan? The table has no indexes.
Is it possible to read (somehow) this information from output of the EXPLAIN command in mysql?
The question is flawed based on a misunderstanding of what a table scan actually is:
A table scan iterates over all rows in the table (irrespective of how it obtains those rows).
It also differs slightly from an index scan in that it works with the "full row". Whereas an index scan has less overall data to process, because it works with a subset of columns.
But the question is actually asking about difference between physical and logical IO.
(from disk) or will the optimizer understand that it's faster too (if there is enough RAM to hold the result set)
Yes the query will do 2 table scans. That cannot be avoided:
the server has to process the full set of prices twice.
and it has to finish processing for AVG(PRICE) before it can start processing for the WHERE filter.
However, a "logical" table scan does not necessarily require reading the data from disk twice. If all the data is in memory, the server can perform the table scan in memory. So although the second stage of processing must still perform a table scan, it can be more efficient by avoiding secondary disk access.
Take a look at this question to see how to distinguish logical and physical IO on mysql:
For a MySQL query, how do you determine physical and logical I/O?
I'll add that in theory a server could choose to keep only the Price column in memory on the first pass. In which case it wouldn't need be perform a "full table scan" on the second pass.
However this is unlikely in practice as there's a benefit to keep all the data in memory for other future queries ... whatever columns they may wish to process.
Re your comment:
my assumption, when looking at the query, is than an optimizer should/would be able to determine that "this query reads the same data twice, after the first read i will put it into memory(if there is space) and use the in-memory data for the next part of the query, instead of asking the disk for it twice"
Well, at least in MySQL's InnoDB engine, something sort of like this happens. InnoDB can't really read pages directly from disk. It load every requested page into RAM before doing data operations on it. The RAM is a preallocated area called the InnoDB buffer pool. This stores byte-for-byte copies of the pages from the on-disk tablespace, plus some metadata about them.
After reading a page, the buffer pool has no immediate need to evict it from RAM, unless other pages are requested and there's no space left in the buffer pool for them. So subsequent requests for the same pages may find the pages already residing in RAM. The more this happens, the better your performance overall.
You might have more data pages in your product table than can fit in your buffer pool. During a table-scan, InnoDB will evict pages as needed to load the remaining set of pages for the table. If you have a table that is many times larger than your buffer pool, you can imagine that this results in quite a bit of "churn" as pages come in and out. If you can afford it, allocating more RAM to the buffer pool is an good way to improve performance.
All these facts about the buffer pool don't change the fact that your query will perform two table-scans. It is true that it will be faster to read the pages from the buffer pool than reading pages from disk. You can experiment:
Shutdown your MySQL Server and start it back up again. The buffer pool should be empty at this point (unless you are using the feature to save the buffer pool on shutdown).
Run your query. It might take many seconds, because each page requested has to be read from disk before it can be used.
Run the same query again. It's faster! I've seen cases where this difference makes the performance about 4x faster in tests. I understand that RAM is typically thousands of times faster than disk, but I/O speed is not the only code running. Also it depends on what other requests are occupying the disk bandwidth, and other factors.
The difference between disk speed and RAM speed is (more or less) an arithmetic factor. No matter how large your dataset, the speed difference gives the same advantage.
Indexes are much more important, because they turn a linear search O(n) into a B-tree search O(log2n). As your dataset gets larger, the advantage of this becomes more dramatic. This is why there is so much emphasis on analyzing complexity of algorithms in computer science.
Please explain how you could do this with only one table scan. It is not obvious.
The use of the AVG() function would typically result in two full scans. If you have an index, then one or both scans might use the index.

Keeping data plus index-data in memory - InnoDB vs. MyISAM

Assume a database consisting of 1 GB of data and 1 GB of index data.
To minimize disk IO and hence maximize performance I want to allocate memory to MySQL so that the entire dataset including indexes can be kept in RAM (assume that the machine has RAM in abundance).
The InnoDB parameter innodb_buffer_pool_size is used to specify the size of the memory buffer InnoDB uses to cache data and indexes of its tables. (Note: The memory is used for data AND indexes.)
The MyISAM parameter key_buffer_size is used to specify the size of the memory buffer MyISAM uses to cache indexes of its tables. (Note: The memory is used ONLY for indexes.)
If I want the 2 GB database (1 GB data and 1 GB index) to fit into memory under InnoDB, I'd simply configure the innodb_buffer_pool_size to be 2GB. The two gigabytes will hold both the data and the index.
However, when setting the MyISAM key key_buffer_size to 2GB that space will be used for the index, but not for the data.
My questions are:
Can MyISAM's "data buffer size" (not index data) be configured explicitly?
When will MyISAM read table data (excluding index data) from disk and when will it read from memory?
No MyISAM has no general purpose data cache. This is documented in the "key_buffer_size" description from the official documentation: This is because MySQL relies on the operating system to perform file system caching for data reads, so you must leave some room for the file system cache.
Modern OSes, especially Linux, tend to have very smart virtual memory subsystems that will keep frequently accessed files in the page cache, so disk I/O is kept at a bare minimum when the working set fits in available memory.
So to answer your second question: never.
It's important not to fall into "buffer oversizing" too for the various myisam variables such as read_buffer_size, read_rnd_buffer_size, sort_buffer_size, join_buffer_size, etc as some are dynamically allocated, so bigger doesn't always mean faster - and sometimes it can even be slower - see this post on mysqlperformanceblog for a very interesting case.
If you're on 5.1 on a posix platform, you might want to benchmark myisam_use_mmap on your workload, it's supposed to help high contention cases by reducing the quantity of malloc() calls.

Alternatives to the MEMORY storage engine for MySQL

I'm currently running some intensive SELECT queries against a MyISAM table. The table is around 100 MiB (800,000 rows) and it never changes.
I need to increase the performance of my script, so I was thinking on moving the table from MyISAM to the MEMORY storage engine, so I could load it completely into the memory.
Besides the MEMORY storage engine, what are my options to load a 100 MiB table into the memory?
A table with 800k rows shouldn't be any problem to mysql, no matter what storage engine you are using. With a size of 100 MB the full table (data and keys) should live in memory (mysql key cache, OS file cache, or propably in both).
First you check the indices. In most cases, optimizing the indices gives you the best performance boost. Never do anything else, unless you are pretty sure they are in shape. Invoke the queries using EXPLAIN and watch for cases where no or the wrong index is used. This should be done with real world data and not on a server with test data.
After you optimized your indices the queries should finish by a fraction of a second. If the queries are still too slow then just try to avoid running them by using a cache in your application (memcached, etc.). Given that the data in the table never changes there shouldn't be any problems with old cache data etc.
Assuming the data rarely changes, you could potentially boost the performance of queries significantly using MySql query caching.
If your table is queried a lot it's probably already cached at the operating system level, depending on how much memory is in your server.
MyISAM also allows for preloading MyISAM table indices into memory using a mechanism called the MyISAM Key Cache. After you've created a key cache you can load an index into the cache using the CACHE INDEX or LOAD INDEX syntax.
I assume that you've analyzed your table and queries and optimized your indices after the actual queries? Otherwise that's really something you should do before attempting to store the entire table in memory.
If you have enough memory allocated for Mysql's use - in the Innodb buffer pool, or for use by MyIsam, you can read the database into memory (just a 'SELECT * from tablename') and if there's no reason to remove it, it stays there.
You also get better key use, as the MEMORY table only does hash-bashed keys, rather than full btree access, which for smaller, non-unique keys might be fats enough, or not so much with such a large table.
As usual, the best thing to do it to benchmark it.
Another idea is, if you are using v5.1, to use an ARCHIVE table type, which can be compressed, and may also speed access to the contents, if they are easily compressible. This swaps the CPU time to de-compress for IO/memory access.
If the data never changes you could easily duplicate the table over several database servers.
This way you could offload some queries to a different server, gaining some extra breathing room for the main server.
The speed improvement depends on the current database load, there will be no improvement if your database load is very low.
PS:
You are aware that MEMORY tables forget their contents when the database restarts!