Will this result in two full table scans? - mysql

SELECT P_CODE, P_PRICE
FROM PRODUCT
WHERE P_PRICE >= (SELECT AVG(P_PRICE) FROM PRODUCT);
Will this query (under mysql) result in two full table scans (from disk) or will the optimizer understand that it's faster too (if there is enough RAM to hold the result set) only do one full table scan? The table has no indexes.
Is it possible to read (somehow) this information from output of the EXPLAIN command in mysql?

The question is flawed based on a misunderstanding of what a table scan actually is:
A table scan iterates over all rows in the table (irrespective of how it obtains those rows).
It also differs slightly from an index scan in that it works with the "full row". Whereas an index scan has less overall data to process, because it works with a subset of columns.
But the question is actually asking about difference between physical and logical IO.
(from disk) or will the optimizer understand that it's faster too (if there is enough RAM to hold the result set)
Yes the query will do 2 table scans. That cannot be avoided:
the server has to process the full set of prices twice.
and it has to finish processing for AVG(PRICE) before it can start processing for the WHERE filter.
However, a "logical" table scan does not necessarily require reading the data from disk twice. If all the data is in memory, the server can perform the table scan in memory. So although the second stage of processing must still perform a table scan, it can be more efficient by avoiding secondary disk access.
Take a look at this question to see how to distinguish logical and physical IO on mysql:
For a MySQL query, how do you determine physical and logical I/O?
I'll add that in theory a server could choose to keep only the Price column in memory on the first pass. In which case it wouldn't need be perform a "full table scan" on the second pass.
However this is unlikely in practice as there's a benefit to keep all the data in memory for other future queries ... whatever columns they may wish to process.

Re your comment:
my assumption, when looking at the query, is than an optimizer should/would be able to determine that "this query reads the same data twice, after the first read i will put it into memory(if there is space) and use the in-memory data for the next part of the query, instead of asking the disk for it twice"
Well, at least in MySQL's InnoDB engine, something sort of like this happens. InnoDB can't really read pages directly from disk. It load every requested page into RAM before doing data operations on it. The RAM is a preallocated area called the InnoDB buffer pool. This stores byte-for-byte copies of the pages from the on-disk tablespace, plus some metadata about them.
After reading a page, the buffer pool has no immediate need to evict it from RAM, unless other pages are requested and there's no space left in the buffer pool for them. So subsequent requests for the same pages may find the pages already residing in RAM. The more this happens, the better your performance overall.
You might have more data pages in your product table than can fit in your buffer pool. During a table-scan, InnoDB will evict pages as needed to load the remaining set of pages for the table. If you have a table that is many times larger than your buffer pool, you can imagine that this results in quite a bit of "churn" as pages come in and out. If you can afford it, allocating more RAM to the buffer pool is an good way to improve performance.
All these facts about the buffer pool don't change the fact that your query will perform two table-scans. It is true that it will be faster to read the pages from the buffer pool than reading pages from disk. You can experiment:
Shutdown your MySQL Server and start it back up again. The buffer pool should be empty at this point (unless you are using the feature to save the buffer pool on shutdown).
Run your query. It might take many seconds, because each page requested has to be read from disk before it can be used.
Run the same query again. It's faster! I've seen cases where this difference makes the performance about 4x faster in tests. I understand that RAM is typically thousands of times faster than disk, but I/O speed is not the only code running. Also it depends on what other requests are occupying the disk bandwidth, and other factors.
The difference between disk speed and RAM speed is (more or less) an arithmetic factor. No matter how large your dataset, the speed difference gives the same advantage.
Indexes are much more important, because they turn a linear search O(n) into a B-tree search O(log2n). As your dataset gets larger, the advantage of this becomes more dramatic. This is why there is so much emphasis on analyzing complexity of algorithms in computer science.

Please explain how you could do this with only one table scan. It is not obvious.
The use of the AVG() function would typically result in two full scans. If you have an index, then one or both scans might use the index.

Related

MySQL Table with ~20Mil rows - Queries getting slow

I have a table in my MySQL(5.7.32) database which currently has 20Mil rows. I have a few fairly complex queries written on that table, where I carry out FullTextSearches and join them to other tables. The queries on the table are getting slow (using appropriate indexes).
I understand that 20 Mil rows are not a lot for a DB table to handle, and would like to understand what are the factors (other than indexes) that I should consider for performance improvements. For example, any DB defaults that I should consider changing that impact performance.
NOTE: Since the table has FTS indexes, partitioning is not an option.
There are a lot of factors that could hurt performance:
Buffer pool not large enough to hold the index. So as a query searches the index, it has to keep swapping parts of the index into RAM and back out. You may need to increase the innodb_buffer_pool_size.
I'd monitor the ratio of the two numbers reported by SHOW GLOBAL STATUS LIKE 'innodb_buffer_pool_read%s'.
CPU is too slow. Each query is single-threaded, so CPU speed is more important than number of cores.
Concurrent load. If you have many queries running at the same time, they compete with each other for CPU, buffer pool, and I/O. Check SHOW PROCESSLIST or SHOW GLOBAL STATUS LIKE 'Threads_running'.
Server is overloaded, either by MySQL or by other apps or processes. Use top to find out if the system load average is high (I would consider anything over 10 to be too high), or if the system is using swap space instead of RAM.
Is the query using indexes like you expect? Did you analyze them with EXPLAIN?

Is search speed achieved with fast data access or fast index access?

From MySQL doc:
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] tbl_name
(create_definition,...)
{DATA|INDEX} DIRECTORY [=] 'absolute path to directory'
My table is for search only and takes 8G of disk space (4G data + 4G index) with 80M rows
I can't use ENGINE = Memory to store the whole table into memory but I can store either the data or the index in a RAM drive through the DIRECTORY table options
From a theorical knoledge, is it better to store the data or the index in RAM?
MySQL's default storage engine is InnoDB. As you run queries against an InnoDB table, the portion of that table or indexes that it reads are copied into the InnoDB Buffer Pool in memory. This is done automatically. So if you query the same table later, chances are it's already in memory.
If you run queries against other tables, it load those into memory too. If the buffer pool is full, it will evicting some data that belongs to your first table. This is not a problem, since it was only a copy of what's on disk.
There's no way to specifically "lock" a table on an index in memory. InnoDB will load either data or index if it needs to. InnoDB is smart enough not to evict data you used a thousand times, just for one other table requested one time.
Over time, this tends to balance out, using memory for your most-frequently queried subset of each table and index.
So if you have system memory available, allocate more of it to your InnoDB Buffer Pool. The more memory the Buffer Pool has, the more able it is to store all the frequently-queried tables and indexes.
Up to the size of your data + indexes, of course. The content copied from the data + indexes is stored only once in memory. So if you have only 8G of data + indexes, there's no need to give the buffer pool more and more memory.
Don't allocate more system memory to the buffer pool than your server can afford. Overallocating memory leads to swapping memory for disk, and that will be bad for performance.
Don't bother with the {DATA|INDEX} DIRECTORY options. Those are for when you need to locate a table on another disk volume, because you're running out of space. It's not likely to help performance. Allocating more system memory to the buffer pool will accomplish that much more reliably.
but I can store either the data or the index in a RAM drive through the DIRECTORY table options...
Short answer: let the database and OS do it.
Using a RAM disk might have made sense 10-20 years ago, but these days the software manages caching disk to RAM for you. The disk itself has its own RAM cache, especially if it's a hybrid drive. The OS will cache file system access in RAM. And then MySQL itself will do its own caching.
And if it's an SSD that's already extremely fast, so a RAM cache is unlikely to show much improvement.
So making your own RAM disk isn't likely to do anything that isn't already happening. What you will do is pull resources away from the OS and MySQL that they could have managed smarter themselves likely slowing everything on that machine down.
What you're describing a micro-optimization. This is attempting to make individual operations faster. They tend to add complexity and degrade the system as a whole. And there are limits to how much optimizing you can do with micro-optimizations. For example, if you have to search 1,000,000 rows, and it takes 1ms per row, that's 1,000,000 ms. If you make it 0.9ms per row then it's 900,000 ms.
What you want to focus on is algorithmic optimization, improvements to the algorithm. These tend to make the code simpler and less complex, though often the data structures need to be more thought out, because you're doing less work. Take those same 1,000,000 rows and add an index. Instead of looking at 1,000,000 rows you'll spend, say, 100 ms to look at the index.
The numbers are made up, but I hope you get the point. If "what you want is speed", algorithmic optimizations will take you where no micro-optimization will.
There's also the performance of the code using the database to consider, it is often the real bottleneck using unoptimized queries, poor patterns for fetching related data, and not taking advantage of caching.
Micro-optimizations, with their complexities and special configurations, tend to make algorithmic optimizations more difficult. So you might be slowing yourself down in the long run by worrying about micro-optimizations now. Furthermore, you're doing this at the very start when you only have fuzzy ideas about how this thing will be used or perform or where the bottlenecks will be.
Spend your time optimizing your data structures and indexes, not minute details of your database storage. Once you've done that, if it still isn't fast enough, then look at tweaking settings.
As a side note, there is one possible benefit to playing with DIRECTORY. You can put the data and index on separate physical drives. Then both can be accessed simultaneously with the full I/O throughput of each drive.
Though you've just made it twice as likely to have a disk failure, and complicated backups. You're probably better off with an SSD and/or RAID.
And consider whether a cloud database might actually out-perform any hardware you might be able to afford.

How to direct MySQL not to cache a table in memory?

Lets say I have several InnoDB tables:
1. table_a 20Gb
2. table_b 10Gb
3. table_c 1Gb
4. table_d 0.5Gb
And a server with limited memory (8Gb)
I want fast access to table_c and table_d, and can allow slower access to table_a and table_b.
Is there a way to direct MySQL to cache c,d in memory, and NOT a,b?
(I'd move a,b to a different servers, but sometimes I require a join on a,c)
InnoDB doesn't have any option to direct certain tables to stay in memory and other tables to stay out of memory. But it's kind of unnecessary.
InnoDB reads tables by loading them page-by-page into the buffer pool. Your usage of the tables guides InnoDB to keep pages in memory.
Reading a page once in a while is unlikely to kick out pages that you need to stay in memory. InnoDB keeps an area of the buffer pool reserved for recently-accessed pages. There's an algorithm for "promoting" pages into this reserved area, and pages that aren't promoted tend to get kicked out first.
Read details here: https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool.html
If you really need to ensure that certain tables are not cached in the InnoDB buffer pool, the only certain way is to alter the storage engine for those tables. Non-InnoDB tables (e.g. MyISAM) are never cached in the InnoDB buffer pool. But this is probably not a good enough reason to switch storage engine.
Answer to question asked: No.
Answer to implied question: Probably. The implied question is "how can I make the queries run faster. This may or may not have anything to do with what is cached.
If you fetch one row using an index, especially the PRIMARY KEY, then the query will be very fast, even if nothing is cached. If, on the other hand, you do a "table scan" of table_a, it will blow out the cache multiple times to scan through the 20GB.
So... Find out which query is the slowest, then let's focus on making it faster. It may be as simple as adding a "composite" index. Or maybe reformulating the query. Or maybe something else.
VIEWs will not help; they are syntactic sugar around a SELECT. Recomputing the statistics is not a 'real' fix.

Adding index increases query execution

How's that possible when I added index to a column it slowed down the execution time?
Trying to get rid of the query from slow queries log.
My slow-query settings:
slow_query_log = 1
long_query_time = 1 # seconds
log_queries_not_using_indexes = 1
slow_query_log_file = /var/log/mysql-slow.log
Indexes do not always speed up execution. The effect of an index depends primarily on the "selectivity" of the query: how many rows are processed by the overall query.
In general, reading a database (a "full table scan") is an efficient operation. The database engine knows what pages it needs to read and can read ahead to get them. Such I/O often occurs in the background, while processing the pages is in the foreground. When the next page is needed though, there is a good chance it is already in the page cache.
The performance issue with full table scans is that tables are big. So even efficient reads take time. When you are looking for one row in a million ("needle-in-the-haystack" queries), the reads are a waste of time. This is where indexes fix things.
However, say you have 100 records per page and you are reading more than 1% of the records. On average, every page will need to be read -- whether you are using an index or a full-table scan. The problem is that index reads are less efficient than scan reads. A read-ahead mechanism doesn't help them, because the reads are random.
This problem can be further exacerbated through something called thrashing. If the table does not fit into memory, then each random read is likely to be a "cache miss", incurring the overhead of a read from disk. The full table scan would just read the data, and with a decent look-ahead system, there would be no cache misses.
In your example, you could increase the selectivity of the index by including both banner and event in the index (these are compared using equality) and one of the other fields.
Depending on structure of the data on disk, it might be faster to just load the entire db/column and sort/filter it in ram (which will likely happen when no index exists), than to traverse a sparsed index on disk. I don't know if this applies to your specific context or if you have another issue here though.

MySQL - How to determine if my table is stored in RAM?

I'm running:
MySQL v5.0.67
InnoDB engine
innodb_buffer_pool_size = 70MB
Question: What command can I run to ensure that my entire 50 MB database is stored entirely in RAM?
I am curious about why you want to store the entire table in memory. My guess is that you are not. The most important thing for me is if your queries are running well and if you are tied up on disk access. It is also possible that the OS has cached disk blocks that you need if there is memory available. In this case, even though MySQL might not have it in memory, the OS will. If your queries are not running well, and you can do it, I highly recommend adding more memory if you want it all in RAM. If you have slowdowns it is more likely that you are running into contention.
show table status
will show you some of the information.
If you get the server IO/buffer/cache statistics from
show server status
and then run a query that requires each row to be accessed (say sum the non empty values from each row using a column that is not indexed) and check to see if any IO has occurred.
I doubt you are caching the entire thing in memory though with only 70MB. You have to take out a lot of cache, temp, and index buffers from that total.
If you run SELECT COUNT(*) FROM yourtable USE INDEX (PRIMARY) then InnoDB will put every page of the PRIMARY index into buffer pool (assuming there is enough room in it). If the table has secondary indexes and if you want to load them into the buffer pool, too, then craft a similar query that would read from a secondary index and do the job.