MySQL LIMIT X, Y slows down as I increase X - mysql

I have a db with around 600 000 listings, while browsing these on a page with pagination, I use this query to limit records:
SELECT file_id, file_category FROM files ORDER BY file_edit_date DESC LIMIT 290580, 30
On first pages LIMIT 0, 30 it loads in few ms, same for LIMIT 30,30, LIMIT 60,30, LIMIT 90,30, etc. But as I move forward to the end of the pages, the query takes around 1 second to execute.
Indexes are probably not related, it also happens if I run this:
SELECT * FROM `files` LIMIT 400000,30
Not sure why.
Is there a way to improve this ?
Unless there is a better solution, would it be a bad practice to just load all records and loop over them in the PHP page to see if the record is inside the pagination range and print it ?
Server is an i7 with 16GB ram;
MySQL Community Server 5.7.28;
files table is around 200 MB
here is the my.cnf if it matters
query_cache_type = 1
query_cache_size = 1G
sort_buffer_size = 1G
thread_cache_size = 256
table_open_cache = 2500
query_cache_limit = 256M
innodb_buffer_pool_size = 2G
innodb_log_buffer_size = 8M
tmp_table_size=2G
max_heap_table_size=2G

You may find that adding the following index will help performance:
CREATE INDEX idx ON files (file_edit_date DESC, file_id, file_category);
If used, MySQL would only need a single index scan to retrieve the number of records at some offset. Note that we include the columns in the select clause so that the index may cover the entire query.

LIMIT was invented to reduce the size of the result set, it can be used by the optimizer if you order the result set using an index.
When using LIMIT x,n the server needs to process x+n rows to deliver a result. The higher the value for x, the more rows have to be processed.
Here is the explain output from a simple table, having an unique index on column a:
MariaDB [test]> explain select a,b from t1 order by a limit 0, 2;
+------+-------------+-------+-------+---------------+---------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+---------+---------+------+------+-------+
| 1 | SIMPLE | t1 | index | NULL | PRIMARY | 4 | NULL | 2 | |
+------+-------------+-------+-------+---------------+---------+---------+------+------+-------+
1 row in set (0.00 sec)
MariaDB [test]> explain select a,b from t1 order by a limit 400000, 2;
+------+-------------+-------+-------+---------------+---------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+---------+---------+------+--------+-------+
| 1 | SIMPLE | t1 | index | NULL | PRIMARY | 4 | NULL | 400002 | |
+------+-------------+-------+-------+---------------+---------+---------+------+--------+-------+
1 row in set (0.00 sec)
When running the statements above (without EXPLAIN) the execution time for LIMIT 0 is 0.01 secs, for LIMIT 400000 0.6 secs.
Since MariaDB doesn't support LIMIT in a subquery, you could split your SQL statements in to two statements:
The first statement retrieves the id's (and needs to read the index file only), the second statement uses the id's retrieved from first statement:
MariaDB [test]> select a from t1 order by a limit 400000, 2;
+--------+
| a |
+--------+
| 595312 |
| 595313 |
+--------+
2 rows in set (0.08 sec)
MariaDB [test]> select a,b from t1 where a in (595312,595313);
+--------+------+
| a | b |
+--------+------+
| 595312 | foo |
| 595313 | foo |
+--------+------+
2 rows in set (0.00 sec)

Caution: I am about to use some strong language. Computers are big and fast, and they can handle bigger stuff than they could even a decade ago. But, as you are finding out, there are limits. I'm going to point out multiple limits that you have threatened; I will try to explain why the limits may be a problem.
Settings
query_cache_size = 1G
is terrible. Whenever a table is written to, the QC scans the 1GB looking for any references to that table in order to purge entries in the QC. Decrease that to 50M. This, alone, will speed up the entire system.
sort_buffer_size = 1G
tmp_table_size=2G
max_heap_table_size=2G
are bad for a different reason. If you have multiple connections performing complex queries, lots of RAM could be allocated for each, thereby chewing up RAM, leading to swapping, and possibly crashing. Don't set them higher than about 1% of RAM.
In general, do not blindly change values in my.cnf. The most important setting is innodb_buffer_pool_size, which should be bigger than your dataset, but no bigger than 70% of available RAM.
load all records
Ouch! The cost of shoveling all that data from MySQL to PHP is non-trivial. Once it gets to PHP, it will be stored in structures that are not designed for huge amounts of data -- 400030 (or 600000) rows might take 1GB inside PHP; this would probably blow out its "memory_limit", leading PHP crashing. (OK, just dying with an error message.) It is possible to raise that limit, but then PHP might push MySQL out of memory, leading to swapping, or maybe running out of swap space. What a mess!
OFFSET
As for the large OFFSET, why? Do you have a user paging through the data? And he is almost to page 10,000? Are there cobwebs covering him?
OFFSET must read and step over 290580 rows in your example. That is costly.
For a way to paginate without that overhead, see http://mysql.rjweb.org/doc.php/pagination .
If you have a program 'crawling' through all 600K rows, 30 at a time, then the tip about "remember where you left off" in that link will work very nicely for such use. It does not "slow down".
If you are doing something different; what is it?
Pagination and gaps
Not a problem. See also: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks which is more aimed at walking through an entire table. It focuses on an efficient way to find the 30th row going forward. (This is not necessarily any better than remembering the last id.)
That link is aimed at DELETEing, but can easily be revised toSELECT`.
Some math for scanning a 600K-row table 30 rows at a time:
My links: 600K rows are touched. Or twice that, if you peek forward with LIMIT 30,1 as suggested in the second link.
OFFSET ..., 30 must touch (600K/30)*600K/2 rows -- about 6 billion rows.
(Corollary: changing 30 to 100 would speed up your query, though it would still be painfully slow. It would not speed up my approach, but it is already quite fast.)

Related

MySQL SELECT * optimization

Is there a reason why there is enormous difference between
1. SELECT * FROM data; -- 45000 rows
2. SELECT data.* FROM data; -- 45000 rows
SHOW PROFILES;
+----------+------------+-------------------------+
| Query_ID | Duration | Query |
+----------+------------+-------------------------+
| 1 | 0.10902800 | SELECT * FROM data |
| 2 | 0.11139200 | SELECT data.* FROM data |
+----------+------------+-------------------------+
2 rows in set, 1 warning (0.00 sec)
As far as I know it, they both return the same number of rows and columns. Why the disparity in duration?
MySQL version 5.6.29
That's not much difference. Neither are optimized. Both do full table scans. Both will parse to the optimizer the same. You are talking about fractions of milliseconds difference.
You can't optimize full table scans. The problem is not "select " or "select data.". The problem is that there is no "where" clause, because that's where optimization starts.
The particular examples specified would return the same result and have the same performance.
[TableName].[column] is usually used to pinpoint the table you wish to use when two tables a present in a join or a complex statement and you want to define which column to use out of the two with the same name.
It's most common use is in a join though, for a basic statement such as the one above there is no difference and the output will be the same.

Why does the query execute so much slower when all the columns involved are the same and only the where condition changes?

I have this query:
SELECT 1 AS InputIndex,
IF(TRIM(DeviceInput1Name = '', 0, IF(INSTR(DeviceInput1Name, '|') > 0, 2, 1)) AS InputType,
(SELECT Value1_1 FROM devicevalues WHERE DeviceID = devices.DeviceID ORDER BY ValueTime DESC LIMIT 1) AS InputValueLeft,
(SELECT Value1_2 FROM devicevalues WHERE DeviceID = devices.DeviceID ORDER BY ValueTime DESC LIMIT 1) AS InputValueRight
FROM devices
WHERE DeviceIMEI = 'Some_Search_Value';
This completes fairly quickly (in up to 0.01 seconds). However, running the same query with WHERE clause as such
WHERE DeviceIMEI = 'Some_Other_Search_Value';
makes it run for upwards of 14 seconds! Some search values finish very quickly, while others run way too long.
If I run EXPLAIN on either query, I get the following:
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
| 1 | PRIMARY | devices | ref | DeviceIMEI | DeviceIMEI | 28 | const | 1 | Using where |
| 3 | DEPENDENT SUBQUERY | devicevalues | index | DeviceID,More | ValueTime | 9 | NULL | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | devicevalues | index | DeviceID,More | ValueTime | 9 | NULL | 1 | Using where |
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
Also, here's the actual number of records, just so it's clear:
mysql> select count(*) from devicevalues inner join devices using(DeviceID) where devices.DeviceIMEI = 'Some_Search_Value';
+----------+
| count(*) |
+----------+
| 1017946 |
+----------+
1 row in set (0.17 sec)
mysql> select count(*) from devicevalues inner join devices using(DeviceID) where devices.DeviceIMEI = 'Some_Other_Search_Value';
+----------+
| count(*) |
+----------+
| 306100 |
+----------+
1 row in set (0.04 sec)
Any ideas why changing a search value in the WHERE clause would cause the query to execute so slowly, even when the number of physical records to search through is lower?
Note there is no need for you to rewrite the query, just explain why the above happens.
UPDATE: I have tried running two separate queries instead of one with dependent subqueries to get the information I need (first I select DeviceID from devices by DeviceIMEI, then select from devicevalues by DeviceID I got from the previous query) and all queries return instantly. I suppose the only solution is to run these queries in a transaction, so I'll be making a stored procedure to do this. This, however, still doesn't answer my question which puzzles me.
I dont think that 1017946 is equivalent to the number of rows returned by your very first query.Your first query returns all rows from devices with some correlated queries,your count query returns all common rows between the 2 tables.If this is so the problem might be cardinality issues namely some_other_values constitute a much larger proportion of the rows in your first query than some_value so Mysql chooses a table scan.
If I understand correctly, the query is the same, and only the searched value changes.
There are three real possibilities as I can see, the first much likelier than the others:
The fast query only appears to be fast. And that's why it is in the MySQL query cache already. Try disabling the cache, running with NO_SQL_CACHE, or run the slow query twice. If the second way round runs in 0.01s instead of 14s, you'll know this is the case.
One query has to look way more records than the other. An IMEI may have lots of rows in devicevalues, another might have next no none. Apparently you are in such a condition, and what makes this unlikely is (apart from the times involved) the fact that it is the slower IMEI which actually has less matches.
The slow query is indeed slow. This means that a particular subset of data is hard to locate or hard to retrieve. The first may be due to an overdue reindexing or to filesystem fragmentation of very large indexes. The second can also be due to fragmentation of the tablespace, or to other condition which splits up records (for example the database is partitioned). A search in a small partition is wont to be faster than a search in a large partition.
But the time differences involved aren't equal in the three cases, and a 1400x difference seems to me an unlikely consequence of (2) or (3). The first possibility seems way more appealing.
Update you seem not to be using indexes on your subqueries. Have you an index such as
CREATE INDEX dv_ndx ON devicevalues(DeviceID, ValueTime);
If you can, you can try a covering index:
CREATE INDEX dv_ndx ON devicevalues(DeviceID, ValueTime, Value1_1, Value1_2);

Can't optimise mySQL query

I am running a query to retrieve some game levels from a MySQL database. The query itself takes around 0.00025 seconds to execute on a base that contains 40 level strings. I thought it was satisfactory, until I got a message from the website host telling me to optimise the below-mentioned query, or the script will be removed since it is pushing a lot of strain onto their servers.
I tried optimising by using explain and explain extended and adjusting the columns accordingly(adding indexes), but am always getting the same performance. What I noticed also is that MySQL didn't use indexes where they were available but instead did a full-table scan.
Results from EXPLAIN EXTENDED:
table id select_type type possible_keys key key_len ref rows Extra
users 1 SIMPLE ALL PRIMARY,id NULL NULL NULL 7 Using temporary; Using filesort
AllTime 1 SIMPLE ref PRIMARY,userid PRIMARY 4 Test.users.id 1
query:
SELECT users.nickname, AllTime.userid, AllTime.id, AllTime.levelname, AllTime.levelstr
FROM AllTime
INNER JOIN users
ON AllTime.userid=users.id
ORDER BY AllTime.id DESC
LIMIT ($value_from_php),20;
The tables:
users
| id(int) | nickname(varchar) |
| (Primary, Auto_increment) | |
|---------------------------|-------------------|
| 1 | username1 |
| 2 | username2 |
| 3 | username3 |
| ... | ... |
and AllTime
| id(int) | userid(int) | levelname(varchar) | levelstr(text) |
| (Primary, Auto_increment) | (index) | | |
|---------------------------|-------------|--------------------|----------------|
| 1 | 2 | levelname1 | levelstr1 |
| 2 | 2 | levelname2 | levelstr2 |
| 3 | 3 | levelname3 | levelstr3 |
| 4 | 1 | levelname4 | levelstr4 |
| 5 | 1 | levelname5 | levelstr5 |
| 6 | 1 | levelname6 | levelstr6 |
| 7 | 2 | levelname7 | levelstr7 |
Is there a way to optimize this query or would I be better off by calling two consecutive queries from php just to avoid the warning?
I am just learning MySQL, so please take that information into account when replying, thank you :)
I'm assuming you're using InnoDB.
For an INNER JOIN, MySQL typically starts with the table with the fewest rows, in this case users. However, since you just want the latest 20 AllTime records joined with the corresponding user records, you actually should start with AllTime since with the LIMIT, it will be the smaller data set.
Use STRAIGHT_JOIN to force the join order:
SELECT users.nickname, AllTime.userid, AllTime.id, AllTime.levelname,
AllTime.levelstr
FROM AllTime
STRAIGHT_JOIN users
ON users.id = AllTime.userid
ORDER BY AllTime.id DESC
LIMIT ($value_from_php),20;
It should be able to use the primary key on the AllTime table and follow it in descending order. It'll grab all the data on the same pages as it goes.
It should also use the primary key on the users table to grab the id and nickname. If there are more than just two columns, you might add a multi-column covering index on (id, nickname) to improve the speed.
If you can, convert the levelstr column to VARCHAR so that the data is stored on the same page as the rest of the data, otherwise, it has to go fetch the text columns separately. This assumes that your columns are under the 8000 byte row limit for InnoDB. There is no way to avoid the USING TEMPORARY unless you get rid of the text column.
Most likely, your host has identified this query by using the slow query log, which can identify all queries that don't use an index, or they may have red flagged it because of the Using temporary.
it doesn't look like the query has a problem.
Review the application code. Most likely the issue is in the code
Check MySQL query execution plan
possibly you are missing an index
Make sure you cache the data in Application and Database (fyi, sometimes you can load the whole database into Application memory)
Make sure you use a connection pool
Create a view (a very small chance for improvement)
Try to remove the "Order By" clause (again a very small chance it will improve the performance)
The query itself takes around 0.00025 seconds ... I got a message from the website host telling me to optimise the below-mentioned query, or the script will be removed since it is pushing a lot of strain onto their servers.
Ask the website host for more details about why this query has been flagged for attention. A query that trivial is not going to cause strain on anything unless it is being called very frequently.
Find out how many times that query is being run. I will bet you a nickel that your site is getting hammered by a bot and being executed hundreds or thousands of times per minute. If so, then that's your real problem.
LIMIT ($value_from_php),20; -- if $value_form_php is huge, then the query is slow. This is because all the 'old' pages need to be scanned before getting to the 20 you need.
By "remembering where you left off" you can make every page equally fast. See this for further details: http://mysql.rjweb.org/doc.php/pagination

Why is MySQL with InnoDB doing a table scan when key exists and choosing to examine 70 times more rows?

I'm troubleshooting a query performance problem. Here's an expected query plan from explain:
mysql> explain select * from table1 where tdcol between '2010-04-13 00:00' and '2010-04-14 03:16';
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
| 1 | SIMPLE | table1 | range | tdcol | tdcol | 8 | NULL | 5437848 | Using where |
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
That makes sense, since the index named tdcol (KEY tdcol (tdcol)) is used, and about 5M rows should be selected from this query.
However, if I query for just one more minute of data, we get this query plan:
mysql> explain select * from table1 where tdcol between '2010-04-13 00:00' and '2010-04-14 03:17';
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
| 1 | SIMPLE | table1 | ALL | tdcol | NULL | NULL | NULL | 381601300 | Using where |
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
1 row in set (0.00 sec)
The optimizer believes that the scan will be better, but it's over 70x more rows to examine, so I have a hard time believing that the table scan is better.
Also, the 'USE KEY tdcol' syntax does not change the query plan.
Thanks in advance for any help, and I'm more than happy to provide more info/answer questions.
5 million index probes could well be more expensive (lots of random disk reads, potentially more complicated synchronization) than reading all 350 million rows (sequential disk reads).
This case might be an exception, because presumably the order of the timestamps roughly matches the order of the inserts into the table. But, unless the index on tdcol is a "clustered" index (meaning that the database ensures that the order in the underlying table matches the order in tdcol), its unlikely that the optimizer knows this.
In the absence of that order correlation information, it would be right to assume that the 5 million rows you want are roughly evenly distributed among the 350 million rows, and thus that the index approach will involve reading most or nearly all of the pages in the underlying row anyway (in which case the scan will be much less expensive than the index approach, fewer reads outright and sequential instead of random reads).
MySQL's query generator has a cutoff when figuring out how to use an index. As you've correctly identified, MySQL has decided a table scan will be faster than using the index, and won't be dissuaded from it's decision. The irony is that when the key-range matches more than about a third of the table, it is probably right. So why in this case?
I don't have an answer, but I have a suspicion MySQL doesn't have enough memory to explore the index. I would be looking at the server memory settings, particularly the Innodb memory pool and some of the other key storage pools.
What's the distribution of your data like? Try running a min(), avg(), max() on it to see where it is. It's possible that that 1 minute makes the difference in how much information is contained in that range.
It also can just be the background setting of InnoDB There are a few factors like page size, and memory like staticsan said. You may want to explicitly define a B+Tree index.
"so I have a hard time believing that the table scan is better."
True. YOU have a hard time believing it. But the optimizer seems not to.
I won't pronounce on your being "right" versus your optimizer being "right". But optimizers do as they do, and, all in all, their "intellectual" capacity must still be considered as being fairly limited.
That said, do your database statistics show a MAX value (for this column) that happens to be equal to the "one second more" value ?
If so, then the optimizer might have concluded that all rows satisfy the upper limit anyway, and mighthave decided to proceed differently, compared to the case when it has to conclude that, "oh, there are definitely some rows that won't satisfy the upper limit either, so I'll use the index just to be on the safe side".

How to improve search speeds in this situation?

I have a search implemented on my site, it runs the following queries:
SELECT COUNT(mov_id) AS total_things
FROM content
WHERE con_status = 1 AND con_incomplete = 0 AND con_type = 1
AND ((con_title) LIKE ('%search keyword%')
OR soundex(con_title) LIKE soundex('search keyword')
OR MATCH (con_title) AGAINST ('search keyword'));
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-------------+
| 1 | SIMPLE | movies | ref | con_type | con_type | 12 | const,const,const | 11804 | Using where |
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-------------+
64058 Queries
Total time: 200817, Average time: 3.13492459958163
Taking 2 to 25 seconds to complete
Rows analyzed 1882 - 12104
SELECT
con_id,
con_title,
con_desc,
MATCH (con_title) AGAINST ('search keyword') AS relevancy
FROM content
WHERE con_status = 1 AND con_incomplete = 0 AND con_type = 1
AND ((con_title) LIKE ('%search keyword%')
OR soundex(con_title) LIKE soundex('search keyword')
OR MATCH (con_title) AGAINST ('search keyword'))
ORDER BY relevancy DESC
LIMIT 0, 24;
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-----------------------------+
| 1 | SIMPLE | movies | ref | con_type | con_type | 12 | const,const,const | 11803 | Using where; Using filesort |
+----+-------------+--------+------+---------------+----------+---------+-------------------+-------+-----------------------------+
78321 Queries
Total time: 200657, Average time: 2.56198209930925
Taking 2 to 16 seconds to complete
Rows analyzed 0 - 15752
This basically works like a ghetto "fuzzy search" to ignore typos people might make.
Unfortunately, its very slow (even if I remove soundex() or FULLTEXT searching. How to improve search speeds in this situation?
The part of the WHERE clause that hurts is the first % after LIKE. To speed it up, you could normalize the keywords, moving them to a separate table:
table moviekeywords: movieid, keyword
table movies: movieid, ...
This allows you to search through the moviekeywords table using an = condition, or at least like 'humphrey%'. Both variants can be made expremely fast with an index.
As long as you keep using soundex and LIKE(%nnn%) you will be running a full scan of all of an intermediate result. To illustrate this: If you omitted your other predicates (on con_status, con_incomplete AND con_type columns) you would always be running a full table scan.
I suggest dropping or scaling back your fuzzy predicates. For example, just running LIKE('nnn%') will be MUCH faster than %nnn% (if that column is indexed) but of course your search results will not be as fuzzy. Perhaps make soundex an advanced search option that does not always run.
If you can't compromise on any of those issues then at least make sure that your con_status, con_incomplete AND con_type columns are all indexed.
Think about Andomar's solution again - most keyword searches allow you to specify multiple keywords. You can't do that with your current query. And there's no problem with "The Terminator" - for that, you'd just add one keyword, "Terminator".
And with an index on the keyword column, it will be fast.
I made my "fuzzy search" a fallback option if COUNT on the original stricter query returns no results. My results have been pretty fast so far using
SOUNDS LIKE ('blah')
So it looks like you only have around 15,000 rows. If you don't expect your table to grow past a hundred thousand entries or so, maybe you should just keep all the titles in memory and avoid hitting the database until you know which entries you want.
That is, at startup and at periodic intervals, just query all the titles out of the database, split each one into words, and keep a mapping of words to row keys. This should take less than 1MB of memory, accessing it should be quite fast, and most importantly you can add whatever fuzzy matching or heuristic scoring mechanisms you like (without modifying your schema).
Just a thought.