Why does this MySQL query not use the index properly? - mysql

Sorry that this is such a specific and probably cliche question, but it is really causing me major problems.
Everyday I have to do several hundred thousands select statements that look like these two (this is one example but they're all pretty much the same just with different word1):
SELECT pibn,COUNT(*) AS aaa FROM research_storage1
USE INDEX (word21pibn)
WHERE word1=270299 AND word2=0
GROUP BY pibn
ORDER BY aaa DESC
LIMIT 1000;
SELECT pibn,page FROM research_storage1
USE INDEX (word12num)
WHERE word1=270299 AND word2=0
ORDER BY num DESC
LIMIT 1000;
The first statement is quick-as-a-flash and takes a fraction of a second. The second statement takes about 2 seconds, which is way too long considering I have hundreds of thousands to do.
The indexes are:
word21pibn: word2, word1, pibn
word12num: word1, word2, num
The results of explain (for both extended and partitions are):
mysql> explain extended SELECT pibn,COUNT(*) AS aaa FROM research_storage1 USE INDEX (word21pibn) WHERE word1=270299 AND word2=0 GROUP BY pibn ORDER BY aaa DESC LIMIT 1000;
+----+-------------+-------------------+------+---------------+------------+---------+-------------+------+----------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------+---------------+------------+---------+-------------+------+----------+-----------------------------------------------------------+
| 1 | SIMPLE | research_storage1 | ref | word21pibn | word21pibn | 6 | const,const | 1549 | 100.00 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-------------------+------+---------------+------------+---------+-------------+------+----------+-----------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)
mysql> explain partitions SELECT pibn,COUNT(*) AS aaa FROM research_storage1 USE INDEX (word21pibn) WHERE word1=270299 AND word2=0 GROUP BY pibn ORDER BY aaa DESC LIMIT 1000;
+----+-------------+-------------------+------------+------+---------------+------------+---------+-------------+------+-----------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+------------+------+---------------+------------+---------+-------------+------+-----------------------------------------------------------+
| 1 | SIMPLE | research_storage1 | p99 | ref | word21pibn | word21pibn | 6 | const,const | 1549 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-------------------+------------+------+---------------+------------+---------+-------------+------+-----------------------------------------------------------+
1 row in set (0.00 sec)
mysql> explain extended SELECT pibn,page FROM research_storage1 USE INDEX (word12num) WHERE word1=270299 AND word2=0 ORDER BY num DESC LIMIT 1000;
+----+-------------+-------------------+------+---------------+-----------+---------+-------------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------+---------------+-----------+---------+-------------+------+----------+-------------+
| 1 | SIMPLE | research_storage1 | ref | word12num | word12num | 6 | const,const | 818 | 100.00 | Using where |
+----+-------------+-------------------+------+---------------+-----------+---------+-------------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
mysql> explain partitions SELECT pibn,page FROM research_storage1 USE INDEX (word12num) WHERE word1=270299 AND word2=0 ORDER BY num DESC LIMIT 1000;
+----+-------------+-------------------+------------+------+---------------+-----------+---------+-------------+------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+------------+------+---------------+-----------+---------+-------------+------+-------------+
| 1 | SIMPLE | research_storage1 | p99 | ref | word12num | word12num | 6 | const,const | 818 | Using where |
+----+-------------+-------------------+------------+------+---------------+-----------+---------+-------------+------+-------------+
1 row in set (0.00 sec)
The only difference I see is that the second statement does not have Using index in the extra column of describe. Though this does not make sense because the index was designed for that statement, so I don't see why it would not be used.
Any idea?

Try adding the pbin and page column to the word12num compound index. Then all the information you need for your query will be in the index, like it is in your first query.
Edit I missed the pbin column you're selecting; sorry about that.
If your compound index turns out to contain (word1, word2, num, pbin, page) then everything in your second query can come from the index.
If you look at the Extra column under your first query's EXPLAIN, one of the blurbs in there is Using index. #sebas pointed this out. This means, actually, Using index only. This means the server can satisfy your query by just consulting the index without having to consult the table. That's why it is so fast: the server doesn't have to bang the disk heads around random-accessing the table to get the extra columns. Using index is not present in your second query's EXPLAIN.
The columns mentioned in WHERE come first. Then we have the columns in ORDER BY. Finally we have the columns you're simply SELECTing. Why use this particular order for columns in the index? The server finds its way to the first index entry matching the SELECT, then can read the index sequentially to satisfy the query.
It is indeed expensive to construct and maintain a compound index on a big table. You are looking at a basic tradeoff in DBMS design: do you want to spend time constructing the table or looking things up in it? Only you know whether it's better to incur the cost when building the table or when looking things up in it.

Related

MariaDB SELECT with index used but looks like table scan

I have a MariaDB 10.4 with a hung table (about 100 million rows) for storing crawled posts. The table contains 4x columns, and one of them is lastUpadate (datetime) and indexed.
Recently I try to select posts by lastUpdate. Most of them returns fast with index used, but some takes minutes with fewer records returned and looks like a table scan.
This is the query explain without conditions.
> explain select 1 from SourceAttr;
+------+-------------+------------+-------+---------------+---------------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------+-------+---------------+---------------+---------+------+----------+-------------+
| 1 | SIMPLE | SourceAttr | index | NULL | idxCreateDate | 5 | NULL | 79830491 | Using index |
+------+-------------+------------+-------+---------------+---------------+---------+------+----------+-------------+
This is the query explain and number of rows returned for the slow one. The number of rows in the explain is almost equals to the above one.
> select 1 from SourceAttr where (lastUpdate >= '2020-01-11 11:46:37' AND lastUpdate < '2020-01-12 11:46:37');
+------+-------------+------------+-------+---------------+---------------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------+-------+---------------+---------------+---------+------+----------+--------------------------+
| 1 | SIMPLE | SourceAttr | index | idxLastUpdate | idxLastUpdate | 5 | NULL | 79827437 | Using where; Using index |
+------+-------------+------------+-------+---------------+---------------+---------+------+----------+--------------------------+
> select 1 from SourceAttr where (lastUpdate >= '2020-01-11 11:46:37' AND lastUpdate < '2020-01-12 11:46:37');
394454 rows in set (14 min 40.908 sec)
The is the fast one.
> explain select 1 from SourceAttr where (lastUpdate >= '2020-01-15 11:46:37' AND lastUpdate < '2020-01-16 11:46:37');
+------+-------------+------------+-------+---------------+---------------+---------+------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------+-------+---------------+---------------+---------+------+---------+--------------------------+
| 1 | SIMPLE | SourceAttr | range | idxLastUpdate | idxLastUpdate | 5 | NULL | 3699041 | Using where; Using index |
+------+-------------+------------+-------+---------------+---------------+---------+------+---------+--------------------------+
> select 1 from SourceAttr where (lastUpdate >= '2020-01-15 11:46:37' AND lastUpdate < '2020-01-16 11:46:37');
1352552 rows in set (2.982 sec)
Any reason what might cause this ?
Thanks a lot.
When you see type: index it's called an index scan. This is almost as bad as a table-scan.
Notice the rows: 79827437 in the EXPLAIN of the two slow queries. This means it's examining over 79 million items in the scanned index, either idxCreateDate or idxLastUpdate. So it's basically examining every index entry, which takes nearly as long as examining every row of the table.
Whereas the quick query says rows: 3699041 so it's estimating less than 3.7 million rows examined. More than 20x fewer.

How to avoid Using Filesort in "!=" mysql

Please, help me!
How to optimize a query like:
SELECT idu
FROM `user`
WHERE `username`!='manager'
AND `username`!='user1#yahoo.com'
ORDER BY lastdate DESC
This is the explain:
explain SELECT idu FROM `user` WHERE `username`!='manager' AND `username`!='ser1#yahoo.com' order by lastdate DESC;
+----+-------------+-------+------+----------------------------+------+---------+------+--------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------------------+------+---------+------+--------+-----------------------------+
| 1 | SIMPLE | user | ALL | username,username-lastdate | NULL | NULL | NULL | 208478 | Using where; Using filesort |
+----+-------------+-------+------+----------------------------+------+---------+------+--------+-----------------------------+
1 row in set (0.00 sec)
To avoid file sorting in a big database.
Since this query is just scanning all rows, you need an index on lastdate to avoid MySQL from having to order the results manually (using filesort, which isn't always to disk/temp table).
For super read performance, add the following multi-column "covering" index:
user(lastdate, username, idu)
A "covering" index would allow MySQL to just scan the index instead of the actual table data.
If using InnoDB and any of the above columns are your primary key, you don't need it in the index.

Should I use derived table in this situation?

I need to fetch 10 random rows from a table, the query below will not do it as it is going to be very slow on a large scale (I've read strong arguments against it):
SELECT `title` FROM table1 WHERE id1 = 10527 and id2 = 37821 ORDER BY RAND() LIMIT 10;
EXPLAIN:
select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
------------+-------------+------+---------------+-------+---------+------+------+----------------+
SIMPLE | table1 | ref | id1,id2 | id2 | 5 | const| 7 | Using where; Using temporary; Using filesort
I tried the following workaround:
SELECT * FROM
(SELECT `title`, RAND() as n1
FROM table1
WHERE id1 = 10527 and id2 = 37821) TTA
ORDER BY n1 LIMIT 10;
EXPLAIN:
select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
------------+-------------+------+---------------+-------+---------+------+------+----------------+
PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 7 | Using filesort |
DERIVED | table1 | ref | id1,id2 | id2 | 5 |const | 7 | Using where |
But I’ve read also couple of statements against using derived tables.
Could you please tell me if the latter query is going to make any improvement?
You should try the first method to see if it works for you. If you have an index on table1(id1, id2) and there are not very many occurrences of any given value pair, then the performance is probably fine for what you want to do.
Your second query is going to have somewhat worse performance than the first. The issue with the performance of order by rand() is not the time taken to calculate random numbers. The issue is the order by, and your second query is basically doing the same thing, with the additional overhead of a derived table.
If you know that there were always at least, say, 1000 matching values, then the following would generally work faster:
SELECT `title`
FROM table1
WHERE id1 = 10527 and id2 = 37821 and rand() < 0.05
ORDER BY RAND()
LIMIT 10;
This would take a random sample of about 5% of the data and with 1,000 matching rows, you would almost always have at least 10 rows to choose from.

Why my query is slower if a date column is in the SELECT part of my query?

I have a strange behavior with my mysql query below:
SELECT domain_id, domain_name, domain_lastupdate
FROM domains
WHERE domain_id > 300000 LIMIT 2000
takes ~ 15seconds...
while
SELECT domain_id, domain_name
FROM domains
WHERE domain_id > 300000 LIMIT 2000
takes ~ 0.05seconds...
I've tried different ids with different limits doing one before the other and the other way around not to get cached results, but I end up with dramatic time differences.
I have 1 index on the domain_id, 1 on the domain_name, but none with both columns...
I just don't get it...
#
The domain_lastupdate is a simple Date column.
Here's the EXPLAIN output of both queries:
explain SELECT domain_id, domain_name, domain_lastupdate FROM domains WHERE domain_id > 255000 LIMIT 500;
+----+-------------+---------+-------+---------------+-------------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+-------------+---------+------+----------+-------------+
| 1 | SIMPLE | domains | range | UN_domainid | UN_domainid | 4 | NULL | 12575357 | Using where |
+----+-------------+---------+-------+---------------+-------------+---------+------+----------+-------------+
1 row in set (0.00 sec)
second one:
explain SELECT domain_id, domain_name FROM domains WHERE domain_id > 255000 LIMIT 500;
+----+-------------+---------+-------+---------------+-------------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+-------------+---------+------+----------+--------------------------+
| 1 | SIMPLE | domains | range | UN_domainid | UN_domainid | 4 | NULL | 12575369 | Using where; Using index |
+----+-------------+---------+-------+---------------+-------------+---------+------+----------+--------------------------+
1 row in set (0.01 sec)
Any idea why the first one doesn't use the index ?
When you are pulling out the non date columns that you have indexed the SQL Server is able to pull your data directly out of the index and needn't go to the table at all. To get the date it is having to hit the table. Add an index on the date column.
Also I suppose you could create a multi column index. Make sure you have domain_id as the first column in the index. Creating Indexes
What you want to use is what is called A Covering Index

Getting a Column's Max Value

Is there any tangible difference (speed/efficiency) between these statements? Assume the column is indexed.
SELECT MAX(someIntColumn) AS someIntColumn
or
SELECT someIntColumn ORDER BY someIntColumn DESC LIMIT 1
This depends largely on the query optimizer in your SQL implementation. At best, they will have the same performance. Typically, however, the first query is potentially much faster.
The first query essentially asks for the DBMS to inspect every value in someIntColumn and pick the largest one.
The second query asks the DBMS to sort all the values in someIntColumn from largest to smallest and pick the first one. Depending on the number of rows in the table and the existence (or lack thereof) of an index on the column, this could be significantly slower.
If the query optimizer is sophisticated enough to realize that the second query is equivalent to the first one, you are in luck. But if you retarget your app to another DBMS, you might get unexpectedly poor performance.
EDIT based on explain plan:
Explain plan shows that max(column) is more efficient. The explain plan say, “Select tables optimized away”.
EXPLAIN SELECT version from schema_migrations order by version desc limit 1;
+----+-------------+-------------------+-------+---------------+--------------------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+-------+---------------+--------------------------+---------+------+------+-------------+
| 1 | SIMPLE | schema_migrations | index | NULL | unique_schema_migrations | 767 | NULL | 1 | Using index |
+----+-------------+-------------------+-------+---------------+--------------------------+---------+------+------+-------------+
1 row in set (0.00 sec)
EXPLAIN SELECT max(version) FROM schema_migrations ;
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
1 row in set (0.00 sec)