Mysql Order by clause using "FileSort" - mysql

I have a table structure like
comment_id primary key
comment_content
comment_author
comment_author_url
When I fire query like
explain SELECT * FROM comments ORDER BY comment_id
It outputs the results as
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE comments ALL NULL NULL NULL NULL 22563 Using filesort
Why is not able to find the index that I have defined as Primary Key?

It's not because it can't use the index. It's because the optimizer thinks it's faster not to use the index and do a filesort1. You should see different behaviour in MyiSAM and InnoDB tables.
InnoDB creates the PRIMARY key as a clustered one (or the first UNIQUE if no primary is defined) and this can be used for queries that have ORDER BY pk or WHERE pk BETWEEN low AND high because all the values needed are in this clustered key and in consecutive locations (the clustered key is the table).
MyISAM tables have only B-tree indices so if the query used this index, it would have to read that entire index and it would have the comment_id values in the wanted order (that's really good) but it would then have to read the table as well (not so good) to get all the other wanted columns. So, the optimizer thinks that since it's going to read the table, why not scan it all and do the filesort? You can test that by trying:
SELECT comment_id FROM comments ORDER BY comment_id ;
It will use the index and do no filesort because the query needs only the values that are stored in the index.
If you want a similar (to InnoDB) behaviour in MyiSAM, you coud try creating an index on (comment_id, comment_content, comment_author, comment_author_url) and then try your query. All the needed values would be found on the index and in correct order so no filesort would be performed.
The additional index will need of course almost as much space on disk as the table.
1: filesort is not always bad and it does not mean that a file is saved on disk. If the size of the data is small, it is performed in memory.

Anytime a sort can’t be performed from an index, it’s a filesort.
The strange thing here is that you should have the index on that field since it is a primary key(and a primary key column is implicitly indexed), testing on a test database i just noticed that MySQL use FileSort anytime you perform a SELECT *, this is a no sense behaviour (i know) but if you rewrite your query in this way :
SELECT comment_id, comment_content, comment_author, comment_author_url
FROM comments
ORDER BY comment_id
it will use the index correctly . Maybe could be a bug of mysql ...

Related

SQL: Avoid filesort when WHERE index IN (1,2) ORDER BY Primary

SELECT post_id FROM posts WHERE blog_id IN (15,16) ORDER BY post_id DESC
Post_id is PRIMARY, and blog_id is index, the table is innoDB, and the DB MariaDB.
This causes filesort because the index blog_id is used as key.
Blog_id has to be an index for when I make a query searching for just one blog_id=15, it’s faster. If blog_id it’s not an index or I use FORCE INDEX (PRIMARY) the problem is solved, and the query is faster.
The thing is that I think you should not use FORCE INDEX on production applications, nor USE INDEX? This would be the first question, can I force the index, and call it solved?
Second question would be why it does filesort here. If I understand correctly, an index has two keys, the index key and the primary key, and the index is ordered by the primary key? I guess not because if it was, that first query should be able to do a search by index and order by primary without filesort. But It does not use filesort when searching for just one id, and I don’t see why it’s different with multiples ids. So I don't know why it happens.
Well I think I know all the answers already. This is how the blog_id index may look like:
(blog_id,post_id)-> (1,55) (1,59) (1,69) (2,57) (2,71)
When searching for one index id, it does not need to do any filesort because the primary ids within each blog ids are already in order.
When searching for more ids ASC or DESC it would need to do a filesort because the primary ids are not in order in all the index.
Regarding the FORCE INDEX. If not using it the DB will search all post ids that match the index and order them, if there are a lot the query may be slow.
If I use it, the Db will go post_id by post_id from the bottom of the PRIMARY and then check the index key on the secondary index, until it finds the LIMIT amount if there is a LIMIT, in this case it would not get and order all posts_id, but it will have to check two indexes, and if the matched ids are far into the index, it may be slow too. It’s a matter, of what would be the average query.
The option of a combined index (post_id,blog_id) and forced works just as the PRIMARY, so I don’t find any other possible option. If anybody can add some hint about the possibility of making some type of index that will perform better I will mark your answer as correct. For now since there are no answers this will do.

MYSQL using JOIN TYPE 'ALL' when checked in explain plan

The query is as below,
select *
from lab this_
inner join visits v1_ on this_.visit_id=v1_.id
v1_.id is primary key in the query.
It takes more than 1 minute to complete.
Below is the plan.
id select_type table type possible_keys key
1 SIMPLE v1_ ALL <null> <null>
1 SIMPLE this_ ALL <null> <null>
Not sure why primary key is picked as key. Also type is ALL.
Mysql may ignore an index during the excution of the query if it belives that the alternative plan is more efficient. A couple of points it considers:
The size of the tables. If the visits table is small, then there is not too much point in using the index.
Selectivity. You do join the 2 tables, however there is no filtering and you want all fields from both tables. This could mean that mysql has to return most of the records from the visits table anyway and the index covers only the id column. Therefore, mysql would be forced to scan through most of the records of the visits table anyway to return the data, so there is not much more to gain by using the index.
Index on the fields on the other side of the join. You do not mention if labs.visit_id field is indexed. If it is not, then again there is less to be gained from using the pk of the visits table.
The speed of producing the results does not depend on the indexes used, it also depends of the size of the resultset (both record and field count), mysql configuration, and the overall performance of the underlying system. Nevertheless, if you believe that mysql should use the pk of the visits table, then use an index hint in the query to emphasise that the index should be used. You can check with explain if mysql was influenced by the index hint.

InnoDB has index problems when using COUNT() + WHERE

Recently, we switched from MyISAM to InnoDB. I tested the whole application and there are generally no problems except for one thing - using COUNT(*) in combination with 2 or more WHERE conditions.
So, here's the problem. The query below takes half a second which is not acceptable. After all InnoDB shouldn't be slower than MyISAM when using COUNT() + WHERE, but that's exactly what is happening here.
Both project_id and status_id are indexed columns. The table has 350K records.
SELECT COUNT(*) FROM respondents WHERE project_id='366' AND status_id='42'
And here is what EXPLAIN says:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE respondents index_merge project_id,status_id project_id,status_id 4,1 NULL 8343 Using intersect(project_id,status_id); Using where...
When I use only one condition after WHERE (either project_id='366' or status_id='42'), it works fine.
I'm thinking, this whole intersecting thing could be the root of the problem. But then what can I do about it? What do you think?
The index merge can be fixed by a compound index
ALTER TABLE respondents ADD KEY(project_id,status_id)
Assuming the data distribution is not very skewed,so this index will be useful.(the project_id='366' AND status_id='42' will not return more than 50% of rows)
Also make sure that your column types match the search.Are project_id and status_id really VARCHAR? If not remove the quotes.

Whats the difference between "Using index" and "Using where; Using index" in the EXPLAIN

In the extra field of the explain in mysql you can get:
Using index
Using where; Using index
What's the difference between the two?
To explain my question better I'm going to use the following table:
CREATE TABLE `test` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`another_field` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8;
INSERT INTO test() VALUES(),(),(),(),();
Which ends up with the content like:
SELECT * FROM `test`;
id another_field
1 0
2 0
3 0
4 0
5 0
On my research I found
Why is this query using where instead of index?
The output of EXPLAIN can sometimes be misleading.
For instance, filesort has nothing to do with files, using where
does not mean you are using a WHERE clause, and using index can
show up on the tables without a single index defined.
Using where just means there is some restricting clause on the table
(WHERE or ON), and not all record will be returned. Note that
LIMIT does not count as a restricting clause (though it can be).
Using index means that all information is returned from the index,
without seeking the records in the table. This is only possible if all
fields required by the query are covered by the index.
Since you are selecting *, this is impossible. Fields other than
category_id, board_id, display and order are not covered by
the index and should be looked up.
and I also found
https://dev.mysql.com/doc/refman/5.1/en/explain-output.html#explain-extra-information
Using index
The column information is retrieved from the table using only
information in the index tree without having to do an additional seek
to read the actual row. This strategy can be used when the query uses
only columns that are part of a single index.
If the Extra column also says Using where, it means the index is being
used to perform lookups of key values. Without Using where, the
optimizer may be reading the index to avoid reading data rows but not
using it for lookups. For example, if the index is a covering index
for the query, the optimizer may scan it without using it for lookups.
For InnoDB tables that have a user-defined clustered index, that index
can be used even when Using index is absent from the Extra column.
This is the case if type is index and key is PRIMARY.
(Look at the second paragraph)
My problem with this:
First: I didn't understand the second paragraph the way it's written.
Second:
The following query returns
EXPLAIN SELECT id FROM test WHERE id = 5;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE test const PRIMARY PRIMARY 8 const 1 Using index
(Scroll to the right)
And this other query returns:
EXPLAIN SELECT id FROM test WHERE id > 5;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE test range PRIMARY PRIMARY 8 NULL 1 Using where; Using index
(Scroll to the right)
Other than the fact that one query uses a range search and another uses the constant search, both queries are using some restricting clause on the table (WHERE or ON), and not all record will be returned.
What does the Using where; mean on the second query mean? and what does the fact that it's not on the first query mean?
EXTRA
What is the difference with Using index condition; Using where?
(I'm not adding an example of this because I have not been able to reproduce it in a small self contained piece os code)
When you see Using Index in the Extra part of an explain it means that the (covering) index is adequate for the query.
In your example: SELECT id FROM test WHERE id = 5; the server doesn't need to access the actual table as it can satisfy the query (you only access id) only using the index (as the explain says). In case you are not aware the PK is implemented via a unique index.
When you see Using Index; Using where it means that first the index is used to retrieve the records (an actual access to the table is not needed) and then on top of this result set the filtering of the where clause is done.
In this example: SELECT id FROM test WHERE id > 5; you still fetch for id from the index and then apply the greater than condition to filter out the records non matching the condition

Distinct (or group by) using filesort and temp table

I know there are similar questions on this but I've got a specific query / question around why this query
EXPLAIN SELECT DISTINCT RSubdomain FROM R_Subdomains WHERE EmploymentState IN (0,1) AND RPhone='7853932120'
gives me this output explain
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE RSubdomains index NULL RSubdomain 767 NULL 3278 Using where
with and index on RSubdomains
but if I add in a composite index on EmploymentState/RPhone
I get this output from explain
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE RSubdomains range EmploymentState EmploymentState 67 NULL 2 Using where; Using temporary
if I take away the distinct on RSubdomains it drops the Using temp from the explain output... but what I don't get is why, when I add in the composite key (and keeping the key on RSubdomain) does the distinct end up using a temp table and which index schema is better here? I see that the amount of rows scanned on the combined key is far less, but the query is of type range and it's also slower.
Q: why ... does the distinct end up using a temp table?
MySQL is doing a range scan on the index (i.e. reading index blocks) to locate the rows that satisfy the predicates (WHERE clause). Then MySQL has to lookup the value of the RSubdomain column from the underlying table (it's not available in the index.) To eliminate duplicates, MySQL needs to scan the values of RSubdomain that were retrieved. The "Using temp" indicates the MySQL is materializing a resultset, which is processed in a subsequent step. (Likely, that's the set of RSubdomain values that was retrieved; given the DISTINCT, it's likely that MySQL is actually creating a temporary table with RSubdomain as a primary or unique key, and only inserting non-duplicate values.
In the first case, it looks like the rows are being retreived in order by RSubdomain (likely, that's the first column in the cluster key). That means that MySQL needn't compare the values of all the RSubdomain values; it only needs to check if the last retrieved value matches the currently retrieved value to determine whether the value can be "skipped."
Q: which index schema is better here?
The optimum index for your query is likely a covering index:
... ON R_Subdomains (RPhone, EmploymentState, RSubdomain)
But with only 3278 rows, you aren't likely to see any performance difference.
FOLLOWUP
Unfortunately, MySQL does not provide the type of instrumentation provided in other RDBMS (like the Oracle event 10046 sql trace, which gives actual timings for resources and waits.)
Since MySQL is choosing to use the index when it is available, that is probably the most efficient plan. For the best efficiency, I'd perform an OPTIMIZE TABLE operation (for InnoDB tables and MyISAM tables with dynamic format, if there have been a significant number of DML changes, especially DELETEs and UPDATEs that modify the length of the row...) At the very least, it would ensure that the index statistics are up to date.
You might want to compare the plan of an equivalent statement that does a GROUP BY instead of a DISTINCT, i.e.
SELECT r.RSubdomain
FROM R_Subdomains r
WHERE r.EmploymentState IN (0,1)
AND r.RPhone='7853932120'
GROUP
BY r.Subdomain
For optimum performance, I'd go with a covering index with RPhone as the leading column; that's based on an assumption about the cardinality of the RPhone column (close to unique values), opposed to only a few different values in the EmploymentState column. That covering index will give the best performance... i.e. the quickest elimination of rows that need to be examined.
But again, with only a couple thousand rows, it's going to be hard to see any performance difference. If the query was examining millions of rows, that's when you'd likely see a difference, and the key to good performance will be limiting the number of rows that need to be inspected.