I'm not clear how to search for this, if there's a duplicate feel free to point me to it.
I'm wondering if there's a way to tell mysql that something will be sorted before filtering so that it can perform the filter with a binary search instead of a linear search.
For example, consider a table with columns id, value, and created_at
id is an auto-increment and created_at is a timestamp field with default of CURRENT_TIMESTAMP
then consider the following query:
SELECT *
FROM `table`
WHERE created_at BETWEEN '2022-10-05' AND '2022-10-06'
ORDER BY id
Because I have context on the data that mysql doesn't, namely that if id is sorted then created_at will also be sorted, I can conclude that we can binary search on created_at. However mysql does a full table scan for the filter as it's unaware of, or unwilling to assume this fact. The explain on the query on my test dataset shows that it's scanning all 50 rows to return the 24 that match the filter, when it's possible to do it by only scanning approximately log2(50) rows. This isn't a huge difference for my test dataset but on larger data it can have an impact.
I'll note that the obvious answer here is to add an index on created_at, but on more real life queries that's not always possible. For example if you were filtering on another indexed column it wouldn't be able to use that created_at index, but we might still be able to make assumptions about the ordering based on other order bys.
Anyway, after all that setup my question is: Is there a way that I can tell MySQL that I know that this data is already sorted so that it need not perform a table scan? Something similar to FORCE INDEX that can be used to overwrite the behaviour of picking an index for a query
The answer is no in general.
InnoDB queries read rows in order by the index used. So in your case if there's an index on created_at, it'll read rows in that order, ascending. There's no way to tell the optimizer that this matches the order of id too, and the optimizer won't make that assumption.
So the bottom line is that it'll have to perform a filesort on the matching rows to guarantee they're in order by id.
The comment above suggests ORDER BY created_at would solve the problem in the example you show. That is, if you know that the order of created_at is the same order as id, then just ORDER BY created_at, and the filesort can be skipped. That is, the optimizer knows the ORDER BY you requested is actually the order it read the rows from the index, so sorting is a no-op.
But I assume your example was only one case among many potential cases, and it might not be the right solution to use in other cases.
But why was the query doing a table-scan?
In the example you give of a table-scan of 50 rows, it's possible the optimizer decided not to use the index because the table-scan is so little work that the extra indirection of using the index isn't worth it. This is why we need to fill a table with at least a few hundred rows during testing, before we know if an index improves the query.
Another possible reason for the table-scan is that the range of dates covers a significantly large part of the table. In my experience, if the condition matches over 20% of the table, then the optimizer says, "meh, not worth the effort to use the index, so I'll just do a table-scan." That 20% figure is not an officially documented threshold, it's just my observation.
FORCE INDEX might help the latter case. FORCE INDEX really just tells the optimizer to assume a table-scan is infinitely costly, so if the named index is relevant at all to the search conditions, then use the index instead of a table-scan. It's possible though to use FORCE INDEX and name an index that is irrelevant to the search conditions. In that case, the optimizer still won't use the index, it'll just feel shame over having to do a table-scan.
Because I have context on the data that mysql doesn't, namely that if id is sorted then created_at will also be sorted...
Give MySQL that information: order by created_at, id and make sure created_at is indexed. This will allow MySQL to use the created_at index to filter and also do most of the ordering. If this is still slow, try adding a composite index of (created_at, id).
Also, upgrade MySQL. 5.6 reached the end of its life last year. MySQL 8 is considerably better at optimizing queries.
I have a query and it seems very slow
My Problem
select conversation_hash as search_hash
from conversation
where conversation_hash ='xxxxx'
and result_published_at between '1600064000' and '1610668799'
order by result_published_at desc
limit 5
There is a total of 773179 Records when I run
select count(*)
from conversation
where conversation_hash ='xxxxx'
After I do an explain query
explain select conversation_hash as search_hash
from conversation
where conversation_hash ='xxxxx'
and result_published_at between '1600064000' and '1610668799'
order by result_published_at desc
limit 5
i got this
id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,extra
1, SIMPLE, conversation, , range, idx_result_published_at,conversation_hash_channel_content_id_index,conversation_result_published_at_index,virtaul_ad_id_conversation_hash, idx_result_published_at, 5, , 29383288, 1.79, Using index condition;Using where
Possible Issues
By looking in the explain query I can see it return more
rows(29383288) than the total Records (ie 773179)
key_len is 5. result_published_at is a timestamp field and its length
is def more than 5 eg(1625836640)
What can I improve to make this query Fast, Thanks in advance
EDIT
Indexes for conversation
Table,Non_unique,Key_name,Seq_in_index,Column_name,Collation,Cardinality,Sub_part,Packed,Null,Index_type,Comment,Index_comment
conversation,0,PRIMARY,1,id,A,96901872,NULL,NULL,,BTREE,,
conversation,0,conversation_conversation_hash_id_result_id_unique,1,conversation_hash_id,A,240485,NULL,NULL,,BTREE,,
conversation,0,conversation_conversation_hash_id_result_id_unique,2,result_id,A,100693480,NULL,NULL,,BTREE,,
conversation,0,conversation_conversation_hash_id_channel_content_id_unique,1,conversation_hash_id,A,232122,NULL,NULL,,BTREE,,
conversation,0,conversation_conversation_hash_id_channel_content_id_unique,2,channel_content_id,A,100693480,NULL,NULL,,BTREE,,
conversation,1,conversation_tool_id_foreign,1,tool_id,A,7788,NULL,NULL,,BTREE,,
conversation,1,idx_result_published_at,1,result_published_at,A,38164712,NULL,NULL,YES,BTREE,,
conversation,1,idx_user_name,1,user_name,A,10896208,NULL,NULL,YES,BTREE,,
conversation,1,conversation_hash_channel_content_id_index,1,conversation_hash,A,294048,NULL,NULL,,BTREE,,
conversation,1,conversation_hash_channel_content_id_index,2,channel_content_id,A,99699696,NULL,NULL,,BTREE,,
conversation,1,idx_parent_channel_content_id,1,parent_channel_content_id,A,3550741,NULL,NULL,YES,BTREE,,
conversation,1,idx_channel_content_id,1,channel_content_id,A,90350472,NULL,NULL,,BTREE,,
conversation,1,conversation_result_published_at_index,1,result_published_at,A,37177476,NULL,NULL,YES,BTREE,,
conversation,1,virtaul_ad_id_conversation_hash,1,conversation_hash,A,238906,NULL,NULL,,BTREE,,
conversation,1,virtaul_ad_id_conversation_hash,2,virtual_ad_id,A,230779,NULL,NULL,YES,BTREE,,
conversation,1,idx_ad_story_id,1,ad_story_id,A,167269,NULL,NULL,YES,BTREE,,
Query is correct, it seems you have to update server configurations for mysql, which is probably not available on a shared hosting environment. However, if you have your own server the follow these steps:
Go to my.cnf file, in my case it is hosted at /etc/mysql/my.cnf
Increate the values of query_cache_size, max_connection, innodb_buffer_pool_size, innodb_io_capacity
Switch from MyISAM to InnoDB (if possible)
Use latest MySQL version (if possible)
You can get more help from this article https://phoenixnap.com/kb/improve-mysql-performance-tuning-optimization
It's a bit hard to read the output of the Explain command because the possible_keys output is separated by commas.
Depending on the data access patterns, you might want to create a unique index on conversation_hash, in case rows are unique.
If conversation_hash is not a unique field you can create a compound index on conversation_hash, result_published_at so your query will be fulfilled from the index itself.
EXPLAIN estimates the row counts. (It has no way to get the exact number of rows without actually running the query.) That estimate may be a lot lower or higher than the real count. It is rare for the estimate to be that far off, but I would not be worried, just annoyed.
The existence of Text and Blob columns sometimes adds to the imprecision of Explain.
Key_len:
The raw length, which is 5 for TIMESTAMP (more below).
+1 for if NULL (+0 for NOT NULL).
Not very useful for VARCHAR.
In older versions of MySQL, a Timestamp took 4 bytes and a DATETIME took 8. When fractional seconds were added, those numbers were changed to 5 in both cases. This allowed for a "length" to indicate the number of decimal places. And Datetime was changed from packed decimal to an integer.
Suggest you run ANALYZE TABLE. This might improve the underlying statistics that feed into the estimates.
Please provide SHOW CREATE TABLE; it may give more insight.
The 'composite' INDEX(conversation_hash, result_published_at), in that order, is optimal for that query.
I have the following problem when running a mysql query:
Query is very slow and when i use explain the query key is null but possible_keys are avaiable and the order is correct, i also tried adding independent indexes per each row but still key was NULL.
You can see table, index and mysql explain here: https://snag.gy/vcChl6.jpg
The optimizer likely has just decided that there is no reason to use the index.
Since you are using SELECT * that means that means that if it used the index, then it would have to use the primary key from the index to then go back and look up all the necessary data from the clustered index. That is referred to as a double lookup, and is generally bad for performance. As there are so few records in this table, the optimizer likely decided that it can easily do a full table scan instead and get your result faster.
In short, this is expected behavior.
If you want to SELECT just some columns, add them to the t1 index and then just SELECT only the columns you need, with that given WHERE clause. It should use the index then. As your table grows in size, it may start using the index as well, once it estimates that the double lookup is cheaper than the full table scan.
A guess: Most rows are of that 'project' and that 'lang'.
The Optimizer does not understand that fact, so it takes the index that is obviously the best:
(id_project, id_lang)
This one would be equally good: (id_lang, id_project).
No fair... The EXPLAIN mentions indexes named id_project and id_lang (not useful), but the list of indexes shows a composite index t1(id_project, id_lang) (useful).
Then, as Willem suggests, it has to bounce between the index and the table. Normally (that is, when it has adequate statistics), the Optimizer will say "Oh, more than ~20% of the table is being referenced; let's ignore any index."
Things you can do:
Get rid of that index.
Change * to a list of just the columns you need. In particular, if you avoid the 3 TEXT columns, two optimizations kick in. Alternatively, any that will never be longer than 255 characters can be changed to VARCHAR(255).
Use some other filtering, ordering, limiting, etc. If this is a web application, do you really want to get ~534 rows?
I am trying to optimize the query which we are using to generate reports.
I think I did quite good to optimize to some extends.
Below was the original query:
select trat.asset_name as group_name,trat.name as sub_group_name,
trat.asset_id as group_id,
if(trat.cause_task_type='AccessRequest',true,false) as is_request_task,
'' as grouped_on,
concat(trat.asset_name,' - {0} (',count(*),')') as table_heading
from t_remote_agent_tasks trat
where trat.status in ('completed','failedredundant')
and trat.name not in ('collect-data','update-conn-params')
group by trat.asset_name, trat.name
order by field(trat.name,'create-account','change-attr',
'add-member-to-group',
'grant-access','disable-account','revoke-access',
'remove-member-from-group','update-license')
When I see the execution plain in Extra column it says using where,Using Temporary,filesort.
So I optimize the query like this
select trat.asset_name as group_name,trat.name as sub_group_name,
trat.asset_id as group_id,
if(trat.cause_task_type='AccessRequest',true,false) as is_request_task,
'' as grouped_on,
concat(trat.asset_name,' - {0} (',count(*),')') as table_heading
from t_remote_agent_tasks trat
where trat.status in ('completed','failedredundant')
and trat.name not in ('collect-data','update-conn-params')
group by trat.asset_name,trat.name
order by null
Which gives me the execution plan as using where,using temporary. So filesort is no more use and there is no extra overhead as optimizer doesn't have to sort,which will be taken care during group by.
I again went forward and created indexes on group by columns in same order as they mentioned in group by(this is important or optimization won't happen) i.e create index on (trat.asset_name,trat.name).
Now this optimization gives me Using where only in extra column. Also the query execution time got deduced by almost half(earlier it was 0.568 sec. and now 0.345sec ,not exact though it vary every time but more or less in this range).
Now I want to optimize the range query ,below part of query
trat.status in ('completed','failedredundant')
and trat.name not in ('collect-data','update-conn-params')
I am reading on mysl reference guide to optimize range query,Which says not in is not in range query ,So I did the modification in query like this
trat.status in ('completed','failedredundant')
and trat.name in ('add-member-to-group','change-attr','create-account',
'disable-account','grant-access', 'remove-member-from-group',
'update-license')
But it doesn't show any improvement in Extra(I mean using index should be there,it is still showing using where).
I also tried by splitting both range part into unions(that will change the query result but still no improvement in execution plan)
I want some help on how to optimize this query more,mostly the range part(in part).
Any other optimization if I need to make on this?
I appreciate your time,Thanks in advance
EDIT 1 I forgot to mentioned that I have index on trat.status also,So Below are the indexes
(trat.asset_name,trat.name)
(trat.status)
In virtually all cases, only one index is used in a SELECT. So, one must have available the best.
Both of the first two queries will probably benefit most from the same 'composite' index:
INDEX(asset_name, name)
Normally, one would try to handle the WHERE conditions in the index, but they do not look amenable to an index. (More discussion below.) Second choice is the GROUP BY, which I am recommending. But, since (in the first case) the ORDER BY and the GROUP BY are different, there will necessarily be a tmp table created for the output of the GROUP BY so that it can be sorted according to the ORDER BY. (There may also be a tmp and sort for the GROUP BY; I cannot tell.)
"Using index" means that a "covering" index was used. A "covering" index is a composite index that includes all of the columns used anywhere in the SELECT. That would be about 5 columns, and probably not wise to attempt. (More below.)
Another thing to note that even something this simple:
WHERE x IN (11,22)
GROUP BY y
cannot use any index to handle both the WHERE and GROUP BY. So, there is no way for your query to consume both (except by 'covering').
A covering index, when used, is only partially useful. It says that all the work is done just in the BTree of the index. But that could include a full index scan -- which is not that much faster than a full table scan. This is another argument against recommending 'covering'.
In some situations, IN or OR can be sped up by turning it into UNION:
( SELECT ... WHERE status in ('completed') )
UNION ALL
( SELECT ... WHERE status in ('failedredundant') )
but this will only cause you to stumble into the NOT IN(...) clause, which is worse than an IN.
The goal of finding the best index is to find one that has the rows (in the index and/or in the table) consecutively sitting in the BTree.
To make any further improvements on this query will probably require re-thinking the schema -- it seems to be forcing you to have IN, NOT IN, FIELD and other hard-to-optimize constructs.
I'm dealing with a seemingly odd case where simply the amount of data being returned from a MySQL query causes an obvious index to be used or not used.
We have a table called "items" with an indexed column called type. Type is a tinyint(3), non-null value. The table is admittedly large both vertically and horizontally, and it does have a rather long list of singular indexes.
In many cases, selecting on this table while specifying type in the where field works as you'd expect, and indexes on the type field.
EXPLAIN SELECT item.itemid, item.`type` FROM item WHERE item.`type` IN (1,40);
works just fine, for instance.
SIMPLE item range type type 2 null 1634830 Using where; Using index
However, add one more unrelated return field, and suddenly it no longer uses the index.
EXPLAIN SELECT item.itemid, item.`type`,item.dir FROM item WHERE item.`type` IN (1,40);
1 SIMPLE item ALL type null null null 3514503 Using where
The dir field isn't terribly interesting...it's just a boolean, and doesn't even have an index on it. Using any other field has the same effect. Now, if I were to swap out another type of item for 40 - which I know has a lot of records - for a type that has fewer records, indexing once again works properly.
EXPLAIN SELECT item.itemid, item.`type`,item.dir FROM item WHERE item.`type` IN (1,2);
1 SIMPLE item range type type 2 79812 Using index condition
I realize that MySQL's optimizer isn't perfect, but it doesn't seem like this additional data should make the logic any different.
It's almost as if there's some kind of memory problem whereas the additional data isn't allow enough memory left for MySQL to do it's job.
Any thoughts are greatly appreciated.
MySQL is behaving reasonably -- perhaps not optimally, but reasonably.
When faced with one of your queries, the optimizer has basically two choices.
First, it can scan the index to find the relevant records. Then it may need to look up additional columns in the data pages (this is true of dir, it may also be true of itemid depending on whether it is declared as a primary index).
Second, it can scan the data pages, apply the where clause and pull the data that it needs directly from the data pages.
The balance between doing an index scan and then looking up information in the datapages versus simply scanning the data pages is a subtle one. Small differences in the query might make one preferable to the other.
You can probably force index usage by creating a covering index. Such an index contains all the columns of the query. For these queries, that would be type, itemid, dir.