Does index work when query like 'where created_at > ?' - mysql

I am using Postgresql, and need to make query like 'WHERE created_at > ?'. I am not sure if the index works in such query.
I have done an experiment. After adding an index on the created_at column, I explained the following 2 queries.
1)
EXPLAIN SELECT * FROM categories WHERE created_at > '2014-05-03 21:34:27.427505';
The result is
QUERY PLAN
------------------------------------------------------------------------------------
Seq Scan on categories (cost=0.00..11.75 rows=47 width=528)
Filter: (created_at > '2014-05-03 21:34:27.427505'::timestamp without time zone)
2)
EXPLAIN SELECT * FROM categories WHERE created_at = '2014-05-03 21:34:27.427505';
The result is
QUERY PLAN
---------------------------------------------------------------------------------------------------
Index Scan using index_categories_on_created_at on categories (cost=0.14..8.16 rows=1 width=528)
Index Cond: (created_at = '2014-05-03 21:34:27.427505'::timestamp without time zone)
Note that the first one is using 'Filter' while the second one is using 'Index Cond', according to the doc of Postgresql, the former is just a one-by-one scan while the later is using index.
Is it indicating that query like 'created_at > ?' will not be fastened by adding an index on 'created_at' column?
Update
I am using Rails 4.0, and according to the console, the index is created by
CREATE INDEX "index_categories_on_created_at" ON "categories" ("created_at")

Indexes on timestamps are normally responsive to range queries, that is, >, <, between, <=, etc. However, as univero points out, selectivity and cost estimation plays a strong role.
PostgreSQL is only going to use an index if it thinks using the index is going to be faster than not using it (for that matter, it tries to pick the fastest index to use if several are available). How much of the table are those 47 rows it expects to get back from the > query? If the answer is "10% of the table" then Postgres is not going to bother with the index. For that matter, the query planner rarely uses indexes for scans of really small tables, because if your whole table fits on 3 data pages, it's faster to scan the entire table.
You can easily play with this if you want.
1) Use EXPLAIN ANALYZE instead of just EXPLAIN so you can compare what the query planner expected vs. what it actually got.
2) Turn off and on index and table scanning with any of these statements:
SET enable_seqscan = false; --turns off table scans
SET enable_indexscan = false; -- turns of index scans
SET enable_bitmapscan = false; -- turns off bitmap index scans
If you play around, you can see where using an index is actually slower.

Using an index means reading the index plus reading the selected rows from the table. There is a trade-off in that it can be more efficient simply to read only the table. The algorithms used by a DBMS to choose which is better for any given query are usually pretty good (though not perfect).
It's easily possible (and likely) that not using the index is the better choice for this query.
Using the #Clockwork-Muse AND #univerio suggestion for selectivity is generally a good idea, though it might not matter in this case due to table size. You might also use an ORDER BY created_at to see if it affects the plan.
Experimentation (per #FuzzyChef) can help find trade-off points. Use different table sizes and change other variables to see results.

Related

MySQL: giving the query hints about sort order

I'm not clear how to search for this, if there's a duplicate feel free to point me to it.
I'm wondering if there's a way to tell mysql that something will be sorted before filtering so that it can perform the filter with a binary search instead of a linear search.
For example, consider a table with columns id, value, and created_at
id is an auto-increment and created_at is a timestamp field with default of CURRENT_TIMESTAMP
then consider the following query:
SELECT *
FROM `table`
WHERE created_at BETWEEN '2022-10-05' AND '2022-10-06'
ORDER BY id
Because I have context on the data that mysql doesn't, namely that if id is sorted then created_at will also be sorted, I can conclude that we can binary search on created_at. However mysql does a full table scan for the filter as it's unaware of, or unwilling to assume this fact. The explain on the query on my test dataset shows that it's scanning all 50 rows to return the 24 that match the filter, when it's possible to do it by only scanning approximately log2(50) rows. This isn't a huge difference for my test dataset but on larger data it can have an impact.
I'll note that the obvious answer here is to add an index on created_at, but on more real life queries that's not always possible. For example if you were filtering on another indexed column it wouldn't be able to use that created_at index, but we might still be able to make assumptions about the ordering based on other order bys.
Anyway, after all that setup my question is: Is there a way that I can tell MySQL that I know that this data is already sorted so that it need not perform a table scan? Something similar to FORCE INDEX that can be used to overwrite the behaviour of picking an index for a query
The answer is no in general.
InnoDB queries read rows in order by the index used. So in your case if there's an index on created_at, it'll read rows in that order, ascending. There's no way to tell the optimizer that this matches the order of id too, and the optimizer won't make that assumption.
So the bottom line is that it'll have to perform a filesort on the matching rows to guarantee they're in order by id.
The comment above suggests ORDER BY created_at would solve the problem in the example you show. That is, if you know that the order of created_at is the same order as id, then just ORDER BY created_at, and the filesort can be skipped. That is, the optimizer knows the ORDER BY you requested is actually the order it read the rows from the index, so sorting is a no-op.
But I assume your example was only one case among many potential cases, and it might not be the right solution to use in other cases.
But why was the query doing a table-scan?
In the example you give of a table-scan of 50 rows, it's possible the optimizer decided not to use the index because the table-scan is so little work that the extra indirection of using the index isn't worth it. This is why we need to fill a table with at least a few hundred rows during testing, before we know if an index improves the query.
Another possible reason for the table-scan is that the range of dates covers a significantly large part of the table. In my experience, if the condition matches over 20% of the table, then the optimizer says, "meh, not worth the effort to use the index, so I'll just do a table-scan." That 20% figure is not an officially documented threshold, it's just my observation.
FORCE INDEX might help the latter case. FORCE INDEX really just tells the optimizer to assume a table-scan is infinitely costly, so if the named index is relevant at all to the search conditions, then use the index instead of a table-scan. It's possible though to use FORCE INDEX and name an index that is irrelevant to the search conditions. In that case, the optimizer still won't use the index, it'll just feel shame over having to do a table-scan.
Because I have context on the data that mysql doesn't, namely that if id is sorted then created_at will also be sorted...
Give MySQL that information: order by created_at, id and make sure created_at is indexed. This will allow MySQL to use the created_at index to filter and also do most of the ordering. If this is still slow, try adding a composite index of (created_at, id).
Also, upgrade MySQL. 5.6 reached the end of its life last year. MySQL 8 is considerably better at optimizing queries.

Range query is not using indexes in mysql

I am trying to optimize the query which we are using to generate reports.
I think I did quite good to optimize to some extends.
Below was the original query:
select trat.asset_name as group_name,trat.name as sub_group_name,
trat.asset_id as group_id,
if(trat.cause_task_type='AccessRequest',true,false) as is_request_task,
'' as grouped_on,
concat(trat.asset_name,' - {0} (',count(*),')') as table_heading
from t_remote_agent_tasks trat
where trat.status in ('completed','failedredundant')
and trat.name not in ('collect-data','update-conn-params')
group by trat.asset_name, trat.name
order by field(trat.name,'create-account','change-attr',
'add-member-to-group',
'grant-access','disable-account','revoke-access',
'remove-member-from-group','update-license')
When I see the execution plain in Extra column it says using where,Using Temporary,filesort.
So I optimize the query like this
select trat.asset_name as group_name,trat.name as sub_group_name,
trat.asset_id as group_id,
if(trat.cause_task_type='AccessRequest',true,false) as is_request_task,
'' as grouped_on,
concat(trat.asset_name,' - {0} (',count(*),')') as table_heading
from t_remote_agent_tasks trat
where trat.status in ('completed','failedredundant')
and trat.name not in ('collect-data','update-conn-params')
group by trat.asset_name,trat.name
order by null
Which gives me the execution plan as using where,using temporary. So filesort is no more use and there is no extra overhead as optimizer doesn't have to sort,which will be taken care during group by.
I again went forward and created indexes on group by columns in same order as they mentioned in group by(this is important or optimization won't happen) i.e create index on (trat.asset_name,trat.name).
Now this optimization gives me Using where only in extra column. Also the query execution time got deduced by almost half(earlier it was 0.568 sec. and now 0.345sec ,not exact though it vary every time but more or less in this range).
Now I want to optimize the range query ,below part of query
trat.status in ('completed','failedredundant')
and trat.name not in ('collect-data','update-conn-params')
I am reading on mysl reference guide to optimize range query,Which says not in is not in range query ,So I did the modification in query like this
trat.status in ('completed','failedredundant')
and trat.name in ('add-member-to-group','change-attr','create-account',
'disable-account','grant-access', 'remove-member-from-group',
'update-license')
But it doesn't show any improvement in Extra(I mean using index should be there,it is still showing using where).
I also tried by splitting both range part into unions(that will change the query result but still no improvement in execution plan)
I want some help on how to optimize this query more,mostly the range part(in part).
Any other optimization if I need to make on this?
I appreciate your time,Thanks in advance
EDIT 1 I forgot to mentioned that I have index on trat.status also,So Below are the indexes
(trat.asset_name,trat.name)
(trat.status)
In virtually all cases, only one index is used in a SELECT. So, one must have available the best.
Both of the first two queries will probably benefit most from the same 'composite' index:
INDEX(asset_name, name)
Normally, one would try to handle the WHERE conditions in the index, but they do not look amenable to an index. (More discussion below.) Second choice is the GROUP BY, which I am recommending. But, since (in the first case) the ORDER BY and the GROUP BY are different, there will necessarily be a tmp table created for the output of the GROUP BY so that it can be sorted according to the ORDER BY. (There may also be a tmp and sort for the GROUP BY; I cannot tell.)
"Using index" means that a "covering" index was used. A "covering" index is a composite index that includes all of the columns used anywhere in the SELECT. That would be about 5 columns, and probably not wise to attempt. (More below.)
Another thing to note that even something this simple:
WHERE x IN (11,22)
GROUP BY y
cannot use any index to handle both the WHERE and GROUP BY. So, there is no way for your query to consume both (except by 'covering').
A covering index, when used, is only partially useful. It says that all the work is done just in the BTree of the index. But that could include a full index scan -- which is not that much faster than a full table scan. This is another argument against recommending 'covering'.
In some situations, IN or OR can be sped up by turning it into UNION:
( SELECT ... WHERE status in ('completed') )
UNION ALL
( SELECT ... WHERE status in ('failedredundant') )
but this will only cause you to stumble into the NOT IN(...) clause, which is worse than an IN.
The goal of finding the best index is to find one that has the rows (in the index and/or in the table) consecutively sitting in the BTree.
To make any further improvements on this query will probably require re-thinking the schema -- it seems to be forcing you to have IN, NOT IN, FIELD and other hard-to-optimize constructs.

Should I avoid ORDER BY in queries for large tables?

In our application, we have a page that displays user a set of data, a part of it actually. It also allows user to order it by a custom field. So in the end it all comes down to query like this:
SELECT name, info, description FROM mytable
WHERE active = 1 -- Some filtering by indexed column
ORDER BY name LIMIT 0,50; -- Just a part of it
And this worked just fine, as long as the size of table is relatively small (used only locally in our department). But now we have to scale this application. And let's assume, the table has about a million of records (we expect that to happen soon). What will happen with ordering? Do I understand correctly, that in order to do this query, MySQL will have to sort a million records each time and give a part of it? This seems like a very resource-heavy operation.
My idea is simply to turn off that feature and don't let users select their custom ordering (maybe just filtering), so that the order would be a natural one (by id in descending order, I believe the indexing can handle that).
Or is there a way to make this query work much faster with ordering?
UPDATE:
Here is what I read from the official MySQL developer page.
In some cases, MySQL cannot use indexes to resolve the ORDER BY,
although it still uses indexes to find the rows that match the WHERE
clause. These cases include the following:
....
The key used to
fetch the rows is not the same as the one used in the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
So yes, it does seem like mysql will have a problem with such a query? So, what do I do - don't use an order part at all?
The 'problem' here seems to be that you have 2 requirements (in the example)
active = 1
order by name LIMIT 0, 50
The former you can easily solve by adding an index on the active field
The latter you can improve by adding an index on name
Since you do both in the same query, you'll need to combine this into an index that lets you resolve the active value quickly and then from there on fetches the first 50 names.
As such, I'd guess that something like this will help you out:
CREATE INDEX idx_test ON myTable (active, name)
(in theory, as always, try before you buy!)
Keep in mind though that there is no such a thing as a free lunch; you'll need to consider that adding an index also comes with downsides:
the index will make your INSERT/UPDATE/DELETE statements (slightly) slower, usually the effect is negligible but only testing will show
the index will require extra space in de database, think of it as an additional (hidden) special table sitting next to your actual data. The index will only hold the fields required + the PK of the originating table, which usually is a lot less data then the entire table, but for 'millions of rows' it can add up.
if your query selects one or more fields that are not part of the index, then the system will have to fetch the matching PK fields from the index first and then go look for the other fields in the actual table by means of the PK. This probably is still (a lot) faster than when not having the index, but keep this in mind when doing something like SELECT * FROM ... : do you really need all the fields?
In the example you use active and name but from the text I get that these might be 'dynamic' in which case you'd have to foresee all kinds of combinations. From a practical point this might not be feasible as each index will come with the downsides of above and each time you add an index you'll add supra to that list again (cumulative).
PS: I use PK for simplicity but in MSSQL it's actually the fields of the clustered index, which USUALLY is the same thing. I'm guessing MySQL works similarly.
Explain your query, and check, whether it goes for filesort,
If Order By doesnt get any index or if MYSQL optimizer prefers to avoid the existing index(es) for sorting, it goes with filesort.
Now, If you're getting filesort, then you should preferably either avoid ORDER BY or you should create appropriate index(es).
if the data is small enough, it does operations in Memory else it goes on the disk.
so you may try and change the variable < sort_buffer_size > as well.
there are always tradeoffs, one way to improve the preformance of order query is to set the buffersize and then the run the order by query which improvises the performance of the query
set sort_buffer_size=100000;
<>
If this size is further increased then the performance will start decreasing

Order of composite where clause (MySQL, Postgres)

I have a table t with columns a int, b int, c int; composite index i (b, c). I fetch some data with following query:
select * from t where c = 1 and b = 2;
So the question is: will MySQL and Postgres use the index i? And, more generally: does the query composite where clause order affect the possibility of index use?
What you need to do is use the explain function in both, to see what's going on. If it says it's using an index then it is. One caveat is that in a small table with minimal data, it's very likely that postgresql (and probably mysql) will ignore the indexes and favor of a scan. To get a real result, insert quite a bit of dummy data (at least 20 rows, and I always do about 500) and be sure the analyze the table. Also, realize that if the search criteria will return a large percentage of the table results, it will likely not use the index either (as a scan will be faster).
create table
generate data (perhaps using generate_series)
run explain select * from t where c=1 and b=2
create index `create index on t(b,c)
Analyze table analyze t
run explain select * from t where c=1 and b=2 and compare with first run
hopefully this will help answer this, and other questions you might have in the future about when indexes will run. To answer your original question though, yes, in general postgresql will use the index, regardless of order, if the optimizer determines that to be the best way to get your results. Remember to analyze your table though, so the optimizer has an idea of what information is in your table, and analyze it any time a ton of data is added or deleted from your table. Depending on your PG version and settings, some of this may be done automatically for you, but it won't hurt to manually analyze, especially when testing this kind of thing.
Edit: the index order may (especially if you don't use an order by in your query and the optimizer uses the index) effect the order of the results of your query-- the returned rows may be ordered in the same order of the index.
It's not, the order doesn't matter.
Optimizer does a lot of smart things to perform a query in the most efficient way.

MySQL: Optimize query with DISTINCT

In my Java application I have found a small performance issue, which is caused by such simple query:
SELECT DISTINCT a
FROM table
WHERE checked = 0
LIMIT 10000
I have index on the checked column.
In the beginning, the query is very fast (i.e. where almost all rows have checked = 0). But as I mark more and more rows as checked, the query becomes greatly inefficient (up to several minutes).
How can I improve the performance of this query ? Should I add a complex index
a, checked
or rather
checked, a?
My table has a lot of millions of rows, that is why I do not want to test it manually and hope to have lucky guess.
I would add an index on checked, a. This means that the value you're returning has already been found in the index and there's no need to re-access the table to find it. Secondly if you're doing lot's of individual updates of the table there's a good chance both the table and the index have become fragmented on the disc. Rebuilding (compacting) a table and index can significantly increase performance.
You can also use the query rewritten as (just in case the optimizer does not understand that it's equivalent):
SELECT a
FROM table
WHERE checked = 0
GROUP BY a
LIMIT 10000
Add a compound index on the DISTINCT column (a in this case). MySQL is able to use this index for the DISTINCT.
MySQL may also take profit of a compound index on (a, checked) (the order matters, the DISTINCT column has to be at the start of the index). Try both and compare the results with your data and your queries.
(After adding this index you should see Using index for group-by in the EXPLAIN output.)
See GROUP BY optimization on the manual. (A DISTINCT is very similar to a GROUP BY.)
The most efficient way to process GROUP BY is when an index is used to directly retrieve the grouping columns. With this access method, MySQL uses the property of some index types that the keys are ordered (for example, BTREE). This property enables use of lookup groups in an index without having to consider all keys in the index that satisfy all WHERE conditions.>
My table has a lot of millions of rows <...> where almost all rows have
checked=0
In this case it seems that the best index would be a simple (a).
UPDATE:
It was not clear how many rows get checked. From your comment bellow the question:
At the beginning 0 is in 100% rows, but at the end of the day it will
become 0%
This changes everything. So #Ben has the correct answer.
I have found a completely different solution which would do the trick. I will simple create a new table with all possible unique "a" values. This will allow me to avoid DISTINCT
You don't state it, but are you updating the index regularly? As changes occur to the underlying data, the index becomes less and less accurate and processing gets worse and worse. If you have an index on checked, and checked is being updated over time, you need to make sure your index is updated accordingly on a regular basis.