Order of composite where clause (MySQL, Postgres) - mysql

I have a table t with columns a int, b int, c int; composite index i (b, c). I fetch some data with following query:
select * from t where c = 1 and b = 2;
So the question is: will MySQL and Postgres use the index i? And, more generally: does the query composite where clause order affect the possibility of index use?

What you need to do is use the explain function in both, to see what's going on. If it says it's using an index then it is. One caveat is that in a small table with minimal data, it's very likely that postgresql (and probably mysql) will ignore the indexes and favor of a scan. To get a real result, insert quite a bit of dummy data (at least 20 rows, and I always do about 500) and be sure the analyze the table. Also, realize that if the search criteria will return a large percentage of the table results, it will likely not use the index either (as a scan will be faster).
create table
generate data (perhaps using generate_series)
run explain select * from t where c=1 and b=2
create index `create index on t(b,c)
Analyze table analyze t
run explain select * from t where c=1 and b=2 and compare with first run
hopefully this will help answer this, and other questions you might have in the future about when indexes will run. To answer your original question though, yes, in general postgresql will use the index, regardless of order, if the optimizer determines that to be the best way to get your results. Remember to analyze your table though, so the optimizer has an idea of what information is in your table, and analyze it any time a ton of data is added or deleted from your table. Depending on your PG version and settings, some of this may be done automatically for you, but it won't hurt to manually analyze, especially when testing this kind of thing.
Edit: the index order may (especially if you don't use an order by in your query and the optimizer uses the index) effect the order of the results of your query-- the returned rows may be ordered in the same order of the index.

It's not, the order doesn't matter.
Optimizer does a lot of smart things to perform a query in the most efficient way.

Related

Can I use index in MySQL in this way? [duplicate]

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..

Configure better index choices for MySQL

I have a MySQL table with ~17M rows where I end up doing a lot of aggregation queries.
For this example lets say I have index_on_b, index_on_c, compound_index_on_a_b, compound_index_on_a_c
I try and run a query explain
EXPLAIN SELECT SUM(revenue) FROM table WHERE a = some_value AND b = other_value
And I find that the selected index is index_on_b, but when I use a query hint
SELECT SUM(revenue) FROM table USE INDEX(compound_index_on_a_b)
The query runs way way faster. Is there anything I can do in MySQL config to make MySQL choose the compound indexes first?
There are 2 possible routes you can take:
A) The index resolution process is when according to the optimizer all things are equal based on the order the indexes are created in. You could drop index_b and recreate it and check if the optimizer was in a scenario where it just thought they were the same.
Or
B) Use optimizer_search_depth (see https://mariadb.com/blog/setting-optimizer-search-depth-mysql). By altering this parameter you determine how much effort the optimizer is allowed to spend on a query plan, and it might come up with the much better solution of using the combined index.
A possible explanation:
If a has the same value throughout the table, then INDEX(b) is actually better than INDEX(a,b). This is because the former is smaller, hence faster to work with. Note that both will return the same number of rows, even without further checking of a.
Please provide:
SHOW CREATE TABLE
SHOW INDEXES -- to see cardinality
EXPLAIN SELECT

Does index work when query like 'where created_at > ?'

I am using Postgresql, and need to make query like 'WHERE created_at > ?'. I am not sure if the index works in such query.
I have done an experiment. After adding an index on the created_at column, I explained the following 2 queries.
1)
EXPLAIN SELECT * FROM categories WHERE created_at > '2014-05-03 21:34:27.427505';
The result is
QUERY PLAN
------------------------------------------------------------------------------------
Seq Scan on categories (cost=0.00..11.75 rows=47 width=528)
Filter: (created_at > '2014-05-03 21:34:27.427505'::timestamp without time zone)
2)
EXPLAIN SELECT * FROM categories WHERE created_at = '2014-05-03 21:34:27.427505';
The result is
QUERY PLAN
---------------------------------------------------------------------------------------------------
Index Scan using index_categories_on_created_at on categories (cost=0.14..8.16 rows=1 width=528)
Index Cond: (created_at = '2014-05-03 21:34:27.427505'::timestamp without time zone)
Note that the first one is using 'Filter' while the second one is using 'Index Cond', according to the doc of Postgresql, the former is just a one-by-one scan while the later is using index.
Is it indicating that query like 'created_at > ?' will not be fastened by adding an index on 'created_at' column?
Update
I am using Rails 4.0, and according to the console, the index is created by
CREATE INDEX "index_categories_on_created_at" ON "categories" ("created_at")
Indexes on timestamps are normally responsive to range queries, that is, >, <, between, <=, etc. However, as univero points out, selectivity and cost estimation plays a strong role.
PostgreSQL is only going to use an index if it thinks using the index is going to be faster than not using it (for that matter, it tries to pick the fastest index to use if several are available). How much of the table are those 47 rows it expects to get back from the > query? If the answer is "10% of the table" then Postgres is not going to bother with the index. For that matter, the query planner rarely uses indexes for scans of really small tables, because if your whole table fits on 3 data pages, it's faster to scan the entire table.
You can easily play with this if you want.
1) Use EXPLAIN ANALYZE instead of just EXPLAIN so you can compare what the query planner expected vs. what it actually got.
2) Turn off and on index and table scanning with any of these statements:
SET enable_seqscan = false; --turns off table scans
SET enable_indexscan = false; -- turns of index scans
SET enable_bitmapscan = false; -- turns off bitmap index scans
If you play around, you can see where using an index is actually slower.
Using an index means reading the index plus reading the selected rows from the table. There is a trade-off in that it can be more efficient simply to read only the table. The algorithms used by a DBMS to choose which is better for any given query are usually pretty good (though not perfect).
It's easily possible (and likely) that not using the index is the better choice for this query.
Using the #Clockwork-Muse AND #univerio suggestion for selectivity is generally a good idea, though it might not matter in this case due to table size. You might also use an ORDER BY created_at to see if it affects the plan.
Experimentation (per #FuzzyChef) can help find trade-off points. Use different table sizes and change other variables to see results.

Does the order of indexes matter?

Let's say you have a table with columns A and B, among others. You create a multi-column index (A, B) on the table.
Does your query have to take the order of indexes into account? For example,
select * from MyTable where B=? and A in (?, ?, ?);
In the query we put B first and A second. But the index is (A, B). Does the order matter?
Update: I do know that the order of indexes matters significantly in terms of the leftmost prefix rule. However, does it matter which column comes first in the query itself?
In this case no but I recommend to use the EXPLAIN keyword and you will see which optimizations MySQL will use (or not).
The order of columns in the index can affect the way the MySQL optimiser uses the index. Specifically, MySQL can use your compound index for queries on column A because it's the first part of the compound index.
However, your question refers to the order of column references in the query. Here, the optimiser will take care of the references appropriately, and the order is unimportant. The different clauses must come in a particular order to satisfy syntax rules, so you have little control anyway.
Mysql reference on multi-column index optimisation is here
You can test out specific queries of you think they are problems, but otherwise I wouldn't worry about this optimization. Your query will mostly likely be mangled from its original form by the query plan. That is to say MySQL should do a good job of planning how it will use the indices to optimize speed. This may require the conditions to be in a different order, but I doubt it. If MySQL actually did have to reorder the conditions for optimization it would be a very minor cost relative to the execution of the query (at least if the result set is large).

MySQL: make a compound index of 3 fields, or make 3 separate indices?

I have a MySQL table that has, among other attributes, a timestamp, a type and a user_id.
All of them are searchable and/or sortable.
Is it better to create an index for each one, or create a single compound index with all three, or both?
If you are going to perform searches on those fields separately, you will probably need separate indexes to make your queries run faster.
If you have an index like this:
mysql> create index my_idx on my_table(tstamp, user_id, type);
And you query is:
mysql> select * from my_table where type = 'A';
Then my_idx won't be that helpful for your query and MySQL will end up doing a full table scan to resolve it.
Pablo's answer is correct, but maybe you'll fail to realize that a compound index might be justified.
You can have multiple indexes and having idx1(tstamp, user_id) does not exclude you from having indx2(tstamp, type) or idx1reverse(user_id, tstamp) and so on...
Compound indexes are most useful when they cover all of the conditions in your query, so the index you propose will be most useful for
SELECT * FROM my_table WHERE tstamp = #ts1 AND user_id = #uid AND type = #type
If you want to improve performance of such queries you can consider adding a composite index.
The downside of the indexes is that it slows down all update operations. However, most general applications do many more selects then updates (both in terms transactions i.e. number of statements and especially in terms of of records affected/retrieved) and at the same time are much more tolerant of slower updates (users mostly judge the speed of the system not by the time it is necessary to update a record, but by the time necessary to retrieve records; again YMMV and there are applications that don't play by such rules).
The best would be if you had some way to test the database performance in terms of typical workloads (create some typical SQL scripts; independent and repeatable, or create unit tests at the application level) and then you can objectively tune your database.
EDIT
Also realize that indexes can be added and dropped without affecting the system in terms of functionality. Therefore, you can tune your indexes later, during actual usage of the system - and normally you would collect and profile the slow SQL queries looking for conditions that could benefit from adding indexes.