If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..
Related
The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.
My thinking is that if I put my ANDs that filter out a greater number of rows before those that filter out just a few, my query should run quicker since that selection set is much smaller between And statements.
But does the order of AND in the WHERE clause of an SQL Statement really effect the performance of the SQL that much or are the engines optimized already for this?
It really depends on the optimiser.
It shouldn't matter because it's the optimiser's job to figure out the optimal way to run your query regardless of how you describe it.
In practice, no optimiser is perfect so you might find that re-ordering the clauses does make a difference to particular queries. The only way to know for sure is to test it yourself with your own schema, data etc.
Most SQL engines are optimized to do this work for you. However, I have found situations in which trying to carve down the largest table first can make a big difference - it doesn't hurt !
A lot depends how the indices are set up. If an index exists which combines the two keys, the optimizer should be able to answer the query with a single index search. Otherwise if independent indices exist for both keys, the optimizer may get a list of the records satisfying each key and merge the lists. If an index exists for one condition but not the other, the optimizer should filter using the indexed list first. In any of those scenarios, it shouldn't matter what order the conditions are listed.
If none of the conditions apply, the order the conditions are specified may affect the order of evaluation, but since the database is going to have to fetch every single record to satisfy the query, the time spent fetching will likely dwarf the time spent evaluating the conditions.
I was wondering if there is a performance gain between a SELECT query with a not very specific WHERE clause and another SELECT query with a more specific WHERE clause.
For instance is the query:
SELECT * FROM table1 WHERE first_name='Georges';
slower than this one:
SELECT * FROM table1 WHERE first_name='Georges' AND nickname='Gigi';
In other words is there a time factor that is link to the precision of the WHERE clause ?
I'm not sure to be very understandable and even if my question takes into account all the components that are involved in database query (MYSQL in my case)
My question is related to the Django framework because I would like to cache an evaluated queryset, and on a next request, take back this cached-evaluated queryset, filter it more, and evaluate it again.
There is no hard and fast rule about this.
There can be either an increase or decrease in performance by adding more conditions to the WHERE clause, as it depends on, among other things, the:
indexing
schema
data quantity
data cardinality
statistics
intelligence of the query engine
You need to test with your data set and determine what will perform the best.
MySql server must compare all columns in your WHERE clause (if all joined by AND ).
So if you don't have any index on column nickname second query will by slightly slower.
Here you can read how column indexes works (with examples similar to your question): http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
I think is difficult to answer this question, too many aspects (e.g.: indexes) are involved. I would say that the first query is faster than the first one, but I can't say for sure.
If this is crucial for you, why don't you run a simulation (e.g.: run 1'000'000 of queries) and check the time?
Yes, it can be slower. It will all depend on indexes you have and data distribution.
Check the link Understanding the Query Execution Plan
for information on how to know what MySQL is going to do when executing your query.
I have a table with 510,085 rows, which is now pushing me to seek higher performance. One of the fields in this table is called 'photoStatus'.
In 'photoStatus', 510,045 rows contain the word 'Active' and the remaining 40 contain the word 'Suspended'.
Which of these two queries would be faster to search for 'Active' photos or doesn't it matter?
WHERE photoStatus = 'Active'
Or
WHERE photoStatus <> 'Suspended'
Obviously this is part of a massive query, it's not just one WHERE condition.
Database is MySQL (MyISAM)
Why not convert the column to a boolean, or a numeric value, which would be much faster than a string compare, then you could just do:
....
WHERE isActive;
If you have an index on that column WHERE photoStatus = 'Active' will be faster since the server can just scan the range in the index matching Active.
Second will be "a little" faster because it will not require to compare whole string just first character comparison is enough to include result according to database comparison algo
Be sure to use an index on that field. And EXPLAIN the query to see how efficient your query is.
Other than that your query would filter out just 40, so the rest of the query has to be efficient.
Its going to have to index the table and rows either way.
Personally I would always match. Use equals.
WHERE photoStatus = 'Active'
I would always use int or boolean, better than matching a string..
A normal index won't help in this scenario since the percentage of actually returned rows is to large.
So the database will have to look at each row. There might be some differenece, depending on how fast an equal vs !equal comparison is, but that should be neglectable.
So I expect the result to be pretty much of same speed.
You have posted to few details to find a shortcut for your query.
As it appears you need a full scan. In this case you can try to read the table in parallel.
Don't know what DBMS you use, but in Oracle you can use a hint select /*+parallel(yourtable 8)*/ from yourtable
What you try to do with this data? What types of queryes are slow? can you give an example?There can be many tricks and you can do many mistakes. And not all queryes should work fast. If they are for UI, must respond in a time < 1s. But if it is for admin task, may take 1 minute :)
WHERE photoStatus = 'Active' is better if you have index on that column based on small testing similar to your example.
I added query execution in sql server. Short one belongs to equal comparison. It is reporting better performance. If you dont have index, query cost is similar.
Firstly, .5M rows is not a large table - by ANY means.
A column like "Active" / "Inactive", is likely to be pretty useless to be an index by itself, because it doesn't have enough selectivity to make an index scan beneficial (in fact, if it's 50% of the rows in the table, a table-scan would probably be better).
I suspect that in fact, "Active" has nothing to do with your problem - after all, you're not trying to return .5M rows to the client are you?
A query which returns .5M rows is not going to be fast, because just returning the rows takes a (relatively) long time.
Anyway my answer: It makes no difference, you need to check the other parts of your query. Post a question with the full query, table structure and explain output.
I have this mysql query and I am not sure what are the implications of indexing all the fields in the query . I mean is it OK to index all the fields in the CASE statement, Join Statement and Where Statement? Are there any performance implications of indexing fields?
SELECT roots.id as root_id, root_words.*,
CASE
WHEN root_words.title LIKE '%text%' THEN 1
WHEN root_words.unsigned_title LIKE '%normalised_text%' THEN 2
WHEN unsigned_source LIKE '%normalised_text%' THEN 3
WHEN roots.root LIKE '%text%' THEN 4
END as priorities
FROM roots INNER JOIN root_words ON roots.id=root_words.root_id
WHERE (root_words.unsigned_title LIKE '%normalised_text%') OR (root_words.title LIKE '%text%')
OR (unsigned_source LIKE '%normalised_text."%') OR (roots.root LIKE '%text%') ORDER by priorities
Also, How can I further improve the speed of the query above?
Thanks!
You index columns in tables, not queries.
None of the search criteria you've specified will be able to make use of indexes (since the search terms begin with a wild card).
You should make sure that the id column is indexed, to speed the JOIN. (Presumably, it's already indexed as a PRIMARY KEY in one table and a FOREIGN KEY in the other).
To speed up this query you will need to use full text search. Adding indexes will not speed up this particular query and will cost you time on INSERTs, UPDATEs, and DELETEs.
Caveat: Indexes speed up retrieval time but cause inserts and updates to run slower.
To answer the implications of indexing every field, there is a performance hit when using indexes whenever the data that is indexed is modified, either through inserts, updates, or deletes. This is because SQL needs to maintain the index. It's a balance between how often the data is read versus how often it is modified.
In this specific query, the only index that could possibly help would be in your JOIN clause, on the fields roots.id and root_words.root_id.
None of the checks in your WHERE clause could be indexed, because of the leading '%'. This causes SQL to scan every row in these tables for a matching value.
If you are able to remove the leading '%', you would then benefit from indexes on these fields... if not, you should look into implementing full-text search; but be warned, this isn't trivial.
Indexing won't help when used in conjunction with LIKE '%something%'.
It's like looking for words in a dictionary that have ae in them somewhere. The dictionary (or Index in this case) is organised based on the first letter of the word, then the second letter, etc. It has no mechanism to put all the words with ae in them close together. You still end up reading the whole dictionary from beginning to end.
Indexing the fields used in the CASE clause will likely not help you. Indexing helps by making it easy to find records in a table. The CASE clause is about processing the records you have found, not finding them in the first place.
Optimisers can also struggle with optimising multiple unrelated OR conditions such as yours. The optimiser is trying to narrow down the amount of effort to complete your query, but that's hard to do when unrelated conditions could make a record acceptable.
All in all your query would benefit from indexes on roots(root_id) and/or roots(id), but not much else.
If you were to index additional fields though, the two main costs are:
- Increased write time (insert, update or delete) due to additional indexes to write to
- Increased space taken up on the disk