I was wondering if there is a performance gain between a SELECT query with a not very specific WHERE clause and another SELECT query with a more specific WHERE clause.
For instance is the query:
SELECT * FROM table1 WHERE first_name='Georges';
slower than this one:
SELECT * FROM table1 WHERE first_name='Georges' AND nickname='Gigi';
In other words is there a time factor that is link to the precision of the WHERE clause ?
I'm not sure to be very understandable and even if my question takes into account all the components that are involved in database query (MYSQL in my case)
My question is related to the Django framework because I would like to cache an evaluated queryset, and on a next request, take back this cached-evaluated queryset, filter it more, and evaluate it again.
There is no hard and fast rule about this.
There can be either an increase or decrease in performance by adding more conditions to the WHERE clause, as it depends on, among other things, the:
indexing
schema
data quantity
data cardinality
statistics
intelligence of the query engine
You need to test with your data set and determine what will perform the best.
MySql server must compare all columns in your WHERE clause (if all joined by AND ).
So if you don't have any index on column nickname second query will by slightly slower.
Here you can read how column indexes works (with examples similar to your question): http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
I think is difficult to answer this question, too many aspects (e.g.: indexes) are involved. I would say that the first query is faster than the first one, but I can't say for sure.
If this is crucial for you, why don't you run a simulation (e.g.: run 1'000'000 of queries) and check the time?
Yes, it can be slower. It will all depend on indexes you have and data distribution.
Check the link Understanding the Query Execution Plan
for information on how to know what MySQL is going to do when executing your query.
Related
If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..
I have a view (say 'v') that is the combination of 10 tables using several Joins and complex calculations. In that view, there are around 10 Thousand rows.
And then I select 1 row based on row as WHERE id = 23456.
Another possible way to use a larger query in which I can cut short the dataset to 1% before the complex calculation starts.
Question: Are SQL views optimized in some form?
MySQL Views are just syntactic sugar. There is not special optimization. Think of views as being textually merged; then optimized. That is, you could get the same optimizations (or not) by manually writing the equivalent SELECT.
If you would like to discuss the particular query further, please provide SHOW CREATE TABLE/VIEW and EXPLAIN SELECT .... It may be that you are missing a useful 'composite' index.
I have a huge database and my task is to improve its performance to avoid the timeout issues and minimize the select query duration's.
Which all areas do i need to concentrate to improve the performance of Stored Procedures effectively?
How does sites like facebook store huge amount of data and still doesn't lack on performance?
What can be done to improve the performance of SPs?
Ninety percent of slow queries can be fixed by adding/rebuilding indexes. Make sure that you have indexes on all the tables involved, and that your join clause criteria match those index keys.
Note that adding indexes can have its own performance cost, however, especially when you insert records. But it's usually worth it.
If you want to improve Stored Procedure performance in SQL Server, would recommend below 3 things:
Add 'SET NOCOUNT ON' in the SP --- It can provide a significant performance boost, because network traffic is greatly reduced.
Try to use columns in the where conditions which are mainly indexed.
Verify the execution plan and if you see multiple parallelism occurring, try to use OPTION(MAXDOP N) where N you can set as per the requirement.
the question is
factors that affect multiple joins
There are many things that affect negatively but the usual suspects are below.
Lack of Index on the joined columns
Inefficient join orders for OUTER JOIN
Use of Subquery
Modification of search arguments or join column (e.g.A.intColumn+1 = B.intColumn
Clauses like ORDER BY will also impact performance in general.
(MySQL-centric answer)
JOINs are performed by tackling one table at a time. The optimizer picks which one it thinks is best to start with. Here are some criteria:
The table with the most filtering (WHERE ...) will probably be picked first.
If two tables look about the same, the smaller table will probably be picked first.
Something like that occurs when picking the 'next' table to use.
MySQL almost never uses more than one index per table in a SELECT (assuming there are no subqueries or UNIONs). A Composite INDEX is often useful. Sometimes a "covering" index is warranted.
See my index cookbook.
Stored Routines do not help performance much -- unless you are accessing the server over a WAN. In that case, a SP cuts down on the number of roundtrips, thereby improving latency.
30K inserts per day? That is trivial. Where is there performance issue? On big SELECTs? Is this a Data Warehouse application? Do you have Summary Tables? They are the big performance boost.
Millions of rows? Or Billions?
Normalized? Over-normalized? (Do not normalize 'continuous' values such as FLOAT, DATE, etc.)
That's a lot of hand-waving. If you want some real advice, let's see a slow query.
In my experience, it all comes down to indexing. This is best illustrated by using an example. Suppose you have two tables T1 and T2 and you want to join them. Each table only has 1000 rows in it. Without indexing, the query execution plan will take the cross product of the two tables and then iterate through sequentially filtering out the results that don't match the where condition. For simplicity, lets just assume only one row matches the filter condition.
T1 X T2 = 1000 * 1000 = 1,000,000
Without indexing, filtering will require 1 million steps.
However, with indexing, only 20 steps will be required. Log2(n)
My thinking is that if I put my ANDs that filter out a greater number of rows before those that filter out just a few, my query should run quicker since that selection set is much smaller between And statements.
But does the order of AND in the WHERE clause of an SQL Statement really effect the performance of the SQL that much or are the engines optimized already for this?
It really depends on the optimiser.
It shouldn't matter because it's the optimiser's job to figure out the optimal way to run your query regardless of how you describe it.
In practice, no optimiser is perfect so you might find that re-ordering the clauses does make a difference to particular queries. The only way to know for sure is to test it yourself with your own schema, data etc.
Most SQL engines are optimized to do this work for you. However, I have found situations in which trying to carve down the largest table first can make a big difference - it doesn't hurt !
A lot depends how the indices are set up. If an index exists which combines the two keys, the optimizer should be able to answer the query with a single index search. Otherwise if independent indices exist for both keys, the optimizer may get a list of the records satisfying each key and merge the lists. If an index exists for one condition but not the other, the optimizer should filter using the indexed list first. In any of those scenarios, it shouldn't matter what order the conditions are listed.
If none of the conditions apply, the order the conditions are specified may affect the order of evaluation, but since the database is going to have to fetch every single record to satisfy the query, the time spent fetching will likely dwarf the time spent evaluating the conditions.
I ran the same query in number of tables (containing different no of records):
SELECT * FROM `tblTest`
ORDER BY `tblTest`.`DateAccess` DESC;
Why the first queries behave erratically (take longer then second, third...)?
I calculated the average of the second, third and fourth query, exuding the first query.
So for example, in a table with 1,000,000 records, the first time to proccess takes 4.8410 s and second time - only 0.8940 s. Why is this happening?
p.s. I use phpMyAdmin tool.
DBMS are really smart applications and maintain multiple catalogues to optimize their execution. When a query is run it generates many entries in the database depending on the DBMS used these catalogues will be more optimized and can even go to automatically generate index to optimize really often used queries. They also all have what is call a query optimizer which analyzes the plan of the query execution in order to optimize the execution plan.
In your specific case, you should look at query and result caching, the following article should help you understand how mysql natively tries to optimize query processing.
http://dev.mysql.com/doc/refman/5.5/en/query-cache.html
http://www.cyberciti.biz/tips/enable-the-query-cache-in-mysql-to-improve-performance.html
Here is a comparison between oracle, mysql and postgres (not a new article but will give you a basic idea of how different dbms have different way of handling complex queries on large databases)
http://dcdbappl1.cern.ch:8080/dcdb/archive/ttraczyk/db_compare/db_compare.html#Query+optimization
Cheers,