I have a query as the following:
SELECT SUM(`weight`) as totalgrams,
SUM(`weight`)/1000 as totalkilograms
FROM `item`
which requires me to use the result of the first column's SUM, but since I can't use totalgrams, I need to redo SUM function again in the second column calculation.
The query plan from EXPLAIN:
Now, with the second query:
SELECT totalgrams, totalgrams/1000 as totalkilograms
FROM (SELECT SUM(`weight`) as totalgrams
FROM `item`) prequery
I don't need to repeat the SUM but I ended up with a nested query.
The query plan from EXPLAIN:
At a glance, it seems better to go with the first query, as it only has one entry in the execution plan, but was SUM calculated twice here (which is redundant and not scalable)?
Or actually the system already have an optimization for this and just calculate it once; so indeed the first query is better?
Right now there are only a few rows inside the table, so perhaps the difference is not significant in the real [ms] unit.
But if later it becomes huge, I wonder actually which query would be better?
And does it apply for all DBMS?
It is purely for understanding the SQL workflow, any insight is appreciated.
MySQL materializes subqueries in the from clause -- the so-called derived table. In this case, the summary has one row and one column, so it is really no big deal.
Including the sum() twice in the select does not have this overhead. It is unclear from the explain output whether sum() is calculated once or twice. Probably twice, but there could be an optimization step that eliminates that processing. In any case, sum() is a really cheap. The expensive part is arranging the aggregation, and all the aggregation functions are processed together.
You say this is purely for understanding the workflow, so I'll start my answer by saying mySQL does have means for optimizing these sort of operations and will do so but it isn't perfect and you shouldn't depend on it. [PICKY] The example is not the best as a sum operation is trivial anyhow[/PICKY]
I would say your first solution is the better, but even better still would be to remove the need for the calculation at all. Most of the time when a calculated column is used, it's simpler to code the calculation in the application that's getting the result, ie if this is called from php let php calculate total kilos instead of mysql. It's a one time calculation based on a single return value and it doesn't matter whether mySQL optimizes it or not. As I said earlier, sum is inexpensive, so for this particular example it isn't relevant but if the operation was something more expensive it would be a factor and for a general policy we should not assume the triviality of the operation.
If the outside language is an issue, another possibility would be to create an intermediate table and then update that table with the result. In this case (a single row) the overhead makes this less desirable but if it were many rows in the result table (such as with a group by), or to create a general policy, the overhead becomes a non-issue.
Related
Here is the tutorial of GROUP_CONCAT() in GeeksForGeeks.
In "Queries 2", the Output is ascending. But there is no ORDER BY clause.
here is the picture of "Queries 2"
Could anyone can tell me why?
Any help would be really appreciated!
This is one of those oddballs where there is likely an implicit sort happening behind the scenes to optimize the DISTINCT execution by mysql.
You can test this yourself pretty easily:
CREATE TABLE t1 (c1 VARCHAR(50));
INSERT INTO t1 VALUES ('zebra'),('giraffe'),('cattle'),('fox'),('octopus'),('yak');
SELECT GROUP_CONCAT(c1) FROM t1;
SELECT GROUP_CONCAT(DISTINCT c1) FROM t1;
GROUP_CONCAT(c1)
zebra,giraffe,cattle,fox,octopus,yak
GROUP_CONCAT(DISTINCT c1)
cattle,fox,giraffe,octopus,yak,zebra
It's not uncommon to find sorted results where no ORDER BY was specified. Window Functions output are a good example of this.
You can imagine if you were tasked, as a human, to only pick distinct items from a list. You would likely first sort the list and then pick out duplicates, right? And when you hand the list back to the person that requested this from you, you wouldn't scramble the data back up to be unsorted, I would assume. Why do the extra work? What you are seeing here is a byproduct of the optimized execution path chosen by the mysql server.
The key takeaway is "byproduct". If I specifically wanted the output of GROUP_CONCAT to be sorted, I would specify exactly what I want and I would not rely on this implicit sorting behavior. We can't guess what the execution path will be. There are a lot of decisions an RDBMS makes when SQL is submitted to optimize the execution and depending on data size and other steps it needs to take in the sql, this behavior may work on one sql statement and not another. Likewise, it may work one day, and not another.
TL;DR Never omit an ORDER BY clause from a query if you rely on the order for something.
Does DISTINCT will automatically sort the result in MySQL?
No. NO! Be careful!
SQL is all about sets of rows. Without ORDER BY clauses, SQL queries return the rows of their result sets in an "unpredictable" order. "Unpredictable" is like random, but worse. If the order is truly random, you have a chance to catch any ordering problem when you're testing. Unpredictable means the server returns rows in any convenient order. This means everything works as you expect until some day in the future when it doesn't, without warning. (MySQL might start using some kind of parallel algorithm in the future.)
Now it is true that DISTINCT result sets from modestly sized tables are often generated using a sorting / deduplicating algorithm in the server. But that is an implementation detail. MySql and other table servers are complex enough that relying on implementation details is not wise. The good news: If you include an ORDER BY clause showing the same order as that methodology generates, usually performance is not changed.
SQL is declarative, not procedural. We specify what we want, not how to get it. It's probably the only declarative language most of us ever see, so it's easy to make the mistake of thinking it is procedural.
If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..
My thinking is that if I put my ANDs that filter out a greater number of rows before those that filter out just a few, my query should run quicker since that selection set is much smaller between And statements.
But does the order of AND in the WHERE clause of an SQL Statement really effect the performance of the SQL that much or are the engines optimized already for this?
It really depends on the optimiser.
It shouldn't matter because it's the optimiser's job to figure out the optimal way to run your query regardless of how you describe it.
In practice, no optimiser is perfect so you might find that re-ordering the clauses does make a difference to particular queries. The only way to know for sure is to test it yourself with your own schema, data etc.
Most SQL engines are optimized to do this work for you. However, I have found situations in which trying to carve down the largest table first can make a big difference - it doesn't hurt !
A lot depends how the indices are set up. If an index exists which combines the two keys, the optimizer should be able to answer the query with a single index search. Otherwise if independent indices exist for both keys, the optimizer may get a list of the records satisfying each key and merge the lists. If an index exists for one condition but not the other, the optimizer should filter using the indexed list first. In any of those scenarios, it shouldn't matter what order the conditions are listed.
If none of the conditions apply, the order the conditions are specified may affect the order of evaluation, but since the database is going to have to fetch every single record to satisfy the query, the time spent fetching will likely dwarf the time spent evaluating the conditions.
Query like:
SELECT DISTINCT max(age), area FROM T_USER GROUP BY area ORDER BY area;
So, what is the process order of order by, group by, distinct and aggregation function ?
Maybe different order will get the same result, but will cause different performance. I want to merge multi-result, I got the sql, and parsed.So I want to know the order of standard sql dose.
This is bigger than just group by/aggregation/order by. You want to have an sense of how a query engine creates a result set. At a high level, that means creating an execution plan, retrieving data from the table into the query's working set, manipulating the data to match the requested result set, and then returning the result set back to the caller. For very simple queries, or queries that are well matched to the table design (or table schemas that are well-designed for the queries you'll need to run), this can mean streaming data from a table or index directly back to the caller. More often, it means thinking at a more detailed level, where you roughly follow these steps:
Look at the query to determine which tables will be needed.
Look at joins and subqueries, to determine which of those table depend on other tables.
Look at the conditions on the joins and in the where clause, in conjunction with indexes, to determine the how much space from each table will be needed, and how much work it will take to extract the portions of each table that you need (how well the query matches up with your indexes or the table as stored on disk).
Based the information collected from steps 1 through 3, figure out the most efficient way to retrieve the data needed for the select list, regardless of the order in which tables are included in the query and regardless of any ORDER BY clause. For this step, "most efficient" is defined as the method that keeps the working set as small as possible for as long as possible.
Begin to iterate over the records indicated by step 4. If there is a GROUP BY clause, each record has to be checked against the existing discovered groups before the engine can determine whether or not a new row should be generated in the working set. Often, the most efficient way to do this is for the query engine to conduct an effective ORDER BY step here, such that all the potential rows for the results are materialized into the working set, which is then ordered by the columns in the GROUP BY clause, and condensed so that only duplicate rows are removed. Aggregate function results for each group are updated as the records for that group are discovered.
Once all of the indicated records are materialized, such that the results of any aggregate functions are known, HAVING clauses can be evaluated.
Now, finally, the ORDER BY can be factored in, as well.
The records remaining in the working set are returned to the caller.
And as complicated as that was, it's only the beginning. It doesn't begin to account for windowing functions, common table expressions, cross apply, pivot, and on and on. It is, however, hopefully enough to give you a sense of the kind of work the database engine needs to do.
I was wondering if there is a performance gain between a SELECT query with a not very specific WHERE clause and another SELECT query with a more specific WHERE clause.
For instance is the query:
SELECT * FROM table1 WHERE first_name='Georges';
slower than this one:
SELECT * FROM table1 WHERE first_name='Georges' AND nickname='Gigi';
In other words is there a time factor that is link to the precision of the WHERE clause ?
I'm not sure to be very understandable and even if my question takes into account all the components that are involved in database query (MYSQL in my case)
My question is related to the Django framework because I would like to cache an evaluated queryset, and on a next request, take back this cached-evaluated queryset, filter it more, and evaluate it again.
There is no hard and fast rule about this.
There can be either an increase or decrease in performance by adding more conditions to the WHERE clause, as it depends on, among other things, the:
indexing
schema
data quantity
data cardinality
statistics
intelligence of the query engine
You need to test with your data set and determine what will perform the best.
MySql server must compare all columns in your WHERE clause (if all joined by AND ).
So if you don't have any index on column nickname second query will by slightly slower.
Here you can read how column indexes works (with examples similar to your question): http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
I think is difficult to answer this question, too many aspects (e.g.: indexes) are involved. I would say that the first query is faster than the first one, but I can't say for sure.
If this is crucial for you, why don't you run a simulation (e.g.: run 1'000'000 of queries) and check the time?
Yes, it can be slower. It will all depend on indexes you have and data distribution.
Check the link Understanding the Query Execution Plan
for information on how to know what MySQL is going to do when executing your query.