SQL Nested Query Processing - mysql

Suppose a Nested query is like :
SELECT *
FROM Table1
WHERE Table1.val in
(
Select Table2.val
from Table2
where Table2.val>3
)
So my question is how do actual query processors evaluate this -
Do they first evaluate inner most query, store the result temporarily and then use it in the upper level query? [ If the result of subquery is large, temporary storage may not be enough ]
Or do they evaluate the outer query for each result of the outer query [ requires too many evaluations of outer query]
Am I missing something, and which way is it actually implemented ?

It depends. You're allowed to reference a value from the outer query inside the inner. If you do this, you have what is referred to as a correlated sub-query or correlated derived table. In this case, it has to recompute the query for every possible row in the parent query. If you don't this, you have what is referred to as an inline view or inline derived table, and most database engines are smart enough to only compute the view once.

This might help you
There is plenty of memory available to run fairly large queries and there is little chance of the query's complexity creating a timeout. Nested queries have the advantage of being run almost exclusively in physical memory for the sake of speed, unless the amount of memory needed to perform the query exceeds the physical memory "footprint" allocated to SQL Server.
Also, because of a nested query's structure, the SQL Server query optimizer will attempt to find the most optimal way to derive the results from the query. (i.e. the optimizer will convert it into micro-ops and reorganize to run it efficiently) [Source]
Due to the re-ordering which query is run first is hard to say. After the optimization most probably however a form of the inner query will be run (you can't really do stuff without that result most of the time).
SQL queries don't run like stack calls the way you seem to think. They're a one line instruction which is comprehended and translated for the machine.

You can use EXPLAIN to get information about the execution plan like this:
EXPLAIN SELECT *
FROM Table1
WHERE Table1.val in
(
Select Table2.val
from Table2
where Table2.val>3
)
For example for your query you'll get
and as the documentation states:
DEPENDENT SUBQUERY evaluation differs from UNCACHEABLE SUBQUERY evaluation. For DEPENDENT SUBQUERY, the subquery is re-evaluated only once for each set of different values of the variables from its outer context. For UNCACHEABLE SUBQUERY, the subquery is re-evaluated for each row of the outer context.
So your first guess was the correct one.
For more information about EXPLAIN see:
http://dev.mysql.com/doc/refman/5.0/en/explain.html

Related

Does the order of JOIN vs WHERE in SQL affect performance?

In SQL, how much the order of JOIN vs WHERE affect the performance of a query?
a) SELECT […] FROM A JOIN ( SELECT […] FROM B WHERE CONDITION ) ON […]
b) SELECT […] FROM A JOIN ( SELECT […] FROM B ) ON […] WHERE CONDITION
My inner feeling tells me that option a) should be more performant: if we do first a join and then we run a where, it seems way less performant than first running a where on one table, and from the resutls doing a join. But I’m not sure as this depends on the internal optimizations of the SQL library itself.
Would be nice to know if the behavior is the same for both MySQL and
PostgreSQL, and also if it depends on any other decorators as group by or order by.
Postgres has a smart optimizer so the two versions should have similar execution plans, under most cases (I'll return to that in a moment).
MySQL has a tendency to materialize subqueries. Although this has gotten better in more recent versions, I still recommend avoiding it. Materializing subqueries prevents the use of indexes and can have a significant impact on performance.
One caveat: If the subquery is complicated, then it might be better to filter as part of the subquery. For instance, if it is an aggregation, then filtering before aggregating usually results in better performance. That said, Postgres is smart about pushing conditions into the subquery. So, if the outer filtering is on a key used in aggregation, Postgres is smart enough to push the condition into the subquery.
All other factors being equal, I would expect the A version to perform better than the B version, as you also seem to expect. The main reason for this is that the A version lets the database throw out rows using the WHERE clause in the subquery. Then the join only has to involve a smaller intermediate table. The exact difference in performance between the two would depend on the underlying data and the actual queries. Note that it is even possible that both queries could be optimized under the hood to the same or very similar execution plan.

The process order of SQL order by, group by, distinct and aggregation function?

Query like:
SELECT DISTINCT max(age), area FROM T_USER GROUP BY area ORDER BY area;
So, what is the process order of order by, group by, distinct and aggregation function ?
Maybe different order will get the same result, but will cause different performance. I want to merge multi-result, I got the sql, and parsed.So I want to know the order of standard sql dose.
This is bigger than just group by/aggregation/order by. You want to have an sense of how a query engine creates a result set. At a high level, that means creating an execution plan, retrieving data from the table into the query's working set, manipulating the data to match the requested result set, and then returning the result set back to the caller. For very simple queries, or queries that are well matched to the table design (or table schemas that are well-designed for the queries you'll need to run), this can mean streaming data from a table or index directly back to the caller. More often, it means thinking at a more detailed level, where you roughly follow these steps:
Look at the query to determine which tables will be needed.
Look at joins and subqueries, to determine which of those table depend on other tables.
Look at the conditions on the joins and in the where clause, in conjunction with indexes, to determine the how much space from each table will be needed, and how much work it will take to extract the portions of each table that you need (how well the query matches up with your indexes or the table as stored on disk).
Based the information collected from steps 1 through 3, figure out the most efficient way to retrieve the data needed for the select list, regardless of the order in which tables are included in the query and regardless of any ORDER BY clause. For this step, "most efficient" is defined as the method that keeps the working set as small as possible for as long as possible.
Begin to iterate over the records indicated by step 4. If there is a GROUP BY clause, each record has to be checked against the existing discovered groups before the engine can determine whether or not a new row should be generated in the working set. Often, the most efficient way to do this is for the query engine to conduct an effective ORDER BY step here, such that all the potential rows for the results are materialized into the working set, which is then ordered by the columns in the GROUP BY clause, and condensed so that only duplicate rows are removed. Aggregate function results for each group are updated as the records for that group are discovered.
Once all of the indicated records are materialized, such that the results of any aggregate functions are known, HAVING clauses can be evaluated.
Now, finally, the ORDER BY can be factored in, as well.
The records remaining in the working set are returned to the caller.
And as complicated as that was, it's only the beginning. It doesn't begin to account for windowing functions, common table expressions, cross apply, pivot, and on and on. It is, however, hopefully enough to give you a sense of the kind of work the database engine needs to do.

Aggregation Repetition vs Nested Query

I have a query as the following:
SELECT SUM(`weight`) as totalgrams,
SUM(`weight`)/1000 as totalkilograms
FROM `item`
which requires me to use the result of the first column's SUM, but since I can't use totalgrams, I need to redo SUM function again in the second column calculation.
The query plan from EXPLAIN:
Now, with the second query:
SELECT totalgrams, totalgrams/1000 as totalkilograms
FROM (SELECT SUM(`weight`) as totalgrams
FROM `item`) prequery
I don't need to repeat the SUM but I ended up with a nested query.
The query plan from EXPLAIN:
At a glance, it seems better to go with the first query, as it only has one entry in the execution plan, but was SUM calculated twice here (which is redundant and not scalable)?
Or actually the system already have an optimization for this and just calculate it once; so indeed the first query is better?
Right now there are only a few rows inside the table, so perhaps the difference is not significant in the real [ms] unit.
But if later it becomes huge, I wonder actually which query would be better?
And does it apply for all DBMS?
It is purely for understanding the SQL workflow, any insight is appreciated.
MySQL materializes subqueries in the from clause -- the so-called derived table. In this case, the summary has one row and one column, so it is really no big deal.
Including the sum() twice in the select does not have this overhead. It is unclear from the explain output whether sum() is calculated once or twice. Probably twice, but there could be an optimization step that eliminates that processing. In any case, sum() is a really cheap. The expensive part is arranging the aggregation, and all the aggregation functions are processed together.
You say this is purely for understanding the workflow, so I'll start my answer by saying mySQL does have means for optimizing these sort of operations and will do so but it isn't perfect and you shouldn't depend on it. [PICKY] The example is not the best as a sum operation is trivial anyhow[/PICKY]
I would say your first solution is the better, but even better still would be to remove the need for the calculation at all. Most of the time when a calculated column is used, it's simpler to code the calculation in the application that's getting the result, ie if this is called from php let php calculate total kilos instead of mysql. It's a one time calculation based on a single return value and it doesn't matter whether mySQL optimizes it or not. As I said earlier, sum is inexpensive, so for this particular example it isn't relevant but if the operation was something more expensive it would be a factor and for a general policy we should not assume the triviality of the operation.
If the outside language is an issue, another possibility would be to create an intermediate table and then update that table with the result. In this case (a single row) the overhead makes this less desirable but if it were many rows in the result table (such as with a group by), or to create a general policy, the overhead becomes a non-issue.

Mysql SELECT query and performance

I was wondering if there is a performance gain between a SELECT query with a not very specific WHERE clause and another SELECT query with a more specific WHERE clause.
For instance is the query:
SELECT * FROM table1 WHERE first_name='Georges';
slower than this one:
SELECT * FROM table1 WHERE first_name='Georges' AND nickname='Gigi';
In other words is there a time factor that is link to the precision of the WHERE clause ?
I'm not sure to be very understandable and even if my question takes into account all the components that are involved in database query (MYSQL in my case)
My question is related to the Django framework because I would like to cache an evaluated queryset, and on a next request, take back this cached-evaluated queryset, filter it more, and evaluate it again.
There is no hard and fast rule about this.
There can be either an increase or decrease in performance by adding more conditions to the WHERE clause, as it depends on, among other things, the:
indexing
schema
data quantity
data cardinality
statistics
intelligence of the query engine
You need to test with your data set and determine what will perform the best.
MySql server must compare all columns in your WHERE clause (if all joined by AND ).
So if you don't have any index on column nickname second query will by slightly slower.
Here you can read how column indexes works (with examples similar to your question): http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
I think is difficult to answer this question, too many aspects (e.g.: indexes) are involved. I would say that the first query is faster than the first one, but I can't say for sure.
If this is crucial for you, why don't you run a simulation (e.g.: run 1'000'000 of queries) and check the time?
Yes, it can be slower. It will all depend on indexes you have and data distribution.
Check the link Understanding the Query Execution Plan
for information on how to know what MySQL is going to do when executing your query.

How do I optimize MySQL's queries with constants?

NOTE: the original question is moot but scan to the bottom for something relevant.
I have a query I want to optimize that looks something like this:
select cols from tbl where col = "some run time value" limit 1;
I want to know what keys are being used but whatever I pass to explain, it is able to optimize the where clause to nothing ("Impossible WHERE noticed...") because I fed it a constant.
Is there a way to tell mysql to not do constant optimizations in explain?
Am I missing something?
Is there a better way to get the info I need?
Edit: EXPLAIN seems to be giving me the query plan that will result from constant values. As the query is part of a stored procedure (and IIRC query plans in spocs are generated before they are called) this does me no good because the value are not constant. What I want is to find out what query plan the optimizer will generate when it doesn't known what the actual value will be.
Am I missing soemthing?
Edit2: Asking around elsewhere, it seems that MySQL always regenerates query plans unless you go out of your way to make it re-use them. Even in stored procedures. From this it would seem that my question is moot.
However that doesn't make what I really wanted to know moot: How do you optimize a query that contains values that are constant within any specific query but where I, the programmer, don't known in advance what value will be used? -- For example say my client side code is generating a query with a number in it's where clause. Some times the number will result in an impossible where clause other times it won't. How can I use explain to examine how well optimized the query is?
The best approach I'm seeing right off the bat would be to run EXPLAIN on it for the full matrix of exist/non-exist cases. Really that isn't a very good solution as it would be both hard and error prone to do by hand.
You are getting "Impossible WHERE noticed" because the value you specified is not in the column, not just because it is a constant. You could either 1) use a value that exists in the column or 2) just say col = col:
explain select cols from tbl where col = col;
For example say my client side code is generating a query with a number in it's where clause.
Some times the number will result in an impossible where clause other times it won't.
How can I use explain to examine how well optimized the query is?
MySQL builds different query plans for different values of bound parameters.
In this article you can read the list of when does the MySQL optimizer does what:
Action When
Query parse PREPARE
Negation elimination PREPARE
Subquery re-writes PREPARE
Nested JOIN simplification First EXECUTE
OUTER->INNER JOIN conversions First EXECUTE
Partition pruning Every EXECUTE
COUNT/MIN/MAX elimination Every EXECUTE
Constant subexpression removal Every EXECUTE
Equality propagation Every EXECUTE
Constant table detection Every EXECUTE
ref access analysis Every EXECUTE
range/index_merge analysis and optimization Every EXECUTE
Join optimization Every EXECUTE
There is one more thing missing in this list.
MySQL can rebuild a query plan on every JOIN iteration: a such called range checking for each record.
If you have a composite index on a table:
CREATE INDEX ix_table2_col1_col2 ON table2 (col1, col2)
and a query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 = t1.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, MySQL will NOT use an index RANGE access from (t1.value1, t1.value2_lowerbound) to (t1.value1, t1.value2_upperbound). Instead, it will use an index REF access on (t1.value) and just filter out the wrong values.
But if you rewrite the query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 <= t1.value1
AND t2.col1 >= t2.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, then MySQL will recheck index RANGE access for each record from table1, and decide whether to use RANGE access on the fly.
You can read about it in these articles in my blog:
Selecting timestamps for a time zone - how to use coarse filtering to filter out timestamps without a timezone
Emulating SKIP SCAN - how to emulate SKIP SCAN access method in MySQL
Analytic functions: optimizing LAG, LEAD, FIRST_VALUE, LAST_VALUE - how to emulate Oracle's analytic functions in MySQL
Advanced row sampling - how to select N records from each group in MySQL
All these things employ RANGE CHECKING FOR EACH RECORD
Returning to your question: there is no way to tell which plan will MySQL use for every given constant, since there is no plan before the constant is given.
Unfortunately, there is no way to force MySQL to use one query plan for every value of a bound parameter.
You can control the JOIN order and INDEX'es being chosen by using STRAIGHT_JOIN and FORCE INDEX clauses, but they will not force a certain access path on an index or forbid the IMPOSSIBLE WHERE.
On the other hand, for all JOIN's, MySQL employs only NESTED LOOPS. That means that if you build right JOIN order or choose right indexes, MySQL will probably benefit from all IMPOSSIBLE WHERE's.
How do you optimize a query with values that are constant only to the query but where I, the programmer, don't known in advance what value will be used?
By using indexes on the specific columns (or even on combination of columns if you always query the given columns together). If you have indexes, the query planner will potentially use them.
Regarding "impossible" values: the query planner can conclude that a given value is not in the table from several sources:
if there is an index on the particular column, it can observe that the particular value is large or smaller than any value in the index (min/max values take constant time to extract from indexes)
if you are passing in the wrong type (if you are asking for a numeric column to be equal with a text)
PS. In general, creation of the query plan is not expensive and it is better to re-create than to re-use them, since the conditions might have changed since the query plan was generated and a better query plan might exists.