NOTE: the original question is moot but scan to the bottom for something relevant.
I have a query I want to optimize that looks something like this:
select cols from tbl where col = "some run time value" limit 1;
I want to know what keys are being used but whatever I pass to explain, it is able to optimize the where clause to nothing ("Impossible WHERE noticed...") because I fed it a constant.
Is there a way to tell mysql to not do constant optimizations in explain?
Am I missing something?
Is there a better way to get the info I need?
Edit: EXPLAIN seems to be giving me the query plan that will result from constant values. As the query is part of a stored procedure (and IIRC query plans in spocs are generated before they are called) this does me no good because the value are not constant. What I want is to find out what query plan the optimizer will generate when it doesn't known what the actual value will be.
Am I missing soemthing?
Edit2: Asking around elsewhere, it seems that MySQL always regenerates query plans unless you go out of your way to make it re-use them. Even in stored procedures. From this it would seem that my question is moot.
However that doesn't make what I really wanted to know moot: How do you optimize a query that contains values that are constant within any specific query but where I, the programmer, don't known in advance what value will be used? -- For example say my client side code is generating a query with a number in it's where clause. Some times the number will result in an impossible where clause other times it won't. How can I use explain to examine how well optimized the query is?
The best approach I'm seeing right off the bat would be to run EXPLAIN on it for the full matrix of exist/non-exist cases. Really that isn't a very good solution as it would be both hard and error prone to do by hand.
You are getting "Impossible WHERE noticed" because the value you specified is not in the column, not just because it is a constant. You could either 1) use a value that exists in the column or 2) just say col = col:
explain select cols from tbl where col = col;
For example say my client side code is generating a query with a number in it's where clause.
Some times the number will result in an impossible where clause other times it won't.
How can I use explain to examine how well optimized the query is?
MySQL builds different query plans for different values of bound parameters.
In this article you can read the list of when does the MySQL optimizer does what:
Action When
Query parse PREPARE
Negation elimination PREPARE
Subquery re-writes PREPARE
Nested JOIN simplification First EXECUTE
OUTER->INNER JOIN conversions First EXECUTE
Partition pruning Every EXECUTE
COUNT/MIN/MAX elimination Every EXECUTE
Constant subexpression removal Every EXECUTE
Equality propagation Every EXECUTE
Constant table detection Every EXECUTE
ref access analysis Every EXECUTE
range/index_merge analysis and optimization Every EXECUTE
Join optimization Every EXECUTE
There is one more thing missing in this list.
MySQL can rebuild a query plan on every JOIN iteration: a such called range checking for each record.
If you have a composite index on a table:
CREATE INDEX ix_table2_col1_col2 ON table2 (col1, col2)
and a query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 = t1.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, MySQL will NOT use an index RANGE access from (t1.value1, t1.value2_lowerbound) to (t1.value1, t1.value2_upperbound). Instead, it will use an index REF access on (t1.value) and just filter out the wrong values.
But if you rewrite the query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 <= t1.value1
AND t2.col1 >= t2.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, then MySQL will recheck index RANGE access for each record from table1, and decide whether to use RANGE access on the fly.
You can read about it in these articles in my blog:
Selecting timestamps for a time zone - how to use coarse filtering to filter out timestamps without a timezone
Emulating SKIP SCAN - how to emulate SKIP SCAN access method in MySQL
Analytic functions: optimizing LAG, LEAD, FIRST_VALUE, LAST_VALUE - how to emulate Oracle's analytic functions in MySQL
Advanced row sampling - how to select N records from each group in MySQL
All these things employ RANGE CHECKING FOR EACH RECORD
Returning to your question: there is no way to tell which plan will MySQL use for every given constant, since there is no plan before the constant is given.
Unfortunately, there is no way to force MySQL to use one query plan for every value of a bound parameter.
You can control the JOIN order and INDEX'es being chosen by using STRAIGHT_JOIN and FORCE INDEX clauses, but they will not force a certain access path on an index or forbid the IMPOSSIBLE WHERE.
On the other hand, for all JOIN's, MySQL employs only NESTED LOOPS. That means that if you build right JOIN order or choose right indexes, MySQL will probably benefit from all IMPOSSIBLE WHERE's.
How do you optimize a query with values that are constant only to the query but where I, the programmer, don't known in advance what value will be used?
By using indexes on the specific columns (or even on combination of columns if you always query the given columns together). If you have indexes, the query planner will potentially use them.
Regarding "impossible" values: the query planner can conclude that a given value is not in the table from several sources:
if there is an index on the particular column, it can observe that the particular value is large or smaller than any value in the index (min/max values take constant time to extract from indexes)
if you are passing in the wrong type (if you are asking for a numeric column to be equal with a text)
PS. In general, creation of the query plan is not expensive and it is better to re-create than to re-use them, since the conditions might have changed since the query plan was generated and a better query plan might exists.
Related
I am doing query with SELECT WHERE IN, and found something unpredicted,
SELECT a,b,c FROM tbl1 where a IN (a1,a2,a3...,a5000) --> this query took 1.7ms
SELECT a,b,c FROM tbl1 where a IN (a1,a2,a3...,a20) --> this query took 6.4ms
what is the algorithm of IN ? why larger amount of parameter faster than smaller one ?
The following is a guess...
For every SQL query, the optimizer analyzes the query to choose which index to use.
For multi-valued range queries (like IN(...)), the optimizer performs an "index dive" for each value in the short list, trying to estimate whether it's a good idea to use the index. If you are searching for values that are too common, it's more efficient to just do a table-scan, so there's no need to use the index.
MySQL 5.6 introduced a special optimization, to skip the index dives if you use a long list. Instead, it just guesses that the index on your a column may be worth using, based on stored index statistics.
You can control how long of a list causes the optimizer to skip index dives with the eq_range_index_dive_limit option. The default is 10. Your example shows a list of length 20, so I'm not sure why it's more expensive.
Read the manual about this feature here: https://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#equality-range-optimization
Query like:
SELECT DISTINCT max(age), area FROM T_USER GROUP BY area ORDER BY area;
So, what is the process order of order by, group by, distinct and aggregation function ?
Maybe different order will get the same result, but will cause different performance. I want to merge multi-result, I got the sql, and parsed.So I want to know the order of standard sql dose.
This is bigger than just group by/aggregation/order by. You want to have an sense of how a query engine creates a result set. At a high level, that means creating an execution plan, retrieving data from the table into the query's working set, manipulating the data to match the requested result set, and then returning the result set back to the caller. For very simple queries, or queries that are well matched to the table design (or table schemas that are well-designed for the queries you'll need to run), this can mean streaming data from a table or index directly back to the caller. More often, it means thinking at a more detailed level, where you roughly follow these steps:
Look at the query to determine which tables will be needed.
Look at joins and subqueries, to determine which of those table depend on other tables.
Look at the conditions on the joins and in the where clause, in conjunction with indexes, to determine the how much space from each table will be needed, and how much work it will take to extract the portions of each table that you need (how well the query matches up with your indexes or the table as stored on disk).
Based the information collected from steps 1 through 3, figure out the most efficient way to retrieve the data needed for the select list, regardless of the order in which tables are included in the query and regardless of any ORDER BY clause. For this step, "most efficient" is defined as the method that keeps the working set as small as possible for as long as possible.
Begin to iterate over the records indicated by step 4. If there is a GROUP BY clause, each record has to be checked against the existing discovered groups before the engine can determine whether or not a new row should be generated in the working set. Often, the most efficient way to do this is for the query engine to conduct an effective ORDER BY step here, such that all the potential rows for the results are materialized into the working set, which is then ordered by the columns in the GROUP BY clause, and condensed so that only duplicate rows are removed. Aggregate function results for each group are updated as the records for that group are discovered.
Once all of the indicated records are materialized, such that the results of any aggregate functions are known, HAVING clauses can be evaluated.
Now, finally, the ORDER BY can be factored in, as well.
The records remaining in the working set are returned to the caller.
And as complicated as that was, it's only the beginning. It doesn't begin to account for windowing functions, common table expressions, cross apply, pivot, and on and on. It is, however, hopefully enough to give you a sense of the kind of work the database engine needs to do.
I am trying to merge two tables on the condition that the value for a row in the first is greater than a row in the second table. Or in code:
select * from Computers join Locations
ON Computers.ip_number > Locations.ipLower;
All of the columns referenced (ip_number and ipLower) are indices in their respective tables with very high cardinality. However, calling explain on the statement states that no indices are used in the call. How can I force MySQL to use indices on the join statement?
Additional info:
I am using MySQL version 5.6.17. The query correctly uses indices if the join condition is equality instead of greater than. The indices are binary tree type.
Edit: The ip_number variable referenced is an integer which is derived from an IP address, not the IP address itself.
MySQL uses indexes when its query planner judges that there is performance to be gained by doing so.
It is not surprising that it doesn't make that judgement in this case; each row of your first table is, by your ON condition, joined to a great many rows of your second table.
Don't worry about what indexes get used by the query planner until you have a query that makes sense in your application. It's not clear the one you've shown makes sense. Its result set will be quite large.
This query might make more sense. It might also use range scans on the index on Computers.ip_number.
select *
from Computers
join Locations ON Computers.ip_number BETWEEN Locations.ipLower AND Locations.ipLupper
But, you probably should enumerate the columns you want in the result set and avoid SELECT * if you want decent performance.
Also, don't forget that IP addresses in dotted quad form 192.168.167.66 don't collate reasonably. That is, using inequalities like < or BETWEEN to compare them with each other doesn't really work.
Suppose a Nested query is like :
SELECT *
FROM Table1
WHERE Table1.val in
(
Select Table2.val
from Table2
where Table2.val>3
)
So my question is how do actual query processors evaluate this -
Do they first evaluate inner most query, store the result temporarily and then use it in the upper level query? [ If the result of subquery is large, temporary storage may not be enough ]
Or do they evaluate the outer query for each result of the outer query [ requires too many evaluations of outer query]
Am I missing something, and which way is it actually implemented ?
It depends. You're allowed to reference a value from the outer query inside the inner. If you do this, you have what is referred to as a correlated sub-query or correlated derived table. In this case, it has to recompute the query for every possible row in the parent query. If you don't this, you have what is referred to as an inline view or inline derived table, and most database engines are smart enough to only compute the view once.
This might help you
There is plenty of memory available to run fairly large queries and there is little chance of the query's complexity creating a timeout. Nested queries have the advantage of being run almost exclusively in physical memory for the sake of speed, unless the amount of memory needed to perform the query exceeds the physical memory "footprint" allocated to SQL Server.
Also, because of a nested query's structure, the SQL Server query optimizer will attempt to find the most optimal way to derive the results from the query. (i.e. the optimizer will convert it into micro-ops and reorganize to run it efficiently) [Source]
Due to the re-ordering which query is run first is hard to say. After the optimization most probably however a form of the inner query will be run (you can't really do stuff without that result most of the time).
SQL queries don't run like stack calls the way you seem to think. They're a one line instruction which is comprehended and translated for the machine.
You can use EXPLAIN to get information about the execution plan like this:
EXPLAIN SELECT *
FROM Table1
WHERE Table1.val in
(
Select Table2.val
from Table2
where Table2.val>3
)
For example for your query you'll get
and as the documentation states:
DEPENDENT SUBQUERY evaluation differs from UNCACHEABLE SUBQUERY evaluation. For DEPENDENT SUBQUERY, the subquery is re-evaluated only once for each set of different values of the variables from its outer context. For UNCACHEABLE SUBQUERY, the subquery is re-evaluated for each row of the outer context.
So your first guess was the correct one.
For more information about EXPLAIN see:
http://dev.mysql.com/doc/refman/5.0/en/explain.html
I've been told on several occasions that it is quite efficient to SELECT using math and that it is NOT very efficient to use math in a WHERE clause. Are these sentiments correct? And how does this apply to ORDER BY clauses?
Thanks!!
Example:
SELECT a.* FROM a ORDER BY (a.field_1*a.field_2)
Your query will have to sort the entire table using temporary files on disk if the result is larger than the sort_buffer_size.
You probably want to add a column to your table that holds the value of field1*field2. This of course slightly denormalizes your data, BUT YOU CAN CREATE AN INDEX ON THE FIELD.
If you have an index on the new field, then MySQL can read the data pre-sorted using the index, because MySQL indexes are b*tree structures and b*tree structures are stored in pre-sorted order. This won't incur extra disk IO or CPU activity for the sort and you will scan the table only once.
Its a good idea , but I never think that using mathematical function in ORDER BY clause make any sense.
You can use this by alias :-
select *,(intId * intId)as xalias from m_xxx_list order by xalias;
OR
select * from m_xxx_list order by (intId + intId);
Yes , If you are using mathemetical aggregate function of MYSQL, then Test it.
For MySQL to sort results by a computed value, it actually needs to calculate the value on the fly, after it has filtered out rows based on the WHERE clause. If the result set is quite large, then MySQL will need to compute the results for all the rows.
For a small result set, this should be fine. However, the larger your result set is (previous to the application of the LIMIT), then the more calculations the server has to do simply figure out the value to order the rows in. If the calculation is deterministic, then you should cache it in a column in the result set, and then index it. If it's on the fly, then you'll need to ensure your CPU is up to the task.
In the case provided, I would recommend creating a column, a.field_3, and store the result of (a.field_1*a.field_2) in it. Whenever the values of a.field_1 or a.field_2 change, you'll need to recalculate the result.