How to use indices on a join in MySQL using inequality comparison - mysql

I am trying to merge two tables on the condition that the value for a row in the first is greater than a row in the second table. Or in code:
select * from Computers join Locations
ON Computers.ip_number > Locations.ipLower;
All of the columns referenced (ip_number and ipLower) are indices in their respective tables with very high cardinality. However, calling explain on the statement states that no indices are used in the call. How can I force MySQL to use indices on the join statement?
Additional info:
I am using MySQL version 5.6.17. The query correctly uses indices if the join condition is equality instead of greater than. The indices are binary tree type.
Edit: The ip_number variable referenced is an integer which is derived from an IP address, not the IP address itself.

MySQL uses indexes when its query planner judges that there is performance to be gained by doing so.
It is not surprising that it doesn't make that judgement in this case; each row of your first table is, by your ON condition, joined to a great many rows of your second table.
Don't worry about what indexes get used by the query planner until you have a query that makes sense in your application. It's not clear the one you've shown makes sense. Its result set will be quite large.
This query might make more sense. It might also use range scans on the index on Computers.ip_number.
select *
from Computers
join Locations ON Computers.ip_number BETWEEN Locations.ipLower AND Locations.ipLupper
But, you probably should enumerate the columns you want in the result set and avoid SELECT * if you want decent performance.
Also, don't forget that IP addresses in dotted quad form 192.168.167.66 don't collate reasonably. That is, using inequalities like < or BETWEEN to compare them with each other doesn't really work.

Related

Does MySQL use only one index per query/sub-query?

Please assume the following query:
select dic.*, d.syllable
from dictionary dic
join details d on d.word = dic.word
As you know (and I heard), MySQL uses only one index per query. Now I want to know, in query above, which index would be better?
dictionary(word)
details(word)
In another word, when there is a join (two tables are involved), the index of which one would be affected? Should I create both of them (on the columns on the on clause) and MySQL itself decides using which one is better?
As you know (and I heard), MySQL uses only one index per query.
In general, most databases will only use one index per table, per query. This isn't always the case, but is at least a decent rule of thumb. For your particular example, you can rely on this.
Now I want to know, in query above, which index would be better?
The query you wrote is actually an inner join. This means that either of the two tables could appear on the left side of the join, and the result sets would be logically equivalent. As a result of this, MySQL is therefore free to write the join in any order it chooses. The plan that gets chosen will likely place the larger table on the left hand side, and the smaller table on the right hand side. If you knew the actual execution order of the tables, then you would just index the right table. Given that you may not know this, then both of your suggested indices are reasonable:
CREATE INDEX dict_idx ON dictionary (word);
CREATE INDEX details_idx ON details (word);
We could even try to improve on the above indices by covering the columns which appear in the select clause. For example, the index on details could be expanded to:
CREATE INDEX details_idx ON details (word, syllable);
This would let MySQL use the above index exclusively to satisfy the query plan, without requiring a seek back to the original table. You select dictionary.*, so covering this with a single index might not be possible or desirable, but at least this gets the point across.
MySQL would use the most selective index (the one giving the fewest rows). This means it depends on the data, and also optimizations like this could change between versions of the database.

is where (A and B) and where (B and A) in SQL the same? [duplicate]

Let's say I have a table called PEOPLE having three columns, ID, LastName, and FirstName. None of these columns are indexed.
LastName is more unique, and FirstName is less unique.
If I do two searches:
select * from PEOPLE where FirstName="F" and LastName="L"
select * from PEOPLE where LastName="L" and FirstName="F"
My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first SQL query.
Is my understanding correct?
No, that order doesn't matter (or at least: shouldn't matter).
Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.
I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.
What does matter is whether or not you have a suitable index for this!
In the case of SQL Server, it will likely use an index if you have:
an index on (LastName, FirstName)
an index on (FirstName, LastName)
an index on just (LastName), or just (FirstName) (or both)
On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).
The order of WHERE clauses should not make a difference in a database that conforms to the SQL standard. The order of evaluation is not guaranteed in most databases.
Do not think that SQL cares about the order. The following generates an error in SQL Server:
select *
from INFORMATION_SCHEMA.TABLES
where ISNUMERIC(table_name) = 1 and CAST(table_name as int) <> 0
If the first part of this clause were executed first, then only numeric table names would be cast as integers. However, it fails, providing a clear example that SQL Server (as with other databases) does not care about the order of clauses in the WHERE statement.
ANSI SQL Draft 2003 5WD-01-Framework-2003-09.pdf
6.3.3.3 Rule evaluation order
...
Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions is generally performed from left to right. However, it is implementation-dependent whether expressions are actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or if the results of the expressions can be determined without completely evaluating all parts of the expression.
copied from here
No, all the RDBMs first start by analysing the query and optimize it by reordering your where clause.
Depending on which RDBM you are you using can display what is the result of the analyse (search for explain plan in oracle for instance)
M.
It's true as far as it goes, assuming the names aren't indexed.
Different data would make it wrong though. In order to find out which way to do it, which could differ every time, the DBMS would have to run a distinct count query for each column and compare the numbers, that would cost more than just shrugging and getting on with it.
Original OP statement
My belief is the second one is faster because the more unique criterion (LastName) comes first in >the where clause, and records will get eliminated more efficiently. I don't think the optimizer is >smart enough to optimize the first sql.
I guess you are confusing this with selecting the order of columns while creating the indexes where you have to put the more selective columns first than second most selective and so on.
BTW, for the above two query SQL server optimizer will not do any optimization but will use Trivila plan as long as the total cost of the plan is less than parallelism threshold cost.

mysql: stored function call + left join = very very slow

I have two tables:
module_339 (id,name,description,etc)
module_339_schedule(id,itemid,datestart,dateend,timestart,timeend,days,recurrent)
module_339_schedule.itemid points to module_339
fist table holds conferences
second one keeps the schedules of the conferences
module_339 has 3 items
module_339_schedule has 4000+ items - almost evenly divided between the 3 conferences
I have a stored function - "getNextDate_module_339" - which will compute the "next date" for a specified conference, in order to be able to display it, and also sort by it - if the user wants to. This stored procedure will just take all the schedule entries of the specified conference and loop through them, comparing dates and times. So it will do one simple read from module_339_schedule, then loop through the items and compare dates and times.
The problem: this query is very slow:
SELECT
distinct(module_339.id)
,min( getNextDate_module_339(module_339.id,1,false)) AS ND
FROM
module_339
LEFT JOIN module_339_schedule on module_339.id=module_339_schedule.itemid /* standard schedule adding */
WHERE 1=1 AND module_339.is_system_preview<=0
group by
module_339.id
order by
module_339.id asc
If I remove either the function call OR the LEFT JOIN, it is fast again.
What am I doing wrong here? Seems to be some kind of "collision" between the function call and the left join.
I think the group by part can be removed from this query, thus enabling you to remove the min function as well. Also, there is not much point of WHERE 1=1 AND..., so I've changed that as well.
Try this:
SELECT DISTINCT module_339.id
,getNextDate_module_339(module_339.id,1,false) AS ND
FROM module_339
LEFT JOIN module_339_schedule ON module_339.id=module_339_schedule.itemid /* standard schedule adding */
WHERE module_339.is_system_preview<=0
ORDER BY module_339.id
Note that this might not have a lot of impact on performance.
I think that the worst part performance-wise is probably the getNextDate_module_339 function.
If you can find a way to get it's functionallity without using a function as a sub query, your sql statement will probably run alot faster then now, with or without the left join.
If you need help doing this, please edit your question to include the function and hopefully I (or someone else) might be able to help you with that.
From the MySQL reference manual:
The best way to improve the performance of SELECT operations is to create indexes on one or more of the columns that are tested in the query. The index entries act like pointers to the table rows, allowing the query to quickly determine which rows match a condition in the WHERE clause, and retrieve the other column values for those rows. All MySQL data types can be indexed.
Although it can be tempting to create an indexes for every possible column used in a query, unnecessary indexes waste space and waste time for MySQL to determine which indexes to use. Indexes also add to the cost of inserts, updates, and deletes because each index must be updated. You must find the right balance to achieve fast queries using the optimal set of indexes.
As a first step I suggest checking that the joined columns are both indexed. Since primary keys are always indexed by default, we can assume that module_339 is already indexed on the id column, so first verify that module_339_schedule is indexed on the itemid column. You can check the indexes on that table in MySQL using:
SHOW INDEX FROM module_339_schedule;
If the table does not have an index on that column, you can add one using:
CREATE INDEX itemid_index ON module_339_schedule (itemid);
That should speed up the join component of the query.
Since your query also references module_339.is_system_preview you might also consider adding an index to that column using:
CREATE INDEX is_system_preview_index ON module_339 (is_system_preview);
You might also be able to optimize the stored procedure, but you haven't included it in your question.

query optimizer operator choice - nested loops vs hash match (or merge)

One of my stored procedures was taking too long execute. Taking a look at query execution plan I was able to locate the operation taking too long. It was a nested loop physical operator that had outer table (65991 rows) and inner table (19223 rows). On the nested loop it showed estimated rows = 1,268,544,993 (multiplying 65991 by 19223) as below:
I read a few articles on physical operators used for joins and got a bit confused whether nested loop or hash match would have been better for this case. From what i could gather:
Hash Match - is used by optimizer when no useful indexes are available, one table is substantially smaller than the other, tables are not sorted on the join columns. Also hash match might indicate more efficient join method (nested loops or merge join) could be used.
Question: Would hash match be better than nested loops in this scenario?
Thanks
ABSOLUTELY. A hash match would be a huge improvement. Creating the hash on the smaller 19,223 row table then probing into it with the larger 65,991 row table is a much smaller operation than the nested loop requiring 1,268,544,993 row comparisons.
The only reason the server would choose the nested loops is that it badly underestimated the number of rows involved. Do your tables have statistics on them, and if so, are they being updated regularly? Statistics are what enable the server to choose good execution plans.
If you've properly addressed statistics and are still having a problem you could force it to use a HASH join like so:
SELECT *
FROM
TableA A -- The smaller table
LEFT HASH JOIN TableB B -- the larger table
Please note that the moment you do this it will also force the join order. This means you have to arrange all your tables correctly so that their join order makes sense. Generally you would examine the execution plan the server already has and alter the order of your tables in the query to match. If you're not familiar with how to do this, the basics are that each "left" input comes first, and in graphical execution plans, the left input is the lower one. A complex join involving many tables may have to group joins together inside parentheses, or use RIGHT JOIN in order to get the execution plan to be optimal (swap left and right inputs, but introduce the table at the correct point in the join order).
It is generally best to avoid using join hints and forcing join order, so do whatever else you can first! You could look into the indexes on the tables, fragmentation, reducing column sizes (such as using varchar instead of nvarchar where Unicode is not required), or splitting the query into parts (insert to a temp table first, then join to that).
I would not recommend trying to "fix" the plan by forcing the hints in one direction or another. Instead, you need to look to your indexes, statistics and the TSQL code to understand why you have a Table spool loading up 1.2billion rows from 19000.

How do I optimize MySQL's queries with constants?

NOTE: the original question is moot but scan to the bottom for something relevant.
I have a query I want to optimize that looks something like this:
select cols from tbl where col = "some run time value" limit 1;
I want to know what keys are being used but whatever I pass to explain, it is able to optimize the where clause to nothing ("Impossible WHERE noticed...") because I fed it a constant.
Is there a way to tell mysql to not do constant optimizations in explain?
Am I missing something?
Is there a better way to get the info I need?
Edit: EXPLAIN seems to be giving me the query plan that will result from constant values. As the query is part of a stored procedure (and IIRC query plans in spocs are generated before they are called) this does me no good because the value are not constant. What I want is to find out what query plan the optimizer will generate when it doesn't known what the actual value will be.
Am I missing soemthing?
Edit2: Asking around elsewhere, it seems that MySQL always regenerates query plans unless you go out of your way to make it re-use them. Even in stored procedures. From this it would seem that my question is moot.
However that doesn't make what I really wanted to know moot: How do you optimize a query that contains values that are constant within any specific query but where I, the programmer, don't known in advance what value will be used? -- For example say my client side code is generating a query with a number in it's where clause. Some times the number will result in an impossible where clause other times it won't. How can I use explain to examine how well optimized the query is?
The best approach I'm seeing right off the bat would be to run EXPLAIN on it for the full matrix of exist/non-exist cases. Really that isn't a very good solution as it would be both hard and error prone to do by hand.
You are getting "Impossible WHERE noticed" because the value you specified is not in the column, not just because it is a constant. You could either 1) use a value that exists in the column or 2) just say col = col:
explain select cols from tbl where col = col;
For example say my client side code is generating a query with a number in it's where clause.
Some times the number will result in an impossible where clause other times it won't.
How can I use explain to examine how well optimized the query is?
MySQL builds different query plans for different values of bound parameters.
In this article you can read the list of when does the MySQL optimizer does what:
Action When
Query parse PREPARE
Negation elimination PREPARE
Subquery re-writes PREPARE
Nested JOIN simplification First EXECUTE
OUTER->INNER JOIN conversions First EXECUTE
Partition pruning Every EXECUTE
COUNT/MIN/MAX elimination Every EXECUTE
Constant subexpression removal Every EXECUTE
Equality propagation Every EXECUTE
Constant table detection Every EXECUTE
ref access analysis Every EXECUTE
range/index_merge analysis and optimization Every EXECUTE
Join optimization Every EXECUTE
There is one more thing missing in this list.
MySQL can rebuild a query plan on every JOIN iteration: a such called range checking for each record.
If you have a composite index on a table:
CREATE INDEX ix_table2_col1_col2 ON table2 (col1, col2)
and a query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 = t1.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, MySQL will NOT use an index RANGE access from (t1.value1, t1.value2_lowerbound) to (t1.value1, t1.value2_upperbound). Instead, it will use an index REF access on (t1.value) and just filter out the wrong values.
But if you rewrite the query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 <= t1.value1
AND t2.col1 >= t2.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, then MySQL will recheck index RANGE access for each record from table1, and decide whether to use RANGE access on the fly.
You can read about it in these articles in my blog:
Selecting timestamps for a time zone - how to use coarse filtering to filter out timestamps without a timezone
Emulating SKIP SCAN - how to emulate SKIP SCAN access method in MySQL
Analytic functions: optimizing LAG, LEAD, FIRST_VALUE, LAST_VALUE - how to emulate Oracle's analytic functions in MySQL
Advanced row sampling - how to select N records from each group in MySQL
All these things employ RANGE CHECKING FOR EACH RECORD
Returning to your question: there is no way to tell which plan will MySQL use for every given constant, since there is no plan before the constant is given.
Unfortunately, there is no way to force MySQL to use one query plan for every value of a bound parameter.
You can control the JOIN order and INDEX'es being chosen by using STRAIGHT_JOIN and FORCE INDEX clauses, but they will not force a certain access path on an index or forbid the IMPOSSIBLE WHERE.
On the other hand, for all JOIN's, MySQL employs only NESTED LOOPS. That means that if you build right JOIN order or choose right indexes, MySQL will probably benefit from all IMPOSSIBLE WHERE's.
How do you optimize a query with values that are constant only to the query but where I, the programmer, don't known in advance what value will be used?
By using indexes on the specific columns (or even on combination of columns if you always query the given columns together). If you have indexes, the query planner will potentially use them.
Regarding "impossible" values: the query planner can conclude that a given value is not in the table from several sources:
if there is an index on the particular column, it can observe that the particular value is large or smaller than any value in the index (min/max values take constant time to extract from indexes)
if you are passing in the wrong type (if you are asking for a numeric column to be equal with a text)
PS. In general, creation of the query plan is not expensive and it is better to re-create than to re-use them, since the conditions might have changed since the query plan was generated and a better query plan might exists.