MySQL performance of LIKE (without wildcards) vs = - mysql

I'm going to start of by saying that I know that you can't use indexes for LIKE queries if the value starts with a wildcard. This is NOT a question about that. I'm not using any wildcards.
In an application that accepts users to pass wildcard into queries, the value is passed to a query's LIKE clause. I've done some testing and have come to the conclusing that when searching for an exact address (so no wildcards) the query runs is slower than when I'm using an =. Take following 2 queries:
SELECT id FROM users WHERE email LIKE 'user#host.tld'
vs
SELECT id FROM users WHERE email = 'user#host.tld'
Both queries will return the exact same records. When doing EXPLAIN on both, I can see that they are both using the index of the email field. The main difference is that the LIKE query is using a RANGE type, and the = query is using a REF type. Also, the RANGE query is examining some 1000 records where the = query is only examining 1 record (on 2 million records in the table).
The profiles of the query are the same, with the exception that the LIKE query uses significantly more time to process the "sending data" step, where it is actually examining the 1000 records. So basically, the query is slower because it is touching more data.
The thing I don't get, is to why it is doing that? Since the RANGE query is using the exact same index and exactly the same set of matches should be returned from the index, why is it examining more rows? This is probably a question about the internals of how a range query uses an index vs how the ref query does, but I can't seem to find any detailed information about it.

Q: why ... is [MySQL Optimizer] doing that?
A:
The short answer is that the optimizer is not converting the LIKE with no wildcards into an = operation.
MySQL optimizer only uses ref access for = and <=> comparisons.
MySQL optimizer can use range access for a lot more operations including =, <=>, <, <=, >, >=, BETWEEN, ...
A predicate like this: col LIKE 'foo' is being handled equivalent to
col >= 'foo' AND col <= 'foo'
We look at that and say, that's the same as col = 'foo', but the optimizer doesn't see it that way. The optimizer approach probably makes more sense if we use a wildcard. For example
col LIKE `foo%bar`
MySQL could use the foo portion for the "range" part of the scan, akin to this:
col >= 'foo' AND col < 'fop'
MySQL optimizer can use an index range scan to satisfy the >= and < comparison.
(I use fop here as a simplistic representation of the lowest "higher weighted" string in the collating sequence. We don't need to dive into charactersets and collating sequences, but just as a short justification of my use of 'fop', with latin1_swedish_ci collating sequence...
SELECT HEX(WEIGHT_STRING(_latin1'foo' AS CHAR(3))) AS ws_foo
, HEX(WEIGHT_STRING(_latin1'fop' AS CHAR(3))) AS ws_fop
And for the rows that are found by the index range scan, the rest of the matching can be performed, akin to
SUBSTR(col,4) LIKE '%bar'
I'm not saying that this is exactly how the MySQL optimizer is operating. (I haven't reviewed the source code.)
I'm suggesting that the MySQL optimizer is not treating 'col LIKE 'foo' the same as 'col = 'foo', and the primary reason for that is the potential for wild card characters.
If we want col = 'foo' performance, we should write col = 'foo'.
We pay a price for a range scan when we opt for the flexibility of the LIKE comparison.
And we pay an even higher price (a full index scan, index operation in the EXPLAIN output), when we use a regular expression col REGEXP '^foo$'.
EDIT
Even with the difference shown in the EXPLAIN plan, I wouldn't expect any measurable difference in performance of these two statements:
SELECT SQL_NO_CACHE id FROM users WHERE email LIKE 'user#host.tld'
SELECT SQL_NO_CACHE id FROM users WHERE email = 'user#host.tld'
For evaluating performance, I would run the statements four (or more) times in a row, capturing the execution time of each statement run, and throw out the result from the first run. Average the execution time of the runs except for the first. (We'd expect the execution times of the subsequent runs to be very close to each other.)
Note that other concurrent operations on the database could impact the performance of the statement we're measuring.

The Optimizer...
(effectively) turns a LIKE without any wild cards into =.
turns IN (one-item) into =.
turns LIKE _with_ a _trailing_%` (as the only wildcard) into a range test.
cannot optimize LIKE with in most other situations with wildcards.
These optimizations are useless without a relevant INDEX.
sending data is a useless metric.
Running a query the first time may have to load stuff from disk; the second time it will find stuff cached in RAM, hence be much faster.
EXPLAIN's "Rows" is an estimate; don't jump to any conclusions if the value varies by less than a factor of 2.
An = drills down the BTree to find the first matching row. Then it scans forward to find any more matching rows.
Ditto for a "range" (BETWEEN or LIKE 'foo%' or ...) -- drill down to find the first (or last) item in the range, then scan forward (or backward). Backward scanning happens if the Optimizer can use ORDER BY .. DESC at the same time.
(spencer7593's Answer goes into more detail.)

Related

MySQL where on indexed column and not indexed behavior [duplicate]

Say that I have a long, expensive query, packed with conditions, searching a large number of rows. I also have one particular condition, like a company id, that will limit the number of rows that need to be searched considerably, narrowing it down to dozens from hundreds of thousands.
Does it make any difference to MySQL performance whether I do this:
SELECT * FROM clients WHERE
(firstname LIKE :foo OR lastname LIKE :foo OR phone LIKE :foo) AND
(firstname LIKE :bar OR lastname LIKE :bar OR phone LIKE :bar) AND
company = :ugh
or this:
SELECT * FROM clients WHERE
company = :ugh AND
(firstname LIKE :foo OR lastname LIKE :foo OR phone LIKE :foo) AND
(firstname LIKE :bar OR lastname LIKE :bar OR phone LIKE :bar)
Here is a demo showing the order of WHERE clause conditions can make a difference due to short-circuiting. It runs the following queries:
-- query #1
SELECT myint FROM mytable WHERE myint >= 3 OR myslowfunction('query #1', myint) = 1;
-- query #2
SELECT myint FROM mytable WHERE myslowfunction('query #2', myint) = 1 OR myint >= 3;
The only difference between these is the order of operands in the OR condition.
myslowfunction deliberately sleeps for a second and has the side effect of adding an entry to a log table each time it is run. Here are the results of what is logged when running the two queries:
myslowfunction called for query #1 with value 1
myslowfunction called for query #1 with value 2
myslowfunction called for query #2 with value 1
myslowfunction called for query #2 with value 2
myslowfunction called for query #2 with value 3
myslowfunction called for query #2 with value 4
The above shows that a slow function is executed more times when it appears on the left side of an OR condition when the other operand isn't always true.
So IMO the answer to the question:
Does the order of conditions in a WHERE clause affect MySQL performance?
is "Sometimes it can do."
No, the order should not make a large difference. When finding which rows match the condition, the condition as a whole (all of the sub-conditions combined via boolean logic) is examined for each row.
Some intelligent DB engines will attempt to guess which parts of the condition can be evaluated faster (for instance, things that don't use built-in functions) and evaluate those first, and more complex (estimatedly) elements get evaluated later. This is something determined by the DB engine though, not the SQL.
The order of columns in your where clause shouldn't really matter, since MySQL will optimize the query before executing it. But I suggest you read the chapter on Optimization in the MySQL reference manual, to get a basic idea on how to analyze queries and tables, and optimize them if necessary. Personally though, I would always try to put indexed fields before non-indexed fields, and order them according to the number of rows that they should return (most restrictive conditions first, least restrictive last).
Mathematically Yes It has an effect. Not only in SQL Query. rather in all programming languages whenever there is an expression with and / or .
There works a theory of Complete evaluation or partial evaluation.
If its an and query and first expression of and evaluates to false it will not check further. as anding false with anything yields false .
Similerly in an or expression if first one is true it will not check further.
A sophisticated DBMS should be able to decide on its own which where condition to evaluate first. Some Databases provide tools to display the "strategy" how a query is executed. In MySQL, e.g. you can enter EXPLAIN in front of a query. The DBMS then prints the actions it performed for executing the query, as e.g. index or full-table scan. So you could see at a glance whether or not it uses the index for 'company' in both cases.
this shouldn't have any effect, but if you aren't sure, why don't you simply try it out? the order of where-clauses on an select from a single table makes no difference, but if you join multiple tables, the order of the joins could affect the performace (sometimes).
I don't think the order of the where clause has any impact. I think the MySQL query optimizer will reorganize where clauses as it sees fit so it filters away the largest subset first.
It's another deal when talking about joins. The optimizer tries to reorder here too, but doesn't always finds the best way and sometimes doesn't use indexes. SELECT STRAIGHT JOIN and FORCE INDEX let's you be in charge of the query.
No it doesn't, the tables required are selected and then evaluated row by row. Order can be arbitrary.

how mysql `select where in` query work ? why larger amount parameter faster?

I am doing query with SELECT WHERE IN, and found something unpredicted,
SELECT a,b,c FROM tbl1 where a IN (a1,a2,a3...,a5000) --> this query took 1.7ms
SELECT a,b,c FROM tbl1 where a IN (a1,a2,a3...,a20) --> this query took 6.4ms
what is the algorithm of IN ? why larger amount of parameter faster than smaller one ?
The following is a guess...
For every SQL query, the optimizer analyzes the query to choose which index to use.
For multi-valued range queries (like IN(...)), the optimizer performs an "index dive" for each value in the short list, trying to estimate whether it's a good idea to use the index. If you are searching for values that are too common, it's more efficient to just do a table-scan, so there's no need to use the index.
MySQL 5.6 introduced a special optimization, to skip the index dives if you use a long list. Instead, it just guesses that the index on your a column may be worth using, based on stored index statistics.
You can control how long of a list causes the optimizer to skip index dives with the eq_range_index_dive_limit option. The default is 10. Your example shows a list of length 20, so I'm not sure why it's more expensive.
Read the manual about this feature here: https://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#equality-range-optimization

Big-O of MySQL Fuzzy Search

What is the Big-O of MySQL Fuzzy Search? Does it vary by index type, if so, what performs the best?
e.g. SELECT * FROM foo WHERE field1 LIKE '%ello Wo%';
I'm unsure of the underlying datatype, what kind of magic it possesses. Something like a trie (https://en.wikipedia.org/wiki/Trie) would be nice for search who is fuzzy at the end, e.g. LIKE 'Hello Wo%'.
I'm guessing the Big-O is O(n) but wish to confirm. There may even be differences between fuzzy searches too, e.g. %ello Wo% vs. Hello W% vs. %lo World vs. %ell%o%W%or%
Are there different ways to index that give better performance? If yes, for particular cases, can you please share?
With a leading wildcard
MySQL will
Scan all the rows in the table (not an index). This is called a "table scan". (This assuming no other filtering going on.)
For each row, scan the column in question for the LIKE;
Deliver the rows not filtered out.
Most of the time is spent in Step 1, which is O(N) where N is the number of rows. Much less time is spent in steps 2 and 3.
Without a leading wildcard
Use an index on that column, if you have one, to limit the number of rows to search. If you have an index on the column and are saying WHERE col LIKE 'Hello W%', it will find all the rows in the index starting with Hello W. They will be consecutive in the index, making this step faster.
For each of those, reach into the Data for the row and do whatever else is required.
There are a number of variables (caching, number of rows, randomness of rows, etc) that lead to whether #1 is more or less costly than #2. But this is likely to be much faster than the leading-wildcard case -- O(n), where n is the number of rows starting with 'Hello W'.

Does long query string affect the speed?

Suppose i have a long query string for eg.
SELECT id from users where collegeid='1' or collegeid='2' . . . collegeid='1000'
will it affect the speed or output in any way?
SELECT m.id,m.message,m.postby,m.tstamp,m.type,m.category,u.name,u.img
from messages m
join users u on m.postby=u.uid
where m.cid = '1' or m.cid = '1' . . . . . .
or m.cid = '1000'. . . .
I would prefer to use IN in this case as it would be better. However to check the performance you may try to look at the Execution Plan of the query which you are executing. You will get the idea about what performance difference you will get by using the both.
Something like this:
SELECT id from users where collegeid IN ('1','2','3'....,'1000')
According to the MYSQL
If all values are constants, they are evaluated according to the type
of expr and sorted. The search for the item then is done using a
binary search. This means IN is very quick if the IN value list
consists entirely of constants.
The number of values in the IN list is only limited by the
max_allowed_packet value.
You may also check IN vs OR in the SQL WHERE Clause and MYSQL OR vs IN performance
The answer given by Ergec is very useful:
SELECT * FROM item WHERE id = 1 OR id = 2 ... id = 10000
This query took 0.1239 seconds
SELECT * FROM item WHERE id IN (1,2,3,...10000)
This query took 0.0433 seconds
IN is 3 times faster than OR
will it affect the speed or output in any way?
So the answer is Yes the performance will be affected.
Obviously, there is no direct correlation between the length of a query string and its processing time (as some very short query can be tremendeously complex and vice versa). For your specific example: It depends on how the query is processed. This is something you can check by looking at the query execution plan (syntax depends on your DBMS, something like EXPLAIN PLAN). If the DBMS has to perform a full table scan, performance will only be affected slightly, since the DBMS has to visit all pages that make up the table anyhow. If there is an index on collegeid, performance will likely suffer more the more entries you put into your disjunction, since there will be several (though very fast) index lookups. At some point, there will we an full index scan instead of individual lookups, at which point performance will not degrade significantly anymore.
However - details depend ony our DBMS and its execution planner.
I' not sure you are facing what I suffered.
Actually, string length is not problem. How many values are in IN() is more important.
I've tested how many elements can be listed IN().
Result is 10,000 elements can be processed without performance loss.
Values in IN() should be stored in somewhere and searched while query evaluation. But 10k values is getting slower.
So if you have many 100k values, split 10 groups and try 10 times query. Or save in temp table and JOIN.
and long query uses more CPU, So IN() better than column = 1 OR ...

How do I optimize MySQL's queries with constants?

NOTE: the original question is moot but scan to the bottom for something relevant.
I have a query I want to optimize that looks something like this:
select cols from tbl where col = "some run time value" limit 1;
I want to know what keys are being used but whatever I pass to explain, it is able to optimize the where clause to nothing ("Impossible WHERE noticed...") because I fed it a constant.
Is there a way to tell mysql to not do constant optimizations in explain?
Am I missing something?
Is there a better way to get the info I need?
Edit: EXPLAIN seems to be giving me the query plan that will result from constant values. As the query is part of a stored procedure (and IIRC query plans in spocs are generated before they are called) this does me no good because the value are not constant. What I want is to find out what query plan the optimizer will generate when it doesn't known what the actual value will be.
Am I missing soemthing?
Edit2: Asking around elsewhere, it seems that MySQL always regenerates query plans unless you go out of your way to make it re-use them. Even in stored procedures. From this it would seem that my question is moot.
However that doesn't make what I really wanted to know moot: How do you optimize a query that contains values that are constant within any specific query but where I, the programmer, don't known in advance what value will be used? -- For example say my client side code is generating a query with a number in it's where clause. Some times the number will result in an impossible where clause other times it won't. How can I use explain to examine how well optimized the query is?
The best approach I'm seeing right off the bat would be to run EXPLAIN on it for the full matrix of exist/non-exist cases. Really that isn't a very good solution as it would be both hard and error prone to do by hand.
You are getting "Impossible WHERE noticed" because the value you specified is not in the column, not just because it is a constant. You could either 1) use a value that exists in the column or 2) just say col = col:
explain select cols from tbl where col = col;
For example say my client side code is generating a query with a number in it's where clause.
Some times the number will result in an impossible where clause other times it won't.
How can I use explain to examine how well optimized the query is?
MySQL builds different query plans for different values of bound parameters.
In this article you can read the list of when does the MySQL optimizer does what:
Action When
Query parse PREPARE
Negation elimination PREPARE
Subquery re-writes PREPARE
Nested JOIN simplification First EXECUTE
OUTER->INNER JOIN conversions First EXECUTE
Partition pruning Every EXECUTE
COUNT/MIN/MAX elimination Every EXECUTE
Constant subexpression removal Every EXECUTE
Equality propagation Every EXECUTE
Constant table detection Every EXECUTE
ref access analysis Every EXECUTE
range/index_merge analysis and optimization Every EXECUTE
Join optimization Every EXECUTE
There is one more thing missing in this list.
MySQL can rebuild a query plan on every JOIN iteration: a such called range checking for each record.
If you have a composite index on a table:
CREATE INDEX ix_table2_col1_col2 ON table2 (col1, col2)
and a query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 = t1.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, MySQL will NOT use an index RANGE access from (t1.value1, t1.value2_lowerbound) to (t1.value1, t1.value2_upperbound). Instead, it will use an index REF access on (t1.value) and just filter out the wrong values.
But if you rewrite the query like this:
SELECT *
FROM table1 t1
JOIN table2 t2
ON t2.col1 <= t1.value1
AND t2.col1 >= t2.value1
AND t2.col2 BETWEEN t1.value2_lowerbound AND t2.value2_upperbound
, then MySQL will recheck index RANGE access for each record from table1, and decide whether to use RANGE access on the fly.
You can read about it in these articles in my blog:
Selecting timestamps for a time zone - how to use coarse filtering to filter out timestamps without a timezone
Emulating SKIP SCAN - how to emulate SKIP SCAN access method in MySQL
Analytic functions: optimizing LAG, LEAD, FIRST_VALUE, LAST_VALUE - how to emulate Oracle's analytic functions in MySQL
Advanced row sampling - how to select N records from each group in MySQL
All these things employ RANGE CHECKING FOR EACH RECORD
Returning to your question: there is no way to tell which plan will MySQL use for every given constant, since there is no plan before the constant is given.
Unfortunately, there is no way to force MySQL to use one query plan for every value of a bound parameter.
You can control the JOIN order and INDEX'es being chosen by using STRAIGHT_JOIN and FORCE INDEX clauses, but they will not force a certain access path on an index or forbid the IMPOSSIBLE WHERE.
On the other hand, for all JOIN's, MySQL employs only NESTED LOOPS. That means that if you build right JOIN order or choose right indexes, MySQL will probably benefit from all IMPOSSIBLE WHERE's.
How do you optimize a query with values that are constant only to the query but where I, the programmer, don't known in advance what value will be used?
By using indexes on the specific columns (or even on combination of columns if you always query the given columns together). If you have indexes, the query planner will potentially use them.
Regarding "impossible" values: the query planner can conclude that a given value is not in the table from several sources:
if there is an index on the particular column, it can observe that the particular value is large or smaller than any value in the index (min/max values take constant time to extract from indexes)
if you are passing in the wrong type (if you are asking for a numeric column to be equal with a text)
PS. In general, creation of the query plan is not expensive and it is better to re-create than to re-use them, since the conditions might have changed since the query plan was generated and a better query plan might exists.