Why would an NOT IN condition be slower than IN in MySQL? - mysql

I was asked this question few days back, in a interview. What is the reason for this slowness?

According to the documentation, in with a list of constants implements a binary search:
The search for the item then is done using a binary search. This
means IN is very quick if the IN value list consists entirely of
constants.
It is not explicitly stated that NOT IN is handled the same way, but it is reasonable to assume based on this:
expr NOT IN (value,...)
This is the same as NOT (expr IN (value,...)).
That would leave two possible differences in performance. The first would be index usage. Because NOT IN is documented as essentially using IN, I would expect the index optimizations to be the same. This is an expectation, but it might not be correct.
The second is that there are fewer comparisons with the NOT IN than with the IN. This would lead to the NOT IN being faster -- but the difference is less than microscopic given the binary search method used by MySQL. In other words, this should not make a difference.
Finally, if you are using a subquery, then all bets are off. The NOT IN has to process the entire list to check if the value is not really in the list. IN can stop processing at the first match. Under some circumstances, this could have an impact on performance.

It is up to your data.
And i found an another answer for you :
Use "IN" as it will most likely make the DBMS use an index on the
corresponding column.
"NOT IN" could in theory also be translated into an index usage, but
in a more complicated way which DBMS might not "spend overhead time"
using.
From : https://stackoverflow.com/a/16892027/3444315

It's because NOT IN is simply NOT (IN). It's always one more step than IN. NOT IN take the result of IN then negates it.

Related

Is wildcard LIKE more performant than a multiple boolean search in MySQL?

I have been building an API for for a website and the objects I am searching for have a LOT of true/false fields. Instead of creating a huge db structure to manage the options I thought about serializing them in a string similar to '001001000010101001' where 1 is true and 0 is false (I am talking about 100 different options). The other reason I am doing this is to have a clean database so that all of those fields get grouped in a single field (I already have serializer/deserializer).
Now in the search function, since not all of the options get searched at the same time, I was thinking about using a LIKE statement with wildcards.
For example I would do something like this:
WHERE options LIKE '1_1__1__1___1%' (The final wildcard is to reduce the number of _ wildcards so that only the beginning of the pattern gets matched. I would stop at the last checked options to check and % wildcard the rest).
Would this (on average because sometimes there might be 2 or 3 parameters selected and many times there might be all of them) be more performant than a multiple series of AND xxx AND XXX AND ....?
Or is there a way more efficient (and clean to maintain) way of doing this that I am completely missing?
Let me discuss some of the items you bring up.
I understand the 0/1 approach. I find it interesting that you chose to go with strings instead of numbers. For up to 64 true/false value, a BIGINT (8 bytes) or something smaller would work.
Did you also want to test for false for some flags? Your string mechanism makes that rather obvious. (Your example should include a 0.)
Searching with LIKE will be efficient only if there is no leading wildcard (_ or %). So, I would expect most of your queries to involve a full table scan. Your serializer, plus the b prefix would work for setting the strings.
The integer approach that I am countering with would involve & and other boolean operations to test. This would probably be noticeably faster. This would necessitate a full table scan.
If there are other columns being tested in the WHERE clause, let's see them. They will be important to performance and indexing.
Using numbers instead of strings would be slightly faster because of smaller size and faster testing. But not a lot.
You can get some of the benefits of numeric by using the SET datatype. (Again limited to 64 independent flags per column.) The advantage is that you talk to the database via strings, not cryptic bit positions.
If this is a real estate app, consider a few columns like: kitchen, bedrooms, unusual_items (second dishwasher, jacuzzi), etc. No matter how it is implemented (string, integer, SET), this suggestion won't impact performance much.
Another performance note, again with a real estate model: Since #bedrooms is almost always a criteria, make it a column by itself. This may allow for some use of it in a composite index.

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?

I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.

Force binary in query

I have a field in my database which the majority of the time, the case sensitivity doesnt matter.
However there are specific circumstances where this isn't the case.
So the options as I see it are, add the BINARY keyword to the query
select id from table where BINARY fieldname= 'tesT'
or alter the table to make it a case sensitive field.
The question is, does using the BINARY keyword cause additional overhead on the query, or is this negligible.
(it would actually be used in an update where statement)
update table set field='tesT' where BINARY anotherfield='4Rtrtactually '
The short answer is binary has been recommended in a similar SO question, however whether overhead is negligible depends on other parts.
Part one is how you define negligible. If you mean by performance then define your negligible threshold return time(s) and try selects using binary and not using binary to see what kind of return time you get (these need to be cases where case won't matter). Further this by doing an explain to root out any difference.
The second part is it depends on how big the table is and how the where clause column is indexed. Generally speaking, the more rows there are then the more prominent any difference would be. So if you're looking at say ten records in the table then it's probably nothing given your doing a single table update. Ten million records then better chance to be noticeable.
Third part is based on Yann Neuhaus' comment in the MSQL 5.7 reference guide. It might depend on exactly where you put the binary keyword. Assuming his findings are accurate then putting the binary on the constant side of the equals might be faster (or at least different). So test a rewrite of your update to:
update table set field='tesT' where anotherfield = BINARY '4Rtrtactually '
To be honest I have my doubts on this third part but I've seen weirder MySQL magic. Kinda of curious if you test it out what the results would be for your scenario.
Final part might be what type of collation you are using. Based on your question it seems you are using case insensitive collation. Depending on how you are using this data you might want to look into case sensitive or binary collation. This SO question was educational for me on this subject.

Will the performance be worse if bitwise operations are used in MYSQL?

There are several boolean columns in one table. For convenience in my code, I want to use one column to replace all of them. In the new column, each bit represents an original boolean column. But if I do like this, I must do bitwise operation in the where clause, so I can't use index for it.
Will the performance be worse if bitwise operations are used in MYSQL?
You've already answered your question - so I can't use an index for it - yes, in most cases performance will be harmed.
I must do bitwise operation in the where clause, so I can't use index for it.
You answered your own question. If your query does not use an index, it means mysql will have to check every actual value stored to see if it matches, which will be a lot slower.

MySQL - AND condition

Let's say I have a query like this:
SELECT bla WHERE foo LIKE '%bar%' AND boo = 'bar' AND whatvr IN ('foo', 'bar')...
I was wondering if MySQL continues to check all conditions when retrieving results.
For eg. if foo is not LIKE %bar%, will it continue to check if boo = 'bar', and so on ?
Would it be any faster if I put conditions that are less likely to be true at the end?
I'm sorry if this seems to be stupid question, I'm a complete noob when it comes to SQL :)
I don't think there are any guarantees about whether or not multiple conditions will be short-circuited, but...
In general, you should treat the query optimiser as a black box and assume -- unless you have evidence to the contrary -- that it will do its job properly. The optimiser's job is to ensure that the requested data is retrieved as efficiently as possible. If the most efficient plan involves short-circuiting then it'll do it; if it doesn't then it won't.
(Of course, query optimisers aren't perfect. If you have evidence that a query isn't being executed optimally then it's often worth re-ordering and/or re-stating the query to see if anything changes.)
What you're looking for is documentation on MySQL's short-circuit evaluation. I have, however, not been able to find anything better than people who were not able to find the documentation, but they claim to have tested it and found it to be true, i.e., MySQL short-circuits.
Would it be any faster if I put conditions that are less likely to be true at the end?
No, the optimizer will try and optimize (!) the order of processing. So, as for the order of tests, you should not assume anything.
I would not count on that : Where Optimisations. That link explains that other criterias prevail on the order.
You can't rely on MySQL evaluating conditions from left to right (as opposed to any programming language). This is because the "WHERE clause optimizer" looks for columns that are indexed and will look for this subset first.
For query optimization see the chapter Optimizing SELECT Statements in the MySQL reference manual.
If it does short-circuit when first condition fails( which is most likely ), it would be best to put those conditions, that are most likely to fail, first!
let's say we have 3 conditions, and all must be true( separated by "AND" ).
slow case:
1. never fail. All rows are looked through and success.
2. sometimes fail. All rows are looked through and still success.
3. often fail. All rows are looked through and this time we fail.
Result: It took a while, but can't find a match.
fast case:
1. often fail. All rows are looked through and matching fail.
2. sometimes fail. NOT looked through, because searching ended due to short-circuit.
3. never fail. NOT looked through, because searching ended due to short-circuit.
Result: Quickly figured, no match.
Correct me if I'm wrong.
I can imagine, that all conditions are checked, for each row looked ad. Which makes this matter ALOT less.
If your fields are ordered in the same order, as your conditions, you could maybe measure the difference.