Why is it not recomended to use "LIKE" in SQL? - mysql

I was recently told that it is not recomended to use the "LIKE" keyword in SQL. is this true? if so why? if it is true are there any alternatives to it?

The reason is primarily performance. However, on the other side of the argument, LIKE is standard SQL and should work in all databases. Because LIKE has to parse the pattern string, it is a bit less efficient than looking for a substring in a longer string (using charindex or instr or your database's favorite function). However, processors are so fast that this rarely makes a difference now, except perhaps for the largest queries.
The one caution with LIKE is in a join statement (and this is true of the alternatives as well). In general, database engines will not use an index for a LIKE in a join. So, if you can express the join clause in a more index-friendly way, then you might see a substantial increase in performance.
By the way, I'm something of an old-timer with the SQL language, and tend to be a person who avoids using it personally. However, this is not a habit that should be passed on, because there is little basis anymore for avoiding it.

Specifically in MySQL (and since this has a MySQL tag I guess that's what you are using), when using LIKE on a column which has an Index you should be carefull of not putting a % in front of the string you are matching if you don't have to, because it will kill the possibility of using the Index for looking efficiently, otherwise there is no problem in using LIKE. e.g.
BAD:
col_with_index LIKE '%someText'
GOOD:
col_with_index LIKE 'someText%'

There are no valid reasons to not use like!!!
The only exception comes when you can use the EQUAL(=) operator to achieve the same results (my_column LIKE 'XYZ').
If you need to use LIKE any other alternative to achieve the same result should cause the same (or even more) performance problems!
So, in those cases, just think if the use of like is necessary and then use it with no hesitations.

Related

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?

I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.

Why would an NOT IN condition be slower than IN in MySQL?

I was asked this question few days back, in a interview. What is the reason for this slowness?
According to the documentation, in with a list of constants implements a binary search:
The search for the item then is done using a binary search. This
means IN is very quick if the IN value list consists entirely of
constants.
It is not explicitly stated that NOT IN is handled the same way, but it is reasonable to assume based on this:
expr NOT IN (value,...)
This is the same as NOT (expr IN (value,...)).
That would leave two possible differences in performance. The first would be index usage. Because NOT IN is documented as essentially using IN, I would expect the index optimizations to be the same. This is an expectation, but it might not be correct.
The second is that there are fewer comparisons with the NOT IN than with the IN. This would lead to the NOT IN being faster -- but the difference is less than microscopic given the binary search method used by MySQL. In other words, this should not make a difference.
Finally, if you are using a subquery, then all bets are off. The NOT IN has to process the entire list to check if the value is not really in the list. IN can stop processing at the first match. Under some circumstances, this could have an impact on performance.
It is up to your data.
And i found an another answer for you :
Use "IN" as it will most likely make the DBMS use an index on the
corresponding column.
"NOT IN" could in theory also be translated into an index usage, but
in a more complicated way which DBMS might not "spend overhead time"
using.
From : https://stackoverflow.com/a/16892027/3444315
It's because NOT IN is simply NOT (IN). It's always one more step than IN. NOT IN take the result of IN then negates it.

Looking for a better solution - long list of AND operator with a type String comparision

I have search function which does search for keywords in an large mysql table, but since i need to filter out all the bad words, i have to do a following type of a AND comparison in the MySQL, which is a long list of banned words (over 500+) , due to this its very slow,
SELECT * FROM keywords WHERE 1
AND keyword NOT LIKE '%love%'
AND keyword NOT LIKE '%hope%'
AND keyword NOT LIKE '%caring%'
AND keyword NOT LIKE '%x%'
AND keyword NOT LIKE '%happiness%'
AND keyword NOT LIKE '%forgiveness%'
AND keyword NOT LIKE '%good%'
AND keyword NOT LIKE '%great%'
AND keyword NOT LIKE '%positive%'
AND keyword NOT LIKE '%sharing%'
AND keyword NOT LIKE '%awesome%'
AND keyword NOT LIKE '%fantastic%'
any other better way of doing this ?
Using LIKE pattern-matching has terrible performance, because there's no way to use an index for it. Using regular expressions like #fuzic suggests is even worse.
You really need to use some fulltext indexing solution if you want good performance.
I cover this and compare several solutions in my presentation, Full Text Search Throwdown.
The brief answer: use Sphinx Search.
You could do worse than to build a Finite State Machine that recognizes the complete set of strings. Coding one by hand would be tedious, but fortunately tools such as LEX, and its descendants and kin, have been around for nearly 40 years to automate that process.

How to force Django to do a case sensitive comparison?

Because of the way our databases are collated they ignore case for usernames and passwords, which we're currently unable to fix at the database level. It appears from this thread that WHERE BINARY 'something' = 'Something' should fix the problem, but I haven't been able to figure out to get Django to insert BINARY. Any tips?
I don't think there's an easy way to force Django to add something into a query content.
You may want to simply write a raw SQL query within the Django to get objects filtered with case-sensitive comparison and then use them in normal queries.
The other approach is to use Django case-sensitive filters in order to achieve the same result. E.g. contains/startswith both use BINARY LIKE and may be used while comparing two strings with the same length (like a password hash)). Finally, you can use regexp to do the case-sensitive comparison. But these are rather ugly methods in that situation. They have an uncessary overhead and you should avoid them as long as it's possible.

MySQL - Do's and Don'ts

I am currently learning MySQL and am noticing a lot of different do's and don'ts.
Is there anywhere I can find the absolute list of best practices that you go by or learned from?
Thanks for your time.
Do use InnoDB; don't use MyISAM.
(OK, OK, unless you absolutely have to, often due to fulltext matching not being available in InnoDB. Even then you're often better off putting the canonical data in InnoDB and the fulltext index on a separate MyISAM searchbait table, which you can then process for stemming.)
Do use BINARY columns when you want rigorous string matching, otherwise you get a case-insensitive comparison by default. Do set the collation correctly for your character set (best: UTF-8) or case-insensitive comparisons will behave strangely.
Do use ANSI SQL mode if you want your code to be portable. ANSI_QUOTES allows you to use standard double-quoted "identifier" (table, column, etc.) names to avoid reserved words; MySQL's default way of saying this is backquotes but they're non-standard and won't work elsewhere. If you can't control settings like this, omit any identifier quoting and try to avoid reserved words (which is annoying, as across the different databases there are many).
Do use your data access layer's MySQL string literal escaping or query parameterisation functions; don't try to create escaped literals yourself because the rules for them are a lot more complicated than you think and if you get it wrong you've got an SQL injection hole.
Don't rely on MySQL's behaviour of returning a particular row when you select columns that don't have a functional dependency on the GROUP BY column(s). This is an error in other databases and can easily hide bugs that will only pop up when the internal storage in the database changes, causing a different row to be returned.
SELECT productid, MIN(cost)
FROM products
GROUP BY productcategory -- this doesn't do what you think
Well, there won't be an absolute list of dos and donts as the goal posts keep moving. MySql moved on in leaps and bounds between versions 4 and 5, and some fairly essential bug fixes for MySql seem to be around the corner (I'm thinking of the issue surrounding the use of count(distinct col1) from ...).
Here are a couple of issues off the top of my head:
don't rely on views to be able to use indexes on the underlying tables
http://forums.mysql.com/read.php?100,22967,66618#msg-66618
The order of columns in indexes intended to be used by GROUP BY is important:
http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html
COUNT(DISTINCT) is slow:
http://www.delphifaq.com/faq/databases/mysql/f3095.shtml
although there might be a bug fix a-coming....
http://bugs.mysql.com/bug.php?id=17865
Here are some other questions from this site you might find useful:
Database opimization
Database design with MySql
Finetuning tips
DON'T WRITE YOUR SQL IN ALL CAPS, EVEN THOUGH THE OFFICIAL REFERENCE DOES IT. I MEAN, OK, IT MAKES IT PRETTY OBVIOUS TO DIFFERENTIATE BETWEEN IDENTIFIERS AND KEYWORDS. NO, WAIT, THAT'S WHY WE HAVE SYNTAX HIGHLIGHTING.
Do use SQL_MODE "Traditional".
SET SQL_MODE='TRADITIONAL'
Or put it in your my.cnf (even better, because you can't forget it; but ensure it gets deployed on to ALL instances including dev, test etc).
If you don't do this, inserting invalid values into columns will succeed anyway. This is not usually a Good Thing, as it may mean that you lose data.
It's important that it's turned on in dev as well as you'll spot those problems early.
Oh I need this list too .. joking. No. The problem is that whatever works with 1 MB database will never be good for 1 GB database, same applies to 1GB database vs 1TB database. etc.