Performance of LIKE 'xyz%' v/s LIKE '%xyz' - mysql

I was wondering how the LIKE operator actually work.
Does it simply start from first character of the string and try matching pattern, one character moving to the right? Or does it look at the placement of the %, i.e. if it finds the % to be the first character of the pattern, does it start from the right most character and starts matching, moving one character to the left on each successful match?
Not that I have any use case in my mind right now, just curious.
edit: made question narrow

If there is an index on the column, putting constant characters in the front will lead your dbms to use a more efficient searching/seeking algorithm. But even at the simplest form, the dbms has to test characters. If it is able to find it doesn't match early on, it can discard it and move onto the next test.

The LIKE search condition uses wildcards to search for patterns within a string. For example:
WHERE name LIKE 'Mickey%'
will locate all values that begin with 'Mickey' optionally followed by any number of characters. The % is not case sensitive and not accent sensitive and you can use multiple %, for example
WHERE name LIKE '%mouse%'
will return all values with 'mouse' (or 'Mouse' or 'mousé') in it.
The % is inclusive, meaning that
WHERE name like '%A%'
will return all that starts with an 'A', contain 'A' or end with 'A'.
You can use _ (underscore) for any character on a single position:
WHERE name LIKE '_at%'
will give you all values with 'a' as the second letter and 't' as the third. The first letter can be anything. For example: 'Batman'
In T-SQL, if you use [] you can find values in a range.
WHERE name LIKE '[c-f]%'
it will find any value beginning with letter between c and f, inclusive. Meaning it will return any value that start with c, d, e or f. This [] is T-SQL only. Use [^ ] to find values not in a range.
Finding all values that contain a number:
WHERE name LIKE '%[0-9]%'
returns everything that has a number in it. Example: 'Godfather2'
If you are looking for all values with the 3rd position to be a '-' (dash) use two underscores:
WHERE NAME '__-%'
It will return for example: 'Lo-Res'
Finding the values with names ends in 'xyz' use:
WHERE name LIKE '%xyz'
returns anything that ends with 'xyz'
Finding a % sign in a name use brackets:
WHERE name LIKE '%[%]%'
will return for example: 'Top%Movies'
Searching for [ use brackets around it:
WHERE name LIKE '%[[]%'
gives results as: 'New York [NY]'
The database collation's sort order determines both case sensitivety and the sort order for the range of characters. You can optionally use COLLATE to specify collation sort order used by the LIKE operator.

Usually the main performance bottleneck is IO. The efficiency of the LIKE operator can be only important if your whole table fits in the memory otherwise IO will take most of the time.
AFAIK oracle can use indexes for prefix matching. (like 'abc%'), but these index cannot be used for more complex expressions.
Anyway if you have only this kind of queries you should consider using a simple index on the related column. (Probably this is true for other RDBMS's as well.)
Otherwise LIKE operator is generally slow, but most of the RDBMS have some kind of full text searching solution. I think the main reason of the slowness is that LIKE is too general. Usually full text indexes has lots of different options which can tell the database what you really want to search for, and with these additional information the DB can do its task in a more efficient way.
As a rule of thumb I think if you want to search in a text field and you think performance can be an issue, you should consider your RDBMS's full text searching solution, or the real goal is not text searching, but this is some kind of "design side effect", for example xml/json/statuses stored in a field as text, then probably you should consider choosing a more efficient data storing option. (if there is any...)

Related

How do I create a SELECT conditional in MySQL where the conditional is the character length of the LIKE match?

I am working on a search function, where the matches are weighted based on certain conditions. One of the conditions I want to add weight to is matches where the character length of the query string in a LIKE match is longer than 4.
This is what I want to the query to look like, roughly. %s is meant to represent the actual match found by LIKE, but I don't think it does. I'm wondering if there is a special variable in MySQL that does represent the precise character match found by LIKE.
SELECT help.*,
IF(CHAR_LENGTH(%s) > 4, 2, 0) w
FROM help
WHERE (
(title LIKE '%this%' OR title LIKE '%testy%' OR title LIKE '%test%') OR
(content LIKE '%this%' OR content LIKE '%testy%' OR content LIKE '%test%')
) LIMIT 1000
edit: I could in the PHP split the search string array into two arrays based on the character length of the elements, with two separate queries that return different values for 'w', then combine the results, but I'd rather not do that, as it seems to me that would be awkward, messy, and slow.
Check out FULLTEXT as another way to discover rows. It will be faster, but won't address your question.
This probably has the effect you want.
SELECT ....
IF ( (title LIKE '%testy%' OR
content LIKE '%testy%'), 2, 0)
....
Note that the "match" in your LIKEs includes the %, so it is the entire length of the string. I don't think that is what you wanted.
REGEXP "(this|testy|that)" will match either 4 or 5 characters (in this example). It may be possible to do something with REGEXP_REPLACE to replace that with the empty string, then see how much it shrank.
I think the answer to my question is that what I wanted to do isn't possible. There is no special variable in MySQL representing the core character match in a WHERE condtional where LIKE is the operator. The match is the contents of the returned data row.
What I did to reach my objective was took the original dynamic list of search tokens, iterated through that list, and performed a search on each token, with the SQL tailored to the conditions that matched each token.
As I did this I built an array of the search results, using the id for the database row as the index for the array. This allowed me to perform calculations with the array elements, while avoiding duplicates.
I'm not posting the PHP code because the original question was about the SQL.

MySQL dictionary query optimization

I have a dictionary query which I would like to optimize. Apparently the query is too long as the result page takes quite long to load. The query is as follows:
$var = #$_GET['q'] ;
$varup1 = strtoupper($var);
$varup = addslashes ($varup1);
$query1 = "select distinct $lang from $dict WHERE
UPPER ($lang) LIKE trim('$varup')
or UPPER($lang) LIKE replace('$varup',' ','')
or replace($lang,'ß','ss') LIKE trim('$varup')
or replace($lang,'ss','ß') LIKE trim('$varup')
or replace($lang,'ence','ance') LIKE trim('$varup')
or replace($lang,'ance','ence') LIKE trim('$varup')
or UPPER ($lang) like trim(trailing 'LY' from '$varup')
or UPPER ($lang) like trim(trailing 'Y' from '$varup')
or UPPER ($lang) like trim(trailing 'MENTE' from '$varup')
or UPPER ($lang) like trim(trailing 'EMENT' from '$varup')
or UPPER ($lang) like trim(trailing 'IN' from '$varup')
The purpose is that a search string shall also find different writings of the same word or the adverb of an adjective.
The table looks like
or
For instance "flawlessly" shall also display "flawless". "Fully" shall also find "full" and vice-versa.
"Feliz" should also find the entries for "Felizmente".
There are around twenty substitutes like the above which I eliminated as they do not make the question easier to understand.
The whole code is quite long and I wonder if I can make it smaller without losing functionality. Any ideas?
Where is the FROM clause in the query?
The REPLACE calls could be chained: REPLACE(REPLACE(..., 'a', 'b'), 'c', 'd'). Ditto for theTRIM` calls.
As already mentioned, a suitable COLLATION eliminates all need for UPPER() and LOWER(). Avoid the ...general... collations, and you will be provided with this: ss=ß. Many, but not all, treat ij=ij and/or oe=œ and/or Aa=Å (etc); do you need them, too? Here is a rundown of most situations: http://mysql.rjweb.org/utf8_collations.html
Using a FULLTEXT index will take care of most of the endings you are testing for, there obviating most of your code.
You show multiple words in the second column. Is this simply for display? If you need to pick apart the words, then you have other nasty challenges.
This, alone, will speed up the query something like 10-fold:
WHERE english LIKE 'ha%'
AND ... (whatever else you have)
That is, filter on the first 2 letters with something that can use INDEX(english), specifically LIKE 'ha%' for the word hate. Since you seem to be using PHP, there should be no difficulty building this into the query.
Here's another thought on my substring($word, 0, 2)... Instead of specifially using "2", see if floor(strlen($word)/2) will work well enough. So, 'flawlessly' would be tested LIKE 'flawl%' and run a lot faster than even 10-fold.
But, another issue. Are you chopping both the word in the table and the word given? Try to avoid chopping the word in the table. To discuss this further, please provide the table entries for 'flaw', 'flaws', 'flawless', flawlessly', etc. I can't quite tell if you need to get all the way down to 'flaw', but have various rows for the various forms.
Beware of some very short words with odd forms. Perhaps you need to add extra entries to avoid making the SQL query too messy. These change the second letter: "LIE" and "LYING". Seems like there is even a common word that changes the first letter.

MySQL - search for patterns

I'm trying to figure out if someone has an elegant way to look for patterns in data stored in a varchar field where a value is not known -- meaning I can't use LIKE. For example, say a table called test looked like this:
id, str
and the data looked like this:
1, YUUUY
2, DDDMM
3, MMMMT
4, XMXMX
and I want to do a select that will return anything where the value of str has a pattern that matches the pattern ABABA. ABABA here shows a pattern and not literal letters. So the only one that matches this pattern would be id = 4. Is there a regular expression that I can use to pattern match like this? To make sure I'm clear regarding the patterns:
The pattern for id=1 is ABBBA.
The pattern for id=2 is AAABB.
The pattern for id=3 is AAAAB.
When running the query, all I will know is the pattern to search for.
Alternatively, if it makes it easier, I can have the table set up like:
id,c1,c2,c3,c4,c5
and the data would look like this:
1,Y,U,U,U,Y
2,D,D,D,M,M
3,M,M,M,M,T
4,X,M,X,M,X
Not sure if that makes it easier, but I think regexp is out the window if the data is set up like that.
No regular expression support in MySQL to do that kind of pattern matching, no.
SQL wasn't specifically designed for pattern matching of strings (or patterns of values in separate columns.)
But... we could come up with something workable, even if it's not a regular expression and it's not elegant.
Assuming we don't have a custom built user-defined function, and we want to use native MySQL functions and expression...
And assuming that the patterns we are looking for are guaranteed to consist of only two distinct characters...
And assuming that we're looking at exactly five character positions...
And assuming that the pattern string we're matching to will always begin with the letter 'A', and the "other" letter in the pattern will also be 'B'
It wouldn't be overly ugly to do something like this:
SELECT t.id
, t.str
FROM myable t
WHERE CONCAT('A'
,IF(MID(t.str,2,1)=MID(t.str,1,1),'A','B')
,IF(MID(t.str,3,1)=MID(t.str,1,1),'A','B')
,IF(MID(t.str,4,1)=MID(t.str,1,1),'A','B')
,IF(MID(t.str,5,1)=MID(t.str,1,1),'A','B')
) = 'ABBBA'
The first character in the string is automatically converted to an 'A'.
The second character, if that matches the first character, then it's also an 'A' otherwise it's a 'B'.
We do the same thing for the third, fourth and fifth characters.
Concatenate the 'A' and 'B' characters into a single string, and we can now perform an equality comparison to a pattern string, consisting of 'A' and 'B', starting with an 'A'.
But that is going to fall apart if the stated assumptions aren't true. If str is less than five characters in length, if it contains more than two distinct characters (we'll see the first character as matching... this would see str=XYYZX as matching pattern ABBBA. (First character is automatic match to A, and the fifth character matches the first, so it's an A, and all of the other characters don't match, so they are 'B', even though they aren't the same.
And so on.
We could add some additional checks.
For example, to guaranteed that str is exactly five characters in length...
AND CHAR_LENGTH(t.str)=5
Note that the default collation in MySQL is case insensitive. That means means a str value of MmmmM would be converted to 'AAAAA', not 'ABBBA'. And a str value of MmmKk would match 'AAABB'.
Unfortunately, it doesn't look like MySQL supports regex groups. I was hoping you could do something like this to match ABBBA for example:
([A-Z])([A-Z])\2\2\1
Example here: http://regexr.com/3d8gu
It looks like there is a MySQL plugin that might support it:
https://github.com/mysqludf/lib_mysqludf_preg
Here is a real hacky way to do it.
ABBBA (or YUUUY, etc):
SELECT id, name FROM table WHERE
substring(name,1,1) = substring(name,5,1) AND
substring(name,2,1) = substring(name,3,1) AND
substring(name,3,1) = substring(name,4,1);
AAABB (or DDDMM, etc):
SELECT id, name FROM table WHERE
substring(name,1,1) = substring(name,2,1) AND
substring(name,2,1) = substring(name,3,1) AND
substring(name,4,1) = substring(name,5,1);
AAAAB (or MMMMT, etc):
SELECT id, name FROM table WHERE
substring(name,1,1) = substring(name,2,1) AND
substring(name,2,1) = substring(name,3,1) AND
substring(name,3,1) = substring(name,4,1) AND
substring(name,4,1) != substring(name,5,1);
You get the picture...
It would be similar if you separated the data into different columns. Instead of comparing substrings you would just compare the columns.

Finding small letter between two capital letters - MySQL

I've got problem - I need to find every single phrase like AbC (small b, between two Capital letters).
For Example a statement:
Little John had a ProBlEm and need to know how to do tHiS.
I need to select ProBlEm and tHiS (you see, BlE and HiS, one small letter in between two capital).
How can I select this?
In MySQL you can use a binary (to ensure case sensitivity) regular expression to filter for those records that contain such a pattern:
WHERE my_column REGEXP BINARY '[[:upper:]][[:lower:]][[:upper:]]'
However, it is not so straightforward to extract the substrings which match such a pattern from within MySQL. One can use a UDF, e.g. lib_mysqludf_preg, but it's probably a task more suited to being performed within your application layer. In either case, regular expressions can again help to simplify this task.
Firstly you have split the String. Please refer this SO Question
and then search each retrive word like
substring(word,2) LIKE '[A-Z]' COLLATE latin1_general_cs

What field structure would be better for a definition table in MySQL?

I'm making a dictionary webapp. The user will search for words. Would it be faster to do this?
SELECT * from definition WHERE word LIKE "house";
or...
SELECT * from definition WHERE word_hash LIKE md5("house");
In the second example, I store the md5() value of words in the word_hash field. Of course, "word" and "word_hash" are indexes.
Update: sometimes, the word field could be more than 1 word. Example: Sacré Bleu
Skipping LIKE completely would be faster. Added the lower case version of word as word_lc, index word_lc, and then do:
select * from definition where word_lc = lower(word_you_want)
Using LIKE without any % or _ wildcards is just a case insensitive equality test so you should go straight to a case insensitive comparison that can and will take advantage of an index. Also, as usual, say what you mean so the computer can do what you want it to do.