How do I create a SELECT conditional in MySQL where the conditional is the character length of the LIKE match? - mysql

I am working on a search function, where the matches are weighted based on certain conditions. One of the conditions I want to add weight to is matches where the character length of the query string in a LIKE match is longer than 4.
This is what I want to the query to look like, roughly. %s is meant to represent the actual match found by LIKE, but I don't think it does. I'm wondering if there is a special variable in MySQL that does represent the precise character match found by LIKE.
SELECT help.*,
IF(CHAR_LENGTH(%s) > 4, 2, 0) w
FROM help
WHERE (
(title LIKE '%this%' OR title LIKE '%testy%' OR title LIKE '%test%') OR
(content LIKE '%this%' OR content LIKE '%testy%' OR content LIKE '%test%')
) LIMIT 1000
edit: I could in the PHP split the search string array into two arrays based on the character length of the elements, with two separate queries that return different values for 'w', then combine the results, but I'd rather not do that, as it seems to me that would be awkward, messy, and slow.

Check out FULLTEXT as another way to discover rows. It will be faster, but won't address your question.
This probably has the effect you want.
SELECT ....
IF ( (title LIKE '%testy%' OR
content LIKE '%testy%'), 2, 0)
....
Note that the "match" in your LIKEs includes the %, so it is the entire length of the string. I don't think that is what you wanted.
REGEXP "(this|testy|that)" will match either 4 or 5 characters (in this example). It may be possible to do something with REGEXP_REPLACE to replace that with the empty string, then see how much it shrank.

I think the answer to my question is that what I wanted to do isn't possible. There is no special variable in MySQL representing the core character match in a WHERE condtional where LIKE is the operator. The match is the contents of the returned data row.
What I did to reach my objective was took the original dynamic list of search tokens, iterated through that list, and performed a search on each token, with the SQL tailored to the conditions that matched each token.
As I did this I built an array of the search results, using the id for the database row as the index for the array. This allowed me to perform calculations with the array elements, while avoiding duplicates.
I'm not posting the PHP code because the original question was about the SQL.

Related

MySQL MATCH() AGAINST() with reversed parameters

I have a database table that looks a bit like this:
id|words|url
1|+Word +Matching -Goodbye|/url-1
2|+Redirect +Me|/url-2
3|+Goodbye +Word|/url-3
When a user types a search for: "Hello I am matching a word", I would like the table.words field to be given a 'relevance' score against that string, a lot like the MATCH() AGAINST() function, but with the parameters reversed.
Effectively, the query I am looking to run would be along the lines of:
SELECT id, words, url,
MATCH ("Hello I am matching a word") AGAINST (words IN BOOLEAN MODE) AS relevance
FROM table
ORDER BY relevance DESC
But this does not work, unfortunately. I could do something in PHP where I create a function to loop through each inclusive/exclusive word, but I fear that this will be really slow when the table size grows.
Just to tie it up, I would expect the query to return id: 1 in that instance, as it includes "Word" and 'Matching", and does not include "Goodbye". I should point out that these words could be in any order within the string, so I couldn't really use LIKE, I don't think.
If such a function does not exist, is there a better way I could approach this?
Thanks!
The documentation is pretty clear:
The search string must be a string value that is constant during query
evaluation. This rules out, for example, a table column because that
can differ for each row.
Hence, you cannot do what you want with a query.
You can use dynamic SQL. Or a loop in PHP to loop through the patterns in your table.

MySQL - search for patterns

I'm trying to figure out if someone has an elegant way to look for patterns in data stored in a varchar field where a value is not known -- meaning I can't use LIKE. For example, say a table called test looked like this:
id, str
and the data looked like this:
1, YUUUY
2, DDDMM
3, MMMMT
4, XMXMX
and I want to do a select that will return anything where the value of str has a pattern that matches the pattern ABABA. ABABA here shows a pattern and not literal letters. So the only one that matches this pattern would be id = 4. Is there a regular expression that I can use to pattern match like this? To make sure I'm clear regarding the patterns:
The pattern for id=1 is ABBBA.
The pattern for id=2 is AAABB.
The pattern for id=3 is AAAAB.
When running the query, all I will know is the pattern to search for.
Alternatively, if it makes it easier, I can have the table set up like:
id,c1,c2,c3,c4,c5
and the data would look like this:
1,Y,U,U,U,Y
2,D,D,D,M,M
3,M,M,M,M,T
4,X,M,X,M,X
Not sure if that makes it easier, but I think regexp is out the window if the data is set up like that.
No regular expression support in MySQL to do that kind of pattern matching, no.
SQL wasn't specifically designed for pattern matching of strings (or patterns of values in separate columns.)
But... we could come up with something workable, even if it's not a regular expression and it's not elegant.
Assuming we don't have a custom built user-defined function, and we want to use native MySQL functions and expression...
And assuming that the patterns we are looking for are guaranteed to consist of only two distinct characters...
And assuming that we're looking at exactly five character positions...
And assuming that the pattern string we're matching to will always begin with the letter 'A', and the "other" letter in the pattern will also be 'B'
It wouldn't be overly ugly to do something like this:
SELECT t.id
, t.str
FROM myable t
WHERE CONCAT('A'
,IF(MID(t.str,2,1)=MID(t.str,1,1),'A','B')
,IF(MID(t.str,3,1)=MID(t.str,1,1),'A','B')
,IF(MID(t.str,4,1)=MID(t.str,1,1),'A','B')
,IF(MID(t.str,5,1)=MID(t.str,1,1),'A','B')
) = 'ABBBA'
The first character in the string is automatically converted to an 'A'.
The second character, if that matches the first character, then it's also an 'A' otherwise it's a 'B'.
We do the same thing for the third, fourth and fifth characters.
Concatenate the 'A' and 'B' characters into a single string, and we can now perform an equality comparison to a pattern string, consisting of 'A' and 'B', starting with an 'A'.
But that is going to fall apart if the stated assumptions aren't true. If str is less than five characters in length, if it contains more than two distinct characters (we'll see the first character as matching... this would see str=XYYZX as matching pattern ABBBA. (First character is automatic match to A, and the fifth character matches the first, so it's an A, and all of the other characters don't match, so they are 'B', even though they aren't the same.
And so on.
We could add some additional checks.
For example, to guaranteed that str is exactly five characters in length...
AND CHAR_LENGTH(t.str)=5
Note that the default collation in MySQL is case insensitive. That means means a str value of MmmmM would be converted to 'AAAAA', not 'ABBBA'. And a str value of MmmKk would match 'AAABB'.
Unfortunately, it doesn't look like MySQL supports regex groups. I was hoping you could do something like this to match ABBBA for example:
([A-Z])([A-Z])\2\2\1
Example here: http://regexr.com/3d8gu
It looks like there is a MySQL plugin that might support it:
https://github.com/mysqludf/lib_mysqludf_preg
Here is a real hacky way to do it.
ABBBA (or YUUUY, etc):
SELECT id, name FROM table WHERE
substring(name,1,1) = substring(name,5,1) AND
substring(name,2,1) = substring(name,3,1) AND
substring(name,3,1) = substring(name,4,1);
AAABB (or DDDMM, etc):
SELECT id, name FROM table WHERE
substring(name,1,1) = substring(name,2,1) AND
substring(name,2,1) = substring(name,3,1) AND
substring(name,4,1) = substring(name,5,1);
AAAAB (or MMMMT, etc):
SELECT id, name FROM table WHERE
substring(name,1,1) = substring(name,2,1) AND
substring(name,2,1) = substring(name,3,1) AND
substring(name,3,1) = substring(name,4,1) AND
substring(name,4,1) != substring(name,5,1);
You get the picture...
It would be similar if you separated the data into different columns. Instead of comparing substrings you would just compare the columns.

Performance of LIKE 'xyz%' v/s LIKE '%xyz'

I was wondering how the LIKE operator actually work.
Does it simply start from first character of the string and try matching pattern, one character moving to the right? Or does it look at the placement of the %, i.e. if it finds the % to be the first character of the pattern, does it start from the right most character and starts matching, moving one character to the left on each successful match?
Not that I have any use case in my mind right now, just curious.
edit: made question narrow
If there is an index on the column, putting constant characters in the front will lead your dbms to use a more efficient searching/seeking algorithm. But even at the simplest form, the dbms has to test characters. If it is able to find it doesn't match early on, it can discard it and move onto the next test.
The LIKE search condition uses wildcards to search for patterns within a string. For example:
WHERE name LIKE 'Mickey%'
will locate all values that begin with 'Mickey' optionally followed by any number of characters. The % is not case sensitive and not accent sensitive and you can use multiple %, for example
WHERE name LIKE '%mouse%'
will return all values with 'mouse' (or 'Mouse' or 'mousé') in it.
The % is inclusive, meaning that
WHERE name like '%A%'
will return all that starts with an 'A', contain 'A' or end with 'A'.
You can use _ (underscore) for any character on a single position:
WHERE name LIKE '_at%'
will give you all values with 'a' as the second letter and 't' as the third. The first letter can be anything. For example: 'Batman'
In T-SQL, if you use [] you can find values in a range.
WHERE name LIKE '[c-f]%'
it will find any value beginning with letter between c and f, inclusive. Meaning it will return any value that start with c, d, e or f. This [] is T-SQL only. Use [^ ] to find values not in a range.
Finding all values that contain a number:
WHERE name LIKE '%[0-9]%'
returns everything that has a number in it. Example: 'Godfather2'
If you are looking for all values with the 3rd position to be a '-' (dash) use two underscores:
WHERE NAME '__-%'
It will return for example: 'Lo-Res'
Finding the values with names ends in 'xyz' use:
WHERE name LIKE '%xyz'
returns anything that ends with 'xyz'
Finding a % sign in a name use brackets:
WHERE name LIKE '%[%]%'
will return for example: 'Top%Movies'
Searching for [ use brackets around it:
WHERE name LIKE '%[[]%'
gives results as: 'New York [NY]'
The database collation's sort order determines both case sensitivety and the sort order for the range of characters. You can optionally use COLLATE to specify collation sort order used by the LIKE operator.
Usually the main performance bottleneck is IO. The efficiency of the LIKE operator can be only important if your whole table fits in the memory otherwise IO will take most of the time.
AFAIK oracle can use indexes for prefix matching. (like 'abc%'), but these index cannot be used for more complex expressions.
Anyway if you have only this kind of queries you should consider using a simple index on the related column. (Probably this is true for other RDBMS's as well.)
Otherwise LIKE operator is generally slow, but most of the RDBMS have some kind of full text searching solution. I think the main reason of the slowness is that LIKE is too general. Usually full text indexes has lots of different options which can tell the database what you really want to search for, and with these additional information the DB can do its task in a more efficient way.
As a rule of thumb I think if you want to search in a text field and you think performance can be an issue, you should consider your RDBMS's full text searching solution, or the real goal is not text searching, but this is some kind of "design side effect", for example xml/json/statuses stored in a field as text, then probably you should consider choosing a more efficient data storing option. (if there is any...)

Using REGEX to alter field data in a mysql query

I have two databases, both containing phone numbers. I need to find all instances of duplicate phone numbers, but the formats of database 1 vary wildly from the format of database 2.
I'd like to strip out all non-digit characters and just compare the two 10-digit strings to determine if it's a duplicate, something like:
SELECT b.phone as barPhone, sp.phone as SPPhone FROM bars b JOIN single_platform_bars sp ON sp.phone.REGEX = b.phone.REGEX
Is such a thing even possible in a mysql query? If so, how do I go about accomplishing this?
EDIT: Looks like it is, in fact, a thing you can do! Hooray! The following query returned exactly what I needed:
SELECT b.phone, b.id, sp.phone, sp.id
FROM bars b JOIN single_platform_bars sp ON REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(b.phone,' ',''),'-',''),'(',''),')',''),'.','') = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(sp.phone,' ',''),'-',''),'(',''),')',''),'.','')
MySQL doesn't support returning the "match" of a regular expression. The MySQL REGEXP function returns a 1 or 0, depending on whether an expression matched a regular expression test or not.
You can use the REPLACE function to replace a specific character, and you can nest those. But it would be unwieldy for all "non-digit" characters. If you want to remove spaces, dashes, open and close parens e.g.
REPLACE(REPLACE(REPLACE(REPLACE(sp.phone,' ',''),'-',''),'(',''),')','')
One approach is to create user defined function to return just the digits from a string. But if you don't want to create a user defined function...
This can be done in native MySQL. This approach is a bit unwieldy, but it is workable for strings of "reasonable" length.
SELECT CONCAT(IF(SUBSTR(sp.phone,1,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,1,1),'')
,IF(SUBSTR(sp.phone,2,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,2,1),'')
,IF(SUBSTR(sp.phone,3,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,3,1),'')
,IF(SUBSTR(sp.phone,4,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,4,1),'')
,IF(SUBSTR(sp.phone,5,1) REGEXP '^[0-9]$',SUBSTR(sp.phone,5,1),'')
) AS phone_digits
FROM sp
To unpack that a bit... we extract a single character from the first position in the string, check if it's a digit, if it is a digit, we return the character, otherwise we return an empty string. We repeat this for the second, third, etc. characters in the string. We concatenate all of the returned characters and empty strings back into a single string.
Obviously, the expression above is checking only the first five characters of the string, you would need to extend this, basically adding a line for each position you want to check...
And unwieldy expressions like this can be included in a predicate (in a WHERE clause). (I've just shown it in the SELECT list for convenience.)
MySQL doesn't support such string operations natively. You will either need to use a UDF like this, or else create a stored function that iterates over a string parameter concatenating to its return value every digit that it encounters.

MySQL query - select postcode matches

I need to make a selection based on the first 2 characters of a field, so for example
SELECT * from table WHERE postcode LIKE 'rh%'
But this would select any record that contains those 2 characters at any point in the "postcode" field right? I am in need of a query that just selects the first 2 characters. Any pointerS?
Thanks
Your query is correct. It searches for postcodes starting with "rh".
In contrast, if you wanted to search for postcodes containing the string "rh" anywhere in the field, you would write:
SELECT * from table WHERE postcode LIKE '%rh%'
Edit:
To answer your comment, you can use either or both % and _ for relatively simple searches. As you have noticed already, % matches any number of characters whereas _ matches a single character.
So, in order to match postcodes starting with "RHx " (where x is any character) your query would be:
SELECT * from table WHERE postcode LIKE 'RH_ %'
(mind the space after _). For more complex search patterns, you need to read about regular expressions.
Further reading:
http://dev.mysql.com/doc/refman/5.1/en/pattern-matching.html
http://dev.mysql.com/doc/refman/5.1/en/regexp.html
LIKE '%rh%' will return all rows with 'rh' anywhere
LIKE 'rh%' will return all rows with 'rh' at the beginning
LIKE '%rh' will return all rows with 'rh' at the end.
If you want to get only first two characters 'rh', use MySQL SUBSTR() function
http://dev.mysql.com/doc/refman/5.1/en/string-functions.html#function_substr
Dave, your way seems correct to me (and works on my test data). Using a leading % as well will match anywhere in the string which obviously isn't desirable when dealing with postcodes.