I've noticed that LIKE 'a%' would return results with 'árbol' but LIKE 'AE%' will not return results like "Æther". What is the extent to which LIKE is smart? Lets say I have these database entries:
Black Blue
Black Blew
Is there any way to have MySQL match both smartly with LIKE 'Black Bleu' (Since they are quite close)? Or is LIKE '_%' only capable of matching exact characters, with the aforementioned exception?
MySQL's fulltext searching is pretty limited, as it's a database, not a search engine.
You should look into something like Apache Solr, which supports all sorts of things like "sounds like" matching, stemming (i.e. "smarter" and "smart" are the same), etc.
Another option is to use SOUNDEX's or the SOUNDS LIKE operator
http://dev.mysql.com/doc/refman/5.5/en/string-functions.html#operator_sounds-like
http://dev.mysql.com/doc/refman/5.5/en/string-functions.html#function_soundex
mysql> SELECT SOUNDEX('Blue'), SOUNDEX('Blew'), SOUNDEX('Blue') = SOUNDEX('Blew');
+-----------------+-----------------+-----------------------------------+
| SOUNDEX('Blue') | SOUNDEX('Blew') | SOUNDEX('Blue') = SOUNDEX('Blew') |
+-----------------+-----------------+-----------------------------------+
| B400 | B400 | 1 |
+-----------------+-----------------+-----------------------------------+
1 row in set (0.00 sec)
Not sure about MySQL, but on SQL Server, regex is your friend
SELECT * FROM Table WHERE Field LIKE 'Black Bl[e|u]%'
Assuming that you're looking to match the 'similar' spellings, rather than on the fact that the two words sound the same :)
Related
I try to use a regex with mysql that search boundary words in a json array string but I don't want the regex match words order because I don't know them.
So I started firstly to write my regex on regex101 (https://regex101.com/r/wNVyaZ/1) and then try to convert this one for mysql.
WHERE `Wish`.`services` REGEXP '^([^>].*[[:<:]]Hygiène[[:>:]])([^>].*[[:<:]]Radiothérapie[[:>:]]).+';
WHERE `Wish`.`services` REGEXP '^([^>].*[[:<:]]Hygiène[[:>:]])([^>].*[[:<:]]Andrologie[[:>:]]).+';
In the first query I get result, cause "Hygiène" is before "Radiothérapie" but in the second query "Andrologie" is before "Hygiène" and not after like it written in the query. The problem is that the query is generated automatically with a list of services that are choosen with no order importance and I want to match only boundary words if they exists no matter the order they have.
You can search for words in JSON like the following (I tested on MySQL 5.7):
select * from wish
where json_search(services, 'one', 'Hygiène') is not null
and json_search(services, 'one', 'Andrologie') is not null;
+------------------------------------------------------------+
| services |
+------------------------------------------------------------+
| ["Andrologie", "Angiologie", "Hygiène", "Radiothérapie"] |
+------------------------------------------------------------+
See https://dev.mysql.com/doc/refman/5.7/en/json-search-functions.html#function_json-search
If you can, use the JSON search queries (you need a MySQL with JSON support).
If it's advisable, consider changing the database structure and enter the various "words" as a related table. This would allow you much more powerful (and faster) queries.
JOIN has_service AS hh ON (hh.row_id = id)
JOIN services AS ss ON (hh.service_id = ss.id
AND ss.name IN ('Hygiène', 'Angiologie', ...)
Otherwise, in this context, consider that you're not really doing a regexp search, and you're doing a full table scan anyway (unless MySQL 8.0+ or PerconaDB 5.7+ (not sure) and an index on the full extent of the 'services' column), and several LIKE queries will actually cost you less:
WHERE (services LIKE '%"Hygiène"%'
OR services LIKE '%"Angiologie"%'
...)
or
IF(services LIKE '%"Hygiène"%', 1, 0)
+IF(services LIKE '%"Angiologie"%', 1, 0)
+ ... AS score
HAVING score > 0 -- or score=5 if you want only matches on all full five
ORDER BY score DESC;
My table/model has TEXT type column, and when filtering for the records on the model itself, the AR where produces the correct SQL and returns correct results, here is what I mean :
MyNamespace::MyValue.where(value: 'Good Quality')
Produces this SQL :
SELECT `my_namespace_my_values`.*
FROM `my_namespace_my_values`
WHERE `my_namespace_my_values`.`value` = '\\\"Good Quality\\\"'
Take another example where I m joining MyNamespace::MyValue and filtering on the same value column but from the other model (has relation on the model to my_values). See this (query #2) :
OtherModel.joins(:my_values).where(my_values: { value: 'Good Quality' })
This does not produce correct query, this filters on the value column as if it was a String column and not Text, therefore producing incorrect results like so (only pasting relevant where) :
WHERE my_namespace_my_values`.`value` = 'Good Quality'
Now I can get past this by doing LIKE inside my AR where, which will produce the correct result but slightly different query. This is what I mean :
OtherModel.joins(:my_values).where('my_values.value LIKE ?, '%Good Quality%')
Finally arriving to my questions. What is this and how it's being generated for where on the model (for text column type)?
WHERE `my_namespace_my_values`.`value` = '\\\"Good Quality\\\"'
Maybe most important question what is the difference in terms of performance using :
WHERE `my_namespace_my_values`.`value` = '\\\"Good Quality\\\"'
and this :
(my_namespace_my_values.value LIKE '%Good Quality%')
and more importantly how do I get my query with joins (query #2) produce where like this :
WHERE `my_namespace_my_values`.`value` = '\\\"Good Quality\\\"'
(Partial answer -- approaching from the MySQL side.)
What will/won't match
Case 1: (I don't know where the extra backslashes and quotes come from.)
WHERE `my_namespace_my_values`.`value` = '\\\"Good Quality\\\"'
\"Good Quality\" -- matches
Good Quality -- does not match
The product has Good Quality. -- does not match
Case 2: (Find Good Quality anywhere in value.)
WHERE my_namespace_my_values.value LIKE '%Good Quality%'
\"Good Quality\" -- matches
Good Quality -- matches
The product has Good Quality. -- matches
Case 3:
WHERE `my_namespace_my_values`.`value` = 'Good Quality'
\"Good Quality\" -- does not match
Good Quality -- matches
The product has Good Quality. -- does not match
Performance:
If value is declared TEXT, all cases are slow.
If value is not indexed, all are slow.
If value is VARCHAR(255) (or smaller) and indexed, Cases 1 and 3 are faster. It can quickly find the one row, versus checking all rows.
Phrased differently:
LIKE with a leading wildcard (%) is slow.
Indexing the column is important for performance, but TEXT cannot be indexed.
What is this and how it's being generated for where on the model (for
text column type)?
Thats generated behind Active Records (Arel) lexical engine.
See my answer below on your second question as to why.
What is the difference in terms of performance using...
The "=" matches by whole string/chunk comparison
While LIKE matches by character(s) ( by character(s)).
In my projects i got tables with millions of rows, from my experience its really faster to the use that comparator "=" or regexp than using a LIKE in a query.
How do I get my query with joins (query #2) produce where like this...
Can you try this,
OtherModel.joins(:my_values).where(OtherModel[:value].eq('\\\"Good Quality\\\"'))
I think it might be helpful.
to search for \n, specify it as \n. To search for \, specify it as
\\ this is because the backslashes are stripped once by the parser
and again when the pattern match is made, leaving a single backslash
to be matched against.
link
LIKE and = are different operators.
= is a comparison operator that operates on numbers and strings. When comparing strings, the comparison operator compares whole strings.
LIKE is a string operator that compares character by character.
mysql> SELECT 'ä' LIKE 'ae' COLLATE latin1_german2_ci;
+-----------------------------------------+
| 'ä' LIKE 'ae' COLLATE latin1_german2_ci |
+-----------------------------------------+
| 0 |
+-----------------------------------------+
mysql> SELECT 'ä' = 'ae' COLLATE latin1_german2_ci;
+--------------------------------------+
| 'ä' = 'ae' COLLATE latin1_german2_ci |
+--------------------------------------+
| 1 |
+--------------------------------------+
The '=' op is looking for an exact match while the LIKE op is working more like pattern matching with '%' being similar like '*' in regular expressions.
So if you have entries with
Good Quality
More Good Quality
only LIKE will get both results.
Regarding the escape string I am not sure where this is generated, but looks like some standardized escaping to get this valid for SQL.
As I am sure a lot of people here are aware, having to deal with German umlauts and UTF8 collations can be problematic to say the least. Stuff like a = ä, o = ö, u = ü is not only capable of affecting the sort order of the results but the actual results as well. Here is an example which clearly demonstrates how things can go wrong by simply trying to make a distinction between a singular and plural version of a noun (Bademantel - singular, Bademäntel - plural).
CREATE TABLE keywords (
id INT (11) PRIMARY KEY AUTO_INCREMENT,
keyword VARCHAR (255) NOT NULL
) ENGINE = MyISAM DEFAULT CHARACTER
SET = utf8 COLLATE = utf8_unicode_ci;
INSERT INTO keywords (keyword) VALUES ('Bademantel'), ('Bademäntel');
SELECT * FROM keywords WHERE keyword LIKE ('%Bademäntel%');
Results should be
+----+------------+
| id | keyword |
+----+------------+
| 1 | Bademäntel |
+----+------------+
yet with utf8_unicode_ci the output is
+----+------------+
| id | keyword |
+----+------------+
| 1 | Bademantel |
| 2 | Bademäntel |
+----+------------+
which is clearly not the required result.
The actual problem is tied for my current project. It involves writing a keyword parser which is basically supposed to replace every occurrence of a keyword on the website with a link to the appropriate product page. In order to avoid unnecessary waste of resources only distinct keywords are fetched but using either
SELECT keyword FROM keywords GROUP BY keyword ORDER BY LENGTH(keyword) DESC
or
SELECT DISTINCT keyword FROM keywords ORDER BY LENGTH(keyword) DESC
will result in failing to process (link) all the non-umlaut versions of the words simply because they are not fetched during the query (i.e. all the keywords containing Bademäntel will be fetched but Bademantel will be omitted).
Now I realize that I have a couple of options to resolve this problem.
1) Use utf8_swedish_ci for the keywords table or during the queries which would effectively save me from having to modify a lot of existing code.
SELECT DISTINCT keyword COLLATE utf8_swedish_ci AS keyword FROM keywords ORDER BY LENGTH(keyword) DESC;
Unfortunately I am not that reluctant to abandon utf8_unicode_ci because a) it offers a really nice feature of sorting "Eszett" (ss and ß are considered the same), b) somehow it simply feels wrong to use a Swedish collation to handle German related stuff.
2) Modify the existing code to make use of utf8_bin.
SELECT DISTINCT keyword COLLATE utf8_bin AS keyword FROM keywords ORDER BY LENGTH(keyword) DESC;
This works as intended but it has a nasty drawback that all comparison is case-sensitive which means that if I decided to rely on utf8_bin as a solution for the problem I would have a hard time doing case-insensitive queries like LIKE('%Mäntel%') which would most definitely omit records like Bademäntel.
I know that this question pops every now and then on SO but some of the answers are now pretty old and I just want to know if there is some other solution that might have emerged in the meantime. I mean, I really can't get around the thought that a simple collation is allowed to completely change the results of a query. Sorting order yes, but the results itself?
Sorry for a bit longer post and thanks in advance for any kind of advice or comment.
For anyone else encountering this problem it's worth noting that since MySQL 5.6 there is official support for utf8_german2_ci collation which solves all of the above problems. Better late, than never I guess.
You could use a binary check using the keyword WHERE BINARY keyword = 'Bademantel'. The result would be the expected one.
Check out this sqlfiddle, which shows this:
SELECT * FROM stackoverflow WHERE BINARY keyword = 'Bademantel';
| id | keyword |
|----|------------|
| 1 | Bademantel |
SELECT * FROM stackoverflow WHERE keyword = 'Bademantel';
| id | keyword |
|----|------------|
| 1 | Bademantel |
| 2 | Bademäntel |
More about this behavior here: What effects does using a binary collation have? and here: What is the best MySQL collation for German language
So for applications with german umlauts or french grave accent or special chars in czech/polish language you have to decide which behavior is the best for your application.
Most cases will be ok with utf8_general_ci but sometimes you have to use utf8_bin for cases like your Bademantel.
The string comparison isn't bad at all, utf8_general_ci will help you sometimes. If you have saved a string like Straße - and you can search for Strasse which will also return Straße.
How would I get MySQL to be more strict with character matching?
A quick example of what I mean, say I have a table with a single column `name`. In this column, I have two names: 'Jorge' and 'Jorgé" The only difference between these names is the ´ over the e. If I run the query SELECT * FROM table WHERE name = 'Jorge', it will return
+--------+
| name |
+--------+
| Jorge |
| Jorgé |
+--------+
and if I run the query SELECT * FROM table WHERE name = 'Jorgé', it returns the same result table. How would I set MySQL to be more strict in that so that it would not return both names?
Thanks ahead.
Quick Edit: I'm using the UTF-8 character encoding
If you want to make sure that no similar characters (like e and é) are considered the same, you should use the utf8_bin collation on that column. I assume that you're using utf8_general_ci now, which will consider some similar characters to be the same. utf8_bin only matches on the exact same characters.
#G-Nugget is correct, but since you are looking at Spanish stuff you might also be interested in the utf8_spanish_ci or utf8_spanish2_ci. They correspond to modern and traditional Spanish. "ñ" is considered a separate letter, and in traditional the "ch" and "ll" are also treated as separate letters.
More here: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html
I have MySQL db, db engine InnoDB, collation set to utf8-utf8_general_ci (also tried utf8_unicode_ci). I would like db to treat equaly č and c, ž and z, ć and c, š and s, đ and d.
E.g,
table1
-------------
id | name
-------------
1 | mačka
2 | đemper
-------------
if I run query:
SELECT * FROM table1 WHERE name LIKE '%mac%'
or
SELECT * FROM table1 WHERE name LIKE '%mač%'
I will get the result:
-------------
id | name
-------------
1 | mačka
Which is OK, that is exactly what I want.
But if run query:
SELECT * FROM table1 WHERE name LIKE '%de%'
I get zero results.
And if I run query:
SELECT * FROM table1 WHERE name LIKE '%đe%'
I will get:
-------------
id | name
-------------
2 | đemper
This is not behaviour that i would want nor expect. I would like that both (last two queries) returned:
-------------
id | name
-------------
2 | đemper
How can I accomplish this?
Any kind of help is appreciated, thanks in advance :) !
This can't be done without the use of regular expressions, as there is no collation in MySQL that considers đ equivalent to d.
The collation you are using determines things like this -- what characters are considered 'equal', and what order they should sort in. But first off, you need to know what encoding your table is using.
The command SHOW TABLE STATUS LIKE 'table1'\G should show you that. That will help you determine the collation you need to use.
If it's Unicode (UTF8, e.g.), then you need to set a Unicode collation. There doesn't appear to be one built-in to MySQL for Croatian. You can check the MySQL Character Set manual page to see if anything there is going to be 'close enough'.
If it's iso-latin-2 (iso-8859-2), then you can use 'latin2_croatian_ci' collation.
If it's CP-1250, then there is also a 'cp1250_croatian_ci' collation.
The non-unicode collations are in the manual here.
EDIT
As Ignacio Vazquez-Abrams correctly points out, none of the MySQL collations consider 'đ' to be equivalent to 'd'. (Reference for MySQL collations)
If you are really eager to put a lot of time into this, you can also read up on how to install your own custom collation