Croatian diacritic signs MySQL DB - like clause - mysql

I have MySQL db, db engine InnoDB, collation set to utf8-utf8_general_ci (also tried utf8_unicode_ci). I would like db to treat equaly č and c, ž and z, ć and c, š and s, đ and d.
E.g,
table1
-------------
id | name
-------------
1 | mačka
2 | đemper
-------------
if I run query:
SELECT * FROM table1 WHERE name LIKE '%mac%'
or
SELECT * FROM table1 WHERE name LIKE '%mač%'
I will get the result:
-------------
id | name
-------------
1 | mačka
Which is OK, that is exactly what I want.
But if run query:
SELECT * FROM table1 WHERE name LIKE '%de%'
I get zero results.
And if I run query:
SELECT * FROM table1 WHERE name LIKE '%đe%'
I will get:
-------------
id | name
-------------
2 | đemper
This is not behaviour that i would want nor expect. I would like that both (last two queries) returned:
-------------
id | name
-------------
2 | đemper
How can I accomplish this?
Any kind of help is appreciated, thanks in advance :) !

This can't be done without the use of regular expressions, as there is no collation in MySQL that considers đ equivalent to d.

The collation you are using determines things like this -- what characters are considered 'equal', and what order they should sort in. But first off, you need to know what encoding your table is using.
The command SHOW TABLE STATUS LIKE 'table1'\G should show you that. That will help you determine the collation you need to use.
If it's Unicode (UTF8, e.g.), then you need to set a Unicode collation. There doesn't appear to be one built-in to MySQL for Croatian. You can check the MySQL Character Set manual page to see if anything there is going to be 'close enough'.
If it's iso-latin-2 (iso-8859-2), then you can use 'latin2_croatian_ci' collation.
If it's CP-1250, then there is also a 'cp1250_croatian_ci' collation.
The non-unicode collations are in the manual here.
EDIT
As Ignacio Vazquez-Abrams correctly points out, none of the MySQL collations consider 'đ' to be equivalent to 'd'. (Reference for MySQL collations)
If you are really eager to put a lot of time into this, you can also read up on how to install your own custom collation

Related

how to handle non ascii characters in where clause

I am facing the problem with non-ascii character in where clause using with Oracle, MySQL, snowflake query.
SELECT * FROM TABLE WHERE col = 'Niño Pobre, Niño Rico';
This query returns no result.
Is there any solution to handle non-ascii character in where clause then please reply me.
Thanks.
Maurcin and user3278684 made comments about the Snowflake Data wharehouse.
In Snowflake when working with data with multiple languages, the COLLATION() function is very helpful.
https://docs.snowflake.net/manuals/sql-reference/functions/collate.html
https://docs.snowflake.net/manuals/sql-reference/functions/collation.html
Considerations and limitations, supported functions used to search are listed: https://docs.snowflake.net/manuals/sql-reference/collation.html#limited-support-for-collation-in-built-in-functions
So say for instance you have a table called feedback with two columns
| id | feedback_string |
| 1 | 'Niño Pobre, Niño Rico'|
SELECT collate(feedback_string) from feedback
WHERE feedback_string like '%Niño Pobre, Niño Rico%';
If you wanted to create a table to search for strings that are a specific language, you can create the same table above in Snowflake like this:
CREATE TABLE feedback (id NUMBER, feedback_string varchar(20) collate 'sp');
INSERT INTO collation1 (v) VALUES (1, 'Niño Pobre, Niño Rico');
then you can search with Like, but know that the search for for N will be close to ñ.

German umlauts and UTF8 collations, revisited

As I am sure a lot of people here are aware, having to deal with German umlauts and UTF8 collations can be problematic to say the least. Stuff like a = ä, o = ö, u = ü is not only capable of affecting the sort order of the results but the actual results as well. Here is an example which clearly demonstrates how things can go wrong by simply trying to make a distinction between a singular and plural version of a noun (Bademantel - singular, Bademäntel - plural).
CREATE TABLE keywords (
id INT (11) PRIMARY KEY AUTO_INCREMENT,
keyword VARCHAR (255) NOT NULL
) ENGINE = MyISAM DEFAULT CHARACTER
SET = utf8 COLLATE = utf8_unicode_ci;
INSERT INTO keywords (keyword) VALUES ('Bademantel'), ('Bademäntel');
SELECT * FROM keywords WHERE keyword LIKE ('%Bademäntel%');
Results should be
+----+------------+
| id | keyword |
+----+------------+
| 1 | Bademäntel |
+----+------------+
yet with utf8_unicode_ci the output is
+----+------------+
| id | keyword |
+----+------------+
| 1 | Bademantel |
| 2 | Bademäntel |
+----+------------+
which is clearly not the required result.
The actual problem is tied for my current project. It involves writing a keyword parser which is basically supposed to replace every occurrence of a keyword on the website with a link to the appropriate product page. In order to avoid unnecessary waste of resources only distinct keywords are fetched but using either
SELECT keyword FROM keywords GROUP BY keyword ORDER BY LENGTH(keyword) DESC
or
SELECT DISTINCT keyword FROM keywords ORDER BY LENGTH(keyword) DESC
will result in failing to process (link) all the non-umlaut versions of the words simply because they are not fetched during the query (i.e. all the keywords containing Bademäntel will be fetched but Bademantel will be omitted).
Now I realize that I have a couple of options to resolve this problem.
1) Use utf8_swedish_ci for the keywords table or during the queries which would effectively save me from having to modify a lot of existing code.
SELECT DISTINCT keyword COLLATE utf8_swedish_ci AS keyword FROM keywords ORDER BY LENGTH(keyword) DESC;
Unfortunately I am not that reluctant to abandon utf8_unicode_ci because a) it offers a really nice feature of sorting "Eszett" (ss and ß are considered the same), b) somehow it simply feels wrong to use a Swedish collation to handle German related stuff.
2) Modify the existing code to make use of utf8_bin.
SELECT DISTINCT keyword COLLATE utf8_bin AS keyword FROM keywords ORDER BY LENGTH(keyword) DESC;
This works as intended but it has a nasty drawback that all comparison is case-sensitive which means that if I decided to rely on utf8_bin as a solution for the problem I would have a hard time doing case-insensitive queries like LIKE('%Mäntel%') which would most definitely omit records like Bademäntel.
I know that this question pops every now and then on SO but some of the answers are now pretty old and I just want to know if there is some other solution that might have emerged in the meantime. I mean, I really can't get around the thought that a simple collation is allowed to completely change the results of a query. Sorting order yes, but the results itself?
Sorry for a bit longer post and thanks in advance for any kind of advice or comment.
For anyone else encountering this problem it's worth noting that since MySQL 5.6 there is official support for utf8_german2_ci collation which solves all of the above problems. Better late, than never I guess.
You could use a binary check using the keyword WHERE BINARY keyword = 'Bademantel'. The result would be the expected one.
Check out this sqlfiddle, which shows this:
SELECT * FROM stackoverflow WHERE BINARY keyword = 'Bademantel';
| id | keyword |
|----|------------|
| 1 | Bademantel |
SELECT * FROM stackoverflow WHERE keyword = 'Bademantel';
| id | keyword |
|----|------------|
| 1 | Bademantel |
| 2 | Bademäntel |
More about this behavior here: What effects does using a binary collation have? and here: What is the best MySQL collation for German language
So for applications with german umlauts or french grave accent or special chars in czech/polish language you have to decide which behavior is the best for your application.
Most cases will be ok with utf8_general_ci but sometimes you have to use utf8_bin for cases like your Bademantel.
The string comparison isn't bad at all, utf8_general_ci will help you sometimes. If you have saved a string like Straße - and you can search for Strasse which will also return Straße.

MySQL fulltext does not search for short emails

I have a fulltext index on a number of columns and i'm trying to do a MATCH AGAINST IN BOOLEAN MODE on those columns, trying to find an email address. Here are the results:
if i search for "test#email.com" (with quotes) - the query returns correct results
if i search for "a#b.com" (with quotes) - the query does not return anything
Can someone tell me why a short email a#b.com does not get returned and how would i solve this?
Here's the query i'm using:
SELECT MATCH(email, phone, title, description) AGAINST('"a#b.com"' IN BOOLEAN MODE) AS score
FROM thetable WHERE MATCH(email, phone, title, description)
AGAINST('"a#b.com"' IN BOOLEAN MODE) ORDER BY `status` DESC, score DESC
This is a combination of two problems:
# isn't considered being a 'word character', and neither is -, so searching for a#b.com actually comes down to searching for words a, b and com
a and b are shorter than ft_min_word_len
The solution would be to make # and . being considered word characters. There are several methods listed on http://dev.mysql.com/doc/refman/5.6/en/fulltext-fine-tuning.html
The most practical one would be adding a custom collation as described in
http://dev.mysql.com/doc/refman/5.6/en/full-text-adding-collation.html
Update:
a)You need to set ft_min_word_len = 1 in my.cnf.
b) Output of show variables
ft_min_word_len | 1
c) Fired below query:
mysql> SELECT name,email FROM jos_users WHERE MATCH (email) AGAINST ('a#b.com') limit 1;
+--------+---------+
| name | email |
+--------+---------+
| kap | a#b.com |
+--------+---------+
1 row in set (0.00 sec)
Hope this will help.
~K
I think you need to change ft_min_word_len
As specified in MySQLdoc fine tuning
The minimum and maximum lengths of words to be indexed are defined by
the ft_min_word_len and ft_max_word_len system variables. (See
Section 5.1.4, “Server System Variables”.) The default minimum value
is four characters; the default maximum is version dependent. If you
change either value, you must rebuild your FULLTEXT indexes. For
example, if you want three-character words to be searchable, you can
set the ft_min_word_len variable by putting the following lines in an
option file:
[mysqld]
ft_min_word_len=3
Then restart the server and rebuild your FULLTEXT indexes. Note particularly the remarks regarding myisamchk in the instructions
following this list.

MySQL - Characters matching

How would I get MySQL to be more strict with character matching?
A quick example of what I mean, say I have a table with a single column `name`. In this column, I have two names: 'Jorge' and 'Jorgé" The only difference between these names is the ´ over the e. If I run the query SELECT * FROM table WHERE name = 'Jorge', it will return
+--------+
| name |
+--------+
| Jorge |
| Jorgé |
+--------+
and if I run the query SELECT * FROM table WHERE name = 'Jorgé', it returns the same result table. How would I set MySQL to be more strict in that so that it would not return both names?
Thanks ahead.
Quick Edit: I'm using the UTF-8 character encoding
If you want to make sure that no similar characters (like e and é) are considered the same, you should use the utf8_bin collation on that column. I assume that you're using utf8_general_ci now, which will consider some similar characters to be the same. utf8_bin only matches on the exact same characters.
#G-Nugget is correct, but since you are looking at Spanish stuff you might also be interested in the utf8_spanish_ci or utf8_spanish2_ci. They correspond to modern and traditional Spanish. "ñ" is considered a separate letter, and in traditional the "ch" and "ll" are also treated as separate letters.
More here: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

Can I make MySQL SELECT LIKE syntax smarter?

I've noticed that LIKE 'a%' would return results with 'árbol' but LIKE 'AE%' will not return results like "Æther". What is the extent to which LIKE is smart? Lets say I have these database entries:
Black Blue
Black Blew
Is there any way to have MySQL match both smartly with LIKE 'Black Bleu' (Since they are quite close)? Or is LIKE '_%' only capable of matching exact characters, with the aforementioned exception?
MySQL's fulltext searching is pretty limited, as it's a database, not a search engine.
You should look into something like Apache Solr, which supports all sorts of things like "sounds like" matching, stemming (i.e. "smarter" and "smart" are the same), etc.
Another option is to use SOUNDEX's or the SOUNDS LIKE operator
http://dev.mysql.com/doc/refman/5.5/en/string-functions.html#operator_sounds-like
http://dev.mysql.com/doc/refman/5.5/en/string-functions.html#function_soundex
mysql> SELECT SOUNDEX('Blue'), SOUNDEX('Blew'), SOUNDEX('Blue') = SOUNDEX('Blew');
+-----------------+-----------------+-----------------------------------+
| SOUNDEX('Blue') | SOUNDEX('Blew') | SOUNDEX('Blue') = SOUNDEX('Blew') |
+-----------------+-----------------+-----------------------------------+
| B400 | B400 | 1 |
+-----------------+-----------------+-----------------------------------+
1 row in set (0.00 sec)
Not sure about MySQL, but on SQL Server, regex is your friend
SELECT * FROM Table WHERE Field LIKE 'Black Bl[e|u]%'
Assuming that you're looking to match the 'similar' spellings, rather than on the fact that the two words sound the same :)