How to avoid invalid errors when comparing integers with explicit collation in MySQL 5.6 - mysql

Okay, not the clearest title ever; feel free to improve.
I have a table representing thousands of linguistic forms. Many of these make heavy use of diacritics, so all of aha, áha̱ and ā̧́ḫà̀ may appear. The table (and the database) uses UTF-8 as character set and utf8mb4_unicode_520_ci as the default collation scheme, since searching should be case- and diacritic-agnostic (so searching for aha should bring up all three). These forms have all been entered in manually by human beings, though, so there are inevitably duplicates.
I’m currently trying to get a list of exactly identical forms in order to get rid of duplicates (manually – each token would have to be checked before being removed), but in this case I need to search in a diacritic-aware manner – that is, given the three tokens listed above, I would expect a search to yield no results, since they are three different forms because of the diacritics.
I figured this should be a fairly easy task; just do:
SELECT token FROM table GROUP BY token HAVING COUNT(token) > 1 COLLATE utf8mb4_bin
But alas, that does not work. Instead, it gives me an error message that “COLLATION utf8mb4_bin is not valid for CHARACTER SET latin1”. I should note that I have absolutely nothing Latin-1 anywhere – no character sets, no collations, no server charsets, nothing. There are also no stored procedures or anything else where Latin-1 might creep in.
No, this is because of this bug, which is apparently fixed from 5.7 onwards; see the description at the bottom:
For constructs such as ORDER BY numeric_expr COLLATE collation_name, the character set of the expression was treated as latin1, which resulted in an error if the collation specified after COLLATE is incompatible with latin1. Now when a numeric expression is implicitly cast to a character expression in the presence of COLLATE, the character set used is the one associated with the named collation.
Unfortunately, I’m on 5.6, and I don’t have the option of upgrading (annoyingly). Converting the data to Latin-1 is also not an option, nor is changing the collation on the table.
Is there a way to run my query or an equivalent one yielding the result set I’m after, without getting the collation error?

SET NAMES utf8mb4;
CREATE TABLE x(s VARCHAR(11) COLLATE utf8mb4_unicode_520_ci NOT NULL);
INSERT INTO x (s)
VALUES ('aha'), ('áha̱'), ('ā̧́ḫà̀'),
('i'), ('i̯');
SELECT s FROM x GROUP BY s HAVING COUNT(*) > 1;
Comes back with
aha
i
without any complaints about numeric stuff.
I ran it on 5.6.46, 5.7.26, 8.0.16 and several MariaDB versions.
What I am doing differently than your case?
When adding an explicate COLLATE clause, put it on the component of the query that needs it. (COLLATE does not apply to the query as a whole; different parts can be collated differently.)

Related

Migrating from SQL server to MySQL using pentaho unicode issue

I have a problem migrating data from SQL server to MySQL. I have nvarchar columns in SQL server and am exporting them to a Unicode textfile. But when I am importing the column into an utf-8 table of MySQL I get an error for duplicate value: Mysql sees no difference between 'Kaneko, Shûsuke' and 'Kaneko, Shusuke'. I am trying to get these values into a unique column.
What's wrong?
must I use another charset in MySQL?
I also tried converting the textfile to utf8 before importing into MySQL, but still getting the same error.
It seems the problem in your Mysql Table creation. First use SHOW CREATE TABLE on mysql prompt and see its table structure. Have you used right charset and collate. You can read here mysql docs
Many times collation is indeed not only case insensitive, but also partly accent insensitive, so ñ = n. (as Joni Salonen points out, this is incorrect!) but á = a.
So we can use binary collation but its have own drawback.Binary collation compares your string exactly as strcmp() in C would do, if characters are different (be it just case or diacritics difference). The downside of it that the sort order is not natural.
An example of unnatural sort order (as in "binary" is) : A,B,a,b Natural sort order would be in this case e.g : A,a,B,b (small and capital variations of the sme letter are sorted next to each other)
The practical advantage of binary collation is its speed, as string comparison is very simple/fast. In general case, indexes with binary might not produce expected results for sort, however for exact matches they can be useful.
Use a binary collation for the specific column (possibly your best bet)
For ex-
drop table cc;
CREATE TABLE cc ( c CHAR(100) primary key ) DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;
insert into cc values ( 'Kaneko, Shûsuke' );
insert into cc values ( 'Kaneko, Shusuke' );

Mysql common ascii character collation

I am getting an error like this:
COLLATION 'latin1_swedish_ci' is not valid for CHARACTER SET 'utf8'
Whenever I try to run a particular query. The problem in my case is, that I need this query to be able to run - without modification - against two separate databases, which have a different character collation (one is latin1, the other is utf8).
Since the strings I am trying to match are guaranteed to be basic letters (a-z), I was wondering if there was any way to force the comparison to work irrespective of the specific encoding?
I mean, a A is an A no matter how it is encoded - is there some way to tell mysql to compare the content of the string as letters rather than as whatever binary thing it does internally? I don't even understand why it can't auto-convert collations, since it is quite capable of doing it when explicitly told to.

Accent insensitive search query in MySQL

Is there any way to make search query accent insensitive?
the column's and table's collation are utf8_polish_ci and I don't want to change them.
example word : toruń
select * from pages where title like '%torun%'
It doesn't find "toruń". How can I do that?
You can change the collation at runtime in the sql query,
...where title like '%torun%' collate utf8_unicode_ci
but beware that changing the collation on the fly at runtime forgoes the possibility of mysql using an index, so performance on large tables may be terrible.
Or, you can copy the column to another column, such as searchable_title, but change the collation on it. It's actually common to do this type of stuff, where you copy data but have it in some slightly different form that's optimized for some specific workload/purpose. You can use triggers as a nice way to keep the duplicated columns in sync. This method has the potential to perform well, if indexed.
Note - Make sure that your db really has those characters and not html entities.
Also, the character set of your connection matters. The above assumes it's set to utf8, for example, via set names like set names utf8
If not, you need an introducer for the literal value
...where title like _utf8'%torun%' collate utf8_unicode_ci
and of course, the value in the single quotes must actually be utf8 encoded, even if the rest of the sql query isn't.
This wont work in extreme circumstances, but try to change the column collation to UFT8 utf8_unicode_ci. Then accented characters will be equal to their non-accented counterparts.
You could try SOUNDEX:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
This compares two string by how they sound. But this obviously delivers many more results.

Accent-insensitive searches / problems with utf8_general_ci collation

Edit: if you're here because you're confused by the polish collation in MySQL, read this.
I'm trying to perform a full-text search on a table of polish cities and many of them contain accented characters. It's meant to be used in an ajax call for auto completion so it would be nice if the search was accent-insensitive. I've set the collation of the rows to ut8_polish_ci. Now, given the city "Zelów", I query the database like this
SELECT * FROMcitiesWHERE MATCH( city ) AGAINST ("zelow")
but to no avail. Mysql returns an empty result. I've tried different accents, tried adding different collations to the query but nothing helped. I'm not sure how I should approach this because accent-sensitivity seems to be poorly documented. Any ideas?
EDIT
So I found out that the case-insensitive full-text searches are performed only IN BOOLEAN MODE, so the correct query would be
SELECT * FROMcitiesWHERE MATCH( city ) AGAINST ( "zelow" IN BOOLEAN MODE )
Previously I thought otherwise due to a misleading comment on dev.mysql.com. There might be more to it but I'm just really confused right now.
Anyway, as mentioned in the comments below, I have UNIQUE index on the cities column so changing the collation of the table to accent-insensitive utf8_general_ci is out of the question.
I realized however, that the following query works quite well on a table with utf8_polish_ci collation:
SELECT * FROMcitiesWHERE city LIKE 'zelow' COLLATE utf8_general_ci
It would seem now that the most reasonable solution would be to do a full-text search in a similar fashion:
SELECT * FROMcitiesWHERE MATCH( city ) AGAINST ( 'zelow' IN BOOLEAN MODE ) COLLATE utf8_genral_ci
This however yields the following error:
#1253 - COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
This is really starting to get on my nerves. Might as well abandon full-text search in favour of a simple where-like approach but it doesn't seem sensible in a table with almost 50k records which will be intensively queried...
LAST EDIT
Ok, the thing with boolean mode was partly bullshit. Only partly because it indeed works as I said, however, on a utf8_general_ci it works the other way around. I'm utterly perplexed and have no will to study this issue further. I decided to drop the UNIQUE index (no further cities will be added anyway so no need to make a big deal out of it) and stick with the utf8_general_ci table collation. I appreciate all the help, it steered me in the right direction.
Change your collation to utf_general_ci. It ignores accent when searching and ordering but still stores them correctly.
MySQL is very flexible in the encoding/collation area, maybe too flexible. When changing your encoding/collation, make sure you are converting the table, not just changing the encoding/collation types.
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
You can also convert individual fields, so your table can have a collation setting of utf8_general_ci, but you can change one or more fields so they use some other collation. Base on the "binary" error you are seeing, it seems your text field might have a collation of UTF8-BIN (or be a blob). Can you post the result of CREATE TABLE?
Remember, the CHARACTER SET (encoding) is how the data is stored, the collation is how it is indexed. Not all combinations work.
My original problem, and question, might help a little:
Converting mysql tables from latin1 to utf8
If you try :
select * from cities where cityname like 'zelow'
Change your collation from binary to utf8_bin. utf8_bin should be compatible with utf8_general_ci, but will still allow you to store city names with differing accents.

UTF8 string comparisons in MySQL

We have issues with utf8-string comparisons in MySQL 5, regarding case and accents :
from what I gathered, what MySQL implements collations by considering that "groups of characters should be considered equal".
For example, in the utf8_unicode_ci collation, all the letters "EÉÈÊeéèê" are in the same box (together with other variants of "e").
So if you have a table containing ["video", "vidéo", "vidÉo", "vidÊo", "vidêo", "vidÈo", "vidèo", "vidEo"] (in a varchar column declared with ut8_general_ci collation) :
when asking MySQL to sort the rows according to this column, the sorting is random (MySQL does not enforce a sorting rule between "é" and "É" for example),
when asking MySQL to add a Unique Key on this column, it raises an error because it considers all the values are equal.
What setting can we fiddle with to fix these two points ?
PS : on a related note, I do not see any case sensitive collation for the utf8 charset. Did I miss something ?
[edit] I think my initial question still holds some interest, and I will leave it as is (and maybe one day get a positive answer).
It turned out, however, that our problems with string comparisons regarding accents was not linked to the collation of our text columns. It was linked to a configuration problem with the character_set_client parameter when talking with MySQL - which defaulted to latin1.
Here is the article that explained it all to us, and allowed us to fix the problem :
Getting out of MySQL character set hell
It is lengthy, but trust me, you need this length to explain both the problem and the fix.
Use collation that considers these characters to be distinct. Maybe utf8_bin (it's case sensitive, since it does binary comparison of characters)
http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html