MySQL Handling of à i utf8 group by primary key - mysql

I have a utf8 table in mysql 5.5.19.
When I select data from it, I get data such as RoÃdorf and Rosdorf on different rows.
When I group by on the data however, or select distinct, I get only RoÃdorf.
My basic problem is that the data is exported to hive (which gets the correct group by with both different alternatives) and then imports this to mysql again. The import will then fail because mysql will treat RoÃdorf and Rosdorf as the same key (and PK constraint will brake). I guess this is for the same reason as the above distinction, so an answer to that is very helpful.

Use utf8_bin as collation, it will differentiate accented and non-accented characters
See MySQL collation charts

Related

How to avoid invalid errors when comparing integers with explicit collation in MySQL 5.6

Okay, not the clearest title ever; feel free to improve.
I have a table representing thousands of linguistic forms. Many of these make heavy use of diacritics, so all of aha, áha̱ and ā̧́ḫà̀ may appear. The table (and the database) uses UTF-8 as character set and utf8mb4_unicode_520_ci as the default collation scheme, since searching should be case- and diacritic-agnostic (so searching for aha should bring up all three). These forms have all been entered in manually by human beings, though, so there are inevitably duplicates.
I’m currently trying to get a list of exactly identical forms in order to get rid of duplicates (manually – each token would have to be checked before being removed), but in this case I need to search in a diacritic-aware manner – that is, given the three tokens listed above, I would expect a search to yield no results, since they are three different forms because of the diacritics.
I figured this should be a fairly easy task; just do:
SELECT token FROM table GROUP BY token HAVING COUNT(token) > 1 COLLATE utf8mb4_bin
But alas, that does not work. Instead, it gives me an error message that “COLLATION utf8mb4_bin is not valid for CHARACTER SET latin1”. I should note that I have absolutely nothing Latin-1 anywhere – no character sets, no collations, no server charsets, nothing. There are also no stored procedures or anything else where Latin-1 might creep in.
No, this is because of this bug, which is apparently fixed from 5.7 onwards; see the description at the bottom:
For constructs such as ORDER BY numeric_expr COLLATE collation_name, the character set of the expression was treated as latin1, which resulted in an error if the collation specified after COLLATE is incompatible with latin1. Now when a numeric expression is implicitly cast to a character expression in the presence of COLLATE, the character set used is the one associated with the named collation.
Unfortunately, I’m on 5.6, and I don’t have the option of upgrading (annoyingly). Converting the data to Latin-1 is also not an option, nor is changing the collation on the table.
Is there a way to run my query or an equivalent one yielding the result set I’m after, without getting the collation error?
SET NAMES utf8mb4;
CREATE TABLE x(s VARCHAR(11) COLLATE utf8mb4_unicode_520_ci NOT NULL);
INSERT INTO x (s)
VALUES ('aha'), ('áha̱'), ('ā̧́ḫà̀'),
('i'), ('i̯');
SELECT s FROM x GROUP BY s HAVING COUNT(*) > 1;
Comes back with
aha
i
without any complaints about numeric stuff.
I ran it on 5.6.46, 5.7.26, 8.0.16 and several MariaDB versions.
What I am doing differently than your case?
When adding an explicate COLLATE clause, put it on the component of the query that needs it. (COLLATE does not apply to the query as a whole; different parts can be collated differently.)

Migrating from SQL server to MySQL using pentaho unicode issue

I have a problem migrating data from SQL server to MySQL. I have nvarchar columns in SQL server and am exporting them to a Unicode textfile. But when I am importing the column into an utf-8 table of MySQL I get an error for duplicate value: Mysql sees no difference between 'Kaneko, Shûsuke' and 'Kaneko, Shusuke'. I am trying to get these values into a unique column.
What's wrong?
must I use another charset in MySQL?
I also tried converting the textfile to utf8 before importing into MySQL, but still getting the same error.
It seems the problem in your Mysql Table creation. First use SHOW CREATE TABLE on mysql prompt and see its table structure. Have you used right charset and collate. You can read here mysql docs
Many times collation is indeed not only case insensitive, but also partly accent insensitive, so ñ = n. (as Joni Salonen points out, this is incorrect!) but á = a.
So we can use binary collation but its have own drawback.Binary collation compares your string exactly as strcmp() in C would do, if characters are different (be it just case or diacritics difference). The downside of it that the sort order is not natural.
An example of unnatural sort order (as in "binary" is) : A,B,a,b Natural sort order would be in this case e.g : A,a,B,b (small and capital variations of the sme letter are sorted next to each other)
The practical advantage of binary collation is its speed, as string comparison is very simple/fast. In general case, indexes with binary might not produce expected results for sort, however for exact matches they can be useful.
Use a binary collation for the specific column (possibly your best bet)
For ex-
drop table cc;
CREATE TABLE cc ( c CHAR(100) primary key ) DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;
insert into cc values ( 'Kaneko, Shûsuke' );
insert into cc values ( 'Kaneko, Shusuke' );

Accent-insensitive searches / problems with utf8_general_ci collation

Edit: if you're here because you're confused by the polish collation in MySQL, read this.
I'm trying to perform a full-text search on a table of polish cities and many of them contain accented characters. It's meant to be used in an ajax call for auto completion so it would be nice if the search was accent-insensitive. I've set the collation of the rows to ut8_polish_ci. Now, given the city "Zelów", I query the database like this
SELECT * FROMcitiesWHERE MATCH( city ) AGAINST ("zelow")
but to no avail. Mysql returns an empty result. I've tried different accents, tried adding different collations to the query but nothing helped. I'm not sure how I should approach this because accent-sensitivity seems to be poorly documented. Any ideas?
EDIT
So I found out that the case-insensitive full-text searches are performed only IN BOOLEAN MODE, so the correct query would be
SELECT * FROMcitiesWHERE MATCH( city ) AGAINST ( "zelow" IN BOOLEAN MODE )
Previously I thought otherwise due to a misleading comment on dev.mysql.com. There might be more to it but I'm just really confused right now.
Anyway, as mentioned in the comments below, I have UNIQUE index on the cities column so changing the collation of the table to accent-insensitive utf8_general_ci is out of the question.
I realized however, that the following query works quite well on a table with utf8_polish_ci collation:
SELECT * FROMcitiesWHERE city LIKE 'zelow' COLLATE utf8_general_ci
It would seem now that the most reasonable solution would be to do a full-text search in a similar fashion:
SELECT * FROMcitiesWHERE MATCH( city ) AGAINST ( 'zelow' IN BOOLEAN MODE ) COLLATE utf8_genral_ci
This however yields the following error:
#1253 - COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
This is really starting to get on my nerves. Might as well abandon full-text search in favour of a simple where-like approach but it doesn't seem sensible in a table with almost 50k records which will be intensively queried...
LAST EDIT
Ok, the thing with boolean mode was partly bullshit. Only partly because it indeed works as I said, however, on a utf8_general_ci it works the other way around. I'm utterly perplexed and have no will to study this issue further. I decided to drop the UNIQUE index (no further cities will be added anyway so no need to make a big deal out of it) and stick with the utf8_general_ci table collation. I appreciate all the help, it steered me in the right direction.
Change your collation to utf_general_ci. It ignores accent when searching and ordering but still stores them correctly.
MySQL is very flexible in the encoding/collation area, maybe too flexible. When changing your encoding/collation, make sure you are converting the table, not just changing the encoding/collation types.
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
You can also convert individual fields, so your table can have a collation setting of utf8_general_ci, but you can change one or more fields so they use some other collation. Base on the "binary" error you are seeing, it seems your text field might have a collation of UTF8-BIN (or be a blob). Can you post the result of CREATE TABLE?
Remember, the CHARACTER SET (encoding) is how the data is stored, the collation is how it is indexed. Not all combinations work.
My original problem, and question, might help a little:
Converting mysql tables from latin1 to utf8
If you try :
select * from cities where cityname like 'zelow'
Change your collation from binary to utf8_bin. utf8_bin should be compatible with utf8_general_ci, but will still allow you to store city names with differing accents.

Is this a safe way to convert MySQL tables from latin1 to utf-8?

I need to change all the tables in one of my databases from latin1 to utf-8 (with utf8_bin collation).
I have dumped the database, created a test database from it, and run the following without any errors or warnings for each table:
ALTER TABLE tablename CONVERT TO CHARSET utf8 COLLATION utf8_bin
Is it safe for me to repeat this on the real database? The data seems fine by inspection...
There are 3 different cases to consider:
The values are indeed encoded using Latin1
This is the consistent case: declared charset and content encoding match. This was the only case I covered in my initial answer.
Use the command you suggested:
ALTER TABLE tablename CONVERT TO CHARSET utf8 COLLATE utf8_bin
Note that the CONVERT TO CHARACTER SET command only appeared in MySQL 4.1.2, so anyone using a database installed before 2005 had to use an export/import trick. This is why there are so many legacy scripts and document on Internet doing it the old way.
The values are already encoded using utf8
In this case, you don't want mysql to convert any data, you only need to change the column's metadata.
For this, you have to change the type to BLOB first, then to TEXT utf8 for each column, so that there are no value conversions:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8
This is the recommended way, and it is explicitely documented in Alter Table Syntax Documentation.
The values use in a different encoding
The default encoding was Latin1 for several years on a some Linux distributions. In this case, you have to use a combination of the two techniques:
Fix the table meta-data, using the BLOB type trick
Convert the values using CONVERT TO.
A straightforward conversion will potentially break any strings with non-utf7 characters.
If you don't have any of those (i.e. all of your text is english), you'll usually be fine.
If you've any of those, however, you need to convert all char/varchar/text fields to blob in an initial run, and to convert them to utf8 in a subsequent run.
See this article for detailed procedures:
http://codex.wordpress.org/Converting_Database_Character_Sets
I've done this a few times on production databases in the past (converting from the old standard encoding swedish to latin1), and when MySQL encounters a character that cannot be translated to the target encoding, it aborts the conversion and remains in the unchanged state. Therefor, I'd deem the ALTER TABLE statement working.

Funny characters in MySQL database after import

I exported and then imported a wordpress mysql database from one server to another.
On the new server, a lot of the apostrophes have turned to question mark symbols ?. When I look at the data in the database itself, the apostrophes are normal like this ' so what would be causing those characters to look messed up?
Thanks
Perhaps run SHOW CREATE TABLE tablename on both servers. The difference might be related to the CHARSET.
You should use the corresponding collation in the select, for example
SELECT k
FROM t1
ORDER BY k COLLATE latin1_german2_ci;
One common collation name could be SQL_Latin1_General_Cp1254_CS_AS
here is a list of collation names:http://msdn.microsoft.com/en-us/library/ms180175.aspx