mysql char length for utf8_swedish_ci - mysql

I created a table with this field:
chr CHAR(1)
The charset and collate are:
DEFAULT CHARSET=utf8 COLLATE=utf8_swedish_ci
Well, even thought the field only can contain one single character, if I insert a value like:
insert into tbl values ('ö');
then the length of the field will return more than 1. Thus, the sql:
select length(chr) from tbl where id = 1
return 2. Why? Im aware that the charset/collation thing can be a real pain. I didnt have utf8_swedish_ci from the beginning, which wasnt good as I could not sort alphabetically (the ö was fetched after o which is wrong as the character ö is the last one in the alphabet).
So I guess it would be best for me to continue using utf8_swedish_ci. But then this bad thing happens. Anyone knows how to solve this issue? Thanks in advance.

The MySQL length function returns the length in bytes; you need char_length to get the number of characters.

Related

MySQL string comparison with special characters

I have created an autocomplete that matches against a list of names in a database.
The database that I'm working contains a ton of names with special characters, but the end users are most likely going to search with the English equivalent of those names, e.g. Bela Bartok for Béla Bartók and Dvorak for Dvořák, etc. Currently, doing the English searches returns no results.
I have come across threads saying that the way to solve this is to change your MySQL collation to utf8 (which I have done to no avail).
I think that this may be because I used utf8_unicode_ci, but the one that would get the results that I want is utf8_general_ci. The problem with the latter though is that all the comments say to no longer use it.
Does anyone know how I can solve this problem?
If you know the list of special characters and what the equivalents in plain English are, than you can do the following:
lower case the string
replace the characters with the lower case equivalents
search against that "plain English" column
You will need to use the full text searching of MySQL in order to search against the text or come up with a home grown solution for how you're going to handle that.
Just tested with both utf8_general_ci and utf8_unicode_ci collations and it worked like a charm in both cases.
Follows the MySQL code I used to run my test:
CREATE TABLE `test` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`text` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
INSERT INTO `test` (`id`, `text`) VALUES (NULL, 'Dvořák'), (NULL, 'Béla Bartók');
SELECT * FROM `test` WHERE `text` LIKE '%dvorak%';
The above SELECT statement returns:
id text
--------------
1 Dvořák
Note: During my test I set all the collations to the desired one. The database collation, the table collation and the column collation as well.
Could it be that there's a bug in your PHP application?
I found the solution to my problem. Changing the collation to utf8_unicode_ci works perfectly fine. My problem was that I needed to use REGEXP in my query instead of LIKE, but REGEXP obviously doesn't work in this situation!
So in short, changing your collation to utf8_unicode_ci will allow you to compare Dvorak and Dvořák using = or LIKE, but not one of the REGEXP equivalents.
First, let's see if the data is stored correctly. Do
SELECT name, HEX(name) FROM ... WHERE ...;
Béla may come out (ignoring the spaces)
42 C3A9 6C 61 -- if correctly encoded with utf8 (é = C3A9)
42 E9 6C 61 -- if encoded with latin1 (é = E9)
The "Collation" (utf8_general_ci or utf8_unicode_ci) makes no difference for the examples you gave. Both tread é = e. See extensive list of equivalences for utf8 collations.
After you determine the encoding, we can proceed to prescribe a cure.
Taking a hint from Rick James, using:
SELECT * FROM `test` WHERE HEX(`column`) = HEX('Dvořák');
Should work. If you need a case insensitive query, then you'll need to lower/upper both sides in addition to the HEX check.
A more up to date collation is utf8mb4_unicode_520_ci.
Note, it does NOT work for utf8mb4_unicode_ci. See the comparison here: https://stackoverflow.com/a/59805600/857113

How to setup MySQL to handle unicode diacriticals properly?

This is an odd puzzle, AFAIK utf8_bin should guarantee that every accent is stored in the database properly, i.e. without some strange conversion to ASCII. So I have such table with:
DEFAULT CHARSET=utf8 COLLATE=utf8_bin
and yet when I try to compare/query/whatever such entries as "Krąków" and "Kraków" according to MySQL this is the same string.
Out of curiosity I also tried utf8_polish, and MySQL claims that for Polish guys "a" and "ą" do not make any difference.
So how to setup MySQL table, so I could store unicode strings safely, without losing accents and alike?
Server: MySQL 5.5 + openSUSE 11.4, client: Windows 7 + MySQL Workbench 5.2.
Update -- CREATE TABLE
CREATE TABLE `Cities` (
`city_Name` VARCHAR(145) CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`city_Name`)
) DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Please note that I cannot set a different utf8_bin for column, because entire table is utf8_bin, so in effect collation for column is reset to default.
All credits of the solution go to bobince, so please upvote his comment to my question.
The solution to the problem is somewhat strange, and I would risk saying MySQL is broken in this regard.
So, let's say I created a table with utf8 and didn't do anything for column. Later I realize I need strict comparison of characters, so I change the collation for table AND columns to utf8_bin. Solved?
No, now MySQL sees this -- the table is indeed utf8_bin, but column is also utf8_bin, which means column uses the DEFAULT collation of the table. However MySQL does not realize that the previous default is not the same as current default. And thus comparison still does not work.
So you have to shake off that default for column, to some alien value out of scope of the collation "family" (in case of "utf8xxx" means no other "utf8xxx"). Once it is shaken off, and you see entry which does not say "default" at column collation, you can set utf8_bin, which now evaluates to default, but since we come from non-default collation, everything kicks in as expected.
Do not forget to apply the changes at each step.
The MySQL default charset and collation (which are server-wide but can be changed per connection) apply at the time a table is created. Changing the defaults after the table is created doesn't affect existing tables.
Character sets and collations are attributes of individual columns. They can be set from a table-wide default but they do belong to columns.
A charset of utf8 should be sufficient to allow all European languages to be represented correctly. You should definitely be able to store "a" and "ą" as two different characters.
A collation of utf8-bin yields a case and accented-character sensitive collation.
Here are some examples of the difference between text value and collation behavior. I'm using three sample strings: 'abcd', 'ĄBCD' , and 'ąbcd'. The last two have the A-ogonek letter.
This first example says that with utf8 character representation and utf8_general_ci collation, that the three strings each display as specified by the user, but that they compare equal. That's to be expected in a collation that doesn't distinguish between a and ą. That's a typical case insensitive collation, where all the variant characters are sorted equal to the character without any diacritical marks.
SET NAMES 'utf8' COLLATE 'utf8_general_ci';
SELECT 'abcd', 'ąbcd' , 'abcd' < 'ąbcd', 'abcd' = 'ąbcd';
false true
This next example shows that in the case-insensitive Polish-language collation, a comes before ą. I don't know Polish, but I suspect Polish telephone books have the As and the Ą's separated.
SET NAMES 'utf8' COLLATE 'utf8_polish_ci';
SELECT 'abcd', 'ĄBCD' , 'ąbcd', 'abcd' < 'ĄBCD', 'abcd' < 'ąbcd' , 'ąbcd' = 'ĄBCD'
true true true
This next example shows what happens with the utf8_bin collation.
SET NAMES 'utf8' COLLATE 'utf8_bin';
SELECT 'abcd', 'ĄBCD' , 'ąbcd', 'abcd' < 'ĄBCD', 'abcd' < 'ąbcd' , 'ąbcd' = 'ĄBCD'
true true false
There's one non-intuitive thing to notice in this case. 'abcd' < 'ĄBCD' is true (whereas 'abcd' < 'ABCD' with pure ASCII is false). That's a strange result if you're thinking linguistically. That's because the both A-ogonek characters have binary values in utf8 that are higher than all the abc and ABC characters. So: if you use the utf8-bin collation for ORDER BY operations, you'll get linguistically strange results.
You're saying that 'Krąków' and 'Kraków' compare equal, and that you're puzzled by that. They do compare equal when the collation in use is utf8_general_ci. But they don't with either utf8_bin or utf8_polish_ci. According to the Polish-language support in MySQL, these two spellings of the city's name are different.
As you design your application, you need to sort out how you want all this to work linguistically. Are 'Krąków' and 'Kraków' the same place? Are 'Ąaron' and 'Aaron' the same person? If so, you want utf8_general_ci.
You could consider altering the table you've shown like this:
ALTER TABLE Cities
MODIFY COLUMN city_Name
VARCHAR(145)
CHARACTER SET utf8
COLLATE utf8_general_ci
This will set the column in your table the way you want it.

mysql querying a utf charset table for c returns ç

I managed to insert special characters into a table by setting the charset with
CHARSET=utf8;
Thing is, when I run the following query on the table
SELECT * FROM table WHERE word = 'francais';
it returns both "francais" and "français"!
This is not quite desirable for my situation.. I have no idea why it does this because they're just different...
Can anyone tell me how to avoid this? Would be much appreciated.
lordstyx
Try using collation, e.g.,
select *
from table
where word = 'francais' collate utf8_bin;

MySQL LOWER() function not multi-byte safe for the º character?

When I encoding the following character to UTF-8:
º
I get:
º
Then with º stored as a field value, I select the field with the LOWER() function and get
âº
I was expecting it to respect that the value is a multi-byte character and thus will not perform the LOWER on it.
Expected:
º
I am I not understanding correctly that the LOWER() function is suppose to be multi-byte safe as stated in the manual? (http://dev.mysql.com/doc/refman/5.1/en/string-functions.html#function_lower)
Or am I doing something wrong here?
I am running MySQL 5.1.
EDIT
The encoding on the table is set to UTF-8. The session encoding is default latin1.
Here are my repro steps.
CREATE TABLE test_table (
test_field VARCHAR(1000) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;
INSERT INTO test_table(test_field) VALUES('º');
SELECT LOWER(test_field) FROM test_table;
INSERT INTO test_table(test_field) VALUES('º');
Will insert a 2 character string, which has the correct LOWER() of "âº"
Lower("Â") is "â"
Lower("º") is "º"
If you want to insert "º" then make sure you have
SET NAMES 'utf-8';
and
INSERT INTO test_table(test_field) VALUES('º');

MySQL case insensitive string matching using =

I'm trying to search records using an alphanumeric "short_code" column. Something like:
SELECT * FROM items WHERE short_code = "1AV9"
With no collation and with column type set to varchar(), this query is case-insensitive, so it returns records with short_codes 1av9, 1Av9, etc. I don't want this.
So I tried changing the collation of the short_code column to utf8_bin, but now the query isn't returning anything at all. However, if I change the query to:
SELECT * FROM items WHERE short_code LIKE "1AV9%"
Then I get the exact row I want. Is it possible that by converting my column's collation, it somehow appended invisible chars at the end of all my shortcodes? How can I verify/fix this?
EDIT: It looks that by changing my column type to binary and trying a bunch of other stuff, it somehow padded all my short_codes with null bytes, which explains why the query wouldn't return any result. After starting over and setting the utf8_bin collation, everything's working as expected.
Here's a wild guess. I think the table had not origiannly a collation set. Then you set the collation into utf_bin and that caused a confusion in the stored length of the field.
First back up your table. Then try:
ALTER TABLE items
CHANGE COLUMN short_code short_code VARCHAR(48)
CHARACTER SET 'utf8'
COLLATE 'utf8_unicode_ci' ;
Adding some characters (that are not in your data):
UPDATE items
SET short_code = CONCAT('++F++F', short_code, '++F++F') ;
Removing them:
UPDATE items
SET short_code = REPLACE(short_code, '++F++F', '') ;
Back to length 8:
ALTER TABLE items
CHANGE COLUMN short_code short_code VARCHAR(8) ;
And back again to binary collation:
ALTER TABLE items
CHANGE COLUMN short_code short_code VARCHAR(8)
CHARACTER SET 'utf8'
COLLATE 'utf8_bin' ;
Perhaps this will fix the incorrect length. (perhaps a shorter change - from varchar to char and back to varchar - will fix it).
Try
SELECT LENGTH(short_code) FROM items WHERE short_code LIKE "1AV9%"
and see if you get something other than 4 as the result.
Edit: Hmm, your values might have trailing spaces. Try
SELECT * FROM items WHERE short_code = "1AV9 "
(that's 1AV9 plus four spaces) and see if you get any results.
If you can change the collation then try "utf8_general_cs".
or maybe
WHERE '1AV9' COLLATE utf8_general_cs = short_code