When I add UTF-8 words to a table column, and execute an ordered SELECT, the sort order is wrong. On DESC sort, the order is correct but on ASC sort, the order is wrong. How to fix that? Let me explain on example. Lets have a mysql table with Slovak collate:
CREATE TABLE IF NOT EXISTS test (
aaa varchar(255) CHARACTER SET utf8 COLLATE utf8_slovak_ci NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_slovak_ci;
Now lets insert some values with UTF-8 words:
INSERT INTO test (aaa) VALUES
('Leco'),
('Lečo'),
('Ledo'),
('Chovatelstvo'),
('Chovateľstvo')
Here is Slovak alphabet explained, you can see which letters are after which other letters: http://en.wikipedia.org/wiki/Slovak_orthography
Now when I select with order, I expect to get the following result:
SELECT aaa FROM test ORDER BY aaa ASC
Chovatelstvo
Chovateľstvo
Leco
Lečo
Ledo
And I also expect the exactly opposite order for DESC. But here is what I get in fact:
SELECT aaa FROM test ORDER BY aaa ASC
Chovateľstvo
Chovatelstvo
Leco
Lečo
Ledo
and DESC:
SELECT aaa FROM test ORDER BY aaa DESC
Ledo
Lečo
Leco
Chovateľstvo
Chovatelstvo
You can see there
Chovateľstvo
Chovatelstvo
is always in the given order regardless of ASC or DESC. I noticed that if I insert the rows in opposite order, it may end up as
Chovatelstvo
Chovateľstvo
meaning that the actual order is opposite, but again is the same for ASC and DESC. As like if mysql considered those two letters 'l' and 'ľ' as equal.
I tried this with some older version of MySQL, as well as newest version of MariaDB on another server, the result is the same.
Any idea what causes that and how to fix it?
In both the utf8_slovak_ci and utf8_general_ci collations, the letter ľ and the letter l are considered the same.
You can see this by observing that this query returns true (1)
select _utf8 'Chovateľstvo' collate utf8_slovak_ci = _utf8 'Chovatelstvo'
The designers of that collation obviously believe that ľ and l belong together in the dictionary. The only collations I can find that do not do that are latin2_hungarian_ci and cp1250_czech_cs. But to use either one of those you'll have to change your character set choice.
If you must have them be different, you could try the utf8_bin collation. But that will be entirely case sensitive.
The way ORDER BY works is basically correct for the rules in the collation.
Maybe there's a defect in the collation? You could submit a defect report to the MySql team at https://bugs.mysql.com/
Related
I want to perform case-insensitive ORDER BY in MySQL.
I have the data in my database like
A, C, b, e, D etc
I'm getting the result as
A, C, D, b, e
But, I want the result as
A, b, C, D, e
How can I get that?
Choose a case-insensitive collation
select * from your_table
order by your_column COLLATE utf8_general_ci
That way indexes still work and the query is fast.
You can use
Select col
from myTable
order by lower(col)
That way it will compare all by lower values.
as #juergen d commented this will void indexes and therefor perfom slowly
There (at least) 3 solutions. Two (LOWER() and ORDER BY .. COLLATE ..) have already been given. Here is a third.
If the COLLATION of the column in question is changed to be some ..._ci collation, then the ORDER BY will do what you want without any special syntax in the query, itself.
See the reference manual on "collation".
PS: Changing the column definition to a suitable collation is more efficient than LOWER or the COLLATE clause, especially for large tables.
Use utf8_unicode_ci or utf8mb4_0900_ai_ci in your case. I suggest utf8mb4_0900_ai_ci because it has more characters.
SELECT * FROM <table> ORDER BY <column> COLLATE utf8mb4_0900_ai_ci;
If you have no reason to choose a schema collation, select utf8mb4_0900_ai_ci
Note that this question is NOT about searching for (non)accented characters.
Suppose I have a table where there is a column name, with collation utf8mb4_unicode_ci.
This collation works perfectly for the purpose of selecting the base selection
in a case-insensitive, accent-insensitive way.
The problem is that I need to order the results in an accent-sensitive and case-insensitive way.
The purpose of this is to select every name starting with some character/string and sort them "alphabetically", first should be not-accented, then accented.
From selection e.g.:
Črpw
Cewo
céag
čefw
The final results should be:
Cewo
céag -- because accented e is more than non-accented
čefw
Črpw -- because r is more than e
Note that c/C < č/Č , but lower/upper cases are handled as equals.
I tried searching for this problem, but there are only popping similar questions or questions about searching, which is not the case, the searching itself is fine.
From mentioned I've tried this test query:
SELECT * FROM
(SELECT 'Črpw' as t
UNION SELECT 'Cewo'
UNION SELECT 'céag'
UNION SELECT 'čefw')virtual
ORDER BY t COLLATE utf8mb4_czech_ci ASC
Which produces something very similar to what I want
céag
Cewo
čefw
Črpw
But note that é gets ordered before e.
Is there a way how to get to the results order I want to have?
Using: MySQL 5.5.54 (Debian)
I have a database table which represent people and the records have people's names in them. Some of the names have accented characters in them. Some do not. Some are non-accented duplicates of the accented version.
I need to generate a report of all of the potential duplicates by finding names that are the same (first, middle, last) except for the accents so that someone else can go through this list and verify which are true duplicates, and which are actually different people (I'm assuming they have some other way of knowing).
For example: Jose DISTINCT-LAST-NAME and José DISTINCT-LAST-NAME should be picked up as potential duplicates because they have the same characters, but one has an accented character.
How can this type of query by written in MySQL?
This question: How to remove accents in MySQL? is not the same. It is asking about de-accenting strings in-place and the poster already has a second column of data that has been de-accented. Also, the accepted answer to that question is to set the character set and collation. I have already set the character set and collation.
I am trying to generate a report that finds strings in different records that are the same except for their accents.
I found your question very interesting.
According to this article Accents in text searches, using "like" condition with some character collation adjustments will solve your problem. I have not tested this solution, so if it helps you, please come back and tell us.
Here is a similar question: Accent insensitive search query in MySQL,
according to that, you can use something like:
where 'José' like 'Jose' collate utf8_general_ci
Well, I found something that seems to work (the real query involves a few more other fields, but the same basic idea):
select distinct p1.person_id, p1.first_name, p1.last_name, p2.last_name
from people as p1, people as p2
where binary p1.last_name <> binary p2.last_name
and p1.last_name = p2.last_name
and p1.first_name = p2.first_name
order by p1.last_name, p1.first_name, p2.last_name, p2.first_name;
The results look like this:
12345 Bob Jose José
56789 Bob José Jose
...
This makes sense as there are 2 records for Bob José and I know that in this case, it is the same person but one record is missing the accent.
The trick is to do a binary and non-binary compare on the "last_name" field as well as matching on all other fields. This way we can find everything that is "equal" and also not binary-equal. This works because with the current character-set/collation (utf8/utf8_general_ci), Jose and José are equal but are not binary-equal. you can try it out like this:
select 'Jose' = 'José', 'Jose' like 'José', binary 'Jose' = binary 'José';
The Bane of Character Encodings
There are a wide variety of character-sets and encodings that may be used in MySQL, and when dealing with encoding it is important to learn what you can about them. In particular, take a close look at the differences between:
utf8_unicode_ci
utf8_general_ci
utf8_unicode_520_ci
utf8mb4_general_ci
Some character sets are built to include as many printable characters as possible, to support a wider range of uses, while others are built with the intent of portability and compatibility between systems. In particular, utf8_unicode_ci maps most accented characters to non-accented equivalents. Alternatively, you could use uft8_ascii_ci which is even more restrictive.
Take a look at the utf8_unicode_ci collation chart, and What's the difference between utf8_general_ci and utf8_unicode_ci .
The best answer is from a similar question, "How to remove accents in MySQL?"
If you set an appropriate collation for the column then the value
within the field will compare equal to its unaccented equivalent
naturally.
mysql> SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT 'é' = 'e';
+------------+
| 'é' = 'e' |
+------------+
| 1 |
+------------+
1 row in set (0.05 sec)
How to apply this to your situation?
SELECT id, last-name
FROM people
WHERE last-name COLLATE utf8_unicode_ci IN
(
SELECT last-name
FROM people
GROUP BY last-name COLLATE utf8_unicode_ci
HAVING COUNT(last-name)>1
)
Duh right off the bat you'd think, "use ORDER BY and then the column" but I have values of:
A
Z
B
a
z
And when I sort them using this query:
SELECT * FROM Diaries ORDER BY title ASC;
I then get this:
A
B
Z
a
z
When I want to get something like this, first issue:
A
a
B
Z
z
I had the same sorting issue else where, second issue, but I was able to fix it with this: By temporarily putting all characters in lowercase
for (NSString *key in [dicGroupedStories allKeys]) {
[dicGroupedStories setValue: [[dicGroupedStories objectForKey: key] sortedArrayUsingComparator:^NSComparisonResult(id a, id b) {
NSString *stringA = [[a objectStory_title] lowercaseString];
NSString *stringB = [[b objectStory_title] lowercaseString];
return [stringA compare: stringB];
}] forKey: key];
}
Only reason why I don't use this Comparator to sort my first issue is bc I don't want to execute my query then sort them then use the array.
Question: I want to know if there's a way to sort them how I want, like I did in my second issue, in a SQL query
objects id a and id b are arrays that contain other objects like title, date created, description, etc. objectDiary_title returns a NSString
In SQL, you can use the lower() or upper() functions in the order by:
ORDER BY lower(Diaries), diaries
You can use COLLATE with xxx_ci where ci means case insensitive. For example:
SELECT * FROM Diaries ORDER BY title COLLATE 'latin1_general_ci' ASC;
There's more information regarding case sensitivity in MySQL here: https://dev.mysql.com/doc/refman/5.0/en/case-sensitivity.html. It's useful for doing searches and comparisons as well.
Use a case-insensitive collation, such as:
ORDER BY Diaries COLLATE utf8_unicode_ci ;
However, changing collation on-the-fly, like any convertion on-the-fly, makes the query unable to use an index (which is acceptable if the data set to be sorted is small enough).
If performance is an issue then you had better reindex the column with the target collation:
ALTER TABLE MODIFY COLUMN Diaries VARCHAR(10) COLLATE utf8_unicode_ci ;
ORDR BY will then be case insensitive by defaut and can use an index on this column.
utf8_unicode_ci is just an example. Just make sure you use a collation *_ci (for Case-Insensitive) which is compatible with the column's encoding
I have a simple table of a single column with rows of char(12) like:
DRF4482
DRF4497
DRF451
DRF4515
EHF452
FJF453
GKF4573
I want to select all of the rows that are between D and F, and have 4 numbers at the end. Like DRF4482, DRF4497, DRF4515, etc. I've tried a number of different wildcard combinations but I get no rows. I'm using:
SELECT * FROM `expired` WHERE id like '%[D-F][A-Z][A-Z]____';
I've even tried to broaden it to:
SELECT * FROM `expired` WHERE id like '%[D-F]%';
and that returns nothing as well.
I've even tried COLLATE latin1_bin based on some other posts but that didn't work either. My table is utf8, but I've created a second table as latin1 and tried a few different collations with the same results - no rows.
Where is my error?
You need to use REGEXP instead of LIKE. Notice that the syntax is a little different; it doesn't do anything with the SQLish % wildcard characters.
So, you want
id REGEXP '[D-F][A-Z][A-Z][0-9]{4}'
for this app. Hopefully you don't have multibyte characters in these strings, because MySQL's regexp doesn't work correctly in those circumstances.