SQL query with Unicode codepoints in SQLite and MySQL

SQL query with Unicode codepoints in SQLite and MySQL - mysql

What is the right way to use unicode codepoints in a query? (SQLite and MySQL):
sqlite> select name from city where name like '%rich';
Zürich
Zurich
I tried using the codepoint but nothing worked so far:
sqlite> select * from city where name like 'Z\u00fc%';
(empty)
Anybody? Thanks
EDIT: I created this sql fiddle which also doesn't work http://sqlfiddle.com/#!9/0faee/2

(Referring to MySQL; I do not know about SQLite.)
Do you really have backslash u 0 0 f c? If you are going with Unicode, you need to go all the way, with the unhex of 005A00FC. That is Z would take two bytes and ü would take two bytes. My point is that MySQL does not see Z\u00fc as anything other than 7 ascii character. (Or maybe 6 if the \ is treated as an escape.)
005A00FC is the hex for Zü when using CHARACTER SET ucs2.
If you are using PHP 5.4+ and JSON, you need an extra parameter:
$t = json_encode($s, JSON_UNESCAPED_UNICODE);

Related

Special characters show as 'BLOB' when typing SELECT CHAR(128,129,130,131,132,133,134,135,136,137);

I'm using MySQL 8.0.31 and learning using the Sakila dataset. I tried typing
SELECT CHAR(128,129,130,131,132,133,134,135,136,137); but the result shows
image
I also checked the default character set and it is 'utf8mb4'
I don't see a lot of answers and I'm a beginner. Please help
Edit:
I am expecting this result:
image2
This is taken from Learning SQL book by Alan B.
From the Book:
the following examples show the
location of the accented characters along with other special characters, such as currency symbols:
mysql> SELECT CHAR(128,129,130,131,132,133,134,135,136,137);
result: Çüéâäàåçêë

"BLOB" is a datatype used in databases to contain binary data (that is, not representable as text).
The string of characters you built is not representable in the default charset (UTF8), so MySQL does not know how to print it out, and just says is binary content.
The example in the book you are reading surely is assuming the default DB charset is ASCII. Since it is not, you must specify it:
SELECT CHAR(128,129,130,131,132,133,134,135,136,137 USING ascii);

MySQL string contains only certain unicode characters

I need to query the database for entries that contain only a certain set of Unicode Japanese characters and nothing else.
I've tried using WHERE word RLIKE '^([あいうえお])+$' but that doesn't work with Japanese because of lack of Unicode support in MySQL's regex.
Is there any other way to accomplish this?

MySQL is looking at each character as a byte sequence, so あ is 0xE3, 0x81, 0x82 and your [あいうえお] is actually looking for any sequence of bytes 0xE3, 0x81, 0x82, 0x84, 0x86, 0x88 and 0x8A. That will match あ fine, but it will also match other sequences that don't correspond to a single character in the list, for example 0xE3, 0x82, 0x81 which is め.
An alternative way of saying [あいうえお] that would still work when each character is considered by the regex engine as being more than one symbol would be (あ|い|う|え|お).
SELECT 'あ' RLIKE '^([あいうえお])+$'; -- 1
SELECT 'め' RLIKE '^([あいうえお])+$'; -- 1
SELECT 'あ' RLIKE '^(あ|い|う|え|お)+$'; -- 1
SELECT 'め' RLIKE '^(あ|い|う|え|お)+$'; -- 0

MySQL difficulties - Hiragana and Katakana are treated as the same

I tried to fetch ピース on mysql database
SELECT * FROM edict WHERE japanese = 'ピース'
However I got 3 results which are:
ヒース
ビーズ
ピース
I tried to use ぴーす as the query and it also return the same result.
SELECT * FROM edict WHERE japanese = 'ぴーす'
How can I solve this problem?
Thank you

I'm not sure about japanese alphabets, but you could use BINARY comparison:
WHERE BINARY japanese = 'ピース'
BINARY keyword casts string to its binary presentation, thus you have "precise" comparison.
Also, if that behaviour should be default for japanese column - you could change its collation to _bin one (it will be more efficient solution, rather than just casting)

How to make MySQL aware of multi-byte characters in LIKE and REGEXP?

I have a MySQL table with two columns, both utf8_unicode_ci collated. It contains the following rows. Except for ASCII, the second field also contains Unicode codepoints like U+02C8 (MODIFIED LETTER VERTICAL LINE) and U+02D0 (MODIFIED LETTER TRIANGULAR COLON).
word | ipa
--------+----------
Hallo | haˈloː
IPA | ˌiːpeːˈʔaː
I need to search the second field with LIKE and REGEXP, but MySQL (5.0.77) seems to interpret these fields as bytes, not as characters.
SELECT * FROM pronunciation WHERE ipa LIKE '%ha?lo%'; -- 0 rows
SELECT * FROM pronunciation WHERE ipa LIKE '%ha??lo%'; -- 1 row
SELECT * FROM pronunciation WHERE ipa REGEXP 'ha.lo'; -- 0 rows
SELECT * FROM pronunciation WHERE ipa REGEXP 'ha..lo'; -- 1 row
I'm quite sure that the data is stored correctly, as it seems good when I retrieve it and shows up fine in phpMyAdmin. I'm on a shared host, so I can't really install programs.
How can I solve this problem? If it's not possible: is there a plausible work-around that does not involve processing the entire database with PHP every time? There are 40 000 lines, and I'm not dead-set on using MySQL (or UTF8, for that matter). I only have access to PHP and MySQL on the host.
Edit: There is an open 4-year-old MySQL bug report, Bug #30241 Regular expression problems, which notes that the regexp engine works byte-wise. Thus, I'm looking for a work-around.

EDITED to incorporate fix to valid critisism
Use the HEX() function to render your bytes to hexadecimal and then use RLIKE on that, for example:
select * from mytable
where hex(ipa) rlike concat('(..)*', hex('needle'), '(..)*'); -- looking for 'needle' in haystack, but maintaining hex-pair alignment.
The odd unicode chars render consistently to their hex values, so you're searching over standard 0-9A-F chars.
This works for "normal" columns too, you just don't need it.
p.s. #Kieren's (valid) point addressed using rlike to enforce char pairs

I'm not dead-set on using MySQL
Postgres seems to handle it quite fine:
test=# select 'ˌˈʔ' like '___';
?column?
----------
t
(1 row)
test=# select 'ˌˈʔ' ~ '^.{3}$';
?column?
----------
t
(1 row)
If you go down that road, note that in Postgres' ilike operator matches that of MySQL's like. (In Postgres, like is case-sensitive.)
For the MySQL-specific solution, you mind be able to work around by binding some user-defined function (maybe bind the ICU library?) into MySQL.

You have problems with UTF8? Eliminate them.
How many special characters do you use? Are you using only locase letters, am I right? So, my tip is: Write a function, which converts spec chars to regular chars, e.g. "æ" ->"A" and so on, and add a column to the table which stores that converted value (you have to convert all values first, and upon each insert/update). When searching, you just have to convert search string with the same function, and use it on that field with regexp.
If there're too many kind of special chars, you should convert it to multi-char. 1. Avoid finding "aa" in the "ba ab" sequence use some prefix, like "#ba#ab". 2. Avoid finding "#a" in "#ab" use fixed length tokens, say, 2.

Unicode (hexadecimal) character literals in MySQL

Is there a way to specify Unicode character literals in MySQL?
I want to replace a Unicode character with an Ascii character, something like the following:
Update MyTbl Set MyFld = Replace(MyFld, "ẏ", "y")
But I'm using even more obscure characters which are not available in most fonts, so I want to be able to use Unicode character literals, something like
Update MyTbl Set MyFld = Replace(MyFld, "\u1e8f", "y")
This SQL statement is being invoked from a PHP script - the first form is not only unreadable, but it doesn't actually work!

You can specify hexadecimal literals (or even binary literals) using 0x, x'', or X'':
select 0xC2A2;
select x'C2A2';
select X'C2A2';
But be aware that the return type is a binary string, so each and every byte is considered a character. You can verify this with char_length:
select char_length(0xC2A2)
2
If you want UTF-8 strings instead, you need to use convert:
select convert(0xC2A2 using utf8mb4)
And we can see that C2 A2 is considered 1 character in UTF-8:
select char_length(convert(0xC2A2 using utf8mb4))
1
Also, you don't have to worry about invalid bytes because convert will remove them automatically:
select char_length(convert(0xC1A2 using utf8mb4))
0
As can be seen, the output is 0 because C1 A2 is an invalid UTF-8 byte sequence.

Thanks for your suggestions, but I think the problem was further back in the system.
There's a lot of levels to unpick, but as far as I can tell, (on this server at least) the command
set names utf8
makes the utf-8 handling work correctly, whereas
set character set utf8
doesn't.
In my environment, these are being called from PHP using PDO, for what difference that may make.
Thanks anyway!

You can use the hex and unhex functions, e.g.:
update mytable set myfield = unhex(replace(hex(myfield),'C383','C3'))

The MySQL string syntax is specified here, as you can see, there is no provision for numeric escape sequences.
However, as you are embedding the SQL in PHP, you can compute the right bytes in PHP. Make sure the bytes you put into the SQL actually match your client character set.

There is also the char function that will allow what you wanted (providing byte numbers and a charset name) and getting a char.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

SQL query with Unicode codepoints in SQLite and MySQL - mysql

Related

Special characters show as 'BLOB' when typing SELECT CHAR(128,129,130,131,132,133,134,135,136,137);

MySQL string contains only certain unicode characters

MySQL difficulties - Hiragana and Katakana are treated as the same

How to make MySQL aware of multi-byte characters in LIKE and REGEXP?

Unicode (hexadecimal) character literals in MySQL

Categories

Resources