Select unicode character u2028 in mysql 5.1 - mysql

I am trying to select unicdode character /u2028 in MySQL 5.1. MySQL 5.1 does support utf8 and ucs2.
In newer versions of MySQL i could select the char just be using utf16 or utf32 collation:
SELECT char(0x2028 using utf16);
SELECT char(0x00002028 using utf32);
But MySQL 5.1 do not support utf16 and utf32. How could I select the unicode character then?
Perhaps a few words about my use case: I have an third party application which stores data in a mysql database and using JavaScript for user interface. The application do not deal with the problem unicode characters /u2028 and /u2029 are valid JSON but will break JavaScript code. (For details see http://timelessrepo.com/json-isnt-a-javascript-subset) So I like to know how much data is affected by that issue and perhaps use replace on MySQL to fix it.
To demonstrate the problem:
CREATE TABLE IF NOT EXISTS `test` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`string` varchar(100) CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=3 ;
INSERT INTO `test` (`id`, `string`) VALUES
(1, 'without U+2028'),
(2, 'with U+2028 at this "
 "point');
SELECT * FROM test WHERE string LIKE CONCAT("%", char(0x2028 using utf16), "%");
// returns row 2 as expected
SELECT * FROM test WHERE string LIKE CONCAT("%", char(??? using utf8), "%");
// U+2028 in utf8 is 0xE2 0x80 0xA8 isn't it?
// But how to parse this to char function?

The unicode character U+2028 can be encoded in UTF-8 as hexadecimal e280a8. So the answer is to use the UNHEX function in MySQL to find it.
SELECT * FROM test WHERE string LIKE CONCAT("%", UNHEX('e280a8'), "%");
MySQL 5.1 can only handle characters encloded in UTF-8 up to three bytes long. So searching for U+2028 using UNHEX will work, but searching for U+1F600 won't as that takes up four bytes.
Use UNHEX('e280a9') to search for U+2029.
To find other characters, visit https://fileformat.info/info/unicode/char/2028/index.htm, substituting '2028' for the character you are looking for. Look for the the number in brackets in the 'UTF-8 (hex)' row.

Related

Convert MySQL LONGTEXT column to JSON

I have an existing table that holds valid JSON data but is stored as LONGTEXT (character set utf8mb4 with collation utf8mb4_bin). Doing JSON queries on this is highly inefficient and I want to change the data type of this column to JSON.
When I do so in HeidiSQL I get an: 'COLLATION 'utf8mb4_bin' is not valid for CHARACTER SET 'binary'. I can fix this by resetting the collation to empty.
I mention this because I thought the character set for a JSON column is utf8mb4 and it's default collation is utf8mb4_bin. In the JSON are some GUIDs and when I try to query them, it appears that I have to use LIKE. If I use the 'normal' = I get no results.
Works: SELECT * FROM MyTable WHERE Data->>"$.Shop[*].ContactPerson.UserId" LIKE "%b1b9ad95-1098-4e6c-a697-50c2a47dc301%";
Doesn't work: SELECT * FROM MyTable WHERE Data->>"$.Shop[*].ContactPerson.UserId" = "b1b9ad95-1098-4e6c-a697-50c2a47dc301";
Is that a problem with my syntax or is it related to collation (I'd say no, but it's the only discrepancy I find). BTW: Leaving out the % in the LIKE also produces no results.
Create/insert:
CREATE TABLE Temp ( `Data` LONGTEXT COLLATE UTF8MB4_BIN );
INSERT INTO Temp( `Data` ) VALUES ("{\"Shop\":[{\"ContactPerson\": {\"UserId\":\"b1b9ad95-1098-4e6c-a697-50c2a47dc301\"}},{\"ContactPerson\": {\"UserId\":\"B27DA5A7-D678-4513-8A44-BD76CC4651CC\"}}]}");
INSERT INTO Temp( `Data` ) VALUES ("{\"Shop\":[{\"ContactPerson\": {\"UserId\":\"82899A81-2024-4F68-917A-710764296A21\"}},{\"ContactPerson\": {\"UserId\":\"AE59DCA7-32AF-4131-93C7-A1BB698DF8E0\"}}]}");
INSERT INTO Temp( `Data` ) VALUES ("{\"Shop\":[{\"ContactPerson\": {\"UserId\":\"4154477B-1B70-4F25-9E4B-2CFBBF4F678F\"}},{\"ContactPerson\": {\"UserId\":\"B27DA5A7-D678-4513-8A44-BD76CC4651CC\"}}]}");
Trying the update: ALTER TABLE `Temp` MODIFY `Data` JSON;
And now it works... (of course) :(
So apparently when using the table-editor in HeidiSQL it fails (I've retried this). Using a SQL statement actually does what I want.
That still leaves me with having to use the LIKE as opposed to = in my query.
Thank you to #Rick James for the "aha moment": Where clause for JSON field on array of objects
So in my case:
SELECT * FROM `Temp` WHERE JSON_SEARCH(`Data`, 'one', 'b1b9ad95-1098-4e6c-a697-50c2a47dc301', NULL, '$.Shop[*].ContactPerson.UserId') IS NOT NULL;
Update:
Issue was reported and will be fixed in next build (https://github.com/HeidiSQL/HeidiSQL/issues/1652)

MySQL trying to apply UTF-8 encoding when trying to insert bytes to BLOB field

I'm using MySQL 8.0.4 (rc4) I need MySQL 8 because it's the only version of MySQL that supports CTEs.
My database is created thus:
CREATE DATABASE IF NOT EXISTS TestDB
DEFAULT CHARACTER SET utf8mb4
DEFAULT COLLATE utf8mb4_general_ci;
USE TestDB;
SET sql_mode = 'STRICT_TRANS_TABLES';
CREATE TABLE IF NOT EXISTS MyTable (
(...)
Body LONGBLOB NOT NULL,
(...)
);
When I try to insert raw byte data to this description field, I receive this error:
Error 1366: Incorrect string value: '\x8B\x08\x00\x00\x00\x00...' for column 'Body' at row 1.
This is the insert statement I'm using.
REPLACE INTO MyTable
SELECT Candidate.* FROM
(SELECT :Id AS Id,
(...)
:Body AS Body,
(...)
) AS Candidate
LEFT JOIN MyTable ON Candidate.Id = MyTable.Id
WHERE (
(...)
);
How could there be an incorrect string value for BLOB? Doesn't BLOB mean I can insert quite literally anything?
What's the : stuff? Why have the nested query? May we see actual SQL? What language are you using? It sounds like the "binding" tried to apply character set rules, when it should not. May we see the code that did the substitution of the : stuff?
BLOBs have not character set. As long as you can get the bytes past the parser, there should be no problem.
However, I find this to be a better way to do it...
In the app language, generate a hex string, then use that in
INSERT INTO ... VALUES (..., UNHEX(the-hex-string), ...)

MySQL string comparison with special characters

I have created an autocomplete that matches against a list of names in a database.
The database that I'm working contains a ton of names with special characters, but the end users are most likely going to search with the English equivalent of those names, e.g. Bela Bartok for Béla Bartók and Dvorak for Dvořák, etc. Currently, doing the English searches returns no results.
I have come across threads saying that the way to solve this is to change your MySQL collation to utf8 (which I have done to no avail).
I think that this may be because I used utf8_unicode_ci, but the one that would get the results that I want is utf8_general_ci. The problem with the latter though is that all the comments say to no longer use it.
Does anyone know how I can solve this problem?
If you know the list of special characters and what the equivalents in plain English are, than you can do the following:
lower case the string
replace the characters with the lower case equivalents
search against that "plain English" column
You will need to use the full text searching of MySQL in order to search against the text or come up with a home grown solution for how you're going to handle that.
Just tested with both utf8_general_ci and utf8_unicode_ci collations and it worked like a charm in both cases.
Follows the MySQL code I used to run my test:
CREATE TABLE `test` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`text` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
INSERT INTO `test` (`id`, `text`) VALUES (NULL, 'Dvořák'), (NULL, 'Béla Bartók');
SELECT * FROM `test` WHERE `text` LIKE '%dvorak%';
The above SELECT statement returns:
id text
--------------
1 Dvořák
Note: During my test I set all the collations to the desired one. The database collation, the table collation and the column collation as well.
Could it be that there's a bug in your PHP application?
I found the solution to my problem. Changing the collation to utf8_unicode_ci works perfectly fine. My problem was that I needed to use REGEXP in my query instead of LIKE, but REGEXP obviously doesn't work in this situation!
So in short, changing your collation to utf8_unicode_ci will allow you to compare Dvorak and Dvořák using = or LIKE, but not one of the REGEXP equivalents.
First, let's see if the data is stored correctly. Do
SELECT name, HEX(name) FROM ... WHERE ...;
Béla may come out (ignoring the spaces)
42 C3A9 6C 61 -- if correctly encoded with utf8 (é = C3A9)
42 E9 6C 61 -- if encoded with latin1 (é = E9)
The "Collation" (utf8_general_ci or utf8_unicode_ci) makes no difference for the examples you gave. Both tread é = e. See extensive list of equivalences for utf8 collations.
After you determine the encoding, we can proceed to prescribe a cure.
Taking a hint from Rick James, using:
SELECT * FROM `test` WHERE HEX(`column`) = HEX('Dvořák');
Should work. If you need a case insensitive query, then you'll need to lower/upper both sides in addition to the HEX check.
A more up to date collation is utf8mb4_unicode_520_ci.
Note, it does NOT work for utf8mb4_unicode_ci. See the comparison here: https://stackoverflow.com/a/59805600/857113

MySQL LOWER() function not multi-byte safe for the º character?

When I encoding the following character to UTF-8:
º
I get:
º
Then with º stored as a field value, I select the field with the LOWER() function and get
âº
I was expecting it to respect that the value is a multi-byte character and thus will not perform the LOWER on it.
Expected:
º
I am I not understanding correctly that the LOWER() function is suppose to be multi-byte safe as stated in the manual? (http://dev.mysql.com/doc/refman/5.1/en/string-functions.html#function_lower)
Or am I doing something wrong here?
I am running MySQL 5.1.
EDIT
The encoding on the table is set to UTF-8. The session encoding is default latin1.
Here are my repro steps.
CREATE TABLE test_table (
test_field VARCHAR(1000) DEFAULT NULL
) ENGINE=INNODB DEFAULT CHARSET=utf8;
INSERT INTO test_table(test_field) VALUES('º');
SELECT LOWER(test_field) FROM test_table;
INSERT INTO test_table(test_field) VALUES('º');
Will insert a 2 character string, which has the correct LOWER() of "âº"
Lower("Â") is "â"
Lower("º") is "º"
If you want to insert "º" then make sure you have
SET NAMES 'utf-8';
and
INSERT INTO test_table(test_field) VALUES('º');

MySQL: char_length(), wrong value for Russian

I am using char_length() to measure the size of "Русский": strangely, instead of telling me that it's 7 chars, it tells me there are 14. Interestingly if the query is simply...
SELECT CHAR_LENGTH('Русский')
...the answer is correct. However if I query the DB instead, the anser is 14:
SELECT CHAR_LENGTH(text) FROM locales WHERE lang = 'ru-RU' AND name = 'lang_name'
Anybody go any ideas what I might be doing wrong? I can confirm that the collation is utf8_general_ci and the table is MyISAM
Thanks,
Adrien
EDIT: My end objective is to be able to measure the lengths of records in a table containing single and double-byte chracters (eg. English & Russian, but not limited to these two languages only)
Because of two bytes is used for each UTF8 char.
See http://dev.mysql.com/doc/refman/5.5/en/string-functions.html#function_char-length
mysql> set names utf8;
mysql> SELECT CHAR_LENGTH('Русский'); result - 7
mysql> SELECT CHAR_LENGTH('test'); result - 4
create table test123 (
text VARCHAR(255) NOT NULL DEFAULT '',
text_text TEXT) Engine=Innodb default charset=UTF8;
insert into test123 VALUES('русский','test русский');
SELECT CHAR_LENGTH(text),CHAR_LENGTH(text_text) from test123; result - 7 and 12
I have tested work with: set names koi8r; create table and so on and got invalid result.
So the solution is recreate table and insert all data after setting set names UTF8.
the function return it's anwser guided by the most adjacent charset avaiable
in the case of a column, the column definition
in the case of a literal, the connection default
review the column charset with:
SELECT CHARACTER_SET_NAME FROM information_schema.`COLUMNS`
where table_name = 'locales'
and column_name = 'text'
be careful, it is not filtered by table_schema