mysql regex utf-8 characters - mysql

I am trying to get data from MySQL database via REGEX with or without special utf-8 characters.
Let me explain on example :
If user enters word like sirena it should return rows which include words like sirena,siréna,šíreňá .. and so on..
also it should work backwards when he enters siréná it should return the same results..
I am trying to search it via REGEX, my query looks like this :
SELECT * FROM `content` WHERE `text` REGEXP '[sšŠ][iíÍ][rŕŔřŘ][eéÉěĚ][nňŇ][AaáÁäÄ0]'
It works only when in database is word sirena but not when there is word siréňa..
Is it because something with UTF-8 and MySQL? (collation of mysql column is utf8_general_ci)
Thank you!

MySQL's regular expression library does not support utf-8.
See Bug #30241 Regular expression problems, which has been open since 2007. They will have to change the regular expression library they use before that can be fixed, and I haven't found any announcement of when or if they will do this.
The only workaround I've seen is to search for specific HEX strings:
mysql> SELECT * FROM `content` WHERE HEX(`text`) REGEXP 'C3A9C588';
+----------+
| text |
+----------+
| siréňa |
+----------+
Re your comment:
No, I don't know of any solution with MySQL.
You might have to switch to PostgreSQL, because that RDBMS supports \u codes for UTF characters in their regular expression syntax.

Try something like ... REGEXP '(a|b|[ab])'
SELECT * FROM `content` WHERE `text` REGEXP '(s|š|Š|[sšŠ])(i|í|Í|[iíÍ])(r|ŕ|Ŕ|ř|Ř|[rŕŔřŘ])(e|é|É|ě|Ě|[eéÉěĚ])(n|ň|Ň|[nňŇ])(A|a|á|Á|ä|Ä|0|[AaáÁäÄ0])'
It works for me!

Use the lib_mysqludf_preg library from the mysql UDF repository for PCRE regular expressions directly in mysql
Although MySQL's regular expression library does not support utf-8 the mysql UDF repository has the ability to use utf-8 compatible regex according PCRE regular expressions directly in mysql.
http://www.mysqludf.org/
https://github.com/mysqludf/lib_mysqludf_preg#readme

Related

Search for replacement character (no TSQL)

I'm trying to find a way to search for the replacement character /uFFFD with SQL (since I'm using MariaDB) but I can not make it work. I tried with:
SELECT id FROM tablename WHERE content LIKE "%\ufffd%";
SELECT id FROM tablename WHERE content LIKE "%�%"
Both results are not working for me. Some topics say to use UNICODE() but it's a TSQL function and I can not use it here in MariaDB. Any solution?
What CHARACTER SET are you using? FFFD is the hex for the Unicode "codepoint". The UTF-8 encoding for it is EFBFBD.
Here's another way to look for it:
WHERE HEX(col) REGEXP '^(....)*FFFD'
or perhaps
WHERE HEX(col) REGEXP '^(..)*EFBFBD'
What are your results? Do you have any error? Try this simple working query or change your col type.
select '�' a from dual where a like '%�%'

MySQL regexp for emoji / unicode

I want to search my database for any string which contains the Butterfly Emoji - 🦋 - using regexp.
For example
SELECT *
FROM `table`
WHERE `text`
REGEXP '🦋'
I'm using REGEXP because I might want to search for Hello[[:space:]]world or similar.
I get the error
Got error 'nothing to repeat at offset 0' from regexp
This works:
SELECT *
FROM `table`
WHERE `text`
LIKE '%🦋%'
But then I lose the ability to search for, say, flying[[:space:]]🦋
My Collation is utf8mb4_unicode_ci. The database is 10.0.36-MariaDB
Honestly I don't know why, but escaping your butterfly will give the desired output. (At least in my version, MariaDB 10.3.10, which gave the same error without escaping).
SELECT * FROM `table` WHERE `text` REGEXP '\\🦋'
(note the double backslash, the first one is to escape the backslash within the string, yielding in the regular expression \🦋)
SHOW VARIABLES LIKE 'char%';
It sounds like you have not told MySQL what encoding the client is using for characters. This is best done via the connection parameters, or via mysqli_charset (if using mysqli, not PDO).
Also, run this on your version:
SELECT 'ab' REGEXP '?';
I suspect it will give you the same error message.

MYSQL/Coldfusion replace registration symbol not working

I'd like to make all registration symbols superscript by wrapping them with a <sup> HTML tag. So, I can do this in SQL no problem:
SELECT s.id,
Replace(s.name,'®','<sup>®</sup>') AS name
FROM staff s
WHERE name LIKE '%®%'
Result:
id | name
1 | Name1 CFP<sup>®</sup>, CDFA
2 | Jeffrey test CFP<sup>®</sup>
3 | Matthew hello CFP<sup>®</sup> CFA
But when I run it in Coldfusion from a cfquery tag, it looks as if the ® character is interpreted as ®.
<cfquery name="getStaff" dataSource="#this.dsn#">
SELECT s.id,
Replace(s.name,'®','<sup>®</sup>') AS name
FROM staff s
WHERE 1=1
<cfif isDefined("arguments.permalink")>
AND s.permalink=<cfqueryparam value="#arguments.permalink#" />
</cfif>
</cfquery>
Is there a better way to approach this? I originally did this in Coldfusion using <cfset getStaff.name = Replace(getStaff.name,Chr(174),'<sup>®</sup>') />, which worked fine until I switched to Mustache templating.
I'd definitely prefer to use the CHAR() function if I could figure out what numeric character ® is in Mysql. (Note, using utf8_general_ci on this and all DB tables) I tried CHAR(174) in Mysql, but it won't work because (as far as I can tell) Mysql isn't using the same character set - SELECT CHAR(174) returns a blob.
UPDATE:
I'd definitely prefer to use the CHAR() function if I could figure out
what numeric character ® is in Mysql. (Note, using utf8_general_ci on
this and all DB tables) I tried CHAR(174) in Mysql, but it won't work
because (as far as I can tell) Mysql isn't using the same character
set - SELECT CHAR(174) returns a blob.
As mentioned in the comments, it sounds like the default charset for your database is utf8. So presumably it failed because the decimal 174 is not the correct way to represent the registered sign in utf8. That symbol requires two bytes. Using the proper hex or decimal value for your default charset (ie utf8) it works as expected:
Hex: CHAR(0xC2AE)
Decimal: CHAR(194,174)
Though it would be better to specify the charset explicitly with USING:
Hex: CHAR(194,174 USING utf8)
Decimal: CHAR(0xC2AE USING utf8)
Is the symbol hard-coded into the .cfm script? If so, it is probably an issue with the character encoding of the script. When interpreting literal characters within the file, the page encoding is what matters. Try:
Adding <cfprocessingdirective pageEncoding="utf-8"> to the top of the script.
Note: For CFC's, the cfprocessingdirective tag must follow the cfcomponent tag
IF the default charset for your database is utf8, try using the CF equivalent function, ie #chr(174)#. However, IMO it is better to use the MySQL Char() function instead.
Side note about cfqueryparam, it is a good practice to always specify a cfsqltype. If omitted, it defaults to CF_SQL_CHAR, which may force implicit conversion and cause wrong/unintended results in some cases (numbers, dates, etcetera). Even for string values it is a good idea to specify the type, as there may be slight differences with how CHAR and VARCHAR types are treated on the database side.
It is possible to do something like ColdFusion Char() in SQL
<cfquery name="getStaff" dataSource="#this.dsn#">
SELECT s.id,
REPLACE(s.name, CHAR(174), '<sup>®</sup>') AS name
FROM staff s
WHERE 1=1
<cfif isDefined("arguments.permalink")>
AND s.permalink=<cfqueryparam value="#arguments.permalink#" />
</cfif>
</cfquery>
For MySQL:
See: http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_char
For SQL Server:
See: https://msdn.microsoft.com/en-us/library/ms187323.aspx

MySQL query with non-printing characters (left-to-right mark)

I just found myself lost in the interesting situation that I need to query MySQL for fields containing a so called Left-to-right mark.
As the nature of this character is to be non-printing, thus invisible, I'm unable to simply copy/paste it into a query.
As mentioned in the linked Wikipedia article, the Left-to-right mark is Unicode character U+200F, which is a fact that I'm sure is the key to success in my current adventure.
My question is: How do I use raw Unicode in a MySQL query? Something along the lines of:
SELECT * FROM users WHERE username LIKE '%\U+200F%'
or
SELECT * FROM users WHERE username REGEXP '\U+200F'
or whatever the correct syntax for Unicode in MySQL is and depending on whether this is supported with LIKE and/or REGEXP.
To get a unicode char, something like this should work:
SELECT CHAR(<number> USING utf8);
Also, don't use REGEXP, because the regexp lib used by MySQL is very old, and doesn't support multi-byte charsets.

How to make MySQL aware of multi-byte characters in LIKE and REGEXP?

I have a MySQL table with two columns, both utf8_unicode_ci collated. It contains the following rows. Except for ASCII, the second field also contains Unicode codepoints like U+02C8 (MODIFIED LETTER VERTICAL LINE) and U+02D0 (MODIFIED LETTER TRIANGULAR COLON).
word | ipa
--------+----------
Hallo | haˈloː
IPA | ˌiːpeːˈʔaː
I need to search the second field with LIKE and REGEXP, but MySQL (5.0.77) seems to interpret these fields as bytes, not as characters.
SELECT * FROM pronunciation WHERE ipa LIKE '%ha?lo%'; -- 0 rows
SELECT * FROM pronunciation WHERE ipa LIKE '%ha??lo%'; -- 1 row
SELECT * FROM pronunciation WHERE ipa REGEXP 'ha.lo'; -- 0 rows
SELECT * FROM pronunciation WHERE ipa REGEXP 'ha..lo'; -- 1 row
I'm quite sure that the data is stored correctly, as it seems good when I retrieve it and shows up fine in phpMyAdmin. I'm on a shared host, so I can't really install programs.
How can I solve this problem? If it's not possible: is there a plausible work-around that does not involve processing the entire database with PHP every time? There are 40 000 lines, and I'm not dead-set on using MySQL (or UTF8, for that matter). I only have access to PHP and MySQL on the host.
Edit: There is an open 4-year-old MySQL bug report, Bug #30241 Regular expression problems, which notes that the regexp engine works byte-wise. Thus, I'm looking for a work-around.
EDITED to incorporate fix to valid critisism
Use the HEX() function to render your bytes to hexadecimal and then use RLIKE on that, for example:
select * from mytable
where hex(ipa) rlike concat('(..)*', hex('needle'), '(..)*'); -- looking for 'needle' in haystack, but maintaining hex-pair alignment.
The odd unicode chars render consistently to their hex values, so you're searching over standard 0-9A-F chars.
This works for "normal" columns too, you just don't need it.
p.s. #Kieren's (valid) point addressed using rlike to enforce char pairs
I'm not dead-set on using MySQL
Postgres seems to handle it quite fine:
test=# select 'ˌˈʔ' like '___';
?column?
----------
t
(1 row)
test=# select 'ˌˈʔ' ~ '^.{3}$';
?column?
----------
t
(1 row)
If you go down that road, note that in Postgres' ilike operator matches that of MySQL's like. (In Postgres, like is case-sensitive.)
For the MySQL-specific solution, you mind be able to work around by binding some user-defined function (maybe bind the ICU library?) into MySQL.
You have problems with UTF8? Eliminate them.
How many special characters do you use? Are you using only locase letters, am I right? So, my tip is: Write a function, which converts spec chars to regular chars, e.g. "æ" ->"A" and so on, and add a column to the table which stores that converted value (you have to convert all values first, and upon each insert/update). When searching, you just have to convert search string with the same function, and use it on that field with regexp.
If there're too many kind of special chars, you should convert it to multi-char. 1. Avoid finding "aa" in the "ba ab" sequence use some prefix, like "#ba#ab". 2. Avoid finding "#a" in "#ab" use fixed length tokens, say, 2.