MySQL REGEXP does not work properly with Russian chars - mysql

SELECT 'Hello' REGEXP '^[^aeiouAEIOU][A-Za-z]*$' -> 1
SELECT 'Привет' REGEXP '^[^аеиоуыэюяАЕИОУЫЭЮЯ][А-Яа-я]*$' -> 0 - it must return 1.

MySQL's REGEXP only works with bytes. Russian characters are each 2 bytes.
For limiting to Cyrillic, this seems to be correct:
SELECT HEX('Привет') REGEXP '^((D0|D1)..)+$'; -- > 1
(I'll get to the issue of avoiding leading vowels in a minute.)
To explain:
All Russian characters are 2 bytes, the first byte being hex D0 or D1. (There may be non-Russian characters starting that way; I'm ignoring that problem.)
(...|...) - | means 'OR'.
.. matches a 2-byte hex, saying that the second byte can be anything (this is overkill but may not hurt).
(...)+ - The plus means one or more occurrence.
The ^ and $ 'anchor' the regexp to encompass the entire string.
Back to the no-leading-vowel issue. Now we need to play some painful games to list the vowels; their HEX seem to be
D0, followed by any of B0 B5 B8 BE
90 95 98 9E, or
D1, followed by any of 83 8B 8D 8E 8F
A3 AB AD AE AF
Example: select hex('э'); --> D18D.
Putting it all together will be messy because MySQL does not have the (? tools to say "not". So, I will start by testing for a leading vowel:
SELECT HEX('Привет')
REGEXP '^(D0(B0|B5|B8|BE|90|95|98|9E))|(D1(83|8B|8D|8E|8F|A3|AB|AD|AE|AF))'
correctly fails.
Now to put things together:
SELECT NOT HEX('Привет')
REGEXP '^(D0(B0|B5|B8|BE|90|95|98|9E))|(D1(83|8B|8D|8E|8F|A3|AB|AD|AE|AF))'
AND HEX('Привет')
REGEXP '^((D0|D1)..)+$';
The first part check for NOT a leading vowel; the second part check for all characters being Russian.
That test case works, and 'э' came back with 0, but I may have goofed somewhere.
(That was a challenge.)

Related

How to create arabic character class in MySQL using RegExp?

For example I want to create a class character, ara digit [٩-٠] .. content all digits.
The corresponding Unicode is [U+0660-U+0669], I tried this:
Select * FROM employees WHERE ID REGEXP [\u{0660}-\u{0669}];
I get this error
#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '[\u{0660}-\u{0669}] LIMIT 0, 25' at line 1"
https://dev.mysql.com/doc/refman/5.7/en/regexp.html says
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are
not multibyte safe and may produce unexpected results with multibyte
character sets. In addition, these operators compare characters by
their byte values and accented characters may not compare as equal
even if a given collation treats them as equal.
That is, if you use à in a regexp, it will treat the 2-byte utf8 code as 2 bytes (in hex) C3 and 83. If this gives you the 'right' answer, it will be more by 'luck' than design.
This does work:
mysql> SELECT '١' REGEXP '[٩-٠]';
+-----------------------+
| '١' REGEXP '[٩-٠]' |
+-----------------------+
| 1 |
+-----------------------+
But, it is just coincidence. The regexp is something like [x0-x9] where x is the D9 byte, 0 is A0 and 9 is A9. But then the regexp is "any character x, or between 0 and x, or 9, which is not what you wanted.
This might work for 'all' of Arabic: REGEXP UNHEX('5BD82DDD5D'), but only because 'all' start with hex D8 through DD. (However, there may be other things in that range.) Furthermore, that will only check "Does the string contain an Arabic letter; it cannot used for anything more complex, such as phrases or a subset of letters.
Back to the digit range. Just checking for hex D9 is not safe, because that will include percent sign, superscript letters, and other characters. This may work: REGEXP UNHEX('D95BA02DA95D').
Caveat: Most of what I have said in this answer is untested; I'm inventing a solution in an area where I have no experience (REGEXP with utf8).

MySQL select UTF-8 string with '=' but not with 'LIKE'

I have a table with some words that come from medieval books and have some accented letters that doesn't exists anymore in modern latin1 alphabet. I can represent these letters easily with UTF-8 combining characters. For example, to create a "J" with a tilde, I use the UTF-8 sequence \u004A+\u0303 and the J becomes accented with a tilde.
The table uses utf8 encoding and the field collation is utf8_unicode_ci.
My problem is the following: If I try to select the entire string, I receive the correct answer. If I try to select using 'LIKE', I receive the wrong answer.
For example:
mysql> select word, hex(word) from oldword where word = 'hua';
+--------+--------------+
| word | hex(word) |
+--------+--------------+
| hũa | 6875CC8361 |
| huã | 6875C3A3 |
| hua | 687561 |
| hũã | 6875CC83C3A3 |
+--------+--------------+
4 rows in set (0,04 sec)
mysql> select word, hex(word) from oldword where word like 'hua';
+-------+------------+
| word | hex(word) |
+-------+------------+
| huã | 6875C3A3 |
| hua | 687561 |
+-------+------------+
2 rows in set (0,04 sec)
I don't want to search only the entire word. I want to search words that start with some substring. Eventually the searched word is the entire word.
How could I select the partial string using like and match all the strings?
I tried to create a custom collation using this information, but the server became unstable and only after a lot of trials and errors I was able to revert to the utf8_unicode_ci collation again and the server returned to normal condition.
EDIT: There's a problem with this site and some characters don't display correctly. Please see the results on these pastebins:
http://pastebin.com/mckJTLFX
http://pastebin.com/WP87QvgB
After seeing Marcus Adams' answer I realized that the REPLACE function could be the solution for this problem, although he didn't mentioned this function.
As I have only two different combining characters (acute and tilde), combined with other ASCII characters, for example j with tilde, j with acute, m with tilde, s with tilde, and so on. I just have to replace these two characters when using LIKE.
After searching the manual, I learned about the UNHEX function that helped me to properly represent the combining characters alone in the query to remove them.
The combining tilde is represented by CC83 in HEX code and the acute is represented by CC81 in HEX.
So, the query that solves my problem is this one.
SELECT word, REPLACE(REPLACE(word, UNHEX("CC83"), ""), UNHEX("CC81"), "")
FROM oldword WHERE REPLACE(REPLACE(word, UNHEX("CC83"), ""), UNHEX("CC81"), "")
LIKE 'hua%';`
The problem is that LIKE performs the comparison character-by-character and when using the "combining tilda", it literally is two characters, though it displays as one (assuming your client supports displaying it as such).
There will never be a case where comparing e.g. hu~a to hua character-by-character will match because it's comparing ~ with a for the third character.
Collations (and coercions) work in your favor and handle such things when comparing the string as a whole, but not when comparing character-by-character.
Even if you considered using SUBSTRING() as a hack instead of using LIKE with a wildcard % to perform a prefix search, consider the following:
SELECT SUBSTRING('hũa', 1, 3) = 'hua'
-> 0
SELECT SUBSTRING('hũa', 1, 4) = 'hua'
-> 1
You kind of have to know the length you're going for or brute force it like this:
SELECT * FROM oldword
WHERE SUBSTRING(word, 1, 3) = 'hua'
OR SUBSTRING(word, 1, 4) = 'hua'
OR SUBSTRING(word, 1, 5) = 'hua'
OR SUBSTRING(word, 1, 6) = 'hua'
According to this:
ũ collates equal to plain U in all utf8 collations on 5.6.
j́ collates equal to plain J in most collations; exceptions:
utf8_general*ci because it is actually j plus an accent. And the "general" collations only look at one character (as distinguished from byte) at a time. Most collations take into consideration multiple characters, such as ch or ll in Spanish or ss in German.
utf8_roman_ci, which is quite an oddball. j́=i=j
(LIKE does not exactly follow the regular rules of collation. I am not versed on the details, but I think that J is represented as 2 characters causes it to work differently in LIKE than in WHERE or ORDER BY. Furthermore, I don't know whether REPLACE() collates like LIKE or the other places.)
You can use the % symbol like a wildcard character. For example this:
SELECT word
FROM myTable
WHERE word LIKE 'hua%';
This will pull all records that start with hua and have 0+ characters following it. Here is an SQL Fiddle example.

MySQL: compare a mixed field containing letters and numbers

I have a field in the mysql database that contains data like the following:
Q16
Q32
L16
Q4
L32
L64
Q64
Q8
L1
L4
Q1
And so forth. What I'm trying to do is pull out, let's say, all the values that start with Q which is easy:
field_name LIKE 'Q%'
But then I want to filter let's say all the values that have a number higher than 32. As a result I'm supposed to get only 'Q64', however, I also get Q4, Q8 and so for as I'm comparing them as strings so only 3 and the respective digit are compared and the numbers are in general taken as single digits, not as integers.
As this makes perfect sense, I'm struggling to find a solution on how to perform this operation without pulling all the data out of the database, stripping out the Qs and parsing it all to integers.
I did play around with the CAST operator, however, it only works if the value is stored as string AND it contains only digits. The parsing fails if there's another character in there..
Extract the number from the string and cast it to a number with *1 or cast
select * from your_table
where substring(field_name, 1, 1) = 'Q'
and substring(field_name, 2) * 1 > 32

Store only numbers after Comma mysql

How to save only the numbers after the comma with mysql functions?
eg : 16.03 ----> 03
If I understand correctly, you want the fractional part of the number. In MySQL, you can do this with mod():
select mod(14.03, 1)
Yields "0.03".
EDIT:
Juhana makes a very good point. MySQL freely converts between numbers and strings, so you can use substr() on this:
select substr(14.03, locate('.', 14.03)+1)
If you want the digits after a comma, replace '.' with ','.
If your values come from a table, you will have to be prepared to deal with the fact that some value could be integers (i.e.: having no decimal separator while converted to a string):
SELECT IF(LOCATE('.', value), SUBSTRING_INDEX(value,'.', -1), "0") FROM tbl;
See http://sqlfiddle.com/#!2/062c8/11
Or simply:
14.03 % 1
--> 0.03
Also works for negative numbers:
-14.03 % 1
--> -0.03
If you are after exactly the two digits after the decimal point (and you are dealing with positive numbers only) you can do the following:
substring(format( num %1,2),3,2) -- num being the column (integer of float, double ...)
--> 03 -- for num=14.03
--> 00 -- for num=14
--> 05 -- for num=14.046739
Of course, if you want all digits, you could leave out the format()
substring(num %1,3,2)
--> 03 -- for num=14.03
--> -- for num=14
--> 046739 -- for num=14.046739
but that makes the integer case quite ugly.

Need a help for sort in mysql

Hi I want to sort a table .The field contains numbers,alphabets and numbers with alphabets ie,
1
2
1a
11a
a
6a
b
I want to sort this to,
1
1a
2
6a
11a
a
b
My code is, SELECT * FROM t ORDER BY CAST(st AS SIGNED), st
But the result is,
a
b
1
1a
2
6a
11a
I found this code in this url "http://www.mpopp.net/2006/06/sorting-of-numeric-values-mixed-with-alphanumeric-values/"
Anyone please help me
This would do your required sort order, even in the presence of 0 in the table;
SELECT * FROM t
ORDER BY
st REGEXP '^[[:alpha:]].*',
st+0,
st
An SQLfiddle to test with.
As a first sort criteria, it sorts anything that starts with a letter after anything that doesn't. That's what the regexp does.
As a second sort criteria it sorts by the numerical value the string starts with (st+0 adds 0 to the numerical part the string starts with and returns an int)
As a last resort, it sorts by the string itself to get the alphabetical ones in order.
You can use this:
SELECT *
FROM t
ORDER BY
st+0=0, st+0, st
Using st+0 the varchar column will be casted to int. Ordering by st+0=0 will put alphanumeric rows at the bottom (st+0=0 will be 1 if the string starts with an alphanumeric character, oterwise it will be 0)
Please see fiddle here.
The reason that you are getting this output is that all the character like 'a', 'b' etc are converted to '0' and if you use order by ASC it will appear at the top.
SELECT CAST(number AS SIGNED) from tbl
is returning
1
2
1
11
0
6
0
Look at this fiddle:- SQL FIDDLE
I did small change in your query -
SELECT *, CAST(st AS SIGNED) as casted_column
FROM t
ORDER BY casted_column ASC, st ASC
this should work.
in theory your syntax should work but not sure why mysql doesn't accept these methods after from tag.
so created temp field and then sorted that one .
This should work as per my experience, and you can check it.
SQL FIDDLE