How to create arabic character class in MySQL using RegExp? - mysql

For example I want to create a class character, ara digit [٩-٠] .. content all digits.
The corresponding Unicode is [U+0660-U+0669], I tried this:
Select * FROM employees WHERE ID REGEXP [\u{0660}-\u{0669}];
I get this error
#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '[\u{0660}-\u{0669}] LIMIT 0, 25' at line 1"

https://dev.mysql.com/doc/refman/5.7/en/regexp.html says
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are
not multibyte safe and may produce unexpected results with multibyte
character sets. In addition, these operators compare characters by
their byte values and accented characters may not compare as equal
even if a given collation treats them as equal.
That is, if you use à in a regexp, it will treat the 2-byte utf8 code as 2 bytes (in hex) C3 and 83. If this gives you the 'right' answer, it will be more by 'luck' than design.
This does work:
mysql> SELECT '١' REGEXP '[٩-٠]';
+-----------------------+
| '١' REGEXP '[٩-٠]' |
+-----------------------+
| 1 |
+-----------------------+
But, it is just coincidence. The regexp is something like [x0-x9] where x is the D9 byte, 0 is A0 and 9 is A9. But then the regexp is "any character x, or between 0 and x, or 9, which is not what you wanted.
This might work for 'all' of Arabic: REGEXP UNHEX('5BD82DDD5D'), but only because 'all' start with hex D8 through DD. (However, there may be other things in that range.) Furthermore, that will only check "Does the string contain an Arabic letter; it cannot used for anything more complex, such as phrases or a subset of letters.
Back to the digit range. Just checking for hex D9 is not safe, because that will include percent sign, superscript letters, and other characters. This may work: REGEXP UNHEX('D95BA02DA95D').
Caveat: Most of what I have said in this answer is untested; I'm inventing a solution in an area where I have no experience (REGEXP with utf8).

Related

How to search by specific pattern in MySQL?

I have below pattern of data in field
XXX-XX-XXX
Some of the data don't have that patterns.
So need to search those records.
SELECT * FROM table WHERE `name` NOT REGEXP '^.{10}$'
SELECT * FROM table WHERE `name` NOT REGEXP '^..........$'
Above 2 queries works fine. But not 100%.
Can I filter by {3}-{2}-{3} ?
You want to match a string with 3 alphanumeric chars followed with -, followed with 2 alphanumeric and then again a hyphen and 3 last alphanumeric in the string.
Use
'^[[:alnum:]]{3}-[[:alnum:]]{2}-[[:alnum:]]{3}$'
Details
^ - start of the string
[[:alnum:]]{3} - 3 alphanumeric chars
- - a hyphen
[[:alnum:]]{2}- - 2 alphanumeric chars and a -
[[:alnum:]]{3} - 3 alphanumeric chars
$ - end of string.
See MySQL "Regular Expression Syntax":
Character Class Name Meaning
alnum Alphanumeric characters

MYSQL - Order column by ASCII Value

I have a column with texts, sorted by ASCII it should be ordered as:
- (hyphen)
0
1 (numbers)
2
A (uppercase)
B
_ (underscore)
a
b (lowercase)
c
However it is being ordered as:
- (hyphen)
0
1 (numbers)
2
a
b (lowercase)
c
A
B (uppercase)
C
_ (underscore)
How can I do the sorting by ASCII value?
The sort order is controlled by the collation. You can use the BINARY collation to sort by raw bytes, which in the case of ASCII data will cause it to sort by ASCII value. See https://dev.mysql.com/doc/refman/5.7/en/sorting-rows.html
SELECT ...
FROM mytable
ORDER BY BINARY mycolumn
This will be more flexible than using the ASCII() function because that function only returns the ASCII value of the first character. Using the BINARY collation allows sorting by the full string.
You could use ASCII:
SELECT *
FROM tab
ORDER BY ASCII(col_name) ASC

MySQL REGEXP does not work properly with Russian chars

SELECT 'Hello' REGEXP '^[^aeiouAEIOU][A-Za-z]*$' -> 1
SELECT 'Привет' REGEXP '^[^аеиоуыэюяАЕИОУЫЭЮЯ][А-Яа-я]*$' -> 0 - it must return 1.
MySQL's REGEXP only works with bytes. Russian characters are each 2 bytes.
For limiting to Cyrillic, this seems to be correct:
SELECT HEX('Привет') REGEXP '^((D0|D1)..)+$'; -- > 1
(I'll get to the issue of avoiding leading vowels in a minute.)
To explain:
All Russian characters are 2 bytes, the first byte being hex D0 or D1. (There may be non-Russian characters starting that way; I'm ignoring that problem.)
(...|...) - | means 'OR'.
.. matches a 2-byte hex, saying that the second byte can be anything (this is overkill but may not hurt).
(...)+ - The plus means one or more occurrence.
The ^ and $ 'anchor' the regexp to encompass the entire string.
Back to the no-leading-vowel issue. Now we need to play some painful games to list the vowels; their HEX seem to be
D0, followed by any of B0 B5 B8 BE
90 95 98 9E, or
D1, followed by any of 83 8B 8D 8E 8F
A3 AB AD AE AF
Example: select hex('э'); --> D18D.
Putting it all together will be messy because MySQL does not have the (? tools to say "not". So, I will start by testing for a leading vowel:
SELECT HEX('Привет')
REGEXP '^(D0(B0|B5|B8|BE|90|95|98|9E))|(D1(83|8B|8D|8E|8F|A3|AB|AD|AE|AF))'
correctly fails.
Now to put things together:
SELECT NOT HEX('Привет')
REGEXP '^(D0(B0|B5|B8|BE|90|95|98|9E))|(D1(83|8B|8D|8E|8F|A3|AB|AD|AE|AF))'
AND HEX('Привет')
REGEXP '^((D0|D1)..)+$';
The first part check for NOT a leading vowel; the second part check for all characters being Russian.
That test case works, and 'э' came back with 0, but I may have goofed somewhere.
(That was a challenge.)

mysql where string ends with numbers

My table column contains values like:
id | item
-------------
1 | aaaa11a112
2 | aa1112aa2a
3 | aa11aa1a11
4 | aaa2a222aa
I want to select only rows where value of item ends with numbers.
Is there something like this?
select * from table where item like '%number'
You can use REGEXP and character class
select * from table where item REGEXP '[[:digit:]]$'
DEMO
Explanation:
[[:digit:]] >> Match digit characters
$ >> Match at the end of the string
Within a bracket expression (written using [ and ]), [:character_class:] represents a character class that matches all characters belonging to that class.
SIDENOTE:
Other helpful mysql character classes to use with REGEXP, taken from the documentation:
Character Class Name Meaning
alnum Alphanumeric characters
alpha Alphabetic characters
blank Whitespace characters
cntrl Control characters
digit Digit characters
graph Graphic characters
lower Lowercase alphabetic characters
print Graphic or space characters
punct Punctuation characters
space Space, tab, newline, and carriage return
upper Uppercase alphabetic characters
xdigit Hexadecimal digit characters
you can use REGEXP
select * from table where RIGHT(item ,1) REGEXP '^-?[0-9]+$';
Yes you can use like for numbers.
select * from table where item like '%1'
This will work

MySQL select UTF-8 string with '=' but not with 'LIKE'

I have a table with some words that come from medieval books and have some accented letters that doesn't exists anymore in modern latin1 alphabet. I can represent these letters easily with UTF-8 combining characters. For example, to create a "J" with a tilde, I use the UTF-8 sequence \u004A+\u0303 and the J becomes accented with a tilde.
The table uses utf8 encoding and the field collation is utf8_unicode_ci.
My problem is the following: If I try to select the entire string, I receive the correct answer. If I try to select using 'LIKE', I receive the wrong answer.
For example:
mysql> select word, hex(word) from oldword where word = 'hua';
+--------+--------------+
| word | hex(word) |
+--------+--------------+
| hũa | 6875CC8361 |
| huã | 6875C3A3 |
| hua | 687561 |
| hũã | 6875CC83C3A3 |
+--------+--------------+
4 rows in set (0,04 sec)
mysql> select word, hex(word) from oldword where word like 'hua';
+-------+------------+
| word | hex(word) |
+-------+------------+
| huã | 6875C3A3 |
| hua | 687561 |
+-------+------------+
2 rows in set (0,04 sec)
I don't want to search only the entire word. I want to search words that start with some substring. Eventually the searched word is the entire word.
How could I select the partial string using like and match all the strings?
I tried to create a custom collation using this information, but the server became unstable and only after a lot of trials and errors I was able to revert to the utf8_unicode_ci collation again and the server returned to normal condition.
EDIT: There's a problem with this site and some characters don't display correctly. Please see the results on these pastebins:
http://pastebin.com/mckJTLFX
http://pastebin.com/WP87QvgB
After seeing Marcus Adams' answer I realized that the REPLACE function could be the solution for this problem, although he didn't mentioned this function.
As I have only two different combining characters (acute and tilde), combined with other ASCII characters, for example j with tilde, j with acute, m with tilde, s with tilde, and so on. I just have to replace these two characters when using LIKE.
After searching the manual, I learned about the UNHEX function that helped me to properly represent the combining characters alone in the query to remove them.
The combining tilde is represented by CC83 in HEX code and the acute is represented by CC81 in HEX.
So, the query that solves my problem is this one.
SELECT word, REPLACE(REPLACE(word, UNHEX("CC83"), ""), UNHEX("CC81"), "")
FROM oldword WHERE REPLACE(REPLACE(word, UNHEX("CC83"), ""), UNHEX("CC81"), "")
LIKE 'hua%';`
The problem is that LIKE performs the comparison character-by-character and when using the "combining tilda", it literally is two characters, though it displays as one (assuming your client supports displaying it as such).
There will never be a case where comparing e.g. hu~a to hua character-by-character will match because it's comparing ~ with a for the third character.
Collations (and coercions) work in your favor and handle such things when comparing the string as a whole, but not when comparing character-by-character.
Even if you considered using SUBSTRING() as a hack instead of using LIKE with a wildcard % to perform a prefix search, consider the following:
SELECT SUBSTRING('hũa', 1, 3) = 'hua'
-> 0
SELECT SUBSTRING('hũa', 1, 4) = 'hua'
-> 1
You kind of have to know the length you're going for or brute force it like this:
SELECT * FROM oldword
WHERE SUBSTRING(word, 1, 3) = 'hua'
OR SUBSTRING(word, 1, 4) = 'hua'
OR SUBSTRING(word, 1, 5) = 'hua'
OR SUBSTRING(word, 1, 6) = 'hua'
According to this:
ũ collates equal to plain U in all utf8 collations on 5.6.
j́ collates equal to plain J in most collations; exceptions:
utf8_general*ci because it is actually j plus an accent. And the "general" collations only look at one character (as distinguished from byte) at a time. Most collations take into consideration multiple characters, such as ch or ll in Spanish or ss in German.
utf8_roman_ci, which is quite an oddball. j́=i=j
(LIKE does not exactly follow the regular rules of collation. I am not versed on the details, but I think that J is represented as 2 characters causes it to work differently in LIKE than in WHERE or ORDER BY. Furthermore, I don't know whether REPLACE() collates like LIKE or the other places.)
You can use the % symbol like a wildcard character. For example this:
SELECT word
FROM myTable
WHERE word LIKE 'hua%';
This will pull all records that start with hua and have 0+ characters following it. Here is an SQL Fiddle example.