MySQL select UTF-8 string with '=' but not with 'LIKE' - mysql

I have a table with some words that come from medieval books and have some accented letters that doesn't exists anymore in modern latin1 alphabet. I can represent these letters easily with UTF-8 combining characters. For example, to create a "J" with a tilde, I use the UTF-8 sequence \u004A+\u0303 and the J becomes accented with a tilde.
The table uses utf8 encoding and the field collation is utf8_unicode_ci.
My problem is the following: If I try to select the entire string, I receive the correct answer. If I try to select using 'LIKE', I receive the wrong answer.
For example:
mysql> select word, hex(word) from oldword where word = 'hua';
+--------+--------------+
| word | hex(word) |
+--------+--------------+
| hũa | 6875CC8361 |
| huã | 6875C3A3 |
| hua | 687561 |
| hũã | 6875CC83C3A3 |
+--------+--------------+
4 rows in set (0,04 sec)
mysql> select word, hex(word) from oldword where word like 'hua';
+-------+------------+
| word | hex(word) |
+-------+------------+
| huã | 6875C3A3 |
| hua | 687561 |
+-------+------------+
2 rows in set (0,04 sec)
I don't want to search only the entire word. I want to search words that start with some substring. Eventually the searched word is the entire word.
How could I select the partial string using like and match all the strings?
I tried to create a custom collation using this information, but the server became unstable and only after a lot of trials and errors I was able to revert to the utf8_unicode_ci collation again and the server returned to normal condition.
EDIT: There's a problem with this site and some characters don't display correctly. Please see the results on these pastebins:
http://pastebin.com/mckJTLFX
http://pastebin.com/WP87QvgB

After seeing Marcus Adams' answer I realized that the REPLACE function could be the solution for this problem, although he didn't mentioned this function.
As I have only two different combining characters (acute and tilde), combined with other ASCII characters, for example j with tilde, j with acute, m with tilde, s with tilde, and so on. I just have to replace these two characters when using LIKE.
After searching the manual, I learned about the UNHEX function that helped me to properly represent the combining characters alone in the query to remove them.
The combining tilde is represented by CC83 in HEX code and the acute is represented by CC81 in HEX.
So, the query that solves my problem is this one.
SELECT word, REPLACE(REPLACE(word, UNHEX("CC83"), ""), UNHEX("CC81"), "")
FROM oldword WHERE REPLACE(REPLACE(word, UNHEX("CC83"), ""), UNHEX("CC81"), "")
LIKE 'hua%';`

The problem is that LIKE performs the comparison character-by-character and when using the "combining tilda", it literally is two characters, though it displays as one (assuming your client supports displaying it as such).
There will never be a case where comparing e.g. hu~a to hua character-by-character will match because it's comparing ~ with a for the third character.
Collations (and coercions) work in your favor and handle such things when comparing the string as a whole, but not when comparing character-by-character.
Even if you considered using SUBSTRING() as a hack instead of using LIKE with a wildcard % to perform a prefix search, consider the following:
SELECT SUBSTRING('hũa', 1, 3) = 'hua'
-> 0
SELECT SUBSTRING('hũa', 1, 4) = 'hua'
-> 1
You kind of have to know the length you're going for or brute force it like this:
SELECT * FROM oldword
WHERE SUBSTRING(word, 1, 3) = 'hua'
OR SUBSTRING(word, 1, 4) = 'hua'
OR SUBSTRING(word, 1, 5) = 'hua'
OR SUBSTRING(word, 1, 6) = 'hua'

According to this:
ũ collates equal to plain U in all utf8 collations on 5.6.
j́ collates equal to plain J in most collations; exceptions:
utf8_general*ci because it is actually j plus an accent. And the "general" collations only look at one character (as distinguished from byte) at a time. Most collations take into consideration multiple characters, such as ch or ll in Spanish or ss in German.
utf8_roman_ci, which is quite an oddball. j́=i=j
(LIKE does not exactly follow the regular rules of collation. I am not versed on the details, but I think that J is represented as 2 characters causes it to work differently in LIKE than in WHERE or ORDER BY. Furthermore, I don't know whether REPLACE() collates like LIKE or the other places.)

You can use the % symbol like a wildcard character. For example this:
SELECT word
FROM myTable
WHERE word LIKE 'hua%';
This will pull all records that start with hua and have 0+ characters following it. Here is an SQL Fiddle example.

Related

How to create arabic character class in MySQL using RegExp?

For example I want to create a class character, ara digit [٩-٠] .. content all digits.
The corresponding Unicode is [U+0660-U+0669], I tried this:
Select * FROM employees WHERE ID REGEXP [\u{0660}-\u{0669}];
I get this error
#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '[\u{0660}-\u{0669}] LIMIT 0, 25' at line 1"
https://dev.mysql.com/doc/refman/5.7/en/regexp.html says
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are
not multibyte safe and may produce unexpected results with multibyte
character sets. In addition, these operators compare characters by
their byte values and accented characters may not compare as equal
even if a given collation treats them as equal.
That is, if you use à in a regexp, it will treat the 2-byte utf8 code as 2 bytes (in hex) C3 and 83. If this gives you the 'right' answer, it will be more by 'luck' than design.
This does work:
mysql> SELECT '١' REGEXP '[٩-٠]';
+-----------------------+
| '١' REGEXP '[٩-٠]' |
+-----------------------+
| 1 |
+-----------------------+
But, it is just coincidence. The regexp is something like [x0-x9] where x is the D9 byte, 0 is A0 and 9 is A9. But then the regexp is "any character x, or between 0 and x, or 9, which is not what you wanted.
This might work for 'all' of Arabic: REGEXP UNHEX('5BD82DDD5D'), but only because 'all' start with hex D8 through DD. (However, there may be other things in that range.) Furthermore, that will only check "Does the string contain an Arabic letter; it cannot used for anything more complex, such as phrases or a subset of letters.
Back to the digit range. Just checking for hex D9 is not safe, because that will include percent sign, superscript letters, and other characters. This may work: REGEXP UNHEX('D95BA02DA95D').
Caveat: Most of what I have said in this answer is untested; I'm inventing a solution in an area where I have no experience (REGEXP with utf8).

MYSQL - Like statement not working with special characters

I am having an issue with the following:
Inside my table I have the following:
ID Long Latt city
1 n/a n/a Newcastle-upon-Tyne
2 n/a n/a Newcastle Upon Tyne
3 n/a n/a Stoke-on-Trent
4 n/a n/a Stoke on Trent
If someone enters in the search "Newcastle Upon Type" I want both of them to show. My sql statement is:
select * from `properties` where `city` LIKE '%Newcastle Upon Tyne%'
But only one shows? But it's a LIKE statement, "Newcastle-Upon-Tyne" and "Newcastle Upon Tyne are similar - So why is only the exact match showing in this instance?
The LIKE comparison is returning TRUE only for that one row.
The LIKE comparison is essentially equivalent to an equality comparison
SELECT 'ab d' = 'ab d' --> TRUE
, 'ab d' LIKE 'ab d' --> TRUE
The difference is that LIKE supports two wildcard characters in the values on the right side.... the percent sign character (%) and the underscore character (_). The % character matches zero, one or more of any character. The _ character matches any one single character.
Compare the results from
city LIKE 'Newcastle_Upon_Tyne'
city LIKE 'Newcastle%Upon%Tyne'
Both of those would evaluate to true for values of city
'Newcastle-Upon-Tyne'
'Newcastle7Upon4Tyne'
Additionally, the one with the percent signs would also evaluate to TRUE for values of city such as
'NewcastleUpon56789Tyne'
'Newcastle FEE- UponFI Tyne'
If you want more precise matching than is provided by the LIKE comarison, you could use a regular expression instead..
city REGEXP 'Newcastle[ -]Upon[ -]Tyne'
This would return TRUE for city values
'Newcastle Upon Tyne'
'Newcastle-Upon Tyne'
'Newcastle Upon-Tyne'
'Newcastle-Upon-Tyne'
Because a space is not a wildcard character. A space is just like any other letter in a like statement. If you do something like this:
select * from `properties` where `city` LIKE '% %'
It will find all of the records that contain a space.
If you want any records that contain the words in that order, regardless of the characters between them, you can do this:
select * from `properties` where `city` LIKE '%Newcastle%Upon%Tyne%'

MySQL: Optimized query to find matching strings from set of strings

I am having 10 sets of strings each set having 9 strings. Of this 10 sets, all strings in first set have length 10, those in second set have length 9 and so on. Finally, all strings in 10th set have length 1.
There is common prefix of (length-2) characters in each set. And the prefix length reduces by 1 in next set. Thus, first set has 8 characters in common, second has 7 and so on.
Here is what a sample of 10 sets look like:
pu3q0k0vwn
pu3q0k0vwp
pu3q0k0vwr
pu3q0k0vwq
pu3q0k0vwm
pu3q0k0vwj
pu3q0k0vtv
pu3q0k0vty
pu3q0k0vtz
pu3q0k0vw
pu3q0k0vy
pu3q0k0vz
pu3q0k0vx
pu3q0k0vr
pu3q0k0vq
pu3q0k0vm
pu3q0k0vt
pu3q0k0vv
pu3q0k0v
pu3q0k0y
pu3q0k1n
pu3q0k1j
pu3q0k1h
pu3q0k0u
pu3q0k0s
pu3q0k0t
pu3q0k0w
pu3q0k0
pu3q0k2
pu3q0k3
pu3q0k1
pu3q07c
pu3q07b
pu3q05z
pu3q0hp
pu3q0hr
pu3q0k
pu3q0m
pu3q0t
pu3q0s
pu3q0e
pu3q07
pu3q05
pu3q0h
pu3q0j
pu3q0
pu3q2
pu3q3
pu3q1
pu3mc
pu3mb
pu3jz
pu3np
pu3nr
pu3q
pu3r
pu3x
pu3w
pu3t
pu3m
pu3j
pu3n
pu3p
pu3
pu9
pud
pu6
pu4
pu1
pu0
pu2
pu8
pu
pv
0j
0h
05
pg
pe
ps
pt
p
r
2
0
b
z
y
n
q
Requirement:
I have a table PROFILES having columns SRNO (type bigint, primary key) and UNIQUESTRING (type char(10), unique key). I want to find 450 SRNOs for matching UNIQUESTRINGs from those 10 sets.
First find strings like in the first set. If we don't get enough results (ie. 450), find strings like in second set. If we still don't get enough results (450 minus results of first set) find strings like in third set. And so on.
Existing Solution:
I've written query something like:
select srno from profiles
where ( (uniquestring like 'pu3q0k0vwn%')
or (uniquestring like 'pu3q0k0vwp%') -- all those above uniquestrings after this and finally the last one
or (uniquestring like 'n%')
or (uniquestring like 'q%')
)
limit 450
However, after getting feedback from Rick James in this answer I realized this is not optimized query as it touches lot many rows than it needs.
So I plan to rewrite the query like this:
(select srno from profiles where uniquestring like 'pu3q0k0vwn%' LIMIT 450)
UNION DISTINCT
(select srno from profiles where uniquestring like 'pu3q0k0vwp%' LIMIT 450); -- and more such clauses after this for each uniquestring
I like to know if there are any better solutions to do this.
SELECT ...
WHERE str LIKE 'pu3q0k0vw%' AND -- the 10-char set
str REGEXP '^pu3q0k0vw[nprqmj]' -- the 9 next letters
LIMIT ...
# then check for 450; if not enough, continue...
SELECT ...
WHERE str LIKE 'pu3q0k0vt%' AND -- the 10-char set
str REGEXP '^pu3q0k0vt[vyz]' -- the 9 next letters
LIMIT 450
# then check for 450; if not enough, continue...
etc.
SELECT ...
WHERE str LIKE 'pu3q0k0v%' AND -- the 9-char set
str REGEXP '^pu3q0k0v[wyzxrqmtv]' -- the 9 next letters
LIMIT ...
# check, etc; for a total of 10 SELECTs or 450 rows, whichever comes first.
This will be 10+ selects. Each select will be somewhat optimized by first picking rows with a common prefix with LIKE, then it double checks with a REGEXP.
(If you don't like splitting the inconsistent pu3q0k0vw vs. pu3q0k0vt; we can discuss things further.)
You say "prefix"; I have coded the LIKE and REGEXP to assume arbitrary text after the prefix given.
UNION is not viable, since it will (I think) gather all the rows before picking 450. Each SELECT will stop at the LIMIT if there is no DISTINCT GROUP BY or ORDER BY that require gathering everything first.
REGEXP is not smart enough to avoid scanning the entire table; adding the LIKE avoids such (except when more than, say, 20% of the rows match the LIKE).

MySQL REGEXP matching NO exclamation point and then a word

I have a problem putting together the right REGEXP in MySQL. I have a database that could have something like this:
id | geo
---+--------
1 | NL
2 | US NL
3 | !US
4 | US
these are entries for geo-targeting or geo-blocking. #3 is not US, #1 is NL only. If I want to look for everything for the US I am using:
SELECT * FROM db WHERE geo REGEXP '[[:<:]]US[[:>:]]'
This would return 2, 3 and 4, but I don't want 3. I tried this:
SELECT * FROM db WHERE geo REGEXP '^![[:<:]]US[[:>:]]'
But that looks for everything starting with an exclamation point. I'm looking for a REGEXP to have the word 'US' and NO exclamation point. I just can't figure out how to make a 'doesn't contain' instead of a 'starts with' since they're both done with '^'
You can use this regex:
SELECT * FROM db WHERE geo REGEXP '(^|[^!])[[:<:]]US[[:>:]]';
This will match any non-word character except ! before US

Mysql get values from column having more than certain characters without punctuation

How can I get all the values from mysql table field, having more than 10 characters without any special characters (space, line breaks, colons, etc.)
Let's say I have table name myTable and the field I want to get values from is myColumn.
myColumn
--------
1234
------
123 456
------
123:456
-------
1234
5678
--------
123-456
----------------
1234567890123
So here I would like to get all the field values except first one i.e. 1234
Any help is much appreciated.
Thanks
UPDATE:
Sorry if I was unable to give proper description of my problem. I have tried it again:
If there is count of more than 10 characters without punctuation, then retrieve that as well.
Retrieve all the values which have special characters like line break, spaces, etc.
Yes, I have primary key in this table if this helps.
The logic seems to be "more than 10 characters OR has special punctuation":
where length(mycol) > 10 or
mycol regexp '[^a-zA-Z0-9]'
SELECT MyColumn
From MyTable
WHERE MyColumn RLIKE '([a-z0-9].*){10}'
[a-z0-9] matches a normal character.
([a-z0-9].*) matches a normal character followed by anything.
{10} matches the preceding regexp 10 times.
The result is that this matches 10 normal characters with anything between them.