How can I find the Non keyboard characters in MySQL? - mysql

Related question to How can I find non-ASCII characters in MySQL?.
I want to check for col1 and col2 in my table given below for the cases where non keyboard characters are present.
+------------+----------+
| col1 | col2 |
+------------+----------+
| rewweew\s | 4rtrt |
| é | é |
| 123/ | h|h |
| ëû | û |
| ¼ | ¼ |
| *&^ | *%$ |
| #$ | ~!` |
+------------+----------+
My desired result will look like
+--------+-------+
| é | é |
| ëû | û |
| ¼ | ¼ |
+--------+-------+
In my case all characters present in English keyboard are allowed, i have to only find out row which have character not present in English keyboard like Chinese character etc.
I got the below mentioned query from the link How can I find non-ASCII characters in MySQL?
SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';
But its not working because character ~`#!#$%^&*()_-+=|}]{[':;?/>.<, are also allowed but its neglecting them.

This may be worth a try.
SELECT whatever
FROM tableName
WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)
The CONVERT(col USING charset) function will turns the unconvertable characters into
replacement characters. Then, the converted and unconverted text will be unequal.
Of course it's based on what is and isn't in the ASCII character repertoire, not what's on a particular keyboard. But it should probably do the trick for you. See this for more discussion.
http://dev.mysql.com/doc/refman/5.0/en/charset-repertoire.html
You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won't render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)
Edit
Your comment mentioned that you need also to detect some characters that are in the ASCII character set. I think you're asking about the so-called control characters, which have values from 0x00 to 0x1f, and then 0x7f. #Joni Salonen's approach helps get us there, but we need to do it in a way that's multibyte-character safe.
SELECT whatever
FROM tableName
WHERE CONVERT(columnToCheck USING ASCII) <> columnToCheck
OR CONVERT(columnToCheck USING ASCII) RLIKE '[[.NUL.]-[.US.][.DEL.]]'
If you look at http://www.asciitable.com/, you'll see that the OR clause here detects characters in the first column of the ASCII table, and the last character in the fourth column.

This query will return the rows that have characters outside of the ASCII range 0 - 127:
SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '^[[.NUL.]-[.DEL.]]*$'
By English keyboard do you mean American or UK keyboard? The UK keyboard includes some non-ASCII characters, like the sterling pound symbol. If you want to accept those too you have to add them to the regular expression.

Related

MySQL - Special characters in column value

I got a big data (approximately 600,000).
I want the rows with value "word's" will appear.
Special characters will be completely ignored.
TABLE:
| column_value |
| ------------- |
| word's |
| hello |
| world |
QUERY: select * from table where column_value like '%words%'
RESULTS:
| column_value |
| ------------- |
| word's |
I want the rows with special characters will appear and ignore their special characters.
Can you please help me how can we achieve it with fast runtime?
You can use replace to remove the "special" character prior the matching.
SELECT *
FROM table
WHERE replace(column_value, '''', '') LIKE '%words%';
Nest the replace() calls for other characters.
Or you try it with regular expressions.
SELECT *
FROM table
WHERE column_value REGEXP 'w[^a-zA-Z]*o[^a-zA-Z]*r[^a-zA-Z]*d[^a-zA-Z]*s';
[^a-zA-Z]* matches optional characters, that are not a, ..., y and z and not A, ..., Y and Z, so this matches your search word also with any non alphas between the letters.
Or you have a look at the options full text search brings with it. Maybe that can help too.
You must add an index on your column_value.
MySQL doc

MySql Regexp result word part of known word

Been struggling for this for awhile.
Is there a way to find all rows in my table where the word in the column 'word' is a part of a search word?
+---------+-----------------+
| id_word | word |
+---------+-----------------+
| 177041 | utvälj |
| 119270 | fonders |
| 39968 | flamländarens |
| 63567 | hänvisningarnas |
| 61244 | hovdansers |
+---------+-----------------+
I want to extract the row 119270, fonders. I want to do this by passing in the word 'plafonders'.
SELECT * FROM words WHERE word REGEXP 'plafonders$'
That query will of course not work in this case, would've been perfect if it had been the other way around.
Does anyone know a solution to this?
SELECT * FROM words WHERE 'plafonders' REGEXP concat(word, '$')
should accomplish what you want. Your regex:
plafonders$
is looking for plafonders at the end of the column. This is looking for everything the column has until its end, e.g. the regexp is fonders$ for 119270.
See https://regex101.com/r/Ytb3kg/1/ compared to https://regex101.com/r/Ytb3kg/2/.
MySQL's REGEXP does not handle accented letters very well. Perhaps it will work OK in your limited situation.
Here's a slightly faster approach (though it still requires a table scan):
SELECT * FROM words
WHERE 'PLAutvälj' =
RIGHT('PLAutvälj', CHAR_LENGTH(word)) = word;
(To check the accents, I picked a different word from your table.)

How to identify a language in utf-8 column in MySQL

My question is how to find specific character set from utf-8 column in MySQL server?
Please note that this is NOT Duplicate question, please read carefully what's asked, not what's you think.
Currently MySQL does works perfectly with utf-8 and shows all types of different languages and I don't have any problem to see different languages in database. I use SQLyog to connect MySQL server and all SELECT results are perfect, I can see Cyrillic, Japanese, chinese, Turkish, French or Italian or Arabic or any types of languages are mixed and shows perfectly. As well my.ini and scripts also perfectly configured and working well.
Here How can I find non-ASCII characters in MySQL? I see that some people answers the question and their answers also perfect to find non ASCII text. but my question is similar, but little different. I want to find specific character set from utf-8 column in MySQL server.
let's say,
select * from TABLE where COLUMN regexp '[^ -~]';
it returns all non ASCII characters including Cyrillic, Japanese, chinese, Turkish, French or Italian or Arabic or any types of languages. but I want is
SELECT * from TABLE WHERE COLUMN like or regexp'Japanese text only?'
another words, I want SELECT only Japanese encoded text. currently I can see all types of language with this;
select * from TABLE where COLUMN regexp '[^ -~]';
but I want select only japanese or russian or arabic or french language. how to do that?
Database contains all languages mixed rows and UTF-8. I am not sure is it possible in MySQL Server? if not possible, then how to do this?
Thanks a lot!
Well, let's start with a table I put in here. It says, for example, that E381yy is the utf8 encoding for Hiragana and E383yy is Katakana (Japanese). (Kanji is another matter.)
To see if a utf8 column contains Katakana, do something like
WHERE HEX(col) REGEXP '^(..)*E383'
Cyrillic might be
WHERE HEX(col) REGEXP '^(..)*D[0-4]'
Chinese is a bit tricky, but this might usually work for Chinese (and Kanji?):
WHERE HEX(col) REGEXP '^(..)*E[4-9A]'
(I'm going to change your Title to avoid the keyword 'character set'.)
Western Europe (including, but not limited to, French) C[23], Turkish (approx, and some others) (C4|C59), Greek: C[EF], Hebrew: D[67], Indian, etc: E0, Arabic/Farsi/Persian/Urdu: D[89AB]. (Always prefix with ^(..)*.
You may notice that these are not necessarily very specific. This is because of overlaps. British English and American English cannot be distinguished except by spelling of a few words. Several accented letters are shared in various ways in Europe. India has many different character sets: Devanagari, Bengali, Gurmukhi, Gujarati, etc.; these are probably distinguishable, but it would take more research. I think Arabic/Farsi/Persian/Urdu share one character set.
Some more:
| SAMARITAN | E0A080 | E0A0BE |
| DEVANAGARI | E0A480 | E0A5BF |
| BENGALI | E0A681 | E0A7BB |
| GURMUKHI | E0A881 | E0A9B5 |
| GUJARATI | E0AA81 | E0ABB1 |
| ORIYA | E0AC81 | E0ADB1 |
| TAMIL | E0AE82 | E0AFBA |
| TELUGU | E0B081 | E0B1BF |
| KANNADA | E0B282 | E0B3B2 |
| MALAYALAM | E0B482 | E0B5BF |
| SINHALA | E0B682 | E0B7B4 |
| THAI | E0B881 | E0B99B |
| LAO | E0BA81 | E0BB9D |
| TIBETAN | E0BC80 | E0BF94 |
So, for DEVANAGARI, '^(..)*E0A[45]'

How to match hyphen delimited in any order

I need to match a set of characters delimited by a hyphen - for example:
B-B/w-W/Br-W-Br
Where the / are part of what I need, up to 20 spaces.
G-R-B, G/R-B-B/W-O
So I need a regex that covers between the -'s in any order (G-R-B could also be R-B-G)
I've been playing around with a bunch of combo's, but I can't come up with something that will match any order.
The plan is to search this way using mysql. So, it'll be something like
select * from table1 where pinout REGEXP '';
I just can't get the regex right :/
Description
This expression will match the string providing each of the hyphen delimited values are included in the string. The color values can appear in the string in any order so this expression will match W/Br-b-B/w and B/w-W/Br-b... or any other combinations which include those colors.
^ # match the start to of the string
(?=.*?(?:^|-)W\/Br(?=-|$)) # require the string to have a w/br
(?=.*?(?:^|-)b(?=-|$)) # require the string to have a b
(?=.*?(?:^|-)B\/w(?=-|$)) # require the string to have a b/w
.* # match the entire string
MySql doesn't really support the look arounds so this will need to be broken into a group of where statements
mysql> SELECT * FROM dog WHERE ( color REGEXP '.*(^|-)W\/Br(-|$)' and color REGEXP '.*(^|-)b(-|$)' and color REGEXP '.*(^|-)B\/w(-|$)' );
+-------+--------+---------+------+------------+---------------------+
| name | owner | species | sex | birth | color |
+-------+--------+---------+------+------------+---------------------+
| Claws | Gwen | cat | m | 1994-03-17 | B-B/w-W/Br-W-Br |
| Buffy | Harold | dog | f | 1989-05-13 | G-R-B, G/R-B-B/W-O |
+-------+--------+---------+------+------------+---------------------+
See also this working sqlfiddle: http://sqlfiddle.com/#!2/943af/1/0
Using a regex in conjunction with a MySql where statement can be found here: http://dev.mysql.com/doc/refman/5.1/en/pattern-matching.html
I might have misunderstood from your example, try this:
-*([a-zA-Z/]+)-*
The capture region can be altered to include your specific letters of interest, e.g. [GRBWOgrbwo/].
Edit: I don't think this will help you in the context you're using it, but I'll leave it here for posterity.

MySQL - UNHEX(HEX(UTF-8)) issue

I've got a database with UTF-8 characters in it, which are improperly displayed. I figured that I could use UNHEX(HEX(column)) != column condition to know what fields have UTF-8 characters in them. The results are rather interesting:
id | content | HEX(content) | UNHEX(HEX(content)) LIKE '%c299%' | UNHEX(HEX(content)) LIKE '%FFF%' | UNHEX(HEX(content))
49829102 | | C299 | 0 | 0 | c299
874625485 | FFF | 464646 | 0 | 1 | FFF
How is this possible and, possibly, how can I find the row with this character in it?
-- edit(2): since my edit has been removed (probably when JamWaffles was fixing my beautiful data table), here it is again: as editor strips out UTF-8 characters, the content in first row is \uc299 (if that's not clear ;) )
-- edit(3): I've figured out what the issue is - the actual representation of UNHEX(HEX(content)) is WRONG - to display my multibyte character I had to do the following: SELECT UNHEX(SUBSTR(HEX(content),1))). Sadly UNHEX(C299) doesn't work as UNHEX(C2)+UNHEX(99) so it's back to the drawing board.
There are two ways to determine if a string contains UTF-8 specific characters. The first is to see if the string has values outside the ASCII character set:
SELECT _utf8 'amńbcd' REGEXP '[^[.NUL.]-[.DEL.]]';
The second is to compare the binary and character lengths:
SELECT LENGTH(_utf8 'amńbcd') <> CHAR_LENGTH(_utf8 'amńbcd');
Both return TRUE.
See http://sqlfiddle.com/#!2/d41d8/9811