How to identify a language in utf-8 column in MySQL - mysql

My question is how to find specific character set from utf-8 column in MySQL server?
Please note that this is NOT Duplicate question, please read carefully what's asked, not what's you think.
Currently MySQL does works perfectly with utf-8 and shows all types of different languages and I don't have any problem to see different languages in database. I use SQLyog to connect MySQL server and all SELECT results are perfect, I can see Cyrillic, Japanese, chinese, Turkish, French or Italian or Arabic or any types of languages are mixed and shows perfectly. As well my.ini and scripts also perfectly configured and working well.
Here How can I find non-ASCII characters in MySQL? I see that some people answers the question and their answers also perfect to find non ASCII text. but my question is similar, but little different. I want to find specific character set from utf-8 column in MySQL server.
let's say,
select * from TABLE where COLUMN regexp '[^ -~]';
it returns all non ASCII characters including Cyrillic, Japanese, chinese, Turkish, French or Italian or Arabic or any types of languages. but I want is
SELECT * from TABLE WHERE COLUMN like or regexp'Japanese text only?'
another words, I want SELECT only Japanese encoded text. currently I can see all types of language with this;
select * from TABLE where COLUMN regexp '[^ -~]';
but I want select only japanese or russian or arabic or french language. how to do that?
Database contains all languages mixed rows and UTF-8. I am not sure is it possible in MySQL Server? if not possible, then how to do this?
Thanks a lot!

Well, let's start with a table I put in here. It says, for example, that E381yy is the utf8 encoding for Hiragana and E383yy is Katakana (Japanese). (Kanji is another matter.)
To see if a utf8 column contains Katakana, do something like
WHERE HEX(col) REGEXP '^(..)*E383'
Cyrillic might be
WHERE HEX(col) REGEXP '^(..)*D[0-4]'
Chinese is a bit tricky, but this might usually work for Chinese (and Kanji?):
WHERE HEX(col) REGEXP '^(..)*E[4-9A]'
(I'm going to change your Title to avoid the keyword 'character set'.)
Western Europe (including, but not limited to, French) C[23], Turkish (approx, and some others) (C4|C59), Greek: C[EF], Hebrew: D[67], Indian, etc: E0, Arabic/Farsi/Persian/Urdu: D[89AB]. (Always prefix with ^(..)*.
You may notice that these are not necessarily very specific. This is because of overlaps. British English and American English cannot be distinguished except by spelling of a few words. Several accented letters are shared in various ways in Europe. India has many different character sets: Devanagari, Bengali, Gurmukhi, Gujarati, etc.; these are probably distinguishable, but it would take more research. I think Arabic/Farsi/Persian/Urdu share one character set.
Some more:
| SAMARITAN | E0A080 | E0A0BE |
| DEVANAGARI | E0A480 | E0A5BF |
| BENGALI | E0A681 | E0A7BB |
| GURMUKHI | E0A881 | E0A9B5 |
| GUJARATI | E0AA81 | E0ABB1 |
| ORIYA | E0AC81 | E0ADB1 |
| TAMIL | E0AE82 | E0AFBA |
| TELUGU | E0B081 | E0B1BF |
| KANNADA | E0B282 | E0B3B2 |
| MALAYALAM | E0B482 | E0B5BF |
| SINHALA | E0B682 | E0B7B4 |
| THAI | E0B881 | E0B99B |
| LAO | E0BA81 | E0BB9D |
| TIBETAN | E0BC80 | E0BF94 |
So, for DEVANAGARI, '^(..)*E0A[45]'

Related

MySQL - Searching for CC numbers

I inherited a MySQL server that has CC numbers stored in plaintext. due to PCI requirements, I need to find the numbers and mask them. The trick is they are stored in a field with other text as well. I need to find a way to search for cc numbers and change just those, not the rest of the text.
I have tried the masking feature in MySQL, but it doesn't work for this version. I also looked up a few different sites but can't seem to find anything that will really help with my particular instance.
Edit
to explain better. the previous admin didn't tell the operators to not take CC info through the live chat system. The system is using SSL but the chat history is stored in plain text in a MySQL DB. The company isn't PCI compliant (as far as getting scanned and SAQ is concerned) so we cannot have CC numbers stored anywhere. but the numbers are given in the middle of a conversation. If they were in their own column then that wouldn't be a big deal.
EDIT
I have tried using regexp to just try and search for CC #'s but now I am getting an operand error, which is lazy quantifiers I believe.
SELECT * FROM table_name Where text regexp '^4[0-9]{12}(?:[0-9]{3})?$'
Any Ideas?
You could potentially use a regular expression to search for 16-19 consecutive characters with (using LIKE if you have the numbers separated from the text, or just REGEXP):
The example is given here (where 5 is the number of items to search for, and ^$ requires it to be those at the beginning and end):
mysql> SELECT * FROM pet WHERE name REGEXP '^.{5}$';
+-------+--------+---------+------+------------+-------+
| name | owner | species | sex | birth | death |
+-------+--------+---------+------+------------+-------+
| Claws | Gwen | cat | m | 1994-03-17 | NULL |
| Buffy | Harold | dog | f | 1989-05-13 | NULL |
+-------+--------+---------+------+------------+-------+
Would end up something like:
REGEXP '^([0-9]{16|17|18|19})$'
https://dev.mysql.com/doc/refman/8.0/en/pattern-matching.html
And lookie here too:
Regex to match a digit two or four times

Suggested character set for non utf8 columns in mysql

Currently I'm using VARCHAR/TEXT with utf8_general_ci for all character columns in mysql. Now I want to improve database layout/performance.
What I figured out so far is to better use
CHAR instead of VARCHAR for fixed length columns as GUIDs or session ids
also use CHAR for small columns having length of 1 or maybe 2?
As I do not want to go as wide to save my GUIDs as BINARY(16) because of handling issues, I'd rather save them as CHAR(32) to especially improve keys. (I would even save 2/3 when switching from utf8 to some 1-byte-charset)
So what will the best character set be for such columns? ASCII? latin1? BINARY? Which collation?
What characterset/collation to use for other columns where I do not need utf8 support but need proper sorting. A Binary would fail?
Is it good practice to mix up different character sets in same mysql (innodb) table? Or do I get better performance when all columns have same charset within same table? Or even database?
GUID/UUID/MD5/SHA1 are all hex and dash. For them
CHAR(..) CHARACTER SET ascii COLLATE ascii_general_ci
That will allow for A=a when comparing hex strings.
For Base64 things, use either of
CHAR(..) CHARACTER SET ascii COLLATE ascii_bin
BINARY(..)
since A is not semantically the same as a.
Further notes...
utf8 spits at you if you give it an invalid 8-bit value.
ascii spits at you for any 8-bit value.
latin1 accepts anything -- thereby your problems down the road
It is quite OK to have different columns in a table having different charsets and/or collations.
The charset/collation on the table is just a default, ripe for overriding at the column definition.
BINARY may be a tiny bit faster than any _bin collation, but not enough to notice.
Use CHAR for columns that are truly fixed length; don't mislead the user by using it for other cases.
%_bin is faster than %_general_ci, which is faster than other collations. Again, you would be hard-pressed to measure a difference.
Never use TINYTEXT or TINYBLOB.
For proper encoding, use the appropriate charset.
For "proper sorting", use the appropriate collation. See example below.
For "proper sorting" where multiple languages are represented, and you are using utf8mb4, use utf8mb4_unicode_520_ci (or utf8mb4_900_ci if using version 8.0). The 520 and 900 refer to Unicode standards; new collations are likely to come in the future.
If you are entirely in Czech, then consider these charsets and collations. I list them in preferred order:
mysql> show collation like '%czech%';
+------------------+---------+-----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+------------------+---------+-----+---------+----------+---------+
| utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | -- opens up the world
| utf8_czech_ci | utf8 | 202 | | Yes | 8 | -- opens up most of the world
| latin2_czech_cs | latin2 | 2 | | Yes | 4 | -- kinda like latin1
The rest are "useless":
| cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
| ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
| utf16_czech_ci | utf16 | 111 | | Yes | 8 |
| utf32_czech_ci | utf32 | 170 | | Yes | 8 |
+------------------+---------+-----+---------+----------+---------+
7 rows in set (0.00 sec)
More
The reason for using smaller datatypes (where appropriate) is to shrink the dataset, which leads to less I/O, which leads to things being more cacheable, which makes the program run faster. This is especially important for huge datasets; it is less important for small- or medium-sized datasets.
ENUM is 1 byte, yet acts like a string. So you get the "best of both worlds". (There are drawbacks, and there is a 'religious war' among advocates for ENUM vs TINYINT vs VARCHAR.)
Usually columns that are "short" are always the same length. A country_code is always 2 letters, always ascii, always could benefit from case insensitive collation. So CHAR(2) CHARACTER SET ascii COLLATE ascii_general_ci is optimal. If you have something that is sometimes 1-char, sometimes 2, then flip a coin; whatever you do won't make much difference.
VARCHAR (up to 255) has an extra 1-byte length attached to it. So, if your strings vary in length at all, VARCHAR is at least as good as CHAR. So simplify your brain processing: "variable length --> `VARCHAR".
BIT, depending on version, may be implemented as a 1-byte TINYINT UNSIGNED. If you have only a few bits in your table, it is not worth worrying about.
One of my Rules of Thumb says that if you aren't likely to get a 10% improvement, move on to some other optimization. Much of what we are discussing here is under 10% (space in this case). Still, get in the habit of thinking about it when writing CREATE TABLE. I often see tables with BIGINT and DOUBLE (each 8 bytes) that could easily use smaller columns. Sometimes saving more than 50% (space).
How does "space" translate into "speed". Tiny tables -> a tiny percentage. Huge tables -> In some cases 10x. (That's 10-fold, not 10%.) (UUIDs are one way to get really bad performance on huge tables.)
ENUM
Acts and feels like a string, yet takes only one byte. (One byte translates, indirectly, into a slight speed improvement.)
Practical when fewer than, say, 10 different values.
Impractical if frequently adding a new value -- requires ALTER TABLE, though it can be "inplace".
Suggest starting the list with 'unknown' (or something like that) and making the column NOT NULL (versus NULL).
The character set for the enum needs to be whatever is otherwise being used for the connection. The choice does not matter much unless you have options that collate equal (eg, A versus a).

MySql Regexp result word part of known word

Been struggling for this for awhile.
Is there a way to find all rows in my table where the word in the column 'word' is a part of a search word?
+---------+-----------------+
| id_word | word |
+---------+-----------------+
| 177041 | utvälj |
| 119270 | fonders |
| 39968 | flamländarens |
| 63567 | hänvisningarnas |
| 61244 | hovdansers |
+---------+-----------------+
I want to extract the row 119270, fonders. I want to do this by passing in the word 'plafonders'.
SELECT * FROM words WHERE word REGEXP 'plafonders$'
That query will of course not work in this case, would've been perfect if it had been the other way around.
Does anyone know a solution to this?
SELECT * FROM words WHERE 'plafonders' REGEXP concat(word, '$')
should accomplish what you want. Your regex:
plafonders$
is looking for plafonders at the end of the column. This is looking for everything the column has until its end, e.g. the regexp is fonders$ for 119270.
See https://regex101.com/r/Ytb3kg/1/ compared to https://regex101.com/r/Ytb3kg/2/.
MySQL's REGEXP does not handle accented letters very well. Perhaps it will work OK in your limited situation.
Here's a slightly faster approach (though it still requires a table scan):
SELECT * FROM words
WHERE 'PLAutvälj' =
RIGHT('PLAutvälj', CHAR_LENGTH(word)) = word;
(To check the accents, I picked a different word from your table.)

How can I find the Non keyboard characters in MySQL?

Related question to How can I find non-ASCII characters in MySQL?.
I want to check for col1 and col2 in my table given below for the cases where non keyboard characters are present.
+------------+----------+
| col1 | col2 |
+------------+----------+
| rewweew\s | 4rtrt |
| é | é |
| 123/ | h|h |
| ëû | û |
| ¼ | ¼ |
| *&^ | *%$ |
| #$ | ~!` |
+------------+----------+
My desired result will look like
+--------+-------+
| é | é |
| ëû | û |
| ¼ | ¼ |
+--------+-------+
In my case all characters present in English keyboard are allowed, i have to only find out row which have character not present in English keyboard like Chinese character etc.
I got the below mentioned query from the link How can I find non-ASCII characters in MySQL?
SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';
But its not working because character ~`#!#$%^&*()_-+=|}]{[':;?/>.<, are also allowed but its neglecting them.
This may be worth a try.
SELECT whatever
FROM tableName
WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)
The CONVERT(col USING charset) function will turns the unconvertable characters into
replacement characters. Then, the converted and unconverted text will be unequal.
Of course it's based on what is and isn't in the ASCII character repertoire, not what's on a particular keyboard. But it should probably do the trick for you. See this for more discussion.
http://dev.mysql.com/doc/refman/5.0/en/charset-repertoire.html
You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won't render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)
Edit
Your comment mentioned that you need also to detect some characters that are in the ASCII character set. I think you're asking about the so-called control characters, which have values from 0x00 to 0x1f, and then 0x7f. #Joni Salonen's approach helps get us there, but we need to do it in a way that's multibyte-character safe.
SELECT whatever
FROM tableName
WHERE CONVERT(columnToCheck USING ASCII) <> columnToCheck
OR CONVERT(columnToCheck USING ASCII) RLIKE '[[.NUL.]-[.US.][.DEL.]]'
If you look at http://www.asciitable.com/, you'll see that the OR clause here detects characters in the first column of the ASCII table, and the last character in the fourth column.
This query will return the rows that have characters outside of the ASCII range 0 - 127:
SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '^[[.NUL.]-[.DEL.]]*$'
By English keyboard do you mean American or UK keyboard? The UK keyboard includes some non-ASCII characters, like the sterling pound symbol. If you want to accept those too you have to add them to the regular expression.

MySQL - UNHEX(HEX(UTF-8)) issue

I've got a database with UTF-8 characters in it, which are improperly displayed. I figured that I could use UNHEX(HEX(column)) != column condition to know what fields have UTF-8 characters in them. The results are rather interesting:
id | content | HEX(content) | UNHEX(HEX(content)) LIKE '%c299%' | UNHEX(HEX(content)) LIKE '%FFF%' | UNHEX(HEX(content))
49829102 | | C299 | 0 | 0 | c299
874625485 | FFF | 464646 | 0 | 1 | FFF
How is this possible and, possibly, how can I find the row with this character in it?
-- edit(2): since my edit has been removed (probably when JamWaffles was fixing my beautiful data table), here it is again: as editor strips out UTF-8 characters, the content in first row is \uc299 (if that's not clear ;) )
-- edit(3): I've figured out what the issue is - the actual representation of UNHEX(HEX(content)) is WRONG - to display my multibyte character I had to do the following: SELECT UNHEX(SUBSTR(HEX(content),1))). Sadly UNHEX(C299) doesn't work as UNHEX(C2)+UNHEX(99) so it's back to the drawing board.
There are two ways to determine if a string contains UTF-8 specific characters. The first is to see if the string has values outside the ASCII character set:
SELECT _utf8 'amńbcd' REGEXP '[^[.NUL.]-[.DEL.]]';
The second is to compare the binary and character lengths:
SELECT LENGTH(_utf8 'amńbcd') <> CHAR_LENGTH(_utf8 'amńbcd');
Both return TRUE.
See http://sqlfiddle.com/#!2/d41d8/9811