Find non- english characters in mysql - mysql

I have a MySQL table which stores email contents as a blob data type. A few of the rows have non-english characters in them. Is there any ways to find only the rows which contain non-english characters?

...
where some_column regexp '.*[^\\w.#].*'

descr LIKE '%[' + CHAR(127)+ '-' +CHAR(255)+']%' COLLATE Latin1_General_100_BIN2

select * from TABLE where COLUMN like '%\0%';
This is working for me on MySQL 5.6 and am happy to use without even thinking how it works. Would be thankful if someone adds an explanation of how this works.

Related

Removing Cyrillic text from MySQL Select string

Recently i stuck with some problems with MySQL queries. I have table that contains multiple language records. For example Columns are ID and Description. It have data like this 1 test with Кирилица; 2 Test without Cyrillic. I need to remove all cyrillic symbols from Select query. The Select answer must be like 1 test with;2 Test without Cyrillic. Seems like i need to use Select Replace query, but is it possible to do it much faster way than replace 66 characters in query for Header letters and small letters.
I have tried something like that. But of course this isnt working. Hope for help from MySQL Gurus. Thank You for attention
SELECT id,SUBSTRING_INDEX(title, REGEXP "[а-яА-Я]", 1)
FROM Test
AFAIK there is no faster method than
SELECT id, REGEXP_REPLACE(title, '[Ѐ-ӿ]+', '') AS title FROM test;
Fiddle
("Ѐ" and "ӿ" are the first and last characters, respectively, of Unicode Cyrillic block. If you go with [а-яА-Я], you can miss Cyrillic letters of languages outside Russian, and even the Russian Ё.)

Search for replacement character (no TSQL)

I'm trying to find a way to search for the replacement character /uFFFD with SQL (since I'm using MariaDB) but I can not make it work. I tried with:
SELECT id FROM tablename WHERE content LIKE "%\ufffd%";
SELECT id FROM tablename WHERE content LIKE "%�%"
Both results are not working for me. Some topics say to use UNICODE() but it's a TSQL function and I can not use it here in MariaDB. Any solution?
What CHARACTER SET are you using? FFFD is the hex for the Unicode "codepoint". The UTF-8 encoding for it is EFBFBD.
Here's another way to look for it:
WHERE HEX(col) REGEXP '^(....)*FFFD'
or perhaps
WHERE HEX(col) REGEXP '^(..)*EFBFBD'
What are your results? Do you have any error? Try this simple working query or change your col type.
select '�' a from dual where a like '%�%'

How to query MySQL for fields containing null characters

I have a MySQL table with a text column. Some rows have null characters (0x00) as part of this text column (along with other characters).
I am looking for a query that will return all rows containing any null characters for this column, but I cannot figure out how the proper syntax for my "where column like '%...%'" clause.
Thank you!
Right after I submitted the question, a suggested link to this related question provided my answer:
Query MySQL with unicode char code
WHERE column LIKE CONCAT("%", CHAR(0x00 using utf8), "%");
The following worked for me...
WHERE column LIKE "%\0%";

Mysql Select with LIKE clause is not working Chinese characters

I have data stored in single column which are in English and Chinese.
the data is separated by the separators e.g.
for Chinese
<!--:zh-->日本<!--:-->
for English
<!--:en-->English Characters<!--:-->
I would show the content according to users selected language.
I made a query like this
SELECT * FROM table WHERE content LIKE '<!--:zh-->%<!--:-->'
The query above works but return empty result set.
Collation of content column is utf8_general_ci
I have also tried using the convert function like below
SELECT * FROM table WHERE CONVERT(content USING utf8)
LIKE CONVERT('<!--:zh-->%<!--:-->' USING utf8)
But this also does not work.
I also tried running the query SET NAMES UTF8 but still it does not work.
I am running queries in PhpMyAdmin if it does matter.
qTranslate did not change the database used by WordPress. Translation data is stored in original fields. For that reason there is each field containing all translations for that special field and the data is like this
<!--:en-->English Characters<!--:--><!--:zh-->日本<!--:-->
http://wpml.org/documentation/related-projects/qtranslate-importer/
Test table data for content
<!--:zh-->日本<!--:--><!--:en-->English Characters<!--:-->
<!--:en-->English Characters<!--:--><!--:zh-->日本<!--:-->
<!--:zh-->日本<!--:-->
<!--:en-->English Characters<!--:-->
followed by
I have data stored in single column which are in English and
Chinese
and your select should look like this
SELECT * FROM tab
WHERE content LIKE '%<!--:zh-->%<!--:-->%'
SQL Fiddle DEMO (also with demo how to get the special language part out of content)
SET #PRE = '<!--:zh-->', #SUF = '<!--:-->';
SELECT
content,
SUBSTR(
content,
LOCATE( #PRE, content ) + LENGTH( #PRE ),
LOCATE( #SUF, content, LOCATE( #PRE, content ) ) - LOCATE( #PRE, content ) - LENGTH( #PRE )
) langcontent
FROM tab
WHERE content LIKE CONCAT( '%', #PRE, '%', #SUF, '%' );
as stated in MySQL Documentation and follow the example of
SELECT 'David!' LIKE '%D%v%';
As others have pointed, your queries seem to be fine, so I'd look somewhere else. This is something you can try:
I'm not sure about chinese input, but for japanese, many symbols have full-width and half-width variants, for example: "hello" and "hello" look similar, but the codepoints of their characaters are different, and therefore won't compare as equal. It's very easy to mistype something in full-width, and very difficult to detect, especially for whitespace. Compare " " and " ".
You are probably storing your data in half width and querying it in full width. Even if one character is different (especially spaces are difficult to detect), the query will not find your desired data.
There are many ways to detect this, for instance try copying the data and query into text files verbatim, and view them with hex editors. If there is a single bit difference in the relevant parts, you may be dealing with this problem.
Assuming you're using MySQL, you can use wildcards in LIKE:
% matches any number of characters, including zero characters.
_ matches exactly one character
Here's an example search for values containing the character 日 in the content column of your table:
SELECT * FROM table WHERE `content` LIKE '%日%'
Search fails because of the way you store data.
You are using utf8_general_ci collation, which is tailored to fast search in some European languages. It is even not so perfect with some of them. People tend to use it just because it fast and they don't care about some search inaccuracy in, say, Scandinavian languages.
Change this to big5_chinese_ci or some other Chinese - tuned collation.
UPD.
Another thing.
I see, you use a kind of markup in your DB records.
<!--:zh-->日本<!--:-->
<!--:en-->English Characters<!--:-->
So, if you're searching for Chinese, you may just use
SELECT * FROM table WHERE content LIKE '<!--:zh-->%'
instead of
SELECT * FROM table WHERE content LIKE '<!--:zh-->%<!--:-->'
I have tried to reproduce the problem. The query is OK, I have got the result, even using SET NAMES latin1.
Check the content of the field, possible there are beginning/ending white spaces, remove them firstly, or try this query -
SELECT * FROM table
WHERE TRIM(content) LIKE '<!--:zh-->%<!--:-->'
Example with your string -
CREATE TABLE table1(
column1 VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_general_ci
);
INSERT INTO table1 VALUES
('<!--:en-->English Characters<!--:--><!--:zh-->日本<!--:-->');
SELECT * FROM table1 WHERE column1 LIKE '%<!--:zh-->%<!--:-->';
=> <!--:en-->English Characters<!--:--><!--:zh-->日本<!--:-->
Can I ask what version of MySQL you're using? From what I see your code seems fine, which gets me thinking you're not running the most up to date version of MySQL.

How to match UTF8 characters in MySQL regular expression?

So I want to find out all the rows that has UTF8 characters in a specific field, in this manner:
SELECT * FROM table1 WHERE field1 REGEXP '[[:utf8:]]';
Searched through MySQL docs but found nothing. Is this possible?
I meant non-ASCII characters.
Managed to find a way to do that: http://www.kavoir.com/2011/03/mysql-find-non-ascii-characters.html