Removing Cyrillic text from MySQL Select string - mysql

Recently i stuck with some problems with MySQL queries. I have table that contains multiple language records. For example Columns are ID and Description. It have data like this 1 test with Кирилица; 2 Test without Cyrillic. I need to remove all cyrillic symbols from Select query. The Select answer must be like 1 test with;2 Test without Cyrillic. Seems like i need to use Select Replace query, but is it possible to do it much faster way than replace 66 characters in query for Header letters and small letters.
I have tried something like that. But of course this isnt working. Hope for help from MySQL Gurus. Thank You for attention
SELECT id,SUBSTRING_INDEX(title, REGEXP "[а-яА-Я]", 1)
FROM Test

AFAIK there is no faster method than
SELECT id, REGEXP_REPLACE(title, '[Ѐ-ӿ]+', '') AS title FROM test;
Fiddle
("Ѐ" and "ӿ" are the first and last characters, respectively, of Unicode Cyrillic block. If you go with [а-яА-Я], you can miss Cyrillic letters of languages outside Russian, and even the Russian Ё.)

Related

mysql MATCH AGAINST weird characters query

I have a table where the field "company_name" has weird characters, like "à","ö","¬","©","¬","†", etc. I want to return all "company_name"s that contain these characters anywhere within the string. My current query looks like this:
SELECT * FROM table WHERE
MATCH (company_name) AGAINST ('"Ä","à","ö","¬","©","¬","†"' in natural language mode);
But I keep getting no data from the query. I know this can't be the case, as there are definitely examples of them I can find manually. To be clear, the query itself isn't throwing any errors, just not returning any data.
The minimun word length is 3 pr 4 .
you can change it see manial
https://dev.mysql.com/doc/refman/8.0/en/fulltext-fine-tuning.html
or use regular expressiions
SELECT * FROM table WHERE
ompany_name REGEXP '[Äàö¬©¬†]+';
SELECT *
FROM table
WHERE company_name LIKE '%[^0-9a-zA-Z !"#$%&''()*+,\-./:;<=>?#\[\^_`{|}~\]\\]%' ESCAPE '\'
This will find any wacky stuff, including wide characters or 'euro-ASCII' or emoji.

MYSQL Find entries that contain more than 7 numbers

I need to find entries that contain more than 7 numbers in one of my mysql tables BUT the numbers are separated by letters or anything else.
What I have is this little piece of code I use to find entries like dsc123456789:
select * from crawl where title regexp '[0-9]{7}'
How can I find entries like dsc-123-456_78B9? I tried different things but without success so far.
Thanks
You can use the following solution:
SELECT *
FROM crawl
WHERE title REGEXP '(([^[:digit:]])?[[:digit:]]){8,}';
Why the original query of the answer doesn't work?
-- this query doesn't work!
SELECT *
FROM crawl
WHERE title REGEXP '\d([^\d]?\d){7,}'
MySQL can't use character groups like \d (digits). So the query fails every time. On PHP and other languages the regular expression would look like this:
\d([^\d]?\d){7,}
but on MySQL this isn't valid. So you have to use the character classes of MySQL to solve this:
(([^[:digit:]])?[[:digit:]]){8,}
Hint: Make sure you use {8} or {8,} instead of {7} since you want to find all entries with more than 7 numbers / digits.

Isolate an email address from a string using MySQL

I am trying to isolate an email address from a block of free field text (column name is TEXT).
There are many different variations of preceding and succeeding characters in the free text field, i.e.:
email me! john#smith.com
e:john#smith.com m:555-555-5555
john#smith.com--personal email
I've tried variations of INSTR() and SUBSTRING_INDEX() to first isolate the "#" (probably the one reliable constant in finding an email...) and extracting the characters to the left (up until a space or non-qualifying character like "-" or ":") and doing the same thing with the text following the #.
However - everything I've tried so far hasn't filtered out the noise to the level I need.
Obviously 100% accuracy isn't possible but would someone mind taking a crack at how I can structure my select statement?
There is no easy solution to do this within MySQL. However you can do this easily after you have retrieved it using regular expressions.
Here would be a an example of how to use it in your case: Regex example
If you want it to select all e-mail addresses from one string: Regex Example
You can use regex to extract the ones where it does contain an e-mail in MySQL but it still doesn't extract the group from the string. This has to be done outside MySQL
SELECT * FROM table
WHERE column RLIKE '\w*#\w*.\w*'
RLIKE is only for matching it, you can use REGEXP in the SELECT but it only returns 1 or 0 on whether it has found a match or not :s
If you do want to extract it in MySQL maybe this other stackoverflow post helps you out. But it seems like a lot of work instead of doing it outside MySQL
Now in MySQL 5 and 8 you can use REGEXP_SUBSTR to isolate just the email from a block of free text.
SELECT *, REGEXP_SUBSTR(`TEXT`, '([a-zA-Z0-9._%+\-]+)#([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})') AS Emails FROM `mytable`;
If you want to get just the records with emails and remove duplicates ...
SELECT DISTINCT REGEXP_SUBSTR(`TEXT`, '([a-zA-Z0-9._%+\-]+)#([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})') AS Emails FROM `mytable` WHERE `TEXT` REGEXP '([a-zA-Z0-9._%+\-]+)#([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})';

Use REPLACE() to ORDER BY a mySQL SELECT alphanumerically when special characters are present

I had done several different searches on SO looking for a simple solution to sorting mySQL results alphanumerically where some fields may have special characters present. The solution:
"SELECT *, REPLACE(title '\"', '') AS indexTitle ORDER BY indexTitle ASC";
In this case I'm searching for strings that begin with a double quote, escaped.
This probably wouldn't be a great solution where the types of special characters are not known, but for a simple sort it works nicely.
Hopefully this helps someone.
One way to do this would be to write your own function to strip non-alphanumeric characters from a String. Google found me this example (I've not checked it!). Then you could write something like:
SELECT *, remove_non_alphanum_char_f(title) AS indexTitle ORDER BY indexTitle ASC;
Though of course as #arkascha has pointed out in the comments above this is slow and not scalable. A better solution is to go back a step and, if possible, ensure the data in your table is in the correct format to begin with. If you really need the special characters, it may be less of an overhead to add an extra column to your table which is the title column with the special characters stripped - then you could just order by that column. You could perform the stripping at the point when you insert into the table.

Mysql Select with LIKE clause is not working Chinese characters

I have data stored in single column which are in English and Chinese.
the data is separated by the separators e.g.
for Chinese
<!--:zh-->日本<!--:-->
for English
<!--:en-->English Characters<!--:-->
I would show the content according to users selected language.
I made a query like this
SELECT * FROM table WHERE content LIKE '<!--:zh-->%<!--:-->'
The query above works but return empty result set.
Collation of content column is utf8_general_ci
I have also tried using the convert function like below
SELECT * FROM table WHERE CONVERT(content USING utf8)
LIKE CONVERT('<!--:zh-->%<!--:-->' USING utf8)
But this also does not work.
I also tried running the query SET NAMES UTF8 but still it does not work.
I am running queries in PhpMyAdmin if it does matter.
qTranslate did not change the database used by WordPress. Translation data is stored in original fields. For that reason there is each field containing all translations for that special field and the data is like this
<!--:en-->English Characters<!--:--><!--:zh-->日本<!--:-->
http://wpml.org/documentation/related-projects/qtranslate-importer/
Test table data for content
<!--:zh-->日本<!--:--><!--:en-->English Characters<!--:-->
<!--:en-->English Characters<!--:--><!--:zh-->日本<!--:-->
<!--:zh-->日本<!--:-->
<!--:en-->English Characters<!--:-->
followed by
I have data stored in single column which are in English and
Chinese
and your select should look like this
SELECT * FROM tab
WHERE content LIKE '%<!--:zh-->%<!--:-->%'
SQL Fiddle DEMO (also with demo how to get the special language part out of content)
SET #PRE = '<!--:zh-->', #SUF = '<!--:-->';
SELECT
content,
SUBSTR(
content,
LOCATE( #PRE, content ) + LENGTH( #PRE ),
LOCATE( #SUF, content, LOCATE( #PRE, content ) ) - LOCATE( #PRE, content ) - LENGTH( #PRE )
) langcontent
FROM tab
WHERE content LIKE CONCAT( '%', #PRE, '%', #SUF, '%' );
as stated in MySQL Documentation and follow the example of
SELECT 'David!' LIKE '%D%v%';
As others have pointed, your queries seem to be fine, so I'd look somewhere else. This is something you can try:
I'm not sure about chinese input, but for japanese, many symbols have full-width and half-width variants, for example: "hello" and "hello" look similar, but the codepoints of their characaters are different, and therefore won't compare as equal. It's very easy to mistype something in full-width, and very difficult to detect, especially for whitespace. Compare " " and " ".
You are probably storing your data in half width and querying it in full width. Even if one character is different (especially spaces are difficult to detect), the query will not find your desired data.
There are many ways to detect this, for instance try copying the data and query into text files verbatim, and view them with hex editors. If there is a single bit difference in the relevant parts, you may be dealing with this problem.
Assuming you're using MySQL, you can use wildcards in LIKE:
% matches any number of characters, including zero characters.
_ matches exactly one character
Here's an example search for values containing the character 日 in the content column of your table:
SELECT * FROM table WHERE `content` LIKE '%日%'
Search fails because of the way you store data.
You are using utf8_general_ci collation, which is tailored to fast search in some European languages. It is even not so perfect with some of them. People tend to use it just because it fast and they don't care about some search inaccuracy in, say, Scandinavian languages.
Change this to big5_chinese_ci or some other Chinese - tuned collation.
UPD.
Another thing.
I see, you use a kind of markup in your DB records.
<!--:zh-->日本<!--:-->
<!--:en-->English Characters<!--:-->
So, if you're searching for Chinese, you may just use
SELECT * FROM table WHERE content LIKE '<!--:zh-->%'
instead of
SELECT * FROM table WHERE content LIKE '<!--:zh-->%<!--:-->'
I have tried to reproduce the problem. The query is OK, I have got the result, even using SET NAMES latin1.
Check the content of the field, possible there are beginning/ending white spaces, remove them firstly, or try this query -
SELECT * FROM table
WHERE TRIM(content) LIKE '<!--:zh-->%<!--:-->'
Example with your string -
CREATE TABLE table1(
column1 VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_general_ci
);
INSERT INTO table1 VALUES
('<!--:en-->English Characters<!--:--><!--:zh-->日本<!--:-->');
SELECT * FROM table1 WHERE column1 LIKE '%<!--:zh-->%<!--:-->';
=> <!--:en-->English Characters<!--:--><!--:zh-->日本<!--:-->
Can I ask what version of MySQL you're using? From what I see your code seems fine, which gets me thinking you're not running the most up to date version of MySQL.