MySQL remove special characters using Regex - mysql

I want to remove special characters from a MySQL table but LEAVE UTF8 characters such as arabic.
This is to remove common special characters such as " ' # ! * $ etc.
I have used the following in PHP which works great.
preg_replace('/(?=\P{Nd})\P{L}/u', '', $name);

You can UPDATE your rows using the very same regular expression for comparison.
The string regular expression operators in MySQL are REGEXP or RLIKE.
I hope, this should solve your problem. :)

Related

SQL Regex last character search not working

I'm using regex to find specific search but the last separator getting ignore.
Must search for |49213[A-Z]| but searches for |49213[A-Z]
SELECT * FROM table WHERE (data REGEXP '/\|49213[A-Z]+\|/')
Why are you using | in the pattern? Why the +?
SELECT * FROM table WHERE (data REGEXP '\|49213[A-Z]\|')
If you want multiple:
SELECT * FROM table WHERE (data REGEXP '\|49213[A-Z]+\|')
or:
SELECT * FROM table WHERE (data REGEXP '[|]49213[A-Z][|]')
Aha. That is rather subtle.
\ escapes certain characters that have special meaning.
But it does not seem to do so for | ("or") or . ("any byte"), etc.
So, \| is the same as |.
But the regexp parser does not like having either side of "or" being empty. (I suspect this is a "bug"). Hence the error message.
https://dev.mysql.com/doc/refman/5.7/en/regexp.html says
To use a literal instance of a special character in a regular expression, precede it by two backslash () characters. The MySQL parser interprets one of the backslashes, and the regular expression library interprets the other. For example, to match the string 1+2 that contains the special + character, only the last of the following regular expressions is the correct one:
The best fix seems to be [|] or \\| instead of \| when you want the pipe character.
Someday, the REGEXP parser in MySQL will be upgraded to PCRE as in MariaDB. Then a lot more features will come, and this 'bug' may go away.

MySQL regex matching at least 2 dots

Consider the following regex
#(.*\..*){2,}
Expected behaviour:
a#b doesnt match
a#b.c doesnt match
a#b.c.d matches
a#b.c.d.e matches
and so on
Testing in regexpal it works as expected.
Using it it in a mysql select doesn't work as expected. Query:
SELECT * FROM `users` where mail regexp '#(.*\..*){2,}'
is returning lines like
foo#example.com
that should not match the given regex. Why?
I think the answer to your question is here.
Because MySQL uses the C escape syntax in strings (for example, “\n”
to represent the newline character), you must double any “\” that you
use in your REGEXP strings.
MYSQL Reference
Because your middle dot wasn't properly escaped it was treated as just another wildcard and in the end your expression was effectively collapsed to #.{2,} or #..+
#anubhava's answer is probably a better substitute for what you tried to do though I would note #dasblinkenlight's comment about using the character class [.] which will make it easy to drop in a regex you've already tested in at RegexPal.
You can use:
SELECT * FROM `users` where mail REGEXP '([^.]*\\.){2}'
to enforce at least 2 dots in mail column.
I would match two dots in MySQL using like:
where col like '%#.%.%'
The problem with your code is that .* (match-everything dot) matches dot '.' character. Replacing it with [^.]* fixes the problem:
SELECT *
FROM `users`
where mail regexp '#([^.]*[.]){2,}'
Note the use of [.] in place of the equivalent \.. This syntax makes it easier to embed the regex into programming languages that use backslash as escape character in their string literals.
Demo.

Mysql regex error #1139 using literal -

I tried running this query:
SELECT column FROM table WHERE column REGEXP '[^A-Za-z\-\']'
but this returns
#1139 - Got error 'invalid character range' from regexp
which seems to me like the - in the character class is not being escaped, and instead read as an invalid range. Is there some other way that it's suppose to be escaped for mysql to be the literal -?
This regex works as expected outside of mysql, https://regex101.com/r/wE8vY5/1.
I came up with an alternative to that regex which is
SELECT column FROM table WHERE column NOT REGEXP '([:alpha:]|-|\')'
so the question isn't how do I get this to work. The question is why doesn't the first regex work?
Here's a SQL fiddle of the issue, http://sqlfiddle.com/#!9/f8a006/1.
Also, there is no language being used here, query is being run at DB level.
Regex in PHP: http://sandbox.onlinephpfunctions.com/code/10f5fe2939bdbbbebcc986c171a97c0d63d06e55
Regex in JS: https://jsfiddle.net/6ay4zmrb/
Just change the order.
SELECT column FROM table WHERE column REGEXP '[^-A-Za-z\']'
#Avinash Raj is correct the - must be first (or last). The \ is not an escape character in POSIX, which is what mysql uses, https://dev.mysql.com/doc/refman/5.1/en/regexp.html.
One key syntactic difference is that the backslash is NOT a metacharacter in a POSIX bracket expression.
-http://www.regular-expressions.info/posixbrackets.html
What special characters must be escaped in regular expressions?
Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally

MySQL REPLACE unicode characters

I'm trying to replace a Unicode character with another one.
I'm running this but it's not working.
UPDATE items SET `data` = REPLACE(`data`, '\u030C', '\u0306');
I've tried REPLACE without the \ or \u and also with multiple slashes like \\\\\\\\\\\\u030C. I've pretty much run out of random combinations to try to make this work.
How can I get this replace working.
Can we back up a step and avoid getting the \u encoding? If you are using PHP:
$t = json_encode($s, JSON_UNESCAPED_UNICODE);
Addenda
In mysql commandline tool, use 2 backslashes:
UPDATE items SET `data` = REPLACE(`data`, '\\u030C', '\\u0306');
Are you replacing a combining caron with a combining breve?
Don't you really want the utf8 character instead of the unicode code?

Mysql replace all special unicode characters with their ascii counterpart

I have a field with encoding utf8-general-ci in which many values contain non-ascii characters. I want to
Search for all fields with any non-ascii characters
Replace all non-ascii characters with their corresponding ascii version.
For example: côte-d'ivoire should be replaced with cote-d-i'voire, são-tomé should be replaced with sao-tome, etc.
How do I achieve this? If I just change the field type to ascii, non-ascii characters get replaced by '?'. I am not even able to search for all such fields using
RLIKE '%[^a-z]%'
For example
SELECT columname
FROM tablename
WHERE NOT columname REGEXP '[a-z]';
returns an empty set.
Thanks
An sql fiddle example is at
http://www.sqlfiddle.com/#!2/c1d90/1/0
the query to select is
select * from test where maintext rlike '[^\x00-\x7F]'
Hope this helps
I'm presuming from your previous questions that you're using PHP.
https://github.com/silverstripe-labs/silverstripe-unidecode
You could then use skv's answer to return the object's you wish to use and then use unidecode to attempt to convert the object to it's ascii equivalents.
In Perl, you can use Text::Unidecode.
In MySQL, there isn't an easy function to convert from utf8 (or utf8mb4) into ascii without it spitting out some ugly '?' characters as replacements. It's best to replace these prior to inserting them in the DB, or run something in Perl (or whatever) to extract the data and re-update it one row at a time.
There are many different ports of Text::Unidecode in different languages: Python, PHP, Java, Ruby, JavaScript, Haskell, C#, Clojure, Go.