I use regular expressions in MySQL on multibyte-encoded (utf-8) data, but I need it to be match case-insensitively. As MySQL has bug (for many years unresolved) that it can't deal properly with matching multibyte-encoded strings case-insensitively, I am trying to simulate the "insensitiveness" by lowercasing the value and the regexp pattern. Is it safe to lowercase regexp pattern such way? I mean, are there any edge cases I forgot?
Could following cause any problems?
LOWER('šárKA') = REGEXP LOWER('^Šárka$')
Update: I edited the question to be more concrete.
MySQL documentation:
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
It is their bug filed in 2007 and unsolved until now. However, I can't just change database to solve this issue. I need MySQL somehow to consider 'Š' equal to 'š', even if it is by hacking it with not-so-elegant solution. Other characters than accented (multi-byte) match well and with no issues.
The i option for the Regex will make sure it matches case insensitively.
Example:
'^(?i)Foo$' // (?i) will turn on case insensitivity for the rest of the regex
'/^Foo$/i' // the i options turns off case sensitivity
Note that these may not work in your particular Flavour of Regex (which you haven't hinted upon) so make sure you consult your manual for the correct syntax.
Update:
From here: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
REGEXP is not case sensitive, except when used with binary strings.
As noone actually answered my original question, I made my own research and realized it is not safe to lowercase or uppercase regular expression without any other processing. To be precise, it is safe to do this with theoretically pure regular expressions, but their every sane implementation adds some character classes and special directives, which can be vulnerable to case changing:
Escape sequences like \n, \t, etc.
Character classes like \W (non-alphanumeric) and \w (alphanumeric).
Character classes like [.characters.], [=character_class=], or [:character_class:] (MySQL regular expressions dialect).
Lowercasing or uppercasing \W and \w could completely change regular expression's meaning. This leads to following conclusion:
Presented solution is no-go.
Presented solution is possible, but the regular expression must be lowercased in more sophisticated way than just by using LOWER or something similar. It has to be parsed and the case has to be changed carefully.
Related
I'm having a really hard time figuring out how to replace a special character with another in SQL (MySQL syntax). I've already tried with REPLACE function without success. What I would like to do is:
From this string:
"C:\foo\bar\file.txt"
Obtain this string:
"C:\\foo\\bar\\file.txt"
As I thought - this is an XY problem. MySQL does not require anything from the path. What it does require though is its input to be syntactical. In input, the string literal interprets the sequences of backslash and another character as "escape", which removes special meaning from the next character. Since backslash is such a special character, it can be escaped to remove its special significance: one writes \\ to get a string with a single backslash.
What this means is, if you write 'C:\\foo\\bar\\file.txt' in an SQL command, MySQL will understand it as the string 'C:\foo\bar\file.txt' (like in my comment under your question). If you write 'C:\foo\bar\file.txt', MySQL will understand the backslash as removing the special significance from letters f, b and f (not that they had any in the first place), and the string it will end up with will be 'C:foobarfile.txt'.
Once the string is inside MySQL, it is correct, no replacements are necessary. Thus, you cannot use MySQL's REPLACE to prepare the string for input to MySQL - it is way too late for this. It is kind of like punching the baby in the stomach to pre-chew its food after it has already eaten it, it doesn't work that way and it hurts the baby.
Rather than that, use the language that you use to interface with the database (you didn't tag it, so I can't give you the details) to properly handle the strings. Many languages have functions that will correctly escape strings for you for use by MySQL. Even better, learn about prepared statements and parametrised queries, which completely remove the need for explicit escaping.
The best reference on parametrised queries I can recommend, with remedies for multiple languages, is the Bobby Tables site.
REPLACE function should do the job for you - https://dev.mysql.com/doc/refman/8.0/en/replace.html.
How are you passing the string into REPLACE function?
I am trying to create a regex to validate usernames which should match the following :
Only one special char (._-) allowed and it must not be at the extremes of the string
The first character cannot be a number
All the other characters allowed are letters and numbers
The total length should be between 3 and 20 chars
This is for an HTML validation pattern, so sadly it must be one big regex.
So far this is what I've got:
^(?=(?![0-9])[A-Za-z0-9]+[._-]?[A-Za-z0-9]+).{3,20}
But the positive lookahead can be repeated more than one time allowing to be more than one special character which is not what I wanted. And I don't know how to correct that.
You should split your regex into two parts (not two Expressions!) to make your life easier:
First, match the format the username needs to have:
^[a-zA-Z][a-zA-Z0-9]*[._-]?[a-zA-Z0-9]+$
Now, we just need to validate the length constraint. In order to not mess around with the already found pattern, you can use a non-consuming match that only validates the number of characters (its literally a hack for creating an and pattern for your regular expression): (?=^.{3,20}$)
The regex will only try to match the valid format if the length constraint is matched. It is non-consuming, so after it is successful, the engine still is at the start of the string.
so, all together:
(?=^.{3,20}$)^[a-zA-Z][a-zA-Z0-9]*[._-]?[a-zA-Z0-9]+$
Debugger Demo
I think you need to use ? instead of +, so the special character is matched only once or not.
^(?=(?![0-9])?[A-Za-z0-9]?[._-]?[A-Za-z0-9]+).{3,20}
I just found out about placeholders in DBI https://metacpan.org/pod/DBI#Placeholders-and-Bind-Values and it seems to be handling various codes pretty well.
Should I be forcing escape regardless? Are there any scenarios where the placeholders would fail based on the input?
If you escape them and then use bound placeholders, they will end up double escaped, which is not what you want. Just use placeholders. (I frequently use them even when the input is trusted, because it looks cleaner.)
There is rarely a reason to use escaping instead of placeholders. An example would be dynamically generating and manipulating a query as an SQL string, but you really shouldn't do that anyway (there are plenty of libraries on CPAN for generating SQL).
The only example that I know of in which a placeholder would fail based on input that would not fail with string interpretation would be when you are interpolating column names from a string, LIMIT clauses, or some such (but again, that is dynamic generating SQL like I mentioned above.)
Placeholders >> manual escaping
I want to write a C-program that gets some strings from input. I want to save them in a MySQL database.
For security I would like to check, if the input is a (possible) UTF-8 string, count the number of characters and also use some regular expressions to validate the input.
Is there a library that offers me that functionality?
I thought about to use wide characters, but as far as I understood, the fact if they are supporting UTF-8 depends on the implementation and ist not defined by a standard.
And also I would be missing the regular expressions.
PCRE supports UTF-8. To validate the string before any processing, the W3C suggests this expression, which I re-implemented in plain C, but PCRE already automatically checks for UTF-8 in accordance to RFC 3629.
When using MySQL full text search in boolean mode there are certain characters like + and - that are used as operators. If I do a search for something like "C++" it interprets the + as an operator. What is the best practice for dealing with these special characters?
The current method I am using is to convert all + characters in the data to _plus. It also converts &,#,/ and # characters to a textual representation.
There's no way to do this in nicely using MySQL's full text search. What you're doing (substituting special characters with a pre-defined string) is the only way to do it.
You may wish to consider using Sphinx Search instead. It apparently supports escaping special characters, and by all reports is significantly faster than the default full text search.
MySQL is fairly brutal in what tokens it ignores when building its full text indexes. I'd say that where it encountered the term "C++" it would probably strip out the plus characters, leaving only C, and then ignore that because it's too short. You could probably configure MySQL to include single-letter words, but it's not optimised for that, and I doubt you could get it to treat the plus characters how you want.
If you have a need for a good internal search engine where you can configure things like this, check out Lucene which has been ported to various languages including PHP (in the Zend framework).
Or if you need this more for 'tagging' than text search then something else may be more appropriate.