python pattern => ^(?=.\bABDUL\b)(?=.\bHAI\b.)(?=.\bMANSOOR\b).*$
need equalent mysql pattern
can you please help me out ?
The regex in question is a quite strange way how to match simple words. It is not clear what is the expected input. Maybe, the input justifies this approach.
^(?=.\bABDUL\b)(?=.\bHAI\b.)(?=.\bMANSOOR\b).*$
Which means: At the beginning there must be any character which is not a part of a word, then ABDUL, a non word character, HAI, a non word character, MANSOOR, a non word character or the end of the string.
^[^[:alnum:]]ABDUL[^[:alnum:]]HAI[^[:alnum:]]MANSOOR([^[:alnum:]]?.*)?$
Which is: At the beginning, not a number or alphabet character (alphanumerical), ABDUL, one non-alphanumerical, HAI, one non-alphanumerical, MANSOOR one non-alphanumerical or the end of the string.
I did not test it and did not intended to make it 100% the same as the first one, but it should be close enough.
For anyone who would like to copy it to their code:
Matching the first character is not very common and can be a bug in the original regexp.
(?=...) is an "lookahead assertion" which does not consume any characters, the POSIX version does not have it, but for a simple string searching it may not be important.
Both versions should match strings like !ABDUL$HAI)MANSOOR - make sure that this is what you want.
For someone who would like to understand the regular expressions I used
https://dev.mysql.com/doc/refman/8.0/en/regexp.html for mysql (POSIX syntax) and https://docs.python.org/3/library/re.html for python (PCRE = Perl compatible syntax)
I'm making a dialogue system in gdscript and am struggling with escape characters, specifically '\n'.
I'm using CastleDB as, although not perfect, it has allowed me to have almost everything stored in data and will allow the person doing the writing for the game to do everything outside the engine, without me having to copy and paste stuff in.
I've hit a stumbling block with escape characters. A single text entry in CastleDB doesn't support spaces, and '\n' within the string prints to '\n', not a space, in the dialog box.
I've tried using the format string function with 'some text here {space} some more text', with the space referencing a string consisting of just \n. This still prints \n. If I feed some constant string with \n in the middle directly into the function which displays the dialog text, it adds a space so I'm not really sure what is going on here.
I don't have a computer science background (I've done some C up until pointers, at which point I decided to return later).
Is there something going on in the background with my string in gdscript? It prints out just like you would expect a string to, apart from ignoring my escape characters.
Could it be something to do with the fact that it comes in as a JSON? As far as I'm aware, even if a string is chopped up and reassembled, it should still just behave like a string...?!
Anyway, I haven't included any code because I don't know what code you'd need to see. I'm hoping it's something simple that because I'm teaching myself as I go I just wasn't aware of, but can post code if it helps.
Thanks,
James
Escape sequences are a way of getting around issues with syntax. When you type a string in most programming languages, it starts with " and ends with another ". And it needs to stay on one line. Simple, right?
What if you want to put an actual " in your string? Or a new line? We need some way of telling the compiler, "hey, we want to insert a newline here, even though we can't use an actual newline character". So we use a \ to let the compiler know that the next character is part of an escape sequence.
This causes another problem: What if we literally want to put a backslash in a string? That's where the double backslash comes from: \\ is the escape sequence for \, since \ by itself has a special meaning.
But CastleDB (apparently, I'm not familiar with it) doesn't recognize escape sequences, so when you type \n it thinks you literally want \ followed by n. When it converts this to JSON, it inserts the \\ because JSON does recognize escape sequences.
GDScript also recognizes escape sequences, so print("Hello\nworld!") prints
Hello
world!
You could try input_string.replace("\\n", "\n") to replace the \n escape sequences.
I've solved this by looking at the way CastleDB data is stored on the project's github page.
For some reason "\n" was stored as "\\n" behind the scenes. Now that I know why it was printing weirdly I can change it, even though it feels like a messy solution!
To add even more weirdness to this whole backslash business, stack overflow displays a double backslash as a single backslash so I have to write \ \ \n minus the spaces to get \\n...
I'm sure there must be a reason, but it eludes me.
I am tired of always trying to guess, if I should escape special characters like '()[]{}|' etc. when using many implementations of regexps.
It is different with, for example, Python, sed, grep, awk, Perl, rename, Apache, find and so on.
Is there any rule set which tells when I should, and when I should not, escape special characters? Does it depend on the regexp type, like PCRE, POSIX or extended regexps?
Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.
For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:
.^$*+?()[{\|
and these inside character classes:
^-]\
For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):
.^$*+?()[{\|
Escaping any other characters is an error with POSIX ERE.
Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:
[]^-]
In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:
.^$*[\
Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as \? and +. Escaping a character other than .^$*(){} is normally an error with BREs.
Inside character classes, BREs follow the same rule as EREs.
If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.
Modern RegEx Flavors (PCRE)
Includes C, C++, Delphi, EditPad, Java, JavaScript, Perl, PHP (preg), PostgreSQL, PowerGREP, PowerShell, Python, REALbasic, Real Studio, Ruby, TCL, VB.Net, VBScript, wxWidgets, XML Schema, Xojo, XRegExp.PCRE compatibility may vary
Anywhere: . ^ $ * + - ? ( ) [ ] { } \ |
Legacy RegEx Flavors (BRE/ERE)
Includes awk, ed, egrep, emacs, GNUlib, grep, PHP (ereg), MySQL, Oracle, R, sed.PCRE support may be enabled in later versions or by using extensions
ERE/awk/egrep/emacs
Outside a character class: . ^ $ * + ? ( ) [ { } \ |
Inside a character class: ^ - [ ]
BRE/ed/grep/sed
Outside a character class: . ^ $ * [ \
Inside a character class: ^ - [ ]
For literals, don't escape: + ? ( ) { } |
For standard regex behavior, escape: \+ \? \( \) \{ \} \|
Notes
If unsure about a specific character, it can be escaped like \xFF
Alphanumeric characters cannot be escaped with a backslash
Arbitrary symbols can be escaped with a backslash in PCRE, but not BRE/ERE (they must only be escaped when required). For PCRE ] - only need escaping within a character class, but I kept them in a single list for simplicity
Quoted expression strings must also have the surrounding quote characters escaped, and often with backslashes doubled-up (like "(\")(/)(\\.)" versus /(")(\/)(\.)/ in JavaScript)
Aside from escapes, different regex implementations may support different modifiers, character classes, anchors, quantifiers, and other features. For more details, check out regular-expressions.info, or use regex101.com to test your expressions live
Unfortunately there really isn't a set set of escape codes since it varies based on the language you are using.
However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.
POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.
There isn't a simple rule for when to use which notation, or even which notation a given command uses.
Check out Jeff Friedl's Mastering Regular Expressions book.
Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.
So you really have to know what style you are trying to quote.
Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.
Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely
sed -e 's/foo\(bar/something_else/'
I tend to just use a simple character class definition instead, so the above expression becomes
sed -e 's/foo[(]bar/something_else/'
which I find works for most regexp implementations.
BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.
Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.
You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.
Not all the world's a PCRE!
Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.
Ah the joys of studying at UNSW in the late '70's! (-:
https://perldoc.perl.org/perlre.html#Quoting-metacharacters and https://perldoc.perl.org/functions/quotemeta.html
In the official documentation, such characters are called metacharacters. Example of quoting:
my $regex = quotemeta($string)
s/$regex/something/
For PHP, "it is always safe to precede a non-alphanumeric with "\" to specify that it stands for itself." - http://php.net/manual/en/regexp.reference.escape.php.
Except if it's a " or '. :/
To escape regex pattern variables (or partial variables) in PHP use preg_quote()
To know when and what to escape without attempts is necessary to understand precisely the chain of contexts the string pass through. You will specify the string from the farthest side to its final destination which is the memory handled by the regexp parsing code.
Be aware how the string in memory is processed: if can be a plain string inside the code, or a string entered to the command line, but a could be either an interactive command line or a command line stated inside a shell script file, or inside a variable in memory mentioned by the code, or an (string)argument through further evaluation, or a string containing code generated dynamically with any sort of encapsulation...
Each of this context assigned some characters with special functionality.
When you want to pass the character literally without using its special function (local to the context), than that's the case you have to escape it, for the next context... which might need some other escape characters which might additionally need to be escaped in the preceding context(s).
Furthermore there can be things like character encoding (the most insidious is utf-8 because it look like ASCII for common characters, but might be optionally interpreted even by the terminal depending on its settings so it might behave differently, then the encoding attribute of HTML/XML, it's necessary to understand the process precisely right.
E.g. A regexp in the command line starting with perl -npe, needs to be transferred to a set of exec system calls connecting as pipe the file handles, each of this exec system calls just has a list of arguments that were separated by (non escaped)spaces, and possibly pipes(|) and redirection (> N> N>&M), parenthesis, interactive expansion of * and ?, $(()) ... (all this are special characters used by the *sh which might appear to interfere with the character of the regular expression in the next context, but they are evaluated in order: before the command line. The command line is read by a program as bash/sh/csh/tcsh/zsh, essentially inside double quote or single quote the escape is simpler but it is not necessary to quote a string in the command line because mostly the space has to be prefixed with backslash and the quote are not necessary leaving available the expand functionality for characters * and ?, but this parse as different context as within quote. Then when the command line is evaluated the regexp obtained in memory (not as written in the command line) receives the same treatment as it would be in a source file.
For regexp there is character-set context within square brackets [ ], perl regular expression can be quoted by a large set of non alfa-numeric characters (E.g. m// or m:/better/for/path: ...).
You have more details about characters in other answer, which are very specific to the final regexp context. As I noted you mention that you find the regexp escape with attempts, that's probably because different context has different set of character that confused your memory of attempts (often backslash is the character used in those different context to escape a literal character instead of its function).
For Ionic (Typescript) you have to double slash in order to scape the characters.
For example (this is to match some special characters):
"^(?=.*[\\]\\[!¡\'=ªº\\-\\_ç##$%^&*(),;\\.?\":{}|<>\+\\/])"
Pay attention to this ] [ - _ . / characters. They have to be double slashed. If you don't do that, you are going to have a type error in your code.
to avoid having to worry about which regex variant and all the bespoke peculiarties, just use this generic function that covers every regex variant other than BRE (unless they have unicode multi-byte chars that are meta) :
jot -s '' -c - 32 126 |
mawk '
function ___(__,_) {
return substr(_="",
gsub("[][!-/_\140:-#{-~]","[&]",__),
gsub("["(_="\\\\")"^]",_ "&",__))__
} ($++NF = ___($!_))^_'
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
[!]["][#][$][%][&]['][(][)][*][+][,][-][.][/]
0 1 2 3 4 5 6 7 8 9 [:][;][<][=][>][?]
[#] ABCDEFGHIJKLMNOPQRSTUVWXYZ [[]\\ []]\^ [_]
[`] abcdefghijklmnopqrstuvwxyz [{][|][}][~]
square-brackets are much easier to deal with, since there's no risk of triggering warning messages about "escaping too much", e.g. :
function ____(_) {
return substr("", gsub("[[:punct:]]","\\\\&",_))_
}
\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/ 0123456789\:\;\<\=\>\?
\#ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^\_\`abcdefghijklmnopqrstuvwxyz \{\|\}\~
gawk: cmd. line:1: warning: regexp escape sequence `\!' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\"' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\%' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\&' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\,' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\:' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\;' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\=' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\_' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\~' is not a known regexp operator
Using Raku (formerly known as Perl_6)
Works (backslash or quote all non-alphanumeric characters except underscore):
~$ raku -e 'say $/ if "#.*?" ~~ m/ \# \. \* \? /; #works fine'
「#.*?」
There exist six flavors of Regular Expression languages, according to Damian Conway's pdf/talk "Everything You Know About Regexes Is Wrong". Raku represents a significant (~15 year) re-working of standard Perl(5)/PCRE Regular Expressions.
In those 15 years the Perl_6 / Raku language experts decided that all non-alphanumeric characters (except underscore) shall be reserved as Regex metacharacters even if no present usage exists. To denote non-alphanumeric characters (except underscore) as literals, backslash or escape them.
So the above example prints the $/ match variable if a match to a literal #.*? character sequence is found. Below is what happens if you don't: # is interpreted as the start of a comment, . dot is interpreted as any character (including whitespace), * asterisk is interpreted as a zero-or-more quantifier, and ? question mark is interpreted as either a zero-or-one quantifier or a frugal (i.e. non-greedy) quantifier-modifier (depending on context):
Errors:
~$ ~$ raku -e 'say $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!'
===SORRY!===
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Couldn't find terminator / (corresponding / was at line 1)
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
expecting any of:
/
https://docs.raku.org/language/regexes
https://raku.org/
I'm trying to recognize quoting (citing) somebody's else sentence in a markdown text, which I have in my local copy of MySQL GHTorrent dataset. So I wrote this query:
select * from github_discussions where body rlike '(.)*(\s){1,}(>)(\s){1,}(.)+';
it matches some unwanted data, which according to https://regex101.com/, it should not with this particular regular expression.
Test string:
`Params` is plural -> contain<s>s</s>
Matched on MySQL database, not matched at regex101 dot com.
Obvious example of quoting, but not matched at db:
Yes, I believe so.\r\n\r\n\r\n\r\nK\r\n\r\n> On 19-Jul-2014, at 17:33, Stefan Karpinski <notifications#github.com> wrote:\r\n> \r\n> This is the standard 3-clause BSD license, right?\r\n> \r\n> —\r\n> Reply to this email directly or view it on GitHub.
Moreover, MySQL workbench didn't show those return carriage and new line symbols unless copy-pasted here.
Can I normalize (remove \r and \n) with some update query ?
Is MySQL regex implementation different from POSIX standard regex ?
Do you have by any chances maximally clean solution for recognizing quoting in a markdown text ?
Thanks!
You've got an awful lot of parens in there. Try this as functionally what you have above:
select * from github_discussions where body rlike '.*[:blank:]+>[:blank:]+.+'
However, I'm not sure that's really what you want. This would happily match this line:
this is before > and after
which by my understanding is not a quoted string in markdown. Instead I would anchor it at the beginning like this:
select * from github_discussions where body rlike '^[:blank:]*>[:blank:]+'
That will match a greater-than sign at the beginning of the line, optionally preceded by whitespace. Is that what you are looking for?
I'm not sure if your data has newlines embedded. If so, you may need to look into ways of having your regex identify newlines using the ^ anchoring symbol. As is the well accepted conclusion in regex literature, that is left as an exercise for the student. :-)
I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII).
Since PHP doesn't directly handle UTF-8, I'm using:
$value = iconv ('UTF-8', 'ISO-8859-1', $value);
to convert the name to ISO-8859-1, followed by a massive strstr statement to replace any accented character by its unaccented equivalent (à becomes a, for example).
However, the original text names were entered with smart quotes, and iconv chokes whenever it comes across one -- I get:
Unknown error type: [8]
iconv() [function.iconv]: Detected an illegal character in input string
To get rid of the smart quotes before using iconv, I have tried using three statements like:
$value = str_replace('’', "'", $value);
(’ is the raw value of a UTF-8 smart single quote)
Because the text file is so long, these str_replace's cause the script to time out every single time.
What is the fastest way to strip out the smart quotes (or any invalid characters) from a UTF-8 string, prior to running iconv?
Or, is there an easier solution to this whole problem? What is the fastest way to convert a name with many accents, in UTF-8, to a name with no accents, spelled correctly, in ASCII?
Glibc (and the GNU libiconv) supports //TRANSLIT and //IGNORE suffixes.
Thus, on Linux, this works just fine:
$ echo $'\xe2\x80\x99'
’
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1
iconv: illegal input sequence at position 0
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1//translit
'
I'm not sure what iconv is in use by PHP, but the documentation implies that //TRANSLIT and //IGNORE will work there too.
What do you mean by "link-friendly"? Only way that makes sense to me, since the text between <a>...</a> tags can be anything, is actually "URL-friendly", similar to SO's URLs where everything is converted to [a-z-].
If that's what you're going for, you'll need a transliteration library, not a character set conversion library. (I've had no luck getting iconv() to do the work in the past, but I haven't tried in a while.) There's a beta PHP extension translit that probably does the job.
If you can't add extensions to your PHP install, you'll have to look for a PHP library that does the same thing. I haven't used it, but the PHP UTF-8 library implements a utf8_to_ascii library that I assume does something like what you need.
(Also, if iconv() is failing like you said, it means that your input isn't actually valid UTF-8, so no amount of replacing valid UTF-8 with anything else will help the problem. EDIT: I may take that back: if ephemient's answer is correct, the iconv error you're seeing may very well be because there's no direct representation of the character in the destination character set. So, nevermind.)
Have you considered using MySQL's REPLACE string function to change the offending strings into apostrophes, or whatever? You may be able to put together the "string to be replaced" part e.g. by using CONCAT on CHAR calls...