When use parameters in JSON POST data in Jmeter, another json entity is changed [duplicate] - json

I am tired of always trying to guess, if I should escape special characters like '()[]{}|' etc. when using many implementations of regexps.
It is different with, for example, Python, sed, grep, awk, Perl, rename, Apache, find and so on.
Is there any rule set which tells when I should, and when I should not, escape special characters? Does it depend on the regexp type, like PCRE, POSIX or extended regexps?

Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.
For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:
.^$*+?()[{\|
and these inside character classes:
^-]\
For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):
.^$*+?()[{\|
Escaping any other characters is an error with POSIX ERE.
Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:
[]^-]
In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:
.^$*[\
Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as \? and +. Escaping a character other than .^$*(){} is normally an error with BREs.
Inside character classes, BREs follow the same rule as EREs.
If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.

Modern RegEx Flavors (PCRE)
Includes C, C++, Delphi, EditPad, Java, JavaScript, Perl, PHP (preg), PostgreSQL, PowerGREP, PowerShell, Python, REALbasic, Real Studio, Ruby, TCL, VB.Net, VBScript, wxWidgets, XML Schema, Xojo, XRegExp.PCRE compatibility may vary
    Anywhere: . ^ $ * + - ? ( ) [ ] { } \ |
Legacy RegEx Flavors (BRE/ERE)
Includes awk, ed, egrep, emacs, GNUlib, grep, PHP (ereg), MySQL, Oracle, R, sed.PCRE support may be enabled in later versions or by using extensions
ERE/awk/egrep/emacs
    Outside a character class: . ^ $ * + ? ( ) [ { } \ |
    Inside a character class: ^ - [ ]
BRE/ed/grep/sed
    Outside a character class: . ^ $ * [ \
    Inside a character class: ^ - [ ]
    For literals, don't escape: + ? ( ) { } |
    For standard regex behavior, escape: \+ \? \( \) \{ \} \|
Notes
If unsure about a specific character, it can be escaped like \xFF
Alphanumeric characters cannot be escaped with a backslash
Arbitrary symbols can be escaped with a backslash in PCRE, but not BRE/ERE (they must only be escaped when required). For PCRE ] - only need escaping within a character class, but I kept them in a single list for simplicity
Quoted expression strings must also have the surrounding quote characters escaped, and often with backslashes doubled-up (like "(\")(/)(\\.)" versus /(")(\/)(\.)/ in JavaScript)
Aside from escapes, different regex implementations may support different modifiers, character classes, anchors, quantifiers, and other features. For more details, check out regular-expressions.info, or use regex101.com to test your expressions live

Unfortunately there really isn't a set set of escape codes since it varies based on the language you are using.
However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.

POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.
There isn't a simple rule for when to use which notation, or even which notation a given command uses.
Check out Jeff Friedl's Mastering Regular Expressions book.

Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.
So you really have to know what style you are trying to quote.

Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.

Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely
sed -e 's/foo\(bar/something_else/'
I tend to just use a simple character class definition instead, so the above expression becomes
sed -e 's/foo[(]bar/something_else/'
which I find works for most regexp implementations.
BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.
Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.
You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.
Not all the world's a PCRE!
Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.
Ah the joys of studying at UNSW in the late '70's! (-:

https://perldoc.perl.org/perlre.html#Quoting-metacharacters and https://perldoc.perl.org/functions/quotemeta.html
In the official documentation, such characters are called metacharacters. Example of quoting:
my $regex = quotemeta($string)
s/$regex/something/

For PHP, "it is always safe to precede a non-alphanumeric with "\" to specify that it stands for itself." - http://php.net/manual/en/regexp.reference.escape.php.
Except if it's a " or '. :/
To escape regex pattern variables (or partial variables) in PHP use preg_quote()

To know when and what to escape without attempts is necessary to understand precisely the chain of contexts the string pass through. You will specify the string from the farthest side to its final destination which is the memory handled by the regexp parsing code.
Be aware how the string in memory is processed: if can be a plain string inside the code, or a string entered to the command line, but a could be either an interactive command line or a command line stated inside a shell script file, or inside a variable in memory mentioned by the code, or an (string)argument through further evaluation, or a string containing code generated dynamically with any sort of encapsulation...
Each of this context assigned some characters with special functionality.
When you want to pass the character literally without using its special function (local to the context), than that's the case you have to escape it, for the next context... which might need some other escape characters which might additionally need to be escaped in the preceding context(s).
Furthermore there can be things like character encoding (the most insidious is utf-8 because it look like ASCII for common characters, but might be optionally interpreted even by the terminal depending on its settings so it might behave differently, then the encoding attribute of HTML/XML, it's necessary to understand the process precisely right.
E.g. A regexp in the command line starting with perl -npe, needs to be transferred to a set of exec system calls connecting as pipe the file handles, each of this exec system calls just has a list of arguments that were separated by (non escaped)spaces, and possibly pipes(|) and redirection (> N> N>&M), parenthesis, interactive expansion of * and ?, $(()) ... (all this are special characters used by the *sh which might appear to interfere with the character of the regular expression in the next context, but they are evaluated in order: before the command line. The command line is read by a program as bash/sh/csh/tcsh/zsh, essentially inside double quote or single quote the escape is simpler but it is not necessary to quote a string in the command line because mostly the space has to be prefixed with backslash and the quote are not necessary leaving available the expand functionality for characters * and ?, but this parse as different context as within quote. Then when the command line is evaluated the regexp obtained in memory (not as written in the command line) receives the same treatment as it would be in a source file.
For regexp there is character-set context within square brackets [ ], perl regular expression can be quoted by a large set of non alfa-numeric characters (E.g. m// or m:/better/for/path: ...).
You have more details about characters in other answer, which are very specific to the final regexp context. As I noted you mention that you find the regexp escape with attempts, that's probably because different context has different set of character that confused your memory of attempts (often backslash is the character used in those different context to escape a literal character instead of its function).

For Ionic (Typescript) you have to double slash in order to scape the characters.
For example (this is to match some special characters):
"^(?=.*[\\]\\[!¡\'=ªº\\-\\_ç##$%^&*(),;\\.?\":{}|<>\+\\/])"
Pay attention to this ] [ - _ . / characters. They have to be double slashed. If you don't do that, you are going to have a type error in your code.

to avoid having to worry about which regex variant and all the bespoke peculiarties, just use this generic function that covers every regex variant other than BRE (unless they have unicode multi-byte chars that are meta) :
jot -s '' -c - 32 126 |
mawk '
function ___(__,_) {
return substr(_="",
gsub("[][!-/_\140:-#{-~]","[&]",__),
gsub("["(_="\\\\")"^]",_ "&",__))__
} ($++NF = ___($!_))^_'
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
[!]["][#][$][%][&]['][(][)][*][+][,][-][.][/]
0 1 2 3 4 5 6 7 8 9 [:][;][<][=][>][?]
[#] ABCDEFGHIJKLMNOPQRSTUVWXYZ [[]\\ []]\^ [_]
[`] abcdefghijklmnopqrstuvwxyz [{][|][}][~]
square-brackets are much easier to deal with, since there's no risk of triggering warning messages about "escaping too much", e.g. :
function ____(_) {
return substr("", gsub("[[:punct:]]","\\\\&",_))_
}
\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/ 0123456789\:\;\<\=\>\?
\#ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^\_\`abcdefghijklmnopqrstuvwxyz \{\|\}\~
gawk: cmd. line:1: warning: regexp escape sequence `\!' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\"' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\%' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\&' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\,' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\:' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\;' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\=' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\_' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\~' is not a known regexp operator

Using Raku (formerly known as Perl_6)
Works (backslash or quote all non-alphanumeric characters except underscore):
~$ raku -e 'say $/ if "#.*?" ~~ m/ \# \. \* \? /; #works fine'
「#.*?」
There exist six flavors of Regular Expression languages, according to Damian Conway's pdf/talk "Everything You Know About Regexes Is Wrong". Raku represents a significant (~15 year) re-working of standard Perl(5)/PCRE Regular Expressions.
In those 15 years the Perl_6 / Raku language experts decided that all non-alphanumeric characters (except underscore) shall be reserved as Regex metacharacters even if no present usage exists. To denote non-alphanumeric characters (except underscore) as literals, backslash or escape them.
So the above example prints the $/ match variable if a match to a literal #.*? character sequence is found. Below is what happens if you don't: # is interpreted as the start of a comment, . dot is interpreted as any character (including whitespace), * asterisk is interpreted as a zero-or-more quantifier, and ? question mark is interpreted as either a zero-or-one quantifier or a frugal (i.e. non-greedy) quantifier-modifier (depending on context):
Errors:
~$ ~$ raku -e 'say $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!'
===SORRY!===
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Couldn't find terminator / (corresponding / was at line 1)
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
expecting any of:
/
https://docs.raku.org/language/regexes
https://raku.org/

Related

how can I pass a json as a command after -x in wscat [duplicate]

Should or should I not wrap quotes around variables in a shell script?
For example, is the following correct:
xdg-open $URL
[ $? -eq 2 ]
or
xdg-open "$URL"
[ "$?" -eq "2" ]
And if so, why?
General rule: quote it if it can either be empty or contain spaces (or any whitespace really) or special characters (wildcards). Not quoting strings with spaces often leads to the shell breaking apart a single argument into many.
$? doesn't need quotes since it's a numeric value. Whether $URL needs it depends on what you allow in there and whether you still want an argument if it's empty.
I tend to always quote strings just out of habit since it's safer that way.
In short, quote everything where you do not require the shell to perform word splitting and wildcard expansion.
Single quotes protect the text between them verbatim. It is the proper tool when you need to ensure that the shell does not touch the string at all. Typically, it is the quoting mechanism of choice when you do not require variable interpolation.
$ echo 'Nothing \t in here $will change'
Nothing \t in here $will change
$ grep -F '#&$*!!' file /dev/null
file:I can't get this #&$*!! quoting right.
Double quotes are suitable when variable interpolation is required. With suitable adaptations, it is also a good workaround when you need single quotes in the string. (There is no straightforward way to escape a single quote between single quotes, because there is no escape mechanism inside single quotes -- if there was, they would not quote completely verbatim.)
$ echo "There is no place like '$HOME'"
There is no place like '/home/me'
No quotes are suitable when you specifically require the shell to perform word splitting and/or wildcard expansion.
Word splitting (aka token splitting);
$ words="foo bar baz"
$ for word in $words; do
> echo "$word"
> done
foo
bar
baz
By contrast:
$ for word in "$words"; do echo "$word"; done
foo bar baz
(The loop only runs once, over the single, quoted string.)
$ for word in '$words'; do echo "$word"; done
$words
(The loop only runs once, over the literal single-quoted string.)
Wildcard expansion:
$ pattern='file*.txt'
$ ls $pattern
file1.txt file_other.txt
By contrast:
$ ls "$pattern"
ls: cannot access file*.txt: No such file or directory
(There is no file named literally file*.txt.)
$ ls '$pattern'
ls: cannot access $pattern: No such file or directory
(There is no file named $pattern, either!)
In more concrete terms, anything containing a filename should usually be quoted (because filenames can contain whitespace and other shell metacharacters). Anything containing a URL should usually be quoted (because many URLs contain shell metacharacters like ? and &). Anything containing a regex should usually be quoted (ditto ditto). Anything containing significant whitespace other than single spaces between non-whitespace characters needs to be quoted (because otherwise, the shell will munge the whitespace into, effectively, single spaces, and trim any leading or trailing whitespace).
When you know that a variable can only contain a value which contains no shell metacharacters, quoting is optional. Thus, an unquoted $? is basically fine, because this variable can only ever contain a single number. However, "$?" is also correct, and recommended for general consistency and correctness (though this is my personal recommendation, not a widely recognized policy).
Values which are not variables basically follow the same rules, though you could then also escape any metacharacters instead of quoting them. For a common example, a URL with a & in it will be parsed by the shell as a background command unless the metacharacter is escaped or quoted:
$ wget http://example.com/q&uack
[1] wget http://example.com/q
-bash: uack: command not found
(Of course, this also happens if the URL is in an unquoted variable.) For a static string, single quotes make the most sense, although any form of quoting or escaping works here.
wget 'http://example.com/q&uack' # Single quotes preferred for a static string
wget "http://example.com/q&uack" # Double quotes work here, too (no $ or ` in the value)
wget http://example.com/q\&uack # Backslash escape
wget http://example.com/q'&'uack # Only the metacharacter really needs quoting
The last example also suggests another useful concept, which I like to call "seesaw quoting". If you need to mix single and double quotes, you can use them adjacent to each other. For example, the following quoted strings
'$HOME '
"isn't"
' where `<3'
"' is."
can be pasted together back to back, forming a single long string after tokenization and quote removal.
$ echo '$HOME '"isn't"' where `<3'"' is."
$HOME isn't where `<3' is.
This isn't awfully legible, but it's a common technique and thus good to know.
As an aside, scripts should usually not use ls for anything. To expand a wildcard, just ... use it.
$ printf '%s\n' $pattern # not ``ls -1 $pattern''
file1.txt
file_other.txt
$ for file in $pattern; do # definitely, definitely not ``for file in $(ls $pattern)''
> printf 'Found file: %s\n' "$file"
> done
Found file: file1.txt
Found file: file_other.txt
(The loop is completely superfluous in the latter example; printf specifically works fine with multiple arguments. stat too. But looping over a wildcard match is a common problem, and frequently done incorrectly.)
A variable containing a list of tokens to loop over or a wildcard to expand is less frequently seen, so we sometimes abbreviate to "quote everything unless you know precisely what you are doing".
Here is a three-point formula for quotes in general:
Double quotes
In contexts where we want to suppress word splitting and globbing. Also in contexts where we want the literal to be treated as a string, not a regex.
Single quotes
In string literals where we want to suppress interpolation and special treatment of backslashes. In other words, situations where using double quotes would be inappropriate.
No quotes
In contexts where we are absolutely sure that there are no word splitting or globbing issues or we do want word splitting and globbing.
Examples
Double quotes
literal strings with whitespace ("StackOverflow rocks!", "Steve's Apple")
variable expansions ("$var", "${arr[#]}")
command substitutions ("$(ls)", "`ls`")
globs where directory path or file name part includes spaces ("/my dir/"*)
to protect single quotes ("single'quote'delimited'string")
Bash parameter expansion ("${filename##*/}")
Single quotes
command names and arguments that have whitespace in them
literal strings that need interpolation to be suppressed ( 'Really costs $$!', 'just a backslash followed by a t: \t')
to protect double quotes ('The "crux"')
regex literals that need interpolation to be suppressed
use shell quoting for literals involving special characters ($'\n\t')
use shell quoting where we need to protect several single and double quotes ($'{"table": "users", "where": "first_name"=\'Steve\'}')
No quotes
around standard numeric variables ($$, $?, $# etc.)
in arithmetic contexts like ((count++)), "${arr[idx]}", "${string:start:length}"
inside [[ ]] expression which is free from word splitting and globbing issues (this is a matter of style and opinions can vary widely)
where we want word splitting (for word in $words)
where we want globbing (for txtfile in *.txt; do ...)
where we want ~ to be interpreted as $HOME (~/"some dir" but not "~/some dir")
See also:
Difference between single and double quotes in Bash
What are the special dollar sign shell variables?
Quotes and escaping - Bash Hackers' Wiki
When is double quoting necessary?
I generally use quoted like "$var" for safe, unless I am sure that $var does not contain space.
I do use $var as a simple way to join lines:
lines="`cat multi-lines-text-file.txt`"
echo "$lines" ## multiple lines
echo $lines ## all spaces (including newlines) are zapped
Whenever the https://www.shellcheck.net/ plugin for your editor tells you to.

two back slash in SQL procedure with Regular expression

i'm working on a SQL procedure where i came across a Regular expression which contains this piece -
REGEXP '\\s*count\\s*\\('
i do not understand why there are two backslash ? what can be inferred from this code.
Backslash is processed at multiple levels.
It's the escape prefix in regular expressions: it makes special characters like . and ( be treated literally, and is used to created escape sequences like \s (which means any whitespace character).
But it's also the escape prefix in string literals, used for things like \n (newline) and \b (backspace). So in order for the regular expression character to get a literal backslash, you need to escape the backslash itself.
This is common in many programming languages, although a few have "raw string literals" where escape sequences are not processed, specifically to avoid having to double the slashes so much.

PowerShell 5.1: how to extract text from nested HTML elements? [duplicate]

For example, this regex
(.*)<FooBar>
will match:
abcde<FooBar>
But how do I get it to match across multiple lines?
abcde
fghij<FooBar>
Try this:
((.|\n)*)<FooBar>
It basically says "any character or a newline" repeated zero or more times.
It depends on the language, but there should be a modifier that you can add to the regex pattern. In PHP it is:
/(.*)<FooBar>/s
The s at the end causes the dot to match all characters including newlines.
The question is, can the . pattern match any character? The answer varies from engine to engine. The main difference is whether the pattern is used by a POSIX or non-POSIX regex library.
A special note about lua-patterns: they are not considered regular expressions, but . matches any character there, the same as POSIX-based engines.
Another note on matlab and octave: the . matches any character by default (demo): str = "abcde\n fghij<Foobar>"; expression = '(.*)<Foobar>*'; [tokens,matches] = regexp(str,expression,'tokens','match'); (tokens contain a abcde\n fghij item).
Also, in all of boost's regex grammars the dot matches line breaks by default. Boost's ECMAScript grammar allows you to turn this off with regex_constants::no_mod_m (source).
As for oracle (it is POSIX based), use the n option (demo): select regexp_substr('abcde' || chr(10) ||' fghij<Foobar>', '(.*)<Foobar>', 1, 1, 'n', 1) as results from dual
POSIX-based engines:
A mere . already matches line breaks, so there isn't a need to use any modifiers, see bash (demo).
The tcl (demo), postgresql (demo), r (TRE, base R default engine with no perl=TRUE, for base R with perl=TRUE or for stringr/stringi patterns, use the (?s) inline modifier) (demo) also treat . the same way.
However, most POSIX-based tools process input line by line. Hence, . does not match the line breaks just because they are not in scope. Here are some examples how to override this:
sed - There are multiple workarounds. The most precise, but not very safe, is sed 'H;1h;$!d;x; s/\(.*\)><Foobar>/\1/' (H;1h;$!d;x; slurps the file into memory). If whole lines must be included, sed '/start_pattern/,/end_pattern/d' file (removing from start will end with matched lines included) or sed '/start_pattern/,/end_pattern/{{//!d;};}' file (with matching lines excluded) can be considered.
perl - perl -0pe 's/(.*)<FooBar>/$1/gs' <<< "$str" (-0 slurps the whole file into memory, -p prints the file after applying the script given by -e). Note that using -000pe will slurp the file and activate 'paragraph mode' where Perl uses consecutive newlines (\n\n) as the record separator.
gnu-grep - grep -Poz '(?si)abc\K.*?(?=<Foobar>)' file. Here, z enables file slurping, (?s) enables the DOTALL mode for the . pattern, (?i) enables case insensitive mode, \K omits the text matched so far, *? is a lazy quantifier, (?=<Foobar>) matches the location before <Foobar>.
pcregrep - pcregrep -Mi "(?si)abc\K.*?(?=<Foobar>)" file (M enables file slurping here). Note pcregrep is a good solution for macOS grep users.
See demos.
Non-POSIX-based engines:
php - Use the s modifier PCRE_DOTALL modifier: preg_match('~(.*)<Foobar>~s', $s, $m) (demo)
c# - Use RegexOptions.Singleline flag (demo): - var result = Regex.Match(s, #"(.*)<Foobar>", RegexOptions.Singleline).Groups[1].Value;- var result = Regex.Match(s, #"(?s)(.*)<Foobar>").Groups[1].Value;
powershell - Use the (?s) inline option: $s = "abcde`nfghij<FooBar>"; $s -match "(?s)(.*)<Foobar>"; $matches[1]
perl - Use the s modifier (or (?s) inline version at the start) (demo): /(.*)<FooBar>/s
python - Use the re.DOTALL (or re.S) flags or (?s) inline modifier (demo): m = re.search(r"(.*)<FooBar>", s, flags=re.S) (and then if m:, print(m.group(1)))
java - Use Pattern.DOTALL modifier (or inline (?s) flag) (demo): Pattern.compile("(.*)<FooBar>", Pattern.DOTALL)
kotlin - Use RegexOption.DOT_MATCHES_ALL : "(.*)<FooBar>".toRegex(RegexOption.DOT_MATCHES_ALL)
groovy - Use (?s) in-pattern modifier (demo): regex = /(?s)(.*)<FooBar>/
scala - Use (?s) modifier (demo): "(?s)(.*)<Foobar>".r.findAllIn("abcde\n fghij<Foobar>").matchData foreach { m => println(m.group(1)) }
javascript - Use [^] or workarounds [\d\D] / [\w\W] / [\s\S] (demo): s.match(/([\s\S]*)<FooBar>/)[1]
c++ (std::regex) Use [\s\S] or the JavaScript workarounds (demo): regex rex(R"(([\s\S]*)<FooBar>)");
vba vbscript - Use the same approach as in JavaScript, ([\s\S]*)<Foobar>. (NOTE: The MultiLine property of the RegExp object is sometimes erroneously thought to be the option to allow . match across line breaks, while, in fact, it only changes the ^ and $ behavior to match start/end of lines rather than strings, the same as in JavaScript regex)
behavior.)
ruby - Use the /m MULTILINE modifier (demo): s[/(.*)<Foobar>/m, 1]
rtrebase-r - Base R PCRE regexps - use (?s): regmatches(x, regexec("(?s)(.*)<FooBar>",x, perl=TRUE))[[1]][2] (demo)
ricustringrstringi - in stringr/stringi regex funtions that are powered with the ICU regex engine. Also use (?s): stringr::str_match(x, "(?s)(.*)<FooBar>")[,2] (demo)
go - Use the inline modifier (?s) at the start (demo): re: = regexp.MustCompile(`(?s)(.*)<FooBar>`)
swift - Use dotMatchesLineSeparators or (easier) pass the (?s) inline modifier to the pattern: let rx = "(?s)(.*)<Foobar>"
objective-c - The same as Swift. (?s) works the easiest, but here is how the option can be used: NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionDotMatchesLineSeparators error:&regexError];
re2, google-apps-script - Use the (?s) modifier (demo): "(?s)(.*)<Foobar>" (in Google Spreadsheets, =REGEXEXTRACT(A2,"(?s)(.*)<Foobar>"))
NOTES ON (?s):
In most non-POSIX engines, the (?s) inline modifier (or embedded flag option) can be used to enforce . to match line breaks.
If placed at the start of the pattern, (?s) changes the bahavior of all . in the pattern. If the (?s) is placed somewhere after the beginning, only those .s will be affected that are located to the right of it unless this is a pattern passed to Python's re. In Python re, regardless of the (?s) location, the whole pattern . is affected. The (?s) effect is stopped using (?-s). A modified group can be used to only affect a specified range of a regex pattern (e.g., Delim1(?s:.*?)\nDelim2.* will make the first .*? match across newlines and the second .* will only match the rest of the line).
POSIX note:
In non-POSIX regex engines, to match any character, [\s\S] / [\d\D] / [\w\W] constructs can be used.
In POSIX, [\s\S] is not matching any character (as in JavaScript or any non-POSIX engine), because regex escape sequences are not supported inside bracket expressions. [\s\S] is parsed as bracket expressions that match a single character, \ or s or S.
If you're using Eclipse search, you can enable the "DOTALL" option to make '.' match any character including line delimiters: just add "(?s)" at the beginning of your search string. Example:
(?s).*<FooBar>
In many regex dialects, /[\S\s]*<Foobar>/ will do just what you want. Source
([\s\S]*)<FooBar>
The dot matches all except newlines (\r\n). So use \s\S, which will match ALL characters.
We can also use
(.*?\n)*?
to match everything including newline without being greedy.
This will make the new line optional
(.*?|\n)*?
In Ruby you can use the 'm' option (multiline):
/YOUR_REGEXP/m
See the Regexp documentation on ruby-doc.org for more information.
"." normally doesn't match line-breaks. Most regex engines allows you to add the S-flag (also called DOTALL and SINGLELINE) to make "." also match newlines.
If that fails, you could do something like [\S\s].
For Eclipse, the following expression worked:
Foo
jadajada Bar"
Regular expression:
Foo[\S\s]{1,10}.*Bar*
Note that (.|\n)* can be less efficient than (for example) [\s\S]* (if your language's regexes support such escapes) and than finding how to specify the modifier that makes . also match newlines. Or you can go with POSIXy alternatives like [[:space:][:^space:]]*.
Use:
/(.*)<FooBar>/s
The s causes dot (.) to match carriage returns.
Use RegexOptions.Singleline. It changes the meaning of . to include newlines.
Regex.Replace(content, searchText, replaceText, RegexOptions.Singleline);
In notepad++ you can use this
<table (.|\r\n)*</table>
It will match the entire table starting from
rows and columns
You can make it greedy, using the following, that way it will match the first, second and so forth tables and not all at once
<table (.|\r\n)*?</table>
In a Java-based regular expression, you can use [\s\S].
This works for me and is the simplest one:
(\X*)<FooBar>
Generally, . doesn't match newlines, so try ((.|\n)*)<foobar>.
In JavaScript you can use [^]* to search for zero to infinite characters, including line breaks.
$("#find_and_replace").click(function() {
var text = $("#textarea").val();
search_term = new RegExp("[^]*<Foobar>", "gi");;
replace_term = "Replacement term";
var new_text = text.replace(search_term, replace_term);
$("#textarea").val(new_text);
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<button id="find_and_replace">Find and replace</button>
<br>
<textarea ID="textarea">abcde
fghij<Foobar></textarea>
Solution:
Use pattern modifier sU will get the desired matching in PHP.
Example:
preg_match('/(.*)/sU', $content, $match);
Sources:
Pattern Modifiers
In the context of use within languages, regular expressions act on strings, not lines. So you should be able to use the regex normally, assuming that the input string has multiple lines.
In this case, the given regex will match the entire string, since "<FooBar>" is present. Depending on the specifics of the regex implementation, the $1 value (obtained from the "(.*)") will either be "fghij" or "abcde\nfghij". As others have said, some implementations allow you to control whether the "." will match the newline, giving you the choice.
Line-based regular expression use is usually for command line things like egrep.
Try: .*\n*.*<FooBar> assuming you are also allowing blank newlines. As you are allowing any character including nothing before <FooBar>.
I had the same problem and solved it in probably not the best way but it works. I replaced all line breaks before I did my real match:
mystring = Regex.Replace(mystring, "\r\n", "")
I am manipulating HTML so line breaks don't really matter to me in this case.
I tried all of the suggestions above with no luck. I am using .NET 3.5 FYI.
I wanted to match a particular if block in Java:
...
...
if(isTrue){
doAction();
}
...
...
}
If I use the regExp
if \(isTrue(.|\n)*}
it included the closing brace for the method block, so I used
if \(!isTrue([^}.]|\n)*}
to exclude the closing brace from the wildcard match.
Often we have to modify a substring with a few keywords spread across lines preceding the substring. Consider an XML element:
<TASK>
<UID>21</UID>
<Name>Architectural design</Name>
<PercentComplete>81</PercentComplete>
</TASK>
Suppose we want to modify the 81, to some other value, say 40. First identify .UID.21..UID., then skip all characters including \n till .PercentCompleted.. The regular expression pattern and the replace specification are:
String hw = new String("<TASK>\n <UID>21</UID>\n <Name>Architectural design</Name>\n <PercentComplete>81</PercentComplete>\n</TASK>");
String pattern = new String ("(<UID>21</UID>)((.|\n)*?)(<PercentComplete>)(\\d+)(</PercentComplete>)");
String replaceSpec = new String ("$1$2$440$6");
// Note that the group (<PercentComplete>) is $4 and the group ((.|\n)*?) is $2.
String iw = hw.replaceFirst(pattern, replaceSpec);
System.out.println(iw);
<TASK>
<UID>21</UID>
<Name>Architectural design</Name>
<PercentComplete>40</PercentComplete>
</TASK>
The subgroup (.|\n) is probably the missing group $3. If we make it non-capturing by (?:.|\n) then the $3 is (<PercentComplete>). So the pattern and replaceSpec can also be:
pattern = new String("(<UID>21</UID>)((?:.|\n)*?)(<PercentComplete>)(\\d+)(</PercentComplete>)");
replaceSpec = new String("$1$2$340$5")
and the replacement works correctly as before.
Typically searching for three consecutive lines in PowerShell, it would look like:
$file = Get-Content file.txt -raw
$pattern = 'lineone\r\nlinetwo\r\nlinethree\r\n' # "Windows" text
$pattern = 'lineone\nlinetwo\nlinethree\n' # "Unix" text
$pattern = 'lineone\r?\nlinetwo\r?\nlinethree\r?\n' # Both
$file -match $pattern
# output
True
Bizarrely, this would be Unix text at the prompt, but Windows text in a file:
$pattern = 'lineone
linetwo
linethree
'
Here's a way to print out the line endings:
'lineone
linetwo
linethree
' -replace "`r",'\r' -replace "`n",'\n'
# Output
lineone\nlinetwo\nlinethree\n
Option 1
One way would be to use the s flag (just like the accepted answer):
/(.*)<FooBar>/s
Demo 1
Option 2
A second way would be to use the m (multiline) flag and any of the following patterns:
/([\s\S]*)<FooBar>/m
or
/([\d\D]*)<FooBar>/m
or
/([\w\W]*)<FooBar>/m
Demo 2
RegEx Circuit
jex.im visualizes regular expressions:

Remove backslash from nested json [duplicate]

When I create a string containing backslashes, they get duplicated:
>>> my_string = "why\does\it\happen?"
>>> my_string
'why\\does\\it\\happen?'
Why?
What you are seeing is the representation of my_string created by its __repr__() method. If you print it, you can see that you've actually got single backslashes, just as you intended:
>>> print(my_string)
why\does\it\happen?
The string below has three characters in it, not four:
>>> 'a\\b'
'a\\b'
>>> len('a\\b')
3
You can get the standard representation of a string (or any other object) with the repr() built-in function:
>>> print(repr(my_string))
'why\\does\\it\\happen?'
Python represents backslashes in strings as \\ because the backslash is an escape character - for instance, \n represents a newline, and \t represents a tab.
This can sometimes get you into trouble:
>>> print("this\text\is\not\what\it\seems")
this ext\is
ot\what\it\seems
Because of this, there needs to be a way to tell Python you really want the two characters \n rather than a newline, and you do that by escaping the backslash itself, with another one:
>>> print("this\\text\is\what\you\\need")
this\text\is\what\you\need
When Python returns the representation of a string, it plays safe, escaping all backslashes (even if they wouldn't otherwise be part of an escape sequence), and that's what you're seeing. However, the string itself contains only single backslashes.
More information about Python's string literals can be found at: String and Bytes literals in the Python documentation.
As Zero Piraeus's answer explains, using single backslashes like this (outside of raw string literals) is a bad idea.
But there's an additional problem: in the future, it will be an error to use an undefined escape sequence like \d, instead of meaning a literal backslash followed by a d. So, instead of just getting lucky that your string happened to use \d instead of \t so it did what you probably wanted, it will definitely not do what you want.
As of 3.6, it already raises a DeprecationWarning, although most people don't see those. It will become a SyntaxError in some future version.
In many other languages, including C, using a backslash that doesn't start an escape sequence means the backslash is ignored.
In a few languages, including Python, a backslash that doesn't start an escape sequence is a literal backslash.
In some languages, to avoid confusion about whether the language is C-like or Python-like, and to avoid the problem with \Foo working but \foo not working, a backslash that doesn't start an escape sequence is illegal.

html Regex can't select content between quote tags [duplicate]

For example, this regex
(.*)<FooBar>
will match:
abcde<FooBar>
But how do I get it to match across multiple lines?
abcde
fghij<FooBar>
Try this:
((.|\n)*)<FooBar>
It basically says "any character or a newline" repeated zero or more times.
It depends on the language, but there should be a modifier that you can add to the regex pattern. In PHP it is:
/(.*)<FooBar>/s
The s at the end causes the dot to match all characters including newlines.
The question is, can the . pattern match any character? The answer varies from engine to engine. The main difference is whether the pattern is used by a POSIX or non-POSIX regex library.
A special note about lua-patterns: they are not considered regular expressions, but . matches any character there, the same as POSIX-based engines.
Another note on matlab and octave: the . matches any character by default (demo): str = "abcde\n fghij<Foobar>"; expression = '(.*)<Foobar>*'; [tokens,matches] = regexp(str,expression,'tokens','match'); (tokens contain a abcde\n fghij item).
Also, in all of boost's regex grammars the dot matches line breaks by default. Boost's ECMAScript grammar allows you to turn this off with regex_constants::no_mod_m (source).
As for oracle (it is POSIX based), use the n option (demo): select regexp_substr('abcde' || chr(10) ||' fghij<Foobar>', '(.*)<Foobar>', 1, 1, 'n', 1) as results from dual
POSIX-based engines:
A mere . already matches line breaks, so there isn't a need to use any modifiers, see bash (demo).
The tcl (demo), postgresql (demo), r (TRE, base R default engine with no perl=TRUE, for base R with perl=TRUE or for stringr/stringi patterns, use the (?s) inline modifier) (demo) also treat . the same way.
However, most POSIX-based tools process input line by line. Hence, . does not match the line breaks just because they are not in scope. Here are some examples how to override this:
sed - There are multiple workarounds. The most precise, but not very safe, is sed 'H;1h;$!d;x; s/\(.*\)><Foobar>/\1/' (H;1h;$!d;x; slurps the file into memory). If whole lines must be included, sed '/start_pattern/,/end_pattern/d' file (removing from start will end with matched lines included) or sed '/start_pattern/,/end_pattern/{{//!d;};}' file (with matching lines excluded) can be considered.
perl - perl -0pe 's/(.*)<FooBar>/$1/gs' <<< "$str" (-0 slurps the whole file into memory, -p prints the file after applying the script given by -e). Note that using -000pe will slurp the file and activate 'paragraph mode' where Perl uses consecutive newlines (\n\n) as the record separator.
gnu-grep - grep -Poz '(?si)abc\K.*?(?=<Foobar>)' file. Here, z enables file slurping, (?s) enables the DOTALL mode for the . pattern, (?i) enables case insensitive mode, \K omits the text matched so far, *? is a lazy quantifier, (?=<Foobar>) matches the location before <Foobar>.
pcregrep - pcregrep -Mi "(?si)abc\K.*?(?=<Foobar>)" file (M enables file slurping here). Note pcregrep is a good solution for macOS grep users.
See demos.
Non-POSIX-based engines:
php - Use the s modifier PCRE_DOTALL modifier: preg_match('~(.*)<Foobar>~s', $s, $m) (demo)
c# - Use RegexOptions.Singleline flag (demo): - var result = Regex.Match(s, #"(.*)<Foobar>", RegexOptions.Singleline).Groups[1].Value;- var result = Regex.Match(s, #"(?s)(.*)<Foobar>").Groups[1].Value;
powershell - Use the (?s) inline option: $s = "abcde`nfghij<FooBar>"; $s -match "(?s)(.*)<Foobar>"; $matches[1]
perl - Use the s modifier (or (?s) inline version at the start) (demo): /(.*)<FooBar>/s
python - Use the re.DOTALL (or re.S) flags or (?s) inline modifier (demo): m = re.search(r"(.*)<FooBar>", s, flags=re.S) (and then if m:, print(m.group(1)))
java - Use Pattern.DOTALL modifier (or inline (?s) flag) (demo): Pattern.compile("(.*)<FooBar>", Pattern.DOTALL)
kotlin - Use RegexOption.DOT_MATCHES_ALL : "(.*)<FooBar>".toRegex(RegexOption.DOT_MATCHES_ALL)
groovy - Use (?s) in-pattern modifier (demo): regex = /(?s)(.*)<FooBar>/
scala - Use (?s) modifier (demo): "(?s)(.*)<Foobar>".r.findAllIn("abcde\n fghij<Foobar>").matchData foreach { m => println(m.group(1)) }
javascript - Use [^] or workarounds [\d\D] / [\w\W] / [\s\S] (demo): s.match(/([\s\S]*)<FooBar>/)[1]
c++ (std::regex) Use [\s\S] or the JavaScript workarounds (demo): regex rex(R"(([\s\S]*)<FooBar>)");
vba vbscript - Use the same approach as in JavaScript, ([\s\S]*)<Foobar>. (NOTE: The MultiLine property of the RegExp object is sometimes erroneously thought to be the option to allow . match across line breaks, while, in fact, it only changes the ^ and $ behavior to match start/end of lines rather than strings, the same as in JavaScript regex)
behavior.)
ruby - Use the /m MULTILINE modifier (demo): s[/(.*)<Foobar>/m, 1]
rtrebase-r - Base R PCRE regexps - use (?s): regmatches(x, regexec("(?s)(.*)<FooBar>",x, perl=TRUE))[[1]][2] (demo)
ricustringrstringi - in stringr/stringi regex funtions that are powered with the ICU regex engine. Also use (?s): stringr::str_match(x, "(?s)(.*)<FooBar>")[,2] (demo)
go - Use the inline modifier (?s) at the start (demo): re: = regexp.MustCompile(`(?s)(.*)<FooBar>`)
swift - Use dotMatchesLineSeparators or (easier) pass the (?s) inline modifier to the pattern: let rx = "(?s)(.*)<Foobar>"
objective-c - The same as Swift. (?s) works the easiest, but here is how the option can be used: NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionDotMatchesLineSeparators error:&regexError];
re2, google-apps-script - Use the (?s) modifier (demo): "(?s)(.*)<Foobar>" (in Google Spreadsheets, =REGEXEXTRACT(A2,"(?s)(.*)<Foobar>"))
NOTES ON (?s):
In most non-POSIX engines, the (?s) inline modifier (or embedded flag option) can be used to enforce . to match line breaks.
If placed at the start of the pattern, (?s) changes the bahavior of all . in the pattern. If the (?s) is placed somewhere after the beginning, only those .s will be affected that are located to the right of it unless this is a pattern passed to Python's re. In Python re, regardless of the (?s) location, the whole pattern . is affected. The (?s) effect is stopped using (?-s). A modified group can be used to only affect a specified range of a regex pattern (e.g., Delim1(?s:.*?)\nDelim2.* will make the first .*? match across newlines and the second .* will only match the rest of the line).
POSIX note:
In non-POSIX regex engines, to match any character, [\s\S] / [\d\D] / [\w\W] constructs can be used.
In POSIX, [\s\S] is not matching any character (as in JavaScript or any non-POSIX engine), because regex escape sequences are not supported inside bracket expressions. [\s\S] is parsed as bracket expressions that match a single character, \ or s or S.
If you're using Eclipse search, you can enable the "DOTALL" option to make '.' match any character including line delimiters: just add "(?s)" at the beginning of your search string. Example:
(?s).*<FooBar>
In many regex dialects, /[\S\s]*<Foobar>/ will do just what you want. Source
([\s\S]*)<FooBar>
The dot matches all except newlines (\r\n). So use \s\S, which will match ALL characters.
We can also use
(.*?\n)*?
to match everything including newline without being greedy.
This will make the new line optional
(.*?|\n)*?
In Ruby you can use the 'm' option (multiline):
/YOUR_REGEXP/m
See the Regexp documentation on ruby-doc.org for more information.
"." normally doesn't match line-breaks. Most regex engines allows you to add the S-flag (also called DOTALL and SINGLELINE) to make "." also match newlines.
If that fails, you could do something like [\S\s].
For Eclipse, the following expression worked:
Foo
jadajada Bar"
Regular expression:
Foo[\S\s]{1,10}.*Bar*
Note that (.|\n)* can be less efficient than (for example) [\s\S]* (if your language's regexes support such escapes) and than finding how to specify the modifier that makes . also match newlines. Or you can go with POSIXy alternatives like [[:space:][:^space:]]*.
Use:
/(.*)<FooBar>/s
The s causes dot (.) to match carriage returns.
Use RegexOptions.Singleline. It changes the meaning of . to include newlines.
Regex.Replace(content, searchText, replaceText, RegexOptions.Singleline);
In notepad++ you can use this
<table (.|\r\n)*</table>
It will match the entire table starting from
rows and columns
You can make it greedy, using the following, that way it will match the first, second and so forth tables and not all at once
<table (.|\r\n)*?</table>
In a Java-based regular expression, you can use [\s\S].
This works for me and is the simplest one:
(\X*)<FooBar>
Generally, . doesn't match newlines, so try ((.|\n)*)<foobar>.
In JavaScript you can use [^]* to search for zero to infinite characters, including line breaks.
$("#find_and_replace").click(function() {
var text = $("#textarea").val();
search_term = new RegExp("[^]*<Foobar>", "gi");;
replace_term = "Replacement term";
var new_text = text.replace(search_term, replace_term);
$("#textarea").val(new_text);
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<button id="find_and_replace">Find and replace</button>
<br>
<textarea ID="textarea">abcde
fghij<Foobar></textarea>
Solution:
Use pattern modifier sU will get the desired matching in PHP.
Example:
preg_match('/(.*)/sU', $content, $match);
Sources:
Pattern Modifiers
In the context of use within languages, regular expressions act on strings, not lines. So you should be able to use the regex normally, assuming that the input string has multiple lines.
In this case, the given regex will match the entire string, since "<FooBar>" is present. Depending on the specifics of the regex implementation, the $1 value (obtained from the "(.*)") will either be "fghij" or "abcde\nfghij". As others have said, some implementations allow you to control whether the "." will match the newline, giving you the choice.
Line-based regular expression use is usually for command line things like egrep.
Try: .*\n*.*<FooBar> assuming you are also allowing blank newlines. As you are allowing any character including nothing before <FooBar>.
I had the same problem and solved it in probably not the best way but it works. I replaced all line breaks before I did my real match:
mystring = Regex.Replace(mystring, "\r\n", "")
I am manipulating HTML so line breaks don't really matter to me in this case.
I tried all of the suggestions above with no luck. I am using .NET 3.5 FYI.
I wanted to match a particular if block in Java:
...
...
if(isTrue){
doAction();
}
...
...
}
If I use the regExp
if \(isTrue(.|\n)*}
it included the closing brace for the method block, so I used
if \(!isTrue([^}.]|\n)*}
to exclude the closing brace from the wildcard match.
Often we have to modify a substring with a few keywords spread across lines preceding the substring. Consider an XML element:
<TASK>
<UID>21</UID>
<Name>Architectural design</Name>
<PercentComplete>81</PercentComplete>
</TASK>
Suppose we want to modify the 81, to some other value, say 40. First identify .UID.21..UID., then skip all characters including \n till .PercentCompleted.. The regular expression pattern and the replace specification are:
String hw = new String("<TASK>\n <UID>21</UID>\n <Name>Architectural design</Name>\n <PercentComplete>81</PercentComplete>\n</TASK>");
String pattern = new String ("(<UID>21</UID>)((.|\n)*?)(<PercentComplete>)(\\d+)(</PercentComplete>)");
String replaceSpec = new String ("$1$2$440$6");
// Note that the group (<PercentComplete>) is $4 and the group ((.|\n)*?) is $2.
String iw = hw.replaceFirst(pattern, replaceSpec);
System.out.println(iw);
<TASK>
<UID>21</UID>
<Name>Architectural design</Name>
<PercentComplete>40</PercentComplete>
</TASK>
The subgroup (.|\n) is probably the missing group $3. If we make it non-capturing by (?:.|\n) then the $3 is (<PercentComplete>). So the pattern and replaceSpec can also be:
pattern = new String("(<UID>21</UID>)((?:.|\n)*?)(<PercentComplete>)(\\d+)(</PercentComplete>)");
replaceSpec = new String("$1$2$340$5")
and the replacement works correctly as before.
Typically searching for three consecutive lines in PowerShell, it would look like:
$file = Get-Content file.txt -raw
$pattern = 'lineone\r\nlinetwo\r\nlinethree\r\n' # "Windows" text
$pattern = 'lineone\nlinetwo\nlinethree\n' # "Unix" text
$pattern = 'lineone\r?\nlinetwo\r?\nlinethree\r?\n' # Both
$file -match $pattern
# output
True
Bizarrely, this would be Unix text at the prompt, but Windows text in a file:
$pattern = 'lineone
linetwo
linethree
'
Here's a way to print out the line endings:
'lineone
linetwo
linethree
' -replace "`r",'\r' -replace "`n",'\n'
# Output
lineone\nlinetwo\nlinethree\n
Option 1
One way would be to use the s flag (just like the accepted answer):
/(.*)<FooBar>/s
Demo 1
Option 2
A second way would be to use the m (multiline) flag and any of the following patterns:
/([\s\S]*)<FooBar>/m
or
/([\d\D]*)<FooBar>/m
or
/([\w\W]*)<FooBar>/m
Demo 2
RegEx Circuit
jex.im visualizes regular expressions: