In OpenGrok how do you do a full search for special non-alphanumeric characters - opengrok

I am trying to search my codebase for code that calls a function named "foo" so I am searching for "foo(" but the results I'm getting includes everything with the word foo in it which includes css, comments and strings that don't even have the trailing open parenthesis.
Anyone know how to do a search for strings that include special characters like ),"'?

When searching for special characters, try using escape character before the character, i.e. \, e.g. "foo\(".
Additionally, I found a reply for a similar question (see http://marc.info/?l=opensolaris-opengrok-discuss&m=115776447032671). It seems that frequently occurring special characters are not indexed because of performance issues, therefore it might not be possible to effectively search for such pattern.

Opengrok supports escaping special characters that are part of the query syntax. Current special characters are:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
To escape these character use the \ before the character. For example to search for (1+1):2 use the query \(1\+1)\:2

Related

MySQL 8.0.30 Regular Expression Word Matching with Special Characters

While there's a told of "old" examples on the internet using the now unsupported '[[:<:]]word[[:>:]]' technique, I'm trying to find out how, in MySQL 8.0.30, to do exact word matching from our table with words that have special characters in them.
For example, we have a paragraph of text like:
"Senior software engineer and C++ developer with Unit Test and JavaScript experience. I also have .NET experience!"
We have a table of keywords to match against this and have been using the basic system of:
SELECT
sk.ID
FROM
sit_keyword sk
WHERE
var_text REGEXP CONCAT('\\b',sk.keyword,'\\b')
It works fine 90% of the time, but it completely fails on:
C#, C++, .NET, A+ or "A +" etc. So it's failing to match keywords with special characters in them.
I can't seem to find any recent documentation on how to address this since, as mentioned, nearly all of the examples I can find use the old unsupported techniques. Note I need to match these words (with special characters) anywhere in the source text, so it can be the first or last word, or somewhere in the middle.
Any advice on the best way to do this using REGEXP would be appreciated.
You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:
var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')
The REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1') matches . ^ $ * + - ? ( ) [ ] { } \ | chars (adds a \ before them) and (?!\\B\\w) / (?<!\\w\\B) require word boundaries only when the search phrase start/ends with a word char.
More details on adaptive dynamic word boundaries and demo in my YT video.
Regular expressions treat several characters as metacharacters. These are documented in the manual on regular expression syntax: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-syntax
If you need a metacharacter to be treated as the literal character, you need to escape it with a backslash.
This gets very complex. If you just want to search for substrings, perhaps you should just use LOCATE():
WHERE LOCATE(sk.keyword, var_text) > 0
This avoids all the trickery with metacharacters. It treats the string of sk.keyword as containing only literal characters.

MySQL search and replace specific character combinations

NOTE: This is hilarious - I've had to update this post multiple times because it doesn't properly display combinations of the \ character :)
I need to be able to search and replace a specific set of characters without compromising specific allowed combinations or the redundancy of those characters. Let's take a core escape character \ for example.
A string that needs processing may look like this:
"This is \ a test! \\r\\n Let's have \r \n \\\ Some fun!"
Now I need to address each \ for translation to JSON (each \ needs to be \\\\), but at the same time, I don't want to touch the \r or \n. I also want to be able to differentiate between \r and \n vs \r and \n. Is there a method using REGEXP I could adopt that would take the above and convert it to:
"This is \\\\ a test! \\r\\n Let's have \\r \\n \\\\\\\\\\\\ Some fun!"
Note I want each escape \ to be 4x \, and also to ensure special escaped characters are double escaped, but not compound them (the 4x only needs to be applied to stand alone \'s).
What it comes down to is I can't control the data that's coming in, but I can control scrubbing it. I could get weird data with \/////\ by somebody who was just having fun. I need to be able to scrub that as a TEXT value and prepare it for insertion into the database via a dynamically created SQL statement that's executed, which means a single \ needs to be \\\\ and the / is ignored (for example).
I'm thinking I first need to do a specific scan of special escape sequences (such as \', \b, \\, \r, etc.) but at the same time verify they aren't already double escaped. I need need to ensure the additional scans of \ don't meet any special standards (and are just escaped on their own).
I'm hoping somebody has already dealt with this and there's an existing function or SP designed to do this sort of thing so I'm not reinventing the wheel.
Thanks!

When use parameters in JSON POST data in Jmeter, another json entity is changed [duplicate]

I am tired of always trying to guess, if I should escape special characters like '()[]{}|' etc. when using many implementations of regexps.
It is different with, for example, Python, sed, grep, awk, Perl, rename, Apache, find and so on.
Is there any rule set which tells when I should, and when I should not, escape special characters? Does it depend on the regexp type, like PCRE, POSIX or extended regexps?
Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.
For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:
.^$*+?()[{\|
and these inside character classes:
^-]\
For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):
.^$*+?()[{\|
Escaping any other characters is an error with POSIX ERE.
Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:
[]^-]
In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:
.^$*[\
Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as \? and +. Escaping a character other than .^$*(){} is normally an error with BREs.
Inside character classes, BREs follow the same rule as EREs.
If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.
Modern RegEx Flavors (PCRE)
Includes C, C++, Delphi, EditPad, Java, JavaScript, Perl, PHP (preg), PostgreSQL, PowerGREP, PowerShell, Python, REALbasic, Real Studio, Ruby, TCL, VB.Net, VBScript, wxWidgets, XML Schema, Xojo, XRegExp.PCRE compatibility may vary
    Anywhere: . ^ $ * + - ? ( ) [ ] { } \ |
Legacy RegEx Flavors (BRE/ERE)
Includes awk, ed, egrep, emacs, GNUlib, grep, PHP (ereg), MySQL, Oracle, R, sed.PCRE support may be enabled in later versions or by using extensions
ERE/awk/egrep/emacs
    Outside a character class: . ^ $ * + ? ( ) [ { } \ |
    Inside a character class: ^ - [ ]
BRE/ed/grep/sed
    Outside a character class: . ^ $ * [ \
    Inside a character class: ^ - [ ]
    For literals, don't escape: + ? ( ) { } |
    For standard regex behavior, escape: \+ \? \( \) \{ \} \|
Notes
If unsure about a specific character, it can be escaped like \xFF
Alphanumeric characters cannot be escaped with a backslash
Arbitrary symbols can be escaped with a backslash in PCRE, but not BRE/ERE (they must only be escaped when required). For PCRE ] - only need escaping within a character class, but I kept them in a single list for simplicity
Quoted expression strings must also have the surrounding quote characters escaped, and often with backslashes doubled-up (like "(\")(/)(\\.)" versus /(")(\/)(\.)/ in JavaScript)
Aside from escapes, different regex implementations may support different modifiers, character classes, anchors, quantifiers, and other features. For more details, check out regular-expressions.info, or use regex101.com to test your expressions live
Unfortunately there really isn't a set set of escape codes since it varies based on the language you are using.
However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.
POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.
There isn't a simple rule for when to use which notation, or even which notation a given command uses.
Check out Jeff Friedl's Mastering Regular Expressions book.
Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.
So you really have to know what style you are trying to quote.
Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.
Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely
sed -e 's/foo\(bar/something_else/'
I tend to just use a simple character class definition instead, so the above expression becomes
sed -e 's/foo[(]bar/something_else/'
which I find works for most regexp implementations.
BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.
Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.
You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.
Not all the world's a PCRE!
Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.
Ah the joys of studying at UNSW in the late '70's! (-:
https://perldoc.perl.org/perlre.html#Quoting-metacharacters and https://perldoc.perl.org/functions/quotemeta.html
In the official documentation, such characters are called metacharacters. Example of quoting:
my $regex = quotemeta($string)
s/$regex/something/
For PHP, "it is always safe to precede a non-alphanumeric with "\" to specify that it stands for itself." - http://php.net/manual/en/regexp.reference.escape.php.
Except if it's a " or '. :/
To escape regex pattern variables (or partial variables) in PHP use preg_quote()
To know when and what to escape without attempts is necessary to understand precisely the chain of contexts the string pass through. You will specify the string from the farthest side to its final destination which is the memory handled by the regexp parsing code.
Be aware how the string in memory is processed: if can be a plain string inside the code, or a string entered to the command line, but a could be either an interactive command line or a command line stated inside a shell script file, or inside a variable in memory mentioned by the code, or an (string)argument through further evaluation, or a string containing code generated dynamically with any sort of encapsulation...
Each of this context assigned some characters with special functionality.
When you want to pass the character literally without using its special function (local to the context), than that's the case you have to escape it, for the next context... which might need some other escape characters which might additionally need to be escaped in the preceding context(s).
Furthermore there can be things like character encoding (the most insidious is utf-8 because it look like ASCII for common characters, but might be optionally interpreted even by the terminal depending on its settings so it might behave differently, then the encoding attribute of HTML/XML, it's necessary to understand the process precisely right.
E.g. A regexp in the command line starting with perl -npe, needs to be transferred to a set of exec system calls connecting as pipe the file handles, each of this exec system calls just has a list of arguments that were separated by (non escaped)spaces, and possibly pipes(|) and redirection (> N> N>&M), parenthesis, interactive expansion of * and ?, $(()) ... (all this are special characters used by the *sh which might appear to interfere with the character of the regular expression in the next context, but they are evaluated in order: before the command line. The command line is read by a program as bash/sh/csh/tcsh/zsh, essentially inside double quote or single quote the escape is simpler but it is not necessary to quote a string in the command line because mostly the space has to be prefixed with backslash and the quote are not necessary leaving available the expand functionality for characters * and ?, but this parse as different context as within quote. Then when the command line is evaluated the regexp obtained in memory (not as written in the command line) receives the same treatment as it would be in a source file.
For regexp there is character-set context within square brackets [ ], perl regular expression can be quoted by a large set of non alfa-numeric characters (E.g. m// or m:/better/for/path: ...).
You have more details about characters in other answer, which are very specific to the final regexp context. As I noted you mention that you find the regexp escape with attempts, that's probably because different context has different set of character that confused your memory of attempts (often backslash is the character used in those different context to escape a literal character instead of its function).
For Ionic (Typescript) you have to double slash in order to scape the characters.
For example (this is to match some special characters):
"^(?=.*[\\]\\[!¡\'=ªº\\-\\_ç##$%^&*(),;\\.?\":{}|<>\+\\/])"
Pay attention to this ] [ - _ . / characters. They have to be double slashed. If you don't do that, you are going to have a type error in your code.
to avoid having to worry about which regex variant and all the bespoke peculiarties, just use this generic function that covers every regex variant other than BRE (unless they have unicode multi-byte chars that are meta) :
jot -s '' -c - 32 126 |
mawk '
function ___(__,_) {
return substr(_="",
gsub("[][!-/_\140:-#{-~]","[&]",__),
gsub("["(_="\\\\")"^]",_ "&",__))__
} ($++NF = ___($!_))^_'
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
[!]["][#][$][%][&]['][(][)][*][+][,][-][.][/]
0 1 2 3 4 5 6 7 8 9 [:][;][<][=][>][?]
[#] ABCDEFGHIJKLMNOPQRSTUVWXYZ [[]\\ []]\^ [_]
[`] abcdefghijklmnopqrstuvwxyz [{][|][}][~]
square-brackets are much easier to deal with, since there's no risk of triggering warning messages about "escaping too much", e.g. :
function ____(_) {
return substr("", gsub("[[:punct:]]","\\\\&",_))_
}
\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/ 0123456789\:\;\<\=\>\?
\#ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]\^\_\`abcdefghijklmnopqrstuvwxyz \{\|\}\~
gawk: cmd. line:1: warning: regexp escape sequence `\!' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\"' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\%' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\&' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\,' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\:' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\;' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\=' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\#' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\_' is not a known regexp operator
gawk: cmd. line:1: warning: regexp escape sequence `\~' is not a known regexp operator
Using Raku (formerly known as Perl_6)
Works (backslash or quote all non-alphanumeric characters except underscore):
~$ raku -e 'say $/ if "#.*?" ~~ m/ \# \. \* \? /; #works fine'
「#.*?」
There exist six flavors of Regular Expression languages, according to Damian Conway's pdf/talk "Everything You Know About Regexes Is Wrong". Raku represents a significant (~15 year) re-working of standard Perl(5)/PCRE Regular Expressions.
In those 15 years the Perl_6 / Raku language experts decided that all non-alphanumeric characters (except underscore) shall be reserved as Regex metacharacters even if no present usage exists. To denote non-alphanumeric characters (except underscore) as literals, backslash or escape them.
So the above example prints the $/ match variable if a match to a literal #.*? character sequence is found. Below is what happens if you don't: # is interpreted as the start of a comment, . dot is interpreted as any character (including whitespace), * asterisk is interpreted as a zero-or-more quantifier, and ? question mark is interpreted as either a zero-or-one quantifier or a frugal (i.e. non-greedy) quantifier-modifier (depending on context):
Errors:
~$ ~$ raku -e 'say $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!'
===SORRY!===
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Regex not terminated.
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
Couldn't find terminator / (corresponding / was at line 1)
at -e:1
------> y $/ if "#.*?" ~~ m/ # . * ? /; #ERROR!⏏<EOL>
expecting any of:
/
https://docs.raku.org/language/regexes
https://raku.org/

Can somebody help me understand this regex pattern? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I have some knowledge of MySQL regexp syntax but I am not proficient. I was looking for a way to construct a pattern that selects all names from a mysql table that contain strange characters due to international input such as spanish names that have the symbol on top of the n.
I came upon this following pattern which I tried and it worked.
[^a-zA-Z0-9#:. \'\-`,\&]
The query is:
SELECT *
FROM orders_table
WHERE customers_name REGEXP '[^a-zA-Z0-9#:. \'\-`,\&]'
However I would like to understand how this pattern was constructed and what each part means.
The idea is that anything between [...] is a character class, which matches any single character that's in the set between the [ and ].
Adding the ^ to the start of the list of characters means (as noted above) it's negated, which means it matches any character NOT in the set. Putting a ^ anywhere but the start of the [ ... ] means it's just a regular ^ character to match, and in no case does ^ inside a character class mean a start-of-line anchor.
Ranges work, such as a-z, and if you want a literal dash in the set, you either put it first (possibly after the ^), or quote it with \
Edit: the other special characters - # : etc. - are not special in this context, they just match as regular characters.
[a-zA-Z0-9#:. \'-`,\&] - Let us consider all the important parts of this pattern.
[] - denotes a character class which matches any single character within [].
[a-zA-Z0-9] -- One character (lowercase or uppercase) that is in the range of a-z,A-Z OR 0-9.
All other characters which are present within the [] match for a particular single character.
single quotes "'" and ampersand & are escaped by "\" because in sql, they are used for specific purposes.
To match a pattern which has a character other than any of these symbols within [], we use a negation "^" at the beginning.
Thus, as you have told ,your pattern selects all names from a mysql table that contain other strange characters.

Is it possible to search for a phrase in opengrok containing curly brackets?

I have tried using something like "struct a {" and "struct a {" to look for the declaration of "a". But it seems opengrok just ignores the curly brackets. Is there a way to search for the phrase "struct a {"?
Grok supports escaping special characters that are part of the query syntax.
The current list of special characters are
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
To escape these character use the \ before the character.
For example to search for (1+1):2 use the query: \(1\+1\)\:2
You should be able to search with "struct a {" (with quotes)
From OpenGrok documentation:
Escaping special characters:
Opengrok supports escaping special characters that are part of the query syntax. Current special characters are:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
To escape these character use the \ before the character. For example to search for (1+1):2 use the query: (1+1):2
NOTE on analyzers: Indexed words are made up of Alpha-Numeric and Underscore characters. One letter words are usually not indexed as symbols!
Most other characters (including single and double quotes) are treated as "spaces/whitespace" (so even if you escape them, they will not be found, since most analyzers ignore them).
The exceptions are: # $ % ^ & = ? . : which are mostly indexed as separate words.
Because some of them are part of the query syntax, they must be escaped with a reverse slash as noted above.
So searching for +1 or + 1 will both find +1 and + 1.
Valid FIELDs are
full
Search through all text tokens (words,strings,identifiers,numbers) in
index.
defs
Only finds symbol definitions (where e.g. a variable (function, ...)
is defined).
refs
Only finds symbols (e.g. methods, classes, functions, variables).
path
path of the source file (no need to use dividers, or if, then use "/"
Windows users, "" is an escape key in Lucene query syntax! Please don't use "", or replace it with "/"). Also note that if you want
just exact path, enclose it in "", e.g. "src/mypath", otherwise
dividers will be removed and you get more hits.
hist
History log comments.
type
Type of analyzer used to scope down to certain file types (e.g. just C
sources). Current mappings: [ada=Ada, asm=Asm, bzip2=Bzip(2), c=C,
clojure=Clojure, csharp=C#, cxx=C++, eiffel=Eiffel, elf=ELF,
erlang=Erlang, file=Image file, fortran=Fortran, golang=Golang,
gzip=GZIP, haskell=Haskell, jar=Jar, java=Java, javaclass=Java class,
javascript=JavaScript, json=Json, kotlin=Kotlin, lisp=Lisp, lua=Lua,
mandoc=Mandoc, pascal=Pascal, perl=Perl, php=PHP, plain=Plain Text,
plsql=PL/SQL, powershell=PowerShell script, python=Python, ruby=Ruby,
rust=Rust, scala=Scala, sh=Shell script, sql=SQL, swift=Swift,
tar=Tar, tcl=Tcl, troff=Troff, typescript=TypeScript,
uuencode=UUEncoded, vb=Visual Basic, verilog=Verilog, xml=XML,
zip=Zip] The term (phrases) can be boosted (making it more relevant)
using a caret ^ , e.g. help^4 opengrok - will make term help boosted
Opengrok search is powered by Lucene, for more detail on query syntax refer to Lucene docs.