Bison shift/reduce conflict / reduce/reduce conflict warnings - warnings

When I run this bison code in Ubuntu Linux i get these warnings:
- shift/reduce conflict [-Wconflicts-sr]
- reduce/reduce conflicts [-Wcolficts-sr]
Here's a screenshot for more clarity:
http://i.imgur.com/iznzSsn.png
Edit: reduce/reduce errors are in
line 86 : typos_dedomenwn
line 101: typos_synartisis
and the shift/reduce error is in:
line 129: entoli_if
I can't find how to fix them could someone help?
Here's the bison code bellow :
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int totalerrors=0;
extern int yylex();
extern FILE *yyin;
extern int lineno; //Arithmos grammis pou kanei parse
//error handling
void yyerror(const char *msg) {
}
//filling the error array
void printError(char y[],int x){
//param 1: error string
//param 2: line number
char temp[15];
char temp2[5];
char final[256];
sprintf(temp2,"%d: ",x);
strcpy(temp, "In Line ");
strcat(temp,temp2);
strcpy(final,"");
strcat(final,temp);
strcat(final,y);
printf("%d) %s\n",totalerrors+1,final);
totalerrors++;
}
%}
%start start
%token T_sigkritikos_telestis
%token T_typos_dedomenwn
%token T_typos_synartisis
%token T_stathera
%token T_newline
%token T_kefalida_programmatos
%token T_extern
%token T_void
%token T_return
%token T_if
%token T_else
%token T_plus
%token T_minus
%token T_mult
%token T_div
%token T_percentage
%token T_int
%token T_bool
%token T_string
%token T_true
%token T_false
%token T_id
%token T_semic
%token T_comma
%token T_openpar
%token T_closepar
%token T_ampersand
%token T_begin
%token T_end
%token T_excl
%token T_or
%token T_equals
%token T_semileft
%token T_semiright
%%
start: exwterikes_dilwseis T_kefalida_programmatos tmima_orismwn tmima_entolwn;
exwterikes_dilwseis: exwteriko_prwtotypo exwterikes_dilwseis
| ;
exwteriko_prwtotypo: T_extern prwtotypo_synartisis;
tmima_orismwn: orismos tmima_orismwn
| ;
orismos: orismos_metavlitwn
| orismos_synartisis
| prwtotypo_synartisis;
orismos_metavlitwn: typos_dedomenwn lista_metavlitwn T_semic;
typos_dedomenwn: T_int
| T_bool
| T_string;
loop1: T_comma T_id
| ;
lista_metavlitwn: T_id loop1;
orismos_synartisis: kefalida_synartisis tmima_orismwn tmima_entolwn;
prwtotypo_synartisis: kefalida_synartisis T_semic;
kefalida_synartisis: typos_synartisis T_id T_openpar lista_typikwn_parametrwn T_closepar
| typos_synartisis T_id T_openpar T_closepar;
typos_synartisis: T_int
| T_bool
| T_void;
lista_typikwn_parametrwn: typikes_parametroi loop2;
loop2: T_comma typikes_parametroi
| ;
typikes_parametroi: typos_dedomenwn T_ampersand T_id;
tmima_entolwn: T_begin loop3 T_end;
loop3: entoli loop3
| ;
entoli: apli_entoli T_semic
| domimeni_entoli
| sintheti_entoli;
sintheti_entoli: T_semileft loop3 T_semiright;
domimeni_entoli: entoli_if;
apli_entoli: anathesi
| klisi_sunartisis
| entoli_return
| ;
entoli_if: T_if T_openpar geniki_ekfrasi T_closepar entoli else_clause
| T_if T_openpar geniki_ekfrasi T_closepar entoli;
else_clause: T_else entoli;
anathesi: T_id T_equals geniki_ekfrasi;
klisi_sunartisis: T_id T_openpar lista_pragmatikwn_parametrwn T_closepar
| T_id T_openpar T_closepar;
lista_pragmatikwn_parametrwn: pragmatiki_parametros loop4;
loop4: T_semic pragmatiki_parametros loop4
| ;
pragmatiki_parametros: geniki_ekfrasi;
entoli_return: T_return geniki_ekfrasi
| T_return;
geniki_ekfrasi: genikos_oros loop5;
loop5: T_or T_or genikos_oros loop5
| ;
genikos_oros: genikos_paragontas loop6;
loop6: T_ampersand T_ampersand loop6
| ;
genikos_paragontas: T_excl genikos_protos_paragontas
| genikos_protos_paragontas;
genikos_protos_paragontas: apli_ekfrasi tmima_sigrisis
| apli_ekfrasi;
tmima_sigrisis: T_sigkritikos_telestis apli_ekfrasi;
apli_ekfrasi: aplos_oros loop7;
loop7: T_plus aplos_oros loop7
| T_minus aplos_oros loop7
| ;
aplos_oros: aplos_paragontas loop8;
loop8: T_mult aplos_paragontas loop8
| T_div aplos_paragontas loop8
| T_percentage aplos_paragontas loop8
| ;
aplos_paragontas: T_plus aplos_prot_oros
| T_minus aplos_prot_oros
| aplos_prot_oros;
aplos_prot_oros: T_id
| stathera
| klisi_sunartisis
| T_openpar geniki_ekfrasi T_closepar;
stathera: T_true
|T_false;
%%
int main(int argc, char *argv[]){
++argv; --argc; //agnooume to onoma tou exe
if (argc==1) {
FILE *fp = fopen(argv[0],"r");
if (fp!=NULL) {
printf("Reading input from file: %s\n",argv[0]);
printf("Output:\n\n");
yyin = fp;
yyparse();
} else {
printf("File doesn't exist\n");
return 1;
}
} else if (argc>1) {
printf("Only one file allowed for input...\n");
return 1;
} else {
printf ("Parsing from stdin..\n");
yyparse();
}
if (totalerrors==0) {
printf("All good!\n");
printf("===================================\n");
printf("Parsing complete! No errors found!!\n");
} else {
printf("===================================\n");
printf("Total Errors: %d\n",totalerrors);
}
return 0;
}

A. Redundant non-terminals
The reduce/reduce conflicts are because you have two non-terminals which exist only to gather together different types:
typos_dedomenwn: T_int
| T_bool
| T_string;
typos_synartisis: T_int
| T_bool
| T_string;
Where these non-terminals are used, it is impossible for the parser to know which one applies; it cannot tell until further along in the declaration. However, it doesn't matter. You could just define a single typos non-terminal, and use it throughout:
typos: T_int
| T_bool
| T_string;
orismos_metavlitwn: typos lista_metavlitwn T_semic;
kefalida_synartisis: typos T_id T_openpar lista_typikwn_parametrwn T_closepar
| typos T_id T_openpar T_closepar;
typikes_parametroi: typos T_ampersand T_id;
B. Dangling else
The shift/reduce conflict is the classic problem with "C" style if statements. These statements are difficult to describe in a way which is not ambiguous. Consider:
if (expr1) if (expr2) statement1; else statement2;
We know that the else must match the second if, so the above is equivalent to:
if (expr1) { if (expr2) statement1; else statement2; }
But the grammar also matches the other possible parse, equivalent to:
if (expr1) { if (expr2) statement1; } else statement2;
There are three possible solutions to this problem:
Do nothing. Bison does the right thing here, by design: it always prefers "shift" over "reduce". What that means is that if an else could match an open if statement, bison will always do that, rather than holding onto the else to match some outer if statement. There is a pretty good description of this in the Dragon book, amongst other places.
The problem with this solution is that you still end up with a warning about shift/reduce conflicts, and it is hard to distinguish between "OK" conflicts, and newly-created "not OK" conflicts. Bison provides the %expect declaration so you can tell it how many conflicts you expect, which will suppress the warning if the right number are found, but that is still pretty fragile.
Use precedence declarations. These are described in the bison manual. and their use in solving the dangling else problem is a running example in that chapter. In your case, it would look something like this:
%precedence T_then /* Fake terminal, needed for %prec */
%precedence T_else
/* ... */
%%
/* ... */
entoli_if: T_if T_openpar geniki_ekfrasi Tw_closepar entoli T_else entoli
| T_if T_openpar geniki_ekfrasi T_closepar entoli %prec T_then
Here, I have eliminated the unnecessary non-terminal else_clause because it hides the else token. If you wanted to keep it, for whatever reason, you would need to add a %prec T_else to the end of the entoli_if production which uses it.
The %precedence declaration is only available from bison 3.0 onwards. If you have an earlier version of bison, you can use the %nonassoc declaration instead, but this may hide some other errors.
Fix the grammar. It is actually possible to make an unambiguous grammar, but it is a bit of work.
The important point is that in:
if (expr) statement1 else statement2
statement1 cannot be an unmatched if statement. If statement1 is an if statement, it must include an else clause; otherwise, the else in the outer if would match the inner if. And that applies recursively to any trailing statements in statement1, such as
if (e2) statement2;
else if (e3) statement3
else /* must be present */ statement;
We can express this by dividing statements into "matching" statements (where all if are matched by else) and "non-matching" statements: (I haven't tried to preserve the greek non-terminal names here; sorry. You'll have to adapt the idea to your grammar).
statement: matching_statement | non_matching_statement ;
matching_statement: call_statement | assignment_statement | ...
| matching_if_statement
non_matching_statement: non_matching_if_statement
/* might be others, see below */
if_condition: "if" '(' expression ')' ;
matching_if_statement:
if_condition matching_statement "else" matching_statement ;
non_matching_if_statement:
if_condition statement
| if_condition matching_statement "else" non_matching_statement
;
In C, there are other compound statements which can end with a statement (while, for). Each of these will also have a "matching" and "non-matching" version, depending on whether the final statement is matching or non-matching:
while_condition: "while" '(' expression ')' ;
matching_while_statement: while_condition matching_statement ;
non_matching_while_statement: while_condition non_matching_statement ;
As far as I can see, this does not apply to your language, but you might want to extend it in the future to include such statements.
C. Some notes about bison style
Bison allows you to use single character tokens as themselves, surrounded by single quotes. So instead of declaring T_openpar and then writing verbose rules which use it, you can just write '('; you don't even need to declare it. (In your flex -- or other -- scanner, you would just return '('; instead of return T_openpar, which is why you don't need to declare the token.) This usually makes grammars more readable.
Bison also lets you specify a human-readable name for a token. (This feature is not in all yacc derivatives, but it is pretty common.), which can also make grammars more readable. For example, you can give names to the if and else tokens as follows:
%token T_if "if"
%token T_else "else"
and then you could use the quoted strings in your grammar rules. (I did that in my last example for the dangling-else problem.) In the flex scanner, you still need to use the token symbols T_if and T_else.
If you have a two-symbol token like &&, it is usually better if the scanner recognizes it and returns a single token, instead of the parser recognizing two consecutive & tokens. In the second case, the parser will recognize:
boolean_expr1 & & boolean_expr2
as though it had been written
boolean_expr1 && boolean_expr2
although the first one was most likely an error which should be reported.
Bison is a bottom-up LALR(1) parser generator. It is not necessary to remove left-recursion. Bottom-up parsers prefer left-recursion, and left-recursive grammars are usually more accurate and easier to read. For example, it is better all round to declare:
apli_ekfrasi: aplos_oros
| apli_ekfrasi '+' aplos_oros
| apli_ekfrasi '-' aplos_oros;
than to use LL-style repeated suffixes (loop7 in your grammar). The left-recursive grammar can be parsed without extending the parser stack, and more accurately represents the syntactic structure of the expression, making parser actions easier to write.
There are a number of other places in your grammar which you might want to revisit.
(This advice comes straight from the bison manual: "you should always use left recursion, because it can parse a sequence of any number of elements with bounded stack space.")

Related

Sphinx 3 Search engine: Having problems reading JSON from CSV source

When I try to read JSON content from a field I get:
WARNING: document 1, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'a:foo'
Here are the details:
This is the (super simplified) CSV file I'm trying to read:
1,hello world, document number one,a:foo
22,hello again, document number two,foo:bar
23,hello now, This is some stuff,foo:{bar:baz}
24,hello cow, more test stuff and things,{foo:bar}
55,hello suess, box and sox and goats and moats,[a]
56,hello raven, nevermore said the thing,foo:bar
When I run the indexer this is the result I get:
../bin/indexer --config /home/ec2-user/sphinx/etc/sphinx.conf --all --rotate
Sphinx 3.3.1 (commit b72d67b)
Copyright (c) 2001-2020, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/home/ec2-user/sphinx/etc/sphinx.conf'...
indexing index 'csvtest'...
WARNING: document 1, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'a:foo'
WARNING: document 22, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:bar'
WARNING: document 23, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:{bar:baz}'
WARNING: document 24, attribute assorted: JSON error: syntax error, unexpected '}', expecting '[' near '}'
WARNING: document 55, attribute assorted: JSON error: syntax error, unexpected ']', expecting '[' near ']'
WARNING: document 56, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:bar'
collected 6 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 6 docs, 0.1 Kb
total 0.0 sec, 17.7 Kb/sec, 1709 docs/sec
rotating indices: successfully sent SIGHUP to searchd (pid=14393).
This is the entire config file:
source csvsrc
{
type = csvpipe
csvpipe_delimiter = ,
csvpipe_command = cat /home/ec2-user/sphinx/etc/example.csv
csvpipe_field_string =t
csvpipe_attr_string =c
csvpipe_attr_json =assorted
}
index csvtest
{
source = csvsrc
path = /var/data/test7
morphology = stem_en
rt_field = t
rt_field = c
rt_field = assorted
}
indexer
{
mem_limit = 128M
}
searchd
{
listen = 9312
listen = 9306:mysql41
log = /var/log/searchd.log
query_log = /var/log/query.log
pid_file = /var/log/searchd.pid
binlog_path = /var/data
}
And If I do log in and query, it's pretty obvious that the JSON was not, in fact, indexed (as expected from the warnings)
select * from csvtest;
+------+-------------+----------------------------------+----------+
| id | t | c | assorted |
+------+-------------+----------------------------------+----------+
| 1 | hello world | document number one | NULL |
| 22 | hello again | document number two | NULL |
| 23 | hello now | This is some stuff | NULL |
| 24 | hello cow | more test stuff and things | NULL |
| 55 | hello suess | box and sox and goats and moats | NULL |
| 56 | hello raven | nevermore said the thing | NULL |
+------+-------------+----------------------------------+----------+
6 rows in set (0.00 sec)
I have tried a few things, but I'm just groping in the dark.
Some things I have tried:
Alternate formats of JSON. I have tried using {foo:bar} and {[foo:bar]} and [{foo,bar}] based on some experiences with other JSON inputs where they want it to be either an array or dict at the top level. These actually generate slightly different errors:
WARNING: document 24, attribute assorted: JSON error: syntax error, unexpected '}', expecting '[' near '}'
WARNING: document 55, attribute assorted: JSON error: syntax error, unexpected ']', expecting '[' near ']'
I have tried adding a trailing comma thinking that might be the $end token that the parser is looking for. This generates an actual error ERROR: index 'csvtest': source 'csvsrc': not all columns found (found=5, total=4, line=1). which prevents index generation. That makes sense to me
2a) I tried adding a whole other column after the JSON so I could have the ending comma but not get an error that would prevent the index from generating. This did generate the index, but did not provide the $end token that the JSON parser was looking for.
I'm totally stumped.
Well as such a:foo isnt a valid JSON value AFAIK. LOoks like it meant to be object? So would need {...} surrounding it.
But even {foo:bar} is not valid either. At the very least the 'value' shoud be quoted {foo:"bar"}. But really the keys quoting too {"foo":"bar"}
Javascript Objects technically allow unquoted key names, but JSON requires the quoting.
... but also remember it CSV. Quotes are typically used for quoting (eg when columns contain commas), so the quotes need double encoding! Ends up a bit messy...
24,hello cow, more test stuff and things,"{""foo"":""bar""}"

Why octave error with function huffmandeco about large index types?

I've got a little MatLab script, which I try to understand. It doesn't do very much. It only reads a text from a file and encode and decode it with the Huffman-functions.
But it throws an error while decoding:
"error: out of memory or dimension too large for Octave's index type
error: called from huffmandeco>dict2tree at line 95 column 19"
I don't know why, because I debugged it and don't see a large index type.
I added the part which calculates p from the input text.
%text is a random input text file in ASCII
%calculate the relative frequency of every Symbol
for i=0:127
nlet=length(find(text==i));
p(i+1)=nlet/length(text);
end
symb = 0:127;
dict = huffmandict(symb,p); % Create dictionary
compdata = huffmanenco(fdata,dict); % Encode the data
dsig = huffmandeco(compdata,dict); % Decode the Huffman code
I can oly use octave instead of MatLab. I don't know, if there is an unexpected error. I use the Octave Version 6.2.0 on Win10. I tried the version for large data, it didn't change anything.
Maybe anyone knows the error in this context?
EDIT:
I debugged the code again. In the function huffmandeco I found the following function:
function tree = dict2tree (dict)
L = length (dict);
lengths = zeros (1, L);
## the depth of the tree is limited by the maximum word length.
for i = 1:L
lengths(i) = length (dict{i});
endfor
m = max (lengths);
tree = zeros (1, 2^(m+1)-1)-1;
for i = 1:L
pointer = 1;
word = dict{i};
for bit = word
pointer = 2 * pointer + bit;
endfor
tree(pointer) = i;
endfor
endfunction
The maximum length m in this case is 82. So the function calculates:
tree = zeros (1, 2^(82+1)-1)-1.
So it's obvious why the error called a too large index type.
But there must be a solution or another error, because the code is tested before.
I haven't weeded through the code enough to know why yet, but huffmandict is not ignoring zero-probability symbols the way it claims to. Nor have I been able to find a bug report on Savannah, but again I haven't searched thoroughly.
A workaround is to limit the symbol list and their probabilities to only the symbols that actually occur. Using containers.Map would be ideal, but in Octave you can do that with a couple of the outputs from unique:
% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols, ~, inds] = unique(textstr);
inds = inds.'; % just make it easier to read
For the string
textstr = 'Random String Input.';
the result is:
>> symbols
symbols = .IRSadgimnoprtu
>> inds
inds =
Columns 1 through 19:
4 6 11 7 12 10 1 5 15 14 9 11 8 1 3 11 13 16 15
Column 20:
2
So the first symbol in the input string is symbols(4), the second is symbols(6), and so on.
From there, you just use symbols and inds to create the dictionary and encode/decode the signal. Here's a quick demo script:
textstr = 'Random String Input.';
fprintf("Starting string: %s\n", textstr);
% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols, ~, inds] = unique(textstr);
inds = inds.'; % just make it easier to read
% Calculate the frequency of each symbol in table
% max(inds) == numel(symbols)
p = histc(inds, 1:max(inds))/numel(inds);
dict = huffmandict(symbols, p);
compdata = huffmanenco(inds, dict);
dsig = huffmandeco(compdata, dict);
fprintf("Decoded string: %s\n", symbols(dsig));
And the output:
Starting string: Random String Input.
Decoded string: Random String Input.
To encode strings other than the original input string, you would have to map the characters to symbol indices (ensuring that all symbols in the string are actually present in the symbol table, obviously):
>> [~, s_idx] = ismember('trogdor', symbols)
s_idx =
15 14 12 8 7 12 14
>> compdata = huffmanenco(s_idx, dict);
>> dsig = huffmandeco(compdata, dict);
>> fprintf("Decoded string: %s\n", symbols(dsig));
Decoded string: trogdor

CSV Parsing Issue with Attoparsec

Here is my code that does CSV parsing, using the text and attoparsec
libraries:
import qualified Data.Attoparsec.Text as A
import qualified Data.Text as T
-- | Parse a field of a record.
field :: A.Parser T.Text -- ^ parser
field = fmap T.concat quoted <|> normal A.<?> "field"
where
normal = A.takeWhile (A.notInClass "\n\r,\"") A.<?> "normal field"
quoted = A.char '"' *> many between <* A.char '"' A.<?> "quoted field"
between = A.takeWhile1 (/= '"') <|> (A.string "\"\"" *> pure "\"")
-- | Parse a block of text into a CSV table.
comma :: T.Text -- ^ CSV text
-> Either String [[T.Text]] -- ^ error | table
comma text
| T.null text = Right []
| otherwise = A.parseOnly table text
where
table = A.sepBy1 record A.endOfLine A.<?> "table"
record = A.sepBy1 field (A.char ',') A.<?> "record"
This works well for a variety of inputs but is not working in case that there
is a trailing \n at the end of the input.
Current behaviour:
> comma "hello\nworld"
Right [["hello"],["world"]]
> comma "hello\nworld\n"
Right [["hello"],["world"],[""]]
Wanted behaviour:
> comma "hello\nworld"
Right [["hello"],["world"]]
> comma "hello\nworld\n"
Right [["hello"],["world"]]
I have been trying to fix this issue but I ran out of idaes. I am almost
certain that it will have to be something with A.endOfInput as that is the
significant anchor and the only "bonus" information we have. Any ideas on how
to work that into the code?
One possible idea is to look at the end of the string before running the
Attoparsec parser and removing the last character (or two in case of \r\n)
but that seems to be a hacky solution that I would like avoid in my code.
Full code of the library can be found here: https://github.com/lovasko/comma

Parse a MySQL insert statement with multiple rows [duplicate]

I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular  (atomic grouping since Ruby 1.9.3)
JavaScript  API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)

Trying to understand number of ParseError in html5lib-test

I was looking at following test case in html5lib-tests:
{"description":"<!DOCTYPE\\u0008",
"input":"<!DOCTYPE\u0008",
"output":["ParseError", "ParseError", "ParseError",
["DOCTYPE", "\u0008", null, null, false]]},
source
State |Input char | Actions
--------------------------------------------------------------------------------------------
Data State | "<" | -> TagOpenState
TagOpenState | "!" | -> MarkupDeclarationOpenState
MarkupDeclarationOpenState | "DOCTYPE" | -> DOCTYPE state
DOCTYPE state | "\u0008" | Parse error; -> before DOCTYPE name state (reconsume)
before DOCTYPE name state | "\u0008" | DOCTYPE(name = "\u0008"); -> DOCTYPE name state
DOCTYPE name state | EOF | Parse error. Set force quirks on. Emit DOCTYPE -> Data state.
Data state | EOF | Emit EOF.
I'm wondering where do those three errors come from? I can only track two, but I assume I'm making an error in logic, somewhere.
The one you're missing is the one from the "Preprocessing the input stream" section:
Any occurrences of any characters in the ranges U+0001 to U+0008, U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse errors. These are all control characters or permanently undefined Unicode characters (noncharacters).
This causes a parse error prior to the U+0008 character ever reaching the tokenizer. Given the tokenizer is defined as reading from the input stream, the tokenizer tests assume the input stream has its normal preprocessing applied to it.