Antlr How to avoid reportAttemptingFullContext and reportAmbiguity - listener

In my java program im parsing many lines of code and to avoid ambiguous lines i used:
ParseTreeWalker walker = new ParseTreeWalker ();
if (!(lexerErrorListener.hasError() || parserErrorListener.hasError ()))
walker.walk (listener, tree);
else
line error
with the listener:
#Override
public void reportAmbiguity(Parser recognizer, DFA dfa, int startIndex, int stopIndex, boolean exact,
BitSet ambigAlts, ATNConfigSet configs) {
hasError = true;
}
#Override
public void reportAttemptingFullContext(Parser recognizer, DFA dfa, int startIndex, int stopIndex,
BitSet conflictingAlts, ATNConfigSet configs) {
hasError = true;
}
input like: |U1,0 = comment generate ambiguity and fullcontext error with my grammar.
is a wrong approach or is there a way to handle these errors?
My Lexer:
lexer grammar LexerGrammar;
SINGLE_COMMENT : '|' -> pushMode(COMMENT);
NUMBER : [0-9];
VIRGOLA : ',';
WS : [ \t] -> skip ;
EOL : [\r\n]+;
// ------------ Everything INSIDE a COMMENT ------------
mode COMMENT;
COMMENT_NUMBER : NUMBER -> type(NUMBER);
COMMENT_VIRGOLA : VIRGOLA -> type(VIRGOLA);
TYPE : 'I'| 'U'| 'Q';
EQUAL : '=';
COMMENT_TEXT: ('a'..'z' | 'A'..'Z')+;
WS_1 : [ \t] -> skip ;
COMMENT_EOL : EOL -> type(EOL),popMode;
my parser:
parser grammar Parser;
options {
tokenVocab = LexerGrammar;
}
prog : (line? EOL)+;
line : comment;
comment: SINGLE_COMMENT (defComm | genericComment);
defComm: TYPE arg EQUAL COMMENT_TEXT;
arg : (argument1) (VIRGOLA argument2)?;
argument1 : numbers ;
argument2 : numbers ;
numbers : NUMBER+ ;
genericComment: .*?;

reportAmbiguity and reportAttemptingFullContext are not an indication that there was a syntax error. You can listen in on these events to know if they happen, but ANTLR has a deterministic approach to resolving this ambiguity (it uses the first alternative).
If you do not treat those as errors, you will get a proper parse tree out of your parse.

Related

Why do I always get a "trailing characters" error when trying to parse data with serde_json?

I have a server that returns requests in a JSON format. When trying to parse the data I always get "trailing characters" error. This happens only when getting the JSON from postman
let type_of_request = parsed_request[1];
let content_of_msg: Vec<&str> = msg_from_client.split("\r\n\r\n").collect();
println!("{}", content_of_msg[1]);
// Will print "{"username":"user","password":"password","email":"dwadwad"}"
let res: serde_json::Value = serde_json::from_str(content_of_msg[1]).unwrap();
println!("The username is: {}", res["username"]);
when getting the data from postman this happens:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error("trailing characters", line: 1, column: 60)', src\libcore\result.rs:997:5
but when having the string inside Rust:
let j = "{\"username\":\"user\",\"password\":\"password\",\"email\":\"dwadwad\"}";
let res: serde_json::Value = serde_json::from_str(j).unwrap();
println!("The username is: {}", res["username"]);
it works like a charm:
The username is: "user"
EDIT: Apparently as I read the message into a buffer and turned it into a string it saved all the NULL characters the buffer had which are of course the trailing characters.
Looking at the serde json code, one finds the following comment above the relevant ErrorCode enum element:
/// JSON has non-whitespace trailing characters after the value.
TrailingCharacters,
So as the error code implies, you've got some trailing character which is not whitespace. In your snippet, you say:
println!("{}", content_of_msg[1]);
// Will print "{"username":"user","password":"password","email":"dwadwad"}"
If you literally copy and pasted the printed output here, I'd note that I wouldn't expect the output to be wrapped in the leading and trailing quotation marks. Did you include these yourself or were they part of what was printed? If they were printed, I suspect that's the source of your problem.
Edit:
In fact, I can nearly recreate this using a raw string with leading/trailing quotation marks in Rust:
extern crate serde_json;
#[cfg(test)]
mod tests {
#[test]
fn test_serde() {
let s =
r#""{"username":"user","password":"password","email":"dwadwad"}""#;
println!("{}", s);
let _res: serde_json::Value = serde_json::from_str(s).unwrap();
}
}
Running it via cargo test yields:
test tests::test_serde ... FAILED
failures:
---- tests::test_serde stdout ----
"{"username":"user","password":"password","email":"dwadwad"}"
thread 'tests::test_serde' panicked at 'called `Result::unwrap()` on an `Err` value: Error("trailing characters", line: 1, column: 4)', src/libcore/result.rs:997:5
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
failures:
tests::test_serde
Note that my printed output also includes leading/trailing quotation marks and I also get a TrailingCharacter error, albeit at a different column.
Edit 2:
Based on your comment that you've added the wrapping quotations yourself, you've got a known good string (the one you've defined in Rust), and one which you believe should match it but doesn't (the one from Postman).
This is a data problem and so we should examine the data. You can adapt the below code to check the good string against the other:
#[test]
fn test_str_comp() {
// known good string we'll compare against
let good =
r#"{"username":"user","password":"password","email":"dwadwad"}"#;
// lengthened string, additional characters
// also n and a in username are transposed
let bad =
r#"{"useranme":"user","password":"password","email":"dwadwad"}abc"#;
let good_size = good.chars().count();
let bad_size = bad.chars().count();
for (idx, (c1, c2)) in (0..)
.zip(good.chars().zip(bad.chars()))
.filter(|(_, (c1, c2))| c1 != c2)
{
println!(
"Strings differ at index {}: (good: `{}`, bad: `{}`)",
idx, c1, c2
);
}
if good_size < bad_size {
let trailing = bad.chars().skip(good_size);
println!(
"bad string contains extra characters: `{}`",
trailing.collect::<String>()
);
} else if good_size > bad_size {
let trailing = good.chars().skip(bad_size);
println!(
"good string contains extra characters: `{}`",
trailing.collect::<String>()
);
}
assert!(false);
}
For my example, this yields the failure:
test tests::test_str_comp ... FAILED
failures:
---- tests::test_str_comp stdout ----
Strings differ at index 6: (good: `n`, bad: `a`)
Strings differ at index 7: (good: `a`, bad: `n`)
bad string contains extra characters: `abc`
thread 'tests::test_str_comp' panicked at 'assertion failed: false', src/lib.rs:52:9
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
failures:
tests::test_str_comp

json string with utf16 char cannot convert from 'const char [566]' to 'std::basic_string<_Elem,_Traits,_Ax>'

I have json that needs to test a string with a utf16 wide char in it but I get the following error message:
\..\test\TestClass.cpp(617): error C2440: 'initializing' : cannot convert from 'const char [566]' to 'std::basic_string<_Elem,_Traits,_Ax>'
with
2> [
2> _Elem=wchar_t,
2> _Traits=std::char_traits<wchar_t>,
2> _Ax=std::allocator<wchar_t>
2> ]
2> No constructor could take the source type, or constructor overload resolution was ambiguous
This is my json:
static std::wstring& BAD_JSON5_missingComma_multipleNewlines_Utf16MixedDosAndUnixLineEndings()
{
static std::wstring j =
"{\n" <=VS squigly says error on this line
"\"header\":{\n"
"\"version\":{\"major\":1,\"minor\":0,\"build\":0}\n"
"},\n"
"\"body\":{\n"
"\"string\":{\"type\":\"OurWideStringClass\",\"value\":\"foo\"},\n\n\n\n"
"\"int\":[\n"
"{\"type\":\"string\",\"value\":\"\\u9CE5\"},\n"
"{\"type\":\"Int\",\"value\":5678}\n"
"],\n"
"\"double\":{\"type\":\"Double\",\"value\":12.34},\n"
"\"path1\":[\n"
"{\n"
"\"string\":{\"type\":\"OurWideStringClass\",\"value\":\"bar\"},\r\n"
"\"int\":[\n"
"{\"type\":\"Int\"\"value\":7},\n"
"{\"type\":\"Int\",\"value\":11}\n"
"]\n"
"},\n"
"{\n"
"\"string\":{\"type\":\"OurWideStringClass\",\"value\":\"top\"},\n"
"\"int\":[\n"
"{\"type\":\"Int\",\"value\":13},\r\n"
"{\"type\":\"Int\",\"value\":41}\n"
"]\n"
"}\n"
"],\n"
"\"path2\":{\n"
"\"int\":{\"type\":\"Int\",\"value\":-1234},\n"
"\"double\":{\"type\":\"Double\",\"value\":-1.234}\r\n"
"}\n"
"}\n"
"}\n"; <=double clicking build error goes to this line
return j;
}
This is how it's used
OurWideStringClass slxStJson5 = BAD_JSON5_missingComma_multipleNewlines_Utf16MixedDosAndUnixLineEndings();
std::wistringstream ssJsonMissingCommaUtf16Newlines(slxStJson5);
I thought I had the wchar_t covered with std::wstring in my json. Any ideas what's the issue? you can see my utf16 character in \u9ce5. This is the key to this test.
I looked at this c2440 but don't see what they are referring to in the resolution with regard to UDT.
I was looking at this which puts an L in front of it, but with escaped c string, I'm not sure where to put the L.
std::wstring cannot be initialized from a narrow string literal like "my string", hence the compilation error.
You can initialize a std::wstring from a wide string literal — its syntax includes the L prefix before the opening quote, e.g. L"my string".
As you are using string literal concatenation, you need to prefix all the string literals with L, i.e.:
static std::wstring j =
L"{\n"
L"\"header\":{\n"
L"\"version\":{\"major\":1,\"minor\":0,\"build\":0}\n"
L"},\n"
...

Parse a MySQL insert statement with multiple rows [duplicate]

I need a regular expression to select all the text between two outer brackets.
Example:
START_TEXT(text here(possible text)text(possible text(more text)))END_TXT
^ ^
Result:
(text here(possible text)text(possible text(more text)))
I want to add this answer for quickreference. Feel free to update.
.NET Regex using balancing groups:
\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)
Where c is used as the depth counter.
Demo at Regexstorm.com
Stack Overflow: Using RegEx to balance match parenthesis
Wes' Puzzling Blog: Matching Balanced Constructs with .NET Regular Expressions
Greg Reinacker's Weblog: Nested Constructs in Regular Expressions
PCRE using a recursive pattern:
\((?:[^)(]+|(?R))*+\)
Demo at regex101; Or without alternation:
\((?:[^)(]*(?R)?)*+\)
Demo at regex101; Or unrolled for performance:
\([^)(]*+(?:(?R)[^)(]*)*+\)
Demo at regex101; The pattern is pasted at (?R) which represents (?0).
Perl, PHP, Notepad++, R: perl=TRUE, Python: PyPI regex module with (?V1) for Perl behaviour.
(the new version of PyPI regex package already defaults to this → DEFAULT_VERSION = VERSION1)
Ruby using subexpression calls:
With Ruby 2.0 \g<0> can be used to call full pattern.
\((?>[^)(]+|\g<0>)*\)
Demo at Rubular; Ruby 1.9 only supports capturing group recursion:
(\((?>[^)(]+|\g<1>)*\))
Demo at Rubular  (atomic grouping since Ruby 1.9.3)
JavaScript  API :: XRegExp.matchRecursive
XRegExp.matchRecursive(str, '\\(', '\\)', 'g');
Java: An interesting idea using forward references by #jaytea.
Without recursion up to 3 levels of nesting:
(JS, Java and other regex flavors)
To prevent runaway if unbalanced, with * on innermost [)(] only.
\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)
Demo at regex101; Or unrolled for better performance (preferred).
\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)
Demo at regex101; Deeper nesting needs to be added as required.
Reference - What does this regex mean?
RexEgg.com - Recursive Regular Expressions
Regular-Expressions.info - Regular Expression Recursion
Mastering Regular Expressions - Jeffrey E.F. Friedl 1 2 3 4
Regular expressions are the wrong tool for the job because you are dealing with nested structures, i.e. recursion.
But there is a simple algorithm to do this, which I described in more detail in this answer to a previous question. The gist is to write code which scans through the string keeping a counter of the open parentheses which have not yet been matched by a closing parenthesis. When that counter returns to zero, then you know you've reached the final closing parenthesis.
You can use regex recursion:
\(([^()]|(?R))*\)
[^\(]*(\(.*\))[^\)]*
[^\(]* matches everything that isn't an opening bracket at the beginning of the string, (\(.*\)) captures the required substring enclosed in brackets, and [^\)]* matches everything that isn't a closing bracket at the end of the string. Note that this expression does not attempt to match brackets; a simple parser (see dehmann's answer) would be more suitable for that.
This answer explains the theoretical limitation of why regular expressions are not the right tool for this task.
Regular expressions can not do this.
Regular expressions are based on a computing model known as Finite State Automata (FSA). As the name indicates, a FSA can remember only the current state, it has no information about the previous states.
In the above diagram, S1 and S2 are two states where S1 is the starting and final step. So if we try with the string 0110 , the transition goes as follows:
0 1 1 0
-> S1 -> S2 -> S2 -> S2 ->S1
In the above steps, when we are at second S2 i.e. after parsing 01 of 0110, the FSA has no information about the previous 0 in 01 as it can only remember the current state and the next input symbol.
In the above problem, we need to know the no of opening parenthesis; this means it has to be stored at some place. But since FSAs can not do that, a regular expression can not be written.
However, an algorithm can be written to do this task. Algorithms are generally falls under Pushdown Automata (PDA). PDA is one level above of FSA. PDA has an additional stack to store some additional information. PDAs can be used to solve the above problem, because we can 'push' the opening parenthesis in the stack and 'pop' them once we encounter a closing parenthesis. If at the end, stack is empty, then opening parenthesis and closing parenthesis matches. Otherwise not.
(?<=\().*(?=\))
If you want to select text between two matching parentheses, you are out of luck with regular expressions. This is impossible(*).
This regex just returns the text between the first opening and the last closing parentheses in your string.
(*) Unless your regex engine has features like balancing groups or recursion. The number of engines that support such features is slowly growing, but they are still not a commonly available.
It is actually possible to do it using .NET regular expressions, but it is not trivial, so read carefully.
You can read a nice article here. You also may need to read up on .NET regular expressions. You can start reading here.
Angle brackets <> were used because they do not require escaping.
The regular expression looks like this:
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
>
I was also stuck in this situation when dealing with nested patterns and regular-expressions is the right tool to solve such problems.
/(\((?>[^()]+|(?1))*\))/
This is the definitive regex:
\(
(?<arguments>
(
([^\(\)']*) |
(\([^\(\)']*\)) |
'(.*?)'
)*
)
\)
Example:
input: ( arg1, arg2, arg3, (arg4), '(pip' )
output: arg1, arg2, arg3, (arg4), '(pip'
note that the '(pip' is correctly managed as string.
(tried in regulator: http://sourceforge.net/projects/regulator/)
I have written a little JavaScript library called balanced to help with this task. You can accomplish this by doing
balanced.matches({
source: source,
open: '(',
close: ')'
});
You can even do replacements:
balanced.replacements({
source: source,
open: '(',
close: ')',
replace: function (source, head, tail) {
return head + source + tail;
}
});
Here's a more complex and interactive example JSFiddle.
Adding to bobble bubble's answer, there are other regex flavors where recursive constructs are supported.
Lua
Use %b() (%b{} / %b[] for curly braces / square brackets):
for s in string.gmatch("Extract (a(b)c) and ((d)f(g))", "%b()") do print(s) end (see demo)
Raku (former Perl6):
Non-overlapping multiple balanced parentheses matches:
my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (「(a(b)c)」 「((d)f(g))」)
Overlapping multiple balanced parentheses matches:
say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (「(a(b)c)」 「(b)」 「((d)f(g))」 「(d)」 「(g)」)
See demo.
Python re non-regex solution
See poke's answer for How to get an expression between balanced parentheses.
Java customizable non-regex solution
Here is a customizable solution allowing single character literal delimiters in Java:
public static List<String> getBalancedSubstrings(String s, Character markStart,
Character markEnd, Boolean includeMarkers)
{
List<String> subTreeList = new ArrayList<String>();
int level = 0;
int lastOpenDelimiter = -1;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == markStart) {
level++;
if (level == 1) {
lastOpenDelimiter = (includeMarkers ? i : i + 1);
}
}
else if (c == markEnd) {
if (level == 1) {
subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
}
if (level > 0) level--;
}
}
return subTreeList;
}
}
Sample usage:
String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]
The regular expression using Ruby (version 1.9.3 or above):
/(?<match>\((?:\g<match>|[^()]++)*\))/
Demo on rubular
The answer depends on whether you need to match matching sets of brackets, or merely the first open to the last close in the input text.
If you need to match matching nested brackets, then you need something more than regular expressions. - see #dehmann
If it's just first open to last close see #Zach
Decide what you want to happen with:
abc ( 123 ( foobar ) def ) xyz ) ghij
You need to decide what your code needs to match in this case.
"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.
This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns. This is where the re package greatly
assists in parsing.
"""
import re
# The pattern below recognises a sequence consisting of:
# 1. Any characters not in the set of open/close strings.
# 2. One of the open/close strings.
# 3. The remainder of the string.
#
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included. However quotes are not ignored inside
# quotes. More logic is needed for that....
pat = re.compile("""
( .*? )
( \( | \) | \[ | \] | \{ | \} | \< | \> |
\' | \" | BEGIN | END | $ )
( .* )
""", re.X)
# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.
matching = { "(" : ")",
"[" : "]",
"{" : "}",
"<" : ">",
'"' : '"',
"'" : "'",
"BEGIN" : "END" }
# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.
def matchnested(s, term=""):
lst = []
while True:
m = pat.match(s)
if m.group(1) != "":
lst.append(m.group(1))
if m.group(2) == term:
return lst, m.group(3)
if m.group(2) in matching:
item, s = matchnested(m.group(3), matching[m.group(2)])
lst.append(m.group(2))
lst.append(item)
lst.append(matching[m.group(2)])
else:
raise ValueError("After <<%s %s>> expected %s not %s" %
(lst, s, term, m.group(2)))
# Unit test.
if __name__ == "__main__":
for s in ("simple string",
""" "double quote" """,
""" 'single quote' """,
"one'two'three'four'five'six'seven",
"one(two(three(four)five)six)seven",
"one(two(three)four)five(six(seven)eight)nine",
"one(two)three[four]five{six}seven<eight>nine",
"one(two[three{four<five>six}seven]eight)nine",
"oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
"ERROR testing ((( mismatched ))] parens"):
print "\ninput", s
try:
lst, s = matchnested(s)
print "output", lst
except ValueError as e:
print str(e)
print "done"
You need the first and last parentheses. Use something like this:
str.indexOf('('); - it will give you first occurrence
str.lastIndexOf(')'); - last one
So you need a string between,
String searchedString = str.substring(str1.indexOf('('),str1.lastIndexOf(')');
because js regex doesn't support recursive match, i can't make balanced parentheses matching work.
so this is a simple javascript for loop version that make "method(arg)" string into array
push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)
const parser = str => {
let ops = []
let method, arg
let isMethod = true
let open = []
for (const char of str) {
// skip whitespace
if (char === ' ') continue
// append method or arg string
if (char !== '(' && char !== ')') {
if (isMethod) {
(method ? (method += char) : (method = char))
} else {
(arg ? (arg += char) : (arg = char))
}
}
if (char === '(') {
// nested parenthesis should be a part of arg
if (!isMethod) arg += char
isMethod = false
open.push(char)
} else if (char === ')') {
open.pop()
// check end of arg
if (open.length < 1) {
isMethod = true
ops.push({ method, arg })
method = arg = undefined
} else {
arg += char
}
}
}
return ops
}
// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)
console.log(test)
the result is like
[ { method: 'push', arg: 'number' },
{ method: 'map', arg: 'test(a(a()))' },
{ method: 'bass', arg: 'wow,abc' } ]
[ { method: '$$', arg: 'groups' },
{ method: 'filter',
arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
{ method: 'pickBy', arg: '_id,type' },
{ method: 'map', arg: 'test()' },
{ method: 'as', arg: 'groups' } ]
While so many answers mention this in some form by saying that regex does not support recursive matching and so on, the primary reason for this lies in the roots of the Theory of Computation.
Language of the form {a^nb^n | n>=0} is not regular. Regex can only match things that form part of the regular set of languages.
Read more # here
I didn't use regex since it is difficult to deal with nested code. So this snippet should be able to allow you to grab sections of code with balanced brackets:
def extract_code(data):
""" returns an array of code snippets from a string (data)"""
start_pos = None
end_pos = None
count_open = 0
count_close = 0
code_snippets = []
for i,v in enumerate(data):
if v =='{':
count_open+=1
if not start_pos:
start_pos= i
if v=='}':
count_close +=1
if count_open == count_close and not end_pos:
end_pos = i+1
if start_pos and end_pos:
code_snippets.append((start_pos,end_pos))
start_pos = None
end_pos = None
return code_snippets
I used this to extract code snippets from a text file.
This do not fully address the OP question but I though it may be useful to some coming here to search for nested structure regexp:
Parse parmeters from function string (with nested structures) in javascript
Match structures like:
matches brackets, square brackets, parentheses, single and double quotes
Here you can see generated regexp in action
/**
* get param content of function string.
* only params string should be provided without parentheses
* WORK even if some/all params are not set
* #return [param1, param2, param3]
*/
exports.getParamsSAFE = (str, nbParams = 3) => {
const nextParamReg = /^\s*((?:(?:['"([{](?:[^'"()[\]{}]*?|['"([{](?:[^'"()[\]{}]*?|['"([{][^'"()[\]{}]*?['")}\]])*?['")}\]])*?['")}\]])|[^,])*?)\s*(?:,|$)/;
const params = [];
while (str.length) { // this is to avoid a BIG performance issue in javascript regexp engine
str = str.replace(nextParamReg, (full, p1) => {
params.push(p1);
return '';
});
}
return params;
};
This might help to match balanced parenthesis.
\s*\w+[(][^+]*[)]\s*
This one also worked
re.findall(r'\(.+\)', s)

Lexer mode inside a string in Antlr4

I have an ANTLR4 grammar for a specific JSON format
(I know, I should be using JSON Schema, but let's ignore that for now)
As part of my JSON object, I would like to match a string like this:
"order" : "somefield ASC, someotherfield DESC"
Here are the relevant grammar parts
Parser:
orderObject : ORDER;
Lexer:
COLON: ':';
QUOT: '"';
FIELDNAME : ALPHA (ALPHA | DIGIT | UNDER)*;
fragment DIGIT : [0-9];
fragment UNDER : '_';
fragment ALPHA : [a-zA-Z];
ORDER : '"order"' -> pushMode(ORDERMODE);
WS : [ \r\n\t]+ -> skip;
mode ORDERMODE;
WS2 : [ \r\n\t]+ -> skip;
PREFIX : COLON QUOT -> skip;
ORDERCLAUSE : (ORDERITEM (COMMA ORDERITEM)*)+;
CLOSE : '"' -> popMode;
ORDERITEM : FIELDNAME ORDERDIRECTION?;
ORDERDIRECTION : 'ASC' | 'DESC';
The output I am getting is
line 1:8 token recognition error at: ': '
What am I doing wrong?
Likely you have have not defined a COLON-ish token within the ORDERMODE mode (same for QUOT) -- each mode is a completely separate rule set.
You can minimize this limitation by using fragment rules - they are visible in all modes.
...
COLON : Colon ;
QUOT : Quot ;
mode ORDERMODE;
PREFIX : COLON1 QUOT1 -> skip;
...
COLON1 : Colon ;
QUOT1 : Quot ;
...
fragment Colon : ':' ;
fragment Quot : '"' ;

How do I get an unhandled exception to be reported in SML/NJ?

I have the following SML program in a file named testexc.sml:
structure TestExc : sig
val main : (string * string list -> int)
end =
struct
exception OhNoes;
fun main(prog_name, args) = (
raise OhNoes
)
end
I build it with smlnj-110.74 like this:
ml-build sources.cm TestExc.main testimg
Where sources.cm contains:
Group is
csx.sml
I invoke the program like so (on Mac OS 10.8):
sml #SMLload testimg.x86-darwin
I expect to see something when I invoke the program, but the only thing I get is a return code of 1:
$ sml #SMLload testimg.x86-darwin
$ echo $?
1
What gives? Why would SML fail silently on this unhandled exception? Is this behavior normal? Is there some generic handler I can put on main that will print the error that occurred? I realize I can match exception OhNoes, but what about larger programs with exceptions I might not know about?
The answer is to handle the exception, call it e, and print the data using a couple functions available in the system:
$ sml
Standard ML of New Jersey v110.74 [built: Tue Jan 31 16:23:10 2012]
- exnName;
val it = fn : exn -> string
- exnMessage;
val it = fn : exn -> string
-
Now, we have our modified program, were we have the generic handler tacked on to main():
structure TestExc : sig
val main : (string * string list -> int)
end =
struct
exception OhNoes;
open List;
fun exnToString(e) =
List.foldr (op ^) "" ["[",
exnName e,
" ",
exnMessage e,
"]"]
fun main(prog_name, args) = (
raise OhNoes
)
handle e => (
print("Grasshopper disassemble: " ^ exnToString(e));
42)
end
I used lists for generating the message, so to make this program build, you'll need a reference to the basis library in sources.cm:
Group is
$/basis.cm
sources.cm
And here's what it looks like when we run it:
$ sml #SMLload testimg.x86-darwin
Grasshopper disassemble: [OhNoes OhNoes (more info unavailable: ExnInfoHook not initialized)]
$ echo $?
42
I don't know what ExnInfoHook is, but I see OhNoes, at least. It's too bad the SML compiler didn't add a basic handler for us, so as to print something when there was an unhandled exception in the compiled program. I suspect ml-build would be responsible for that task.