I am currently gettin' my hands dirty on some Perl6. Specifically I am trying to write a Fortran parser based on grammars (the Fortran::Grammar module)
For testing purposes, I would like to have the possiblity to convert a Match object into a JSON-serializable Hash.
Googling / official Perl6 documentation didn't help. My apologies if I overlooked something.
My attempts so far:
I know that one can convert a Match $m to a Hash via $m.hash. But this keeps nested Match objects.
Since this just has to be solvable via recursion, I tried but gave up in favor of asking first for the existance of a simpler/existing solution here
Dealing with Match objects' contents is obviously best accomplished via make/made. I would love to have a super simple Actions object to hand to .parse with a default method for all matches that basically just does a make $/.hash or something the like. I just have no idea on how to specify a default method.
Here's an action class method from one of my Perl 6 projects, which does what you describe.
It does almost the same as what Christoph posted, but is written more verbosely (and I've added copious amounts of comments to make it easier to understand):
#| Fallback action method that produces a Hash tree from named captures.
method FALLBACK ($name, $/) {
# Unless an embedded { } block in the grammar already called make()...
unless $/.made.defined {
# If the Match has named captures, produce a hash with one entry
# per capture:
if $/.hash -> %captures {
make hash do for %captures.kv -> $k, $v {
# The key of the hash entry is the capture's name.
$k => $v ~~ Array
# If the capture was repeated by a quantifier, the
# value becomes a list of what each repetition of the
# sub-rule produced:
?? $v.map(*.made).cache
# If the capture wasn't quantified, the value becomes
# what the sub-rule produced:
!! $v.made
}
}
# If the Match has no named captures, produce the string it matched:
else { make ~$/ }
}
}
Notes:
This totally ignores positional captures (i.e. those made with ( ) inside the grammar) - only named captures (e.g. <foo> or <foo=bar>) are used to build the Hash tree. It could be amended to handle them too, depending on what you want to do with them. Keep in mind that:
$/.hash gives the named captures, as a Map.
$/.list gives the positional captures, as a List.
$/.caps (or $/.pairs) gives both the named and positional captures, as a sequence of name=>submatch and/or index=>submatch pairs.
It allows you to override the AST generation for specific rules, either by adding a { make ... } block inside the rule in the grammar (assuming that you never intentionally want to make an undefined value), or by adding a method with the rule's name to the action class.
I just have no idea on how to specify a default method.
The method name FALLBACK is reserved for this purpose.
Adding something like this
method FALLBACK($name, $/) {
make $/.pairs.map(-> (:key($k), :value($v)) {
$k => $v ~~ Match ?? $v.made !! $v>>.made
}).hash || ~$/;
}
to your actions class should work.
For each named rule without an explicit action method, it will make either a hash containing its subrules (either named ones or positional captures), or if the rule is 'atomic' and has no such subrules the matching string.
Related
I found two ways to instance class:
one is:
class_name create instance
instance class_method
the other is:
set instance [class_name new]
$instance class_method
Each way worked well, so is there any difference between two ways?
The only difference is that the new method on classes generates a unique name for you, and with the create method you get to specify what the name is. Both are provided because there's use cases for each of them. Use whichever makes sense for you. (Note that class objects themselves are always named because of how they're generally used, and so you can't normally create classes with new; it's hidden on oo::class instances.)
For the sake of completeness only, there's an additional way to make instances, createWithNamespace, which lets you also specify the name of the state namespace of the object. It's not exposed by default (you have to manually export it for general use) and is pretty specialist for people who aren't doing deep shenanigans. You probably don't want to use it.
At some point in the future new may be enhanced so that it also enables garbage collection for the object, whereas create will not (because you know the name out-of-band). Specifically, I've written a TIP for doing this for Tcl 9.0 but I don't have a working implementation yet, so do not hold your breath.
I have tried some cases,
class defined as this:
oo::class create class_name {
variable text
method add {t} { set text $t }
method print { } { puts $text }
}
First way:
class_name create foo
foo add "abc"
foo print
class_name create foo
It will return:
abc
Error: can't create object "foo": command already exists with that name
Second way:
set foo [class_name new]
$foo add "abc"
$foo print
set foo [class_name new]
$foo add "def"
$foo print
It will return:
abc
def
This is one difference I found, and it seems like secnod way is more convenient.
Still expecting master to provide authoritative answers or documents.
I have been working through the Advent of Code problems in Perl6 this year and was attempting to use a grammar to parse the Day 3's input.
Given input in this form: #1 # 1,3: 4x4 and this grammar that I created:
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
}
token digits {
<digit>+
}
token id {
<digits>
}
token coordinates {
<digits> ',' <digits>
}
token dimensions {
<digits> 'x' <digits>
}
}
say Claim.parse('#1 # 1,3: 4x4');
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse. I understand that I can pull them from the resulting Match object of Claim.parse(<input>), but I have to dig down through each grammar production to get the value I need e.g.
say $match<id>.hash<digits>.<digit>;
this seems a little messy, is there a better way?
For the particular challenge you're solving, using a grammar is like using a sledgehammer to crack a nut.
Like #Scimon says, a single regex would be fine. You can keep it nicely readable by laying it out appropriately. You can name the captures and keep them all at the top level:
/ ^
'#' $<id>=(\d+) ' '
'# ' $<x>=(\d+) ',' $<y>=(\d+)
': ' $<w>=(\d+) x $<d>=(\d+)
$
/;
say ~$<id x y w d>; # 1 1 3 4 4
(The prefix ~ calls .Str on the value on its right hand side. Called on a Match object it stringifies to the matched strings.)
With that out the way, your question remains perfectly cromulent as it is because it's important to know how P6 scales in this regard from simple regexes like the one above to the largest and most complex parsing tasks. So that's what the rest of this answer covers, using your example as the starting point.
Digging less messily
say $match<id>.hash<digits>.<digit>; # [「1」]
this seems a little messy, is there a better way?
Your say includes unnecessary code and output nesting. You could just simplify to something like:
say ~$match<id> # 1
Digging a little deeper less messily
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse.
For matches of multiple tokens you no longer have the luxury of relying on Perl 6 guessing which one you mean. (When there's only one, guess which one it guesses you mean. :))
One way to write your say to get the y coordinate:
say ~$match<coordinates><digits>[1] # 3
If you want to drop the <digits> you can mark which parts of a pattern should be stored in a list of numbered captures. One way to do so is to put parentheses around those parts:
token coordinates { (<digits>) ',' (<digits>) }
Now you've eliminated the need to mention <digits>:
say ~$match<coordinates>[1] # 3
You could also name the new parenthesized captures:
token coordinates { $<x>=(<digits>) ',' $<y>=(<digits>) }
say ~$match<coordinates><y> # 3
Pre-digging
I have to dig down through each grammar production to get the value I need
The above techniques still all dig down into the automatically generated parse tree which by default precisely corresponds to the tree implicit in the grammar's hierarchy of rule calls. The above techniques just make the way you dig into it seem a little shallower.
Another step is to do the digging work as part of the parsing process so that the say is simple.
You could inline some code right into the TOP token to store just the interesting data you've made. Just insert a {...} block in the appropriate spot (for this sort of thing that means the end of the token given that you need the token pattern to have already done its matching work):
my $made;
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
{ $made = ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
}
...
Now you can write just:
say $made # 1 1 3 4 4
This illustrates that you can just write arbitrary code at any point in any rule -- something that's not possible with most parsing formalisms and their related tools -- and the code can access the parse state as it is at that point.
Pre-digging less messily
Inlining code is quick and dirty. So is using a variable.
The normal thing to do for storing data is to instead use the make function. This hangs data off the match object that's being constructed corresponding to a given rule. This can then be retrieved using the .made method. So instead of $make = you'd have:
{ make ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
And now you can write:
say $match.made # 1 1 3 4 4
That's much tidier. But there's more.
A sparse subtree of a parse tree
.oO ( 🎶 On the first day of an imagined 2019 Perl 6 Christmas Advent calendar 🎶 a StackOverflow title said to me ... )
In the above example I constructed a .made payload for just the TOP node. For larger grammars it's common to form a sparse subtree (a term I coined for this because I couldn't find a standard existing term).
This sparse subtree consists of the .made payload for the TOP that's a data structure referring to .made payloads of lower level rules which in turn refer to lower level rules and so on, skipping uninteresting intermediate rules.
The canonical use case for this is to form an Abstract Syntax Tree after parsing some programming code.
In fact there's an alias for .made, namely .ast:
say $match.ast # 1 1 3 4 4
While this is trivial to use, it's also fully general. P6 uses a P6 grammar to parse P6 code -- and then builds an AST using this mechanism.
Making it all elegant
For maintainability and reusability you can and typically should not insert code inline at the end of rules but should instead use Action objects.
In summary
There are a range of general mechanisms that scale from simple to complex scenarios and can be combined as best fits any given use case.
Add parentheses as I explained above, naming the capture that those parentheses zero in on, if that is a nice simplification for digging into the parse tree.
Inline any action you wish to take during parsing of a rule. You get full access to the parse state at that point. This is great for making it easy to extract just the data you want from a parse because you can use the make convenience function. And you can abstract all actions that are to be taken at the end of successfully matching rules out of a grammar, ensuring this is a clean solution code-wise and that a single grammar remains reusable for multiple actions.
One final thing. You may wish to prune the parse tree to omit unnecessary leaf detail (to reduce memory consumption and/or simplify parse tree displays). To do so, write <.foo>, with a dot preceding the rule name, to switch the default automatic capturing off for that rule.
You can refer to each of you named portions directly. So to get the cordinates you can access :
say $match.<coordinates>.<digits>
this will return you the Array of digits matches. Ig you just want the values the easiest way is probably :
say $match.<coordinates>.<digits>.map( *.Int) or say $match.<coordinates>.<digits>>>.Int or even say $match.<coordinates>.<digits>».Int
to cast them to Ints
For the id field it's even easier you can just cast the <id> match to an Int :
say $match.<id>.Int
I feel like I understand MAKE as being a constructor for a datatype. It takes two arguments... the first the target datatype, and the second a "spec".
In the case of objects it's fairly obvious that a block of Rebol data can be used as the "spec" to get back a value of type object!
>> foo: make object! [x: 10 y: 20 z: func [value] [print x + y + value] ]
== make object! [
x: 10
y: 20
]
>> print foo/x
10
>> foo/z 1
31
I know that if you pass an integer when you create a block, it will preallocate enough underlying memory to hold a block of that length, despite being empty:
>> foo: make block! 10
== []
That makes some sense. If you pass a string in, then you get the string parsed into Rebol tokens...
>> foo: make block! "some-set-word: {String in braces} some-word 12-Dec-2012"
== [some-set-word: "String in braces" some-word 12-Dec-2012]
Not all types are accepted, and again I'll say so far... so good.
>> foo: make block! 12-Dec-2012
** Script error: invalid argument: 12-Dec-2012
** Where: make
** Near: make block! 12-Dec-2012
By contrast, the TO operation is defined very similar, except it is for "conversion" instead of "construction". It also takes a target type as a first parameter, and then a "spec". It acts differently on values
>> foo: to block! 10
== [10]
>> foo: to block! 12-Dec-2012
== [12-Dec-2012]
That seems reasonable. If it received a non-series value, it wrapped it in a block. If you try an any-block! value with it, I'd imagine it would give you a block! series with the same values inside:
>> foo: to block! quote (a + b)
== [a + b]
So I'd expect a string to be wrapped in a block, but it just does the same thing MAKE does:
>> foo: to block! "some-set-word: {String in braces} some-word 12-Dec-2012"
== [some-set-word: "String in braces" some-word 12-Dec-2012]
Why is TO being so redundant with MAKE, and what is the logic behind their distinction? Passing integers into to block! gets the number inside a block (instead of having the special construction mode), and dates go into to block! making the date in a block instead of an error as with MAKE. So why wouldn't one want a to block! of a string to put that string inside a block?
Also: beyond reading the C sources for the interpreter, where is the comprehensive list of specs accepted by MAKE and TO for each target type?
MAKE is a constructor, TO is a converter. The reason that we have both is that for many types that operation is different. If they weren't different, we could get by with one operation.
MAKE takes a spec that is supposed to be a description of the value you're constructing. This is why you can pass MAKE a block and get values like objects or functions that aren't block-like at all. You can even pass an integer to MAKE and have it be treated like an allocation directive.
TO takes a value that is intended to be more directly converted to the target type (this value being called "spec" is just an unfortunate naming mishap). This is why the values in the input more directly correspond to the values in the output. Whenever there is a sensible default conversion, TO does it. That is why many types don't have TO conversions defined between them, the types are too different conceptually. We have fairly comprehensive conversions for some types where this is appropriate, such as to strings and blocks, but have carefully restricted some other conversions that are more useful to prohibit, such as from none to most types.
In some cases of simple types, there really isn't a complex way to describe the type. For them, it doesn't hurt to have the constructors just take self-describing values as their specs. Coincidentally, this ends up being the same behavior as TO for the same type and values. This doesn't hurt, so it's not useful to trigger an error in this case.
There are no comprehensive docs for the behavior of MAKE and TO because in Rebol 3 their behavior isn't completely finalized. There is still some debate in some cases about what the proper behavior should be. We're trying to make things more balanced, without losing any valuable functionality. We've already done a lot of work improving none and binary conversions, for instance. Once they are more finalized, and once we have a place to put them, we'll have more docs. In the meanwhile most of the Rebol 2 behavior is documented, and most of the changes so far for Rebol 3 are in CureCode.
Also: beyond reading the C sources for the interpreter, where is the
comprehensive list of specs accepted by MAKE and TO for each target
type?
May not be that useful, since it's red specific:
comparison-matrix
conversion-matrix
But it does at least mention if the behaviour is similar or different from rebol
On the one hand we have:
>> source object
object: make function! [[
"Defines a unique object."
blk [block!] "Object words and values."
][
make object! append blk none
]]
For context we see:
>> source context
context: make function! [[
"Defines a unique object."
blk [block!] "Object words and values."
][
make object! blk
]]
So, for object the object is constructed from a block to which none has been appended. This doesn't change the length, or, to my knowledge, add anything. With context, on the other hand, the object is constructed with the passed-in block, as is.
Why the difference and why, for example, couldn't context just be an alias for object.
Backwards compatibility. We had a context function already in Rebol that worked a particular way (not initializing variables), but we needed a function that initialized variables to none, as a convenience function when creating objects as data structures rather than as code containers.
It made sense to call it object since that is the type name, and since "context" is actually kind of a bad name for objects in a language with context-sensitive semantics (for a more appropriate meaning of the word "context"). It really leads to some confusing conversations. Since R3 has modules now, most of the previous uses of the context function are covered better by modules. Keeping context at all is mostly for backwards compatibility.
The current object function is pretty much a placeholder for a better type construction wrapper that we haven't thought up yet. We need something like it, but there may be subtle changes in its behavior needed that we'll discover with more use. For one thing, the fact that it modifies its spec block makes it not very safe for recursion or concurrency. It will likely end up as a native if that improves it, or maybe as a construct option if that turns out to be a better approach.
One thing that did turn out to be a win is the practice of using the type name without the exclamation point as the name of a type construction function. We changed map to be that as well, and we may end up adding similar constructors for other types, though most that need them have them already.
Just seen an interesting possibility to initialize code blocks in Scala for high order functions such as foreach or map:
(1 to 3) map {
val t = 5
i => i * 5
}
(1 to 3) foreach {
val line = Console.readLine
i => println(line)
}
Is this some documented feature or should I avoid such constructs? I could imagine, the "initialization" block comes into the constructor and the closure itself becomes an apply() method?
Thanks Pat for the original Question (http://extrabright.com/blog/2010/07/10/scala-question-regarding-readline)
While the features used are not uncommon, I'll admit is is a fairly odd combination of features. The basic trick is that any block in Scala is an expression, with type the same as the last expression in the block. If that last expression is a function, this means that the block has functional type, and thus can be used as an argument to "map" or "foreach" . What happens in these cases is that when "map" or "foreach" is called, the block is evaluated. The block evaluates to a function ( i=> i*5 in the first case ), and that function is then mapped over the range.
One possible use of this construct is for the block to define mutable variables, and the resulting function mutate the variables each time it is called. The variables will be initialized once, closed over by the function, and their values updated every time the function is called.
For example, here's a somewhat surprising way of calculating the first 6 factorial numbers
(1 to 6) map {
var total = 1
i => {total *= i;total}
}
(BTW, sorry for using factorial as an example. It was either that or fibonacci. Functional
Progamming Guild rules. You gotta problem with that, take it up with the boys down at the hall.)
A less imperative reason to have a block return a function is to define helper functions earlier in the block. For instance, if your second example were instead
(1 to 3) foreach {
def line = Console.readLine
i => println(line)
}
The result would be that three lines were read and echoed once each, while your example had the line read once and echoed three times.
First, the comment of the original blog "Scala Question Regarding readLine" post mention
The “line” is a value and cannot be executed, it is assigned only once from the result of the “Console.readLine” method execution.
It is used less than three times in your closure.
But if you define it as a method, it will be executed three times:
(1 to 3) foreach {
def line = Console.readLine
i => println(line)
}
The blog Scala for Java Refugees Part 6: Getting Over Java has an interesting section on Higher Order function, including:
Scala provides still more flexibility in the syntax for these higher-order function things.
In the iterate invocation, we’re creating an entire anonymous method just to make another call to the println(String) method.
Considering println(String) is itself a method which takes a String and returns Unit, one would think we could compress this down a bit. As it turns out, we can:
iterate(a, println)
By omitting the parentheses and just specifying the method name, we’re telling the Scala compiler that we want to use println as a functional value, passing it to the iterate method.
Thus instead of creating a new method just to handle a single set of calls, we pass in an old method which already does what we want.
This is a pattern commonly seen in C and C++. In fact, the syntax for passing a function as a functional value is precisely the same. Seems that some things never change…