Extract tokens from grammar - extract

I have been working through the Advent of Code problems in Perl6 this year and was attempting to use a grammar to parse the Day 3's input.
Given input in this form: #1 # 1,3: 4x4 and this grammar that I created:
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
}
token digits {
<digit>+
}
token id {
<digits>
}
token coordinates {
<digits> ',' <digits>
}
token dimensions {
<digits> 'x' <digits>
}
}
say Claim.parse('#1 # 1,3: 4x4');
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse. I understand that I can pull them from the resulting Match object of Claim.parse(<input>), but I have to dig down through each grammar production to get the value I need e.g.
say $match<id>.hash<digits>.<digit>;
this seems a little messy, is there a better way?

For the particular challenge you're solving, using a grammar is like using a sledgehammer to crack a nut.
Like #Scimon says, a single regex would be fine. You can keep it nicely readable by laying it out appropriately. You can name the captures and keep them all at the top level:
/ ^
'#' $<id>=(\d+) ' '
'# ' $<x>=(\d+) ',' $<y>=(\d+)
': ' $<w>=(\d+) x $<d>=(\d+)
$
/;
say ~$<id x y w d>; # 1 1 3 4 4
(The prefix ~ calls .Str on the value on its right hand side. Called on a Match object it stringifies to the matched strings.)
With that out the way, your question remains perfectly cromulent as it is because it's important to know how P6 scales in this regard from simple regexes like the one above to the largest and most complex parsing tasks. So that's what the rest of this answer covers, using your example as the starting point.
Digging less messily
say $match<id>.hash<digits>.<digit>; # [「1」]
this seems a little messy, is there a better way?
Your say includes unnecessary code and output nesting. You could just simplify to something like:
say ~$match<id> # 1
Digging a little deeper less messily
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse.
For matches of multiple tokens you no longer have the luxury of relying on Perl 6 guessing which one you mean. (When there's only one, guess which one it guesses you mean. :))
One way to write your say to get the y coordinate:
say ~$match<coordinates><digits>[1] # 3
If you want to drop the <digits> you can mark which parts of a pattern should be stored in a list of numbered captures. One way to do so is to put parentheses around those parts:
token coordinates { (<digits>) ',' (<digits>) }
Now you've eliminated the need to mention <digits>:
say ~$match<coordinates>[1] # 3
You could also name the new parenthesized captures:
token coordinates { $<x>=(<digits>) ',' $<y>=(<digits>) }
say ~$match<coordinates><y> # 3
Pre-digging
I have to dig down through each grammar production to get the value I need
The above techniques still all dig down into the automatically generated parse tree which by default precisely corresponds to the tree implicit in the grammar's hierarchy of rule calls. The above techniques just make the way you dig into it seem a little shallower.
Another step is to do the digging work as part of the parsing process so that the say is simple.
You could inline some code right into the TOP token to store just the interesting data you've made. Just insert a {...} block in the appropriate spot (for this sort of thing that means the end of the token given that you need the token pattern to have already done its matching work):
my $made;
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
{ $made = ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
}
...
Now you can write just:
say $made # 1 1 3 4 4
This illustrates that you can just write arbitrary code at any point in any rule -- something that's not possible with most parsing formalisms and their related tools -- and the code can access the parse state as it is at that point.
Pre-digging less messily
Inlining code is quick and dirty. So is using a variable.
The normal thing to do for storing data is to instead use the make function. This hangs data off the match object that's being constructed corresponding to a given rule. This can then be retrieved using the .made method. So instead of $make = you'd have:
{ make ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
And now you can write:
say $match.made # 1 1 3 4 4
That's much tidier. But there's more.
A sparse subtree of a parse tree
.oO ( 🎶 On the first day of an imagined 2019 Perl 6 Christmas Advent calendar 🎶 a StackOverflow title said to me ... )
In the above example I constructed a .made payload for just the TOP node. For larger grammars it's common to form a sparse subtree (a term I coined for this because I couldn't find a standard existing term).
This sparse subtree consists of the .made payload for the TOP that's a data structure referring to .made payloads of lower level rules which in turn refer to lower level rules and so on, skipping uninteresting intermediate rules.
The canonical use case for this is to form an Abstract Syntax Tree after parsing some programming code.
In fact there's an alias for .made, namely .ast:
say $match.ast # 1 1 3 4 4
While this is trivial to use, it's also fully general. P6 uses a P6 grammar to parse P6 code -- and then builds an AST using this mechanism.
Making it all elegant
For maintainability and reusability you can and typically should not insert code inline at the end of rules but should instead use Action objects.
In summary
There are a range of general mechanisms that scale from simple to complex scenarios and can be combined as best fits any given use case.
Add parentheses as I explained above, naming the capture that those parentheses zero in on, if that is a nice simplification for digging into the parse tree.
Inline any action you wish to take during parsing of a rule. You get full access to the parse state at that point. This is great for making it easy to extract just the data you want from a parse because you can use the make convenience function. And you can abstract all actions that are to be taken at the end of successfully matching rules out of a grammar, ensuring this is a clean solution code-wise and that a single grammar remains reusable for multiple actions.
One final thing. You may wish to prune the parse tree to omit unnecessary leaf detail (to reduce memory consumption and/or simplify parse tree displays). To do so, write <.foo>, with a dot preceding the rule name, to switch the default automatic capturing off for that rule.

You can refer to each of you named portions directly. So to get the cordinates you can access :
say $match.<coordinates>.<digits>
this will return you the Array of digits matches. Ig you just want the values the easiest way is probably :
say $match.<coordinates>.<digits>.map( *.Int) or say $match.<coordinates>.<digits>>>.Int or even say $match.<coordinates>.<digits>».Int
to cast them to Ints
For the id field it's even easier you can just cast the <id> match to an Int :
say $match.<id>.Int

Related

Output files with mixed wildcards

I'm stuck on how to do this with Snakemake.
First, say my "all" rule is:
rule all:
input: "SA.txt", "SA_T1.txt", "SA_T2.txt",
"SB.txt", "SB_T1.txt", "SB_T2.txt", "SB_T3.txt"
Notice that SA has two _T# files while SB has three such files, a crucial element of this.
Now I want to write a rule like this to generate these files:
rule X:
output: N="S{X}.txt", T="S{X}_T{Y}.txt"
(etc.)
But SnakeMake requires that both output templates have the same wildcards, which these don't. Further, even if SnakeMake could handle the multiple wildcards, it would presumably want to find a single filename match for the S{X}_T{Y}.txt template, but I want that to match ALL files where {X} matches the first template's {X}, i.e. I want output.T to be a list, not a single file. So it would seem the way to do it is:
def myExpand(T, wildcards):
T = T.replace("{X}", wildcards.X)
T = [T.replace("{Y}", S) for S in theXYs[wildcards.X]]
return T
rule X:
output: N="S{X}.txt", T=lambda wildcards: myExpand("S{X}_T{Y}.txt", wildcards)
(etc.)
But I can't do this, because a lambda function cannot be used in an output section.
How do I do it?
It seems to me this argues for supporting lambda functions on output statements, providing a wildcards dictionary filled with values from already-parsed sections of the output statement.
Additions responding to comments:
The value of wildcard Y is needed because other rules have inputs for those files that have the wildcard Y.
My rule knows the different values for Y (and X) that it needs to work with from data read from a database into python dictionaries.
There are many values for X, and from 2 to 6 values of Y for each value of X. I don't think it makes sense to use separate rules for each value of X. However, I might be wrong as I recently learned that one can put a rule inside a loop, and create multiple rules.
More info about the workflow: I am combining somatic variant VCF files for several tumor samples from one person together into a single VCF file, and doing it such that for each called variant in any one tumor, all tumors not calling that variant are analyzed to determine read depth at the variant, which is included in the merged VCF file.
The full process involves about 14 steps, which could perhaps be as many as 14 rules. I actually didn't want to use 14 rules, but preferred to just do it all in one rule.
However, I now think the solution is to indeed use lots of separate rules. I was avoiding this partly because of the large number of unnecessary intermediate files, but actually, these exist anyway, temporarily, within a single large rule. With multiple rules I can mark them temp() so Snakemake will delete them at the end.
For the sake of fleshing out this discussion, which I believe is a legitimate one, let's assume a simple situation that might arise. Say that for each of a number of persons, you have N (>=2) tumor VCF files, as I do, and that you want to write a rule that will produce N+1 output files, one output file per tumor plus one more file that is associated with the person. Use wildcard X for person ID and wildcard Y for tumor ID within person X. Say that the operation is to put all variants present in ALL tumor VCF files into the person output VCF file, and all OTHER variants into the corresponding tumor output files for the tumors in which they appear. Say a single program generates all N+1 files from the N input files. How do you write the rule?
You want this:
rule:
output: COMMON="{X}.common.vcf", INDIV="{X}.{Y}.indiv.vcf"
input: "{X}.{Y}.vcf"
shell: """
getCommonAndIndividualVariants --inputs {input} \
--common {output.COMMON} --indiv {output.INDIV}
"""
But that violates the rules for output wildcards.
The way I did it, which is less than satisfactory, but works, is to use two rules, the first of which has the output template with more wildcards, and the second the template with fewer wildcards, and having the second rule create temporary output files which are renamed to the final name by the first rule:
rule A:
output: "{X}.{Y}.indiv.vcf"
input: "{X}.common.vcf"
run: "for infile in {input}: os.system('mv '+infile+'.tmp'+' '+infile)"
rule B:
output: "{X}.common.vcf"
input: lambda wildcards: \
expand("{X}.{Y}.vcf", **wildcards, Y=getYfromDB(wildcards["X"]))
params: OUT=lambda wildcards: \
expand("{X}.{Y}.indiv.vcf.tmp", Y=getYfromDB(wildcards["X"]))
shell: """
getCommonAndIndividualVariants --inputs {input} \
--common {output} --indiv {params.OUT}
"""
I do not know how the rest of your workflow looks, and what the best solution is is context-dependent.
1
What about splitting up the rule into two, one creating "SA.txt", "SA_T1.txt", "SA_T2.txt" and another "SB.txt", "SB_T1.txt", "SB_T2.txt", "SB_T3.txt"?
2
Another possibility is to only have the {X} files in the output-directive, but have the rule create the other files, even though they are not in the output-directive. This does not work if the {Y} files are part of the DAG.
3 (Best solution?)
A third and potentially the best solution might be to have aggregated wildcards in the X rule and the rule that requires the output from X.
Then the solution would be
rule X:
output: N="S{X_prime}.txt", T="S{Y_prime}.txt"
The rule which requires these files can look like:
rule all:
input:
expand("S{X_prime}", X_prime="A_T1 A_T2".split()),
expand("S{Y_prime}", Y_prime="B_T1 B_T2 B_T3".split())
If this does not meet your requirements we can discuss it further :)
Ps. You might need to use wildcard_constraints to disambiguate the outputs of rule X.
list_of_all_valid_X_prime_values = "A_T1 A_T2".split()
list_of_all_valid_Y_prime_values = "B_T1 B_T2 B_T3".split()
wildcard_constraints:
X_prime = "({})".format("|".join(list_of_all_valid_X_prime_values))
Y_prime = "({})".format("|".join(list_of_all_valid_Y_prime_values))
rule all:
...
My understanding is that snakemake works by taking steps that look as follows:
Look at the name of a file it is asked to generate.
Find the rule that has a matching pattern in its output.
Infer the values of the wildcards from the above match.
Use the wildcards to determine what the name of the inputs of the chosen rule should be.
Your rule can generate its output without needing an input, so the problem of inferring the value of wildcard Y is not evident.
How does your rule know how many different values for Y it needs to work with ?
If you find a way to determine the values for Y just knowing the value for X and predefined "python-level" functions and variables, then there may be a way to have Y as an internal variable of your rule, and not a wildcard.
In this case, the workflow could be driven only by files S{X}.txt. The S{X}_T{Y}.txt would just be a byproduct of the execution of the rule, not in its explicit output.

Perl6: Convert Match object to JSON-serializable Hash

I am currently gettin' my hands dirty on some Perl6. Specifically I am trying to write a Fortran parser based on grammars (the Fortran::Grammar module)
For testing purposes, I would like to have the possiblity to convert a Match object into a JSON-serializable Hash.
Googling / official Perl6 documentation didn't help. My apologies if I overlooked something.
My attempts so far:
I know that one can convert a Match $m to a Hash via $m.hash. But this keeps nested Match objects.
Since this just has to be solvable via recursion, I tried but gave up in favor of asking first for the existance of a simpler/existing solution here
Dealing with Match objects' contents is obviously best accomplished via make/made. I would love to have a super simple Actions object to hand to .parse with a default method for all matches that basically just does a make $/.hash or something the like. I just have no idea on how to specify a default method.
Here's an action class method from one of my Perl 6 projects, which does what you describe.
It does almost the same as what Christoph posted, but is written more verbosely (and I've added copious amounts of comments to make it easier to understand):
#| Fallback action method that produces a Hash tree from named captures.
method FALLBACK ($name, $/) {
# Unless an embedded { } block in the grammar already called make()...
unless $/.made.defined {
# If the Match has named captures, produce a hash with one entry
# per capture:
if $/.hash -> %captures {
make hash do for %captures.kv -> $k, $v {
# The key of the hash entry is the capture's name.
$k => $v ~~ Array
# If the capture was repeated by a quantifier, the
# value becomes a list of what each repetition of the
# sub-rule produced:
?? $v.map(*.made).cache
# If the capture wasn't quantified, the value becomes
# what the sub-rule produced:
!! $v.made
}
}
# If the Match has no named captures, produce the string it matched:
else { make ~$/ }
}
}
Notes:
This totally ignores positional captures (i.e. those made with ( ) inside the grammar) - only named captures (e.g. <foo> or <foo=bar>) are used to build the Hash tree. It could be amended to handle them too, depending on what you want to do with them. Keep in mind that:
$/.hash gives the named captures, as a Map.
$/.list gives the positional captures, as a List.
$/.caps (or $/.pairs) gives both the named and positional captures, as a sequence of name=>submatch and/or index=>submatch pairs.
It allows you to override the AST generation for specific rules, either by adding a { make ... } block inside the rule in the grammar (assuming that you never intentionally want to make an undefined value), or by adding a method with the rule's name to the action class.
I just have no idea on how to specify a default method.
The method name FALLBACK is reserved for this purpose.
Adding something like this
method FALLBACK($name, $/) {
make $/.pairs.map(-> (:key($k), :value($v)) {
$k => $v ~~ Match ?? $v.made !! $v>>.made
}).hash || ~$/;
}
to your actions class should work.
For each named rule without an explicit action method, it will make either a hash containing its subrules (either named ones or positional captures), or if the rule is 'atomic' and has no such subrules the matching string.

Can I read the rest of the line after a positive value of IOSTAT?

I have a file with 13 columns and 41 lines consisting of the coefficients for the Joback Method for 41 different groups. Some of the values are non-existing, though, and the table lists them as "X". I saved the table as a .csv and in my code read the file to an array. An excerpt of two lines from the .csv (the second one contains non-exisiting coefficients) looks like this:
48.84,11.74,0.0169,0.0074,9.0,123.34,163.16,453.0,1124.0,-31.1,0.227,-0.00032,0.000000146
X,74.6,0.0255,-0.0099,X,23.61,X,797.0,X,X,X,X,X
What I've tried doing was to read and define an array to hold each IOSTAT value so I can know if an "X" was read (that is, IOSTAT would be positive):
DO I = 1, 41
(READ(25,*,IOSTAT=ReadStatus(I,J)) JobackCoeff, J = 1, 13)
END DO
The problem, I've found, is that if the first value of the line to be read is "X", producing a positive value of ReadStatus, then the rest of the values of those line are not read correctly.
My intent was to use the ReadStatus array to produce an error message if JobackCoeff(I,J) caused a read error, therefore pinpointing the "X"s.
Can I force the program to keep reading a line after there is a reading error? Or is there a better way of doing this?
As soon as an error occurs during the input execution then processing of the input list terminates. Further, all variables specified in the input list become undefined. The short answer to your first question is: no, there is no way to keep reading a line after a reading error.
We come, then, to the usual answer when more complicated input processing is required: read the line into a character variable and process that. I won't write complete code for you (mostly because it isn't clear exactly what is required), but when you have a character variable you may find the index intrinsic useful. With this you can locate Xs (with repeated calls on substrings to find all of them on a line).
Alternatively, if you provide an explicit format (rather than relying on list-directed (fmt=*) input) you may be able to do something with non-advancing input (advance='no' in the read statement). However, as soon as an error condition comes about then the position of the file becomes indeterminate: you'll also have to handle this. It's probably much simpler to process the line-as-a-character-variable.
An outline of the concept (without declarations, robustness) is given below.
read(iunit, '(A)') line
idx = 1
do i=1, 13
read(line(idx:), *, iostat=iostat) x(i)
if (iostat.gt.0) then
print '("Column ",I0," has an X")', i
x(i) = -HUGE(0.) ! Recall x(i) was left undefined
end if
idx = idx + INDEX(line(idx:), ',')
end do
An alternative, long used by many many Fortran programmers, and programmers in other languages, would be to use an editor of some sort (I like sed) and modify the file by changing all the Xs to NANs. Your compiler has to provide support for IEEE NaNs for this to work (most of the current crop in widespread use do) and they will correctly interpret NAN in the input file to a real number with value NaN.
This approach has the benefit, compared with the already accepted (and perfectly good) answer, of not requiring clever programming in Fortran to parse input lines containing mixed entries. Use an editor for string processing, use Fortran for reading numbers.

COBOL code to replace characters by html entities

I want to replace the characters '<' and '>' by < and > with COBOL. I was wondering about INSPECT statement, but it looks like this statement just can be used to translate one char by another. My intention is to replace all html characters by their html entities.
Can anyone figure out some way to do it? Maybe looping over the string and testing each char is the only way?
GnuCOBOL or IBM COBOL examples are welcome.
My best code is something like it: (http://ideone.com/MKiAc6)
IDENTIFICATION DIVISION.
PROGRAM-ID. HTMLSECURE.
ENVIRONMENT DIVISION.
DATA DIVISION.
WORKING-STORAGE SECTION.
77 INPTXT PIC X(50).
77 OUTTXT PIC X(500).
77 I PIC 9(4) COMP VALUE 1.
77 P PIC 9(4) COMP VALUE 1.
PROCEDURE DIVISION.
MOVE 1 TO P
MOVE '<SCRIPT> TEST TEST </SCRIPT>' TO INPTXT
PERFORM VARYING I FROM 1 BY 1
UNTIL I EQUAL LENGTH OF INPTXT
EVALUATE INPTXT(I:1)
WHEN '<'
MOVE "<" TO OUTTXT(P:4)
ADD 4 TO P
WHEN '>'
MOVE ">" TO OUTTXT(P:4)
ADD 4 TO P
WHEN OTHER
MOVE INPTXT(I:1) TO OUTTXT(P:1)
ADD 1 TO P
END-EVALUATE
END-PERFORM
DISPLAY OUTTXT
STOP RUN
.
GnuCOBOL (yes, another name branding change) has an intrinsic function extension, FUNCTION SUBSTITUTE.
move function substitute(inptxt, ">", ">", "<", "<") to where-ever-including-inptxt
Takes a subject string, and pairs of patterns and replacements. (This is not regex patterns, straight up text matching). See http://opencobol.add1tocobol.com/gnucobol/#function-substitute for some more details. The patterns and replacements can all be different lengths.
As intrinsic functions return anonymous COBOL fields, the result of the function can be used to overwrite the subject field, without worry of sliding overlap or other "change while reading" problems.
COBOL is a language of fixed-length fields. So no, INSPECT is not going to be able to do what you want.
If you need this for an IBM Mainframe, your SORT product (assuming sufficiently up-to-date) can do this using FINDREP.
If you look at the XML processing possibilities in Enterprise COBOL, you will see that they do exactly what you want (I'd guess). GnuCOBOL can also readily interface with lots of other things. If you are writing GnuCOBOL for running on a non-Mainframe, I'd suggest you ask on the GnuCOBOL part of SourceForge.
Otherwise, yes, it would come down to looping through the data. Once you clarify what you want a bit more, you may get examples of that if you still need them.

multilevel parsing algorithm

rephrase...
I'd like to know how to best to parse functions/conditionals. so if you have something like: [if {a} is {12 or 34}][if {b} not {55}] show +c+ [/if][/if] which is a conditional inside a conditional. Looks like I can't do this with regex only.
original question
for now I have a pretty simple way of parsing out some commands through actionscript.
I'm using regexp to find tags, commands and operands using...
+key_word+ // any text surrounded by +
[ifempty +val_1+]+val_2+[/ifempty] //simple conditional
[ifisnot={`true,yes`} +ShowTitle+]+val_3+[/ifisnot] // conditional with operands
my current algorithm matches the opening tag[**] with the first closing tag [/**] even though it doesn't match. Which means that I could not do something like [ifempty +val_2+][ifnotempty +val_2]+val_3+[/ifnotempty]+val_4+[/ifempty] - essentially putting one conditional inside another one.
I'm using an inline way of parsing that splits the string into an array of strings based on this regexp \[[^\/](?:[^\]])*\](?:[^\]])*\[\/(?:[^\]])*\]
can anyone suggest a more robust algorithm with a more robust parsing convention/standard? especially for as3.
Regular expressions define Regular Languages. Regular Languages cannot have regions of constrained, but potentially infinite, recursion.
One way of thinking about it is that all Regular Languages can be represented by a Finite State Machine. You would need a state for every possible number of if's, but the machine must be 'finite', so your in a bind. A classic example is:
a{n}b{n}, n >= 0
(meaning n a's, followed by n b's)
As you parse each a, you would need to go to another state (FSMs have no memory beyond the state their in, that's the only way they could remember n to match it later). To parse any number of n's, you would need an infinite number of states.
This is the same situation you're in, a regular expression could express a finite number of ifs (although it would take quite a bit of copy-pasting), but not an infinite number. Note however that some regular expression implementations cheat a bit, giving them more power than their mathematical equivalents.
In any case, your best bet is to use a more powerful parsing method. A recursive descent parser is particularly fun to implement, and could easily do what you need. You could also look into an LR-parser, or build a simple parser using a stack. Depending on your language, you might be able to find a parsing library such as pyparse for Python or Boost Spirit for C++.