I'm stuck on how to do this with Snakemake.
First, say my "all" rule is:
rule all:
input: "SA.txt", "SA_T1.txt", "SA_T2.txt",
"SB.txt", "SB_T1.txt", "SB_T2.txt", "SB_T3.txt"
Notice that SA has two _T# files while SB has three such files, a crucial element of this.
Now I want to write a rule like this to generate these files:
rule X:
output: N="S{X}.txt", T="S{X}_T{Y}.txt"
(etc.)
But SnakeMake requires that both output templates have the same wildcards, which these don't. Further, even if SnakeMake could handle the multiple wildcards, it would presumably want to find a single filename match for the S{X}_T{Y}.txt template, but I want that to match ALL files where {X} matches the first template's {X}, i.e. I want output.T to be a list, not a single file. So it would seem the way to do it is:
def myExpand(T, wildcards):
T = T.replace("{X}", wildcards.X)
T = [T.replace("{Y}", S) for S in theXYs[wildcards.X]]
return T
rule X:
output: N="S{X}.txt", T=lambda wildcards: myExpand("S{X}_T{Y}.txt", wildcards)
(etc.)
But I can't do this, because a lambda function cannot be used in an output section.
How do I do it?
It seems to me this argues for supporting lambda functions on output statements, providing a wildcards dictionary filled with values from already-parsed sections of the output statement.
Additions responding to comments:
The value of wildcard Y is needed because other rules have inputs for those files that have the wildcard Y.
My rule knows the different values for Y (and X) that it needs to work with from data read from a database into python dictionaries.
There are many values for X, and from 2 to 6 values of Y for each value of X. I don't think it makes sense to use separate rules for each value of X. However, I might be wrong as I recently learned that one can put a rule inside a loop, and create multiple rules.
More info about the workflow: I am combining somatic variant VCF files for several tumor samples from one person together into a single VCF file, and doing it such that for each called variant in any one tumor, all tumors not calling that variant are analyzed to determine read depth at the variant, which is included in the merged VCF file.
The full process involves about 14 steps, which could perhaps be as many as 14 rules. I actually didn't want to use 14 rules, but preferred to just do it all in one rule.
However, I now think the solution is to indeed use lots of separate rules. I was avoiding this partly because of the large number of unnecessary intermediate files, but actually, these exist anyway, temporarily, within a single large rule. With multiple rules I can mark them temp() so Snakemake will delete them at the end.
For the sake of fleshing out this discussion, which I believe is a legitimate one, let's assume a simple situation that might arise. Say that for each of a number of persons, you have N (>=2) tumor VCF files, as I do, and that you want to write a rule that will produce N+1 output files, one output file per tumor plus one more file that is associated with the person. Use wildcard X for person ID and wildcard Y for tumor ID within person X. Say that the operation is to put all variants present in ALL tumor VCF files into the person output VCF file, and all OTHER variants into the corresponding tumor output files for the tumors in which they appear. Say a single program generates all N+1 files from the N input files. How do you write the rule?
You want this:
rule:
output: COMMON="{X}.common.vcf", INDIV="{X}.{Y}.indiv.vcf"
input: "{X}.{Y}.vcf"
shell: """
getCommonAndIndividualVariants --inputs {input} \
--common {output.COMMON} --indiv {output.INDIV}
"""
But that violates the rules for output wildcards.
The way I did it, which is less than satisfactory, but works, is to use two rules, the first of which has the output template with more wildcards, and the second the template with fewer wildcards, and having the second rule create temporary output files which are renamed to the final name by the first rule:
rule A:
output: "{X}.{Y}.indiv.vcf"
input: "{X}.common.vcf"
run: "for infile in {input}: os.system('mv '+infile+'.tmp'+' '+infile)"
rule B:
output: "{X}.common.vcf"
input: lambda wildcards: \
expand("{X}.{Y}.vcf", **wildcards, Y=getYfromDB(wildcards["X"]))
params: OUT=lambda wildcards: \
expand("{X}.{Y}.indiv.vcf.tmp", Y=getYfromDB(wildcards["X"]))
shell: """
getCommonAndIndividualVariants --inputs {input} \
--common {output} --indiv {params.OUT}
"""
I do not know how the rest of your workflow looks, and what the best solution is is context-dependent.
1
What about splitting up the rule into two, one creating "SA.txt", "SA_T1.txt", "SA_T2.txt" and another "SB.txt", "SB_T1.txt", "SB_T2.txt", "SB_T3.txt"?
2
Another possibility is to only have the {X} files in the output-directive, but have the rule create the other files, even though they are not in the output-directive. This does not work if the {Y} files are part of the DAG.
3 (Best solution?)
A third and potentially the best solution might be to have aggregated wildcards in the X rule and the rule that requires the output from X.
Then the solution would be
rule X:
output: N="S{X_prime}.txt", T="S{Y_prime}.txt"
The rule which requires these files can look like:
rule all:
input:
expand("S{X_prime}", X_prime="A_T1 A_T2".split()),
expand("S{Y_prime}", Y_prime="B_T1 B_T2 B_T3".split())
If this does not meet your requirements we can discuss it further :)
Ps. You might need to use wildcard_constraints to disambiguate the outputs of rule X.
list_of_all_valid_X_prime_values = "A_T1 A_T2".split()
list_of_all_valid_Y_prime_values = "B_T1 B_T2 B_T3".split()
wildcard_constraints:
X_prime = "({})".format("|".join(list_of_all_valid_X_prime_values))
Y_prime = "({})".format("|".join(list_of_all_valid_Y_prime_values))
rule all:
...
My understanding is that snakemake works by taking steps that look as follows:
Look at the name of a file it is asked to generate.
Find the rule that has a matching pattern in its output.
Infer the values of the wildcards from the above match.
Use the wildcards to determine what the name of the inputs of the chosen rule should be.
Your rule can generate its output without needing an input, so the problem of inferring the value of wildcard Y is not evident.
How does your rule know how many different values for Y it needs to work with ?
If you find a way to determine the values for Y just knowing the value for X and predefined "python-level" functions and variables, then there may be a way to have Y as an internal variable of your rule, and not a wildcard.
In this case, the workflow could be driven only by files S{X}.txt. The S{X}_T{Y}.txt would just be a byproduct of the execution of the rule, not in its explicit output.
Related
Let's say I have: [[1,2], [3,9], [4,2], [], []]
I would like to know the scripts to get:
The number of nested lists which are/are not non-empty. ie want to get: [3,2]
The number of nested lists which contain or not contain number 3. ie want to get: [1,4]
The number of nested lists for which the sum of the elements is/isn't less than 4. ie want to get: [3,2]
ie basic examples of nested data partition.
Since stackoverflow.com is not a coding service, I'll confine this response to the first question, with the hope that it will convince you that learning jq is worth the effort.
Let's begin by refining the question about the counts of the lists
"which are/are not empty" to emphasize that the first number in the answer should correspond to the number of empty lists (2), and the second number to the rest (3). That is, the required answer should be [2,3].
Solution using built-in filters
The next step might be to ask whether group_by can be used. If the ordering did not matter, we could simply write:
group_by(length==0) | map(length)
This returns [3,2], which is not quite what we want. It's now worth checking the documentation about what group_by is supposed to do. On checking the details at https://stedolan.github.io/jq/manual/#Builtinoperatorsandfunctions,
we see that by design group_by does indeed sort by the grouping value.
Since in jq, false < true, we could fix our first attempt by writing:
group_by(length > 0) | map(length)
That's nice, but since group_by is doing so much work when all we really need is a way to count, it's clear we should be able to come up with a more efficient (and hopefully less opaque) solution.
An efficient solution
At its core the problem boils down to counting, so let's define a generic tabulate filter for producing the counts of distinct string values. Here's a def that will suffice for present purposes:
# Produce a JSON object recording the counts of distinct
# values in the given stream, which is assumed to consist
# solely of strings.
def tabulate(stream):
reduce stream as $s ({}; .[$s] += 1);
An efficient solution can now be written down in just two lines:
tabulate(.[] | length==0 | tostring )
| [.["true", "false"]]
QED
p.s.
The function named tabulate above is sometimes called bow (for "bag of words"). In some ways, that would be a better name, especially as it would make sense to reserve the name tabulate for similar functionality that would work for arbitrary streams.
I have been working through the Advent of Code problems in Perl6 this year and was attempting to use a grammar to parse the Day 3's input.
Given input in this form: #1 # 1,3: 4x4 and this grammar that I created:
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
}
token digits {
<digit>+
}
token id {
<digits>
}
token coordinates {
<digits> ',' <digits>
}
token dimensions {
<digits> 'x' <digits>
}
}
say Claim.parse('#1 # 1,3: 4x4');
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse. I understand that I can pull them from the resulting Match object of Claim.parse(<input>), but I have to dig down through each grammar production to get the value I need e.g.
say $match<id>.hash<digits>.<digit>;
this seems a little messy, is there a better way?
For the particular challenge you're solving, using a grammar is like using a sledgehammer to crack a nut.
Like #Scimon says, a single regex would be fine. You can keep it nicely readable by laying it out appropriately. You can name the captures and keep them all at the top level:
/ ^
'#' $<id>=(\d+) ' '
'# ' $<x>=(\d+) ',' $<y>=(\d+)
': ' $<w>=(\d+) x $<d>=(\d+)
$
/;
say ~$<id x y w d>; # 1 1 3 4 4
(The prefix ~ calls .Str on the value on its right hand side. Called on a Match object it stringifies to the matched strings.)
With that out the way, your question remains perfectly cromulent as it is because it's important to know how P6 scales in this regard from simple regexes like the one above to the largest and most complex parsing tasks. So that's what the rest of this answer covers, using your example as the starting point.
Digging less messily
say $match<id>.hash<digits>.<digit>; # [「1」]
this seems a little messy, is there a better way?
Your say includes unnecessary code and output nesting. You could just simplify to something like:
say ~$match<id> # 1
Digging a little deeper less messily
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse.
For matches of multiple tokens you no longer have the luxury of relying on Perl 6 guessing which one you mean. (When there's only one, guess which one it guesses you mean. :))
One way to write your say to get the y coordinate:
say ~$match<coordinates><digits>[1] # 3
If you want to drop the <digits> you can mark which parts of a pattern should be stored in a list of numbered captures. One way to do so is to put parentheses around those parts:
token coordinates { (<digits>) ',' (<digits>) }
Now you've eliminated the need to mention <digits>:
say ~$match<coordinates>[1] # 3
You could also name the new parenthesized captures:
token coordinates { $<x>=(<digits>) ',' $<y>=(<digits>) }
say ~$match<coordinates><y> # 3
Pre-digging
I have to dig down through each grammar production to get the value I need
The above techniques still all dig down into the automatically generated parse tree which by default precisely corresponds to the tree implicit in the grammar's hierarchy of rule calls. The above techniques just make the way you dig into it seem a little shallower.
Another step is to do the digging work as part of the parsing process so that the say is simple.
You could inline some code right into the TOP token to store just the interesting data you've made. Just insert a {...} block in the appropriate spot (for this sort of thing that means the end of the token given that you need the token pattern to have already done its matching work):
my $made;
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
{ $made = ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
}
...
Now you can write just:
say $made # 1 1 3 4 4
This illustrates that you can just write arbitrary code at any point in any rule -- something that's not possible with most parsing formalisms and their related tools -- and the code can access the parse state as it is at that point.
Pre-digging less messily
Inlining code is quick and dirty. So is using a variable.
The normal thing to do for storing data is to instead use the make function. This hangs data off the match object that's being constructed corresponding to a given rule. This can then be retrieved using the .made method. So instead of $make = you'd have:
{ make ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
And now you can write:
say $match.made # 1 1 3 4 4
That's much tidier. But there's more.
A sparse subtree of a parse tree
.oO ( 🎶 On the first day of an imagined 2019 Perl 6 Christmas Advent calendar 🎶 a StackOverflow title said to me ... )
In the above example I constructed a .made payload for just the TOP node. For larger grammars it's common to form a sparse subtree (a term I coined for this because I couldn't find a standard existing term).
This sparse subtree consists of the .made payload for the TOP that's a data structure referring to .made payloads of lower level rules which in turn refer to lower level rules and so on, skipping uninteresting intermediate rules.
The canonical use case for this is to form an Abstract Syntax Tree after parsing some programming code.
In fact there's an alias for .made, namely .ast:
say $match.ast # 1 1 3 4 4
While this is trivial to use, it's also fully general. P6 uses a P6 grammar to parse P6 code -- and then builds an AST using this mechanism.
Making it all elegant
For maintainability and reusability you can and typically should not insert code inline at the end of rules but should instead use Action objects.
In summary
There are a range of general mechanisms that scale from simple to complex scenarios and can be combined as best fits any given use case.
Add parentheses as I explained above, naming the capture that those parentheses zero in on, if that is a nice simplification for digging into the parse tree.
Inline any action you wish to take during parsing of a rule. You get full access to the parse state at that point. This is great for making it easy to extract just the data you want from a parse because you can use the make convenience function. And you can abstract all actions that are to be taken at the end of successfully matching rules out of a grammar, ensuring this is a clean solution code-wise and that a single grammar remains reusable for multiple actions.
One final thing. You may wish to prune the parse tree to omit unnecessary leaf detail (to reduce memory consumption and/or simplify parse tree displays). To do so, write <.foo>, with a dot preceding the rule name, to switch the default automatic capturing off for that rule.
You can refer to each of you named portions directly. So to get the cordinates you can access :
say $match.<coordinates>.<digits>
this will return you the Array of digits matches. Ig you just want the values the easiest way is probably :
say $match.<coordinates>.<digits>.map( *.Int) or say $match.<coordinates>.<digits>>>.Int or even say $match.<coordinates>.<digits>».Int
to cast them to Ints
For the id field it's even easier you can just cast the <id> match to an Int :
say $match.<id>.Int
I am using ELKI to cluster data from CSV file
I use
-resulthandler ResultWriter
-out folder/
to save the outputdata
But as an output I have some strange indexes
ID=2138 0.1799 0.2761
ID=2137 0.1797 0.2778
ID=2136 0.1796 0.2787
ID=2109 0.1161 0.2072
ID=2007 0.1139 0.2047
The ID is more than 2000 despite I have less than 100 training samples
DBIDs are internal; the documentation clearly says that you shouldn't make too much assumptions on them because their implementation may change. The only reason they are written to the output at all is because some methods (such as OPTICS) may require cross-referencing objects by this unique ID.
Because they are meant to be unique identifiers, they are usually continuously incremented. The next time you click on "run" in the MiniGUI, you will get the next n IDs... so clearly, you clicked run more than once.
The "Tips & Tricks" in the ELKI DBID documentation probably answer your underlying question - how to use map DBIDs to line numbers of your input file. The best way is to if you want to have object identifiers, assign object identifiers yourself by using an identifier column (and configuring it to be an external identifier).
For further information, see the documentation: https://elki-project.github.io/dev/dbids
I have a file with 13 columns and 41 lines consisting of the coefficients for the Joback Method for 41 different groups. Some of the values are non-existing, though, and the table lists them as "X". I saved the table as a .csv and in my code read the file to an array. An excerpt of two lines from the .csv (the second one contains non-exisiting coefficients) looks like this:
48.84,11.74,0.0169,0.0074,9.0,123.34,163.16,453.0,1124.0,-31.1,0.227,-0.00032,0.000000146
X,74.6,0.0255,-0.0099,X,23.61,X,797.0,X,X,X,X,X
What I've tried doing was to read and define an array to hold each IOSTAT value so I can know if an "X" was read (that is, IOSTAT would be positive):
DO I = 1, 41
(READ(25,*,IOSTAT=ReadStatus(I,J)) JobackCoeff, J = 1, 13)
END DO
The problem, I've found, is that if the first value of the line to be read is "X", producing a positive value of ReadStatus, then the rest of the values of those line are not read correctly.
My intent was to use the ReadStatus array to produce an error message if JobackCoeff(I,J) caused a read error, therefore pinpointing the "X"s.
Can I force the program to keep reading a line after there is a reading error? Or is there a better way of doing this?
As soon as an error occurs during the input execution then processing of the input list terminates. Further, all variables specified in the input list become undefined. The short answer to your first question is: no, there is no way to keep reading a line after a reading error.
We come, then, to the usual answer when more complicated input processing is required: read the line into a character variable and process that. I won't write complete code for you (mostly because it isn't clear exactly what is required), but when you have a character variable you may find the index intrinsic useful. With this you can locate Xs (with repeated calls on substrings to find all of them on a line).
Alternatively, if you provide an explicit format (rather than relying on list-directed (fmt=*) input) you may be able to do something with non-advancing input (advance='no' in the read statement). However, as soon as an error condition comes about then the position of the file becomes indeterminate: you'll also have to handle this. It's probably much simpler to process the line-as-a-character-variable.
An outline of the concept (without declarations, robustness) is given below.
read(iunit, '(A)') line
idx = 1
do i=1, 13
read(line(idx:), *, iostat=iostat) x(i)
if (iostat.gt.0) then
print '("Column ",I0," has an X")', i
x(i) = -HUGE(0.) ! Recall x(i) was left undefined
end if
idx = idx + INDEX(line(idx:), ',')
end do
An alternative, long used by many many Fortran programmers, and programmers in other languages, would be to use an editor of some sort (I like sed) and modify the file by changing all the Xs to NANs. Your compiler has to provide support for IEEE NaNs for this to work (most of the current crop in widespread use do) and they will correctly interpret NAN in the input file to a real number with value NaN.
This approach has the benefit, compared with the already accepted (and perfectly good) answer, of not requiring clever programming in Fortran to parse input lines containing mixed entries. Use an editor for string processing, use Fortran for reading numbers.
Summary and basic question
Using MS Access 2010 and VBA (sigh..)
I am attempting to implement a specialized Diff function that is capable of outputting a list of changes in different ways depending on what has changed. I need to be able to generate a concise list of changes to submit for our records.
I would like to use something such as html tags like <span class="references">These are references 1, 6</span> so that I can review the changes with code and customize how the change text is outputted. Or anything else to accomplish my task.
I see this as a way to provide an extensible way to customize the output, and possibly move things into a more robust platform and actually use html/css.
Does anyone know of a similar project that may be able to point me in the right direction?
My task
I have an access database with tables of work operation instructions - typically 200-300 operations, many of which are changing from one revision to another. I have currently implemented a function that iterates through tables, finds instructions that have changed and compares them.
Note that each operation instruction is typically a couple sentences with a couple lines at the end with some document references.
My algorithm is based on "An O(ND) Difference Algorithm and Its Variations" and it works great.
Access supports "Rich" text, which is just glorified simple html, so I can easily generate the full text with formatted additions and deletions, i.e. adding tags like <font color = "red"><strong><i>This text has been removed</i></strong></font>. The main output from the Diff procedure is a full text of the operation that includes non-changed, deleted, and inserted text inline with each other. The diff procedure adds <del> and <ins> tags that are later replaced with the formatting text later (The result is something similar to the view of changes from stack exchange edits).
However, like I said, I need the changes listed in human readable format. This has proven difficult because of the ambiguity many changes create.
for example: If a type of chemical is being changed from "Class A" to "Class C", the change text that is easily generated is "Change 'A' to 'C'", which is not very useful to someone reviewing the changes. More common are document reference at the end: Adding SOP 3 to the list such as "SOP 1, 2, 3" generates the text "Add '3'". Clearly not useful either.
What would be most useful is a custom output for text designated as "SOP" text so that the output would be "Add reference to SOP 3".
I started with the following solution:
Group words together, e.g. treat text such as "SOP 1, 2, 3" as one token to compare. This generates the text "Change 'SOP 1, 2' to 'SOP 1, 2, 3". This get's cluttered when there is a large list and you are attempting to determine what actually changed.
Where I am now
I am now attempting to add extra html tags before running the the diff algorithm. For example, I will run the text through a "pre-processor" that will convert "SOP 1, 2" to SOP 1, 2
Once the Diff procedure returns the full change text, I scan through it noting the current "class" of text and when there is a <del> or <ins> I capture the text between the tags and use a SELECT CASE block over the class to address each change.
This actually works okay for the most part, but there are many issues that I have to work through, such add Diff deciding that the shortest path is to delete certain opening tags and insert other ones. This creates a scenario that there are two <span> tags but only one </span> tag.
The ultimate question
I am looking for advise to either continue with the direction I have started or to try something different before investing a lot more time into a sub-optimal solution.
Thanks all in advance.
Also note:
The time for a typical run is approximately 1.5 to 2.5 seconds with me attempting more fancy things and a bunch of debug.prints. So running through an extra pass or two wouldn't be killer.
It is clear that reporting a difference in terms of the smallest change to the structures you have isn't what you want; you want to report some context.
To do that, you have to identify what context there is to report, so you can decide what part of that is interesting. You sketched an idea where you fused certain elements of your structure together (e.g., 'SOP' '1' '2' into 'SOP 1 2'), but that seems to me to be going the wrong way. What it is doing is changing the size of the smallest structure elements, not reporting better context.
While I'm not sure this is the right approach, I'd try to characterize your structures using a grammar, e.g., a BNF. For instance, some grammar rules you might have would be:
action = 'SOP' sop_item_list ;
sop_item_list = NATURAL ;
sop_item_list = sop_item_list ',' NATURAL ;
Now an actual SOP item can be characterized an an abstract syntax tree (show nested children, indexable by constants to get to subtrees):
t1: action('SOP',sop_item_list(sop_item_list(NATURAL[1]),',',NATURAL[2]))
You still want to compute a difference using something like the dynamic programming algorithm you have suggested, but now you want a minimal tree delta. Done right (my company makes tools that do this for grammars for conventional computer languages, and you can find publicly available tree diff algorithms), you can get a delta like:
replace t1[2] with op_item_list(sop_item_list(sop_item_list(NATURAL[1]),',',NATURAL[2]),',',NATURAL[3]
which in essence is what you got by gluing (SOP,1,2) into a single item, but without the external adhocery of you personally deciding to do that.
The real value in this, I think, it that you can use the tree t1 to report the context. In particular, you start at the root of the tree, and print summaries of the subtrees (clearly you don't want to print full subtrees, as that would just give you back the full original text).
By printing subtrees down to a depth of 1 or two levels and eliding anything deep (e.g, representing a list as "..." and single subtree as "_"), you could print something like:
replace 'SOP' ... with 'SOP',...,3
which I think is what you are looking for in your specific example.
No, this isn't an algorithm; it is a sketch of an idea. The fact we have tree-delta algorithms that compute useful deltas, and the summarizing idea (taken from LISP debuggers, frankly) suggests this will probably generalize to something useful, or a least take you in a new direction.
Having your answer in terms of ASTs, should also make it relatively easy to produce HTML as you desire. (People work with XML all the time, and XML is basically a tree representation).
Try thinking in terms of Prolog-style rewrite rules that transform your instructions into a canonical form that will cause the diff algorithm to produce what you need. The problem you specified would be solved by this rule:
SOP i1, i2, ... iN -> SOP j1, SOP j2, ... SOP jN where j = sorted(i)
In other words, "distribute" SOP over a sorted list of the following integers. This will trick the diff algorithm into giving a fully qualified change report "Add SOP 3."
Rules are applied by searching the input for matches of the left hand side and replacing with the corresponding right.
You are probably already doing this, but you will get more commonsense analysis if the input is tokenized: "SOP" should considered a single "character" for the diff algorithm. Whitespace may be reduced to tokens for space and line break if they are significant or else ignored.
You can do another kind of diff at a the character level to test "fuzzy" equality of tokens to account for typographical errors when matching left-hand-sides. "SIP" and "SOP" would be counted a "match" because their edit distance is only 1 (and I and O are only one key apart on a QUERTY keyboard!).
If you consider all the quirks in the output you're getting now and can rectify each one as a rewrite rule that takes the input to a form where the diff algorithm produces what you need, then what's left is to implement the rewrite system. Doing this in a general way that is efficient so that changing and adding rules does not involve a great deal of ad hoc coding is a difficult problem, but one that has been studied. It's interesting that #Ira Baxter mentioned lisp, as it excels as a tool for this sort of thing.
Ira's suggestion of syntax trees falls naturally into the rewrite rule method. For example, suppose SOPs have sections and paragraphs:
SOP 1 section 3, paras 2,1,3
is a hierarchy that should be rewritten as
SOP 1 section 3 para 1, SOP 1 section 3 para 2, SOP 1 section 3 para 3
The rewrite rules
paras i1, i2, ... iN -> para j1, para j2, ... para jN where j = sorted(i)
section K para i1, ... para iN ->s section K para j1, ... section K para j1
SOP section K para i1, ... section K para i1 -> SOP section K para j1, ... SOP section K para j1
when applied in three passes will produce a result like "SOP 1 section 3, para 4 was added."
While there are many strategies to implementing the rules and rewrites, including coding each one as a procedure in VB (argh...), there are other ways. Prolog is a grand attempt to do this as generally as possible. There is a free implementation. There are others. I have used TXL to get useful rewriting done. The only problem with TXL is that it assumes you have a rather strict grammar for inputs, which it doesn't sound like the case in your problem.
If you post more examples of the quirks in your current outputs, I can follow this up with more detail on rewrite rules.
In case you would decide to proceed with what you already achieved (IMO you got pretty far already), you might consider doing two steps of diff.
Group words together, e.g. treat text such as "SOP 1, 2, 3" as one token to compare.
That's a good start; you already managed to make the context clear to the user.
This generates the text "Change 'SOP 1, 2' to 'SOP 1, 2, 3'". This get's cluttered when there is a large list and you are attempting to determine what actually changed.
How about doing another diff pass across the tokens found (i.e. compare 'SOP 1, 2' with 'SOP 1, 2, 3'), this time without the word grouping, to generate additional info? That would make the full output something like this:
Change 'SOP 1, 2' to 'SOP 1, 2, 3'
Change details: Add '3'
The text is a bit cryptic, so you may want to do some refining there. I would also suggest to truncate lengthy tokens in the first line ('SOP 1, 2, 3, ...'), since the second line should already provide sufficient detail.
I am not sure about the performance impact of this second pass; in a big text with many changes, you may experience many roundtrips to the diff functionality. You might optimize by accumulating changes from the first pass into one 'change document', run the second pass across it, then stitch the results together.
HTH.