Context sensitive diff implementation - html

Summary and basic question
Using MS Access 2010 and VBA (sigh..)
I am attempting to implement a specialized Diff function that is capable of outputting a list of changes in different ways depending on what has changed. I need to be able to generate a concise list of changes to submit for our records.
I would like to use something such as html tags like <span class="references">These are references 1, 6</span> so that I can review the changes with code and customize how the change text is outputted. Or anything else to accomplish my task.
I see this as a way to provide an extensible way to customize the output, and possibly move things into a more robust platform and actually use html/css.
Does anyone know of a similar project that may be able to point me in the right direction?
My task
I have an access database with tables of work operation instructions - typically 200-300 operations, many of which are changing from one revision to another. I have currently implemented a function that iterates through tables, finds instructions that have changed and compares them.
Note that each operation instruction is typically a couple sentences with a couple lines at the end with some document references.
My algorithm is based on "An O(ND) Difference Algorithm and Its Variations" and it works great.
Access supports "Rich" text, which is just glorified simple html, so I can easily generate the full text with formatted additions and deletions, i.e. adding tags like <font color = "red"><strong><i>This text has been removed</i></strong></font>. The main output from the Diff procedure is a full text of the operation that includes non-changed, deleted, and inserted text inline with each other. The diff procedure adds <del> and <ins> tags that are later replaced with the formatting text later (The result is something similar to the view of changes from stack exchange edits).
However, like I said, I need the changes listed in human readable format. This has proven difficult because of the ambiguity many changes create.
for example: If a type of chemical is being changed from "Class A" to "Class C", the change text that is easily generated is "Change 'A' to 'C'", which is not very useful to someone reviewing the changes. More common are document reference at the end: Adding SOP 3 to the list such as "SOP 1, 2, 3" generates the text "Add '3'". Clearly not useful either.
What would be most useful is a custom output for text designated as "SOP" text so that the output would be "Add reference to SOP 3".
I started with the following solution:
Group words together, e.g. treat text such as "SOP 1, 2, 3" as one token to compare. This generates the text "Change 'SOP 1, 2' to 'SOP 1, 2, 3". This get's cluttered when there is a large list and you are attempting to determine what actually changed.
Where I am now
I am now attempting to add extra html tags before running the the diff algorithm. For example, I will run the text through a "pre-processor" that will convert "SOP 1, 2" to SOP 1, 2
Once the Diff procedure returns the full change text, I scan through it noting the current "class" of text and when there is a <del> or <ins> I capture the text between the tags and use a SELECT CASE block over the class to address each change.
This actually works okay for the most part, but there are many issues that I have to work through, such add Diff deciding that the shortest path is to delete certain opening tags and insert other ones. This creates a scenario that there are two <span> tags but only one </span> tag.
The ultimate question
I am looking for advise to either continue with the direction I have started or to try something different before investing a lot more time into a sub-optimal solution.
Thanks all in advance.
Also note:
The time for a typical run is approximately 1.5 to 2.5 seconds with me attempting more fancy things and a bunch of debug.prints. So running through an extra pass or two wouldn't be killer.

It is clear that reporting a difference in terms of the smallest change to the structures you have isn't what you want; you want to report some context.
To do that, you have to identify what context there is to report, so you can decide what part of that is interesting. You sketched an idea where you fused certain elements of your structure together (e.g., 'SOP' '1' '2' into 'SOP 1 2'), but that seems to me to be going the wrong way. What it is doing is changing the size of the smallest structure elements, not reporting better context.
While I'm not sure this is the right approach, I'd try to characterize your structures using a grammar, e.g., a BNF. For instance, some grammar rules you might have would be:
action = 'SOP' sop_item_list ;
sop_item_list = NATURAL ;
sop_item_list = sop_item_list ',' NATURAL ;
Now an actual SOP item can be characterized an an abstract syntax tree (show nested children, indexable by constants to get to subtrees):
t1: action('SOP',sop_item_list(sop_item_list(NATURAL[1]),',',NATURAL[2]))
You still want to compute a difference using something like the dynamic programming algorithm you have suggested, but now you want a minimal tree delta. Done right (my company makes tools that do this for grammars for conventional computer languages, and you can find publicly available tree diff algorithms), you can get a delta like:
replace t1[2] with op_item_list(sop_item_list(sop_item_list(NATURAL[1]),',',NATURAL[2]),',',NATURAL[3]
which in essence is what you got by gluing (SOP,1,2) into a single item, but without the external adhocery of you personally deciding to do that.
The real value in this, I think, it that you can use the tree t1 to report the context. In particular, you start at the root of the tree, and print summaries of the subtrees (clearly you don't want to print full subtrees, as that would just give you back the full original text).
By printing subtrees down to a depth of 1 or two levels and eliding anything deep (e.g, representing a list as "..." and single subtree as "_"), you could print something like:
replace 'SOP' ... with 'SOP',...,3
which I think is what you are looking for in your specific example.
No, this isn't an algorithm; it is a sketch of an idea. The fact we have tree-delta algorithms that compute useful deltas, and the summarizing idea (taken from LISP debuggers, frankly) suggests this will probably generalize to something useful, or a least take you in a new direction.
Having your answer in terms of ASTs, should also make it relatively easy to produce HTML as you desire. (People work with XML all the time, and XML is basically a tree representation).

Try thinking in terms of Prolog-style rewrite rules that transform your instructions into a canonical form that will cause the diff algorithm to produce what you need. The problem you specified would be solved by this rule:
SOP i1, i2, ... iN -> SOP j1, SOP j2, ... SOP jN where j = sorted(i)
In other words, "distribute" SOP over a sorted list of the following integers. This will trick the diff algorithm into giving a fully qualified change report "Add SOP 3."
Rules are applied by searching the input for matches of the left hand side and replacing with the corresponding right.
You are probably already doing this, but you will get more commonsense analysis if the input is tokenized: "SOP" should considered a single "character" for the diff algorithm. Whitespace may be reduced to tokens for space and line break if they are significant or else ignored.
You can do another kind of diff at a the character level to test "fuzzy" equality of tokens to account for typographical errors when matching left-hand-sides. "SIP" and "SOP" would be counted a "match" because their edit distance is only 1 (and I and O are only one key apart on a QUERTY keyboard!).
If you consider all the quirks in the output you're getting now and can rectify each one as a rewrite rule that takes the input to a form where the diff algorithm produces what you need, then what's left is to implement the rewrite system. Doing this in a general way that is efficient so that changing and adding rules does not involve a great deal of ad hoc coding is a difficult problem, but one that has been studied. It's interesting that #Ira Baxter mentioned lisp, as it excels as a tool for this sort of thing.
Ira's suggestion of syntax trees falls naturally into the rewrite rule method. For example, suppose SOPs have sections and paragraphs:
SOP 1 section 3, paras 2,1,3
is a hierarchy that should be rewritten as
SOP 1 section 3 para 1, SOP 1 section 3 para 2, SOP 1 section 3 para 3
The rewrite rules
paras i1, i2, ... iN -> para j1, para j2, ... para jN where j = sorted(i)
section K para i1, ... para iN ->s section K para j1, ... section K para j1
SOP section K para i1, ... section K para i1 -> SOP section K para j1, ... SOP section K para j1
when applied in three passes will produce a result like "SOP 1 section 3, para 4 was added."
While there are many strategies to implementing the rules and rewrites, including coding each one as a procedure in VB (argh...), there are other ways. Prolog is a grand attempt to do this as generally as possible. There is a free implementation. There are others. I have used TXL to get useful rewriting done. The only problem with TXL is that it assumes you have a rather strict grammar for inputs, which it doesn't sound like the case in your problem.
If you post more examples of the quirks in your current outputs, I can follow this up with more detail on rewrite rules.

In case you would decide to proceed with what you already achieved (IMO you got pretty far already), you might consider doing two steps of diff.
Group words together, e.g. treat text such as "SOP 1, 2, 3" as one token to compare.
That's a good start; you already managed to make the context clear to the user.
This generates the text "Change 'SOP 1, 2' to 'SOP 1, 2, 3'". This get's cluttered when there is a large list and you are attempting to determine what actually changed.
How about doing another diff pass across the tokens found (i.e. compare 'SOP 1, 2' with 'SOP 1, 2, 3'), this time without the word grouping, to generate additional info? That would make the full output something like this:
Change 'SOP 1, 2' to 'SOP 1, 2, 3'
Change details: Add '3'
The text is a bit cryptic, so you may want to do some refining there. I would also suggest to truncate lengthy tokens in the first line ('SOP 1, 2, 3, ...'), since the second line should already provide sufficient detail.
I am not sure about the performance impact of this second pass; in a big text with many changes, you may experience many roundtrips to the diff functionality. You might optimize by accumulating changes from the first pass into one 'change document', run the second pass across it, then stitch the results together.
HTH.

Related

Explain the difference between Docstring and Comment with an appropriate example in python? [duplicate]

I'm a bit confused over the difference between docstrings and comments in python.
In my class my teacher introduced something known as a 'design recipe', a set of steps that will supposedly help us students plot and organize our coding better in Python. From what I understand, the below is an example of the steps we follow - this so call design recipe (the stuff in the quotations):
def term_work_mark(a0_mark, a1_mark, a2_mark, ex_mark, midterm_mark):
''' (float, float, float, float, float) -> float
Takes your marks on a0_mark, a1_mark, a2_mark, ex_mark and midterm_mark,
calculates their respective weight contributions and sums these
contributions to deliver your overall term mark out of a maximum of 55 (This
is because the exam mark is not taken account of in this function)
>>>term_work_mark(5, 5, 5, 5, 5)
11.8
>>>term_work_mark(0, 0, 0, 0, 0)
0.0
'''
a0_component = contribution(a0_mark, a0_max_mark, a0_weight)
a1_component = contribution(a1_mark, a1_max_mark, a1_weight)
a2_component = contribution(a2_mark, a2_max_mark, a2_weight)
ex_component = contribution(ex_mark, exercises_max_mark,exercises_weight)
mid_component = contribution(midterm_mark, midterm_max_mark, midterm_weight)
return (a0_component + a1_component + a2_component + ex_component +
mid_component)
As far as I understand this is basically a docstring, and in our version of a docstring it must include three things: a description, examples of what your function should do if you enter it in the python shell, and a 'type contract', a section that shows you what types you enter and what types the function will return.
Now this is all good and done, but our assignments require us to also have comments which explain the nature of our functions, using the token '#' symbol.
So, my question is, haven't I already explained what my function will do in the description section of the docstring? What's the point of adding comments if I'll essentially be telling the reader the exact same thing?
It appears your teacher is a fan of How to Design Programs ;)
I'd tackle this as writing for two different audiences who won't always overlap.
First there are the docstrings; these are for people who are going to be using your code without needing or wanting to know how it works. Docstrings can be turned into actual documentation. Consider the official Python documentation - What's available in each library and how to use it, no implementation details (Unless they directly relate to use)
Secondly there are in-code comments; these are to explain what is going on to people (generally you!) who want to extend the code. These will not normally be turned into documentation as they are really about the code itself rather than usage. Now there are about as many opinions on what makes for good comments (or lack thereof) as there are programmers. My personal rules of thumb for adding comments are to explain:
Parts of the code that are necessarily complex. (Optimisation comes to mind)
Workarounds for code you don't have control over, that may otherwise appear illogical
I'll admit to TODOs as well, though I try to keep that to a minimum
Where I've made a choice of a simpler algorithm where a better performing (but more complex) option can go if performance in that section later becomes critical
Since you're coding in an academic setting, and it sounds like your lecturer is going for verbose, I'd say just roll with it. Use code comments to explain how you are doing what you say you are doing in the design recipe.
I believe that it's worth to mention what PEP8 says, I mean, the pure concept.
Docstrings
Conventions for writing good documentation strings (a.k.a. "docstrings") are immortalized in PEP 257.
Write docstrings for all public modules, functions, classes, and methods. Docstrings are not necessary for non-public methods, but you should have a comment that describes what the method does. This comment should appear after the def line.
PEP 257 describes good docstring conventions. Note that most importantly, the """ that ends a multiline docstring should be on a line by itself, e.g.:
"""Return a foobang
Optional plotz says to frobnicate the bizbaz first.
"""
For one liner docstrings, please keep the closing """ on the same line.
Comments
Block comments
Generally apply to some (or all) code that follows them, and are indented to the same level as that code. Each line of a block comment starts with a # and a single space (unless it is indented text inside the comment).
Paragraphs inside a block comment are separated by a line containing a single #.
Inline Comments
Use inline comments sparingly.
An inline comment is a comment on the same line as a statement. Inline comments should be separated by at least two spaces from the statement. They should start with a # and a single space.
Inline comments are unnecessary and in fact distracting if they state the obvious.
Don't do this:
x = x + 1 # Increment x
But sometimes, this is useful:
x = x + 1 # Compensate for border
Reference
https://www.python.org/dev/peps/pep-0008/#documentation-strings
https://www.python.org/dev/peps/pep-0008/#inline-comments
https://www.python.org/dev/peps/pep-0008/#block-comments
https://www.python.org/dev/peps/pep-0257/
First of all, for formatting your posts you can use the help options above the text area you type your post.
And about comments and doc strings, the doc string is there to explain the overall use and basic information of the methods. On the other hand comments are meant to give specific information on blocks or lines, #TODO is used to remind you what you want to do in future, definition of variables and so on. By the way, in IDLE the doc string is shown as a tool tip when you hover over the method's name.
Quoting from this page http://www.pythonforbeginners.com/basics/python-docstrings/
Python documentation strings (or docstrings) provide a convenient way
of associating documentation with Python modules, functions, classes,
and methods.
An object's docsting is defined by including a string constant as the
first statement in the object's definition.
It's specified in source code that is used, like a comment, to
document a specific segment of code.
Unlike conventional source code comments the docstring should describe
what the function does, not how.
All functions should have a docstring
This allows the program to inspect these comments at run time, for
instance as an interactive help system, or as metadata.
Docstrings can be accessed by the __doc__ attribute on objects.
Docstrings can be accessed through a program (__doc__) where as inline comments cannot be accessed.
Interactive help systems like in bpython and IPython can use docstrings to display the docsting during the development. So that you dont have to visit the program everytime.

Pascal binary search tree that contains linked lists

I need to search through its contents with a recursive function, so it returns a boolean response depending whether the value I read was found or not. I dunno how to make it work. Here's the type for the tree I defined:
text=string[30];
list=^nodeL;
nodeL=record
title:text;
ISBN:text;
next:list;
end;
tree=^nodeT;
nodeT=record
cod:text;
l:list;
LC:tree;
RC:tree;
end;
This looks like a "please do my assignment for me post", which I won't do. I will try and help you do the assignment yourself.
I don't know exactly what your assignment is, so I'm going to have to make some guesses.
I think your assignment is to write a recursive function that will search a tree and return a boolean response depending on whether a value (input to the function) is found or not.
I don't know how the tree gets its content. You say, you defined the tree type, so I'm guessing that means you are not provided with a tree that already has content. So, at least for testing purposes, you are going to have to write code to add content to the tree (so you can search it).
I don't know exactly what kind of tree you are supposed to create. Usually trees have rules about how the items are arranged in the tree. A common type of tree, is a binary tree, where for each node, the item in the left node (if present) is "less than" the item in the right node (if present). You probably need this when adding items (i.e. content) to the tree.
I think you need to change your definition of the tree node, nodeT (I could be wrong). A tree is a kind of linked list, it does not usually contain linked lists. Usually each tree node contains an item of data (not a list of items).
If I were doing in this assignment (and learning to program in Pascal) I would do the following (in this order):
Make sure I understand linked lists (at least singe-linked list). Write at least one program to add data to a linked list, and search
it (do not use recursion).
Make sure I understand recursion. Read some tutorials on recursion (that do not use linked lists, or trees). For example "First Textbook Examples of Recursion". Write at least one program that uses recursion (do not use linked lists or trees).
Make sure I understand trees. Read some tutorials on trees. For example, "Binary Search Trees"
Do the assignment.
P.S. You might want to change the name of your text type from "text", because, in Pascal, "text" is the name of a predefined type, for text files.

How to detect redundant piece of code containing array?

The lecture for my Java class has this piece of code:
for (int i=0; i<arr.length; i=i+10){
if(i%10 == 0){
System.out.println(arr[i]);
}
}
If you start at 0 and then go 10, 20, etc. Why do you need the if condition? Naturally all of these numbers divide by 10.
It's redundant. The only way it could have an effect is when the array length is close to the Integer max value and you're causing overflows by adding 10, but then your code would loop infinitely anyway (or crash when accessing negative array values).
To me the code in the if condition might have 2 reasones:
It is a way to monitor the progress of the function (although since the condition of the for loop is i=i+10 instead of i++, it is less meaningful in this case). This is very normal when we are using some script to execute a task that is dealing with a lots of data (normally in single process, and take some time). By printing out the progress periodically we are able to know (or estimate) how many data has been read/wrtie, or how many times have the codes in the loop has been executed, in this case.
There might be more code added in the for loop, which might modify i. In this case, i%10 == 0 will be meaningful.
In other words, without any more context it does seems like the if condition is redundant, in this case.
To answer the question of the title, here's what we usually do. First, have the code review done by someone else before you merge your branch. Having another fellow to review your codes are good practise as they could give you a fresh mind on correctness and code style. Second, if you find something that is suspecious but not sure (for example, the "redundant code" you think here), wrote unit tests to cover the part of code that you would like to change, make the changes and rerun the unit tests and see if you still get what is expected.
Personally I haven't heard of any tools that is able to detect "redundant code" as the example here, as "redundant" might not be "redundant" at all under different circumstances.

How to find likelihood in NLTK

http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/comment-page-1/#comment-73511
I am trying to understand NLTK using this link. I cannot understand how the values of feature_probdist and show_most_informative_features are computed.
Esp when the word "best" does not come how the likelihood is computed as 0.077 . I was trying since long back
That is because it is explaining code from NLTK's source code but not displaying all of it. The full code is available on NLTK's website (and is also linked to in the article you referenced). These are a field within a method and a method (respectively) of the NaiveBayesClassifier class within NLTK. This class is of course using a Naive Bayes classifier, which is essentially a modification of Bayes Theorum with a strong (naive) assumption that each event is independent.
feature_probdist = "P(fname=fval|label), the probability distribution for feature values, given labels. It is expressed as a dictionary whose keys are (label,fname) pairs and whose values are ProbDistIs over feature values. I.e., P(fname=fval|label) = feature_probdist[label,fname].prob(fval). If a given (label,fname) is not a key in feature_probdist, then it is assumed that the corresponding P(fname=fval|label) is 0 for all values of fval."
most_informative features returns "a list of the 'most informative' features used by this classifier. For the purpose of this function, the informativeness of a feature (fname,fval) is equal to the highest value of P(fname=fval|label), for any label, divided by the lowest value of P(fname=fval|label), for any label:"
max[ P(fname=fval|label1) / P(fname=fval|label2) ]
Check out the source code for the entire class if this is still unclear, the article's intent was not to break down how NLTK works under the hood in depth, but rather just to give a basic concept of how to use it.

What does Backpatching mean?

What does backpatching mean ? Please illustrate with a simple example.
Back patching usually refers to the process of resolving forward branches that have been planted in the code, e.g. at 'if' statements, when the value of the target becomes known, e.g. when the closing brace or matching 'else' is encountered.
In intermediate code generation stage of a compiler we often need to execute "jump" instructions to places in the code that don't exist yet. To deal with this type of cases a target label is inserted for that instruction.
A marker nonterminal in the production rule causes the semantic action to pick up.
Some statements like conditional statements, while, etc. will be represented as a bunch of "if" and "goto" syntax while generating the intermediate code.
The problem is that, These "goto" instructions, do not have a valid reference at the beginning(when the compiler starts reading the source code line by line - A.K.A 1st pass). But, after reading the whole source code for the first time, the labels and references these "goto"s are pointing to, are determined.
The problem is that can we make the compiler able to fill the X in the "goto X" statements in one single pass or not?
The answer is yes.
If we don't use backpatching, this can be achieved by a 2 pass analysis on the source code. But, backpatching lets us to create and hold a separate list which is exclusively designed for "goto" statements. Since it is done in only one pass, the first pass will not fill the X in the "goto X" statements because the comipler doesn't know where the X is at first glance. But, it does stores the X in that exclusive list and after going through the whole code and finding that X, the X is replaced by that address or reference.
Backpaching is the process of leaving blank entries for the goto instruction where the target address is unkonown in the forward transfer in the first pass and filling these unknown in the second pass.
Backpatching:
The syntax directed definition can be implemented in two or more passes (we have both synthesized attributes and inherited attributes).
Build the tree first.
Walk the tree in the depth-first order.
The main difficulty with code generation in one pass is that we may not know the target of a branch when we generate code for flow of control statements
Backpatching is the technique to get around this problem.
Generate branch instructions with empty targets
When the target is known, fill in the label of the branch instructions (backpatching).
backpatching is a process in which the operand field of an instruction containing a forward reference is left blank initially. the address of the forward reference symbol is put into this field when its definition is encountered in the program.
Back patching is the activity of filling up the unspecified information of labels
by using the appropriate semantic expression in during the code generation process.
It is done by:
boolean expression.
flow of control statement.