I need a basic concrete example on how to use TDD along with Design by Contract - language-agnostic

I have seen many questions like this and this. Some people see there is overlapping between TDD and Design by Contract and others say they are complementary, I am biased to the second one, so I need a very basic, correct and complete example in any language or even in pseudo on how to use them together.

This is a slightly tricky question because both "test driven development" (TDD) and "design by contract" (DbC) imply something about your development process. (generally that the tests/contracts are written before the code)
Since you're asking about code examples, though, you are more interested in what it would look like to use tests and contracts together. Here is an example:
def sort_numbers(nums: List[int]) -> List[int]:
'''
Tests:
>>> sort_numbers([4, 1, 2])
[1, 2, 4]
>>> sort_numbers([])
[]
Contracts:
post: len(__return__) == len(nums)
post: __return__[0] <= __return__[-1]
'''
return sorted(nums)
Tests
We use tests to check how specific inputs affect the output. For example, sorting the numbers [4, 1, 2] produces the list [1, 2, 4]. Furthermore, sorting the empty list produces the empty list.
(these tests are written using doctest and can be checked with python -m doctest <file>)
Contracts
We use contracts to ensure that some properties hold, no matter what the inputs are. In this example, we assert that:
The returned list has the same length as the input list.
The first item returned is always less than or equal to the last item returned.
(these contracts are written in PEP-316 syntax and can be checked with CrossHair)

Related

How do I get molecular structural information from SMILES

My question is: is there any algorithm that can convert a SMILES structure into a topological fingerprint? For example if glycerol is the input the answer would be 3 x -OH , 2x -CH2 and 1x -CH.
I'm trying to build a python script that can predict the density of a mixture using an artificial neural network. As an input I want to have the structure/fingerprint of my molecules starting from the SMILES structure.
I'm already familiar with -rdkit and the morganfingerprint but that is not what i'm looking for. I'm also aware that I can use the 'matching substructure' search in rdkit, but then I would have to define all the different subgroups. Is there any more convenient/shorter way?
For most of the structures, there's no existing option to find the fragments. However, there's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that
from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)
or the following would return the number of aromatic -OH groups:
from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)
Similarly, there are 83 more functions available. Some of them would be useful for your task. For the ones, you don't get the pre-written function, you can always go to the source code of these rdkit modules, figure out how they did it, and then implement them for your features. But as you already mentioned, the way would be to define a SMARTS string and then fragment matching. The fragment matching module can be found here.
If you want to predict densities of pure components before predicting the mixtures I recommend the following paper:
https://pubs.acs.org/doi/abs/10.1021/acs.iecr.6b03809
You can use the fragments specified by rdkit as mnis proposes. Or you could specify the groups as SMARTS patterns and look for them yourself using GetSubstructMatches as you proposed yourself.
Dissecting a molecule into specific groups is not as straightforward as it might appear in the first place. You could also use an algorithm I published a while ago:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0382-3
It includes a list of SMARTS for the UNIFAC model, but you could also use them for other things, like density prediction.

jq: groupby and nested json arrays

Let's say I have: [[1,2], [3,9], [4,2], [], []]
I would like to know the scripts to get:
The number of nested lists which are/are not non-empty. ie want to get: [3,2]
The number of nested lists which contain or not contain number 3. ie want to get: [1,4]
The number of nested lists for which the sum of the elements is/isn't less than 4. ie want to get: [3,2]
ie basic examples of nested data partition.
Since stackoverflow.com is not a coding service, I'll confine this response to the first question, with the hope that it will convince you that learning jq is worth the effort.
Let's begin by refining the question about the counts of the lists
"which are/are not empty" to emphasize that the first number in the answer should correspond to the number of empty lists (2), and the second number to the rest (3). That is, the required answer should be [2,3].
Solution using built-in filters
The next step might be to ask whether group_by can be used. If the ordering did not matter, we could simply write:
group_by(length==0) | map(length)
This returns [3,2], which is not quite what we want. It's now worth checking the documentation about what group_by is supposed to do. On checking the details at https://stedolan.github.io/jq/manual/#Builtinoperatorsandfunctions,
we see that by design group_by does indeed sort by the grouping value.
Since in jq, false < true, we could fix our first attempt by writing:
group_by(length > 0) | map(length)
That's nice, but since group_by is doing so much work when all we really need is a way to count, it's clear we should be able to come up with a more efficient (and hopefully less opaque) solution.
An efficient solution
At its core the problem boils down to counting, so let's define a generic tabulate filter for producing the counts of distinct string values. Here's a def that will suffice for present purposes:
# Produce a JSON object recording the counts of distinct
# values in the given stream, which is assumed to consist
# solely of strings.
def tabulate(stream):
reduce stream as $s ({}; .[$s] += 1);
An efficient solution can now be written down in just two lines:
tabulate(.[] | length==0 | tostring )
| [.["true", "false"]]
QED
p.s.
The function named tabulate above is sometimes called bow (for "bag of words"). In some ways, that would be a better name, especially as it would make sense to reserve the name tabulate for similar functionality that would work for arbitrary streams.

how to predict topics for a batch of documents with mallet

I am using mallet from a scala project. After training the topic models and got the inferencer file, I tried to assign topics to new texts. The problem is I got different results with different calling methods. Here are the things I tried:
creating a new InstanceList and ingest just one document and get the topic results from the InstanceList
somecontentList.map(text=>getTopics(text, model))
def getTopics(text:String, inferencer: TopicInferencer):Array[Double]={
val testing = new InstanceList(pipe)
testing.addThruPipe(new Instance(text, null, "test instance", null))
inferencer.getSampledDistribution(testing.get(0), iter, 1, burnIn)
}
Put everything in a InstanceList and predict topics together.
val testing = new InstanceList(pipe)
somecontentList.foreach(text=>
testing.addThruPipe(new Instance(text, null, "test instance", null))
)
(0 until testing.size).map(i=>
ldaModel.getSampledDistribution(testing.get(i), 100, 1, 50))
These two methods produce very different results except for the first instance. What is the right way of using the inferencer?
Additional information:
I checked the instance data.
0: topic (0)
1: beaten (1)
2: death (2)
3: examples (3)
4: forum (4)
5: wanted (5)
6: contributing (6)
I assume the number in parenthesis is the index of words used in prediction. When I put all text into the InstanceList, the index is different because the collection has more text. Not sure how exactly that information is considered in the model prediction process.
Remember that the new instances must be imported with the pipe from the original data as recorded in the Inferencer in order for the alphabets to match. It's not clear where pipe is coming from in the scala code, but the fact that the first six words seem to have what looks like it might be ids starting with 0 suggests that this is a new alphabet.
I too found similar issue, although with R plug in. We ended up calling the Inferencer for each row/document separately.
However, there will be some differences in inferences when you call for the same row, because of stochasticity in the drawing and inferencer. Although, I agree that the differences should be small.

Context sensitive diff implementation

Summary and basic question
Using MS Access 2010 and VBA (sigh..)
I am attempting to implement a specialized Diff function that is capable of outputting a list of changes in different ways depending on what has changed. I need to be able to generate a concise list of changes to submit for our records.
I would like to use something such as html tags like <span class="references">These are references 1, 6</span> so that I can review the changes with code and customize how the change text is outputted. Or anything else to accomplish my task.
I see this as a way to provide an extensible way to customize the output, and possibly move things into a more robust platform and actually use html/css.
Does anyone know of a similar project that may be able to point me in the right direction?
My task
I have an access database with tables of work operation instructions - typically 200-300 operations, many of which are changing from one revision to another. I have currently implemented a function that iterates through tables, finds instructions that have changed and compares them.
Note that each operation instruction is typically a couple sentences with a couple lines at the end with some document references.
My algorithm is based on "An O(ND) Difference Algorithm and Its Variations" and it works great.
Access supports "Rich" text, which is just glorified simple html, so I can easily generate the full text with formatted additions and deletions, i.e. adding tags like <font color = "red"><strong><i>This text has been removed</i></strong></font>. The main output from the Diff procedure is a full text of the operation that includes non-changed, deleted, and inserted text inline with each other. The diff procedure adds <del> and <ins> tags that are later replaced with the formatting text later (The result is something similar to the view of changes from stack exchange edits).
However, like I said, I need the changes listed in human readable format. This has proven difficult because of the ambiguity many changes create.
for example: If a type of chemical is being changed from "Class A" to "Class C", the change text that is easily generated is "Change 'A' to 'C'", which is not very useful to someone reviewing the changes. More common are document reference at the end: Adding SOP 3 to the list such as "SOP 1, 2, 3" generates the text "Add '3'". Clearly not useful either.
What would be most useful is a custom output for text designated as "SOP" text so that the output would be "Add reference to SOP 3".
I started with the following solution:
Group words together, e.g. treat text such as "SOP 1, 2, 3" as one token to compare. This generates the text "Change 'SOP 1, 2' to 'SOP 1, 2, 3". This get's cluttered when there is a large list and you are attempting to determine what actually changed.
Where I am now
I am now attempting to add extra html tags before running the the diff algorithm. For example, I will run the text through a "pre-processor" that will convert "SOP 1, 2" to SOP 1, 2
Once the Diff procedure returns the full change text, I scan through it noting the current "class" of text and when there is a <del> or <ins> I capture the text between the tags and use a SELECT CASE block over the class to address each change.
This actually works okay for the most part, but there are many issues that I have to work through, such add Diff deciding that the shortest path is to delete certain opening tags and insert other ones. This creates a scenario that there are two <span> tags but only one </span> tag.
The ultimate question
I am looking for advise to either continue with the direction I have started or to try something different before investing a lot more time into a sub-optimal solution.
Thanks all in advance.
Also note:
The time for a typical run is approximately 1.5 to 2.5 seconds with me attempting more fancy things and a bunch of debug.prints. So running through an extra pass or two wouldn't be killer.
It is clear that reporting a difference in terms of the smallest change to the structures you have isn't what you want; you want to report some context.
To do that, you have to identify what context there is to report, so you can decide what part of that is interesting. You sketched an idea where you fused certain elements of your structure together (e.g., 'SOP' '1' '2' into 'SOP 1 2'), but that seems to me to be going the wrong way. What it is doing is changing the size of the smallest structure elements, not reporting better context.
While I'm not sure this is the right approach, I'd try to characterize your structures using a grammar, e.g., a BNF. For instance, some grammar rules you might have would be:
action = 'SOP' sop_item_list ;
sop_item_list = NATURAL ;
sop_item_list = sop_item_list ',' NATURAL ;
Now an actual SOP item can be characterized an an abstract syntax tree (show nested children, indexable by constants to get to subtrees):
t1: action('SOP',sop_item_list(sop_item_list(NATURAL[1]),',',NATURAL[2]))
You still want to compute a difference using something like the dynamic programming algorithm you have suggested, but now you want a minimal tree delta. Done right (my company makes tools that do this for grammars for conventional computer languages, and you can find publicly available tree diff algorithms), you can get a delta like:
replace t1[2] with op_item_list(sop_item_list(sop_item_list(NATURAL[1]),',',NATURAL[2]),',',NATURAL[3]
which in essence is what you got by gluing (SOP,1,2) into a single item, but without the external adhocery of you personally deciding to do that.
The real value in this, I think, it that you can use the tree t1 to report the context. In particular, you start at the root of the tree, and print summaries of the subtrees (clearly you don't want to print full subtrees, as that would just give you back the full original text).
By printing subtrees down to a depth of 1 or two levels and eliding anything deep (e.g, representing a list as "..." and single subtree as "_"), you could print something like:
replace 'SOP' ... with 'SOP',...,3
which I think is what you are looking for in your specific example.
No, this isn't an algorithm; it is a sketch of an idea. The fact we have tree-delta algorithms that compute useful deltas, and the summarizing idea (taken from LISP debuggers, frankly) suggests this will probably generalize to something useful, or a least take you in a new direction.
Having your answer in terms of ASTs, should also make it relatively easy to produce HTML as you desire. (People work with XML all the time, and XML is basically a tree representation).
Try thinking in terms of Prolog-style rewrite rules that transform your instructions into a canonical form that will cause the diff algorithm to produce what you need. The problem you specified would be solved by this rule:
SOP i1, i2, ... iN -> SOP j1, SOP j2, ... SOP jN where j = sorted(i)
In other words, "distribute" SOP over a sorted list of the following integers. This will trick the diff algorithm into giving a fully qualified change report "Add SOP 3."
Rules are applied by searching the input for matches of the left hand side and replacing with the corresponding right.
You are probably already doing this, but you will get more commonsense analysis if the input is tokenized: "SOP" should considered a single "character" for the diff algorithm. Whitespace may be reduced to tokens for space and line break if they are significant or else ignored.
You can do another kind of diff at a the character level to test "fuzzy" equality of tokens to account for typographical errors when matching left-hand-sides. "SIP" and "SOP" would be counted a "match" because their edit distance is only 1 (and I and O are only one key apart on a QUERTY keyboard!).
If you consider all the quirks in the output you're getting now and can rectify each one as a rewrite rule that takes the input to a form where the diff algorithm produces what you need, then what's left is to implement the rewrite system. Doing this in a general way that is efficient so that changing and adding rules does not involve a great deal of ad hoc coding is a difficult problem, but one that has been studied. It's interesting that #Ira Baxter mentioned lisp, as it excels as a tool for this sort of thing.
Ira's suggestion of syntax trees falls naturally into the rewrite rule method. For example, suppose SOPs have sections and paragraphs:
SOP 1 section 3, paras 2,1,3
is a hierarchy that should be rewritten as
SOP 1 section 3 para 1, SOP 1 section 3 para 2, SOP 1 section 3 para 3
The rewrite rules
paras i1, i2, ... iN -> para j1, para j2, ... para jN where j = sorted(i)
section K para i1, ... para iN ->s section K para j1, ... section K para j1
SOP section K para i1, ... section K para i1 -> SOP section K para j1, ... SOP section K para j1
when applied in three passes will produce a result like "SOP 1 section 3, para 4 was added."
While there are many strategies to implementing the rules and rewrites, including coding each one as a procedure in VB (argh...), there are other ways. Prolog is a grand attempt to do this as generally as possible. There is a free implementation. There are others. I have used TXL to get useful rewriting done. The only problem with TXL is that it assumes you have a rather strict grammar for inputs, which it doesn't sound like the case in your problem.
If you post more examples of the quirks in your current outputs, I can follow this up with more detail on rewrite rules.
In case you would decide to proceed with what you already achieved (IMO you got pretty far already), you might consider doing two steps of diff.
Group words together, e.g. treat text such as "SOP 1, 2, 3" as one token to compare.
That's a good start; you already managed to make the context clear to the user.
This generates the text "Change 'SOP 1, 2' to 'SOP 1, 2, 3'". This get's cluttered when there is a large list and you are attempting to determine what actually changed.
How about doing another diff pass across the tokens found (i.e. compare 'SOP 1, 2' with 'SOP 1, 2, 3'), this time without the word grouping, to generate additional info? That would make the full output something like this:
Change 'SOP 1, 2' to 'SOP 1, 2, 3'
Change details: Add '3'
The text is a bit cryptic, so you may want to do some refining there. I would also suggest to truncate lengthy tokens in the first line ('SOP 1, 2, 3, ...'), since the second line should already provide sufficient detail.
I am not sure about the performance impact of this second pass; in a big text with many changes, you may experience many roundtrips to the diff functionality. You might optimize by accumulating changes from the first pass into one 'change document', run the second pass across it, then stitch the results together.
HTH.

How to find likelihood in NLTK

http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/comment-page-1/#comment-73511
I am trying to understand NLTK using this link. I cannot understand how the values of feature_probdist and show_most_informative_features are computed.
Esp when the word "best" does not come how the likelihood is computed as 0.077 . I was trying since long back
That is because it is explaining code from NLTK's source code but not displaying all of it. The full code is available on NLTK's website (and is also linked to in the article you referenced). These are a field within a method and a method (respectively) of the NaiveBayesClassifier class within NLTK. This class is of course using a Naive Bayes classifier, which is essentially a modification of Bayes Theorum with a strong (naive) assumption that each event is independent.
feature_probdist = "P(fname=fval|label), the probability distribution for feature values, given labels. It is expressed as a dictionary whose keys are (label,fname) pairs and whose values are ProbDistIs over feature values. I.e., P(fname=fval|label) = feature_probdist[label,fname].prob(fval). If a given (label,fname) is not a key in feature_probdist, then it is assumed that the corresponding P(fname=fval|label) is 0 for all values of fval."
most_informative features returns "a list of the 'most informative' features used by this classifier. For the purpose of this function, the informativeness of a feature (fname,fval) is equal to the highest value of P(fname=fval|label), for any label, divided by the lowest value of P(fname=fval|label), for any label:"
max[ P(fname=fval|label1) / P(fname=fval|label2) ]
Check out the source code for the entire class if this is still unclear, the article's intent was not to break down how NLTK works under the hood in depth, but rather just to give a basic concept of how to use it.