Ideas for automatic refactoring of modules - language-agnostic

I often have the problem that function and type definitions aggregate over time in a single module (we can assume that a module corresponds to a source file). At some point the source file is so big that it is barely maintenable anymore. For example, one realizes the module contains a lot of functions that are logically related to a handful of different topics, and one feels that they should be in their own modules.
The idea is to have a tool that suggests ways to split up such a module. The actual creation of new source files from the old ones could then be made automatically.
I have the follwoing:
list of functions and data types, along with their direct dependencies on other functions and types in the module.
from this, we can compute all dependencies for every item
we can do a topological sort, to make dependency groups. For example
(a, (b,c,d), e)
an item more left in the outer list does not depend on any ones further right
inner grouped items like (b,c,d) depend recursively on each other
The modules must form an acyclic hierarchy, i.e. it is not possible that module A imports module B when B or one of the modules it imports already imports A. From this it follows that (b, c, d) from the example above must not be splitted across different modules.
Now I am somehow stuck and looking for a strategy to make sensible suggestions based on the information found so far.
Of course, one possibility would be to split according to the topologically sorted list of dependency groups. However, let us assume the list starts thus:
(a, b, c, ....)
Where c depends on b and a, b on a and a on nothing. Here we could do the following:
module ABC that defines a, b and c
module A that defines a, module BC that imports A and defines b and c
module C imports A and B and defines c, module B imports A and defines b
If we simply map dependencies between functions into dependencies between modules we may end up with a module hierarchy that is too complex and fine grained.
Somehow I must take further factors into account. Maybe some desired module size, or number of imports.
Any advice, as well as pointers to existing software doing something like this are most welcome.

It sounds like you are describing Efferent and Afferent Couplings http://en.wikipedia.org/wiki/Software_package_metrics
This is a language agnostic question but I know of a few tools in Java such as JDepend (http://clarkware.com/software/JDepend.html) that will compute these metrics for you to help guide future refactorings.

Related

How do I get molecular structural information from SMILES

My question is: is there any algorithm that can convert a SMILES structure into a topological fingerprint? For example if glycerol is the input the answer would be 3 x -OH , 2x -CH2 and 1x -CH.
I'm trying to build a python script that can predict the density of a mixture using an artificial neural network. As an input I want to have the structure/fingerprint of my molecules starting from the SMILES structure.
I'm already familiar with -rdkit and the morganfingerprint but that is not what i'm looking for. I'm also aware that I can use the 'matching substructure' search in rdkit, but then I would have to define all the different subgroups. Is there any more convenient/shorter way?
For most of the structures, there's no existing option to find the fragments. However, there's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that
from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)
or the following would return the number of aromatic -OH groups:
from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)
Similarly, there are 83 more functions available. Some of them would be useful for your task. For the ones, you don't get the pre-written function, you can always go to the source code of these rdkit modules, figure out how they did it, and then implement them for your features. But as you already mentioned, the way would be to define a SMARTS string and then fragment matching. The fragment matching module can be found here.
If you want to predict densities of pure components before predicting the mixtures I recommend the following paper:
https://pubs.acs.org/doi/abs/10.1021/acs.iecr.6b03809
You can use the fragments specified by rdkit as mnis proposes. Or you could specify the groups as SMARTS patterns and look for them yourself using GetSubstructMatches as you proposed yourself.
Dissecting a molecule into specific groups is not as straightforward as it might appear in the first place. You could also use an algorithm I published a while ago:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0382-3
It includes a list of SMARTS for the UNIFAC model, but you could also use them for other things, like density prediction.

Why while querying ontologies we have to load the ontology, also provide its namespace?

I wonder why we have to load an ontology, also provide its namespace while querying it? Why loading the ontology is not enough?
To understand my question better, here is a sample code:
g = rdflib.Graph()
g.parse('ppp.owl', format='turtle')
ppp = rdflib.Namespace('http://purl.org/xxx/ont/ppp/')
g.bind('ppp', ppp)
In line 2, we have opened the ontology (ppp.owl), but in line 3 we also provided its namespace. Does namespace show the program how to handle the ontology?
Cheers,
RF
To specify an element over the semantic web you need its URI: Unique Resource Identifier, which is composed of the namespace and the localname. For example, consider Person an RDF class; how would you differentiate the Person DBpedia class http://dbpedia.org/ontology/Person from Person in some other ontology somewhere? you need the namespace http://dbpedia.org/ontology/ and the local name Person. Which both uniquely identify the class.
Now coming back to your specific question, when you query the ontology, you might use multiple namespaces, some namespaces may not be the one of your ontology. You need other namespaces for querying your own ontology, e.g. rdf, rdfs, and owl. As an example, you can rarely write an arbitrary query without rdf:type property, which is included under the rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns> namespace, not your ontology namespace. As a consequence, you need to specify the namespace.
Well, now as you should know why to use a namespace, then we can proceed. Why to repeat the whole string of the namespace each time it is needed? It is nothing more than a prefix string appended to the local names to use in the query, to avoid writing exhaustively the full uri. See the difference between <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> and type.
Edit
As #AKSW says, as a conclusion, there is no need to declare a namespace in order to work with the ontology but it increases the convenience when working quite often with resources whose URI has particular namespace.

Common Lisp a Lisp-n?

I'm aware that Common Lisp has different binding environments for functions and variables, but I believe that it also has another binding environment for tagbody labels. Are there even more binding environments than this? If so, then is it fair to categorize Common Lisp as a Lisp-2?
These question are not meant as pedantry or bike-shedding, I only want to gain a better understanding of Common Lisp and hopefully get some pointers into where to dig deeper into its spec.
I'm aware that Common Lisp has different binding environments for
functions and variables,
That would be namespaces, according to the HyperSpec:
namespace n. 1. bindings whose denotations are restricted to a
particular kind. The bindings of names to tags is the tag
namespace.'' 2. any mapping whose domain is a set of names.A
package defines a namespace.''
(Point 1.)
but I believe that it also has another binding environment for tagbody
labels. Are there even more binding environments than this?
Yes, there are more namespaces. I even remember a little snippet exposing most of them, but unfortunately, I can't find it anymore¹. It at least exposed variable, function, tag, and block namespaces, but maybe also types and declarations were included. There is also another SO answer that lists these namespaces.
If so, then is it fair to categorize Common Lisp as a Lisp-2?
In the comments to the above linked answer, Rainer Joswig agrees that the "general debate is about Lisp-1 against Lisp-n".
The "2" might be due to the relative importance of the distinction between value and function slots, or because the objects of the other namespaces aren't first-class objects. For example in the Gabriel/Pitman paper referenced in the other answer:
There is really a larger number of namespaces than just the two that
are discussed here. As we noted earlier, other namespaces include at
least those of blocks and tags; type names and declaration names are
often considered namespaces. Thus, the names Lisp1 and Lisp2, which we
have been using are misleading. The names Lisp5 and Lisp6 might be
more appropriate.
and:
In this paper, there are two namespaces of concern, which we
shall term the "value namespace" and the "function namespace." Other
namespaces include tag names (used by TAGBODY and GO) and block names
(used by BLOCK and RETURN-FROM), but the objects in the location parts
of their bindings are not first-class Lisp objects.
¹) PAIP, p. 837:
(defun f (f)
(block f
(tagbody
f (catch 'f
(if (typep f 'f)
(throw 'f (go f)))
(funcall #'f (get (symbol-value 'f) 'f))))))
In PAIP, Peter Norvig says "Common Lisp has at least seven name spaces" (p. 836).
The seven he lists are:
functions and macros
variables
special variables
data types
label for go statements within a tagbody
a block name for return-from statements within a block
symbols inside a quoted expression
Peter Seibel makes a great point in his comp.lang.lisp post about "compiler" versus "library" namespaces. I think all of Norvig's seven namespaces are "compiler" namespaces.
See for example this old discussion post from comp.lang.lisp:
http://coding.derkeiler.com/Archive/Lisp/comp.lang.lisp/2004-04/0737.html
Yes - http://www.lispworks.com/documentation/lw51/CLHS/Body/t_symbol.htm#symbol specifies a separate value cell and function cell, consonant with a lisp-2.
There is also a property list, but as there is no context in which a symbol "naturally" refers to its property list, it is not usual to describe CL as a lisp-3 (in fact, I am not aware of any language usually so designated).

Equivalent function of R's "%in%" for Stata

Is there an equivalent function of "%in%" from R for Stata?
As already mentioned, it's hard to tell what you need from the question. inlist() might work, or it might not depending on the setting.
I find that Stata's macro lists functions are invaluable. Store your list in a macro (local or global) and then a suite of useful commands are available:
local list a b c d d e
local search c
local search_in_list : list search in list
di `search_in_list'
These can be calculated on the fly:
if `: list search in list' {
actions if true
}
Stata does not offer the same flexible tool, but inlist will cover the basic operation that you might be looking for, as in count if inlist(country,"FR","US","DE").
working with lists proper is one way, you could also just treat the rhs like a string and treat the lhs as a regex, use regexm()

Is there a programming language with no controls structures or operators?

Like Smalltalk or Lisp?
EDIT
Where control structures are like:
Java Python
if( condition ) { if cond:
doSomething doSomething
}
Or
Java Python
while( true ) { while True:
print("Hello"); print "Hello"
}
And operators
Java, Python
1 + 2 // + operator
2 * 5 // * op
In Smalltalk ( if I'm correct ) that would be:
condition ifTrue:[
doSomething
]
True whileTrue:[
"Hello" print
]
1 + 2 // + is a method of 1 and the parameter is 2 like 1.add(2)
2 * 5 // same thing
how come you've never heard of lisp before?
You mean without special syntax for achieving the same?
Lots of languages have control structures and operators that are "really" some form of message passing or functional call system that can be redefined. Most "pure" object languages and pure functional languages fit the bill. But they are all still going to have your "+" and some form of code block--including SmallTalk!--so your question is a little misleading.
Assembly
Befunge
Prolog*
*I cannot be held accountable for any frustration and/or headaches caused by trying to get your head around this technology, nor am I liable for any damages caused by you due to aforementioned conditions including, but not limited to, broken keyboard, punched-in screen and/or head-shaped dents in your desk.
Pure lambda calculus? Here's the grammar for the entire language:
e ::= x | e1 e2 | \x . e
All you have are variables, function application, and function creation. It's equivalent in power to a Turing machine. There are well-known codings (typically "Church encodings") for such constructs as
If-then-else
while-do
recursion
and such datatypes as
Booleans
integers
records
lists, trees, and other recursive types
Coding in lambda calculus can be a lot of fun—our students will do it in the undergraduate languages course next spring.
Forth may qualify, depending on exactly what you mean by "no control structures or operators". Forth may appear to have them, but really they are all just symbols, and the "control structures" and "operators" can be defined (or redefined) by the programmer.
What about Logo or more specifically, Turtle Graphics? I'm sure we all remember that, PEN UP, PEN DOWN, FORWARD 10, etc.
The SMITH programming language:
http://esolangs.org/wiki/SMITH
http://catseye.tc/projects/smith/
It has no jumps and is Turing complete. I've also made a Haskell interpreter for this bad boy a few years back.
I'll be first to mention brain**** then.
In Tcl, there's no control structures; there's just commands and they can all be redefined. Every last one. There's also no operators. Well, except for in expressions, but that's really just an imported foreign syntax that isn't part of the language itself. (We can also import full C or Fortran or just about anything else.)
How about FRACTRAN?
FRACTRAN is a Turing-complete esoteric programming language invented by the mathematician John Conway. A FRACTRAN program is an ordered list of positive fractions together with an initial positive integer input n. The program is run by updating the integer (n) as follows:
for the first fraction f in the list for which nf is an integer, replace n by nf
repeat this rule until no fraction in the list produces an integer when multiplied by n, then halt.
Of course there is an implicit control structure in rule 2.
D (used in DTrace)?
APT - (Automatic Programmed Tool) used extensively for programming NC machine tools.
The language also has no IO capabilities.
XSLT (or XSL, some say) has control structures like if and for, but you should generally avoid them and deal with everything by writing rules with the correct level of specificity. So the control structures are there, but are implied by the default thing the translation engine does: apply potentially-recursive rules.
For and if (and some others) do exist, but in many many situations you can and should work around them.
How about Whenever?
Programs consist of "to-do list" - a series of statements which are executed in random order. Each statement can contain a prerequisite, which if not fulfilled causes the statement to be deferred until some (random) later time.
I'm not entirely clear on the concept, but I think PostScript meets the criteria, although it calls all of its functions operators (the way LISP calls all of its operators functions).
Makefile syntax doesn't seem to have any operators or control structures. I'd say it's a programming language but it isn't Turing Complete (without extensions to the POSIX standard anyway)
So... you're looking for a super-simple language? How about Batch programming? If you have any version of Windows, then you have access to a Batch compiler. It's also more useful than you'd think, since you can carry out basic file functions (copy, rename, make directory, delete file, etc.)
http://www.csulb.edu/~murdock/dosindex.html
Example
Open notepad and make a .Bat file on your Windows box.
Open the .Bat file with notepad
In the first line, type "echo off"
In the second line, type "echo hello world"
In the third line, type "pause"
Save and run the file.
If you're looking for a way to learn some very basic programming, this is a good way to start. (Just be careful with the Delete and Format commands. Don't experiment with those.)