I remember seeing somewhere a dictionary of idiomatic word pairs for use in programming.
Like get-set, open-close, allocate-free and so on.
Does anyone remember an URL?
Building on ergosys' answer:
From Code Complete 2, Chapter 11, p. 264:
Common Opposites in Variable Names
Begin/end
first/last
locked/unlocked
min/max
next/previous
old/new
opened/closed
visible/invisible
source/target
source/destination
up/down
There are two short lists of such pairs in Code Complete, one for function names, one for variable names. You can search for "Use Opposites Precisely" using amazon's look inside feature if you don't have the book.
Never seen a list generally aimed at programming, however PowerShell has such a list: Cmdlet verbs. Pairings are highlighted for each verb where they exist.
And while much of PowerShell's strive for consistency on the command line comes from standardizing those verbs some of the pairings may be appropriate in other contexts as well.
English is not my first language but aren't those antonyms rather than idiomatic pairs?
In Linux you can use wordnet to search for antonyms
sudo apt-get install wordnet
wn open -antsv
-ants for antonyms and v for verbs. You can also search for (n | a | r) – antonym for noun | adjective | adverbs.
Related
My question is: is there any algorithm that can convert a SMILES structure into a topological fingerprint? For example if glycerol is the input the answer would be 3 x -OH , 2x -CH2 and 1x -CH.
I'm trying to build a python script that can predict the density of a mixture using an artificial neural network. As an input I want to have the structure/fingerprint of my molecules starting from the SMILES structure.
I'm already familiar with -rdkit and the morganfingerprint but that is not what i'm looking for. I'm also aware that I can use the 'matching substructure' search in rdkit, but then I would have to define all the different subgroups. Is there any more convenient/shorter way?
For most of the structures, there's no existing option to find the fragments. However, there's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that
from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)
or the following would return the number of aromatic -OH groups:
from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)
Similarly, there are 83 more functions available. Some of them would be useful for your task. For the ones, you don't get the pre-written function, you can always go to the source code of these rdkit modules, figure out how they did it, and then implement them for your features. But as you already mentioned, the way would be to define a SMARTS string and then fragment matching. The fragment matching module can be found here.
If you want to predict densities of pure components before predicting the mixtures I recommend the following paper:
https://pubs.acs.org/doi/abs/10.1021/acs.iecr.6b03809
You can use the fragments specified by rdkit as mnis proposes. Or you could specify the groups as SMARTS patterns and look for them yourself using GetSubstructMatches as you proposed yourself.
Dissecting a molecule into specific groups is not as straightforward as it might appear in the first place. You could also use an algorithm I published a while ago:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0382-3
It includes a list of SMARTS for the UNIFAC model, but you could also use them for other things, like density prediction.
I'm aware that Common Lisp has different binding environments for functions and variables, but I believe that it also has another binding environment for tagbody labels. Are there even more binding environments than this? If so, then is it fair to categorize Common Lisp as a Lisp-2?
These question are not meant as pedantry or bike-shedding, I only want to gain a better understanding of Common Lisp and hopefully get some pointers into where to dig deeper into its spec.
I'm aware that Common Lisp has different binding environments for
functions and variables,
That would be namespaces, according to the HyperSpec:
namespace n. 1. bindings whose denotations are restricted to a
particular kind. The bindings of names to tags is the tag
namespace.'' 2. any mapping whose domain is a set of names.A
package defines a namespace.''
(Point 1.)
but I believe that it also has another binding environment for tagbody
labels. Are there even more binding environments than this?
Yes, there are more namespaces. I even remember a little snippet exposing most of them, but unfortunately, I can't find it anymore¹. It at least exposed variable, function, tag, and block namespaces, but maybe also types and declarations were included. There is also another SO answer that lists these namespaces.
If so, then is it fair to categorize Common Lisp as a Lisp-2?
In the comments to the above linked answer, Rainer Joswig agrees that the "general debate is about Lisp-1 against Lisp-n".
The "2" might be due to the relative importance of the distinction between value and function slots, or because the objects of the other namespaces aren't first-class objects. For example in the Gabriel/Pitman paper referenced in the other answer:
There is really a larger number of namespaces than just the two that
are discussed here. As we noted earlier, other namespaces include at
least those of blocks and tags; type names and declaration names are
often considered namespaces. Thus, the names Lisp1 and Lisp2, which we
have been using are misleading. The names Lisp5 and Lisp6 might be
more appropriate.
and:
In this paper, there are two namespaces of concern, which we
shall term the "value namespace" and the "function namespace." Other
namespaces include tag names (used by TAGBODY and GO) and block names
(used by BLOCK and RETURN-FROM), but the objects in the location parts
of their bindings are not first-class Lisp objects.
¹) PAIP, p. 837:
(defun f (f)
(block f
(tagbody
f (catch 'f
(if (typep f 'f)
(throw 'f (go f)))
(funcall #'f (get (symbol-value 'f) 'f))))))
In PAIP, Peter Norvig says "Common Lisp has at least seven name spaces" (p. 836).
The seven he lists are:
functions and macros
variables
special variables
data types
label for go statements within a tagbody
a block name for return-from statements within a block
symbols inside a quoted expression
Peter Seibel makes a great point in his comp.lang.lisp post about "compiler" versus "library" namespaces. I think all of Norvig's seven namespaces are "compiler" namespaces.
See for example this old discussion post from comp.lang.lisp:
http://coding.derkeiler.com/Archive/Lisp/comp.lang.lisp/2004-04/0737.html
Yes - http://www.lispworks.com/documentation/lw51/CLHS/Body/t_symbol.htm#symbol specifies a separate value cell and function cell, consonant with a lisp-2.
There is also a property list, but as there is no context in which a symbol "naturally" refers to its property list, it is not usual to describe CL as a lisp-3 (in fact, I am not aware of any language usually so designated).
Like Smalltalk or Lisp?
EDIT
Where control structures are like:
Java Python
if( condition ) { if cond:
doSomething doSomething
}
Or
Java Python
while( true ) { while True:
print("Hello"); print "Hello"
}
And operators
Java, Python
1 + 2 // + operator
2 * 5 // * op
In Smalltalk ( if I'm correct ) that would be:
condition ifTrue:[
doSomething
]
True whileTrue:[
"Hello" print
]
1 + 2 // + is a method of 1 and the parameter is 2 like 1.add(2)
2 * 5 // same thing
how come you've never heard of lisp before?
You mean without special syntax for achieving the same?
Lots of languages have control structures and operators that are "really" some form of message passing or functional call system that can be redefined. Most "pure" object languages and pure functional languages fit the bill. But they are all still going to have your "+" and some form of code block--including SmallTalk!--so your question is a little misleading.
Assembly
Befunge
Prolog*
*I cannot be held accountable for any frustration and/or headaches caused by trying to get your head around this technology, nor am I liable for any damages caused by you due to aforementioned conditions including, but not limited to, broken keyboard, punched-in screen and/or head-shaped dents in your desk.
Pure lambda calculus? Here's the grammar for the entire language:
e ::= x | e1 e2 | \x . e
All you have are variables, function application, and function creation. It's equivalent in power to a Turing machine. There are well-known codings (typically "Church encodings") for such constructs as
If-then-else
while-do
recursion
and such datatypes as
Booleans
integers
records
lists, trees, and other recursive types
Coding in lambda calculus can be a lot of fun—our students will do it in the undergraduate languages course next spring.
Forth may qualify, depending on exactly what you mean by "no control structures or operators". Forth may appear to have them, but really they are all just symbols, and the "control structures" and "operators" can be defined (or redefined) by the programmer.
What about Logo or more specifically, Turtle Graphics? I'm sure we all remember that, PEN UP, PEN DOWN, FORWARD 10, etc.
The SMITH programming language:
http://esolangs.org/wiki/SMITH
http://catseye.tc/projects/smith/
It has no jumps and is Turing complete. I've also made a Haskell interpreter for this bad boy a few years back.
I'll be first to mention brain**** then.
In Tcl, there's no control structures; there's just commands and they can all be redefined. Every last one. There's also no operators. Well, except for in expressions, but that's really just an imported foreign syntax that isn't part of the language itself. (We can also import full C or Fortran or just about anything else.)
How about FRACTRAN?
FRACTRAN is a Turing-complete esoteric programming language invented by the mathematician John Conway. A FRACTRAN program is an ordered list of positive fractions together with an initial positive integer input n. The program is run by updating the integer (n) as follows:
for the first fraction f in the list for which nf is an integer, replace n by nf
repeat this rule until no fraction in the list produces an integer when multiplied by n, then halt.
Of course there is an implicit control structure in rule 2.
D (used in DTrace)?
APT - (Automatic Programmed Tool) used extensively for programming NC machine tools.
The language also has no IO capabilities.
XSLT (or XSL, some say) has control structures like if and for, but you should generally avoid them and deal with everything by writing rules with the correct level of specificity. So the control structures are there, but are implied by the default thing the translation engine does: apply potentially-recursive rules.
For and if (and some others) do exist, but in many many situations you can and should work around them.
How about Whenever?
Programs consist of "to-do list" - a series of statements which are executed in random order. Each statement can contain a prerequisite, which if not fulfilled causes the statement to be deferred until some (random) later time.
I'm not entirely clear on the concept, but I think PostScript meets the criteria, although it calls all of its functions operators (the way LISP calls all of its operators functions).
Makefile syntax doesn't seem to have any operators or control structures. I'd say it's a programming language but it isn't Turing Complete (without extensions to the POSIX standard anyway)
So... you're looking for a super-simple language? How about Batch programming? If you have any version of Windows, then you have access to a Batch compiler. It's also more useful than you'd think, since you can carry out basic file functions (copy, rename, make directory, delete file, etc.)
http://www.csulb.edu/~murdock/dosindex.html
Example
Open notepad and make a .Bat file on your Windows box.
Open the .Bat file with notepad
In the first line, type "echo off"
In the second line, type "echo hello world"
In the third line, type "pause"
Save and run the file.
If you're looking for a way to learn some very basic programming, this is a good way to start. (Just be careful with the Delete and Format commands. Don't experiment with those.)
Given some (English) word that we shall assume is a plural, is it possible to derive the singular form? I'd like to avoid lookup/dictionary tables if possible.
Some examples:
Examples -> Example a simple 's' suffix
Glitch -> Glitches 'es' suffix, as opposed to above
Countries -> Country 'ies' suffix.
Sheep -> Sheep no change: possible fallback for indeterminate values
Or, this seems to be a fairly exhaustive list.
Suggestions of libraries in language x are fine, as long as they are open-source (ie, so that someone can examine them to determine how to do it in language y)
It really depends on what you mean by 'programmatically'. Part of English works on easy to understand rules, and part doesn't. It has to do mainly with frequency. For a brief overview, you can read Pinker's "Words and Rules", but do yourself a favor and don't take the whole generative theory of linguistics entirely to heart. There's a lot more empiricism there than that school of thought really lends to the pursuit.
A lot of English can be statistically lemmatized. By the way, stemming or lemmatization is the term you're looking for. One of the most effective lemmatizers which work off of statistical rules bootstrapped with frequency-based exceptions is the Morpha Lemmatizer. You can give this a shot if you have a project that requires this type of simplification of strings which represent specific terms in English.
There are even more naive approaches that accomplish much with respect to normalizing related terms. Take a look at the Porter Stemmer, which is effective enough to cluster together most terms in English.
Going from singular to plural, English plural form is actually pretty regular compared to some other European languages I have a passing familiarity with. In German for example, working out the plural form is really complicated (eg Land -> Länder). I think there are roughly 20-30 exceptions and the rest follow a fairly simple ruleset:
-y -> -ies (family -> families)
-us -> -i (cactus -> cacti)
-s -> -ses (loss -> losses)
otherwise add -s
That being said, plural to singular form becomes that much harder because the reverse cases have ambiguities. For example:
pies: is it py or pie?
ski: is it singular or plural for 'skus'?
molasses: is it singular or plural for 'molasse' or 'molass'?
So it can be done but you're going to have a much larger list of exceptions and you're going to have to store a lot of false positives (ie things that appear plural but aren't).
Is "axes" the plural of "ax" or of "axis"? Even a human cannot tell without context.
You can take a look at Inflector.net - my port of Rails' inflection class.
No - English isn't a language which sticks to many rules.
I think your best bet is either:
use a dictionary of common words and their plurals (or group them by their plural rule, eg: group words where you just add an S, words where you add ES, words where you drop a Y and add IES...)
rethink your application
It is not possible, as nickf has already said. It would be simple for the classes of words you have described, but what about all the words that end with s naturally? My name, Marius, for example, is not plural of Mariu. Same with Bus I guess. Pluralization of words in English is a one way function (a hash function), and you usually need the rest of the sentence or paragraph for context.
I have a huge list of person's full names that I must search in a huge text.
Only part of the name may appear in the text. And it is possible to be misspelled, misstyped or abreviated. The text has no tokens, so I don't know where a person name starts in the text. And I don't if know if the name will appear or not in the text.
Example:
I have "Barack Hussein Obama" in my list, so I have to check for occurrences of that name in the following texts:
...The candidate Barack Obama was elected the president of the United States... (incomplete)
...The candidate Barack Hussein was elected the president of the United States... (incomplete)
...The candidate Barack H. O. was elected the president of the United States... (abbreviated)
...The candidate Barack ObaNa was elected the president of the United States... (misspelled)
...The candidate Barack OVama was elected the president of the United States... (misstyped, B is next to V)
...The candidate John McCain lost the the election... (no occurrences of Obama name)
Certanily there isn't a deterministic solution for it, but...
What is a good heuristic for this kind of search?
If you had to, how would you do it?
You said it's about 200 pages.
Divide it into 200 one-page PDFs.
Put each page on Mechanical Turk, along with the list of names. Offer a reward of about $5 per page.
Split everything on spaces removing special characters (commas, periods, etc). Then use something like soundex to handle misspellings. Or you could go with something like lucene if you need to search a lot of documents.
What you want is a Natural Lanuage Processing library. You are trying to identify a subset of proper nouns. If names are the main source of proper nouns than it will be easy if there are a decent number of other proper nouns mixed in than it will be more difficult. If you are writing in JAVA look at OpenNLP or C# SharpNLP. After extracting all the proper nouns you could probably use Wordnet to remove most non-name proper nouns. You may be able to use wordnet to identify subparts of names like "John" and then search the neighboring tokens to suck up other parts of the name. You will have problems with something like "John Smith Industries". You will have to look at your underlying data to see if there are features that you can take advantage of to help narrow the problem.
Using an NLP solution is the only real robust technique I have seen to similar problems. You may still have issues since 200 pages is actually fairly small. Ideally you would have more text and be able to use more statistical techniques to help disambiguate between names and non names.
At first blush I'm going for an indexing server. lucene, FAST or Microsoft Indexing Server.
I would use C# and LINQ. I'd tokenize all the words on space and then use LINQ to sort the text (and possibly use the Distinct() function) to isolate all the text that I'm interested in. When manipulating the text I'd keep track of the indexes (which you can do with LINQ) so that I could relocate the text in the original document - if that's a requirement.
The best way I can think of would be to define grammars in python NLTK. However it can get quite complicated for what you want.
I'd personnaly go for regular expressions while generating a list of permutations with some programming.
Both SQL Server and Oracle have built-in SOUNDEX Functions.
Additionally there is a built-in function for SQL Server called DIFFERENCE, that can be used.
pure old regular expression scripting will do the job.
use Ruby, it's quite fast. read lines and match words.
cheers