Using the nltk to recognise dates as named entities? - nltk

I'm trying to use the NLTK Named Entity Tagger to identify various named entities. In the book Natural Language Processing with Python they provide a list of commonly used named entitities, (Table 7.4, if anyone is curious) which include: DATE June, 2008-06-29 and TIME two fifty a m, 1:30 p.m. So I got the impresssion that this could be done with the NLTK's named entity tagger.
However, when I've run the tagger, it doesn't seem to pick up dates or times at all, as it does people or organizations. Does the NLTK named entity tagger not handle these date/time cases, or does it only pick up a specific date/time format? If it doesn't handle these cases, does anybody know of a system that does? Or is creating my own the only solution?
Thanks!

You should check out the contrib repository of NLTK - contains a module called timex.py or download it here:
https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py
From the first line of the module:
# Code for tagging temporal expressions in text

Related

How do I get molecular structural information from SMILES

My question is: is there any algorithm that can convert a SMILES structure into a topological fingerprint? For example if glycerol is the input the answer would be 3 x -OH , 2x -CH2 and 1x -CH.
I'm trying to build a python script that can predict the density of a mixture using an artificial neural network. As an input I want to have the structure/fingerprint of my molecules starting from the SMILES structure.
I'm already familiar with -rdkit and the morganfingerprint but that is not what i'm looking for. I'm also aware that I can use the 'matching substructure' search in rdkit, but then I would have to define all the different subgroups. Is there any more convenient/shorter way?
For most of the structures, there's no existing option to find the fragments. However, there's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that
from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)
or the following would return the number of aromatic -OH groups:
from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)
Similarly, there are 83 more functions available. Some of them would be useful for your task. For the ones, you don't get the pre-written function, you can always go to the source code of these rdkit modules, figure out how they did it, and then implement them for your features. But as you already mentioned, the way would be to define a SMARTS string and then fragment matching. The fragment matching module can be found here.
If you want to predict densities of pure components before predicting the mixtures I recommend the following paper:
https://pubs.acs.org/doi/abs/10.1021/acs.iecr.6b03809
You can use the fragments specified by rdkit as mnis proposes. Or you could specify the groups as SMARTS patterns and look for them yourself using GetSubstructMatches as you proposed yourself.
Dissecting a molecule into specific groups is not as straightforward as it might appear in the first place. You could also use an algorithm I published a while ago:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0382-3
It includes a list of SMARTS for the UNIFAC model, but you could also use them for other things, like density prediction.

Bioisosteric replacement using SMARTS (KNIME and RDKit)

I am trying to create a KNIME workflow that would accept a list of compounds and carry out bioisosteric replacements (we will use the following example here: carboxylic acid to tetrazole) automatically.
NOTE: I am using the following workflow as inspiration : RDKit-bioisosteres (myexperiment.org). This uses a text file as SMARTS input. I cannot seem to replicate the SMARTS format used here.
For this, I plan to use the Rdkit One Component Reaction node which uses a set of compounds to carry out the reaction on as input and a SMARTS string that defines the reaction.
My issue is the generation of a working SMARTS string describing the reaction.
I would like to input two SDF files (or another format, not particularly attached to SDF): one with the group to replace (carboxylic acid) and one with the list of possible bioisosteric replacements (tetrazole). I would then combine these two in KNIME and generate a SMARTS string for the reaction to then be used in the Rdkit One Component Reaction node.
NOTE: The input SDF files have the structures written with an
attachment point (*COOH for the carboxylic acid for example) which
defines where the group to replace is attached. I suspect this is the
cause of many of the issues I am experiencing.
So far, I can easily generate the reactions in RXN format using the Reaction Builder node from the Indigo node package. However, converting this reaction into a SMARTS string that is accepted by the Rdkit One Component Reaction node has proven tricky.
What I have tried so far:
Converting RXN to SMARTS (Molecule Type Cast node) : gives the following error code : scanner: BufferScanner::read() error
Converting the Source and Target molecules into SMARTS (Molecule Type Cast node) : gives the following error code : SMILES loader: unrecognised lowercase symbol: y
showing this as a string in KNIME shows that the conversion is not carried out and the string is of SDF format : *filename*.sdf 0 0 0 0 0 0 0 V3000M V30 BEGIN etc.
Converting the Source and Target molecules into RDkit first (RDkit from Molecule node) then from RDkit into SMARTS (RDkit to Molecule node, SMARTS option). This outputs the following SMARTS strings:
Carboxylic acid : [#6](-[#8])=[#8]
Tetrazole : [#6]1:[#7H]:[#7]:[#7]:[#7]:1
This is as close as I've managed to get. I can then join these two smarts strings with >> in between (output: [#6](-[#8])=[#8]>>[#6]1:[#7H]:[#7]:[#7]:[#7]:1) to create a SMARTS reaction string but this is not accepted as an input for the Rdkit One Component Reaction node.
Error message in KNIME console :
ERROR RDKit One Component Reaction 0:40 Creation of Reaction from SMARTS value failed: null
WARN RDKit One Component Reaction 0:40 Invalid Reaction SMARTS: missing
Note that the SMARTS strings that this last option (3.) generates are very different than the ones used in the myexperiments.org example ([*:1][C:2]([OH])=O>>[*:1][C:2]1=NNN=N1). I also seem to have lost the attachment point information through these conversions which are likely to cause issues in the rest of the workflow.
Therefore I am looking for a way to generate the SMARTS strings used in the myexperiments.org example on my own sets of substituents. Obviously doing this by hand is not an option. I would also like this workflow to use only the open-source nodes available in KNIME and not proprietary nodes (Schrodinger etc.).
Hopefully, someone can help me out with this. If you need my current workflow I am happy to upload that with the source files if required.
Thanks in advance for your help,
Stay safe and healthy!
-Antoine
What you're describing is template generation, which has been a consistent field of work in reaction prediction and/or retrosynthesis in cheminformatics for a long time.
I'm not particularly familiar with KNIME myself, though I know RDKit extensively: Your last option (3) is closest to what I'd consider a usable workflow. The way I would do this:
Load the substitution pair molecules from SDF into RDKit mol objects.
Export these RDKit mol objects as SMARTS strings rdkit.Chem.MolToSmarts().
Concatenate these strings into the form before_substructure>>after_substructure to generate a reaction SMARTS string.
Load this SMARTS string into a reaction object rxn = rdkit.Chem.AllChem.ReactionFromSmarts()
Use the rxn.RunReactants() method to generate your bioisosterically substituted products.
The error you quote for the RDKit One Component Reaction node input cuts off just before the important information, unfortunately. Running rdkit.Chem.AllChem.ReactionFromSmarts("[#6](-[#8])=[#8]>>[#6]1:[#7H]:[#7]:[#7]:[#7]:1") produces no errors for me locally, which leads me to believe this is specific to the KNIME node functionality.
Note, that the difference between [#6](-[#8])=[#8] and [*:1][C:2]([OH])=O is relatively minimal: The former represents a O-C=O substructure, the latter represents a ~COOH group. Within the square brackets of the latter, the :num refers to an optional 'atom map' number, which allows a one-to-one mapping of reactant and product atoms. For example, [C:1][C:3].[C:2][C:4]>>[C:1][C:3][C:4][C:2] allows you to track which carbon is which during a reaction, for situations where it may matter. The token [*:1] means "any atom" and is equivalent to a wavey line in organic chemistry (and it is mapped to #1).
There are only two situations I can think of where [#6](-[#8])=[#8] and [*:1][C:2]([OH])=O might differ:
You have methanoic acid as a potential input for substitution (former will match, latter might not - I can't remember how implicit hydrogens are treated in this situation)
Inputs are over/under protonated. (COO- != COOH)
Converting these reaction SMARTS to RDKit reaction objects and running them on input molecule objects should potentially create a number of substituted products. Note: Typically, in extensive projects, there will be some SMARTS templates that require some degree of manual intervention - indicating attachment points, specifying explicit hydrogens, etc. If you need any help or have any questions don't hesitate to drop a comment and I'll do my best to help with specifics.

Princeton Wordnet database - two different synset identifiers?

I am trying to make sense of the different identifiers in the Princeton Wordnet database. I am using version 3.1. You can read about the structure here but my focus is on the synsets table.
The Synset Table The synsets table is one of the most important tables in the database. It is responsible for housing all the definitions within WordNet. Each row in the synset table has a synsetid, a definition, a pos (parts of speech field) and a lexdomainid (which links to the lexdomain table) There are 117373 synsets in the WordNet Database.
When I search for the word joy in my senses table, I see that there are four different results (2 nouns and 2 vebs). From there, I can identify the sense/meaning that I am looking for, which is the one that corresponds to the meaning:
"the emotion of great happiness"
So I have now found the result that I am looking for. The synset id of this result is 107542591 and I can search this id to find other words with the same sense/meaning.
However, when I use some online versions of Wordnet and I search for words in the synset "the emotion of great happiness", I see a different type of identifier. This identifier is 07527352-n.
For example, you can see it at the top-left corner of this site. On that same site, in the address bar you'll see that identifier is referred to as the synset id: &synset=07527352-n.
I would like to know how to retrieve the second type of identifier for a given synset. I've read through the documentation here and searched through the raw data files, but I cannot figure it out.
Thank you!
There are two things going on.
First, MySQL does not like ids starting with a 0, so they start with 1. (Specifically, nouns get a 1 prefix, verbs 2, adjectives 3, and adverbs get a 4 prefix: see WordNet identifiers section at http://wordnet-rdf.princeton.edu/ )
Second, 07542591 is from WordNet 3.1 (I've checked both the raw WordNet files, and the SQL files, and they both use this).
"07527352" is from an older version of WordNet. In the case of Chinese WordNet I believe they use WordNet 3.0. http://compling.hss.ntu.edu.sg/cow/
Additional: https://stackoverflow.com/a/33348009/841830 has more information. Strangely, I've not been able to track a simple 3.0 to 3.1 conversion table yet... but I'm sure I've seen one.

When could a CSV records *not* have the same number of fields?

I am storing a series of events to a CSV file, each event type comes with a different set of data.
To illustrate, say I have two events (there will be many more):
Running, which has a data set containing speed and incline.
Sleeping, which has a data set containing snores.
There are two options to store this data in CSV records:
Option A
Storing each possible item of data in it's own field...
speed, incline, snores
therefore...
15mph, 20%, ,
, , 12
16mph, 20%, ,
14mph, 20%, ,
Option B
Storing each event in its own record...
event, value1...
therefore...
running, 15mph, 20%
sleeping, 12
running, 16mph, 20%
running, 14mph, 20%
Without a specific CSV specification, the consensus seems to be:
Each record "should" contain the same number of comma-separated fields.
Context
There are a number of events which each have a large & different set of data values.
CSV data is to be of use to other developers (I will/could/should/won't use either structure).
The 'other developers' to be toward the novice end of the spectrum and/or using resource limited systems. CSV is accessible.
The CSV format is being provided non-exclusively as feature not requirement. Although, if said application is providing a CSV file it should be provided in the correct manner from now on.
Question
Would it be valid – in this case - to go with Option B?
Thoughts
Option B maintains a level of human readability, which is an advantage say CSV is read by human not processor. Neither method is more complex to parse using a custom parser, but will Option B void the usefulness of a CSV format with other libraries, frameworks, applications et al. With Option A future changes/versions to the data set of an individual event may break the CSV structure (zombie , , to maintain forwards compatibility); whereas Option B will fail gracefully.
edit
This may be aimed at students and frameworks like OpenFrameworks, Plask, Proccessing et al. where CSV is easier to implement.
Any "other frameworks, libraries and applications" I've ever used all handle CSV parsing differently, so trying to conform to one or many of these standards might over-complicate your end result. My recommendation would be to keep it simple and use what works for your specific task. If human readbility is a requirement, then CSV in the form of Option B would work fine. Otherwise, you may want to consider JSON or XML.
As you say there is no "CSV Standard" with regard to contents. The real answer depend on what you are doing and why. You mention "other frameworks, libraries and applications". The one thing I've learnt is "Dont over engineer". i.e. Don't write reams of code today on the assumption that you will plug it into some other framework tomorrow.
I'd say option B is fine, unless you have specific requirements to use other apps etc.
< edit >
Having re-read your context, I'd probably pick one output format and use it, and forget about having multiple formats:
Having multiple output formats is a source of inconsistency (e.g. bug in one format but not another).
Having multiple formats means more code that needs to be
tested
documented
supported
< /edit >
Is there any reason you can't use XML? Yes, it's slightly more difficult to parse, at least for novices, but if so they probably need the practice. File size would be much greater, of course, but it's compressible.

List of idiomatic word pairs

I remember seeing somewhere a dictionary of idiomatic word pairs for use in programming.
Like get-set, open-close, allocate-free and so on.
Does anyone remember an URL?
Building on ergosys' answer:
From Code Complete 2, Chapter 11, p. 264:
Common Opposites in Variable Names
Begin/end
first/last
locked/unlocked
min/max
next/previous
old/new
opened/closed
visible/invisible
source/target
source/destination
up/down
There are two short lists of such pairs in Code Complete, one for function names, one for variable names. You can search for "Use Opposites Precisely" using amazon's look inside feature if you don't have the book.
Never seen a list generally aimed at programming, however PowerShell has such a list: Cmdlet verbs. Pairings are highlighted for each verb where they exist.
And while much of PowerShell's strive for consistency on the command line comes from standardizing those verbs some of the pairings may be appropriate in other contexts as well.
English is not my first language but aren't those antonyms rather than idiomatic pairs?
In Linux you can use wordnet to search for antonyms
sudo apt-get install wordnet
wn open -antsv
-ants for antonyms and v for verbs. You can also search for (n | a | r) – antonym for noun | adjective | adverbs.