NLTK (or other) Part of speech tagger that returns n-best tag sequences - nltk

I need a part of speech tagger that does not just return the optimal tag sequence for a given sentence, but that returns the n-best tag sequences. So for 'time flies like an arrow', it could return both NN VBZ IN DT NN and NN NNS VBP DT NN for example, ordered in terms of their probability. I need to train the tagger using my own tag set and sentence examples, and I would like a tagger that allows different features of the sentence to be engineered. If one of the nltk taggers had this functionality, that would be great, but any tagger that I can interface with my Python code would do. Thanks in advance for any suggestions.

I would recommend having a look at spaCy. From what I have seen, it doesn't by default allow you to return the top-n tags, but it supports creating custom pipeline components.
There is also an issue on Github where exactly this is discussed, and there are some suggestions on how to implement it relatively quickly.

Related

Obtaining METEOR scores for Japanese text

I wish to produce METEOR scores for several Japanese strings. I have imported nltk, wordnet and omw but the results do not convince me it is working correctly.
from nltk.corpus import wordnet
from nltk.translate.meteor_score import single_meteor_score
nltk.download('wordnet')
nltk.download('omw')
reference = "チップは含まれていません。"
hypothesis = "チップは含まれていません。"
print(single_meteor_score(reference, hypothesis))
This outputs 0.5 but surely it should be much closer to 1.0 given the reference and hypothesis are identical?
Do I somehow need to specify which wordnet language I want to use in the call to single_meteor_score() for example:
single_meteor_score(reference, hypothesis, wordnet=wordnetJapanese.
Pending review by a qualified linguist, I appear to have found a solution. I found an open source tokenizer for Japanese. I pre-processed all of my reference and hypothesis strings to insert spaces between Japanese tokens and then run the nltk.single_meteor_score() over the files.

How do I get molecular structural information from SMILES

My question is: is there any algorithm that can convert a SMILES structure into a topological fingerprint? For example if glycerol is the input the answer would be 3 x -OH , 2x -CH2 and 1x -CH.
I'm trying to build a python script that can predict the density of a mixture using an artificial neural network. As an input I want to have the structure/fingerprint of my molecules starting from the SMILES structure.
I'm already familiar with -rdkit and the morganfingerprint but that is not what i'm looking for. I'm also aware that I can use the 'matching substructure' search in rdkit, but then I would have to define all the different subgroups. Is there any more convenient/shorter way?
For most of the structures, there's no existing option to find the fragments. However, there's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that
from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)
or the following would return the number of aromatic -OH groups:
from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)
Similarly, there are 83 more functions available. Some of them would be useful for your task. For the ones, you don't get the pre-written function, you can always go to the source code of these rdkit modules, figure out how they did it, and then implement them for your features. But as you already mentioned, the way would be to define a SMARTS string and then fragment matching. The fragment matching module can be found here.
If you want to predict densities of pure components before predicting the mixtures I recommend the following paper:
https://pubs.acs.org/doi/abs/10.1021/acs.iecr.6b03809
You can use the fragments specified by rdkit as mnis proposes. Or you could specify the groups as SMARTS patterns and look for them yourself using GetSubstructMatches as you proposed yourself.
Dissecting a molecule into specific groups is not as straightforward as it might appear in the first place. You could also use an algorithm I published a while ago:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0382-3
It includes a list of SMARTS for the UNIFAC model, but you could also use them for other things, like density prediction.

Sequence labeling for sentences and not tokens

I have sentences that belong to a paragraph. Each sentence has a label.
[s1,s2,s3,…], [l1,l2,l3,…]
I understand that I have to encode each sentence using an encoder, and then use sequence labeling. Could you guide me on how I could do that, combining them?
If i understand your question correctly, you are looking for encoding of your sentences into numeric representation.
let's say you have data like :
data = ["Sarah, is that you? Hahahahahaha Todd give you another black eye??"
"Well, being slick comes with the job of being a propagandist, Andi..."
"Sad to lose a young person who was earnestly working for the common good and public safety when so many are in the basement smoking pot and playing computer games."]
labels = [0,1,0]
Now you want to build a classifier, for training classifier data should be in numeric format so here we will transfer text data into numeric structure for that we will use tf-idf vectorizer which will create matrix for text data, then apply any algorithm.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
vectorizerPipe = Pipeline([
('tfidf', TfidfVectorizer(lowercase=True,stop_words='english')),
('classification', LinearSVC(penalty='l2',loss='hinge'))])
trained_model = vectorizerPipe.fit(data,labels)
Here pipeline is constructed where first step is feature vector extraction (converting text data into numeric format) and in next step we are applying algorithm to it. There are lot of parameters in both steps you can try.
later we fir the pipeline with .fit method and passing data and labels.

neutral label for NLTK

I have similar problem like below
Why did NLTK NaiveBayes classifier misclassify one record?
In my case, I queried positive feed and built positive_vocab and then queried negative feed and built negative_voca. I get the data from feed clean and built the classifier. How do I build the neutral_vocab. Is there a way I can instruct NLTK classifier to return neutral label when the given word is not found in the negative_voca and positive_vocab. How do I do that?
In my current implementation, if I give a word which is not present in the both sets it tells positive by default. Instead it should tell, neutral or notfound

WordNet 3.0 Curse Words

I'm developing a system where keywords are extracted from plain text.
The requirements for a keyword are:
Between 1 - 45 letters long
Word must exist within the WordNet database
Must not be a "common" word
Must not be a curse word
I have fulfilled requirements 1 - 3, however I can't find a method for finding a distinction between curse words; how do I filter them?
I know this would not be a definitive method of filtering out all the curse words, but what happens is all keywords are first set to a state of "pending" before being "approved" by a moderator. However if I can get WordNet to filter most of the curse words out, it would make the moderator's job more easy.
It's strange, the Unix command line version of WordNet (wn) will give you the desired
information with the option -domn (domain):
wn ass -domnn (-domnv for a verb)
...
>>> USAGE->(noun) obscenity#2, smut#4, vulgarism#1, filth#4, dirty word#1
>>> USAGE->(noun) slang#2, cant#3, jargon#1, lingo#1, argot#1, patois#1, vernacular#1
However, the equivalent method in the NLTK just returns an empty list:
from nltk.corpus import wordnet
a = wordnet.synsets('ass')
for s in a:
for l in s.lemmas:
print l.usage_domains()
[]
[]
...
As an alternative you could try to filter words that have "obscene", "coarse" or "slang" in their SynSet's definition. But probably it's much easier to filter against a fixed list as suggested before (like the one at noswearing.com).
For the 4th point it would be better and effective if you can collect the list of curse words and remove them through iterative process.
To achieve the same you can checkout this blog
I will summarize the same here.
1. Load Swear words text file from here
2. Compare it with the text, remove if it match.
def remove_curse_words():
text = 'Hey Bro Fuck you'
text = ' '.join([word for word in text.split() if word not in curseWords])
return text
The output would be.
Hey bro you