Named Entity Recognition using NLTK. Relevance of extracted keywords - nltk

I was checking out the Named Entity Recognition feature of NLTK. Is it possible to find out which of the extracted keywords is most relevant to the original text? Also, is it possible to know the type (Person / Organization) of the extracted keywords?

If you have a trained tagger, you can first tag your text and then use the NE classifier that comes with NLTK.
The tagged text should be presented as a list
sentence = 'The U.N.'
tagged_sentence = [('The','DT'), ('U.N.', 'NNP')]
Then, the ne classifier would be called like this
nltk.ne_chunk(tagged_sentence)
It returns a Tree. The classified words will appear as Tree nodes inside the main structure.
The result will include if it is a PERSON, ORGANIZATION or GPE.
To find out the most relevant terms, you have to define a measure of "relevance". Usually tf/idf is used but if you are considering only one document, frequency could be enough.
Computing the frequency of each word within a document is easy with NLTK. First you have to load your corpus and once you have load it and have a Text object, simply call:
relevant_terms_sorted_by_freq = nltk.probability.FreqDist(corpus).keys()
Finally, you could filter out all words in relevant_terms_sorted_by_freq that don't belong to a NE list of words.
NLTK offers an online version of a complete book which I find interesting to start with

Related

What is the right way to generate long sequence using PyTorch-Transformers?

I am trying to generate a long sequence of text using PyTorch-Transformers from a sample text. I am following this tutorial for this purpose. Because the original article only predicts one word from a given text, I modified that script to generate long sequence instead of one. This is the modified part of the code
# Encode a text inputs
text = """An examination can be defined as a detailed inspection or analysis
of an object or person. For example, an engineer will examine a structure,
like a bridge, to see if it is safe. A doctor may conduct"""
indexed_tokens = tokenizer.encode(text)
# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
seq_len = tokens_tensor.shape[1]
tokens_tensor = tokens_tensor.to('cuda')
with torch.no_grad():
for i in range(50):
outputs = model(tokens_tensor[:,-seq_len:])
predictions = outputs[0]
predicted_index = torch.argmax(predictions[0, -1, :])
tokens_tensor = torch.cat((tokens_tensor,predicted_index.reshape(1,1)),1)
pred = tokens_tensor.detach().cpu().numpy().tolist()
predicted_text = tokenizer.decode(pred[0])
print(predicted_text)
Output
An examination can be defined as a detailed inspection or analysis
of an object or person. For example, an engineer will examine a
structure, like a bridge, to see if it is safe. A doctor may conduct
an examination of a patient's body to see if it is safe.
The doctor may also examine a patient's body to see if it is safe. A
doctor may conduct an examination of a patient's body to see if it is
safe.
As you can see the generated text does not generates any unique text sequence but it generates the same sentence over and over again with minor changes.
How should we create long sequence using PyTorch-Transformers?
There is usually no such thing as generating a complete sentence or complete text once. There were some research approaches on that but almost all of the state-of-the-art models generate a text word by word. The generated word at time t-1 is then used as input (together with other already generated or given words) while generating the next word at time t. So, it is normal that it generates word by word. I do not understand what you mean by this.
Which model are you using?

How do I get molecular structural information from SMILES

My question is: is there any algorithm that can convert a SMILES structure into a topological fingerprint? For example if glycerol is the input the answer would be 3 x -OH , 2x -CH2 and 1x -CH.
I'm trying to build a python script that can predict the density of a mixture using an artificial neural network. As an input I want to have the structure/fingerprint of my molecules starting from the SMILES structure.
I'm already familiar with -rdkit and the morganfingerprint but that is not what i'm looking for. I'm also aware that I can use the 'matching substructure' search in rdkit, but then I would have to define all the different subgroups. Is there any more convenient/shorter way?
For most of the structures, there's no existing option to find the fragments. However, there's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that
from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)
or the following would return the number of aromatic -OH groups:
from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)
Similarly, there are 83 more functions available. Some of them would be useful for your task. For the ones, you don't get the pre-written function, you can always go to the source code of these rdkit modules, figure out how they did it, and then implement them for your features. But as you already mentioned, the way would be to define a SMARTS string and then fragment matching. The fragment matching module can be found here.
If you want to predict densities of pure components before predicting the mixtures I recommend the following paper:
https://pubs.acs.org/doi/abs/10.1021/acs.iecr.6b03809
You can use the fragments specified by rdkit as mnis proposes. Or you could specify the groups as SMARTS patterns and look for them yourself using GetSubstructMatches as you proposed yourself.
Dissecting a molecule into specific groups is not as straightforward as it might appear in the first place. You could also use an algorithm I published a while ago:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0382-3
It includes a list of SMARTS for the UNIFAC model, but you could also use them for other things, like density prediction.

neutral label for NLTK

I have similar problem like below
Why did NLTK NaiveBayes classifier misclassify one record?
In my case, I queried positive feed and built positive_vocab and then queried negative feed and built negative_voca. I get the data from feed clean and built the classifier. How do I build the neutral_vocab. Is there a way I can instruct NLTK classifier to return neutral label when the given word is not found in the negative_voca and positive_vocab. How do I do that?
In my current implementation, if I give a word which is not present in the both sets it tells positive by default. Instead it should tell, neutral or notfound

How to find likelihood in NLTK

http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/comment-page-1/#comment-73511
I am trying to understand NLTK using this link. I cannot understand how the values of feature_probdist and show_most_informative_features are computed.
Esp when the word "best" does not come how the likelihood is computed as 0.077 . I was trying since long back
That is because it is explaining code from NLTK's source code but not displaying all of it. The full code is available on NLTK's website (and is also linked to in the article you referenced). These are a field within a method and a method (respectively) of the NaiveBayesClassifier class within NLTK. This class is of course using a Naive Bayes classifier, which is essentially a modification of Bayes Theorum with a strong (naive) assumption that each event is independent.
feature_probdist = "P(fname=fval|label), the probability distribution for feature values, given labels. It is expressed as a dictionary whose keys are (label,fname) pairs and whose values are ProbDistIs over feature values. I.e., P(fname=fval|label) = feature_probdist[label,fname].prob(fval). If a given (label,fname) is not a key in feature_probdist, then it is assumed that the corresponding P(fname=fval|label) is 0 for all values of fval."
most_informative features returns "a list of the 'most informative' features used by this classifier. For the purpose of this function, the informativeness of a feature (fname,fval) is equal to the highest value of P(fname=fval|label), for any label, divided by the lowest value of P(fname=fval|label), for any label:"
max[ P(fname=fval|label1) / P(fname=fval|label2) ]
Check out the source code for the entire class if this is still unclear, the article's intent was not to break down how NLTK works under the hood in depth, but rather just to give a basic concept of how to use it.

Can an OCR run in a split-second if it is highly targeted? (Small dictionary)

I am looking for an open source ocr (maybe tesseract) that uses a dictionary to match words against. For example, I know that this ocr will only be used to search for certain names. Imagine I have a master guest list (written) and I want to scan this list in under a second with the ocr and check this against a database of names.
I understand that a traditional ocr can attempt to read every letter and then I could just cross reference the results with the 100 names, but this takes too long. If the ocr was just focusing on those 100 words and nothing else then it should be able to do all this in a split second. i.e. There is no point in guessing that a word might be "Jach" since "Jach" isn't a name in my database. The ocr should be able to infer that it is "Jack" since that is an actual name in the database.
Is this possible?
It should be possible. Think of it this way: instead of having your OCR look for 'J' it could be looking for 'Jack' directly, sort of: as an individual symbol.
So when you train / calibrate your OCR, train it with images of whole words, similar to how you would - for an individual symbol.
(if this feature is not directly available in your OCR then first map images of whole words to a unique symbol and later transform that symbol into the final word string)