WordNetLemmatizer is not lemmatizing in text data - nltk

I am preprocessing text data. After stemming when I am doing lemmatizing, it is giving exactly the same results as stemming (no change in text). I can't understand what is the issue.
def stem_list(row):
my_list = row['no_stopwords']
stemmed_list = [stemming.stem(word) for word in my_list]
return stemmed_list
Japan['stemmed_words'] = Japan.apply(stem_list, axis=1)
def lemma_list(row):
my_list = row['stemmed_words']
lemmas_list = [lemma.lemmatize(word) for word in my_list]
return lemmas_list
Japan['lemma_words'] = Japan.apply(lemma_list, axis=1)
Below is the sample output:
secur huawei involv uk critic network suffici mitig longterm hcsec
form mitig perceiv risk aris involv huawei critic nation infrastructur
governmentl board includ offici britain gchq cybersecur agenc well
senior huawei execut repres uk telecommun
My text is news articles.
I am using PorterStemmer for Stemming, and WordNetLemmatizer for Lemmatizing.
Thank you in Advance.

The reason your text is likely not changing during lemmatization is that stemmed words are often not real words that have lemmas at all.
Both processes try to shorten a word to its root, but stemming is strictly and algorithm and lemmatization uses a vocabulary to find the smallest root of a word. Generally you would use one or the other depending on the speed you need.
However, if you just want to see the results of both in series, reverse your process- you should get stems that differ from the lemmas you feed into the stemmer.

Related

How to extract noun phrases from a sentence using pre-trained BERT?

I want to extract noun phrases from sentence using BERT. There are some available libraries like TextBlob that allows us to extract noun phrases like this:
from textblob import TextBlob
line = "Out business could be hurt by increased labor costs or labor shortages"
blob = TextBlob(line)
blob.noun_phrases
>>> WordList(['labor costs', 'labor shortages'])
The output of this seems pretty good. However, this is not able to capture longer noun phrases. Consider following example:
from textblob import TextBlob
line = "The Company’s Image Activation program may not positively affect sales at company-owned and participating "
"franchised restaurants or improve our results of operations"
blob = TextBlob(line)
blob.noun_phrases
>>> WordList(['company ’ s', 'image activation'])
However, the ideal answer here could be a longer phrase, like : company-owned and participating franchised restaurants. Since BERT is proven to be state-of-the-art in many NLP tasks, it should perform at least better than this approach.
However, I could not find any relevant resource to use BERT for this task. Is it possible to solve this task using pre-trained BERT?

What is the right way to generate long sequence using PyTorch-Transformers?

I am trying to generate a long sequence of text using PyTorch-Transformers from a sample text. I am following this tutorial for this purpose. Because the original article only predicts one word from a given text, I modified that script to generate long sequence instead of one. This is the modified part of the code
# Encode a text inputs
text = """An examination can be defined as a detailed inspection or analysis
of an object or person. For example, an engineer will examine a structure,
like a bridge, to see if it is safe. A doctor may conduct"""
indexed_tokens = tokenizer.encode(text)
# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
seq_len = tokens_tensor.shape[1]
tokens_tensor = tokens_tensor.to('cuda')
with torch.no_grad():
for i in range(50):
outputs = model(tokens_tensor[:,-seq_len:])
predictions = outputs[0]
predicted_index = torch.argmax(predictions[0, -1, :])
tokens_tensor = torch.cat((tokens_tensor,predicted_index.reshape(1,1)),1)
pred = tokens_tensor.detach().cpu().numpy().tolist()
predicted_text = tokenizer.decode(pred[0])
print(predicted_text)
Output
An examination can be defined as a detailed inspection or analysis
of an object or person. For example, an engineer will examine a
structure, like a bridge, to see if it is safe. A doctor may conduct
an examination of a patient's body to see if it is safe.
The doctor may also examine a patient's body to see if it is safe. A
doctor may conduct an examination of a patient's body to see if it is
safe.
As you can see the generated text does not generates any unique text sequence but it generates the same sentence over and over again with minor changes.
How should we create long sequence using PyTorch-Transformers?
There is usually no such thing as generating a complete sentence or complete text once. There were some research approaches on that but almost all of the state-of-the-art models generate a text word by word. The generated word at time t-1 is then used as input (together with other already generated or given words) while generating the next word at time t. So, it is normal that it generates word by word. I do not understand what you mean by this.
Which model are you using?

NLTK (or other) Part of speech tagger that returns n-best tag sequences

I need a part of speech tagger that does not just return the optimal tag sequence for a given sentence, but that returns the n-best tag sequences. So for 'time flies like an arrow', it could return both NN VBZ IN DT NN and NN NNS VBP DT NN for example, ordered in terms of their probability. I need to train the tagger using my own tag set and sentence examples, and I would like a tagger that allows different features of the sentence to be engineered. If one of the nltk taggers had this functionality, that would be great, but any tagger that I can interface with my Python code would do. Thanks in advance for any suggestions.
I would recommend having a look at spaCy. From what I have seen, it doesn't by default allow you to return the top-n tags, but it supports creating custom pipeline components.
There is also an issue on Github where exactly this is discussed, and there are some suggestions on how to implement it relatively quickly.

WordNet 3.0 Curse Words

I'm developing a system where keywords are extracted from plain text.
The requirements for a keyword are:
Between 1 - 45 letters long
Word must exist within the WordNet database
Must not be a "common" word
Must not be a curse word
I have fulfilled requirements 1 - 3, however I can't find a method for finding a distinction between curse words; how do I filter them?
I know this would not be a definitive method of filtering out all the curse words, but what happens is all keywords are first set to a state of "pending" before being "approved" by a moderator. However if I can get WordNet to filter most of the curse words out, it would make the moderator's job more easy.
It's strange, the Unix command line version of WordNet (wn) will give you the desired
information with the option -domn (domain):
wn ass -domnn (-domnv for a verb)
...
>>> USAGE->(noun) obscenity#2, smut#4, vulgarism#1, filth#4, dirty word#1
>>> USAGE->(noun) slang#2, cant#3, jargon#1, lingo#1, argot#1, patois#1, vernacular#1
However, the equivalent method in the NLTK just returns an empty list:
from nltk.corpus import wordnet
a = wordnet.synsets('ass')
for s in a:
for l in s.lemmas:
print l.usage_domains()
[]
[]
...
As an alternative you could try to filter words that have "obscene", "coarse" or "slang" in their SynSet's definition. But probably it's much easier to filter against a fixed list as suggested before (like the one at noswearing.com).
For the 4th point it would be better and effective if you can collect the list of curse words and remove them through iterative process.
To achieve the same you can checkout this blog
I will summarize the same here.
1. Load Swear words text file from here
2. Compare it with the text, remove if it match.
def remove_curse_words():
text = 'Hey Bro Fuck you'
text = ' '.join([word for word in text.split() if word not in curseWords])
return text
The output would be.
Hey bro you

Named Entity Recognition using NLTK. Relevance of extracted keywords

I was checking out the Named Entity Recognition feature of NLTK. Is it possible to find out which of the extracted keywords is most relevant to the original text? Also, is it possible to know the type (Person / Organization) of the extracted keywords?
If you have a trained tagger, you can first tag your text and then use the NE classifier that comes with NLTK.
The tagged text should be presented as a list
sentence = 'The U.N.'
tagged_sentence = [('The','DT'), ('U.N.', 'NNP')]
Then, the ne classifier would be called like this
nltk.ne_chunk(tagged_sentence)
It returns a Tree. The classified words will appear as Tree nodes inside the main structure.
The result will include if it is a PERSON, ORGANIZATION or GPE.
To find out the most relevant terms, you have to define a measure of "relevance". Usually tf/idf is used but if you are considering only one document, frequency could be enough.
Computing the frequency of each word within a document is easy with NLTK. First you have to load your corpus and once you have load it and have a Text object, simply call:
relevant_terms_sorted_by_freq = nltk.probability.FreqDist(corpus).keys()
Finally, you could filter out all words in relevant_terms_sorted_by_freq that don't belong to a NE list of words.
NLTK offers an online version of a complete book which I find interesting to start with