How to extract noun phrases from a sentence using pre-trained BERT? - deep-learning

I want to extract noun phrases from sentence using BERT. There are some available libraries like TextBlob that allows us to extract noun phrases like this:
from textblob import TextBlob
line = "Out business could be hurt by increased labor costs or labor shortages"
blob = TextBlob(line)
blob.noun_phrases
>>> WordList(['labor costs', 'labor shortages'])
The output of this seems pretty good. However, this is not able to capture longer noun phrases. Consider following example:
from textblob import TextBlob
line = "The Company’s Image Activation program may not positively affect sales at company-owned and participating "
"franchised restaurants or improve our results of operations"
blob = TextBlob(line)
blob.noun_phrases
>>> WordList(['company ’ s', 'image activation'])
However, the ideal answer here could be a longer phrase, like : company-owned and participating franchised restaurants. Since BERT is proven to be state-of-the-art in many NLP tasks, it should perform at least better than this approach.
However, I could not find any relevant resource to use BERT for this task. Is it possible to solve this task using pre-trained BERT?

Related

Sequence labeling for sentences and not tokens

I have sentences that belong to a paragraph. Each sentence has a label.
[s1,s2,s3,…], [l1,l2,l3,…]
I understand that I have to encode each sentence using an encoder, and then use sequence labeling. Could you guide me on how I could do that, combining them?
If i understand your question correctly, you are looking for encoding of your sentences into numeric representation.
let's say you have data like :
data = ["Sarah, is that you? Hahahahahaha Todd give you another black eye??"
"Well, being slick comes with the job of being a propagandist, Andi..."
"Sad to lose a young person who was earnestly working for the common good and public safety when so many are in the basement smoking pot and playing computer games."]
labels = [0,1,0]
Now you want to build a classifier, for training classifier data should be in numeric format so here we will transfer text data into numeric structure for that we will use tf-idf vectorizer which will create matrix for text data, then apply any algorithm.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
vectorizerPipe = Pipeline([
('tfidf', TfidfVectorizer(lowercase=True,stop_words='english')),
('classification', LinearSVC(penalty='l2',loss='hinge'))])
trained_model = vectorizerPipe.fit(data,labels)
Here pipeline is constructed where first step is feature vector extraction (converting text data into numeric format) and in next step we are applying algorithm to it. There are lot of parameters in both steps you can try.
later we fir the pipeline with .fit method and passing data and labels.

WordNetLemmatizer is not lemmatizing in text data

I am preprocessing text data. After stemming when I am doing lemmatizing, it is giving exactly the same results as stemming (no change in text). I can't understand what is the issue.
def stem_list(row):
my_list = row['no_stopwords']
stemmed_list = [stemming.stem(word) for word in my_list]
return stemmed_list
Japan['stemmed_words'] = Japan.apply(stem_list, axis=1)
def lemma_list(row):
my_list = row['stemmed_words']
lemmas_list = [lemma.lemmatize(word) for word in my_list]
return lemmas_list
Japan['lemma_words'] = Japan.apply(lemma_list, axis=1)
Below is the sample output:
secur huawei involv uk critic network suffici mitig longterm hcsec
form mitig perceiv risk aris involv huawei critic nation infrastructur
governmentl board includ offici britain gchq cybersecur agenc well
senior huawei execut repres uk telecommun
My text is news articles.
I am using PorterStemmer for Stemming, and WordNetLemmatizer for Lemmatizing.
Thank you in Advance.
The reason your text is likely not changing during lemmatization is that stemmed words are often not real words that have lemmas at all.
Both processes try to shorten a word to its root, but stemming is strictly and algorithm and lemmatization uses a vocabulary to find the smallest root of a word. Generally you would use one or the other depending on the speed you need.
However, if you just want to see the results of both in series, reverse your process- you should get stems that differ from the lemmas you feed into the stemmer.

NLTK (or other) Part of speech tagger that returns n-best tag sequences

I need a part of speech tagger that does not just return the optimal tag sequence for a given sentence, but that returns the n-best tag sequences. So for 'time flies like an arrow', it could return both NN VBZ IN DT NN and NN NNS VBP DT NN for example, ordered in terms of their probability. I need to train the tagger using my own tag set and sentence examples, and I would like a tagger that allows different features of the sentence to be engineered. If one of the nltk taggers had this functionality, that would be great, but any tagger that I can interface with my Python code would do. Thanks in advance for any suggestions.
I would recommend having a look at spaCy. From what I have seen, it doesn't by default allow you to return the top-n tags, but it supports creating custom pipeline components.
There is also an issue on Github where exactly this is discussed, and there are some suggestions on how to implement it relatively quickly.

What NLTK technique for making extracting terms for a tag cloud

I want to extract relevant terms from text and I want to choose the most relevant terms.
How to config nltk data -> how, to, config ignored
config mysql to scan -> config NOT ingored
Python NLTK usage -> usage ingored
new song by the band usage -> usage NOT ingored
NLTK Thinks that -> thinks ignored
critical thinking -> thinking NOT ignored
I can think only this crude method:
>>> text = nltk.word_tokenize(input)
>>> nltk.pos_tag(text)
and to save only the nouns and verbs. But even if "think" and "thinking" are verbs, I want to retain only "thinking". Also "combined" over "combine". I also want to extract phrases if I could. Also terms like "free2play", "#pro_blogger" etc.
Please suggest a better scheme or how to actually make my scheme work.
all you need is a better pos tagging. This is a well known problem with NLTK, the core pos tagger is not efficient for production use. May be you want to try out something else. Compare your results for pos tagging here - http://nlp.stanford.edu:8080/parser/ . This is most accurate POS tagger I have ever found (I know I will be proved wrong soon). Once you parse your data in this tagger, you will automatically realize what exactly you want.
I suggest you to focus on tagging properly.
Check POS Tagging Example :
Tagging
critical/JJ
thinking/NN
Source : I am also struggling with NLTK pos tagger these days.:)

Named Entity Recognition using NLTK. Relevance of extracted keywords

I was checking out the Named Entity Recognition feature of NLTK. Is it possible to find out which of the extracted keywords is most relevant to the original text? Also, is it possible to know the type (Person / Organization) of the extracted keywords?
If you have a trained tagger, you can first tag your text and then use the NE classifier that comes with NLTK.
The tagged text should be presented as a list
sentence = 'The U.N.'
tagged_sentence = [('The','DT'), ('U.N.', 'NNP')]
Then, the ne classifier would be called like this
nltk.ne_chunk(tagged_sentence)
It returns a Tree. The classified words will appear as Tree nodes inside the main structure.
The result will include if it is a PERSON, ORGANIZATION or GPE.
To find out the most relevant terms, you have to define a measure of "relevance". Usually tf/idf is used but if you are considering only one document, frequency could be enough.
Computing the frequency of each word within a document is easy with NLTK. First you have to load your corpus and once you have load it and have a Text object, simply call:
relevant_terms_sorted_by_freq = nltk.probability.FreqDist(corpus).keys()
Finally, you could filter out all words in relevant_terms_sorted_by_freq that don't belong to a NE list of words.
NLTK offers an online version of a complete book which I find interesting to start with