S -> NP VP, do these sentences follow this format? - nltk

Am parsing some sentences (from the inaugural speech in the nltk corpus) with the format S -> NP VP, and I want to make sure I parsed them correctly, do these sentences follow the aforementioned format, sorry if this question seems trivial, English is not my first language. If anyone has any questions on a given sentence follows NP VP, ask me and I will give you my reasons on why I picked it and give you it's parsing tree.
god bless you
our capacity remains undiminished
their memories are short
they are serious
these things are true
the capital was abandoned
they are many
god bless the united stated of
america
the enemy was advancing
all this we can do
all this we will do
Thanks in advance.

The first 9 are NP VP. In the last two, "all this" is the direct object, which is part of the VP.
god bless you
NP- VP-------
our capacity remains undiminished
NP---------- VP------------------
their memories are short
NP------------ VP-------
they are serious
NP-- VP---------
these things are true
NP---------- VP------
the capital was abandoned
NP--------- VP-----------
they are many
NP-- VP------
god bless the united stated of america
NP- VP--------------------------------
the enemy was advancing
NP------- VP-----------
all this we can do
VP------ NP VP----
all this we will do
VP------ NP VP-----
Note that the last two sentences are semantically equivalent to the sentences "We can do all this" and "We will do all this", an order which makes the subject predicate/verb predicate breakdown easier.

Related

Looking for dataset for sentiment analysis that consists of sentences with slang words

I am developing a machine learning model to predict the sentiment polarity of customers' comments about some product.
Currently, I use the pretrained twitter-roberta-base-sentiment as the base model.
It is works well most of the time except when predicting text contains slang words.
For example, it predict "The product is idiot proof." wrongly as Negative.
So, I want to add some labeled example sentences contains slang words into the training dataset in order to improve the model's performance at sentences contains slang.
For example:
[
{"doc":"I am having a blast with this game.", "sentiment": "Postive"},
{"doc":"This game is like pigeon chess", "sentiment": "Negative"},
...
]
I found SlangSD, a sentiment lexicon of slang words. For my project, it has 2 drawback as a training dataset.
it has only words, not sentences in each entry;
it contains not only slang words but also many ordinary words, such as "have","project","dictionary",etc.
I don't know what degree of slang you are targetting, but by intersecting SlangSD with a common English dictionary you might get a list of true slang terms.
Then, scraping a movie/game/forum website and selecting only the comments/posts with terms within your new slang list could do the trick I believe (giving you a set of sentences with slang terms). For the label, it would be imperfect, but quite viable I believe, to put the same label as the SlangSD word in the sentence.

What's the difference between NLTK's BLEU score and SacreBLEU?

I'm curious if anyone is familiar with the difference between using NLTK's BLEU score calculation and the SacreBLEU library.
In particular, I'm using both library's sentence BLEU scores, averaged over the entire dataset. The two give different results:
>>> from nltk.translate import bleu_score
>>> from sacrebleu import sentence_bleu
>>> print(len(predictions))
256
>>> print(len(targets))
256
>>> prediction = "this is the first: the world's the world's the world's the \
... world's the world's the world's the world's the world's the world's the world \
... of the world of the world'"
...
>>> target = "al gore: so the alliance for climate change has launched two campaigns."
>>> print(bleu_score.sentence_bleu([target], prediction))
0.05422283394039736
>>> print(sentence_bleu(prediction, [target]).score)
0.0
>>> print(sacrebleu.corpus_bleu(predictions, [targets]).score)
0.678758518214081
>>> print(bleu_score.corpus_bleu([targets], [predictions]))
0
As you can see, there's a lot of confusing inconsistencies going on. There's no way that my BLEU score is 67.8%, but it's also not supposed to be 0% (there are a lot of overlapping n-grams like "the").
I'd appreciate it if anyone could shed some light on this. Thanks.
NLTK and SacreBLEU use different tokenization rules, mostly in how they handle punctuation. NLTK uses its own tokenization, whereas SacreBLEU replicates the original Perl implementation from 2002. The tokenization rules are probably more elaborate in NLTK, but they make the number incomparable with the original implementation.
The corpus BLEU that you got from SacreBLEU is not 67.8%, but 0.67% – the numbers from SacreBLEU are already multiplied by 100, unlike NLTK. So, I would not say there is a huge difference between the scores.
The sentence-level BLEU can use different smoothing techniques that should ensure that score would get reasonable values even if 3-gram of 4-gram precision would be zero. However, note that BLEU as a sentence-level metric is very unreliable.

I need to turn the texts into vectors then feed the vectors into a classifier

I have a csv file named movie_reviews.csv and the data inside looks like this:
1 Pixar classic is one of the best kids' movies of all time.
1 Apesar de representar um imenso avanço tecnológico, a força
1 It doesn't enhance the experience, because the film's timeless appeal is down to great characters and wonderful storytelling; a classic that doesn't need goggles or gimmicks.
1 As such Toy Story in 3D is never overwhelming. Nor is it tedious, as many recent 3D vehicles have come too close for comfort to.
1 The fresh look serves the story and is never allowed to overwhelm it, leaving a beautifully judged yarn to unwind and enchant a new intake of young cinemagoers.
1 There's no denying 3D adds extra texture to Pixar's seminal 1995 buddy movie, emphasising Buzz and Woody's toy's-eye- view of the world.
1 If anything, it feels even fresher, funnier and more thrilling in today's landscape of over-studied demographically correct moviemaking.
1 If you haven't seen it for a while, you may have forgotten just how fantastic the snappy dialogue, visual gags and genuinely heartfelt story is.
0 The humans are wooden, the computer-animals have that floating, jerky gait of animated fauna.
1 Some thrills, but may be too much for little ones.
1 Like the rest of Johnston's oeuvre, Jumanji puts vivid characters through paces that will quicken any child's pulse.
1 "This smart, scary film, is still a favorite to dust off and take from the ""vhs"" bin"
0 All the effects in the world can't disguise the thin plot.
the first columns with 0s and 1s is my label.
I want to first turn the texts in movie_reviews.csv into vectors, then split my dataset based on the labels (all 1s to train and 0s to test). Then feed the vectors into a classifier like random forest.
For such a task you'll need to parse your data first with different tools. First lower-case all your sentences. Then delete all stopwords (the, and, or, ...). Tokenize (an introduction here: https://medium.com/#makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3). You can also use stemming in order to keep anly the root of the word, it can be helpful for sentiment classification.
Then you'll assign an index to each word of your vocabulary and replace words in your sentence by these indexes :
Imagine your vocabulary is : ['i', 'love', 'keras', 'pytorch', 'tensorflow']
index['None'] = 0 #in case a new word is not in your vocabulary
index['i'] = 1
index['love'] = 2
...
Thus the sentence : 'I love Keras' will be encoded as [1 2 3]
However you have to define a maximum length max_len for your sentences and when a sentence contain less words than max_len you complete your vector of size max_len by zeros.
In the previous example if your max_len = 5 then [1 2 3] -> [1 2 3 0 0].
This is a basic approach. Feel free to check preprocessing tools provided by libraries such as NLTK, Pandas ...

Is there a notation named sensor in nltk?

I am learning Stanford CS224N: natural language processing with Deep Learning.
Chris said
"very fine-grain differences between sensors that are a human being
can barely understand the difference between them and relate to"
in Lecture 1 while he is illustrating the piece of NLTK code.
Is there a notation named sensor in nltk? if yes, what does that mean?
I think that the automatic captioning of Youtube is wrong and that the lecturer pronounced the word synset.
And yes, there is a notation for synsets in NLTK, in fact the notation is coming from Wordnet.
You can get a synset with:
from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
where dog is the morphological stem of one of the lemma, n is the part of speech (noun in this case), and 01 is an index.
According to the NLTK documentation:
Synset(wordnet_corpus_reader)
Create a Synset from a lemma.pos.number string where: lemma is the word’s morphological stem pos is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB number is the sense number, counting from 0.

wordnet on different text?

I am new to nltk, and I find wordnet functionality pretty useful. It gives synsets, hypernyms, similarity, etc. But however it fails to give similarity between locations like 'Delhi'-'Hyderabad' obviously as these words are not in the wordnet corpus.
So, I would like to know, if somehow I can update the wordnet corpus OR create wordnet over a different corpus e.g. Set of pages extracted from wikipedia related to travel? If at all we can create wordnet over different corpus, then what would be the format, steps to do the same, any limitations?
Please can you point me to links that describe the above concerns. I have searched the internet, googled, read portions of nltk book, but I don't have a single hint to above question.
Pardon me, if the question sounds completely ridiculous.
For flexibility in measuring the semantic similarity of very specific terms like Dehli or Hyderabad, what you want is not something hand-crafted like WordNet, but an automatically-learned similarity measure from a very large database. These are statistical similarity approaches. Of course, you want to avoid having to train such a model on data yourself...
Thus one thing that may be useful is the Google Distance (wikipedia, original paper). It seems fairly simple to implement such a measure in a language like R (code), and the original paper reports 87% agreement with WordNet.
The similarity measures in Wordnet work as expected because Wordnet measures semantic similarity. In that sense, both are cities, so they are very similar. What you are looking for is probably called geographic similarity.
delhi = wn.synsets('Delhi', 'n')[0]
print delhi.definition()
# a city in north central India
hyderabad = wn.synsets('Hyderabad', 'n')[0]
print hyderabad.definition()
# a city in southern Pakistan on the Indus River
delhi.wup_similarity(hyderabad)
# 0.9
melon = wn.synsets('melon', 'n')[0]
delhi.wup_similarity(melon)
# 0.3
There is a Wordnet extension, called Geowordnet. I kind of had the same problem as you at one point and tried to unify Wordnet with some of its extensions: wnext. Hope that helps.