Nltk lesk issue - nltk

I am running a simple sentence disambiguation test. But the synset returned by nltk Lesk for the word 'cat' in the sentence "The cat likes milk" is 'kat.n.01', synsetid=3608870.
(n) kat, khat, qat, quat, cat, Arabian tea, African tea (the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant) "in Yemen kat is used daily by 85% of adults"
This is a simple phrase and yet the disambiguation task fails.
And this is happening for many words in a set containing more than one sentence, for example in my test sentences, I would expect 'dog' to be disambiguated as 'domestic dog' but Lesk gives me 'pawl' (a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward)
Is it related to the size of the training set which is in my test only few sentences?
Here is my test code:
def test_lesk():
words = get_sample_words()
print(words)
tagger = PerceptronTagger()
tags = tagger.tag(words)
print (tags[:5])
for word, tag in tags:
pos = get_wordnet_pos(tag)
if pos is None:
continue
print("word=%s,tag=%s,pos=%s" %(word, tag, pos))
synset = lesk(words, word, pos)
if synset is None:
print('No synsetid for word=%s' %word)
else:
print('word=%s, synsetname=%s, synsetid=%d' %(word,synset.name(), synset.offset()))

Related

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier

I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.
Below I'm attaching the code please look at it
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd
model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)
data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')
data.head()
Output is
Review
0 If you've ever been to Disneyland anywhere you...
1 Its been a while since d last time we visit HK...
2 Thanks God it wasn t too hot or too humid wh...
3 HK Disneyland is a great compact park. Unfortu...
4 the location is not in the city, took around 1...
Followed by
classifier("My name is mark")
Output is
[{'label': 'POSITIVE', 'score': 0.9953688383102417}]
Followed by code
basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment
Output is
['POSITIVE']
Appending the total rows to empty list
text = []
for index, row in data.iterrows():
text.append(row['Review'])
I'm trying to get the sentiment for all the rows
sent = []
for i in range(len(data)):
sentiment = classifier(data.iloc[i,0])
sent.append(sentiment)
The error is :
Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
2
3 for i in range(len(data)):
----> 4 sentiment = classifier(data.iloc[i,0])
5 sent.append(sentiment)
11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1914 # remove once script supports set_grad_enabled
1915 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1917
1918
IndexError: index out of range in self
some of the sentences in your Review column of the data frame are too long. when these sentences are converted to tokens and sent inside the model they are exceeding the 512 seq_length limit of the model, the embedding of the model used in the sentiment-analysis task was trained on 512 tokens embedding.
to fix this issue you can filter out the long sentences and keep only smaller ones (with token length < 512 )
or you can truncate the sentences with truncating = True
sentiment = classifier(data.iloc[i,0], truncation=True)
If you're tokenizing separately from your classification step, this warning can be output during tokenization itself (as opposed to classification).
In my case, I am using a BERT model, so I have MAX_TOKENS=510 (leaving room for the sequence-start and sequence-end tokens).
token = AutoTokenizer.from_pretrained("your model")
tokens = token.tokenize(
text, max_length=MAX_TOKENS, truncation=True
)
Now, when you run your classifier, the tokens are guaranteed not to exceed the maximum length.

Python how to define a function that only accepts an integer input that can be used for multiple lines

I am trying to make a program that has a user input numbers into multiple different lines of code and I am trying to make it so that if the user inputs something other than a number the program will ask the user again to input the number correctly. I was trying to define a function that I could use for all of them but every time I run the program, it crashes. Any help would be much appreciated, thank you.
My code:
def error():
global m1
global m2
global w1
global w2
while True:
try:
int(m1 or m2 or w1 or w2)
except ValueError:
try:
float(m1 or m2 or w1 or w2)
except ValueError:
m1 or m2 or w1 or w2=input("please input your response correctly...")
break
m1=input("\nWhat was your first marking period percentage?")
error()
w1=input("\nWhat is the weighting of the first marking period? (in decimal)")
error()
m2=input("\nWhat was your second marking period percentage?")
error()
w2=input("\nWhat is the weighting of the second marking period? (in decimal)")
error()
def user_input(msg):
inp = input(msg)
try:
return int(inp) if inp.isnumeric() else float(inp)
except ValueError as e:
return user_input("Please enter a numeric value")
m1=user_input("\nWhat was your first marking period percentage?")
w1=user_input("\nWhat is the weighting of the first marking period? (in decimal)")
m2=user_input("\nWhat was your second marking period percentage?")
w2=user_input("\nWhat is the weighting of the second marking period? (in decimal)")
You should write your function to get one number at a time. If at exception is triggered somewhere, it should be handled. Note how the get_number function shown below will keep asking for a number but also shows the prompt specified by its caller. If you are not running Python 3.6 or higher, you will need to comment out the call to print in the main function.
#! /usr/bin/env python3
def main():
p1 = get_number('What is your 1st marking period percentage? ')
w1 = get_number('What is the weighting of the 1st marking period? ')
p2 = get_number('What is your 2nd marking period percentage? ')
w2 = get_number('What is the weighting of the 2nd marking period? ')
score = calculate_score((p1, p2), (w1, w2))
print(f'Your score is {score:.2f}%.')
def get_number(prompt):
while True:
try:
text = input(prompt)
except EOFError:
raise SystemExit()
else:
try:
number = float(text)
except ValueError:
print('Please enter a number.')
else:
break
return number
def calculate_score(percentages, weights):
if len(percentages) != len(weights):
raise ValueError('percentages and weights must have same length')
return sum(p * w for p, w in zip(percentages, weights)) / sum(weights)
if __name__ == '__main__':
main()
By the following code you can able to make a function that only accept integer value:
def input_type(a):
if(type(10)==type(a)):
print("integer")
else:
print("not integer")
a=int(input())
input_type(a)

Comparison the words with the original file in the R

I have original dataset in json format. Let's load it in R.
library("rjson")
setwd("mydir")
getwd()
json_data <- fromJSON(paste(readLines("N1.json"), collapse=""))
uu <- unlist(json_data)
uutext <- uu[names(uu) == "text"]
And I have another dataset mydata2
mydata=read.csv(path to data/words)
I need to find the words in mydata2, only which are present in messages in json file. And then write this messages into the new document, "xyz.txt" How to do it?
chalk indirect pick reaction team skip pumpkin surprise bless ignorance
1 time patient road extent decade cemetery staircase monarch bubble abbey
2 service conglomerate banish pan friendly position tight highlight rice disappear
3 write swear break tire jam neutral momentum requirement relationship matrix
4 inspire dose jump promote trace latest absolute adjust joystick habit
5 wrong behave claim dedicate threat sell particle statement teach lamb
6 eye tissue prescription problem secretion revenge barrel beard mechanism platform
7 forest kick face wisecrack uncertainty ratio complain doubt reflection realism
8 total fee debate hall soft smart sip ritual pill category
9 contain headline lump absorption superintendent digital increase key banner second
i mean
chalk -1 number1 indirect -2 number2
template
Word1-1 number1-1; Word1-2 number 1-2; …; Word 1-10 number 1-10
Word2-1 number2-1; Word2-2 number 2-2; …; Word 2-10 number 2-10
Next time pls include real data. Simplified model:
library(data.table)
word = c("test","meh","blah")
jsonF = c("let's do test", "blah is right", "test blah", "test test")
outp <- list()
for (i in 1:length(word)) {
outp[[i]] = as.data.frame(grep(word[i],jsonF,v=T,fixed=T)) # possibly, ignore.case=T
}
qq = rbindlist(outp)
qq = unique(qq)
print(qq)
1: let's do test
2: test blah
3: test test
4: blah is right
Edit: quick and dirty paste/collapse:
library(data.table)
x = LETTERS[1:10]
y = LETTERS[11:20]
df = rbind(x,y)
L = list()
for (i in 1:nrow(df)) {
L[i] = paste0(df[i,],"-",seq(1,10)," ",i,"-",seq(1,10),collapse="; ")
}
Fin = cbind(L)
View(Fin)
Gives:
> Fin
L
[1,] "A-1 1-1; B-2 1-2; C-3 1-3; D-4 1-4; E-5 1-5; F-6 1-6; G-7 1-7; H-8 1-8; I-9 1-9; J-10 1-10"
[2,] "K-1 2-1; L-2 2-2; M-3 2-3; N-4 2-4; O-5 2-5; P-6 2-6; Q-7 2-7; R-8 2-8; S-9 2-9; T-10 2-10"

How to obtain a random word using function and dictionary? (Python 3)

The user has to select a category. And from there, the program has to generate a random word from the category list. If the user selected an invalid category, the program will prompt the user to choose a category again (loop the askCat function again).
import random
#Make word dictionary
wordDict = {'Animals':['Giraffe','Dog','Dolphin','Rabbit','Butterfly'], \
'Fruits': ['Apple','Pineapple','Durian','Orange','Rambutan'], \
'Colours': ['Red','Blue','Yellow','Green','Purple'], \
'Shapes': ['Heart','Circle','Rectangle','Square','Diamond']}
#Determine word category and random word
def askCat (wordDict):
category = str( input ("To start the game, please choose a category: \n Animals (a), Fruits (f), Colours (c), Shapes (s) "))
print()
if category == 'a':
print ("You chose the Animals category.")
cat = (wordDict['Animals'])
elif category == 'f':
print ("You chose the Fruits category.")
cat = (wordDict['Animals'])
elif category == 'c':
print ("You chose the Colours category.")
cat = (wordDict['Animals'])
elif category == 's':
print ("You chose the Shapes category.")
cat = (wordDict['Animals'])
else:
print ("You entered an invalid category. Try again!")
print()
askCat(wordDict)
return random.choice(cat)
#Print random word
randWord = askCat(wordDict)
print (randWord)
When on the first try, the user enter a valid category, the program works just fine. However, the problem I'm facing is that, when the user enter an invalid category the first time, and when he enter a valid category the second time, the program don't work anymore.
Please do help! Thanks (:
else:
print ("You entered an invalid category. Try again!")
print()
askCat(wordDict)
return random.choice(cat)
In the else branch, you are recursively calling the function again—which is okay—and then you discard its return value and return cat instead which, in this call of the function, was never set.
Instead, you should return the value from the recursive call:
else:
print ("You entered an invalid category. Try again!")
print()
return askCat(wordDict)
return random.choice(cat)
That way, when you call it recursively, the result from that call will be used, and not the one you tried to get from the current cat.
Furthermore, in each of your branches, you are doing cat = (wordDict['Animals']); you probably want to change that so you actually get fruits for f etc.
And finally, while using recursion is okay, it’s not the best way to handle this. Recursion always has a maximum depth it can into, so in the worst case, a user could keep answering the wrong thing increasing the recursion stack further, until the program errors out. If you want to avoid that, you should use a standard loop instead:
cat = None
while not cat:
# You don’t nee to use `str()` here; input always returns a string
category = input("To start the game, please choose a category: \n Animals (a), Fruits (f), Colours (c), Shapes (s) ")
print()
if category == 'a':
print("You chose the Animals category.")
cat = wordDict['Animals'] # no need to use parentheses here
elif category == 'f':
# ...
# and so on
else:
print("You entered an invalid category. Try again!")
# the loop will automatically repeat, as `cat` wasn’t set
# when we reach here, `cat` has been set
return random.choice(cat)
In your function askCat, if the user first enter a wrong category, you call again askCat. However, you don't return the value returned by that call.
Replace (in the function askCat):
askCat(wordDict)
to:
return askCat(wordDict)
However, I would strongly recommend you to use a while loop instead.

Word frequency count based on two words using python

There are many resources online that shows how to do a word count for single word
like this and this and this and others...
But I was not not able to find a concrete example for two words count frequency .
I have a csv file that has some strings in it.
FileList = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
So I want the output to be like :
wordscount = {"I love": 2, "show makes": 2, "makes me" : 2 }
Of course I will have to strip all the comma, interrogation points.... {!, , ", ', ?, ., (,), [, ], ^, %, #, #, &, *, -, _, ;, /, \, |, }
I will also remove some stop words which I found here just to get more concrete data from the text.
How can I achieve this results using python?
Thanks!
>>> from collections import Counter
>>> import re
>>>
>>> sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
>>> words = re.findall(r'\w+', sentence)
>>> two_words = [' '.join(ws) for ws in zip(words, words[1:])]
>>> wordscount = {w:f for w, f in Counter(two_words).most_common() if f > 1}
>>> wordscount
{'show makes': 2, 'makes me': 2, 'I love': 2}