I want compare output from the wordnet.synset function in the NLTK library. In an example when I run:
from nltk.corpus import wordnet as wn
wn.synsets('dog')
I get output:
output:
[Synset('dog.n.01'),
Synset('frump.n.01'),
Synset('dog.n.03'),
Synset('cad.n.01'),
Synset('frank.n.02'),
Synset('pawl.n.01'),
Synset('andiron.n.01'),
Synset('chase.v.01')]
Now if I try:
from nltk.corpus import wordnet as wn
wn.synsets('dogg')
I get output
output:
[]
How can I compare the outputs in the console to build logic around it? Logic could be "if word not in output then mispelled"
Thank you in advance.
Related
I am getting this error while importing a JSON dataset from a website.
JSONDecodeError: Expecting value: line 1 column 2 (char 1)
I am working in colaboratory and wanted to import the sarcastic dataset, but since I don't know JSON, I am stuck. I have tried different placements of slash() character and also changing the -o parameter but nothing works correctly...my code[reprex]:=====>
!wget --no-check-certificate \ https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json -o /tmp/sarcasm.json
import json
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
#importing the Sarcasm dataset from !wget --no-check-certificate \ https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json \
#-o /tmp/sarcasm.json
with open("/tmp/sarcasm.json", 'r') as f:
datastore = json.load(f)
datastore = json.detect_encoding()
print (datastore)
sentences = []
labels = []
urls = []
I think the problem might be the fact the the data is being imported in HTML format, which has to be converted in JSON format(or something compatible with it). Any help would be appreciated! :)
In my case, i was able to resolve this by replacing single quotes with double quotes.
a = "['1','2']"
json.loads(a.replace("'",'"'))
I suspect you are saving the log of the transaction(instead of the doc itself) to /tmp/sarcasm.json.
Try --output-document=sarcasm.json instead
wget --no-check-certificate "https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json" --output-document=sarcasm.json
There is no need to detect the encoding, json library will take care of it
Remove the below line and try,
datastore = json.detect_encoding()
try using -O instead of -o
!wget --no-check-certificate \
https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json -O /tmp/sarcasm.json
I want to create a LDA topic model and am using SpaCy to do so, following a tutorial. The error I receive when I try to use spacy is one I cannot find on google, so I'm hoping someone here knows what it's about.
I'm running this code on Anaconda:
import numpy as np
import pandas as pd
import re, nltk, spacy, gensim
# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint
# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
df = pd.DataFrame(data)
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
# deacc=True removes punctuations
data_words = list(sent_to_words(data))
print(data_words[:1])
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
return texts_out
nlp = spacy.load('en', disable=['parser', 'ner'])
# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:1])
And I receive the following error:
File "C:\Users\maart\AppData\Local\Continuum\anaconda3\lib\site-packages\_regex_core.py", line 1880, in get_firstset
raise _FirstSetError()
_FirstSetError
The error must occur somewhere after the lemmatization, because the other parts work fine.
Thanks a bunch!
I had this same issue and I was able to resolve it by uninstalling regex (I had the wrong version installed) and then running python -m spacy download en again. This will reinstall the correct version of regex.
The following example creates an anagram dictionary.
However, it throws a TypeError: 'LazyCorpusLoader' object is not an iterator:
import nltk
from nltk.corpus import words
anagrams = nltk.defaultdict(list)
for word in words:
key = ''.join(sorted(word))
anagrams[key].append(word)
print(anagrams['aeilnrt'])
You have to use the .words() method on the words corpus object.
Specifically: change
for word in words:
to
for word in words.words():
and it should work.
I got the same error for importing stopwords using nltk
Error occurred due to below import
from nltk.corpus import stopwords
Replacing the above with the following worked for me
# from nltk.corpus import stopwords
stopwords = nltk.corpus.stopwords.words('english')
How can I know the maximum depth of the taxonomy for wordnet 3.0? (is-a relationships for synsets)
I read some papers and found from a paper that it is 16 for wordnet 1.7.1.
I'm wondering the value for wordnet 3.0.
You can try the wordnet interface in python nltk.
Iterate through each synset in wordnet and find the distance to their top most hypernym:
>>> from nltk.corpus import wordnet
>>> from nltk.corpus import wordnet as wn
>>> max(max(len(hyp_path) for hyp_path in ss.hypernym_paths()) for ss in wn.all_synsets())
20
To find the possible paths of a synset to its top most hypernym:
>>> wn.synset('dog.n.1')
Synset('dog.n.01')
>>> wn.synset('dog.n.1').hypernym_paths()
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('chordate.n.01'), Synset('vertebrate.n.01'), Synset('mammal.n.01'), Synset('placental.n.01'), Synset('carnivore.n.01'), Synset('canine.n.02'), Synset('dog.n.01')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('domestic_animal.n.01'), Synset('dog.n.01')]]
To find the maximum of one synset:
>>> max(len(hyp_path) for hyp_path in wn.synset('dog.n.1').hypernym_paths())
14
Is there any way to get the list of English words in python nltk library?
I tried to find it but the only thing I have found is wordnet from nltk.corpus. But based on documentation, it does not have what I need (it finds synonyms for a word).
I know how to find the list of this words by myself (this answer covers it in details), so I am interested whether I can do this by only using nltk library.
Yes, from nltk.corpus import words
And check using:
>>> "fine" in words.words()
True
Reference: Section 4.1 (Wordlist Corpora), chapter 2 of Natural Language Processing with Python.
Other than the nltk.corpus.words that #salvadordali has highlighted,:
>>> from nltk.corpus import words
>>> print words.readme()
Wordlists
en: English, http://en.wikipedia.org/wiki/Words_(Unix)
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)
>>> print words.words()[:10]
[u'A', u'a', u'aa', u'aal', u'aalii', u'aam', u'Aani', u'aardvark', u'aardwolf', u'Aaron']
Do note that nltk.corpus.words is a list of words without frequencies so it's not exactly a corpora of natural text.
The corpus package that contains various corpora, some of which are English corpora, see http://www.nltk.org/nltk_data/. E.g. nltk.corpus.brown:
>>> from nltk.corpus import brown
>>> brown.words()[:10]
[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of']
To get a word list from a natural text corpus:
>>> wordlist = set(brown.words())
>>> print len(wordlist)
56057
>>> wordlist_lowercased = set(i.lower() for i in brown.words())
>>> print len(wordlist_lowercased)
49815
Note that the brown.words() contains words with both lower and upper cases like natural text.
In most cases, a list of words is not very useful without frequencies, so you can use the FreqDist:
>>> from nltk import FreqDist
>>> from nltk.corpus import brown
>>> frequency_list = FreqDist(i.lower() for i in brown.words())
>>> frequency_list.most_common()[:10]
[(u'the', 69971), (u',', 58334), (u'.', 49346), (u'of', 36412), (u'and', 28853), (u'to', 26158), (u'a', 23195), (u'in', 21337), (u'that', 10594), (u'is', 10109)]
For more, see http://www.nltk.org/book/ch01.html on how to access corpora and process them in NLTK