'LazyCorpusLoader' object is not iterable - nltk

The following example creates an anagram dictionary.
However, it throws a TypeError: 'LazyCorpusLoader' object is not an iterator:
import nltk
from nltk.corpus import words
anagrams = nltk.defaultdict(list)
for word in words:
key = ''.join(sorted(word))
anagrams[key].append(word)
print(anagrams['aeilnrt'])

You have to use the .words() method on the words corpus object.
Specifically: change
for word in words:
to
for word in words.words():
and it should work.

I got the same error for importing stopwords using nltk
Error occurred due to below import
from nltk.corpus import stopwords
Replacing the above with the following worked for me
# from nltk.corpus import stopwords
stopwords = nltk.corpus.stopwords.words('english')

Related

Unable to print output of JSON code into a .csv file

I'm getting the following errors when trying to decode this data, and the 2nd error after trying to compensate for the unicode error:
Error 1:
write.writerows(subjects)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 160: ordinal not in range(128)
Error 2:
with open("data.csv", encode="utf-8", "w",) as writeFile:
SyntaxError: non-keyword arg after keyword arg
Code
import requests
import json
import csv
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
subjects = []
for post in data['posts']:
subjects.append([post['title'], post['episodeNumber'],
post['audioSource'], post['image']['large'], post['excerpt']['long']])
with open("data.csv", encode="utf-8", "w",) as writeFile:
write = csv.writer(writeFile)
write.writerows(subjects)
Using requests and with the correction to the second part (as below) I have no problem running. I think your first problem is due to the second error (is a consequence of that being incorrect).
I am on Python3 and can run yours with my fix to open line and with
r = urllib.request.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
I personally would use requests.
import requests
import csv
data = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1').json()
subjects = []
for post in data['posts']:
subjects.append([post['title'], post['episodeNumber'],
post['audioSource'], post['image']['large'], post['excerpt']['long']])
with open("data.csv", encoding ="utf-8", mode = "w",) as writeFile:
write = csv.writer(writeFile)
write.writerows(subjects)
For your second, looking at documentation for open function, you need to use the right argument names and add the name of the mode argument if not positional matching.
with open("data.csv", encoding ="utf-8", mode = "w") as writeFile:

How to remove unusual characters from JSON dump in Python?

I have been searching around for a good way to remove all unusual characters from a JSON dump of tweets that I am using to compile a dataset for sentiment analysis.
characters I am trying to remove = ンボ チョボ付 最安値
These characters appear in my tweet data and I am trying to remove them using regex but to no avail.
import json
import csv
import pandas as pd
import matplotlib.pyplot as plt
tweets_data_path = 'twitter_data.txt'
tweets_data = []
tweets_text_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
for tweet in tweets_data:
if tweet['text']:
tweets_text_data.append(tweet['text'])
print(tweets_text_data)
with open('dataset_file', 'w') as dataset_file:
writer = csv.writer(dataset_file)
writer.writerow(tweets_text_data)
I tried using re.sub() to take away these charcters but it will not work. How can I make this work?

raise_FirstSetError in SpaCy topic modeling

I want to create a LDA topic model and am using SpaCy to do so, following a tutorial. The error I receive when I try to use spacy is one I cannot find on google, so I'm hoping someone here knows what it's about.
I'm running this code on Anaconda:
import numpy as np
import pandas as pd
import re, nltk, spacy, gensim
# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint
# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
df = pd.DataFrame(data)
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
# deacc=True removes punctuations
data_words = list(sent_to_words(data))
print(data_words[:1])
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
return texts_out
nlp = spacy.load('en', disable=['parser', 'ner'])
# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:1])
And I receive the following error:
File "C:\Users\maart\AppData\Local\Continuum\anaconda3\lib\site-packages\_regex_core.py", line 1880, in get_firstset
raise _FirstSetError()
_FirstSetError
The error must occur somewhere after the lemmatization, because the other parts work fine.
Thanks a bunch!
I had this same issue and I was able to resolve it by uninstalling regex (I had the wrong version installed) and then running python -m spacy download en again. This will reinstall the correct version of regex.

Dill/Pickling Error: odd numbers of set items

I have a very strange error that cannot be reproduced anywhere except my production environment. What does this error mean? I get it when I try run the following piece of code:
serialized_object = dills.dumps(object)
dill.loads(serialized_object)
pickle.UnpicklingError: odd number of items for SET ITEMS
I'd never seen this before, so I looked at the source code. See here: https://github.com/python/cpython/blob/f24143b25e4f83368ff6182bebe14f885073015c/Modules/_pickle.c#L5914 it seems that the implication is that you have a corrupted or hostile pickle.
Based on the OP's comments, I think I see the workaround. I'll have to determine the impact of the workaround, and it will have to be integrated into dill, but for now here it is:
>>> import StringIO as io
>>> f = io.StringIO()
>>> import dill
>>> import numpy as np
>>> x = np.array([1])
>>> y = (x,)
>>> p = dill.Pickler(f)
>>> p.dump(x)
>>> f.getvalue()
"cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I1\ntp6\ncnumpy\ndtype\np7\n(S'i8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\ntp14\nb."
>>> p.dump(y)
>>> f.getvalue()
"cnumpy.core.multiarray\n_reconstruct\np0\n(cnumpy\nndarray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I1\ntp6\ncnumpy\ndtype\np7\n(S'i8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\ntp14\nb.(g5\ntp15\n."
>>> dill.loads(_)
array([1])
>>>
import dill
import numpy as np
x = np.array([1])
y = (x,)
dill.dumps(x)
dill.loads(dill.dumps(y))
This will throw an out of index exception. The reason is because there is a special function that is registered to serialize numpy array objects. That special function uses the global Pickler to store the serialized data instead of the Pickler that is passed as an argument. To fix it, I used the Pickler that is passed to the argument instead. I'm not sure if it breaks anything else in the dill though.

Is there a corpus of English words in nltk?

Is there any way to get the list of English words in python nltk library?
I tried to find it but the only thing I have found is wordnet from nltk.corpus. But based on documentation, it does not have what I need (it finds synonyms for a word).
I know how to find the list of this words by myself (this answer covers it in details), so I am interested whether I can do this by only using nltk library.
Yes, from nltk.corpus import words
And check using:
>>> "fine" in words.words()
True
Reference: Section 4.1 (Wordlist Corpora), chapter 2 of Natural Language Processing with Python.
Other than the nltk.corpus.words that #salvadordali has highlighted,:
>>> from nltk.corpus import words
>>> print words.readme()
Wordlists
en: English, http://en.wikipedia.org/wiki/Words_(Unix)
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)
>>> print words.words()[:10]
[u'A', u'a', u'aa', u'aal', u'aalii', u'aam', u'Aani', u'aardvark', u'aardwolf', u'Aaron']
Do note that nltk.corpus.words is a list of words without frequencies so it's not exactly a corpora of natural text.
The corpus package that contains various corpora, some of which are English corpora, see http://www.nltk.org/nltk_data/. E.g. nltk.corpus.brown:
>>> from nltk.corpus import brown
>>> brown.words()[:10]
[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of']
To get a word list from a natural text corpus:
>>> wordlist = set(brown.words())
>>> print len(wordlist)
56057
>>> wordlist_lowercased = set(i.lower() for i in brown.words())
>>> print len(wordlist_lowercased)
49815
Note that the brown.words() contains words with both lower and upper cases like natural text.
In most cases, a list of words is not very useful without frequencies, so you can use the FreqDist:
>>> from nltk import FreqDist
>>> from nltk.corpus import brown
>>> frequency_list = FreqDist(i.lower() for i in brown.words())
>>> frequency_list.most_common()[:10]
[(u'the', 69971), (u',', 58334), (u'.', 49346), (u'of', 36412), (u'and', 28853), (u'to', 26158), (u'a', 23195), (u'in', 21337), (u'that', 10594), (u'is', 10109)]
For more, see http://www.nltk.org/book/ch01.html on how to access corpora and process them in NLTK