NLTK single-word part-of-speech tagging - nltk

Is there a way to use NLTK to get a set of possible parts of speech of a single string of letters, taking into account that different words might have homonyms?
For example: report -> {Noun, Verb} , kind -> {Adjective, Noun}
I have not been able to find a POS-tokenizer that tags part-of-speech for words outside of the context of a full sentence. This seems like a very basic request of NLTK, so I'm confused as to why I've had so much trouble finding it.

Yes. The simplest way is not to use a tagger, but simply load up one or more corpora and collect the set of all tags for the word you are interested in. If you're interested in more than one word, it's simplest to collect the tags for all words in the corpus, then look up anything you want. I'll add frequency counts, just because I can. For example, using the Brown corpus and the simple "universal" tagset:
>>> wordtags = nltk.ConditionalFreqDist((w.lower(), t)
for w, t in nltk.corpus.brown.tagged_words(tagset="universal"))
>>> wordtags["report"]
FreqDist({'NOUN': 135, 'VERB': 39})
>>> list(wordtags["kind"])
['ADJ', 'NOUN']

Because POS models are trained on sentence/document based data, so the expected input to the pre-trained model is a sentence/document. When there's only a single word, it treats it as a single word sentence, hence there should only be one tag in that single word sentence context.
If you're trying to find all possible POS tags per English words, you would need a corpus of many different use of the words and then tag the corpus and count/extract the no. of tags per word. E.g.
>>> from nltk import pos_tag
>>> sent1 = 'The coaches are going from Singapore to Frankfurt'
>>> sent2 = 'He coaches the football team'
>>> pos_tag(sent1.split())
[('The', 'DT'), ('coaches', 'NNS'), ('are', 'VBP'), ('going', 'VBG'), ('from', 'IN'), ('Singapore', 'NNP'), ('to', 'TO'), ('Frankfurt', 'NNP')]
>>> pos_tag(sent2.split())
[('He', 'PRP'), ('coaches', 'VBZ'), ('the', 'DT'), ('football', 'NN'), ('team', 'NN')]
>>> from collections import defaultdict, Counter
>>> counts = defaultdict(Counter)
>>> tagged_sents = [pos_tag(sent) for sent in [sent1.split(), sent2.split()]]
>>> for word, pos in chain(*tagged_sents):
... counts[word][pos] += 1
...
>>> counts
defaultdict(<class 'collections.Counter'>, {'from': Counter({'IN': 1}), 'to': Counter({'TO': 1}), 'Singapore': Counter({'NNP': 1}), 'football': Counter({'NN': 1}), 'coaches': Counter({'VBZ': 1, 'NNS': 1}), 'going': Counter({'VBG': 1}), 'are': Counter({'VBP': 1}), 'team': Counter({'NN': 1}), 'The': Counter({'DT': 1}), 'Frankfurt': Counter({'NNP': 1}), 'the': Counter({'DT': 1}), 'He': Counter({'PRP': 1})})
>>> counts['coaches']
Counter({'VBZ': 1, 'NNS': 1})
Alternatively, there's WordNet:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('coaches')
[Synset('coach.n.01'), Synset('coach.n.02'), Synset('passenger_car.n.01'), Synset('coach.n.04'), Synset('bus.n.01'), Synset('coach.v.01'), Synset('coach.v.02')]
>>> [ss.pos() for ss in wn.synsets('coaches')]
[u'n', u'n', u'n', u'n', u'n', u'v', u'v']
>>> Counter([ss.pos() for ss in wn.synsets('coaches')])
Counter({u'n': 5, u'v': 2})
But note that WordNet is a manually crafted resource, so you cannot expect every English word to be in it.

Related

Convert select in postgress to return array instead of string (just for one row) [duplicate]

I was wondering what the simplest way is to convert a string representation of a list like the following to a list:
x = '[ "A","B","C" , " D"]'
Even in cases where the user puts spaces in between the commas, and spaces inside of the quotes, I need to handle that as well and convert it to:
x = ["A", "B", "C", "D"]
I know I can strip spaces with strip() and split() and check for non-letter characters. But the code was getting very kludgy. Is there a quick function that I'm not aware of?
>>> import ast
>>> x = '[ "A","B","C" , " D"]'
>>> x = ast.literal_eval(x)
>>> x
['A', 'B', 'C', ' D']
>>> x = [n.strip() for n in x]
>>> x
['A', 'B', 'C', 'D']
ast.literal_eval:
With ast.literal_eval you can safely evaluate an expression node or a string containing a Python literal or container display. The string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, booleans, and None.
The json module is a better solution whenever there is a stringified list of dictionaries. The json.loads(your_data) function can be used to convert it to a list.
>>> import json
>>> x = '[ "A","B","C" , " D"]'
>>> json.loads(x)
['A', 'B', 'C', ' D']
Similarly
>>> x = '[ "A","B","C" , {"D":"E"}]'
>>> json.loads(x)
['A', 'B', 'C', {'D': 'E'}]
The eval is dangerous - you shouldn't execute user input.
If you have 2.6 or newer, use ast instead of eval:
>>> import ast
>>> ast.literal_eval('["A","B" ,"C" ," D"]')
["A", "B", "C", " D"]
Once you have that, strip the strings.
If you're on an older version of Python, you can get very close to what you want with a simple regular expression:
>>> x='[ "A", " B", "C","D "]'
>>> re.findall(r'"\s*([^"]*?)\s*"', x)
['A', 'B', 'C', 'D']
This isn't as good as the ast solution, for example it doesn't correctly handle escaped quotes in strings. But it's simple, doesn't involve a dangerous eval, and might be good enough for your purpose if you're on an older Python without ast.
There is a quick solution:
x = eval('[ "A","B","C" , " D"]')
Unwanted whitespaces in the list elements may be removed in this way:
x = [x.strip() for x in eval('[ "A","B","C" , " D"]')]
Inspired from some of the answers above that work with base Python packages I compared the performance of a few (using Python 3.7.3):
Method 1: ast
import ast
list(map(str.strip, ast.literal_eval(u'[ "A","B","C" , " D"]')))
# ['A', 'B', 'C', 'D']
import timeit
timeit.timeit(stmt="list(map(str.strip, ast.literal_eval(u'[ \"A\",\"B\",\"C\" , \" D\"]')))", setup='import ast', number=100000)
# 1.292875313000195
Method 2: json
import json
list(map(str.strip, json.loads(u'[ "A","B","C" , " D"]')))
# ['A', 'B', 'C', 'D']
import timeit
timeit.timeit(stmt="list(map(str.strip, json.loads(u'[ \"A\",\"B\",\"C\" , \" D\"]')))", setup='import json', number=100000)
# 0.27833264000014424
Method 3: no import
list(map(str.strip, u'[ "A","B","C" , " D"]'.strip('][').replace('"', '').split(',')))
# ['A', 'B', 'C', 'D']
import timeit
timeit.timeit(stmt="list(map(str.strip, u'[ \"A\",\"B\",\"C\" , \" D\"]'.strip('][').replace('\"', '').split(',')))", number=100000)
# 0.12935059100027502
I was disappointed to see what I considered the method with the worst readability was the method with the best performance... there are trade-offs to consider when going with the most readable option... for the type of workloads I use Python for I usually value readability over a slightly more performant option, but as usual it depends.
import ast
l = ast.literal_eval('[ "A","B","C" , " D"]')
l = [i.strip() for i in l]
If it's only a one dimensional list, this can be done without importing anything:
>>> x = u'[ "A","B","C" , " D"]'
>>> ls = x.strip('[]').replace('"', '').replace(' ', '').split(',')
>>> ls
['A', 'B', 'C', 'D']
This u can do,
**
x = '[ "A","B","C" , " D"]'
print(list(eval(x)))
**
best one is the accepted answer
Though this is not a safe way, the best answer is the accepted one.
wasn't aware of the eval danger when answer was posted.
There isn't any need to import anything or to evaluate. You can do this in one line for most basic use cases, including the one given in the original question.
One liner
l_x = [i.strip() for i in x[1:-1].replace('"',"").split(',')]
Explanation
x = '[ "A","B","C" , " D"]'
# String indexing to eliminate the brackets.
# Replace, as split will otherwise retain the quotes in the returned list
# Split to convert to a list
l_x = x[1:-1].replace('"',"").split(',')
Outputs:
for i in range(0, len(l_x)):
print(l_x[i])
# vvvv output vvvvv
'''
A
B
C
D
'''
print(type(l_x)) # out: class 'list'
print(len(l_x)) # out: 4
You can parse and clean up this list as needed using list comprehension.
l_x = [i.strip() for i in l_x] # list comprehension to clean up
for i in range(0, len(l_x)):
print(l_x[i])
# vvvvv output vvvvv
'''
A
B
C
D
'''
Nested lists
If you have nested lists, it does get a bit more annoying. Without using regex (which would simplify the replace), and assuming you want to return a flattened list (and the zen of python says flat is better than nested):
x = '[ "A","B","C" , " D", ["E","F","G"]]'
l_x = x[1:-1].split(',')
l_x = [i
.replace(']', '')
.replace('[', '')
.replace('"', '')
.strip() for i in l_x
]
# returns ['A', 'B', 'C', 'D', 'E', 'F', 'G']
If you need to retain the nested list it gets a bit uglier, but it can still be done just with regular expressions and list comprehension:
import re
x = '[ "A","B","C" , " D", "["E","F","G"]","Z", "Y", "["H","I","J"]", "K", "L"]'
# Clean it up so the regular expression is simpler
x = x.replace('"', '').replace(' ', '')
# Look ahead for the bracketed text that signifies nested list
l_x = re.split(r',(?=\[[A-Za-z0-9\',]+\])|(?<=\]),', x[1:-1])
print(l_x)
# Flatten and split the non nested list items
l_x0 = [item for items in l_x for item in items.split(',') if not '[' in items]
# Convert the nested lists to lists
l_x1 = [
i[1:-1].split(',') for i in l_x if '[' in i
]
# Add the two lists
l_x = l_x0 + l_x1
This last solution will work on any list stored as a string, nested or not.
Assuming that all your inputs are lists and that the double quotes in the input actually don't matter, this can be done with a simple regexp replace. It is a bit perl-y, but it works like a charm. Note also that the output is now a list of Unicode strings, you didn't specify that you needed that, but it seems to make sense given Unicode input.
import re
x = u'[ "A","B","C" , " D"]'
junkers = re.compile('[[" \]]')
result = junkers.sub('', x).split(',')
print result
---> [u'A', u'B', u'C', u'D']
The junkers variable contains a compiled regexp (for speed) of all characters we don't want, using ] as a character required some backslash trickery.
The re.sub replaces all these characters with nothing, and we split the resulting string at the commas.
Note that this also removes spaces from inside entries u'["oh no"]' ---> [u'ohno']. If this is not what you wanted, the regexp needs to be souped up a bit.
If you know that your lists only contain quoted strings, this pyparsing example will give you your list of stripped strings (even preserving the original Unicode-ness).
>>> from pyparsing import *
>>> x =u'[ "A","B","C" , " D"]'
>>> LBR,RBR = map(Suppress,"[]")
>>> qs = quotedString.setParseAction(removeQuotes, lambda t: t[0].strip())
>>> qsList = LBR + delimitedList(qs) + RBR
>>> print qsList.parseString(x).asList()
[u'A', u'B', u'C', u'D']
If your lists can have more datatypes, or even contain lists within lists, then you will need a more complete grammar - like this one in the pyparsing examples directory, which will handle tuples, lists, ints, floats, and quoted strings.
You may run into such problem while dealing with scraped data stored as Pandas DataFrame.
This solution works like charm if the list of values is present as text.
def textToList(hashtags):
return hashtags.strip('[]').replace('\'', '').replace(' ', '').split(',')
hashtags = "[ 'A','B','C' , ' D']"
hashtags = textToList(hashtags)
Output: ['A', 'B', 'C', 'D']
No external library required.
This usually happens when you load list stored as string to CSV
If you have your list stored in CSV in form like OP asked:
x = '[ "A","B","C" , " D"]'
Here is how you can load it back to list:
import csv
with open('YourCSVFile.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
rows = list(reader)
listItems = rows[0]
listItems is now list
To further complete Ryan's answer using JSON, one very convenient function to convert Unicode is in this answer.
Example with double or single quotes:
>print byteify(json.loads(u'[ "A","B","C" , " D"]')
>print byteify(json.loads(u"[ 'A','B','C' , ' D']".replace('\'','"')))
['A', 'B', 'C', ' D']
['A', 'B', 'C', ' D']
I would like to provide a more intuitive patterning solution with regex.
The below function takes as input a stringified list containing arbitrary strings.
Stepwise explanation:
You remove all whitespacing,bracketing and value_separators (provided they are not part of the values you want to extract, else make the regex more complex). Then you split the cleaned string on single or double quotes and take the non-empty values (or odd indexed values, whatever the preference).
def parse_strlist(sl):
import re
clean = re.sub("[\[\],\s]","",sl)
splitted = re.split("[\'\"]",clean)
values_only = [s for s in splitted if s != '']
return values_only
testsample: "['21',"foo" '6', '0', " A"]"
You can save yourself the .strip() function by just slicing off the first and last characters from the string representation of the list (see the third line below):
>>> mylist=[1,2,3,4,5,'baloney','alfalfa']
>>> strlist=str(mylist)
['1', ' 2', ' 3', ' 4', ' 5', " 'baloney'", " 'alfalfa'"]
>>> mylistfromstring=(strlist[1:-1].split(', '))
>>> mylistfromstring[3]
'4'
>>> for entry in mylistfromstring:
... print(entry)
... type(entry)
...
1
<class 'str'>
2
<class 'str'>
3
<class 'str'>
4
<class 'str'>
5
<class 'str'>
'baloney'
<class 'str'>
'alfalfa'
<class 'str'>
And with pure Python - not importing any libraries:
[x for x in x.split('[')[1].split(']')[0].split('"')[1:-1] if x not in[',',' , ',', ']]
So, following all the answers I decided to time the most common methods:
from time import time
import re
import json
my_str = str(list(range(19)))
print(my_str)
reps = 100000
start = time()
for i in range(0, reps):
re.findall("\w+", my_str)
print("Regex method:\t", (time() - start) / reps)
start = time()
for i in range(0, reps):
json.loads(my_str)
print("JSON method:\t", (time() - start) / reps)
start = time()
for i in range(0, reps):
ast.literal_eval(my_str)
print("AST method:\t\t", (time() - start) / reps)
start = time()
for i in range(0, reps):
[n.strip() for n in my_str]
print("strip method:\t", (time() - start) / reps)
regex method: 6.391477584838867e-07
json method: 2.535374164581299e-06
ast method: 2.4425282478332518e-05
strip method: 4.983267784118653e-06
So in the end regex wins!
This solution is simpler than some I read in the previous answers, but it requires to match all features of the list.
x = '[ "A","B","C" , " D"]'
[i.strip() for i in x.split('"') if len(i.strip().strip(',').strip(']').strip('['))>0]
Output:
['A', 'B', 'C', 'D']

categorical_crossentropy expects targets to be binary matrices

First of all I am not a programmer, but I am self-teaching me Deep Learning to undertake a real project with my own dataset. My situation can be broken down as follows:
I am trying to undertake a multiclass text classification project. I have a corpus with 1000 examples, each example with 4 possible labels(A1,A2,B1,B2) They are mutually exclusive. All the examples are in separate folders and separate .txt files.
After a lot of effort and some man tears I managed to put together this code:
import os
import string
import keras
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
import re
import numpy as np
import tensorflow as tf
from numpy import array
from sklearn.model_selection import KFold
from numpy.random import seed
seed(1)
tf.random.set_seed(1)
root="D:/bananaCorpus"
train_dir=os.path.join(root,"train")
texts=[]
labels=[]
for label in ["A1","A2","B1","B2"]:
directory=os.path.join(train_dir,label)
for fname in os.listdir(directory):
if fname[-4:]==".txt":
f = open(os.path.join(directory, fname),encoding="cp1252")
texts.append(f.read())
f.close()
if label == 'A1':
labels.append(0)
elif label=="A2":
labels.append(1)
elif label=="B1":
labels.append(2)
else:
labels.append(3)
print(texts)
print(labels)
print("Corpus Length", len( root), "\n")
print("The total number of reviews in the train dataset is", len(texts),"\n")
stops = set(stopwords.words("english"))
print("The number of stopwords used in the beginning: ", len(stops),"\n")
print("The words removed from the corpus will be",stops,"\n")
## This adds new words or terms from words_to_add list to the stop_words
words_to_add=[]
[stops.append(w) for w in words_to_add]
##This removes the words or terms from the words_to_remove list,
##so that they are no longer included in stopwords
words_to_remove=["i","having"]
[stops.remove(w) for w in words_to_remove ]
texts=[[w.lower() for w in word_tokenize("".join(str(review))) if w not in stops and w not in string.punctuation and len(w)>2 and w.isalpha()]for review in texts ]
print("costumized stopwords: ", stops,"\n")
print("count of costumized stopwords",len(stops),"\n")
print("**********",texts,"\n")
#vectorization
#tokenizing the raw data
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
maxlen = 50
training_samples = 200
validation_samples = 10000
max_words = 10000
#delete?
tokens=keras.preprocessing.text.text_to_word_sequence(str(texts))
print("Sequence of tokens: ",tokens,"\n")
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
print("Tokens:", sequences,"\n")
word_index = tokenizer.word_index
print("Unique tokens:",word_index,"\n")
print(' %s unique tokens in total.' % len(word_index,),"\n")
print("Unique tokens: ", word_index,"\n")
print("Dictionary of words and their count:", tokenizer.word_counts,"\n" )
print(" Number of docs/seqs used to fit the Tokenizer:", tokenizer.document_count,"\n")
print(tokenizer.word_index,"\n")
print("Dictionary of words and how many documents each appeared in:",tokenizer.word_docs,"\n")
data = pad_sequences(sequences, maxlen=maxlen, padding="post")
print("padded data","\n")
print(data)
#checking the encoding with a new document
text2="I like to study english in the morning and play games in the afternoon"
text2=[w.lower() for w in word_tokenize("".join(str(text2))) if w not in stops and w not in string.punctuation
and len(w)>2 and w.isalpha()]
sequences = tokenizer.texts_to_sequences([text2])
text2 = pad_sequences(sequences, maxlen=maxlen, padding="post")
print("padded text2","\n")
print(text2)
#cross-validation
labels = np.asarray(labels)
print('Shape of data tensor:', data.shape,"\n")
print('Shape of label tensor:', labels.shape,"\n")
print("labels",labels,"\n")
kf = KFold(n_splits=4, random_state=None, shuffle=True)
kf.get_n_splits(data)
print(kf)
KFold(n_splits=4, random_state=None, shuffle=True)
for train_index, test_index in kf.split(data):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = data[train_index], data[test_index]
y_train, y_test = labels[train_index], labels[test_index]
#Pretrained embedding
glove_dir = 'D:\glove'
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'),encoding="utf-8")
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print("Found %s words vectors fom GLOVE."% len(embeddings_index))
#Preparing the Glove word-embeddings matrix to pass to the embedding layer(max_words, embedding_dim)
embedding_dim = 100
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
if i < max_words:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
# define vocabulary size (largest integer value)
# define model
from keras.models import Sequential
from keras.layers import Embedding,Flatten,Dense
from keras import layers
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))#vocabulary size + the size of glove version +max len of input documents.
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
#Loading pretrained word embeddings and Freezing the Embedding layer
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
history=model.fit(X_train, y_train, epochs=6,verbose=2)
# evaluate
loss, acc = model.evaluate(X_test, y_test, verbose=0)
print('Test Accuracy: %f' % (acc*100))
However, I am getting this error:
Traceback (most recent call last):
File "D:/banana.py", line 177, in <module>
history=model.fit(X_train, y_train, epochs=6,verbose=2)
File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\keras\engine\training.py", line 1154, in fit
batch_size=batch_size)
File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\keras\engine\training.py", line 642, in _standardize_user_data
y, self._feed_loss_fns, feed_output_shapes)
File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\keras\engine\training_utils.py", line 284, in check_loss_and_target_compatibility
' while using as loss `categorical_crossentropy`. '
ValueError: You are passing a target array of shape (3, 1) while using as loss `categorical_crossentropy`. `categorical_crossentropy` expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:
```
from keras.utils import to_categorical
y_binary = to_categorical(y_int)
```
Alternatively, you can use the loss function `sparse_categorical_crossentropy` instead, which does expect integer targets.
I tried everything the error message says, but to no avail. After some research I came to the conclusion that the model is not trying to predict multiple classes, that's why the categorical_crossentropy loss is not being accepted. I then realized that, if I changed it for binary cross-entropy the error goes away, which is really a confirmation that this is not working as a multiclass classification model.
What can I do to adjust my code to make it work as intended? Am I S*it out of luck and have to start a whole different project?
Any type of guidance will be of immense help for me and my mental health.
You should make two changes. First the number of neurons in the output of your network should match the number of classes, and use the softmax activation:
model.add(Dense(4, activation='softmax'))
Then you should use the sparse_categorical_crossentropy loss as you are not one-hot encoding the labels:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Then the model should be able to train without errors.

Is there a corpus of English words in nltk?

Is there any way to get the list of English words in python nltk library?
I tried to find it but the only thing I have found is wordnet from nltk.corpus. But based on documentation, it does not have what I need (it finds synonyms for a word).
I know how to find the list of this words by myself (this answer covers it in details), so I am interested whether I can do this by only using nltk library.
Yes, from nltk.corpus import words
And check using:
>>> "fine" in words.words()
True
Reference: Section 4.1 (Wordlist Corpora), chapter 2 of Natural Language Processing with Python.
Other than the nltk.corpus.words that #salvadordali has highlighted,:
>>> from nltk.corpus import words
>>> print words.readme()
Wordlists
en: English, http://en.wikipedia.org/wiki/Words_(Unix)
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)
>>> print words.words()[:10]
[u'A', u'a', u'aa', u'aal', u'aalii', u'aam', u'Aani', u'aardvark', u'aardwolf', u'Aaron']
Do note that nltk.corpus.words is a list of words without frequencies so it's not exactly a corpora of natural text.
The corpus package that contains various corpora, some of which are English corpora, see http://www.nltk.org/nltk_data/. E.g. nltk.corpus.brown:
>>> from nltk.corpus import brown
>>> brown.words()[:10]
[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of']
To get a word list from a natural text corpus:
>>> wordlist = set(brown.words())
>>> print len(wordlist)
56057
>>> wordlist_lowercased = set(i.lower() for i in brown.words())
>>> print len(wordlist_lowercased)
49815
Note that the brown.words() contains words with both lower and upper cases like natural text.
In most cases, a list of words is not very useful without frequencies, so you can use the FreqDist:
>>> from nltk import FreqDist
>>> from nltk.corpus import brown
>>> frequency_list = FreqDist(i.lower() for i in brown.words())
>>> frequency_list.most_common()[:10]
[(u'the', 69971), (u',', 58334), (u'.', 49346), (u'of', 36412), (u'and', 28853), (u'to', 26158), (u'a', 23195), (u'in', 21337), (u'that', 10594), (u'is', 10109)]
For more, see http://www.nltk.org/book/ch01.html on how to access corpora and process them in NLTK

piecewise numpy function with integer arguments

I define the piecewise function
def Li(x):
return piecewise(x, [x < 0, x >= 0], [lambda t: sin(t), lambda t: cos(t)])
And when I evaluate Li(1.0)
The answer is correct
Li(1.0)=array(0.5403023058681398),
But if I write Li(1) the answer is array(0).
I don't understand this behaviour.
This function runs correctly.
def Li(x):
return piecewise(float(x),
[x < 0, x >= 0],
[lambda t: sin(t), lambda t: cos(t)])
It seems that piecewise() converts the return values to the same type as the input so, when an integer is input an integer conversion is performed on the result, which is then returned. Because sine and cosine always return values between −1 and 1 all integer conversions will result in 0, 1 or -1 only - with the vast majority being 0.
>>> x=np.array([0.9])
>>> np.piecewise(x, [True], [float(x)])
array([ 0.9])
>>> x=np.array([1.0])
>>> np.piecewise(x, [True], [float(x)])
array([ 1.])
>>> x=np.array([1])
>>> np.piecewise(x, [True], [float(x)])
array([1])
>>> x=np.array([-1])
>>> np.piecewise(x, [True], [float(x)])
array([-1])
In the above I have explicitly cast the result to float, however, an integer input results in an integer output regardless of the explicit cast. I'd say that this is unexpected and I don't know why piecewise() should do this.
I don't know if you have something more elaborate in mind, however, you don't need piecewise() for this simple case; an if/else will suffice instead:
from math import sin, cos
def Li(t):
return sin(t) if t < 0 else cos(t)
>>> Li(0)
1.0
>>> Li(1)
0.5403023058681398
>>> Li(1.0)
0.5403023058681398
>>> Li(-1.0)
-0.8414709848078965
>>> Li(-1)
-0.8414709848078965
You can wrap the return value in an numpy.array if required.
I am sorry, but this example is taken and modified from
http://docs.scipy.org/doc/numpy/reference/generated/numpy.piecewise.html
But, in fact, using ipython with numpy 1.9
"""
Python 2.7.8 |Anaconda 2.1.0 (64-bit)| (default, Aug 21 2014, 18:22:21)
Type "copyright", "credits" or "license" for more information.
IPython 2.2.0 -- An enhanced Interactive Python.
"""
I have no errors, but "ValueError: too many boolean indices" error appears if I use python 2.7.3 with numpy 1.6
"""
Python 2.7.3 (default, Feb 27 2014, 19:58:35)
"""
I test this function under Linux and Windows and the same error occurs.
Obviously, It is very easy to overcome this situation, but I think that
this behaviour is a mistake in the numpy library.

nltk function to count occurrences of certain words

In the nltk book there is the question
"Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?"
I thought I could use a function like state_union('1945-Truman.txt').count('men')
However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of repeating this function over and over for each text.
You can use the .words() function in the corpus to returns a list of strings (i.e. tokens/words):
>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
Then use the Counter() object to count the instances, see https://docs.python.org/2/library/collections.html#collections.Counter:
>>> wordcounts = Counter(brown.words())
But do note that the Counter is case-sensitive, see:
>>> from nltk.corpus import brown
>>> from collections import Counter
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> wordcounts = Counter(brown.words())
>>> wordcounts['the']
62713
>>> wordcounts['The']
7258
>>> wordcounts_lower = Counter(i.lower() for i in brown.words())
>>> wordcounts_lower['The']
0
>>> wordcounts_lower['the']
69971