Word tokenizing gives different results at home than on Colaboratory - nltk

Local:
$ python
Python 3.8.0 (default, Nov 6 2019, 15:27:39)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import collections
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('averaged_perceptron_tagger')
>>> nltk.download('stopwords')
>>> stop_words = set(nltk.corpus.stopwords.words('english'))
>>> text = """Former Kansas Territorial Governor James W. Denver visited his namesake city in 1875 and in 1882."""
>>> def preprocess(document):
... sentence_list = list()
... for sentence in nltk.sent_tokenize(document):
... word_tokens = nltk.word_tokenize(sentence)
... sentence_list.append([w for w in word_tokens if not w in stop_words and len(w) > 1])
... sentences = [nltk.pos_tag(sent) for sent in sentence_list]
... return sentences
>>> grammar = r'Chunk: {(<A.*>*|<N.*>*|<VB[DGNP]?>*)+}'
>>> chunk_parser = nltk.RegexpParser(grammar)
>>> tagged = preprocess(text)
>>> result = collections.Counter()
>>> for sentence in tagged:
... my_tree = chunk_parser.parse(sentence)
... for subtree in my_tree.subtrees():
... if subtree.label() == 'Chunk':
... leaves = [x[0] for x in subtree.leaves()]
... phrase = " ".join(leaves)
... result[phrase] += 1
Output at home is:
>>> print(result.most_common(10))
[('Former Kansas Territorial Governor James W. Denver', 1), ('visited', 1), ('city', 1)]
Same code on Colaboratory, result is:
>>> print(result.most_common(10))
[]
I have run non-NLTK code in both places and gotten identical output. Could it be local NLTK libraries that are different? Different versions of the NLTK?

I was running python 3.8.0 locally. I changed it to 3.6.9 and I now get the same results as on Colaboratory.

Related

NameError: name 'status_code' is not defined while parsing access.log

Good afternoon, while testing the code for parsing access.log, the following error occurred:
Traceback (most recent call last):
File "logsscript_3.py", line 31, in
dict_ip[ip][status_code] += 1
NameError: name 'status_code' is not defined
I need to output top 10 requests with code 400 to a json file
The code is like this:
import argparse
import json
import re
from collections import defaultdict
parser = argparse.ArgumentParser(description='log parser')
parser.add_argument('-f', dest='logfile', action='store', default='access.log')
args = parser.parse_args()
regul_ip = (r"^(?P<ips>.*?)")
regul_statuscode = (r"\s(?P<status_code>400)\s")
dict_ip = defaultdict(lambda: {"400": 0})
with open(args.logfile) as file:
for index, line in enumerate(file.readlines()):
try:
ip = re.search(regul_ip, line).group()
status_code = re.search(regul_statuscode, line).groups()[0]
except AttributeError:
pass
dict_ip[ip][status_code] += 1
print(json.dumps(dict_ip, indent=4))
with open("final_log.json", "w") as jsonfile:
json.dump(dict_ip, jsonfile, indent=5)
An example of a line from access.log:
213.137.244.2 - - [13/Dec/2015:17:30:13 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 7717
Following up on the comment (included below for completeness) where I explain why you see the error, below I explain some ways to fix the code.
Expanding on #khelwood point: the example line (and likely many more in your log) is not a 400 code line. Your regex includes 400, thus it will not match, and the entire status_code = ... line will fail with an AttributeError: 'NoneType' object has no attribute 'groups' for all non-400 lines. Ignoring the exception results in a NameError in the dict_ip[ip]... line b/c status_code was not assigned a value.
First, you can use one regex to parse the access logs.
>>> import re
>>>
>>> line = '213.137.244.2 - - [13/Dec/2015:17:30:13 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 7717'
>>> p = r'(\S+) (\S+) (\S+) \[(.*?)\] "(\S+) (\S+) (\S+)" (\S+) (\S+) "(\S+)" "(.*?)" (\S+)'
>>> pat = re.compile(p)
>>> m = pat.match(line)
>>> m.groups()
('213.137.244.2', '-', '-', '13/Dec/2015:17:30:13 +0100', 'GET', '/administrator/', 'HTTP/1.1', '200', '4263', '-', 'Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0', '7717')
>>> m.group(1)
'213.137.244.2'
>>> m.group(2)
'-'
...
The snippet above shows you how to fetch and access the various fields from your log, as I observed in your other recent questions that you need this.
You can slightly modify the above as shown below (since you care only about log lines with a 400 and need only the IP address). Note that this is not the only way to write the regex, it's simply one way that can be easily derived from the one above. Note also that for illustration purposes I changed 200 to 400.
>>> line = '213.137.244.2 - - [13/Dec/2015:17:30:13 +0100] "GET /administrator/ HTTP/1.1" 400 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" 7717'
>>> p = r'(\S+) \S+ \S+ \[.*?\] "\S+ \S+ \S+" 400 \S+ "\S+" ".*?" \S+'
>>> pat = re.compile(p)
>>> m = pat.match(line)
>>> m.group(1)
'213.137.244.2'
So, reading your log, counting 400s per IP address, and saving the 10 IP addresses with most 400s in a json file:
>>> from collections import Counter
>>> import json
>>> import re
>>>
>>> p = r'(\S+) \S+ \S+ \[.*?\] "\S+ \S+ \S+" 400 \S+ "\S+" ".*?" \S+'
>>> pat = re.compile(p)
>>> dict_ips_400 = Counter()
>>>
>>> with open("input_log.text") as f:
>>> for line in f: # see Note 1
>>> m = pat.match(line)
>>> if m: # check if there is a match
>>> ip = m.group(1)
>>> dict_ips_400[ip] += 1
>>>
>>> with open("final_log.json", "w") as jsonfile:
... json.dump(dict_ips_400.most_common(10), jsonfile, indent=5)
...
Notes:
you may want to check the differences of using f.readlines() vs processing a file line by line as above (if you are working with large files)
you could modify the above to a. use named groups (see re's docs), as you attempted to do in your code and/or b. capture and store more fields, say, IP addresses and status code pair counts

How to Merge two json files into one in python3.6

Merge two json files into one in python3.6
I tried data1.update(data2),it didn't work
import json
with open("test.json") as fin1:
data1 = json.load(fin1)
with open("test_userz.json") as fin2:
data2 = json.load(fin2)
data1.update(data2)
with open("merged.json", "w") as fout:
json.dump(data1, fout)
you could merge like so
>>> data1=json.loads('{"test1":"one"}')
>>> data2=json.loads('{"test2":"two"}')
>>> data3=[]
>>> data3.append(data1)
>>> data3.append(data2)
>>> json.dumps(data3)
'[{"test1": "one"}, {"test2": "two"}]'

NLTK single-word part-of-speech tagging

Is there a way to use NLTK to get a set of possible parts of speech of a single string of letters, taking into account that different words might have homonyms?
For example: report -> {Noun, Verb} , kind -> {Adjective, Noun}
I have not been able to find a POS-tokenizer that tags part-of-speech for words outside of the context of a full sentence. This seems like a very basic request of NLTK, so I'm confused as to why I've had so much trouble finding it.
Yes. The simplest way is not to use a tagger, but simply load up one or more corpora and collect the set of all tags for the word you are interested in. If you're interested in more than one word, it's simplest to collect the tags for all words in the corpus, then look up anything you want. I'll add frequency counts, just because I can. For example, using the Brown corpus and the simple "universal" tagset:
>>> wordtags = nltk.ConditionalFreqDist((w.lower(), t)
for w, t in nltk.corpus.brown.tagged_words(tagset="universal"))
>>> wordtags["report"]
FreqDist({'NOUN': 135, 'VERB': 39})
>>> list(wordtags["kind"])
['ADJ', 'NOUN']
Because POS models are trained on sentence/document based data, so the expected input to the pre-trained model is a sentence/document. When there's only a single word, it treats it as a single word sentence, hence there should only be one tag in that single word sentence context.
If you're trying to find all possible POS tags per English words, you would need a corpus of many different use of the words and then tag the corpus and count/extract the no. of tags per word. E.g.
>>> from nltk import pos_tag
>>> sent1 = 'The coaches are going from Singapore to Frankfurt'
>>> sent2 = 'He coaches the football team'
>>> pos_tag(sent1.split())
[('The', 'DT'), ('coaches', 'NNS'), ('are', 'VBP'), ('going', 'VBG'), ('from', 'IN'), ('Singapore', 'NNP'), ('to', 'TO'), ('Frankfurt', 'NNP')]
>>> pos_tag(sent2.split())
[('He', 'PRP'), ('coaches', 'VBZ'), ('the', 'DT'), ('football', 'NN'), ('team', 'NN')]
>>> from collections import defaultdict, Counter
>>> counts = defaultdict(Counter)
>>> tagged_sents = [pos_tag(sent) for sent in [sent1.split(), sent2.split()]]
>>> for word, pos in chain(*tagged_sents):
... counts[word][pos] += 1
...
>>> counts
defaultdict(<class 'collections.Counter'>, {'from': Counter({'IN': 1}), 'to': Counter({'TO': 1}), 'Singapore': Counter({'NNP': 1}), 'football': Counter({'NN': 1}), 'coaches': Counter({'VBZ': 1, 'NNS': 1}), 'going': Counter({'VBG': 1}), 'are': Counter({'VBP': 1}), 'team': Counter({'NN': 1}), 'The': Counter({'DT': 1}), 'Frankfurt': Counter({'NNP': 1}), 'the': Counter({'DT': 1}), 'He': Counter({'PRP': 1})})
>>> counts['coaches']
Counter({'VBZ': 1, 'NNS': 1})
Alternatively, there's WordNet:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('coaches')
[Synset('coach.n.01'), Synset('coach.n.02'), Synset('passenger_car.n.01'), Synset('coach.n.04'), Synset('bus.n.01'), Synset('coach.v.01'), Synset('coach.v.02')]
>>> [ss.pos() for ss in wn.synsets('coaches')]
[u'n', u'n', u'n', u'n', u'n', u'v', u'v']
>>> Counter([ss.pos() for ss in wn.synsets('coaches')])
Counter({u'n': 5, u'v': 2})
But note that WordNet is a manually crafted resource, so you cannot expect every English word to be in it.

piecewise numpy function with integer arguments

I define the piecewise function
def Li(x):
return piecewise(x, [x < 0, x >= 0], [lambda t: sin(t), lambda t: cos(t)])
And when I evaluate Li(1.0)
The answer is correct
Li(1.0)=array(0.5403023058681398),
But if I write Li(1) the answer is array(0).
I don't understand this behaviour.
This function runs correctly.
def Li(x):
return piecewise(float(x),
[x < 0, x >= 0],
[lambda t: sin(t), lambda t: cos(t)])
It seems that piecewise() converts the return values to the same type as the input so, when an integer is input an integer conversion is performed on the result, which is then returned. Because sine and cosine always return values between −1 and 1 all integer conversions will result in 0, 1 or -1 only - with the vast majority being 0.
>>> x=np.array([0.9])
>>> np.piecewise(x, [True], [float(x)])
array([ 0.9])
>>> x=np.array([1.0])
>>> np.piecewise(x, [True], [float(x)])
array([ 1.])
>>> x=np.array([1])
>>> np.piecewise(x, [True], [float(x)])
array([1])
>>> x=np.array([-1])
>>> np.piecewise(x, [True], [float(x)])
array([-1])
In the above I have explicitly cast the result to float, however, an integer input results in an integer output regardless of the explicit cast. I'd say that this is unexpected and I don't know why piecewise() should do this.
I don't know if you have something more elaborate in mind, however, you don't need piecewise() for this simple case; an if/else will suffice instead:
from math import sin, cos
def Li(t):
return sin(t) if t < 0 else cos(t)
>>> Li(0)
1.0
>>> Li(1)
0.5403023058681398
>>> Li(1.0)
0.5403023058681398
>>> Li(-1.0)
-0.8414709848078965
>>> Li(-1)
-0.8414709848078965
You can wrap the return value in an numpy.array if required.
I am sorry, but this example is taken and modified from
http://docs.scipy.org/doc/numpy/reference/generated/numpy.piecewise.html
But, in fact, using ipython with numpy 1.9
"""
Python 2.7.8 |Anaconda 2.1.0 (64-bit)| (default, Aug 21 2014, 18:22:21)
Type "copyright", "credits" or "license" for more information.
IPython 2.2.0 -- An enhanced Interactive Python.
"""
I have no errors, but "ValueError: too many boolean indices" error appears if I use python 2.7.3 with numpy 1.6
"""
Python 2.7.3 (default, Feb 27 2014, 19:58:35)
"""
I test this function under Linux and Windows and the same error occurs.
Obviously, It is very easy to overcome this situation, but I think that
this behaviour is a mistake in the numpy library.

nltk function to count occurrences of certain words

In the nltk book there is the question
"Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?"
I thought I could use a function like state_union('1945-Truman.txt').count('men')
However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of repeating this function over and over for each text.
You can use the .words() function in the corpus to returns a list of strings (i.e. tokens/words):
>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
Then use the Counter() object to count the instances, see https://docs.python.org/2/library/collections.html#collections.Counter:
>>> wordcounts = Counter(brown.words())
But do note that the Counter is case-sensitive, see:
>>> from nltk.corpus import brown
>>> from collections import Counter
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> wordcounts = Counter(brown.words())
>>> wordcounts['the']
62713
>>> wordcounts['The']
7258
>>> wordcounts_lower = Counter(i.lower() for i in brown.words())
>>> wordcounts_lower['The']
0
>>> wordcounts_lower['the']
69971