Is it possible to speed up Wordnet Lemmatizer? - nltk

I'm using the Wordnet Lemmatizer via NLTK on the Brown Corpus (to determine if the nouns in it are used more in their singular form or their plural form).
i.e. from nltk.stem.wordnet import WordNetLemmatizer
l = WordnetLemmatizer()
I've noticed that even the simplest queries such as the one below takes quite a long time (at least a second or two).
l("cats")
Presumably this is because a web connection must be made to Wordnet for each query?..
I'm wondering if there is a way to still use the Wordnet Lemmatizer but have it perform much faster? For instance, would it help at all for me to download Wordnet on to my machine?
Or any other suggestions?
I'm trying to figure out if the Wordnet Lemmatizer can be made faster rather than trying a different lemmatizer, because I've found it works the best among others like Porter and Lancaster.

It doesn't query the internet, NLTK reads WordNet from your local machine. When you run the first query, NLTK loads WordNet from disk into memory:
>>> from time import time
>>> t=time(); lemmatize('dogs'); print time()-t, 'seconds'
u'dog'
3.38199806213 seconds
>>> t=time(); lemmatize('cats'); print time()-t, 'seconds'
u'cat'
0.000236034393311 seconds
It is rather slow if you have to lemmatize many thousands of phrases. However if you are doing a lot of redundant queries, you can get some speedup by caching the results of the function:
from nltk.stem import WordNetLemmatizer
from functools32 import lru_cache
wnl = WordNetLemmatizer()
lemmatize = lru_cache(maxsize=50000)(wnl.lemmatize)
lemmatize('dogs')

I've used the lemmatizer like this
from nltk.stem.wordnet import WordNetLemmatizer # to download corpora: python -m nltk.downloader all
lmtzr = WordNetLemmatizer() # create a lemmatizer object
lemma = lmtzr.lemmatize('cats')
It is not slow at all on my machine. There is no need to connect to the web to do this.

Related

Obtaining METEOR scores for Japanese text

I wish to produce METEOR scores for several Japanese strings. I have imported nltk, wordnet and omw but the results do not convince me it is working correctly.
from nltk.corpus import wordnet
from nltk.translate.meteor_score import single_meteor_score
nltk.download('wordnet')
nltk.download('omw')
reference = "チップは含まれていません。"
hypothesis = "チップは含まれていません。"
print(single_meteor_score(reference, hypothesis))
This outputs 0.5 but surely it should be much closer to 1.0 given the reference and hypothesis are identical?
Do I somehow need to specify which wordnet language I want to use in the call to single_meteor_score() for example:
single_meteor_score(reference, hypothesis, wordnet=wordnetJapanese.
Pending review by a qualified linguist, I appear to have found a solution. I found an open source tokenizer for Japanese. I pre-processed all of my reference and hypothesis strings to insert spaces between Japanese tokens and then run the nltk.single_meteor_score() over the files.

Calculate meteor_score over entire corpus

I understand that meteor_score from nltk.translate.meteor_score calculates the METEOR-score for one hypothesis sentence based on a list of candidates.
But is there an implementation for calculating the score over an entire corpus as well or a way to do it, similar to the corpus_bleu implementation?
I couldn't find something for this case.
I have created something like this for my project:
#>>> nltk.download()
# Download window opens, fetch wordnet
#>>> from nltk.corpus import wordnet as wn
from nltk.translate.meteor_score import meteor_score
import numpy as np
def corpus_meteor(expected, predicted):
meteor_score_sentences_list = list()
[meteor_score_sentences_list.append(meteor_score(expect, predict)) for expect, predict in zip(expected, predicted)]
meteor_score_res = np.mean(meteor_score_sentences_list)
return meteor_score_res

Properly handling Dask multiprocessing in SQLAlchemy

The setting in which I am working can be described as follows:
Database and what I want to extract from it
The data required to run the analysis is stored in a single de-normalized (more than 100 columns) Oracle table. Financial reporting data is published to the table every day and its range-partitioned on the reporting date (one partition per day). Here's the structure of the query I intend to run:
SELECT col1,
col2,
col3
FROM table
WHERE date BETWEEN start_date AND end_date
Strategy to load data with Dask
I am using sqlalchemy with the cx_Oracle driver to access the database. The strategy I am following to load data in parallel with Dask is:
from dask import bag as db
def read_rows(from_date, to_date, engine):
engine.dispose()
query = """
-- Query Text --
""".format(from_date, to_date)
with engine.connect() as conn:
ret = conn.execute(query).fetchall()
return ret
engine = create_engine(...) # initialise sqlalchemy engine
add_engine_pidguard(engine) # adding pidguard to engine
date_ranges = [...] # list of (start_date, end_date)-tuples
data_db = db.from_sequence(date_ranges)
.map(lambda x: read_rows(from_date=x[0], to_date=x[1], engine=engine)).concat()
# ---- further process data ----
...
add_engine_pidguard is taken from the sqlalchemy documentation:How do I use engines / connections / sessions with Python multiprocessing, or os.fork()?
Questions
Is the current way of running blocked queries fine - or is there a cleaner way of achieving this in sqlalchemy?
Since the queries operate in a multiprocessing environment, is the approach of managing the engines fine the way it is implemented?
Currently I am executing a "raw query", would it be beneficial from a performance point of view to define the table in a declarative_base (with respective column types) and use session.query on the required columns from within read_rows?
I would be apt to try code along the lines of
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
...
con = engine.connect()
df = dd.from_delayed([
delayed(pd.read_sql_query)(QUERY, con, params=params)
for params in date_ranges
])
In this case, I just have one connection I make--cx_Oracle connections are, as I understand it, able to be used by multiple threads. The data is loaded using dask.dataframe, without doing anything yet to make it anything other than the threaded scheduler. Database IO and many pandas operations release the GIL, so the threaded scheduler is a good candidate here.
This will let us jump right to having a dataframe, which is nice for many operations on structured data.
Currently I am executing a "raw query", would it be beneficial from a performance point of view to define the table in a declarative_base (with respective column types) and use session.query on the required columns from within read_rows?
This is not especially likely to improve performance, as I understand things.

Nltk .most_common(), what is the order it is returned in?

I have found the frequecny of bigrams in certain sentences using:
import nltk
from nltk import ngrams
mydata = “xxxxx"
mylist = mydata.split()
mybigrams =list(ngrams(mylist, 2))
fd = nltk.FreqDist(mybigrams)
print(fd.most_common())
On printing out the bigrams with the most common frequencies, one occurs 7 times wheras all 95 other bigrams only occur 1 time. However when comparing the bigrams to my sentences I can see no logical order to the way the bigrams all of frequency 1 are printed out. Does anyone know if there is any logic to the way .most_common() prints the bigrams or is it randomly generated
Thanks in advance
Short answer, based on the documentation of collections.Counter.most_common:
Elements with equal counts are ordered arbitrarily:
In current versions of NLTK, nltk.FreqDist is based on nltk.compat.Counter. On Python 2.7 and 3.x, collections.Counter will be imported from the standard library. On Python 2.6, NLTK provides its own implementation.
For details, look at the source code:
https://github.com/nltk/nltk/blob/develop/nltk/compat.py
In conclusion, without checking all possible version configurations, you cannot expect words with equal frequency to be ordered.

New to Python - proficient with Matlab: getting error "IndexError: list index out of range"

As the title says, I'm proficient with Matlab and already have this function written there and it works great. I wanted to learn a new language and I've been pointed to Python so I figured I would write a simple function to get used to the syntax of Python and have something to validate what I've done. I wrote the function "Xfcn" (which is non-dimensional mass flow in rocket problems) and it gives me the correct number if I only use one value. Now, I'd like to plot the X-function versus Mach and validate with my Matlab version. I need to loop through some Mach vector then plot it. Plotting comes later. I'm getting the error mentioned above and I think it's a simple indexing problem, although I can't seem to figure out what it is. I've looked here and on Python's documentation center so hopefully we can resolve this quickly. I've also checked the "type" of "i", printed the range(len(Ms)) and get 0-49, by 1's, as I expect with the particular values of Ms 0-1 by equally spaced increments, also as I expect, so I cannot figure out where my error is. My code is below.
from Xfcn import Xfcn
import pylab as pyl
import numpy as np
Ms = np.linspace(0,1,endpoint=True)
X = []
for i in range(len(Ms)):
X[i][0] = Xfcn(Ms[i])
print X
print 'Done.'
Thanks for the help!
BL
You created x as a single dimensional list and are trying to access it as if it was multi dimensional