FreqDist.plot not using the most common words - nltk

I am following the code in http://www.nltk.org/book/ch01.html specifically section 3.2.
from __future__ import division
import nltk
from nltk.book import *
import dateutil
import pyparsing
import numpy
import six
import matplotlib
fdist1 = FreqDist(text1)
fdist1.plot(50,cumulative=True)
is the script that I am running (the future import is leftover from some other stuff). However the plot I get does not match the one in the book. It has uncommon words such as funereal. I am running 32bit python 2.7 on windows. My friend who is running it on his mac runs the same commands and gets the plot from the book. I am at a complete loss as to what the difference may be. Thanks!

I've recently had the same problem.
NLTK builds it FreqDist upon the Counter class from python's collections package. The plot() (and the tabulate() function) extract the samples to use as follows:
samples = list(islice(self, *args))
Source: http://www.nltk.org/_modules/nltk/probability.html
I guess they're doing it that way to enable plotting of sub-sequences like the range between the 10th and the 40th sample, but it assumes that the Counter object (which basically is a dict) is sorted. Though that sometimes is the case, it obviously does not always hold, as the python documentation explicitely states. The correct replacement for the above line in the NLTK source would be:
samples = [item for item, _ in self.most_common(*args)]
That correct version can be also found in the most recent NLTK code on GitHub. It has been fixed in NLTK 3.0.0, so make sure that you aren't using an older version (like an alpha or beta version of NLTK 3).
If you do not want to change the source of NLTK and can not update it, you could easily adapt the plot() function yourself:
def plot_freqdist(fd, num = 0, cumulative = False, title = None):
import pylab
# Set up parameters
if num <= 0:
num = fd.B
# Get samples and frequencies
samples, freq, accu = [], [], 0
for s, f in fd.most_common(num):
accu = accu + f if cumulative else f
samples.append(s)
freq.append(accu)
# Create plot
pylab.grid(True, color = 'silver')
if title:
pylab.title(title)
pylab.plot(freq, linewidth = 2)
pylab.xticks(range(len(samples)), samples, rotation = 90)
pylab.xlabel('Samples')
pylab.ylabel('Cumulative Counts' if cumulative else 'Counts')
pylab.show()

Related

DeepSORT with custom detectors

I have created a class in which am loading my custom weights for yolo v5, v7, and v8 to get detection out of this class I am getting xyxy values and class name. now I want to deploy deep SORT on it by using this specific class with no extra files. in my detection class I am simply using torch.hub.load() and access to my basic functions in yolo v5, v7, and v8. now I am looking for specific and straightforward techniques to apply deep sort by using my detection class. is it possible? if it is possible please tell me how can I do it. if it is not possible, what is the simplest method to implement deep sort?
import torch
import cv2
class YoloDetector:
def __init__(self, conf_thold, device, weights_path, expected_objs):
# taking the image file path, confidence level and GPU or CPU selection
self._model = torch.hub.load("WongKinYiu/yolov7", "custom", f"{weights_path}", trust_repo=True)
self._model.conf = conf_thold # NMS confidence threshold
self._model.classes = expected_objs # (optional list) filter by class
self._model.to(device) # specifying device type
def process_image(self, image):
results = self._model(image)
predictions = [] # final list to return all detections
detection = results.pandas().xyxy[0]
for i in range(len(detection)):
# getting bbox and class name one by one
class_name = detection["name"]
xmin = detection["xmin"][i]
ymin = detection["ymin"][i]
xmax = detection["xmax"][i]
ymax = detection["ymax"][i]
# parallely appending the values in list using dictionary
predictions.append({'class': class_name[i], 'bbox': [int(xmin), int(ymin), int(xmax), int(ymax)]})
return predictions
that is the code I am using for getting detections now please tell me how can I implement deep sort by using this.

Trashword Identifcation with Sklearn for NLTK

For a NLTK project I want to mass identify, if a new word is likely a "trash" word or a meaningful word. It fits to the architecture to do this in the early phase, so I can proceed with "true" words later - so I like this negative approach to identify this non meaningful words.
For this I have a training set with words/trash labeled.
I work with single characters for the labeling and have the error:
ValueError: empty vocabulary; perhaps the documents only contain stop words
The code I use is:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
talk_data = pd.read_excel('nn_data_stack.xlsx', sheet_name='label2')
talk_data['word_char_list'] = [[*x] for x in talk_data['word'].astype(str)]
talk_data['word_char'] = [','.join(map(str, l)) for l in talk_data['word_char_list']]
talk_data['word_char'].replace(',',' ', regex=True, inplace=True)
z = talk_data['word_char']
y = talk_data['class']
z_train, z_test,y_train, y_test = train_test_split(z,y,test_size = 0.2)
cv = CountVectorizer()
features = cv.fit_transform(z_train)
Example of the data set I have for training
word
class
drawing
word
to be
word
龚莹author
trash
ï½°c
trash
Do I need to use an alternative to CountVectorizer?
I think I need to go to character embedding to ensure a proper input - but how?

GridSearchCV without Cross Validation CV = 1

I have a special dataset and this dataset could be trains with a %1 error. I need to do hyperparameter tuning for MLPRegressor without a split train set. Meanly cv = 1. Is this possible with GridSearchCV?
One of the options for cv parameter is:
An iterable yielding (train, test) splits as arrays of indices.
So, if you have X input matrix, y target vector, mlp classifier, and params grid you can do just one train-test split.
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
indices = np.arange(len(X))
train_idx, test_idx = train_test_split(indices, test_size=0.2)
clf = GridSearchCV(mlp, params, cv=[(train_idx, test_idx)])
But keep in mind that using 1 split for hyper-parameter sweep is a bad practice. Do not make many steps with such a grid search.

Python3 tkinter label value

I need some help, I trying to update the etcprice label value after I push the button and after every 5 seconds, in terminal works, but in tk window not. I stucked here :( please, help me.
I tried to setup the "price" to "StringVar()" but in that case I got a lot of errors.
Many thanks
import urllib.request
from urllib.request import *
import json
import six
from tkinter import *
import tkinter as tk
import threading
price = '0'
def timer():
threading.Timer(5.0, timer).start()
currentPrice()
def currentPrice():
url = 'https://api.cryptowat.ch/markets/bitfinex/ethusd/price'
json_obj = urllib.request.urlopen(url)
data = json.load(json_obj)
for item, v in six.iteritems(data['result']):
# print("ETC: $", v)
price = str(v)
# print(type(etcar))
print(price)
return price
def windows():
root = Tk()
root.geometry("500x200")
kryptoname = Label(root, text="ETC price: ")
kryptoname.grid(column=0, row=0)
etcprice = Label(root, textvariable=price)
etcprice.grid(column=1, row=0)
updatebtn = Button(root, text="update", command=timer)
updatebtn.grid(column=0, row=1)
root.mainloop()
windows()
The solution was: I created a new String variable called “b” and I changed the etcprice Label variable to this.
After I added this b.set(price) in currentPrice() def: and is working.
The price variable is a global - if you're trying to change it, you need to do so explicitly:
def currentPrice():
global price
url = 'https://api.cryptowat.ch/markets/bitfinex/ethusd/price'
json_obj = urllib.request.urlopen(url)
data = json.load(json_obj)
for item, v in six.iteritems(data['result']):
# print("ETC: $", v)
price = str(v)
# print(type(etcar))
print(price)
return price
otherwise, Python will 'mirror' it as a local variable inside the function, and not modify the global.
It's not a good idea to keep on launching more and more threads each time you click the button - so:
updatebtn = Button(root, text="update", command=currentPrice)
probably makes more sense.
You don't need to use threads here, just to call functions 'in the background'. You can use tkinter's own .after function instead to delay caling functions. (it uses milliseconds, not float second values, btw)
def timer(delay_ms, root, func):
func()
root.after(delay_ms, timer, root, func)
might be a helpful kind of function.
Then before you launch your mainloop, or whenever you want the getting to start, call it once:
timer(5000, root, currentPrice)
If you want the currentPrice function to run in a separate thread, and so not block your main GUI thread if there is network lag, for instance, then you can use threads more like this:
Thread(target=currentPrice, daemon=True).start()
which will run it in a daemon-thread - which will automatically get killed if you close the program, or ctrl-c it, or whatever. So you could put that line in a getCurrentPriceBG or similar function.

why random forest regression return a very bad result?

I'm trying to use randomforestregressor() in scikit_learn to model some data.After processing my raw data, the data I applied to randomforestregressor() is as follows.
The following is only a little part of my data. In fact, there are around 6000 pieces of data.
Note, the first column is the datetimeindex of my created DataFrame 'final_data' that contains all the data. In addition, the data in column4 were strings. I just converted them to numbers by a map function.
import pandas as pd
from datetime import datetime
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
S_dataset1= final_data[(final_data.index >=pd.to_datetime('20160403')) &
(final_data.index <= pd.to_datetime('20161002'))]
S_dataset2= final_data[(final_data.index >=pd.to_datetime('20170403')) &
(final_data.index <= pd.to_datetime('20170901'))]
W_dataset = final_data[(final_data.index >=pd.to_datetime('20161002')) &
(final_data.index <= pd.to_datetime('20170403'))]
S_dataset = pd.concat([S_dataset1,S_dataset2])
A = W_dataset.iloc[:, :8]
B = W_dataset.loc[:,'col20']
W_data = pd.concat([A,B],axis = 1)
X = W_data.iloc[:,:].values
y = W_dataset['col9'].values
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.3,
random_state=1)
forest = RandomForestRegressor(n_estimators = 1000,criterion='mse',
random_state=1,n_jobs=-1)
forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)
print('R^2 train: %.3f, test: %.3f' % (r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Here is my code for predicting col9. I separated the final_data into two seasons which may make the prediction more accurate. However, the result is very bad. R2 score of train is around 0.9, but for test, it is only around 0.25. I really don't know why I get a so bad result. Could some tell me where I was wrong and how can improve my model? Many thanks!!!
I think the problem is because I didn't consider the effect of datetime to the prediction. After converting these datetimeindexs to their numerical values and input to my model, I got a quite good result. The R2 score is around 0.95-0.98.