Trashword Identifcation with Sklearn for NLTK - nltk

For a NLTK project I want to mass identify, if a new word is likely a "trash" word or a meaningful word. It fits to the architecture to do this in the early phase, so I can proceed with "true" words later - so I like this negative approach to identify this non meaningful words.
For this I have a training set with words/trash labeled.
I work with single characters for the labeling and have the error:
ValueError: empty vocabulary; perhaps the documents only contain stop words
The code I use is:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
talk_data = pd.read_excel('nn_data_stack.xlsx', sheet_name='label2')
talk_data['word_char_list'] = [[*x] for x in talk_data['word'].astype(str)]
talk_data['word_char'] = [','.join(map(str, l)) for l in talk_data['word_char_list']]
talk_data['word_char'].replace(',',' ', regex=True, inplace=True)
z = talk_data['word_char']
y = talk_data['class']
z_train, z_test,y_train, y_test = train_test_split(z,y,test_size = 0.2)
cv = CountVectorizer()
features = cv.fit_transform(z_train)
Example of the data set I have for training
word
class
drawing
word
to be
word
龚莹author
trash
ï½°c
trash
Do I need to use an alternative to CountVectorizer?
I think I need to go to character embedding to ensure a proper input - but how?

Related

HuggingFace Pipeline exceeds 512 tokens of BERT

I tested this following HuggingFace Pipeline:
https://huggingface.co/deepset/gelectra-base-germanquad
While testing it, I noticed that the pipeline has no limit for the input size.
I passed inputs with over approx. 5.400 tokens and it always gave me good results (even for answers being at the end of the input).
I tried to do it similarly (not using the pipeline but instead importing the model) by just creating several subtexts of the maximum size of 512 tokens. However, using this approach, I often get wrong results...
What approach is the HuggingFace Pipeline following here in order to be able to exceed the 512 tokens limit?
For example, lets retrieve the content of a wikipedia page, which is way over 512 tokens. Here doc is the retrieved content...
url="https://de.wikipedia.org/wiki/Gesch%C3%A4ftsbericht"
r = requests.get(url)
raw_html = r.text
doc = cleanhtml(raw_html)
question = "Wie teuer ist ein Geschäftsbericht?"
How the Pipeline handels the prediction:
from transformers import pipeline
german_qa_model = "deepset/gelectra-base-germanquad"
qa_model = pipeline("question-answering", model=german_qa_model)
qa_model(question=question,context=doc)
>>> {'answer':'über 100.000' ,'score': 0.7242466807365417}
Loading the model itself:
from transformers import ElectraForQuestionAnswering, Trainer
import torch
tokenizer = AutoTokenizer.from_pretrained(german_qa_model)
model = AutoModelForSequenceClassification.from_pretrained(german_qa_model)
model = Trainer(model=model)
encoding = tokenizer(question, doc, add_special_tokens=True, return_tensors="pt")
output = model(**encoding)
>>> Error: token indices sequence length is longer than the specified maximum sequence length for this model (10402 > 512)

GridSearchCV without Cross Validation CV = 1

I have a special dataset and this dataset could be trains with a %1 error. I need to do hyperparameter tuning for MLPRegressor without a split train set. Meanly cv = 1. Is this possible with GridSearchCV?
One of the options for cv parameter is:
An iterable yielding (train, test) splits as arrays of indices.
So, if you have X input matrix, y target vector, mlp classifier, and params grid you can do just one train-test split.
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
indices = np.arange(len(X))
train_idx, test_idx = train_test_split(indices, test_size=0.2)
clf = GridSearchCV(mlp, params, cv=[(train_idx, test_idx)])
But keep in mind that using 1 split for hyper-parameter sweep is a bad practice. Do not make many steps with such a grid search.

why random forest regression return a very bad result?

I'm trying to use randomforestregressor() in scikit_learn to model some data.After processing my raw data, the data I applied to randomforestregressor() is as follows.
The following is only a little part of my data. In fact, there are around 6000 pieces of data.
Note, the first column is the datetimeindex of my created DataFrame 'final_data' that contains all the data. In addition, the data in column4 were strings. I just converted them to numbers by a map function.
import pandas as pd
from datetime import datetime
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
S_dataset1= final_data[(final_data.index >=pd.to_datetime('20160403')) &
(final_data.index <= pd.to_datetime('20161002'))]
S_dataset2= final_data[(final_data.index >=pd.to_datetime('20170403')) &
(final_data.index <= pd.to_datetime('20170901'))]
W_dataset = final_data[(final_data.index >=pd.to_datetime('20161002')) &
(final_data.index <= pd.to_datetime('20170403'))]
S_dataset = pd.concat([S_dataset1,S_dataset2])
A = W_dataset.iloc[:, :8]
B = W_dataset.loc[:,'col20']
W_data = pd.concat([A,B],axis = 1)
X = W_data.iloc[:,:].values
y = W_dataset['col9'].values
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.3,
random_state=1)
forest = RandomForestRegressor(n_estimators = 1000,criterion='mse',
random_state=1,n_jobs=-1)
forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)
print('R^2 train: %.3f, test: %.3f' % (r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Here is my code for predicting col9. I separated the final_data into two seasons which may make the prediction more accurate. However, the result is very bad. R2 score of train is around 0.9, but for test, it is only around 0.25. I really don't know why I get a so bad result. Could some tell me where I was wrong and how can improve my model? Many thanks!!!
I think the problem is because I didn't consider the effect of datetime to the prediction. After converting these datetimeindexs to their numerical values and input to my model, I got a quite good result. The R2 score is around 0.95-0.98.

Python-Sqlalchemy Binary Column Type HEX() and UNHEX()

I'm attempting to learn Sqlalchemy and utilize an ORM. One of my columns stores file hashes as binary. In SQL, the select would simply be
SELECT type, column FROM table WHERE hash = UNHEX('somehash')
How do I achieve a select like this (ideally with an insert example, too) using my ORM? I've begun reading about column overrides, but I'm confused/not certain that that's really what I'm after.
eg
res = session.query.filter(Model.hash == __something__? )
Thoughts?
Only for select's and insert's
Well, for select you could use:
>>> from sqlalchemy import func
>>> session = (...)
>>> (...)
>>> engine = create_engine('sqlite:///:memory:', echo=True)
>>> q = session.query(Model.id).filter(Model.some == func.HEX('asd'))
>>> print q.statement.compile(bind=engine)
SELECT model.id
FROM model
WHERE model.some = HEX(?)
For insert:
>>> from sqlalchemy import func
>>> session = (...)
>>> (...)
>>> engine = create_engine('sqlite:///:memory:', echo=True)
>>> m = new Model(hash=func.HEX('asd'))
>>> session.add(m)
>>> session.commit()
INSERT INTO model (hash) VALUES (HEX(%s))
A better approach: Custom column that converts data by using sql functions
But, I think the best for you is a custom column on sqlalchemy using any process_bind_param, process_result_value, bind_expression and column_expression see this example.
Check this code below, it create a custom column that I think fit your needs:
from sqlalchemy.types import VARCHAR
from sqlalchemy import func
class HashColumn(VARCHAR):
def bind_expression(self, bindvalue):
# convert the bind's type from String to HEX encoded
return func.HEX(bindvalue)
def column_expression(self, col):
# convert select value from HEX encoded to String
return func.UNHEX(col)
You could model your a table like:
from sqlalchemy import Column, types
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Model(Base):
__tablename__ = "model"
id = Column(types.Integer, primary_key=True)
col = Column(HashColumn(20))
def __repr__(self):
return "Model(col=%r)" % self.col
Some usage:
>>> (...)
>>> session = create_session(...)
>>> (...)
>>> model = Model(col='Iuri Diniz')
>>> session.add(model)
>>> session.commit()
this issues this query:
INSERT INTO model (col) VALUES (HEX(?)); -- ('Iuri Diniz',)
More usage:
>>> session.query(Model).first()
Model(col='Iuri Diniz')
this issues this query:
SELECT
model.id AS model_id, UNHEX(model.col) AS model_col
FROM model
LIMIT ? ; -- (1,)
A bit more:
>>> session.query(Model).filter(Model.col == "Iuri Diniz").first()
Model(col='Iuri Diniz')
this issues this query:
SELECT
model.id AS model_id, UNHEX(model.col) AS model_col
FROM model
WHERE model.col = HEX(?)
LIMIT ? ; -- ('Iuri Diniz', 1)
Extra: Custom column that converts data by using python types
Maybe you want to use some beautiful custom type and want to convert it between python and the database.
In the following example I convert UUID's between python and the database (the code is based on this link):
import uuid
from sqlalchemy.types import TypeDecorator, VARCHAR
class UUID4(TypeDecorator):
"""Portable UUID implementation
>>> str(UUID4())
'VARCHAR(36)'
"""
impl = VARCHAR(36)
def process_bind_param(self, value, dialect):
if value is None:
return value
else:
if not isinstance(value, uuid.UUID):
return str(uuid.UUID(value))
else:
# hexstring
return str(value)
def process_result_value(self, value, dialect):
if value is None:
return value
else:
return uuid.UUID(value)
I wasn't able to get #iuridiniz's Custom column solution to work because of the following error:
sqlalchemy.exc.StatementError: (builtins.TypeError) encoding without a string argument
For an expression like:
m = Model(col='FFFF')
session.add(m)
session.commit()
I solved it by overriding process_bind_param, which processes the parameter
before passing it to bind_expression for interpolation into your query language.
from sqlalchemy.types import VARCHAR
from sqlalchemy import func
class HashColumn(VARCHAR):
def process_bind_param(self, value, dialect):
# encode value as a binary
if value:
return bytes(value, 'utf-8')
def bind_expression(self, bindvalue):
# convert the bind's type from String to HEX encoded
return func.HEX(bindvalue)
def column_expression(self, col):
# convert select value from HEX encoded to String
return func.UNHEX(col)
And then defining the table is the same:
from sqlalchemy import Column, types
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Model(Base):
__tablename__ = "model"
id = Column(types.Integer, primary_key=True)
col = Column(HashColumn(20))
def __repr__(self):
return "Model(col=%r)" % self.col
I really like iuridiniz approach A better approach: Custom column that converts data by using sql functions, but I had some trouble making it work when using BINARY and VARBINARY to store hex strings in MySQL 5.7. I tried different things, but SQLAlchemy kept complaining about the encoding, and/or the use of func.HEX and func.UNHEX in contexts where they couldn't be used. Using python3 and SQLAlchemy 1.2.8, I managed to make it work extending the base class and replacing its processors, so that sqlalchemy does not require a function from the database to bind the data and compute the result, but rather it is done within python, as follows:
import codecs
from sqlalchemy.types import VARBINARY
class VarBinaryHex(VARBINARY):
"""Extend VARBINARY to handle hex strings."""
impl = VARBINARY
def bind_processor(self, dialect):
"""Return a processor that decodes hex values."""
def process(value):
return codecs.decode(value, 'hex')
return process
def result_processor(self, dialect, coltype):
"""Return a processor that encodes hex values."""
def process(value):
return codecs.encode(value, 'hex')
return process
def adapt(self, impltype):
"""Produce an adapted form of this type, given an impl class."""
return VarBinaryHex()
The idea is to replace HEX and UNHEX, which require DBMS intervention, with python functions that do just the same, encode and decode an hex string just like HEX and UNHEX do. If you directly connect to the database, you can use HEX and UNHEX, but from SQLAlchemy, codecs.enconde and codecs.decode functions make the work for you.
I bet that, if anybody were interested, writting the appropriate processors, one could even manage the hex values as integers from the python perspective, allowing to store integers that are greater the BIGINT.
Some considerations:
BINARY could be used instead of VARBINARY if the length of the hex string is known.
Depending on what you are going to do, it might worth to un-/capitalise the string on the constructor of class that is going to use this type of column, so that you work with a consistent capitalization, right at the moment of the object initialization. i.e., 'aa' != 'AA' but 0xaa == 0xAA.
As said before, you could consider a processor that converts db binary hex values to prython integer.
When using VARBINARY, be careful because 'aa' != '00aa'
If you use BINARY, lets say that your column is col = Column(BinaryHex(length=4)), take into account that any value that you provide with less than length bytes will be completed with zeros. I mean, if you do
obj.col = 'aabb' and commit it, when you later retrieve this, from the dataase, what you will get is obj.col == 'aabb0000', which is something quite different.

FreqDist.plot not using the most common words

I am following the code in http://www.nltk.org/book/ch01.html specifically section 3.2.
from __future__ import division
import nltk
from nltk.book import *
import dateutil
import pyparsing
import numpy
import six
import matplotlib
fdist1 = FreqDist(text1)
fdist1.plot(50,cumulative=True)
is the script that I am running (the future import is leftover from some other stuff). However the plot I get does not match the one in the book. It has uncommon words such as funereal. I am running 32bit python 2.7 on windows. My friend who is running it on his mac runs the same commands and gets the plot from the book. I am at a complete loss as to what the difference may be. Thanks!
I've recently had the same problem.
NLTK builds it FreqDist upon the Counter class from python's collections package. The plot() (and the tabulate() function) extract the samples to use as follows:
samples = list(islice(self, *args))
Source: http://www.nltk.org/_modules/nltk/probability.html
I guess they're doing it that way to enable plotting of sub-sequences like the range between the 10th and the 40th sample, but it assumes that the Counter object (which basically is a dict) is sorted. Though that sometimes is the case, it obviously does not always hold, as the python documentation explicitely states. The correct replacement for the above line in the NLTK source would be:
samples = [item for item, _ in self.most_common(*args)]
That correct version can be also found in the most recent NLTK code on GitHub. It has been fixed in NLTK 3.0.0, so make sure that you aren't using an older version (like an alpha or beta version of NLTK 3).
If you do not want to change the source of NLTK and can not update it, you could easily adapt the plot() function yourself:
def plot_freqdist(fd, num = 0, cumulative = False, title = None):
import pylab
# Set up parameters
if num <= 0:
num = fd.B
# Get samples and frequencies
samples, freq, accu = [], [], 0
for s, f in fd.most_common(num):
accu = accu + f if cumulative else f
samples.append(s)
freq.append(accu)
# Create plot
pylab.grid(True, color = 'silver')
if title:
pylab.title(title)
pylab.plot(freq, linewidth = 2)
pylab.xticks(range(len(samples)), samples, rotation = 90)
pylab.xlabel('Samples')
pylab.ylabel('Cumulative Counts' if cumulative else 'Counts')
pylab.show()