GridSearchCV without Cross Validation CV = 1 - deep-learning

I have a special dataset and this dataset could be trains with a %1 error. I need to do hyperparameter tuning for MLPRegressor without a split train set. Meanly cv = 1. Is this possible with GridSearchCV?

One of the options for cv parameter is:
An iterable yielding (train, test) splits as arrays of indices.
So, if you have X input matrix, y target vector, mlp classifier, and params grid you can do just one train-test split.
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
indices = np.arange(len(X))
train_idx, test_idx = train_test_split(indices, test_size=0.2)
clf = GridSearchCV(mlp, params, cv=[(train_idx, test_idx)])
But keep in mind that using 1 split for hyper-parameter sweep is a bad practice. Do not make many steps with such a grid search.

Related

Trashword Identifcation with Sklearn for NLTK

For a NLTK project I want to mass identify, if a new word is likely a "trash" word or a meaningful word. It fits to the architecture to do this in the early phase, so I can proceed with "true" words later - so I like this negative approach to identify this non meaningful words.
For this I have a training set with words/trash labeled.
I work with single characters for the labeling and have the error:
ValueError: empty vocabulary; perhaps the documents only contain stop words
The code I use is:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
talk_data = pd.read_excel('nn_data_stack.xlsx', sheet_name='label2')
talk_data['word_char_list'] = [[*x] for x in talk_data['word'].astype(str)]
talk_data['word_char'] = [','.join(map(str, l)) for l in talk_data['word_char_list']]
talk_data['word_char'].replace(',',' ', regex=True, inplace=True)
z = talk_data['word_char']
y = talk_data['class']
z_train, z_test,y_train, y_test = train_test_split(z,y,test_size = 0.2)
cv = CountVectorizer()
features = cv.fit_transform(z_train)
Example of the data set I have for training
word
class
drawing
word
to be
word
龚莹author
trash
ï½°c
trash
Do I need to use an alternative to CountVectorizer?
I think I need to go to character embedding to ensure a proper input - but how?

multipling matrix elements creating a function and using repetition structures

I have the following exercise:
Make a function that takes two parameters, a NumPy matrix and a constant, and uses repetition structures to multiply each element of the matrix and returns the multiplied matrix.
I did it just using repetition structures:
np.random.seed(0)
matriz = np.random.randint (1,30, (3,4))
constante = 4
for i in matrix:
print (i * constante)
Can anyone help me to solve it right?
thanks
The question (which is a terrible exercise, you'll see why in a minute) is trying to help you get used to for loops. The array you're working with has two dimensions, so you need to loop over row indices i and column indices j to access each of the entries in the matrix. You can get the number of rows and columns of the array with the .shape attribute, so you don't have to know the shape of the array beforehand.
Here's one solution that modifies the matrix in-place, meaning the original matrix gets overwritten.
>>> def scale_matrix(A, c):
... for i in range(A.shape[0]): # Loop over row indices.
... for j in range(A.shape[1]): # Loop over column indices.
... A[i,j] = c * A[i,j] # Multiply the entry and store the result in the same spot.
... return A
BUT
This is a futile exercise, because when you multiply a NumPy array by a constant, it multiplies each entry of the array. No need for looping. This way is also faster and much more readable.
>>> import numpy as np
>>> A = np.array([[1, 2],
... [3, 4]])
>>> A * 2
array([[2, 4],
[6, 8]])

Trying to get CSV ready for keras model with tensorflow dataset

I do have a keras CNN model ready which expects [None,20,20,3] arrays as input. (20 is image size here...) On the other side I do have a CSV with 1200 (20*20*3) columns ready in my cloud storage.
I want to write an ETL pipeline with tensorflow to obtain a [20,20,3] shape tensor for each row in the csv.
My code so far:
I spent days of work already and feel confident, that this approoach might work out in the end.
import tensorflow as tf
BATCH_SIZE = 30
tf.enable_eager_execution()
X_csv_path = 'gs://my-bucket/dataX.csv'
X_dataset = tf.data.experimental.make_csv_dataset(X_csv_path, BATCH_SIZE, column_names=range(1200) , header=False)
X_dataset = X_dataset.map(lambda x: tf.stack(list(x.values())))
iterator = X_dataset.make_one_shot_iterator()
image = iterator.get_next()
I would expect to have a [30,1200] shape but I still get 1200 tensors of shape [30] instead. My idea is to read every line into a [1200] shaped tensor and then reshape the line to a [20,20,3] tensor to feed my model with. Thanks for your time!
tf.data.experimental.make_csv_dataset creates a OrderedDict of column arrays. For your task I'd use tf.data.TextLineDataset.
def parse(filename):
string = tf.strings.split([filename], sep=',').values
return string
dataset = tf.data.TextLineDataset('sample.csv').map(parse).batch(BATCH_SIZE)
for i in dataset:
print(i)
This will output tensor of shape (BATCH_SIZE, row_length), where row_length is a row from csv file. You can apply any additional preprocessing, depending on your task

Keras Seq2Seq Introduction

A Keras introduction to Seq2Seq model have been published a few weeks ago that can be found here. I do not really understand one part of this code:
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _= decoder_lstm(decoder_inputs,initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
Here the decoder_lstm is defined. It is a layer of dimension latent_dim. We use the states of the encoder as initial_state for the decoder.
What I do not understand is why a dense layer is then added after the LSTM layer and why it is working?
The decoder is supposed to return all the sequence because of return_sequences = True, so how is it possible that adding a dense layer after is working?
I guess I miss something here.
Although the common cases use 2D data (batch,dim) as inputs for dense layers, in newer versions of Keras you can use 3D data (batch,timesteps,dim).
If you don't flatten this 3D data, your Dense layer will behave as if it would be applied to each of the time steps. And you will get outputs like (batch,timesteps,dense_units)
You can check these two little models below and confirm that independently of the time steps, both Dense layers have the same number of parameters, showing its parameters are suited only for the last dimension.
from keras.layers import *
from keras.models import Model
import keras.backend as K
#model with time steps
inp = Input((7,12))
out = Dense(5)(inp)
model = Model(inp,out)
model.summary()
#model without time steps
inp2 = Input((12,))
out2 = Dense(5)(inp2)
model2 = Model(inp2,out2)
model2.summary()
The result will show 65 (12*5 + 5) parameters in both cases.

why random forest regression return a very bad result?

I'm trying to use randomforestregressor() in scikit_learn to model some data.After processing my raw data, the data I applied to randomforestregressor() is as follows.
The following is only a little part of my data. In fact, there are around 6000 pieces of data.
Note, the first column is the datetimeindex of my created DataFrame 'final_data' that contains all the data. In addition, the data in column4 were strings. I just converted them to numbers by a map function.
import pandas as pd
from datetime import datetime
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
S_dataset1= final_data[(final_data.index >=pd.to_datetime('20160403')) &
(final_data.index <= pd.to_datetime('20161002'))]
S_dataset2= final_data[(final_data.index >=pd.to_datetime('20170403')) &
(final_data.index <= pd.to_datetime('20170901'))]
W_dataset = final_data[(final_data.index >=pd.to_datetime('20161002')) &
(final_data.index <= pd.to_datetime('20170403'))]
S_dataset = pd.concat([S_dataset1,S_dataset2])
A = W_dataset.iloc[:, :8]
B = W_dataset.loc[:,'col20']
W_data = pd.concat([A,B],axis = 1)
X = W_data.iloc[:,:].values
y = W_dataset['col9'].values
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.3,
random_state=1)
forest = RandomForestRegressor(n_estimators = 1000,criterion='mse',
random_state=1,n_jobs=-1)
forest.fit(X_train, y_train)
y_train_pred = forest.predict(X_train)
y_test_pred = forest.predict(X_test)
print('R^2 train: %.3f, test: %.3f' % (r2_score(y_train, y_train_pred),
r2_score(y_test, y_test_pred)))
Here is my code for predicting col9. I separated the final_data into two seasons which may make the prediction more accurate. However, the result is very bad. R2 score of train is around 0.9, but for test, it is only around 0.25. I really don't know why I get a so bad result. Could some tell me where I was wrong and how can improve my model? Many thanks!!!
I think the problem is because I didn't consider the effect of datetime to the prediction. After converting these datetimeindexs to their numerical values and input to my model, I got a quite good result. The R2 score is around 0.95-0.98.