Sales Prediction error on catboost regression/CatBoostRegressor - regression

d = {'customer':['A','B','C','A'],'season':[1,2,3,4],
'cat1': ['BAGS','TSHIRT','DRESS','BELT'],
'cat2': ['high','low','high','medium'],'sale': [10,20,15,50]}
df = pd.DataFrame(data=d)
df
Desired output on season 5
d = {'customer':['A','B','C','A'],'season': [5,5,5,5],
'cat1': ['BAGS','TSHIRT','DRESS','BELT'],
'cat2': ['high','low','high','medium'],'sale': [?,?,?,?]}
df = pd.DataFrame(data=d)
df
I tried
df=df.groupby(['customer','season','cat1','cat2'])['Sales'].sum().sort_values(ascending=False).reset_index()
from sklearn.model_selection import train_test_split
X=df[['customer','season','cat1','cat2']]
y=df[['Sales']]
X.season=X.season.astype(float)
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size = 0.90, random_state =42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size = 0.85, random_state =42)
categorical_features_indices = np.where(X.dtypes != np.float)[0]
import catboost
from catboost import MetricVisualizer, Pool, CatBoostRegressor, cv
train_pool = Pool(data=X_train, label=y_train, cat_features=categorical_features_indices)
val_pool = Pool(data=X_val, label=y_val, cat_features=categorical_features_indices)
test_pool = Pool(data=X_test, label=y_test, cat_features=categorical_features_indices)
params = {
'iterations':900,
'loss_function': 'RMSE',
'learning_rate': 0.0109, #1 0.102,
'depth': 6,
'l2_leaf_reg': 6,
'border_count': 7,
'thread_count': 7,
'bagging_temperature': 2,
'random_strength': 2.23,
'colsample_bylevel': 0.85,
'custom_metric': ['MAPE', 'R2'],
'eval_metric': 'R2',
'random_seed': 41,
'max_ctr_complexity': 2,
'logging_level': 'Silent',
'use_best_model':False # Takes
}
reg_model = CatBoostRegressor(**params)
reg_model.fit(train_pool, eval_set=val_pool, plot=True, verbose=100)
X['season']=5
X['Predict_sales']=reg_model.predict(X)
The above code throws no error.
My Question is: My predict values doesn't change if input 5,6,7,8 however season is a continuous value. What am I doing wrong and how can i predict for season 6, 7, 8 and so on.

catboost is a tree-based model. Regression trees (as well as decision trees) partition the feature space and each partition yields the same value. Since neither season 5,6,7 or 8 occurred in the training data it should all land in the same partition and hence yielding exactly the same value.
You might need to go for another model type (e.g. linear regression). What kind of relationship would you expect between season and sales? Predicting on something you haven't seen in your training data always is hard (except if there is something like a linear relationship)

Related

Building neural network using k-fold cross validation

I am new to deep learning, trying to implement a neural network using 4-fold cross-validation for training, testing, and validating. The topic is to classify the vehicle using an existing dataset.
The accuracy result is 0.7.
Traning Accuracy
An example output for epochs
I also don't know whether the code is correct and what to do for increasing the accuracy.
Here is the code:
!pip install category_encoders
import tensorflow as tf
from sklearn.model_selection import KFold
import pandas as pd
import numpy as np
from tensorflow import keras
import category_encoders as ce
from category_encoders import OrdinalEncoder
car_data = pd.read_csv('car_data.csv')
car_data.columns = ['Purchasing', 'Maintenance', 'No_Doors','Capacity','BootSize','Safety','Evaluation']
# Extract the features and labels from the dataset
X = car_data.drop(['Evaluation'], axis=1)
Y = car_data['Evaluation']
encoder = ce.OrdinalEncoder(cols=['Purchasing', 'Maintenance', 'No_Doors','Capacity','BootSize','Safety'])
X = encoder.fit_transform(X)
X = X.to_numpy()
Y_df = pd.DataFrame(Y, columns=['Evaluation'])
encoder = OrdinalEncoder(cols=['Evaluation'])
Y_encoded = encoder.fit_transform(Y_df)
Y = Y_encoded.to_numpy()
input_layer = tf.keras.layers.Input(shape=(X.shape[1]))
# Define the hidden layers
hidden_layer_1 = tf.keras.layers.Dense(units=64, activation='relu', kernel_initializer='glorot_uniform')(input_layer)
hidden_layer_2 = tf.keras.layers.Dense(units=32, activation='relu', kernel_initializer='glorot_uniform')(hidden_layer_1)
# Define the output layer
output_layer = tf.keras.layers.Dense(units=1, activation='sigmoid', kernel_initializer='glorot_uniform')(hidden_layer_2)
# Create the model
model = tf.keras.Model(inputs=input_layer, outputs=output_layer)
# Initialize the 4-fold cross-validation
kfold = KFold(n_splits=4, shuffle=True, random_state=42)
# Initialize a list to store the scores
scores = []
quality_weights= []
# Compile the model
model.compile(optimizer='adam',
loss=''sparse_categorical_crossentropy'',
metrics=['accuracy'],
sample_weight_mode='temporal')
for train_index, test_index in kfold.split(X,Y):
# Split the data into train and test sets
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
# Fit the model on the training data
model.fit(X_train, Y_train, epochs=300, batch_size=64, sample_weight=quality_weights)
# Evaluate the model on the test data
score = model.evaluate(X_test, Y_test)
# Append the score to the scores list
scores.append(score[1])
plt.plot(history.history['accuracy'])
plt.title('Model Training Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train'], loc='upper left')
plt.show()
# Print the mean and standard deviation of the scores
print(f'Mean accuracy: {np.mean(scores):.3f} +/- {np.std(scores):.3f}')
The first thing that caught my attention was here:
model.fit(X_train, Y_train, epochs=300, batch_size=64, sample_weight=quality_weights)
Your quality_weights should be a numpy array of size of the input.
Refer here: https://keras.io/api/models/model_training_apis/#fit-method
If changing that doesn't seemt to help then may be your network doesn't seem to be learning from the data. A few possible reasons could be:
The network is a bit too shallow. Try adding just one more hidden layer to see if that improves anything
From the code I can't see the size of your input data. Does it have enough datapoints for 4-fold cross-validation? Can you somehow augment the data?

Neural network predicts very poorly though it has high accuracy

I am working on RNN. After training, I got a high accuracy on the test data set. However, when I make a prediction with some external data, it predicts so poorly. Also, I used the same data set, which has over 300,000 texts and 57 classes, on artificial neural networks, it's still predicting very poorly. When I tried the same data set on a machine learning model, it worked fine.
Here is my code:
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, LSTM, BatchNormalization
from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
df = pd.read_excel("data.xlsx", usecols=["X", "y"])
df = df.sample(frac = 1)
X = np.array(df["X"])
y = np.array(df["y"])
le = LabelEncoder()
y = le.fit_transform(y)
y = y.reshape(-1,1)
encoder = OneHotEncoder(sparse=False)
y = encoder.fit_transform(y)
num_words = 100000
token = Tokenizer(num_words=num_words)
token.fit_on_texts(X)
seq = token.texts_to_sequences(X)
X = sequence.pad_sequences(seq, padding = "pre", truncating = "pre")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = Sequential()
model.add(Embedding(num_words, 96, input_length = X.shape[1]))
model.add(LSTM(108, activation='relu', dropout=0.1, recurrent_dropout = 0.2))
model.add(BatchNormalization())
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="rmsprop", metrics=['accuracy'])
model.summary()
history = model.fit(X_train, y_train, epochs=4, batch_size=64, validation_data = (X_test, y_test))
loss, accuracy = model.evaluate(X_test, y_test)
Here are the history plots of the model:
After doing some research, I have realized that the model was actually working fine. The problem was using Keras Tokenizer wrongly.
At the end of the code, I used the following code:
sentence = ["Example Sentence to Make Prediction."]
token.fit_on_texts(sentence) # <- This row is redundant.
seq = token.texts_to_sequences(sentence)
cx = sequence.pad_sequences(seq, maxlen = X.shape[1])
sx = np.argmax(model.predict(cx), axis=1)
The problem occurs when I want to fit Tokenizer again, on the new data. So, removing that code line solved the problem for me.

Linear Regression using sklearn

I have a model fitted with data but having trouble using the predict function.
d = {'df_Size': [1, 3, 5, 8, 10, 15, 18], 'RAM': [3676, 6532, 9432, 13697, 16633, 23620, 27990]}
df = pd.DataFrame(data=d)
df
X = np.array(df['df_Size']).reshape(-1, 1)
y = np.array(df['RAM']).reshape(-1, 1)
model = LinearRegression()
model.fit(X, y)
print(regr.score(X, y))
then when I try to predict on
X_Size = 25
X_Size
prediction = model.predict(X_Size)
I get the following error
ValueError: Expected 2D array, got scalar array instead:
array=25.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I think I am passing the 25 in the wrong format but would just like some help on getting the response for Ram considering the 25 rows.
Thanks,
You need to pass the predictor in the same shape (basically 1 column):
X.shape
Out[11]: (7, 1)
You can do:
model.predict(np.array(25).reshape(1,1))

How to use RNN to predict the next 4 timetsteps using 6 timesteps

I got a dataset with 6 datapoints +4 datapoints as labels, they asked to predict those 4 timesteps using the 6 datasteps.
can you please advise me what model and how should I use it , I though about some kind of RNN since there is time for each point.
Thanks!
These sort of problems where the predictions depend on the previous inputs are generally uses RNN networks(rnn, gru and lstm) as they retain the previous state information.
for deeper understanding:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Please go through the comments as well I have written in the code.
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
from tensorflow.keras import Model
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import RNN, LSTM
"""
creating a toy dataset
lets use this below ```input_sequence``` as the sequence to make data points.
as per the question, we will use 6 points to predict next 4 points
"""
input_sequence = [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10]
X_train = []
y_train = []
#first 6 points will be our input data points and next 4 points will be data label.
# so on we will shift by 1 and make such data points and label pairs
for i in range(len(input_sequence)-9):
X_train.append(input_sequence[i:i+6])
y_train.append(input_sequence[i+6:i+10])
X_train = np.array(X_train, dtype=np.float32)
y_train = np.array(y_train, dtype=np.int32)))
#X_test for the predictions (contains 6 points)
X_test = np.array([[8,9,10,1,2,3]],dtype=np.float32)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
#we will be using basic LSTM, which accepts input in ```[num_inputs, time_steps, data_points], therefore reshaping as per that```
X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))
X_test = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
x_points = X_train.shape[-1]
print("one input contains {} points".format(x_points))
model = Sequential()
model.add(LSTM(4, input_shape=(1, x_points)))
model.add(Dense(4))
model.compile(loss='mean_squared_error', optimizer='adam')
model.summary()
model.fit(X_train, y_train, epochs=500, batch_size=5, verbose=2)
output = list(map(np.ceil, model.predict(X_test)))
print(output)
we have used the simpler model, this further can be improved to get better results.

How to reshape my input to feed it into 1D Convolutional layer for sequence classification?

I have a csv file with 339732 rows and two columns :
the first being 29 feature values, i.e. X
the second being a binary label value, i.e. Y
dataframe = pd.read_csv("features.csv", header = None)
dataset = dataframe.values
X = dataset[:, 0:29].astype(float)
Y = dataset[:,29]
X_train, y_train, X_test, y_test = train_test_split(X,Y, random_state = 42)
I am trying to train it on a 1D convolutional layer:
model = Sequential()
model.add(Conv1D(64, 3, activation='relu', input_shape=(X_train.shape[0], 29)))
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(128, 3, activation='relu'))
model.add(Conv1D(128, 3, activation='relu'))
model.add(GlobalAveragePooling1D())
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(X_train, y_train, batch_size=16, epochs=2)
score = model.evaluate(X_test, y_test, batch_size=16)
Since, the Conv1D layer expects a 3-D input, I transformed my input as follows:
X_train = np.reshape(X_train, (1, X_train.shape[0], X_train.shape[1]))
X_test = np.reshape(X_test, (1, X_test.shape[0], X_test.shape[1]))
However, this still throws error:
ValueError: Negative dimension size caused by subtracting 3 from 1 for 'conv1d_1/convolution/Conv2D' (op: 'Conv2D') with input shapes: [?,1,1,29], [1,3,29,64].
Is there any way to feed my input correctly?
As far as I know 1D Convolution layer accepts inputs of the form Batchsize x Width x Channels. You are reshaping with
X_train = np.reshape(X_train, (1, X_train.shape[0], X_train.shape[1]))
But X_train.shape[0] is your batchsize I guess.I think the problem is somewhere here. Can you please tell what is the shape of X_train before reshape?
You have to think about if your data have some progression relation between the 339732 entries or the 29 features, this means if the order matters. If not I don't think that CNN is suitable for this case.
If the 29 features "indicates the progression of something":
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1],1))
If the 29 features are independent, then is like the channels on the image, but doesn't make sense convolute with only 1.
X_train = X_train.reshape((X_train.shape[0],1, X_train.shape[1]))
If you want to pick the 339732 entries like in blocks where the order matters (clip the 339732 or add zero padding in order to be divisible by timesteps):
X_train = X_train.reshape((int(X_train.shape[0]/timesteps),timesteps, X_train.shape[1],1))