Difference in model feature importance and SHAP summary plot - catboost

I have been playing around the toy dataset to understand more about shap library and usage. I found this issue that the feature importances from the catboost regressor model is different than the features importances from the summary_plot in the shap library.
I am analyzing the feature importance from the model.feature_importances_ on X_train set and the summary plot from shap explainer on X_test set.
Here is my source code -
import catboost
from catboost import *
import shap
shap.initjs()
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
X,y = shap.datasets.boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Train Model
model = CatBoostRegressor(iterations=300, learning_rate=0.1, random_seed=123)
model.fit(X_train, y_train, verbose=False, plot=False)
# Compute feature importance dataframe
feat_imp_list = list(zip ( list(model.feature_importances_) , model.feature_names_) )
feature_imp_df = pd.DataFrame(sorted(feat_imp_list, key=lambda x: x[0], reverse=True) , columns = ['feature_value','feature_name'])
feature_imp_df
# Run shap explainer on X_test set
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
Why does DIS show up at rank 3 in the feature importance plot from Model but shows up at rank 7 in the summary plot from the SHAP library?

Feature importance are always positive where as shap values are coefficients attached to independent variables(it can be negative and positive both).
Both are give you results in descending order:
-In Feature Importance you can see it start from max and goes down to min. Its sum necessarily need to be 100(i.e.100%) in any case.
-For shape values it just the coefficient attached to that particular feature. This is also in descending order (start from highest coefficient to lowest value). Its sum can be anything in real line(for any case).
P.S. you can compare these shap coefficients with coefficient from logistic regression model for better understanding.
Cheers!

Related

Building neural network using k-fold cross validation

I am new to deep learning, trying to implement a neural network using 4-fold cross-validation for training, testing, and validating. The topic is to classify the vehicle using an existing dataset.
The accuracy result is 0.7.
Traning Accuracy
An example output for epochs
I also don't know whether the code is correct and what to do for increasing the accuracy.
Here is the code:
!pip install category_encoders
import tensorflow as tf
from sklearn.model_selection import KFold
import pandas as pd
import numpy as np
from tensorflow import keras
import category_encoders as ce
from category_encoders import OrdinalEncoder
car_data = pd.read_csv('car_data.csv')
car_data.columns = ['Purchasing', 'Maintenance', 'No_Doors','Capacity','BootSize','Safety','Evaluation']
# Extract the features and labels from the dataset
X = car_data.drop(['Evaluation'], axis=1)
Y = car_data['Evaluation']
encoder = ce.OrdinalEncoder(cols=['Purchasing', 'Maintenance', 'No_Doors','Capacity','BootSize','Safety'])
X = encoder.fit_transform(X)
X = X.to_numpy()
Y_df = pd.DataFrame(Y, columns=['Evaluation'])
encoder = OrdinalEncoder(cols=['Evaluation'])
Y_encoded = encoder.fit_transform(Y_df)
Y = Y_encoded.to_numpy()
input_layer = tf.keras.layers.Input(shape=(X.shape[1]))
# Define the hidden layers
hidden_layer_1 = tf.keras.layers.Dense(units=64, activation='relu', kernel_initializer='glorot_uniform')(input_layer)
hidden_layer_2 = tf.keras.layers.Dense(units=32, activation='relu', kernel_initializer='glorot_uniform')(hidden_layer_1)
# Define the output layer
output_layer = tf.keras.layers.Dense(units=1, activation='sigmoid', kernel_initializer='glorot_uniform')(hidden_layer_2)
# Create the model
model = tf.keras.Model(inputs=input_layer, outputs=output_layer)
# Initialize the 4-fold cross-validation
kfold = KFold(n_splits=4, shuffle=True, random_state=42)
# Initialize a list to store the scores
scores = []
quality_weights= []
# Compile the model
model.compile(optimizer='adam',
loss=''sparse_categorical_crossentropy'',
metrics=['accuracy'],
sample_weight_mode='temporal')
for train_index, test_index in kfold.split(X,Y):
# Split the data into train and test sets
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
# Fit the model on the training data
model.fit(X_train, Y_train, epochs=300, batch_size=64, sample_weight=quality_weights)
# Evaluate the model on the test data
score = model.evaluate(X_test, Y_test)
# Append the score to the scores list
scores.append(score[1])
plt.plot(history.history['accuracy'])
plt.title('Model Training Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train'], loc='upper left')
plt.show()
# Print the mean and standard deviation of the scores
print(f'Mean accuracy: {np.mean(scores):.3f} +/- {np.std(scores):.3f}')
The first thing that caught my attention was here:
model.fit(X_train, Y_train, epochs=300, batch_size=64, sample_weight=quality_weights)
Your quality_weights should be a numpy array of size of the input.
Refer here: https://keras.io/api/models/model_training_apis/#fit-method
If changing that doesn't seemt to help then may be your network doesn't seem to be learning from the data. A few possible reasons could be:
The network is a bit too shallow. Try adding just one more hidden layer to see if that improves anything
From the code I can't see the size of your input data. Does it have enough datapoints for 4-fold cross-validation? Can you somehow augment the data?

How to increase Emotion Detection Validation Accuracy on VGG16 model ? [Transfer Learning]

import pandas as pd
import numpy as np
import keras
import tensorflow
from keras.models import Model
from keras.layers import Dense
from keras import optimizers
from keras.preprocessing.image import ImageDataGenerator
from keras.preprocessing import image
trdata = ImageDataGenerator()
traindata = trdata.flow_from_directory(directory="path",target_size=(224,224))
tsdata = ImageDataGenerator()
testdata = tsdata.flow_from_directory(directory="path", target_size=(224,224))
from keras.applications.vgg16 import VGG16
vggmodel = VGG16(weights='imagenet', include_top=True)
vggmodel.summary()
for layers in (vggmodel.layers)[:19]:
print(layers)
layers.trainable = False
#flatten_out = tensorflow.keras.layers.Flatten()(vggmodel.output)
#fc1 = tensorflow.keras.layers.Dense(units=4096,activation="relu")(flatten_out)
#fc2 = tensorflow.keras.layers.Dense(units=4096,activation="relu")(fc1)
#fc3 = tensorflow.keras.layers.Dense(units=256,activation="relu")(fc2)
#predictions = tensorflow.keras.layers.Dense(units=3, activation="softmax")(fc3)
X= vggmodel.layers[-2].output
predictions = Dense(units=3, activation="softmax")(X)
model_final = Model(vggmodel.input, predictions)
model_final.compile(loss = "categorical_crossentropy", optimizer = optimizers.SGD(lr=0.001, momentum=0.9), metrics=["accuracy"])
model_final.summary()
from keras.callbacks import ModelCheckpoint, LearningRateScheduler, TensorBoard, EarlyStopping
checkpoint = ModelCheckpoint("vgg16_1.h5", monitor='val_acc', verbose=1, save_best_only=True, save_weights_only=False, mode='auto', period=1)
early = EarlyStopping(monitor='val_acc', min_delta=0, patience=40, verbose=1, mode='auto')
model_final.fit_generator(generator= traindata, steps_per_epoch= 95, epochs= 100, validation_data= testdata, validation_steps=7, callbacks=[checkpoint,early])
i am classifying emotion in positive, negative and neutral.
i a, using Vgg16 transfer learning model.
though i m still not getting better validation accuracy.
things i've tried:
increase the number of training data
layers.trainable=False/True
learning rate:0.0001,0.001,0.01
Activation function= relu/softmax
batch size= 64
optimizers= adam/sgd
loss fn= categoricalcrossentrpy / sparsecategoricalcrossentrpy
momentum =0.09 /0.9
also, i tried to change my dataset color to GRAY and somehow it gave better accuracy than previous COLOR IMAGE but it is still not satisfactory.
i also changed my code and add dropout layers but still no progress.
i tried with FER2013 dataset it was giving me pretty decent accuracy.
these are the results on the FER dataset:
accuracy: 0.9997 - val_accuracy: 0.7105
but on my own dataset(which is pretty good) validation accuracy is not increasing more than 66%.
what else can I do to increase val_accuracy?
I think your model is more complex than necessary. I would remove the fc1 and fc2 layers. I would include regularization in the fc3 layer. I would add a dropout layer after the fc3 . In your early stopping callback change patience to 4. I recommend you use the Keras callback Reduce Learning rate on plateau. Full recommendations are in the code below
#flatten_out = tensorflow.keras.layers.Flatten()(vggmodel.output)
#fc3 = tensorflow.keras.layers.Dense(kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
bias_regularizer=regularizers.l1(0.006) ,activation='relu'))(flatten_out)
x=Dropout(rate=.4, seed=123)
#predictions = tensorflow.keras.layers.Dense(units=3, activation="softmax")(x)
rlronp=tf.keras.callbacks.ReduceLROnPlateau( monitor='val_loss',
factor=0.4,patience=2,
verbose=0, mode='auto')
callbacks=[rlronp, checkpoint, early]
X= vggmodel.layers[-2].output
predictions = Dense(units=3, activation="softmax")(X)
model_final.fit_generator(generator= traindata, steps_per_epoch= 95, epochs= 100, validation_data= testdata, validation_steps=7, callbacks=callbacks)
I do not like VGG it is a very large model and is a bit old and slow. I think you will get better and faster result using EfficientNet models, EfficientNetB3 should work fine.
If you want to try that get rid of all code for VGG and use
lr=.001
img_size=(256,256)
base_model=tf.keras.applications.efficientnet.EfficientNetB3(include_top=False,
weights="imagenet",input_shape=img_shape, pooling='max')
base_model.trainable=True
x=base_model.output
x=BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )(x)
x = Dense(256, kernel_regularizer = regularizers.l2(l =
0.016),activity_regularizer=regularizers.l1(0.006),
bias_regularizer=regularizers.l1(0.006) ,activation='relu')(x)
x=Dropout(rate=.4, seed=123)(x)
output=Dense(class_count, activation='softmax')(x)
model=Model(inputs=base_model.input, outputs=output)
model.compile(Adamax(learning_rate=lr), loss='categorical_crossentropy', metrics=
['accuracy'])
NOTE: EfficientNet models expect pixels in the range 0 to 255 so don't scale the pixels. Also note I make the base model trainable. They tell you NOT to do that but in many experiments I find training the base model from the outset leads to faster convergence and net lower validation loss.

How to use RNN to predict the next 4 timetsteps using 6 timesteps

I got a dataset with 6 datapoints +4 datapoints as labels, they asked to predict those 4 timesteps using the 6 datasteps.
can you please advise me what model and how should I use it , I though about some kind of RNN since there is time for each point.
Thanks!
These sort of problems where the predictions depend on the previous inputs are generally uses RNN networks(rnn, gru and lstm) as they retain the previous state information.
for deeper understanding:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Please go through the comments as well I have written in the code.
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
from tensorflow.keras import Model
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import RNN, LSTM
"""
creating a toy dataset
lets use this below ```input_sequence``` as the sequence to make data points.
as per the question, we will use 6 points to predict next 4 points
"""
input_sequence = [1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10]
X_train = []
y_train = []
#first 6 points will be our input data points and next 4 points will be data label.
# so on we will shift by 1 and make such data points and label pairs
for i in range(len(input_sequence)-9):
X_train.append(input_sequence[i:i+6])
y_train.append(input_sequence[i+6:i+10])
X_train = np.array(X_train, dtype=np.float32)
y_train = np.array(y_train, dtype=np.int32)))
#X_test for the predictions (contains 6 points)
X_test = np.array([[8,9,10,1,2,3]],dtype=np.float32)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
#we will be using basic LSTM, which accepts input in ```[num_inputs, time_steps, data_points], therefore reshaping as per that```
X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))
X_test = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
x_points = X_train.shape[-1]
print("one input contains {} points".format(x_points))
model = Sequential()
model.add(LSTM(4, input_shape=(1, x_points)))
model.add(Dense(4))
model.compile(loss='mean_squared_error', optimizer='adam')
model.summary()
model.fit(X_train, y_train, epochs=500, batch_size=5, verbose=2)
output = list(map(np.ceil, model.predict(X_test)))
print(output)
we have used the simpler model, this further can be improved to get better results.

How to get probabilities when using Pytorch's densenet?

I want to do a binary classification and I used the DenseNet from Pytorch.
Here is my predict code:
densenet = torch.load(model_path)
densenet.eval()
output = densenet(input)
print(output)
And here is the output:
Variable containing:
54.4869 -54.3721
[torch.cuda.FloatTensor of size 1x2 (GPU 0)]
I want to get the probabilities of each class. What should I do?
I have noticed that torch.nn.Softmax() could be used when there are many categories, as discussed here.
import torch.nn as nn
Add a softmax layer to the classifier layer:
i.e. typical:
num_ftrs = model_ft.classifier.in_features
model_ft.classifier = nn.Linear(num_ftrs, num_classes)
updated:
model_ft.classifier = nn.Sequential(nn.Linear(num_ftrs, num_classes),
nn.Softmax(dim=1))

RNN/LSTM deep learning model?

I am trying to build an RNN/LSTM model for binary classification 0 or 1
a sample of my dataset (patient number, time in mill/sec., normalization of X Y and Z, kurtosis, skewness, pitch, roll and yaw, label) respectively.
1,15,-0.248010047716,0.00378335508419,-0.0152548459993,-86.3738760481,0.872322164158,-3.51314800063,0
1,31,-0.248010047716,0.00378335508419,-0.0152548459993,-86.3738760481,0.872322164158,-3.51314800063,0
1,46,-0.267422664673,0.0051143782875,-0.0191247001961,-85.7662354031,1.0928406847,-4.08015176908,0
1,62,-0.267422664673,0.0051143782875,-0.0191247001961,-85.7662354031,1.0928406847,-4.08015176908,0
what I have tried.
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Bidirectional
from keras.preprocessing import sequence
# fix random seed for reproducibility
np.random.seed(7)
train = np.loadtxt("featwithsignalsTRAIN.txt", delimiter=",")
test = np.loadtxt("featwithsignalsTEST.txt", delimiter=",")
x_train = train[:,[2,3,4,5,6,7]]
x_test = test[:,[2,3,4,5,6,7]]
y_train = train[:,8]
y_test = test[:,8]
# create the model
model = Sequential()
model.add(LSTM(20, dropout=0.2, input_dim=6))
model.add(Dense(4, activation = 'sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs = 2)
but it gives me the following error
Error when checking input: expected lstm_1_input to have 3 dimensions, but got array with shape (1415684, 6)
The LSTM layer takes a 3 dimensional input, corresponding to (batch_size, timesteps, features). In your case you have only a 2 dimensional input, which is (batch_size, features).
The LSTM layer is adapted to sequences formats (sentences, stocks prices ...). You need to reshape your data so that it can be used this way. More specificaly, you need to reshape your data to have one line per patient (Or you can choose to have multiple sequences per patient, but let's say we want one line per patient for now), and each line needs to contain multiple arrays, each array corresponding to an observation of your patient.