How do I interpret my random forest model OOB error and improve the plot of actual vs predicted values? - regression

so I ran a random forest model to observe impact of climate variables on evapotranspiration.
I split the data into training data for fitting the model and testing data for testing the model.
The OOB error seems to be low, but the plot of actual data vs predicted data is not making sense to me. How do I improve this please?
``
library(randomForest)
library(rpart)
#Splitting data into Training and Testing data
index <- sample(2, nrow(cdata), replace = TRUE, prob = c(0.8, 0.2))
#Training data
Training <- cdata[index==1,]
#Testing data
Testing <- cdata[index==2,]
#View(Testing)
#Building Model with training data
NumericET <- as.numeric(Training$ETc)
rfm.model <- randomForest(NumericET~., data = Training, ntree=500,)
summary(rfm.model)
plot(rfm.model)
#Testing model using predict function
predictmodel <- predict(rfm.model, data=Testing)
predictmodel
#Plotting Actual vs predicted values of Training data
library(ggplot2)
actualAndPredictedData = data.frame(actualValue = Testing$ETc,
predictedValue = predictmodel)
ggplot(actualAndPredictedData,aes(x = actualValue, y = predictedValue)) +
geom_point()+geom_abline()
#Check RMSE value
#install.packages("Metrics")
library(Metrics)
rmse(Testing$ETc,predictmodel)
``

Related

Overcoming Overfitting: How to Improve Video Classification AI Training Accuracy

I am developing an AI for video classification, which classifies a video file into one of three labels: Normal, Violent, or Pornography.
Here is a summary of my efforts so far to improve the accuracy of the model:
1. Dataset: I have collected a training dataset of 50,000 videos, consisting of 5000 original videos and 45,000 augmented videos, evenly split between the three labels.
2. Pre-processing: I have used an InceptionV3 model pre-trained on the ImageNet dataset to extract features from the videos for feeding into my main model.
3. Model Architecture: I have tried many different model architectures, but all of them resulted in overfitting problems after a maximum of 15 epochs.
4. Regularization: I have added L1 and L2 regularization, but they did not help improve the model.
5. Early Stopping: I have implemented early stopping, but it stopped training when the validation values were still not good enough to achieve good accuracy.
6. Model Complexity: I have tried both complex and less complex models, but both still resulted in overfitting.
7. Batch Normalization: I have added batch normalization, but it did not solve the overfitting problem.
8. Learning Rate Scheduler: I have tried using ReduceLROnPlateau and LearningRateScheduler togheter and alone, but still no luck.
reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, verbose=0, mode='min', min_delta=0.0001, cooldown=0, min_lr=0)
lr_schedule = keras.callbacks.LearningRateScheduler(
lambda epoch: 0.0005* tf.math.exp(-0.05 * epoch),
verbose=True)
9. Computing Resources: I am running the training on AWS Sagemaker ml.t3.2xlarge with 32GB RAM memory.
10. Dataset Size: I would prefer to avoid increasing the size of the dataset as I am running short on time for the project delivery. However, if this is my only option, I am open to suggestions.
11. Tuning regularizer Gradually increase the regularization value in each layer to fine-tune the model.
Please note that these are just examples of the models I have tried, I have experimented with many others with similar results.
x = keras.layers.GRU(32, return_sequences=True, kernel_regularizer=keras.regularizers.l2(0.001))(
frame_features_input, mask=mask_input
)
x = keras.layers.GRU(16, kernel_regularizer=keras.regularizers.l2(0.001))(x)
x = keras.layers.Dropout(0.4)(x)
x = keras.layers.Dense(1024, activation="relu",
kernel_regularizer=keras.regularizers.l2(0.001))(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(256, activation="relu",
kernel_regularizer=keras.regularizers.l2(0.001))(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(128, activation="relu",
kernel_regularizer=keras.regularizers.l2(0.001))(x)
output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)
rnn_model = keras.Model([frame_features_input, mask_input], output)
opt = keras.optimizers.experimental.AdamW(
learning_rate=0.0001, # 0.001
weight_decay=0.004, # .004 best perform
beta_1=0.9,
beta_2=0.999,
epsilon=1e-07,
amsgrad=False,
clipnorm=None,
clipvalue=None,
global_clipnorm=None,
use_ema=False,
ema_momentum=0.99,
ema_overwrite_frequency=None,
jit_compile=True,
name="AdamW")
rnn_model.compile(
loss="sparse_categorical_crossentropy", optimizer=opt, metrics=["accuracy"]
)
x = keras.layers.GRU(128, return_sequences=True, recurrent_dropout=0.3)(frame_features_input)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.GRU(64, return_sequences=False, recurrent_dropout=0.3)(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dense(32, activation="relu", kernel_regularizer=keras.regularizers.l2(0.01))(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.BatchNormalization()(x)
output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)
x = keras.layers.GRU(256, return_sequences=True, recurrent_dropout=0.3)(frame_features_input)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.GRU(128, return_sequences=True, recurrent_dropout=0.3)(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.GRU(64, return_sequences=False, recurrent_dropout=0.3)(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dense(32, activation="relu", kernel_regularizer=keras.regularizers.l2(0.01))(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.BatchNormalization()(x)
output = keras.layers.Dense
The results
The results using learning rate scheduler
Tried different model architectures, adding regularization, early stopping, and batch normalization, but still faced overfitting issue. Expected improved accuracy, but actual results show overfitting.

Why Is accuracy so different when I use evaluate() and predict()?

I have a Convolutional Neural Network, and it's trying to resolve a classification problem using images (2 classes, so binary classification), using sigmoid.
To evaluate the model I use:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
path_dir = '../../dataset/train'
parth_dir_test = '../../dataset/test'
datagen = ImageDataGenerator(
rescale=1./255,
validation_split = 0.2)
test_set = datagen.flow_from_directory(parth_dir_test,
target_size= (150,150),
batch_size = 64,
class_mode = 'binary')
score = classifier.evaluate(test_set, verbose=0)
print('Test Loss', score[0])
print('Test accuracy', score[1])
And it outputs:
When I try to print the classification report I use:
yhat_classes = classifier.predict_classes(test_set, verbose=0)
yhat_classes = yhat_classes[:, 0]
print(classification_report(test_set.classes,yhat_classes))
But now I get this accuracy:
If I print the test_set.classes, it shows the first 344 numbers of the array as 0, and the next 344 as 1. Is this test_set shuffled before feeding into the network?
I think your model is doing just fine both in "training" and "evaluating".Evaluation accuracy comes on the basis of prediction so maybe you are making some logical mistake while using model.predict_classes().Please check if you are using the trained model weights and not any randomly initialized model while evaluating it.
what "evaluate" does: The model sets apart this fraction of data while training, and will not train on it, and will evaluate loss and any other model's metrics on this data after each "epoch".so, model.evaluate() is for evaluating your trained model. Its output is accuracy or loss, not prediction to your input data!
predict: Generates output predictions for the input samples. model.predict() actually predicts, and its output is target value, predicted from your input data.
FYI: if your accurscy in Binary Classification problem is less than 50%, it's worse than the case that you randomly predict one of those classes (acc = 50%)!
I needed to add a shuffle=False. The code that work is:
test_set = datagen.flow_from_directory(parth_dir_test,
target_size=(150,150),
batch_size=64,
class_mode='binary',
shuffle=False)

Transforming a Dataset for Multi-Label Text Classification

I am conducting some experiments on multi-label classification via deep learning models.
But I face a problem with the dataset.
I use Keras,TensorFlow 2.0, numpy,pandas.
I have a dataset in the form:
Dataset in the form that I have it
To apply multi-label classification(6 labels) I need my dataset to be in this form:
Dataset in the form that I need it
How is it possible to achieve this? Are there any functions making this transformation easier?
Try:
comments_df[['abusive','hateful','offensive','disrespectful','fearful','normal']] = comments_df['sentiment'].str.split('_', -1, expand=True)
This gives me an error:
ValueError: Columns must be same length as key
Regarding the DL model I will use, it's bi-LSTM, but it doesn't have anything to do with the question per-se.
Try this:
df = pd.get_dummies(data = df, columns = ['sentiment'])
I found this to work (not optimal solution):
"""
Creating a column for each of the target labels with sentiment's column data.
"""
def split_sentiment_outputs(output_label, sentiment_col="sentiment"):
comments_df[output_label] = comments_df[sentiment_col].str.split('_')
"""
Transform column's data to categorical.
"""
def transform_data_for_multilabel(output_label):
row = comments_df[output_label]
for index, row in row.items():
# print("Index:", index)
# print("length:", len(row))
# print("content:", row)
# print("--------------")
z = 0
while z < len(row):
if row[z] == output_label:
comments_df.at[index, output_label] = 1
break
else:
comments_df.at[index, output_label] = 0
z = z + 1
# Applying Data Transformation
output_labels = ["abusive", "hateful", "offensive", "disrespectful", "fearful", "normal"]
for i in range(MAX_OUT):
split_sentiment_outputs(output_labels[i])
for i in range(MAX_OUT):
transform_data_for_multilabel(output_labels[i])

How to feed an LSTM/GRU model multiple independent Time Series?

In Order to explain it simply: I have 53 Oil Producing wells measurements, each well has been measured each day for 6 years, we recorded multiple variables (Pressure, water production, gas production...etc), and our main component(The one we want to study and forecast) is the Oil production rate. How can I Use all the data to train my model of LSTM/GRU knowing that the Oil wells are independent and that the measurments have been done in the same time for each one?
The knowledge that "the measurments have been done in the same time for each [well]" is not necessary if you want to assume that the wells are independent. (Why do you think that that knowledge is useful?)
So if the wells are considered independent, treat them as individual samples. Split them into a training set, validation set, and test set, as usual. Train a usual LSTM or GRU on the training set.
By the way, you might want to use the attention mechanism instead of recurrent networks. It is easier to train and usually yields comparable results.
Even convolutional networks might be good enough. See methods like WaveNet if you suspect long-range correlations.
These well measurements sound like specific and independent events. I work in the finance sector. We always look at different stocks, and each stocks specific time neries using LSTM, but not 10 stocks mashed up together. Here's some code to analyze a specific stock. Modify the code to suit your needs.
from pandas_datareader import data as wb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
from sklearn.preprocessing import MinMaxScaler
start = '2019-06-30'
end = '2020-06-30'
tickers = ['GOOG']
thelen = len(tickers)
price_data = []
for ticker in tickers:
prices = wb.DataReader(ticker, start = start, end = end, data_source='yahoo')[['Open','Adj Close']]
price_data.append(prices.assign(ticker=ticker)[['ticker', 'Open', 'Adj Close']])
#names = np.reshape(price_data, (len(price_data), 1))
df = pd.concat(price_data)
df.reset_index(inplace=True)
for col in df.columns:
print(col)
#used for setting the output figure size
rcParams['figure.figsize'] = 20,10
#to normalize the given input data
scaler = MinMaxScaler(feature_range=(0, 1))
#to read input data set (place the file name inside ' ') as shown below
df['Adj Close'].plot()
plt.legend(loc=2)
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()
ntrain = 80
df_train = df.head(int(len(df)*(ntrain/100)))
ntest = -80
df_test = df.tail(int(len(df)*(ntest/100)))
#importing the packages
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
#dataframe creation
seriesdata = df.sort_index(ascending=True, axis=0)
new_seriesdata = pd.DataFrame(index=range(0,len(df)),columns=['Date','Adj Close'])
length_of_data=len(seriesdata)
for i in range(0,length_of_data):
new_seriesdata['Date'][i] = seriesdata['Date'][i]
new_seriesdata['Adj Close'][i] = seriesdata['Adj Close'][i]
#setting the index again
new_seriesdata.index = new_seriesdata.Date
new_seriesdata.drop('Date', axis=1, inplace=True)
#creating train and test sets this comprises the entire data’s present in the dataset
myseriesdataset = new_seriesdata.values
totrain = myseriesdataset[0:255,:]
tovalid = myseriesdataset[255:,:]
#converting dataset into x_train and y_train
scalerdata = MinMaxScaler(feature_range=(0, 1))
scale_data = scalerdata.fit_transform(myseriesdataset)
x_totrain, y_totrain = [], []
length_of_totrain=len(totrain)
for i in range(60,length_of_totrain):
x_totrain.append(scale_data[i-60:i,0])
y_totrain.append(scale_data[i,0])
x_totrain, y_totrain = np.array(x_totrain), np.array(y_totrain)
x_totrain = np.reshape(x_totrain, (x_totrain.shape[0],x_totrain.shape[1],1))
#LSTM neural network
lstm_model = Sequential()
lstm_model.add(LSTM(units=50, return_sequences=True, input_shape=(x_totrain.shape[1],1)))
lstm_model.add(LSTM(units=50))
lstm_model.add(Dense(1))
lstm_model.compile(loss='mean_squared_error', optimizer='adadelta')
lstm_model.fit(x_totrain, y_totrain, epochs=10, batch_size=1, verbose=2)
#predicting next data stock price
myinputs = new_seriesdata[len(new_seriesdata) - (len(tovalid)+1) - 60:].values
myinputs = myinputs.reshape(-1,1)
myinputs = scalerdata.transform(myinputs)
tostore_test_result = []
for i in range(60,myinputs.shape[0]):
tostore_test_result.append(myinputs[i-60:i,0])
tostore_test_result = np.array(tostore_test_result)
tostore_test_result = np.reshape(tostore_test_result,(tostore_test_result.shape[0],tostore_test_result.shape[1],1))
myclosing_priceresult = lstm_model.predict(tostore_test_result)
myclosing_priceresult = scalerdata.inverse_transform(myclosing_priceresult)
totrain = df_train
tovalid = df_test
#predicting next data stock price
myinputs = new_seriesdata[len(new_seriesdata) - (len(tovalid)+1) - 60:].values
# Printing the next day’s predicted stock price.
print(len(tostore_test_result));
print(myclosing_priceresult);
Final result:
1
[[1396.532]]

Pytorch: Overfitting on a small batch: Debugging

I am building a multi-class image classifier.
There is a debugging trick to overfit on a single batch to check if there any deeper bugs in the program.
How to design the code in a way that can do it in a much portable format?
One arduous and a not smart way is to build a holdout train/test folder for a small batch where test class consists of 2 distribution - seen data and unseen data and if the model is performing better on seen data and poorly on unseen data, then we can conclude that our network doesn't have any deeper structural bug.
But, this does not seems like a smart and a portable way, and have to do it with every problem.
Currently, I have a dataset class where I am partitioning the data in train/dev/test in the below way -
def split_equal_into_val_test(csv_file=None, stratify_colname='y',
frac_train=0.6, frac_val=0.15, frac_test=0.25,
):
"""
Split a Pandas dataframe into three subsets (train, val, and test).
Following fractional ratios provided by the user, where val and
test set have the same number of each classes while train set have
the remaining number of left classes
Parameters
----------
csv_file : Input data csv file to be passed
stratify_colname : str
The name of the column that will be used for stratification. Usually
this column would be for the label.
frac_train : float
frac_val : float
frac_test : float
The ratios with which the dataframe will be split into train, val, and
test data. The values should be expressed as float fractions and should
sum to 1.0.
random_state : int, None, or RandomStateInstance
Value to be passed to train_test_split().
Returns
-------
df_train, df_val, df_test :
Dataframes containing the three splits.
"""
df = pd.read_csv(csv_file).iloc[:, 1:]
if frac_train + frac_val + frac_test != 1.0:
raise ValueError('fractions %f, %f, %f do not add up to 1.0' %
(frac_train, frac_val, frac_test))
if stratify_colname not in df.columns:
raise ValueError('%s is not a column in the dataframe' %
(stratify_colname))
df_input = df
no_of_classes = 4
sfact = int((0.1*len(df))/no_of_classes)
# Shuffling the data frame
df_input = df_input.sample(frac=1)
df_temp_1 = df_input[df_input['labels'] == 1][:sfact]
df_temp_2 = df_input[df_input['labels'] == 2][:sfact]
df_temp_3 = df_input[df_input['labels'] == 3][:sfact]
df_temp_4 = df_input[df_input['labels'] == 4][:sfact]
dev_test_df = pd.concat([df_temp_1, df_temp_2, df_temp_3, df_temp_4])
dev_test_y = dev_test_df['labels']
# Split the temp dataframe into val and test dataframes.
df_val, df_test, dev_Y, test_Y = train_test_split(
dev_test_df, dev_test_y,
stratify=dev_test_y,
test_size=0.5,
)
df_train = df[~df['img'].isin(dev_test_df['img'])]
assert len(df_input) == len(df_train) + len(df_val) + len(df_test)
return df_train, df_val, df_test
def train_val_to_ids(train, val, test, stratify_columns='labels'): # noqa
"""
Convert the stratified dataset in the form of dictionary : partition['train] and labels.
To generate the parallel code according to https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
Parameters
-----------
csv_file : Input data csv file to be passed
stratify_columns : The label column
Returns
-----------
partition, labels:
partition dictionary containing train and validation ids and label dictionary containing ids and their labels # noqa
"""
train_list, val_list, test_list = train['img'].to_list(), val['img'].to_list(), test['img'].to_list() # noqa
partition = {"train_set": train_list,
"val_set": val_list,
}
labels = dict(zip(train.img, train.labels))
labels.update(dict(zip(val.img, val.labels)))
return partition, labels
P.S - I know about the Pytorch lightning and know that they have an overfitting feature which can be used easily but I don't want to move to PyTorch lightning.
I don't know how portable it will be, but a trick that I use is to modify the __len__ function in the Dataset.
If I modified it from
def __len__(self):
return len(self.data_list)
to
def __len__(self):
return 20
It will only output the first 20 elements in the dataset (regardless of shuffle). You only need to change one line of code and the rest should work just fine so I think it's pretty neat.