Model comparison with RMSE - regression

I am newby on data science and would like to ask for help of model selection.
I have built 8 models to predict Salary vs year exp, position name and location.
Then, I tried to compare 8 models by RMSE. But finally, I am not sure that which model I should select. (In m mind, I prefer model 8 because after test with random forest, the result is better than Regression, then I have used all data set to make final version but it is more difficult to interpret coef than regression)
Can you help which model do you prefer and why?
And in reality, do data scientist do the process like this or they have automatic way to deal with?
1 RMSElm1 : model: linear regression, data: Train 80%, test 20% No any imputation
= 22067.58
2 RMSElm2:model: linear regression, data: Train 80%, test 20%: Imputation some locations which I think they give the same idea of salary
= 22115.64
3 RMSElm3: model: linear regression+ Stepwise, data: Train 80%, test 20% No any imputation
= 22081.06
4 RMSEdeep1: model: Deep learning (H2O package activation = 'Rectifier', hidden c(5,5),epochs = 100,), data: Train 80%, test 20%: No any imputation
= 16265.13
5 RMSErf1: model: Random forest (ntree =10),data: Train 80%, test 20% No any imputation
= 14669.92
6 RMSErf2: model: Random forest (ntree =500),data: Train 80%, test 20% No any imputation
[1] 14669.92
7 RMSErf3: model: Random forest (ntree =10,)data: K-Fold 10 No any imputation
[1] 14440.82
8 RMSErf4 model: Random forest (ntree =10),data: all dataset No any imputation
[1] 13532.74

In regression problems, mse or rmse is a way to identify how good your model is doing. Low rmse or mse is preferred. So, go with the model which gives the lowest mse or rmse value and try it on test data. Ensemble methods often give the best results. XGBoost is often used in competitions.
There might be a case of overfitting where you might get very low rmse in training data but high rmse in test data. Thus, it is considered a good practice to use cross-validation.
You might want to check it: https://stats.stackexchange.com/questions/56302/what-are-good-rmse-values

Related

Fully connected neural network with constant loss

I am working on a project to predict soccer player values from a set of inputs. The data consists of about 19,000 rows and 8 columns (7 columns for input and 1 column for the target) all of numerical values.
I am using a fully connected Neural Network for the prediction but the problem is the loss is not decreasing as it should.
The loss is very large (1e+13) and doesn’t decrease as it should, it just fluctuates.
This is the function I am using to run the model:
def gradient_descent(model, learning_rate, num_epochs, data_loader, criterion):
losses = []
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(num_epochs): # one epoch
for inputs, outputs in data_loader: # one iteration
inputs, outputs = inputs.to(torch.float32), outputs.to(torch.float32)
logits = model(inputs)
loss = criterion(torch.squeeze(logits), outputs) # forward-pass
optimizer.zero_grad() # zero out the gradients
loss.backward() # compute the gradients (backward-pass)
optimizer.step() # take one step
losses.append(loss.item())
loss = sum(losses[-len(data_loader):]) / len(data_loader)
print(f'Epoch #{epoch}: Loss={loss:.3e}')
return losses
The model is fully connected neural network with 4 hidden layers, each with 7 neurons. input layer has 7 neurons and output has 1. I am using MSE for loss function. I tried changing the learning rate but it is still bad.
What could be the reason behind this?
Thank you!
It is difficult to diagnose your problem from the information you provided, but I'll try to point you in some useful directions.
Data Normalization:
The way we initialize the weights in deep NN has a significant effect on the training process. See, e.g.:
He, K., Zhang, X., Ren, S. and Sun, J., Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (ICCV 2015).
Most initialization methods assume the inputs have zero mean and unit variance (or similar statistics). If your inputs violate these assumptions, you will find it difficult to train. See, e.g., this post.
Normalize the Targets:
You are trying to solve a regression problem (MSE loss), it might be the case that your targets are poorly scaled and causing very large loss values. Try and normalize the targets to span a more compact range.
Learning Rate:
Try and adjust your learning rate: both increasing it and decreasing it by orders of magnitude.

Deep learning model stuck in local minima or overfit?

I trained an image classification model of 10 classes by finetuning EfficientNet-B4 for 100 epochs. I split my training data to 70/30. I used stochastic gradient descent with Nesterov momentum of 0.9 and the starting learning rate is 0.001. The batch size is 10. The test loss seemed to stuck at 84% for the next 50 epochs (51st - 100th). I do not know whether the model was stuck in local minima or the model was overfitted. Below is an image of the test and train loss from 51st epoch to 100th. I need your help a lot. Thanks. Train test loss image from 51st to 100th epoch.
From the graph you provided, both validation and training losses are still going down so your model is still training and there is no overfit. If your test set is stuck at the same accuracy, the reason is probably that the data you are using for your training/validation dataset does not generalize well enough on your test dataset (in your graph the validation only reached 50% accuracy while your test set reached 84% accuracy).
I looked into your training and validation graph. yes, your model is training and the losses are going down, but your validation error is near 50%, which means 'random guess'.
Possible reasons-
1- From your train error (which is presented in the image between 50-100 epoch), the error in average is going down, but it's random. like your error at epoch 100 is pretty much the same at epoch 70. This could be because your either dataset is too simple and you are forcing huge network like an efficient net to overfit it.
2- it could also be because of the way you are finetuning it, there could be any problem. like which all layers you froze and for which layer you are taking the gradients while doing BP. I am assuming you are using pre-trained weights.
3- Optimizer issue. try to use Adam
It would be great if you can provide total losses (from epoch 1 - 100).

CNN not learning correctly

I've a small dataset of 500 plant images and I have to predict a number for a single image in range [1, 10]. There is a order relation between the numbers (10 > 9 > ... > 1). This problem is similar to age estimation based on a single photo.
I tried regression using Resnet18, Resnet34 and VGG16. None of them gave a very good result.
The interesting point is that when I plotted the heatmap for a few images it showed that the model was picking the wrong spots to predict the answer. It's like, if I was suppose to predict age based on facial photo, the cnn gave more value to the background than to the actual face.
I tried other approachs as well, like classification and learning to rank, but it happens the same thing when I do heatmap. In these approachs, the best accuracy I get is 30% using classification and 35% using learning to rank.
The regression and classification approachs I used Fastai implementation with pretrained. The learning to rank approach I used this : https://github.com/Raschka-research-group/coral-cnn. I changed a little bit to be able to use a pretrained model as well.
Another important point is that the dataset is unbalanced. 80% of the dataset corresponds to classes 6 to 10.
Does anyone has any tips to improve it or another approach I could try?
EDIT:
My data augmentation looks like this:
transforms.Compose([
transforms.Resize(256), transforms.CenterCrop(224),
transforms.RandomHorizontalFlip(p=0.5),
transforms.ColorJitter(brightness=0.15),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.299, 0.224, 0.225])
])
You can try augmenting your dataset to obtain more data (e.g. random cropping, rotating, etc), and make sure you normalise your data. For the class imbalance problem, you can try using PyTorch's WeightedRandomSampler:
#Let there be 9 samples in class 0 and 1 sample in class 1 respectively
class_counts = [9.0, 1.0]
num_samples = sum(class_counts)
labels = [0, 0,..., 0, 1] #corresponding labels of samples
class_weights = [num_samples/class_counts[i] for i in range(len(class_counts))]
weights = [class_weights[labels[i]] for i in range(int(num_samples))]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))
You should be able to apply this to your case with 10 classes easily, hope this solves your problem!

training/validation/test sets in python for regression

I want to split my data in 3 partitions training, validation and test: 70% training, 15% validation and 15% test for regression. Python provides a way to do that only for training and testing by cross_validation.train_test_split. Any Ideas?
Use cross_validation.train_test_split, 2 times.
First with (70,30) => (training, validation_test) and secondly use (50,50) -> (validation,test).

Loss function for ordinal target on SoftMax over Logistic Regression

I am using Pylearn2 OR Caffe to build a deep network. My target is ordered nominal. I am trying to find a proper loss function but cannot find any in Pylearn2 or Caffe.
I read a paper "Loss Functions for Preference Levels: Regression with Discrete Ordered Labels" . I get the general idea - but I am not sure I understand what will the thresholds be, if my final layer is a SoftMax over Logistic Regression (outputting probabilities).
Can some help me by pointing to any implementation of such a loss function ?
Thanks
Regards
For both pylearn2 and caffe, your labels will need to be 0-4 instead of 1-5...it's just the way they work. The output layer will be 5 units, each is a essentially a logistic unit...and the softmax can be thought of as an adaptor that normalizes the final outputs. But "softmax" is commonly used as an output type. When training, the value of any individual unit is rarely ever exactly 0.0 or 1.0...it's always a distribution across your units - which log-loss can be calculated on. This loss is used to compare against the "perfect" case and the error is back-propped to update your network weights. Note that a raw output from PL2 or Caffe is not a specific digit 0,1,2,3, or 5...it's 5 number, each associated to the likelihood of each of the 5 classes. When classifying, one just takes the class with the highest value as the 'winner'.
I'll try to give an example...
say I have a 3 class problem, I train a network with a 3 unit softmax.
the first unit represents the first class, second the second and third, third.
Say I feed a test case through and get...
0.25, 0.5, 0.25 ...0.5 is the highest, so a classifier would say "2". this is the softmax output...it makes sure the sum of the output units is one.
You should have a look at ordinal (logistic) regression. This is the formal solution to the problem setup you describe ( do not use plain regression as the distance measures of errors are wrong).
https://stats.stackexchange.com/questions/140061/how-to-set-up-neural-network-to-output-ordinal-data
In particular I recommend looking at Coral ordinal regression implementation at
https://github.com/ck37/coral-ordinal/issues.