XGBoost regression RMSE individual prediction - regression

I have a simple regression problem with two independent variables and one dependent one. I tried linear regression from statsmodels and sk-learn, but I get the best results (R ^ 2 and RMSE) with XGBoost regressor.
On the new data set, RMSE is still in line with earlier results, but individual predictions are very different.
For example, the RMSE is 1000, and individual predictions vary from 20 to 3000. Thus, predictions are either almost perfectly accurate or strongly deviate in a few cases, but i don't know why is that.
My question is what is the cause of such variations in individual predictions?

When testing your model with new data, it's normal to get some of the predictions wrong. Since RMSE is 1000 it means that, on average, the root of the difference between the actual and predicted values is 1000. You can have values that are predicted very well, as well as values that give a very large squared error. The reason for this could be overfitting. It could also be that the new data set contains data that is very different from the data the model was trained on. But since the RMSE is in line with earlier results, I understand that RMSE was around 1000 on the training set as well. Therefore I don't necessarily see a problem with the test set. What I would do is go through the preprocessing steps for the training data and make sure they're done correctly:
standardize the data and remove possible skewness
check for collinearity between independent variables (you only have 2, so it should be easy to do)
check to see if independent variables have an acceptable variance. If your variables don't vary too much for each new data point it could be that they are useless for explaining the dependent variable.
BTW, what is the R2 score for your regression? It should tell you how much of the variability of the target variable is explained by your model. A low R2 score should indicate that the regressors used aren't very useful in explaining your target variable.
You can use the sklearn function StandardScaler() to standaredize the data.

Related

Ways to prevent underfitting and overfitting to when using data augmentation to train a transposed CNN

I'm training a CNN (one using a series of ConvTranspose2D in pytorch) that uses input data from JSON to constitute an image. Unlike natural language, the input data can be in any order, as it contains info about various sprites in a scene.
In my first attempts to train the model, I didn't change the order of the input data (meaning, on each epoch, each sprite was represented in the same place in the input data). The model learned for about 10 epochs, but then there started to be divergence between the training loss (which continued to go down) and the test loss. So classic overfitting.
I tried to solve this by doing a form of data augmentation where the output data (in this case an image) stayed the same but I shuffled the order of the input data. As I have around 400 sprites, the maximum shuffling is 400!, so theoretically this can vastly expand the amount of training data. For example, instead of 100k JSON documents corresponding to 100K images, by shuffling the order of sprites in the input data, you have 400!*100000 training data points. In practice of course this amount of data is impractical, so I went with around 2m data points for an initial test. The issue I ran into here was that the model was not learning at all - after getting to a certain loss very quickly (after the first few mini-batches), it didn't learn at all for around 4 epochs. So classic underfitting.
Like Goldilocks, I'd like to find "just right" between the initial overfitting and subsequent underfitting. I'm wondering other strategies I could try out. One idea I had was letting the model train on a predetermined order of sprites (the overfitting case) and then, once overfitting starts (ie two straight epochs with divergence between the test and training loss) shuffling the data. I can also play with changing the model, although it can only be so big because of constraints with the hardware and the fact that inference needs to happen in under 20ms.
Are there any papers or techniques that are recommended in this scenario where data augmentation can lead to vastly more data points but results in a model ceasing to learn? Thanks in advance for any tips!

Evaluating the performance of variational autoencoder on unlabeled data

I've designed a variational autoencoder (VAE) that clusters sequential time series data.
To evaluate the performance of VAE on labeled data, First, I run KMeans on the raw data and compare the generated labels with the true labels using Adjusted Mutual Info Score (AMI). Then, after the model is trained, I pass validation data to it, run KMeans on latent vectors, and compare the generated labels with the true labels of validation data using AMI. Finally, I compare the two AMI scores with each other to see if KMeans has better performance on the latent vectors than the raw data.
My question is this: How can we evaluate the performance of VAE when the data is unlabeled?
I know we can run KMeans on the raw data and generate labels for it, but in this case, since we consider the generated labels as true labels, how can we compare the performance of KMeans on the raw data with KMeans on the latent vectors?
Note: The model is totally unsupervised. Labels (if exist) are not used in the training process. They're used only for evaluation.
In unsupervised learning you evaluate the performance of a model by either using labelled data or visual analysis. In your case you do not have labelled data, so you would need to do analysis. One way to do this is by looking at the predictions. If you know how the raw data should be labelled, you can qualitatively evaluate the accuracy. Another method is, since you are using KMeans, is to visualize the clusters. If the clusters are spread apart in groups, that is usually a good sign. However, if they are closer together and overlapping, the labelling of vectors in the respective areas may be less accurate. Alternatively, there may be some sort of a metric that you can use to evaluate the clusters or come up with your own.

Saving Random Forest Classifiers (sklearn) with picke/joblib creates huge files

I am trying to save a bunch of trained random forest classifiers in order to reuse them later. For this, I am trying to use pickle or joblib. The problem I encounter is, that the saved files get huge. This seems to be correlated to the amount of data that I use for training (which is several 10-millions of samples per forest, leading to dumped files in the order of up to 20GB!).
Is the RF classifier itself saving the training data in its structure? If so, how could I take the structure apart and only save the necessary parameters for later predictions? Sadly, I could not find anything on the subject of size yet.
Thanks for your help!
Baradrist
Here's what I did in a nutshell:
I trained the (fairly standard) RF on a large dataset and saved the trained forest afterwards, trying both pickle and joblib (also with the compress-option set to 3).
X_train, y_train = ... some data
classifier = RandomForestClassifier(n_estimators=24, max_depth=10)
classifier.fit(X_train, y_train)
pickle.dump(classifier, open(path+'classifier.pickle', 'wb'))
or
joblib.dump(classifier, path+'classifier.joblib', compress=True)
Since the saved files got quite big (5GB to nearly 20GB, compressed aprox. 1/3 of this - and I will need >50 such forests!) and the training takes a while, I experimented with different subsets of the training data. Depending on the size of the train set, I found different sizes for the saved classifier, making me believe that information about the training is pickled/joblibed as well. This seems unintuitive to me, as for predictions, I only need the information of all the trained weak predictors (decision trees) which should be steady and since the number of trees and the max depth is not too high, they should also not take up that much space. And certainly not more due to a larger training set.
All in all, I suspect that the structure is containing more than I need. Yet, I couldn't find a good answer on how to exclude these parts from it and save only the necessary information for my future predictions.
I ran into a similar issue and I also thought in the beginning that the model was saving unnecessary information or that the serialization was introducing some redundancy. It turns out in fact that decision trees are indeed memory hungry structures that consists of multiple arrays of length given by the total number of nodes. Nodes in general grow with the size of data (and parameters like max_depth cannot effectively used to limit growth since the reasonable values still have room to generate huge number of nodes). See details in this answer but the gist is:
a single decision tree can easy grow to a few MBs (example above has a 5MB decision tree for 100K data and a 50MB decision tree for 1M data)
a random forest commonly contains at least 100 such decision tree and for the example above you would have models in the range of 0.5/5GB
compression is usually not enough to reduce to reasonable sizes (1/2, 1/3 are usual ranges)
Other notes:
using a different algorithm models might remain of a more manageable size (e.g. with xgboost I saw much smaller serialized models)
it is probably possible to "prune" some of the data used by decision trees if you only plan it to reuse it for prediction. In particular I imagine the array of impurity and possible those on n_samples might not be needed but I have not checked.
with respect to you hypothesis that the random forest is saving the data on which it is trained: not it is not and the data itself would likely be one or more order of magnitude smaller than the final model
so in principle another strategy if you have a reproducible training pipeline could be to save the data instead of the model and retrain on purpose, but this is only possible if you can spare the time to retrain (for example if in a use case where you have a long running service which has the model in memory and you serialize the model in order to have a backup for when the model goes down)
there are probably also other options to limit growth of random forest, the best one I have found until now is in this answer, where the suggestion is to work with min_samples_leaf to set it as a percentage of data

Regression problem getting much better results when dividing values by 100

I'm working on a regression problem in pytorch. My target values can be either between 0 to 100 or 0 to 1 (they represent % or % divided by 100).
The data is unbalanced, I have much more data with lower targets.
I've noticed that when I run the model with targets in the range 0-100, it doesn't learn - the validation loss doesn't improve, and the loss on the 25% large targets is very big, much bigger than the std in this group.
However, when I run the model with targets in the range 0-1, it does learn and I get good results.
If anyone can explain why this happens, and if using the ranges 0-1 is "cheating", that will be great.
Also - should I scale the targets? (either if I use the larger or the smaller range).
Some additional info - I'm trying to fine tune bert for a specific task. I use MSEloss.
Thanks!
I think your observation relates to batch normalization. There is a paper written on the subject, an numerous medium/towardsdatascience posts, which i will not list here. Idea is that if you have a no non-linearities in your model and loss function, it doesn't matter. But even in MSE you do have non-linearity, which makes it sensitive to scaling of both target and source data. You can experiment with inserting Batch Normalization Layers into your models, after dense or convolutional layers. In my experience it often improves accuracy.

Computation consideration with different Caffe's network topology (difference in number of output)

I would like to use one of Caffe's reference model i.e. bvlc_reference_caffenet. I found that my target class i.e. person is one of the classes included in the ILSVRC dataset that has been trained for the model. As my goal is to classify whether a test image contains a person or not, I may achieve this by the following:
Use inference directly with 1000 number of output. This doesn't
require training/learning.
Change the network topology a little bit with the final FC layer's number of output (num_output) is set to 2 (instead of 1000). Retrain it as a binary classification problem.
My concern is about computational effort at deployment/prediction phase (testing). The latter looks more expensive computationally than the former. This is because during prediction phase it needs to compute those 1000 output possibilities to find the one with the highest score. What I'm not sure is that, it could be the case that there's a heuristic (which I'm not aware of) that simplifies the computation.
Can somebody please help cross check my understanding on this.