Computation consideration with different Caffe's network topology (difference in number of output) - deep-learning

I would like to use one of Caffe's reference model i.e. bvlc_reference_caffenet. I found that my target class i.e. person is one of the classes included in the ILSVRC dataset that has been trained for the model. As my goal is to classify whether a test image contains a person or not, I may achieve this by the following:
Use inference directly with 1000 number of output. This doesn't
require training/learning.
Change the network topology a little bit with the final FC layer's number of output (num_output) is set to 2 (instead of 1000). Retrain it as a binary classification problem.
My concern is about computational effort at deployment/prediction phase (testing). The latter looks more expensive computationally than the former. This is because during prediction phase it needs to compute those 1000 output possibilities to find the one with the highest score. What I'm not sure is that, it could be the case that there's a heuristic (which I'm not aware of) that simplifies the computation.
Can somebody please help cross check my understanding on this.

Related

Simulating a matrix of variables with predefined correlation structure

For a simulation study I am working on, we are trying to test an algorithm that aims to identify specific culprit factors that predict a binary outcome of interest from a large mixture of possible exposures that are mostly unrelated to the outcome. To test this algorithm, I am trying to simulate the following data:
A binary dependent variable
A set of, say, 1000 variables, most binary and some continuous, that are not associated with the outcome (that is, are completely independent from the binary dependent variable, but that can still be correlated with one another).
A group of 10 or so binary variables which will be associated with the dependent variable. I will a-priori determine the magnitude of the correlation with the binary dependent variable, as well as their frequency in the data.
Generating a random set of binary variables is easy. But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?
Thank you!
"But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?"
With statistical sampling you can't ensure anything, you can only adjust the acceptable risk. Finding an acceptable level of risk may be harder than many people think.
Spurious correlations are a very real phenomenon. Real independent observations will often contain correlations, and if you want to actually test your algorithm to see how it will perform in reality then your tests should produce such phenomena in a manner similar to the real world—you should be generating independent candidate factors and allowing spurious correlations to occur.
If you are performing ~1000 independent tests of candidate factors, and you're targeting a risk level of α = 0.05, you can expect 50 non-significant terms to leak through into your analysis. To avoid this, you need to adjust your testing threshold using something along the lines of a Bonferroni correction. Recall that statistical discriminating power is based on standard error, which is inversely proportional to the square root of the sample size. Bonferroni says that 1000 simultaneous tests need their individual test threshold to be adjusted by a factor of 1000, which in turn means the sample size needs to be a million times larger than when performing a single test for significance.
So in summary I'd say that you shouldn't attempt to ensure lack of correlation, it's going to occur in the real world. You can mitigate the risk of non-predictive factors being included due to spurious correlation by generating massive amounts of data. In practice there will be non-predictors that leak through unless you can obtain enough data, so I'd suggest that your testing should address the rates of occurrence as a function of number of candidate factors and the sample size.

Saving Random Forest Classifiers (sklearn) with picke/joblib creates huge files

I am trying to save a bunch of trained random forest classifiers in order to reuse them later. For this, I am trying to use pickle or joblib. The problem I encounter is, that the saved files get huge. This seems to be correlated to the amount of data that I use for training (which is several 10-millions of samples per forest, leading to dumped files in the order of up to 20GB!).
Is the RF classifier itself saving the training data in its structure? If so, how could I take the structure apart and only save the necessary parameters for later predictions? Sadly, I could not find anything on the subject of size yet.
Thanks for your help!
Baradrist
Here's what I did in a nutshell:
I trained the (fairly standard) RF on a large dataset and saved the trained forest afterwards, trying both pickle and joblib (also with the compress-option set to 3).
X_train, y_train = ... some data
classifier = RandomForestClassifier(n_estimators=24, max_depth=10)
classifier.fit(X_train, y_train)
pickle.dump(classifier, open(path+'classifier.pickle', 'wb'))
or
joblib.dump(classifier, path+'classifier.joblib', compress=True)
Since the saved files got quite big (5GB to nearly 20GB, compressed aprox. 1/3 of this - and I will need >50 such forests!) and the training takes a while, I experimented with different subsets of the training data. Depending on the size of the train set, I found different sizes for the saved classifier, making me believe that information about the training is pickled/joblibed as well. This seems unintuitive to me, as for predictions, I only need the information of all the trained weak predictors (decision trees) which should be steady and since the number of trees and the max depth is not too high, they should also not take up that much space. And certainly not more due to a larger training set.
All in all, I suspect that the structure is containing more than I need. Yet, I couldn't find a good answer on how to exclude these parts from it and save only the necessary information for my future predictions.
I ran into a similar issue and I also thought in the beginning that the model was saving unnecessary information or that the serialization was introducing some redundancy. It turns out in fact that decision trees are indeed memory hungry structures that consists of multiple arrays of length given by the total number of nodes. Nodes in general grow with the size of data (and parameters like max_depth cannot effectively used to limit growth since the reasonable values still have room to generate huge number of nodes). See details in this answer but the gist is:
a single decision tree can easy grow to a few MBs (example above has a 5MB decision tree for 100K data and a 50MB decision tree for 1M data)
a random forest commonly contains at least 100 such decision tree and for the example above you would have models in the range of 0.5/5GB
compression is usually not enough to reduce to reasonable sizes (1/2, 1/3 are usual ranges)
Other notes:
using a different algorithm models might remain of a more manageable size (e.g. with xgboost I saw much smaller serialized models)
it is probably possible to "prune" some of the data used by decision trees if you only plan it to reuse it for prediction. In particular I imagine the array of impurity and possible those on n_samples might not be needed but I have not checked.
with respect to you hypothesis that the random forest is saving the data on which it is trained: not it is not and the data itself would likely be one or more order of magnitude smaller than the final model
so in principle another strategy if you have a reproducible training pipeline could be to save the data instead of the model and retrain on purpose, but this is only possible if you can spare the time to retrain (for example if in a use case where you have a long running service which has the model in memory and you serialize the model in order to have a backup for when the model goes down)
there are probably also other options to limit growth of random forest, the best one I have found until now is in this answer, where the suggestion is to work with min_samples_leaf to set it as a percentage of data

XGBoost regression RMSE individual prediction

I have a simple regression problem with two independent variables and one dependent one. I tried linear regression from statsmodels and sk-learn, but I get the best results (R ^ 2 and RMSE) with XGBoost regressor.
On the new data set, RMSE is still in line with earlier results, but individual predictions are very different.
For example, the RMSE is 1000, and individual predictions vary from 20 to 3000. Thus, predictions are either almost perfectly accurate or strongly deviate in a few cases, but i don't know why is that.
My question is what is the cause of such variations in individual predictions?
When testing your model with new data, it's normal to get some of the predictions wrong. Since RMSE is 1000 it means that, on average, the root of the difference between the actual and predicted values is 1000. You can have values that are predicted very well, as well as values that give a very large squared error. The reason for this could be overfitting. It could also be that the new data set contains data that is very different from the data the model was trained on. But since the RMSE is in line with earlier results, I understand that RMSE was around 1000 on the training set as well. Therefore I don't necessarily see a problem with the test set. What I would do is go through the preprocessing steps for the training data and make sure they're done correctly:
standardize the data and remove possible skewness
check for collinearity between independent variables (you only have 2, so it should be easy to do)
check to see if independent variables have an acceptable variance. If your variables don't vary too much for each new data point it could be that they are useless for explaining the dependent variable.
BTW, what is the R2 score for your regression? It should tell you how much of the variability of the target variable is explained by your model. A low R2 score should indicate that the regressors used aren't very useful in explaining your target variable.
You can use the sklearn function StandardScaler() to standaredize the data.

Is BatchNorm turned off when inferencing?

I read from several sources that implicitly suggest batchnorm being turned off for inference but I have no definite answer for this.
Most common is to use a moving average of mean and std for your batch normalization as used by Keras for example (https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py). If you just turn it off the network will perform worse on the same data, due to changes in how the images are processed.
This is done by storing the average mean and the average std of all the batches used during training the network. Then in inference this moving average is used for normalization.

the number of the output unit and the loss function in binary classification network

Say I have a binary classification task, and I build a neural network to do this.
There are two different framework to choose in which the first is the network has one output unit indicating the probability belonging to one of the class, thus I can use the binary cross-entropy to compute the loss, the second is the network has two output units indicating the probabilities belonging to the two classes separately, also I can use the softmax cross-entropy to compute the loss.
Some suggests to use the first option, my confusion is that what the pros and cons of the two options are, and what the severest problem is if I choose the second framework? Can anyone explain this in detail to me? Thanks in advance.
If you use one output unit then you should understand that you are choosing strictly between two classes. If the probability is high enough then your netwrok chooses class A, otherwise it chooses class B. If you have two output units your network may produce rather low probability for both your units so you will end up with neither A nor B. You should choose among those two approaches depending on what is the real system you're trying to model with your network.