Why does CatBoostClassifier yield different results between CPUs? - catboost

I know that CatBoost yields different results while training it on GPU and CPU due to some parameters like sub-sampling method.
Now I realized it can also not be reproduced through CPUs on the same training data and parameter settings. I used one dataset with 28 float fields and two categorical fields to train CatBoost with default parameters, except random_state=42. First I tried with a computer with Intel i7, then another with Intel Xeon (also Colab notebook), no train/test split. Then I used .predict_proba() method to predict probabilities of training dataset (the whole one aforementioned), the two computer gave me different results. In some cases, with different CPUs, the difference maybe at 14-16th decimal places, but I got at 1st decimal place.
If I train the model on only one computer, then save the model and use it on another, the results will be the same.
I tried to study the parameters of CatBoostClassifier but still don't know which parameter to change to make CatBoost reproducible with my data.
Does anyone face this problem? Which parameter of CatBoost do I have to change?

Related

XGBoost regression RMSE individual prediction

I have a simple regression problem with two independent variables and one dependent one. I tried linear regression from statsmodels and sk-learn, but I get the best results (R ^ 2 and RMSE) with XGBoost regressor.
On the new data set, RMSE is still in line with earlier results, but individual predictions are very different.
For example, the RMSE is 1000, and individual predictions vary from 20 to 3000. Thus, predictions are either almost perfectly accurate or strongly deviate in a few cases, but i don't know why is that.
My question is what is the cause of such variations in individual predictions?
When testing your model with new data, it's normal to get some of the predictions wrong. Since RMSE is 1000 it means that, on average, the root of the difference between the actual and predicted values is 1000. You can have values that are predicted very well, as well as values that give a very large squared error. The reason for this could be overfitting. It could also be that the new data set contains data that is very different from the data the model was trained on. But since the RMSE is in line with earlier results, I understand that RMSE was around 1000 on the training set as well. Therefore I don't necessarily see a problem with the test set. What I would do is go through the preprocessing steps for the training data and make sure they're done correctly:
standardize the data and remove possible skewness
check for collinearity between independent variables (you only have 2, so it should be easy to do)
check to see if independent variables have an acceptable variance. If your variables don't vary too much for each new data point it could be that they are useless for explaining the dependent variable.
BTW, what is the R2 score for your regression? It should tell you how much of the variability of the target variable is explained by your model. A low R2 score should indicate that the regressors used aren't very useful in explaining your target variable.
You can use the sklearn function StandardScaler() to standaredize the data.

the number of the output unit and the loss function in binary classification network

Say I have a binary classification task, and I build a neural network to do this.
There are two different framework to choose in which the first is the network has one output unit indicating the probability belonging to one of the class, thus I can use the binary cross-entropy to compute the loss, the second is the network has two output units indicating the probabilities belonging to the two classes separately, also I can use the softmax cross-entropy to compute the loss.
Some suggests to use the first option, my confusion is that what the pros and cons of the two options are, and what the severest problem is if I choose the second framework? Can anyone explain this in detail to me? Thanks in advance.
If you use one output unit then you should understand that you are choosing strictly between two classes. If the probability is high enough then your netwrok chooses class A, otherwise it chooses class B. If you have two output units your network may produce rather low probability for both your units so you will end up with neither A nor B. You should choose among those two approaches depending on what is the real system you're trying to model with your network.

Counting the number of multiply-add operations (MAC) in Caffe CNN's architecture

Lately I've been benchmarking some CNNs regarding time, # of multiply-add operations (MAC), # of parameters and model size. I have seen some similar SO questions (here and here) and in the latter, they suggest using Netscope CNN Analyzer. This tool allows me to calculate most of the things I need just by inputing my Caffe network definition.
However, the number of multiply-add operations of some architectures I've seen in papers and over the internet doesn't match what Netscope is outputting, whereas other architectures match. I'm always comparing either FLOPs or MAC with the MACC column in netscope, but there a ~10x factor that I'm forgetting at some point (check table bellow for more detail).
Architecture ---- MAC (paper/internet) ---- macc column in netscope
VGG 16 ~15.5G ~157G
GoogLeNet ~1.55G ~16G
Reference about GoogLeNet macc number and VGG16 macc number in Netscope.
Does anybody that used that tool could point me out on what mistake I'm doing while reading Netscope output?
I've found what was causing the discrepancy between Netscope and the information I'd found in papers. Most preset architectures in Nestcope were using a batch size of 10 (this is the case for VGG and GoogLeNet, for example), therefore the x10 factor that multiplied the number of mult-add operations.

Computation consideration with different Caffe's network topology (difference in number of output)

I would like to use one of Caffe's reference model i.e. bvlc_reference_caffenet. I found that my target class i.e. person is one of the classes included in the ILSVRC dataset that has been trained for the model. As my goal is to classify whether a test image contains a person or not, I may achieve this by the following:
Use inference directly with 1000 number of output. This doesn't
require training/learning.
Change the network topology a little bit with the final FC layer's number of output (num_output) is set to 2 (instead of 1000). Retrain it as a binary classification problem.
My concern is about computational effort at deployment/prediction phase (testing). The latter looks more expensive computationally than the former. This is because during prediction phase it needs to compute those 1000 output possibilities to find the one with the highest score. What I'm not sure is that, it could be the case that there's a heuristic (which I'm not aware of) that simplifies the computation.
Can somebody please help cross check my understanding on this.

Keras pass data through layers explicitly

I am trying to implement a Pairwise Learning to rank model with keras where features are being computed by deep neural network.
In the pairwise L2R model, while training, I am giving the query, one positive and one negative result. And it is trained on the classification loss by difference of feature vector.
I am able to do compile and fit model successfully but the problem is to actually use this model on test data.
As in Pairwise L2R model, at testing time I would have only query and sample pair (no separate negative and positives). And I can use the calculated value before softmax to rank samples.
Is there any way I can use keras to pass data manually at test time through particular trained layers. (In short I have 3 set of inputs at train time and 2 at testing time.)