StableBaselines3 - Why does calling "model.learn(50,000)" twice not give same result as calling "model.learn(100,000)" once? - reinforcement-learning

I am working on a Reinforcement Learning problem in StableBaselines3.
I am trying to understand why this code:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(100000)
Does not give the exact same result as this code:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(50000)
model.learn(50000)
I say they don't give the same results because in both cases, I tested out the model on a test-set through a for-loop, and the performance was different. Given that I set deterministic=True in the for-loop and I didn't change the seed, the different performance must mean the networks are different, which means the training process was different.
I was under the impression that if I run model.learn() on an existing model, it would just pick up the training where it was previously left off, but I guess that's incorrect.
Can someone help me understand why those two situations deliver different results?

Related

Save only best weights with huggingface transformers

Currently, I'm building a new transformer-based model with huggingface-transformers, where attention layer is different from the original one. I used run_glue.py to check performance of my model on GLUE benchmark. However, I found that Trainer class of huggingface-transformers saves all the checkpoints that I set, where I can set the maximum number of checkpoints to save. However, I want to save only the weight (or other stuff like optimizers) with best performance on validation dataset, and current Trainer class doesn't seem to provide such thing. (If we set the maximum number of checkpoints, then it removes older checkpoints, not ones with worse performances). Someone already asked about same question on Github, but I can't figure out how to modify the script and do what I want. Currently, I'm thinking about making a custom Trainer class that inherits original one and change the train() method, and it would be great if there's an easy and simple way to do this. Thanks in advance.
You may try the following parameters from trainer in the huggingface
training_args = TrainingArguments(
output_dir='/content/drive/results', # output directory
do_predict= True,
num_train_epochs=3, # total number of training epochs
**per_device_train_batch_size=4, # batch size per device during training
per_device_eval_batch_size=2**, # batch size for evaluation
warmup_steps=1000, # number of warmup steps for learning rate
save_steps=1000,
save_total_limit=10,
load_best_model_at_end= True,
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=0, evaluate_during_training=True)
There may be better ways to avoid too many checkpoints and selecting the best model.
So far you can not save only the best model, but you check when the evaluation yields better results than the previous one.
I have not seen any parameter for that. However, there is a workaround.
Use following combinations
evaluation_strategy =‘steps’,
eval_steps = 10, # Evaluation and Save happens every 10 steps
save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.
load_best_model_at_end=True,
When I tried with the above combination, at any time 5 previous models will be saved in output directory, but if the best model is not one among them, it will keep the best model as well. So it will be 1 + 5 models. You can change save_total_limit = 1 so it will serve your purpose
This answer could be useful
training_args = TrainingArguments(
output_dir=repo_name,
group_by_length=True,
length_column_name='input_length',
per_device_train_batch_size=24,
gradient_accumulation_steps=2,
evaluation_strategy="steps",
num_train_epochs=20,
fp16=True,
save_steps=1000,
save_strategy='steps', # we cannot set it to "no". Otherwise, the model cannot guess the best checkpoint.
eval_steps=1000,
logging_steps=1000,
learning_rate=5e-5,
warmup_steps=500,
save_total_limit=3,
load_best_model_at_end = True # this will let the model save the best checkpoint
)
This should be helpful where you compare the current validation accuracy with the best one and then save the best model.

Training and test diverge while running catboost

When I run the catboost regressor my training and test plots diverge with weird kinks at ~1000 iterations. The plot is appended below and my regressor setup is as follows:
cat_model=CatBoostRegressor(iterations=2500, depth=4, learning_rate=0.01, loss_function='RMSE', thread_count=-1, use_best_model = True, random_seed=12, random_strength=10, rsm=0.5)
I tried different values of leaf_estimation_iterations & bagging_temperature but did not get any success. Any suggestions on what i should try to get better results.
Model Fit Plot
The diverge is normal. you will always perform better on the train set, as the model overfits the training set, and your objective is to regulate it with the validation set.
First I would recommend to read on bias vs variance tradeoff for a general intuition on how to tackle this issue.
specifically for catboost, you would like to regularize the training procedure so it would generalize better.
you can start with adding more data, and set higher l2_leaf_reg parameter.
The official documentation have much more good suggestions on model tuning:
https://catboost.ai/docs/concepts/parameter-tuning.html

Does loss_weight=N and diff*N do the same thing in pycaffe?

I tried these two methods with pycaffe:
loss_weigth=100 in prototxt;
net.blobs['fc'].diff[...] = A_loss + 100*B_loss.
I thought they do the same thing in BP theory, model loss shows the opposite results however.
I want to know what is the difference between those two methods? How should I deal with loss weights if there are multiple losses?

How to get the accuracy of classifier on test data in DeepLearning

I am trying to use DL4J for deep learning and have provided the training data with the labels. I am then trying to send a test data by assigning a dummy label. Without providing a dummy label, it gives runtime error. I dont understand why we need to assign label to test data.
Additionally, I want to know what is the accuracy of the prediction made. From what I saw in the dl4j docs, there is something known as a confusion matrix which is generated. I understand that this just gives us an idea of how well the training data has trained the system. Is there a way to get the accuracy of prediction on test data? Since we are giving a dummy label for the test data, I feel that the confusion matrix is also not generated correctly.
First, how can you test if the network outputs the correct labels if you don't know what the correct labels are? You should always have a labels when training and testing because that way you can assert if the output is correct.
Second question, I've found this on dl4j webpage:
Evaluation eval = new Evaluation(3);
INDArray output = model.output(testData.getFeatures());
eval.eval(testData.getLabels(), output);
log.info(eval.stats());
There is stated that this .stats() method displays the confusion matrix entries (one per line), Accuracy, Precision, Recall and F1 Score. Additionally the Evaluation Class can also calculate and return the following values:
Confusion Matrix
False Positive/Negative Rate
True Positive/Negative
Class Counts
F-beta, G-measure, Matthews Correlation Coefficient and more
I hope this helps you.
You may find people who can respond to your question in the DL4J dev community here: https://gitter.im/deeplearning4j/deeplearning4j/tuninghelp

h2o deep learning different results per run

I use the h2o deep learning using python on a data of 2 balanced classes "0" and "1", and adjusted the parameters to be as follows:
prostate_dl = H2ODeepLearningEstimator(
activation=,"Tanh"
hidden=[50,50,50],
distribution="multinomial",
score_interval=10,
epochs=1000,
input_dropout_ratio=0.2
,adaptive_rate=True
, rho=0.998, epsilon = 1e-8
)
prostate_dl .train(
x=x,
y=y,
training_frame =train,
validation_frame = test)
Each time the program runs gives different confusion matric and accuarcy results, can anyway explain that? how can the results can be reliable?
Also, all of the runs gives the majority prediction as class "1" not "0" , is their any suggestion?
This question has already been answered here, but you need to set reproducible=TRUE when you initialize the H2ODeepLearningEstimator in Python (or in h2o.deeplearning() in R).
Even after setting reproducible=TRUE, the H2O Deep Learning results are only reproducible when using a single core; in other words, when h2o.init(nthreads = 1). The reasons behind this are outlined here.
Also, per the H2O Deep Learning user guide:
Does each Mapper task work on a separate neural-net model that is combined during reduction, or is each Mapper manipulating a shared object that’s persistent across nodes?
Neither; there’s one model per compute node, so multiple
Mappers/threads share one model, which is why H2O is not reproducible
unless a small dataset is used and force_load_balance=F or
reproducible=T, which effectively rebalances to a single chunk and
leads to only one thread to launch a map(). The current behavior is
simple model averaging; between-node model averaging via “Elastic
Averaging” is currently in progress.