Is there anyone who can share his/her MoCo training log π
I found that when pretraining 11 epoch, the log showed "Acc#1 84.38 ( 83.33) Acc#5 87.50 ( 92.75)"
I think that's relatively fitting, so I tranfer it to the downstream task, but gain a bad result.
Is that possible the training time is not enough?
Hope someone can do me a favor πππππ
I try to re-pretrain the resnet-50 with MoCo v2. And I have trained for 11 epoch whith the training log "Acc#1 84.38 ( 83.33) Acc#5 87.50 ( 92.75)"
When I take the pre-trained model to the downstream task, it shows a bad performance.
I wonder is that possible for lacking of training time?
Related
I write a custom gym environment, and trained with PPO provided by stable-baselines3. The ep_rew_mean recorded by tensorboard is as follow:
the ep_rew_mean curve for total 100 million steps, each episode has 50 steps
As shown in the figure, the reward is around 15.5 after training, and the model converges. However, I use the function evaluate_policy() for the trained model, and the reward is much smaller than the ep_rew_mean value. The first value is mean reward, the second value is std of reward:
4.349947246664763 1.1806464511030819
the way I use function evaluate_policy() is:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10000)
According to my understanding, the initial environment is randomly distributed in an area when using reset() fuction, so there should not be overfitting problem.
I have also tried different learning rate or other parameters, and this problem is not solved.
I have checked my environment, and I think there is no error.
I have searched on the internet, read the doc of stable-baselines3 and issues on github, but did not find the solution.
evaluate_policy has deterministic to True by default (https://stable-baselines3.readthedocs.io/en/master/common/evaluation.html).
If you sample from the distribution during training, it may help to evaluate the policy without it selecting the actions with an argmax (by passing in deterministic=False).
Assume that I am training a neural network model. I am storing the tensor file of the neural network model for every 15 epochs in .pth format.
I need to run 1000 epochs in total. Suppose I stopped my program during the 501st epoch, then I have the following files
15.pth, 30.pth, 45.pth, 60.pth, 75.pth,.... 420.pth, 435.pth, 450.pth, 465.pth, 480.pth, 495.pth
Then my doubt is
Is it possible to use the last stored model 495.pth and continue execution as it generally happens if done without any interruption? In short, I am asking for something similar to the "resumption" of the training phase with a few modifications to the existing code. I am just asking for such a possibility.
I am asking for general practice and not particular to any code. If such a method exists, I will be free to stop any program under execution and can resume later. Currently, I cannot use resources for shorter programs if longer programs are in execution and hence I am asking this question.
I order to resume training from a checkpoint, you need to save the entire state of your training process. This includes:
Current weights of the model.
State of the optimizer: most optimizers keep track of different statistics of the updates, e.g., momentum, variance etc.
State of the learning rate scheduler.
Additional "state" variables unique to your code.
If you saved all this information, you should be able to fully restore the "state" of your training process and resume from that point.
So what I do is the following:
After each epoch I save my models weights into a .pt file and each time I run my program in gerneral I check if the resume argument is set to True. If so, I initialize the model using the weights in the .pt file as just continue training, if not I initialize random weights as normal. This could look like this:
def train(resume: bool=False):
model = Model()
if resume:
model.load_state_dict(torch.load("weights.pt"))
criterion = Loss()
optimizer = Optimizer()
for epoch in range(100):
for data, targets in dataloader:
optimizer.zero_grad()
predictions = model.train()(data)
loss = criterion(predicitions, targets)
loss.backward()
optimizer.step()
torch.save(model.state_dict(), "weights.pt")
So if I interrupt the training, I can still continue after my last epoch that I saved.
Normally you are logging more stuff than only the weights, for example the learning-rate scheduler or simply the loss and accuracy history. For that you could save the training history into a json file and read it out if resume is True.
Below is the code to configure TrainingArguments consumed from the HuggingFace transformers library to finetune the GPT2 language model.
training_args = TrainingArguments(
output_dir="./gpt2-language-model", #The output directory
num_train_epochs=100, # number of training epochs
per_device_train_batch_size=8, # batch size for training #32, 10
per_device_eval_batch_size=8, # batch size for evaluation #64, 10
save_steps=100, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
prediction_loss_only=True,
metric_for_best_model = "eval_loss",
load_best_model_at_end = True,
evaluation_strategy="epoch",
learning_rate=0.00004, # learning rate
)
early_stop_callback = EarlyStoppingCallback(early_stopping_patience = 3)
trainer = Trainer(
model=gpt2_model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset,
callbacks = [early_stop_callback],
)
The number of epochs as 100 and learning_rate as 0.00004 and also the early_stopping is configured with the patience value as 3.
The model ran for 5/100 epochs and noticed that the difference in loss_value is negligible. The latest checkpoint is saved as checkpoint-latest.
Now Can I modify the learning_rate may be to 0.01 from 0.00004 and resume the training from the latest saved checkpoint - checkpoint-latest? Doing that will be efficient?
Or to train with the new learning_rate value should I start the training from the beginning?
No, you don't have to restart your training.
Changing the learning rate is like changing how big a step your model take in the direction determined by your loss function.
You can also think of it as transfer learning where the model has some experience (no matter how little or irrelevant) and the weights are in a state most likely better than a randomly initialised one.
As a matter of fact, changing the learning rate mid-training is considered an art in deep learning and you should change it if you have a very very good reason to do it.
You would probably want to write down when (why, what, etc) you did it if you or someone else wants to "reproduce" the result of your model.
Pytorch provides several methods to adjust the learning_rate: torch.optim.lr_scheduler.
Check the docs for usage https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
Currently, I'm building a new transformer-based model with huggingface-transformers, where attention layer is different from the original one. I used run_glue.py to check performance of my model on GLUE benchmark. However, I found that Trainer class of huggingface-transformers saves all the checkpoints that I set, where I can set the maximum number of checkpoints to save. However, I want to save only the weight (or other stuff like optimizers) with best performance on validation dataset, and current Trainer class doesn't seem to provide such thing. (If we set the maximum number of checkpoints, then it removes older checkpoints, not ones with worse performances). Someone already asked about same question on Github, but I can't figure out how to modify the script and do what I want. Currently, I'm thinking about making a custom Trainer class that inherits original one and change the train() method, and it would be great if there's an easy and simple way to do this. Thanks in advance.
You may try the following parameters from trainer in the huggingface
training_args = TrainingArguments(
output_dir='/content/drive/results', # output directory
do_predict= True,
num_train_epochs=3, # total number of training epochs
**per_device_train_batch_size=4, # batch size per device during training
per_device_eval_batch_size=2**, # batch size for evaluation
warmup_steps=1000, # number of warmup steps for learning rate
save_steps=1000,
save_total_limit=10,
load_best_model_at_end= True,
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=0, evaluate_during_training=True)
There may be better ways to avoid too many checkpoints and selecting the best model.
So far you can not save only the best model, but you check when the evaluation yields better results than the previous one.
I have not seen any parameter for that. However, there is a workaround.
Use following combinations
evaluation_strategy =βstepsβ,
eval_steps = 10, # Evaluation and Save happens every 10 steps
save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.
load_best_model_at_end=True,
When I tried with the above combination, at any time 5 previous models will be saved in output directory, but if the best model is not one among them, it will keep the best model as well. So it will be 1 + 5 models. You can change save_total_limit = 1 so it will serve your purpose
This answer could be useful
training_args = TrainingArguments(
output_dir=repo_name,
group_by_length=True,
length_column_name='input_length',
per_device_train_batch_size=24,
gradient_accumulation_steps=2,
evaluation_strategy="steps",
num_train_epochs=20,
fp16=True,
save_steps=1000,
save_strategy='steps', # we cannot set it to "no". Otherwise, the model cannot guess the best checkpoint.
eval_steps=1000,
logging_steps=1000,
learning_rate=5e-5,
warmup_steps=500,
save_total_limit=3,
load_best_model_at_end = True # this will let the model save the best checkpoint
)
This should be helpful where you compare the current validation accuracy with the best one and then save the best model.
When I run the catboost regressor my training and test plots diverge with weird kinks at ~1000 iterations. The plot is appended below and my regressor setup is as follows:
cat_model=CatBoostRegressor(iterations=2500, depth=4, learning_rate=0.01, loss_function='RMSE', thread_count=-1, use_best_model = True, random_seed=12, random_strength=10, rsm=0.5)
I tried different values of leaf_estimation_iterations & bagging_temperature but did not get any success. Any suggestions on what i should try to get better results.
Model Fit Plot
The diverge is normal. you will always perform better on the train set, as the model overfits the training set, and your objective is to regulate it with the validation set.
First I would recommend to read on bias vs variance tradeoff for a general intuition on how to tackle this issue.
specifically for catboost, you would like to regularize the training procedure so it would generalize better.
you can start with adding more data, and set higher l2_leaf_reg parameter.
The official documentation have much more good suggestions on model tuning:
https://catboost.ai/docs/concepts/parameter-tuning.html