I am using PyTorch Lightening trainer for pre-training a large model. I know I can resume training from old weights but that does not contain old hyper-parameters (lr, last_epoch, etc.). Is there any automatic way to resume training? OR Do I need to overload Checkpoint Callback or CSVLogger to search for old cvs logs and get last epoch number?
Instead of load from checkpoint, use resume_from_checkpoint in trainer.
# resume from a specific checkpoint
trainer = Trainer(resume_from_checkpoint="some/path/to/my_checkpoint.ckpt")
See more details here https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#resume-from-checkpoint .
Trainer(resume_from_checkpoint="some/path/to/my_checkpoint.ckpt") has been depreciated. New method is: self.trainer.fit(self.model, self.data_module, ckpt_path = self.ckpt_path). It requires to save best/last model after certain number of epochs with pre-defined path+name and then retrieve from path for resuming the training.
I am wondering if this method would be effective for transfer learning i.e. changing only few fc layer for some other task.
Related
I have been looking for certain features in the HuggingFace transformer Trainer object (in particular Seq2SeqTrainer) and would like to know whether they exist and if so, how to implement them, or whether I would have to write my own training loop to enable them.
I am looking to apply Curriculum Learning to my training strategy, as well as evaluating the model at regular intervals, and therefore would like to enable the following
choose in which order the model sees training samples at each epoch (it seems that the data passed onto the train_dataset argument are automatically shuffled by some internal code, and even if I managed to stop that, I would still need to pass differently ordered data at different epochs, as I may want to start training the model from easy samples for a few epochs, and then pass a random shuffle of all data for later epochs)
run custom evaluation at integer multiples of a fix number of steps. The standard compute_metrics argument of the Trainer takes a function to which the predictions and labels are passed* and the user can decide how to generate the metrics given these. However I'd like a finer level of control, for example changing the maximum sequence length for the tokenizer when doing the evaluation, as opposed to when doing training, which would require me including some explicit evaluation code inside compute_metrics which needs to access the trained model and the data from disk.
Can these two points be achieved by using the Trainer on a multi-GPU machine, or would I have to write my own training loop?
*The function often looks something like this and I'm not sure it would work with the Trainer if it doesn't have this configuration
def compute_metrics(eval_pred):
predictions, labels = eval_pred
...
You can pass custom functions to compute metrics in the training arguments
Assume that I am training a neural network model. I am storing the tensor file of the neural network model for every 15 epochs in .pth format.
I need to run 1000 epochs in total. Suppose I stopped my program during the 501st epoch, then I have the following files
15.pth, 30.pth, 45.pth, 60.pth, 75.pth,.... 420.pth, 435.pth, 450.pth, 465.pth, 480.pth, 495.pth
Then my doubt is
Is it possible to use the last stored model 495.pth and continue execution as it generally happens if done without any interruption? In short, I am asking for something similar to the "resumption" of the training phase with a few modifications to the existing code. I am just asking for such a possibility.
I am asking for general practice and not particular to any code. If such a method exists, I will be free to stop any program under execution and can resume later. Currently, I cannot use resources for shorter programs if longer programs are in execution and hence I am asking this question.
I order to resume training from a checkpoint, you need to save the entire state of your training process. This includes:
Current weights of the model.
State of the optimizer: most optimizers keep track of different statistics of the updates, e.g., momentum, variance etc.
State of the learning rate scheduler.
Additional "state" variables unique to your code.
If you saved all this information, you should be able to fully restore the "state" of your training process and resume from that point.
So what I do is the following:
After each epoch I save my models weights into a .pt file and each time I run my program in gerneral I check if the resume argument is set to True. If so, I initialize the model using the weights in the .pt file as just continue training, if not I initialize random weights as normal. This could look like this:
def train(resume: bool=False):
model = Model()
if resume:
model.load_state_dict(torch.load("weights.pt"))
criterion = Loss()
optimizer = Optimizer()
for epoch in range(100):
for data, targets in dataloader:
optimizer.zero_grad()
predictions = model.train()(data)
loss = criterion(predicitions, targets)
loss.backward()
optimizer.step()
torch.save(model.state_dict(), "weights.pt")
So if I interrupt the training, I can still continue after my last epoch that I saved.
Normally you are logging more stuff than only the weights, for example the learning-rate scheduler or simply the loss and accuracy history. For that you could save the training history into a json file and read it out if resume is True.
Below is the code to configure TrainingArguments consumed from the HuggingFace transformers library to finetune the GPT2 language model.
training_args = TrainingArguments(
output_dir="./gpt2-language-model", #The output directory
num_train_epochs=100, # number of training epochs
per_device_train_batch_size=8, # batch size for training #32, 10
per_device_eval_batch_size=8, # batch size for evaluation #64, 10
save_steps=100, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
prediction_loss_only=True,
metric_for_best_model = "eval_loss",
load_best_model_at_end = True,
evaluation_strategy="epoch",
learning_rate=0.00004, # learning rate
)
early_stop_callback = EarlyStoppingCallback(early_stopping_patience = 3)
trainer = Trainer(
model=gpt2_model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset,
callbacks = [early_stop_callback],
)
The number of epochs as 100 and learning_rate as 0.00004 and also the early_stopping is configured with the patience value as 3.
The model ran for 5/100 epochs and noticed that the difference in loss_value is negligible. The latest checkpoint is saved as checkpoint-latest.
Now Can I modify the learning_rate may be to 0.01 from 0.00004 and resume the training from the latest saved checkpoint - checkpoint-latest? Doing that will be efficient?
Or to train with the new learning_rate value should I start the training from the beginning?
No, you don't have to restart your training.
Changing the learning rate is like changing how big a step your model take in the direction determined by your loss function.
You can also think of it as transfer learning where the model has some experience (no matter how little or irrelevant) and the weights are in a state most likely better than a randomly initialised one.
As a matter of fact, changing the learning rate mid-training is considered an art in deep learning and you should change it if you have a very very good reason to do it.
You would probably want to write down when (why, what, etc) you did it if you or someone else wants to "reproduce" the result of your model.
Pytorch provides several methods to adjust the learning_rate: torch.optim.lr_scheduler.
Check the docs for usage https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
Currently, I'm building a new transformer-based model with huggingface-transformers, where attention layer is different from the original one. I used run_glue.py to check performance of my model on GLUE benchmark. However, I found that Trainer class of huggingface-transformers saves all the checkpoints that I set, where I can set the maximum number of checkpoints to save. However, I want to save only the weight (or other stuff like optimizers) with best performance on validation dataset, and current Trainer class doesn't seem to provide such thing. (If we set the maximum number of checkpoints, then it removes older checkpoints, not ones with worse performances). Someone already asked about same question on Github, but I can't figure out how to modify the script and do what I want. Currently, I'm thinking about making a custom Trainer class that inherits original one and change the train() method, and it would be great if there's an easy and simple way to do this. Thanks in advance.
You may try the following parameters from trainer in the huggingface
training_args = TrainingArguments(
output_dir='/content/drive/results', # output directory
do_predict= True,
num_train_epochs=3, # total number of training epochs
**per_device_train_batch_size=4, # batch size per device during training
per_device_eval_batch_size=2**, # batch size for evaluation
warmup_steps=1000, # number of warmup steps for learning rate
save_steps=1000,
save_total_limit=10,
load_best_model_at_end= True,
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=0, evaluate_during_training=True)
There may be better ways to avoid too many checkpoints and selecting the best model.
So far you can not save only the best model, but you check when the evaluation yields better results than the previous one.
I have not seen any parameter for that. However, there is a workaround.
Use following combinations
evaluation_strategy =‘steps’,
eval_steps = 10, # Evaluation and Save happens every 10 steps
save_total_limit = 5, # Only last 5 models are saved. Older ones are deleted.
load_best_model_at_end=True,
When I tried with the above combination, at any time 5 previous models will be saved in output directory, but if the best model is not one among them, it will keep the best model as well. So it will be 1 + 5 models. You can change save_total_limit = 1 so it will serve your purpose
This answer could be useful
training_args = TrainingArguments(
output_dir=repo_name,
group_by_length=True,
length_column_name='input_length',
per_device_train_batch_size=24,
gradient_accumulation_steps=2,
evaluation_strategy="steps",
num_train_epochs=20,
fp16=True,
save_steps=1000,
save_strategy='steps', # we cannot set it to "no". Otherwise, the model cannot guess the best checkpoint.
eval_steps=1000,
logging_steps=1000,
learning_rate=5e-5,
warmup_steps=500,
save_total_limit=3,
load_best_model_at_end = True # this will let the model save the best checkpoint
)
This should be helpful where you compare the current validation accuracy with the best one and then save the best model.
I would like to modify the ImageNet caffe model as described bellow:
As the input channel number for temporal nets is different from that
of spatial nets (20 vs. 3), we average the ImageNet model filters of
first layer across the channel, and then copy the average results 20
times as the initialization of temporal nets.
My question is how can I achive the above results? How can I open the caffe model to be able to do those changes to it?
I read the net surgery tutorial but it doesn't cover the procedure needed.
Thank you for your assistance!
AMayer
The Net Surgery tutorial should give you the basics you need to cover this. But let me explain the steps you need to do in more detail:
Prepare the .prototxt network architectures: You need two files: the existing ImageNet .prototxt file, and your new temporal network architecture. You should make all layers except the first convolutional layers identical in both networks, including the names of the layers. That way, you can use the ImageNet .caffemodel file to initialize the weights automatically.
As the first conv layer has a different size, you have to give it a different name in your .prototxt file than it has in the ImageNet file. Otherwise, Caffe will try to initialize this layer with the existing weights too, which will fail as they have different shapes. (This is what happens in the edit to your question.) Just name it e.g. conv1b and change all references to that layer accordingly.
Load the ImageNet network for testing, so you can extract the parameters from the model file:
net = caffe.Net('imagenet.prototxt', 'imagenet.caffemodel', caffe.TEST)
Extract the weights from this loaded model.
conv_1_weights = old_net.params['conv1'][0].data
conv_1_biases = old_net.params['conv1'][1].data
Average the weights across the channels:
conv_av_weights = np.mean(conv_1_weights, axis=1, keepdims=True)
Load your new network together with the old .caffemodel file, as all layers except for the first layer directly use the weights from ImageNet:
new_net = caffe.Net('new_network.prototxt', 'imagenet.caffemodel', caffe.TEST)
Assign your calculated average weights to the new network
new_net.params['conv1b'][0].data[...] = conv_av_weights
new_net.params['conv1b'][1].data[...] = conv_1_biases
Save your weights to a new .caffemodel file:
new_net.save('new_weights.caffemodel')