CatBoost RandomizedSearch stopping and finding out hyperparameters for it - catboost

So I started running CatBoost's RandomizedSearch implementation and so far as I see from the log (which I know from verbose=1, down below is the log), there is one model that is the best one for a really long time, and I cannot wait anymore for it to finish the randomized search till the end, and I wanted to see here does anyone knows how I can somehow access hyperparameters of that model after shutting my Python script down. Is there any way to see that?
/catboost_log.txt
...
595: loss: 7.3805087 best: 6.8218305 (130) total: 9h 7m 51s remaining: 4h 16m 27s
596: loss: 7.3949953 best: 6.8218305 (130) total: 9h 10m 11s remaining: 4h 16m 12s
...

If you use jupyter notebook and set plot=True, then you will see parameter values on the plot.
You can also look into file catboost_info/catboost_training.json, this file contains information about metrics + parameter values for every iteration of randomized_search.

Related

explain how this pipeline work and from where this is working? Why my loss tok2vec is increasing?

- ``
- type here
- ``
I am building Custom Spacy ner model. But i didn't understand from where this pipeline is working? and in Config.cfg file there is by default setting epoch is 0. but how this epochs values increasing? anybody please explain from attached snip?
I trained on 40000 samples and 3000 for testing.
I am Expecting answers from follwing qouestion:-
why this loss value is increasing?
why that epochs is increasing?
Which epochs value is correct?
from where this is calculating?
why same epochs value iterating many times?

Is it possible to execute from the point where the neural network model is interrupted?

Assume that I am training a neural network model. I am storing the tensor file of the neural network model for every 15 epochs in .pth format.
I need to run 1000 epochs in total. Suppose I stopped my program during the 501st epoch, then I have the following files
15.pth, 30.pth, 45.pth, 60.pth, 75.pth,.... 420.pth, 435.pth, 450.pth, 465.pth, 480.pth, 495.pth
Then my doubt is
Is it possible to use the last stored model 495.pth and continue execution as it generally happens if done without any interruption? In short, I am asking for something similar to the "resumption" of the training phase with a few modifications to the existing code. I am just asking for such a possibility.
I am asking for general practice and not particular to any code. If such a method exists, I will be free to stop any program under execution and can resume later. Currently, I cannot use resources for shorter programs if longer programs are in execution and hence I am asking this question.
I order to resume training from a checkpoint, you need to save the entire state of your training process. This includes:
Current weights of the model.
State of the optimizer: most optimizers keep track of different statistics of the updates, e.g., momentum, variance etc.
State of the learning rate scheduler.
Additional "state" variables unique to your code.
If you saved all this information, you should be able to fully restore the "state" of your training process and resume from that point.
So what I do is the following:
After each epoch I save my models weights into a .pt file and each time I run my program in gerneral I check if the resume argument is set to True. If so, I initialize the model using the weights in the .pt file as just continue training, if not I initialize random weights as normal. This could look like this:
def train(resume: bool=False):
model = Model()
if resume:
model.load_state_dict(torch.load("weights.pt"))
criterion = Loss()
optimizer = Optimizer()
for epoch in range(100):
for data, targets in dataloader:
optimizer.zero_grad()
predictions = model.train()(data)
loss = criterion(predicitions, targets)
loss.backward()
optimizer.step()
torch.save(model.state_dict(), "weights.pt")
So if I interrupt the training, I can still continue after my last epoch that I saved.
Normally you are logging more stuff than only the weights, for example the learning-rate scheduler or simply the loss and accuracy history. For that you could save the training history into a json file and read it out if resume is True.

Modifying the Learning Rate in the middle of the Model Training in Deep Learning

Below is the code to configure TrainingArguments consumed from the HuggingFace transformers library to finetune the GPT2 language model.
training_args = TrainingArguments(
output_dir="./gpt2-language-model", #The output directory
num_train_epochs=100, # number of training epochs
per_device_train_batch_size=8, # batch size for training #32, 10
per_device_eval_batch_size=8, # batch size for evaluation #64, 10
save_steps=100, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
prediction_loss_only=True,
metric_for_best_model = "eval_loss",
load_best_model_at_end = True,
evaluation_strategy="epoch",
learning_rate=0.00004, # learning rate
)
early_stop_callback = EarlyStoppingCallback(early_stopping_patience = 3)
trainer = Trainer(
model=gpt2_model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset,
callbacks = [early_stop_callback],
)
The number of epochs as 100 and learning_rate as 0.00004 and also the early_stopping is configured with the patience value as 3.
The model ran for 5/100 epochs and noticed that the difference in loss_value is negligible. The latest checkpoint is saved as checkpoint-latest.
Now Can I modify the learning_rate may be to 0.01 from 0.00004 and resume the training from the latest saved checkpoint - checkpoint-latest? Doing that will be efficient?
Or to train with the new learning_rate value should I start the training from the beginning?
No, you don't have to restart your training.
Changing the learning rate is like changing how big a step your model take in the direction determined by your loss function.
You can also think of it as transfer learning where the model has some experience (no matter how little or irrelevant) and the weights are in a state most likely better than a randomly initialised one.
As a matter of fact, changing the learning rate mid-training is considered an art in deep learning and you should change it if you have a very very good reason to do it.
You would probably want to write down when (why, what, etc) you did it if you or someone else wants to "reproduce" the result of your model.
Pytorch provides several methods to adjust the learning_rate: torch.optim.lr_scheduler.
Check the docs for usage https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

The last steps of each epochs take too long time

I'm using Keras. When I run model.fit_generator(...), it goes 1 step per about 1.5 second, but the last step takes a few minutes.
Epoch 1/50
30/31 [============================>.] - ETA: 0s - loss: 2.0676 - acc: 0.2010
Why?
This happens because you are giving validation data to Keras, through a parameter in model.fit or model.fit_generator.
After each epoch, Keras takes the validation data and evaluates the model on this data, which implies one forward pass for each validation data point, which might take a lot of time and might seem that Keras is stuck, but it is necessary when training a model.
I faced this issue while training a CNN , and found that decreasing the image dimensions speeds up the training. The processing time is reduced due to reduced input dimension during both forward pass and backpropagation (while updating weights). If for example, you are using a CNN for image classification, image size of 64*64 would be processed much faster than of size 256*256, though obviously at the cost of losing out information due to lower resolution.

caffe loss does not seem to decrease

Some of my parameters
base_lr: 0.04
max_iter: 170000
lr_policy: "poly"
batch_size = 8
iter_size =16
this is how the training process looks until now:
The Loss here seems stagnant, is there a problem here or this normal?
The solution for me was to lower the base learning rate by a factor of 10 before resuming training from a solverstate snapshot.
To achieve this same solution automatically, you can set the "gamma" and "stepsize" parameters in your solver.prototxt:
base_lr: 0.04
stepsize:10000
gamma:0.1
max_iter: 170000
lr_policy: "poly"
batch_size = 8
iter_size =16
This will reduce your base_lr by a factor of 10 every 10,000 iterations.
Please note, it is normal for loss to fluctuate between values, and even hover around a constant value before making a dip. This could be the cause of your issue, I would suggest training well beyond 1800 iterations before falling back on the above implementation. Look up graphs of caffe train loss logs.
Additionally, please direct all future questions to the caffe mailing group. This serves as a central location for all caffe questions and solutions.
I struggled with this myself and didn't find solutions anywhere before I figured it out. Hope what worked for me will work for you!