Understanding torch.optim.SGD documentation and momentum calculations - deep-learning

So I was looking at the documentation (SGD Documentation of pytorchs SGD optimizer, specifically the part of the documentation covering momentum. However, I don't understand what $b_t$ means.
Could it maybe be the bias term, or is $b_t$ something else that I need to calculate?

Related

rank-ordered logit + IIA

Can anyone point me to a source for the use of IIA in rank-ordered logit, and why this results in independence? It seems to me that the conventional formulation of IIA in logit models won't work here, because, eg, Prob[first choice]/Prob[second choice] under the logit formulation will have different sums in the denominators, and hence won't cancel, as they do in "ordinary" logit.
Thanks!

Is GradScaler necessary with Mixed precision training with pytorch?

So going the AMP: Automatic Mixed Precision Training tutorial for Normal networks, I found out that there are two versions, Automatic and GradScaler. I just want to know if it's advisable / necessary to use the GradScaler with the training becayse it is written in the document that:
Gradient scaling helps prevent gradients with small magnitudes from flushing to zero (“underflowing”) when training with mixed precision.
scaler = torch.cuda.amp.GradScaler()
for epoch in range(1):
for input, target in zip(data, targets):
with torch.cuda.amp.autocast():
output = net(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
scaler.step(opt)
scaler.update()
opt.zero_grad()
Also, looking at NVIDIA Apex Documentation for PyTorch, they have used it as,
from apex import amp
model, optimizer = amp.initialize(model, optimizer)
loss = criterion(…)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
optimizer.step()
I think this is what GradScaler does too so I think it is a must. Can someone help me with the query here.
Short answer: yes, your model may fail to converge without GradScaler().
There are three basic problems with using FP16:
Weight updates: with half precision, 1 + 0.0001 rounds to 1. autocast() takes care of this one.
Vanishing gradients: with half precision, anything less than (roughly) 2e-14 rounds to 0, as opposed to single precision 2e-126. GradScaler() takes care of this one.
Explosive loss: similar to the above, overflow is also much more likely with half precision. This is also managed by autocast() context.

Numerical Methods - Checking Domain of Phi(x)

I have just started a MSc in Scientific Computing, and being an engineer my knowledge of real analysis is somewhat limited.
When rewriting f(x) = 0 as a fixed point formulation Phi(x) = x, it is stressed that we must check that for x in the interval [a,b] that Phi(x) maps into the same interval.
Is there a general real analysis method of checking this, using the Mean Value Theorem for example, or do I need to use simpler calculus method of checking the minimum and maximum values of Phi(x). The simpler calculus method doesn’t seem to be satisfactory or formal enough in a real analysis sense.
Any assistance would be appreciated.
Kind regards
John

Ray Tune: How do schedulers and search algorithms interact?

It seems to me that the natural way to integrate hyperband with a bayesian optimization search is to have the search algorithm determine each bracket and have the hyperband scheduler run the bracket. That is to say, the bayesian optimization search runs only once per bracket. Looking at Tune's source code for this, it's not clear to me whether the Tune library applies this strategy or not.
In particular, I want to know how the Tune library handles passing between the search algorithm and trial scheduler. For instance, how does this work if I call SkOptSearch and AsyncHyperBandScheduler (or HyperBandScheduler) together as follows:
sk_search = SkOptSearch(optimizer,
['group','dimensions','normalize','sampling_weights','batch_size','lr_adam','loss_weight'],
max_concurrent=4,
reward_attr="neg_loss",
points_to_evaluate=current_params)
hyperband = AsyncHyperBandScheduler(
time_attr="training_iteration",
reward_attr="neg_loss",
max_t=50,
grace_period=5,
reduction_factor=2,
brackets=5
)
run(Trainable_Dense,
name='hp_search_0',
stop={"training_iteration": 9999,
"neg_loss": -0.2},
num_samples=75,
resources_per_trial={'cpu':4,'gpu':1},
local_dir='./tune_save',
checkpoint_freq=5,
search_alg=sk_search,
scheduler=hyperband,
verbose=2,
resume=False,
reuse_actors=True)
Based on the source code linked above and the source code here, it seems to me that sk_search would return groups of up to 4 trials at a time, but hyperband should be querying the sk_search algorithm for N_sizeofbracket trials at a time.
There is now a Bayesian Optimization HyperBand implementation in Tune - https://ray.readthedocs.io/en/latest/tune-searchalg.html#bohb.
For standard search algorithms and schedulers, the search algorithm currently only sees the result of a trial if it is completed.

Is it possible to implement a loss function that prioritizes the correct answer being in the top k probabilities?

I am working on an multi-class image recognition problem. The task is to have the correct answer being in the top 3 output probabilities. So I was thinking that maybe there exists a clever cost function that prioritizes the correct answer being in the top K and doesn't penalize much in between these top K.
This can be achieved by class-weighted cross-entropy loss, which essentially assigns the weight to the errors associated with each class. This loss is used in research, e.g. see the paper "Multi-task learning and Weighted Cross-entropy for DNN-based Keyword" by S. Panchapagesan at al. Before computing the cross-entropy, you can check if the predicted distribution satisfies your condition (e.g., ground truth class is in top-k of the predicted classes) and assign the zero (or near zero) weights accordingly, if it does.
There are open questions though: when the correct class is in top-k, should you penalize the k-1 incorrectly predicted classes? What if, for example, the prediction is (0.9, 0.05, 0.01, ...), the third class is correct and it is in top-3 -- is this prediction good enough or not? Should you care what exactly k-1 incorrect classes are?
All these question arise because this kind of loss doesn't have probabilistic interpretation, unlike standard cross-entropy. That's why I wouldn't recommend using it in practice, but reformulate the goal instead.
E.g., if the original problem is that for some inputs several classes are equally good, the best way to deal with it is to use soft labels, e.g. (0.33, 0.33, 0.33, 0, 0, 0, ...) instead of one-hot (note that this totally agrees with probabilistic interpretation). It will force the network to learn features associated with all three good classes, and generally lead to the same goal, but with better control over target classes.