Does loss_weight=N and diff*N do the same thing in pycaffe? - deep-learning

I tried these two methods with pycaffe:
loss_weigth=100 in prototxt;
net.blobs['fc'].diff[...] = A_loss + 100*B_loss.
I thought they do the same thing in BP theory, model loss shows the opposite results however.
I want to know what is the difference between those two methods? How should I deal with loss weights if there are multiple losses?

Related

StableBaselines3 - Why does calling "model.learn(50,000)" twice not give same result as calling "model.learn(100,000)" once?

I am working on a Reinforcement Learning problem in StableBaselines3.
I am trying to understand why this code:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(100000)
Does not give the exact same result as this code:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(50000)
model.learn(50000)
I say they don't give the same results because in both cases, I tested out the model on a test-set through a for-loop, and the performance was different. Given that I set deterministic=True in the for-loop and I didn't change the seed, the different performance must mean the networks are different, which means the training process was different.
I was under the impression that if I run model.learn() on an existing model, it would just pick up the training where it was previously left off, but I guess that's incorrect.
Can someone help me understand why those two situations deliver different results?

Trade off between losses?

I have been working on a super-resolution task. I have this question about determining loss function, So in the case of the task at hand I felt like going with SSIM as a loss function to train my model. I did get a good set of results. Recently I come across perceptual loss function where we compare how a pretrained model looks at the Ground truth(GT) Images and the Super Resolution(SR) Image(Image generated by the model). My question is, I am thinking of using both ((1-SSIM(SR,GT))+Perceptual loss(SR,GT)) loss for backpropagation, so should I use a trade-off parameter between these two losses? if so how can I set up these trade-off parameters? or should I add these losses with equal weights.
PS: the perceptual loss is calculated by finding SSIMs between the feature maps of GT and SR images from the pre-trained model

How to train two pytorch networks with different inputs together?

I'm totally new to pytorch, so it might be a very basic question. I have two networks that should be trained together.
First one takes data as input and returns its embedding as output.
Second one takes pairs of embedded datapoints and returns their 'similarity' as output.
Partial loss is then computed for every datapoint, and then all the losses are combined.
This final loss should be backpropagated through both networks.
How should the code for that look like? I'm thinking something like this:
def train_models(inputs, targets):
network1.train()
network2.train()
embeddings = network1(inputs)
paired_embeddings = pair_embeddings(embeddings)
similarities = network2(similarities)
"""
I don't know how the loss should be calculated here.
I have a loss formula for every embedded datapoint,
but not for every similarity.
But if I only calculate loss for every embedding (using similarites),
won't backpropagate() only modify network1,
since embeddings are network1's outputs
and have not been modified in network2?
"""
optimizer1.step()
optimizer2.step()
scheduler1.step()
scheduler2.step()
network1.eval()
network2.eval()
I hope this specific enough. I'll gladly share more details if necessary. I'm just so inexperienced with pytorch and deep learning in general, that I'm not even sure how to ask this question.
You can use single optimizer for this purpose, and even pass different learning rate for each network.
optimizer = optim.Adam([
{'params': network1.parameters()},
{'params': network2.parameters(), 'lr': 1e-3}
], lr=1e-4)
# ...
loss = loss1 + loss2
loss.backward()
optimizer.step()

How to perform multi labeling classification (for CNN)?

I am currently looking into multi-labeling classification and I have some questions (and I couldn't find clear answers).
For the sake of clarity let's take an example : I want to classify images of vehicles (car, bus, truck, ...) and their make (Audi, Volkswagen, Ferrari, ...).
So I thought about training two independant CNN (one for the "type" classification and one fore the "make" classifiaction) but I thought it might be possible to train only one CNN on all the classes.
I read that people tend to use sigmoid function instead of softmax to do that. I understand that sigmoid does not sum up to 1 like softmax does but I dont understand in what doing that enables to do multi-labeling classification ?
My second question is : Is it possible to take into account that some classes are completly independant ?
Thridly, in term of performances (accuracy and time to give the classification for a new image), isn't training two independant better ?
Thank you for those who could give my some answers or some ideas :)
Softmax is a special output function; it forces the output vector to have a single large value. Now, training neural networks works by calculating an output vector, comparing that to a target vector, and back-propagating the error. There's no reason to restrict your target vector to a single large value, and for multi-labeling you'd use a 1.0 target for every label that applies. But in that case, using a softmax for the output layer will cause unintended differences between output and target, differences that are then back-propagated.
For the second part: you define the target vectors; you can encode any sort of dependency you like there.
Finally, no - a combined network performs better than the two halves would do independently. You'd only run two networks in parallel when there's a difference in network layout, e.g. a regular NN and CNN in parallel might be viable.

Best techinique to approximate a 32-bit function using machine learning?

I was wondering which is the best machine learning technique to approximate a function that takes a 32-bit number and returns another 32-bit number, from a set of observations.
Thanks!
Multilayer perceptron neural networks would be worth taking a look at. Though you'll need to process the inputs to a floating point number between 0 and 1, and then map the outputs back to the original range.
There are several possible solutions to your problem:
1.) Fitting a linear hypothesis with least-squares method
In that case, you are approximating a hypothesis y = ax + b with the least squares method. This one is really easy to implement, but sometimes, a linear model is not good enough to fit your data. But - I would give this one a try first.
Good thing is that there is a closed form, so you can directly calculate parameters a and b from your data.
See Least Squares
2.) Fitting a non-linear model
Once seen that your linear model does not describe your function very well, you can try to fit higher polynomial models to your data.
Your hypothesis then might look like
y = ax² + bx + c
y = ax³ + bx² + cx + d
etc.
You can also use least squares method to fit your data, and techniques from the gradient descent types (simmulated annealing, ...). See also this thread: Fitting polynomials to data
Or, as in the other answer, try fitting a Neural Network - the good thing is that it will automatically learn the hypothesis, but it is not so easy to explain what the relation between input and output is. But in the end, a neural network is also a linear combination of nonlinear functions (like sigmoid or tanh functions).