Does scaling the values to [0,1] affect the CNN learning procedure? - caffe

I am working on semantic segmentation using CNNs.
I have normalized the values of images to range [0,1]. I have trained my network many times, learning curve seems is learning well, however, the output is always black image. My question is does really scaling affects the learning or should the range of pixel values remains in 0-255 range?
Thanks a lot.

In general, it doesn't affect the training. In many situations, [0 1] is actually better than [0 255] for training a DNN since the latter may cause some numerical issues in learning the first layer weights if you don't use lower learning rates in the first layer since w' = y' * x, where ' denotes derivative.

Related

Bayesian models for CNN using Pyro & Pytorch

I am applying a Bayesian model for a CNN that has many layers (more than 3), using Stochastic Variational Inference in Pyro Package.
However after defining the NN, Model and Guide functions and running the training loop I found that the loss stops decreasing on loss ~8000 (which is extremely high). I tried different learning rates and different optimization functions but non of them reaches a loss lower than 8000.
At last I changed the autoguide function (I tried AutoNormal, AutoGaussian, AutoBeta) they all stopped dicreasing the loss at the same point.
enter image description here
The last thing I did is trying the AutoMultivatiateNormal and for this it reached negative values (it reached -1) but when I looked at the weight matrices I found that they are all turned into scalers!!
The following graph represent the loss pattern for the AutoMultivatiateNormal
enter image description here
Do anyone knows how to solve such problem?? and why is this happening??

Most wierd loss function shape (because of weight decay parameter)

I am training a large neural network model (1 module Hourglass) for a facial landmark recognition task. Database used for training is WFLW.
Loss function used is MSELoss() between the predicted output heatmaps, and the ground-truth heatmaps.
- Batch size = 32
- Adam Optimizer
- Learning rate = 0.0001
- Weight decay = 0.0001
As I am building a baseline model, I have launched a basic experiment with the parameters shown above. I previously had executed a model with the same exact parameters, but with weight-decay=0. The model converged successfully. Thus, the problem is with the weight-decay new value.
I was expecting to observe a smooth loss function that slowly decreased. As it can be observed in the image below, the loss function has a very very wierd shape.
This will probably be fixed by changing the weight decay parameter (decreasing it, maybe?).
I would highly appreciate if someone could provide a more in-depth explanation into the strange shape of this loss function, and its relation with the weight-decay parameter.
In addition, to explain why this premature convergence into a very specific value of 0.000415 with a very narrow standard deviation? Is it a strong local minimum?
Thanks in advance.
Loss should not consistently increase when using gradient descent. It does not matter if you use weight decay or not, there is either a bug in your code (e.g. worth checking what happens with normal gradient descent, not Adam, as there are ways in which one can wrongly implement weight decay with Adam), or your learning rate is too large.

Is image rescaling between 0-255 needed for transfer learning

I am working on a classification task using transfer learning. I am using ResNet50 and weights from ImageNet.
My_model = (ResNet50( include_top=False, weights='imagenet', input_tensor=None,
input_shape=(img_height, img_width, 3),pooling=None))
I didn't rescale my input images between 0-255 but my result is quite good (acc: 93.25%). So my question is do I need to rescale images between 0-255? Do you think my result is wrong without rescaling between 0-255?
Thank you.
No basically your result is not wrong. to give a clue on that, we standardize the pixels values to a range between (0 and 1) just to avoid resulting big values during the calculus in the forward propagation z = w*x + b and then the backward propagation.
Why we do that ?
I develop, the optimization algorithm is definitely dependent on the result of the backward prop, so when we start updating our optimization algo with big values of weights/bias, then we need then a lot of epochs to reach the global minimum.

PyTorch find keypoints: output nodes to be in a range and negative loss

I am beginner in deep learning.
I am using this dataset and I want my network to detect keypoints of a hand.
How can I make my output layer's nodes to be in range [-1, 1] (range of normalized 2D points)?
Another problem is when I train for more than 1 epoch the loss gets negative values
criterion: torch.nn.MultiLabelSoftMarginLoss() and optimizer: torch.optim.SGD()
Here u can find my repo
net = nnModel.Net()
net = net.to(device)
criterion = nn.MultiLabelSoftMarginLoss()
optimizer = optim.SGD(net.parameters(), lr=learning_rate)
lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer=optimizer, gamma=decay_rate)
You can use the Tanh activation function, since the image of the function lies in [-1, 1].
The problem of predicting key-points in an image is more of a regression problem than a classification problem (especially if you're making your model outputs + targets fall within a continuous interval). Therefore, I suggest you use the L2 Loss.
In fact, it could be a good exercise for you to determine which loss function that is appropriate for regression problems provides the lowest expected generalization error using cross-validation. There's several such functions available in PyTorch.
One way I can think of is to use torch.nn.Sigmoid which produces outputs in [0,1] range and scale outputs to [-1,1] using 2*x-1 transformation.

Determining the values of the filter matrices in a CNN

I am getting started with deep learning and have a basic question on CNN's.
I understand how gradients are adjusted using backpropagation according to a loss function.
But I thought the values of the convolving filter matrices (in CNN's) needs to be determined by us.
I'm using Keras and this is how (from a tutorial) the convolution layer was defined:
classifier = Sequential()
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
There are 32 filter matrices with dimensions 3x3 is used.
But, how are the values for these 32x3x3 matrices are determined?
It's not the gradients that are adjusted, the gradient calculated with the backpropagation algorithm is just the group of partial derivatives with respect to each weight in the network, and these components are in turn used to adjust the network weights in order to minimize the loss.
Take a look at this introductive guide.
The weights in the convolution layer in your example will be initialized to random values (according to a specific method), and then tweaked during training, using the gradient at each iteration to adjust each individual weight. Same goes for weights in a fully connected layer, or any other layer with weights.
EDIT: I'm adding some more details about the answer above.
Let's say you have a neural network with a single layer, which has some weights W. Now, during the forward pass, you calculate your output yHat for your network, compare it with your expected output y for your training samples, and compute some cost C (for example, using the quadratic cost function).
Now, you're interested in making the network more accurate, ie. you'd like to minimize C as much as possible. Imagine you want to find the minimum value for simple function like f(x)=x^2. You can start at some random point (as you did with your network), then compute the slope of the function at that point (ie, the derivative) and move down that direction, until you reach a minimum value (a local minimum at least).
With a neural network it's the same idea, with the difference that your inputs are fixed (the training samples), and you can see your cost function C as having n variables, where n is the number of weights in your network. To minimize C, you need the slope of the cost function C in each direction (ie. with respect to each variable, each weight w), and that vector of partial derivatives is the gradient.
Once you have the gradient, the part where you "move a bit following the slope" is the weights update part, where you update each network weight according to its partial derivative (in general, you subtract some learning rate multiplied by the partial derivative with respect to that weight).
A trained network is just a network whose weights have been adjusted over many iterations in such a way that the value of the cost function C over the training dataset is as small as possible.
This is the same for a convolutional layer too: you first initialize the weights at random (ie. you place yourself on a random position on the plot for the cost function C), then compute the gradients, then "move downhill", ie. you adjust each weight following the gradient in order to minimize C.
The only difference between a fully connected layer and a convolutional layer is how they calculate their outputs, and how the gradient is in turn computed, but the part where you update each weight with the gradient is the same for every weight in the network.
So, to answer your question, those filters in the convolutional kernels are initially random and are later adjusted with the backpropagation algorithm, as described above.
Hope this helps!
Sergio0694 states ,"The weights in the convolution layer in your example will be initialized to random values". So if they are random and say I want 10 filters. Every execution algorithm could find different filter. Also say I have Mnist data set. Numbers are formed of edges and curves. Is it guaranteed that there will be a edge filter or curve filter in 10?
I mean is first 10 filters most meaningful most distinctive filters we can find.
best