Backpropagation on Two Layered Networks - deep-learning

i have been following cs231n lectures of Stanford and trying to complete assignments on my own and sharing these solutions both on github and my blog. But i'm having a hard time on understanding how to modelize backpropagation. I mean i can code modular forward and backward passes but what bothers me is that if i have the model below : Two Layered Neural Network
Lets assume that our loss function here is a softmax loss function. In my modular softmax_loss() function i am calculating loss and gradient with respect to scores (dSoft = dL/dY). After that, when i'am following backwards lets say for b2, db2 would be equal to dSoft*1 or dW2 would be equal to dSoft*dX2(outputs of relu gate). What's the chain rule here ? Why isnt dSoft equal to 1 ? Because dL/dL would be 1 ?

The softmax function is outputs a number given an input x.
What dSoft means is that you're computing the derivative of the function softmax(x) with respect to the input x. Then to calculate the derivative with respect to W of the last layer you use the chain rule i.e. dL/dW = dsoftmax/dx * dx/dW. Note that x = W*x_prev + b where x_prev is the input to the last node. Therefore dx/dW is just x and dx/db is just 1, which means that dL/dW or simply dW is dsoftmax/dx * x_prev and dL/db or simply db is dsoftmax/dx * 1. Note that here dsoftmax/dx is dSoft we defined earlier.

Related

How to calculate backpropagation offline

I want to learn how to backpropagate neural net offline, so I came up with the example following:
and 1 stays for bias. Activation function in last layer is linear.
where
My intuitive problem
I want to calculate backprogation offline but I'm not sure how it should be done. I understand intuition behind online backprop where we calculate gradient observation by observation. But I don't have idea how it should work offline i.e. to calculate with all observations at once. Could you please give me a hint in which direction should I follow?
So you say that you know how to do backpropagation on a single sample, but not on many samples at the same time, right? Then let's just assume that you have some loss function, e.g. the mean squared error. Then, for a single sample (x, label) your loss would be (y - label)^2. Since you said you know how to do backpropagation on a single sample, you should know that we now take the gradient of our loss with respect to out last node, y. This gradient should be 2 * (y - label). From there on, wo propagate the gradient through the whole network.
If we now have a batch of samples instead, we just use the loss function for the whole batch. Lets say out batch has two samples. Then our mean square error would just be the sum of the individual losses divided by the number of samples. Thus, the loss would be 1/2 * ((y_1 - label_1)^2 + ((y_2 - label_2)^2). And now we can just do the same as before: We take the gradient of our loss function with respect to our last node, y. It is important to realize, that both y_1 and y_2 in our loss are actually the output of our final node y, just for the different samples. This means, that the gradient is 1/2 * (2*(y_1 - label_1) + 2*(y_2 - label_2)). From here you have a single gradient value for the whole batch (if you plug in y_1, y_2, label_1, and label_2) and can propagate the gradient through the network as before.
As a summary: Instead of calculating the loss for a single sample, we now use a loss function that includes our whole batch (e.g. by just summing over all samples). This produces a single gradient, and we can proceed as before.

PyTorch find keypoints: output nodes to be in a range and negative loss

I am beginner in deep learning.
I am using this dataset and I want my network to detect keypoints of a hand.
How can I make my output layer's nodes to be in range [-1, 1] (range of normalized 2D points)?
Another problem is when I train for more than 1 epoch the loss gets negative values
criterion: torch.nn.MultiLabelSoftMarginLoss() and optimizer: torch.optim.SGD()
Here u can find my repo
net = nnModel.Net()
net = net.to(device)
criterion = nn.MultiLabelSoftMarginLoss()
optimizer = optim.SGD(net.parameters(), lr=learning_rate)
lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer=optimizer, gamma=decay_rate)
You can use the Tanh activation function, since the image of the function lies in [-1, 1].
The problem of predicting key-points in an image is more of a regression problem than a classification problem (especially if you're making your model outputs + targets fall within a continuous interval). Therefore, I suggest you use the L2 Loss.
In fact, it could be a good exercise for you to determine which loss function that is appropriate for regression problems provides the lowest expected generalization error using cross-validation. There's several such functions available in PyTorch.
One way I can think of is to use torch.nn.Sigmoid which produces outputs in [0,1] range and scale outputs to [-1,1] using 2*x-1 transformation.

In DQN, hwo to perform gradient descent when each record in experience buffer corresponds to only one action?

The DQN algorithm below
Source
At the gradient descent line, there's something I don't quite understand.
For example, if I have 8 actions, then the output Q is a vector of 8 components, right?
But for each record in D, the return y_i is only a scalar with respect to a given action. How can I perform gradient descent on (y_i - Q)^2 ? I think it's not guaranteed that within a minibatch I have all actions' returns for a state.
You need to calculate the loss only on the Q-value which its action is selected. In your example, assume for a given row in your mini-batch, the action is 3. Then, you obtain the corresponding target, y_3, and then the loss is (Q(s,3) - y_3)^2, and basically you set the loss value of other actions to zero. You can implement this by using gather_nd in tensorflow or by obtaining one-hot-encode version of actions and then multiplying that one-hot-encode vector to Q-value vector. Using a one-hot-encode vector you can write:
action_input = tf.placeholder("float",[None,action_len])
QValue_batch = tf.reduce_sum(tf.multiply(T_Q_value,action_input), reduction_indices = 1)
in which action_input = np.eye(nb_classes)[your_action (e.g. 3)]. Same procedure can be followed by gather_nd:
https://www.tensorflow.org/api_docs/python/tf/gather_nd
I hope this resolves your confusion.

Determining the values of the filter matrices in a CNN

I am getting started with deep learning and have a basic question on CNN's.
I understand how gradients are adjusted using backpropagation according to a loss function.
But I thought the values of the convolving filter matrices (in CNN's) needs to be determined by us.
I'm using Keras and this is how (from a tutorial) the convolution layer was defined:
classifier = Sequential()
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
There are 32 filter matrices with dimensions 3x3 is used.
But, how are the values for these 32x3x3 matrices are determined?
It's not the gradients that are adjusted, the gradient calculated with the backpropagation algorithm is just the group of partial derivatives with respect to each weight in the network, and these components are in turn used to adjust the network weights in order to minimize the loss.
Take a look at this introductive guide.
The weights in the convolution layer in your example will be initialized to random values (according to a specific method), and then tweaked during training, using the gradient at each iteration to adjust each individual weight. Same goes for weights in a fully connected layer, or any other layer with weights.
EDIT: I'm adding some more details about the answer above.
Let's say you have a neural network with a single layer, which has some weights W. Now, during the forward pass, you calculate your output yHat for your network, compare it with your expected output y for your training samples, and compute some cost C (for example, using the quadratic cost function).
Now, you're interested in making the network more accurate, ie. you'd like to minimize C as much as possible. Imagine you want to find the minimum value for simple function like f(x)=x^2. You can start at some random point (as you did with your network), then compute the slope of the function at that point (ie, the derivative) and move down that direction, until you reach a minimum value (a local minimum at least).
With a neural network it's the same idea, with the difference that your inputs are fixed (the training samples), and you can see your cost function C as having n variables, where n is the number of weights in your network. To minimize C, you need the slope of the cost function C in each direction (ie. with respect to each variable, each weight w), and that vector of partial derivatives is the gradient.
Once you have the gradient, the part where you "move a bit following the slope" is the weights update part, where you update each network weight according to its partial derivative (in general, you subtract some learning rate multiplied by the partial derivative with respect to that weight).
A trained network is just a network whose weights have been adjusted over many iterations in such a way that the value of the cost function C over the training dataset is as small as possible.
This is the same for a convolutional layer too: you first initialize the weights at random (ie. you place yourself on a random position on the plot for the cost function C), then compute the gradients, then "move downhill", ie. you adjust each weight following the gradient in order to minimize C.
The only difference between a fully connected layer and a convolutional layer is how they calculate their outputs, and how the gradient is in turn computed, but the part where you update each weight with the gradient is the same for every weight in the network.
So, to answer your question, those filters in the convolutional kernels are initially random and are later adjusted with the backpropagation algorithm, as described above.
Hope this helps!
Sergio0694 states ,"The weights in the convolution layer in your example will be initialized to random values". So if they are random and say I want 10 filters. Every execution algorithm could find different filter. Also say I have Mnist data set. Numbers are formed of edges and curves. Is it guaranteed that there will be a edge filter or curve filter in 10?
I mean is first 10 filters most meaningful most distinctive filters we can find.
best

How to find a function that fits a given set of data points in Julia?

So, I have a vector that corresponds to a given feature (same dimensionality). Is there a package in Julia that would provide a mathematical function that fits these data points, in relation to the original feature? In other words, I have x and y (both vectors) and need to find a decent mapping between the two, even if it's a highly complex one. The output of this process should be a symbolic formula that connects x and y, e.g. (:x)^3 + log(:x) - 4.2454. It's fine if it's just a polynomial approximation.
I imagine this is a walk in the park if you employ Genetic Programming, but I'd rather opt for a simpler (and faster) approach, if it's available. Thanks
Turns out the Polynomials.jl package includes the function polyfit which does Lagrange interpolation. A usage example would go:
using Polynomials # install with Pkg.add("Polynomials")
x = [1,2,3] # demo x
y = [10,12,4] # demo y
polyfit(x,y)
The last line returns:
Poly(-2.0 + 17.0x - 5.0x^2)`
which evaluates to the correct values.
The polyfit function accepts a maximal degree for the output polynomial, but defaults to using the length of the input vectors x and y minus 1. This is the same degree as the polynomial from the Lagrange formula, and since polynomials of such degree agree on the inputs only if they are identical (this is a basic theorem) - it can be certain this is the same Lagrange polynomial and in fact the only one of such a degree to have this property.
Thanks to the developers of Polynomial.jl for leaving me just to google my way to an Answer.
Take a look to MARS regression. Multi adaptive regression splines.