In DQN, hwo to perform gradient descent when each record in experience buffer corresponds to only one action? - reinforcement-learning

The DQN algorithm below
Source
At the gradient descent line, there's something I don't quite understand.
For example, if I have 8 actions, then the output Q is a vector of 8 components, right?
But for each record in D, the return y_i is only a scalar with respect to a given action. How can I perform gradient descent on (y_i - Q)^2 ? I think it's not guaranteed that within a minibatch I have all actions' returns for a state.

You need to calculate the loss only on the Q-value which its action is selected. In your example, assume for a given row in your mini-batch, the action is 3. Then, you obtain the corresponding target, y_3, and then the loss is (Q(s,3) - y_3)^2, and basically you set the loss value of other actions to zero. You can implement this by using gather_nd in tensorflow or by obtaining one-hot-encode version of actions and then multiplying that one-hot-encode vector to Q-value vector. Using a one-hot-encode vector you can write:
action_input = tf.placeholder("float",[None,action_len])
QValue_batch = tf.reduce_sum(tf.multiply(T_Q_value,action_input), reduction_indices = 1)
in which action_input = np.eye(nb_classes)[your_action (e.g. 3)]. Same procedure can be followed by gather_nd:
https://www.tensorflow.org/api_docs/python/tf/gather_nd
I hope this resolves your confusion.

Related

Using Softmax Activation function after calculating loss from BCEWithLogitLoss (Binary Cross Entropy + Sigmoid activation)

I am going through a Binary Classification tutorial using PyTorch and here, the last layer of the network is torch.Linear() with just one neuron. (Makes Sense) which will give us a single neuron. as pred=network(input_batch)
After that the choice of Loss function is loss_fn=BCEWithLogitsLoss() (which is numerically stable than using the softmax first and then calculating loss) which will apply Softmax function to the output of last layer to give us a probability. so after that, it'll calculate the binary cross entropy to minimize the loss.
loss=loss_fn(pred,true)
My concern is that after all this, the author used torch.round(torch.sigmoid(pred))
Why would that be? I mean I know it'll get the prediction probabilities in the range [0,1] and then round of the values with default threshold of 0.5.
Isn't it better to use the sigmoid once after the last layer within the network rather using a softmax and a sigmoid at 2 different places given it's a binary classification??
Wouldn't it be better to just
out = self.linear(batch_tensor)
return self.sigmoid(out)
and then calculate the BCE loss and use the argmax() for checking accuracy??
I am just curious that can it be a valid strategy?
You seem to be thinking of the binary classification as a multi-class classification with two classes, but that is not quite correct when using the binary cross-entropy approach. Let's start by clarifying the goal of the binary classification before looking at any implementation details.
Technically, there are two classes, 0 and 1, but instead of considering them as two separate classes, you can see them as opposites of each other. For example, you want to classify whether a StackOverflow answer was helpful or not. The two classes would be "helpful" and "not helpful". Naturally, you would simply ask "Was the answer helpful?", the negative aspect is left off, and if that wasn't the case, you could deduce that it was "not helpful". (Remember, it's a binary case, there is no middle ground).
Therefore, your model only needs to predict a single class, but to avoid confusion with the actual two classes, that can be expressed as: The model predicts the probability that the positive case occurs. In context of the previous example: What is the probability that the StackOverflow answer was helpful?
Sigmoid gives you values in the range [0, 1], which are the probabilities. Now you need to decide when the model is confident enough for it to be positive by defining a threshold. To make it balanced, the threshold is 0.5, therefore as long as the probability is greater than 0.5 it is positive (class 1: "helpful") otherwise it's negative (class 0: "not helpful"), which is achieved by rounding (i.e. torch.round(torch.sigmoid(pred))).
After that the choice of Loss function is loss_fn=BCEWithLogitsLoss() (which is numerically stable than using the softmax first and then calculating loss) which will apply Softmax function to the output of last layer to give us a probability.
Isn't it better to use the sigmoid once after the last layer within the network rather using a softmax and a sigmoid at 2 different places given it's a binary classification??
BCEWithLogitsLoss applies Sigmoid not Softmax, there is no Softmax involved at all. From the nn.BCEWithLogitsLoss documentation:
This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.
By not applying Sigmoid in the model you get a more numerically stable version of the binary cross-entropy, but that means you have to apply the Sigmoid manually if you want to make an actual prediction outside of training.
[...] and use the argmax() for checking accuracy??
Again, you're thinking of the multi-class scenario. You only have a single output class, i.e. output has size [batch_size, 1]. Taking argmax of that, will always give you 0, because that is the only available class.

What should be the loss function for classification problem in pytorch if sigmoid is used in the output layer

I am trying to implement a model for binary classification problem. Up to now, I was using softmax function (at the output layer) together with torch.NLLLoss function to calculate the loss. However, now I want to use the sigmoid function (instead of softmax) at the output layer. If I do that, should I also change the loss function (to BCELoss or binary_cross_entropy) or may I still use torch.NLLLoss function ?
If you use sigmoid function, then you can only do binary classification. It's not possible to do a multi-class classification. The reason for this is because sigmoid function always returns a value in the range between 0 and 1. So, for instance one can threshold the value at 0.5 and separate (or classify) it into two classes based on the obtained values.
Regarding the objective function NLLLoss - Negative Log Likelihood Loss. It just learns the data distribution. So, it's not a problem as long as that what you're trying to achieve during training.

Backpropagation on Two Layered Networks

i have been following cs231n lectures of Stanford and trying to complete assignments on my own and sharing these solutions both on github and my blog. But i'm having a hard time on understanding how to modelize backpropagation. I mean i can code modular forward and backward passes but what bothers me is that if i have the model below : Two Layered Neural Network
Lets assume that our loss function here is a softmax loss function. In my modular softmax_loss() function i am calculating loss and gradient with respect to scores (dSoft = dL/dY). After that, when i'am following backwards lets say for b2, db2 would be equal to dSoft*1 or dW2 would be equal to dSoft*dX2(outputs of relu gate). What's the chain rule here ? Why isnt dSoft equal to 1 ? Because dL/dL would be 1 ?
The softmax function is outputs a number given an input x.
What dSoft means is that you're computing the derivative of the function softmax(x) with respect to the input x. Then to calculate the derivative with respect to W of the last layer you use the chain rule i.e. dL/dW = dsoftmax/dx * dx/dW. Note that x = W*x_prev + b where x_prev is the input to the last node. Therefore dx/dW is just x and dx/db is just 1, which means that dL/dW or simply dW is dsoftmax/dx * x_prev and dL/db or simply db is dsoftmax/dx * 1. Note that here dsoftmax/dx is dSoft we defined earlier.

Making sense of soundMixer.computeSpectrum

All examples that I can find on the Internet just visualize the result array of the function computeSpectrum, but I am tasked with something else.
I generate a music note and I need by analyzing the result array to be able to say what note is playing. I figured out that I need to set the second parameter of the function call 'FFTMode' to true and then it returns sound frequencies. I thought that really it should return only one non-zero value which I could use to determine what note I generated using Math.sin function, but it is not the case.
Can somebody suggest a way how I can accomplish the task? Using the soundMixer.computeSpectrum is a requirement because I am going to analyze more complex sounds later.
FFT will transform your signal window into set of Nyquist sine waves so unless 440Hz is one of them you will obtain more than just one nonzero value! For a single sine wave you would obtain 2 frequencies due to aliasing. Here an example:
As you can see for exact Nyquist frequency the FFT response is single peak but for nearby frequencies there are more peaks.
Due to shape of the signal you can obtain continuous spectrum with peaks instead of discrete values.
Frequency of i-th sample is f(i)=i*samplerate/N where i={0,1,2,3,4,...(N/2)-1} is sample index (first one is DC offset so not frequency for 0) and N is the count of samples passed to FFT.
So in case you want to detect some harmonics (multiples of single fundamental frequency) then set the samplerate and N so samplerate/N is that fundamental frequency or divider of it. That way you would obtain just one peak for harmonics sinwaves. Easing up the computations.

Why Caffe's Accuracy layer's bottoms consist of InnerProduct and Label?

I'm new to caffe, in the MNIST example, I was thought that label should compared with softmax layers, but it was not the situation in lenet.prototxt.
I wonder why use InnerProduct result and label to get the accuracy, it seems unreasonable. Was it because I missed something in the layers?
The output dimension of the last inner product layer is 10, which corresponds to the number of classes (digits 0~9) of your problem.
The loss layer, takes two blobs, the first one being the prediction(ip2) and the second one being the label provided by the data layer.
loss_layer = SoftmaxLossLayer(name="loss", bottoms=[:ip2,:label])
It does not produce any outputs - all it does is to compute the loss function value, report it when backpropagation starts, and initiates the gradient with respect to ip2. This is where all magic starts.
After training ( in TEST phase), In the last layer desire results come from multiply weights and ip1( that are computed in last layer ); And each of class( one of 10 neurons) has max value is choosen.