Mask specific elements in a final layer in PyTorch - deep-learning

I am now reproducing the following model which outputs an action and uses filter for filtering inappropriate candidates.
https://arxiv.org/abs/1702.03274
In this model, output is filtered after last softmax layer. Let's assume action_size==3. So the output after dense & asoftmax layer is like below.
output: [0.1, 0.7, 0.2]
filter: [0, 1, 1]
output*filter: [0, 0.7, 0.2]
But in pytorch, logsoftmax is preferred with NLLLoss. So my output is like below. This doesn't make sense.
output: [-5.4, -0.2, -4.9]
filter: [0, 1, 1]
output*filter: [0, -0.2, -4.9]
So pytoroch doesn't recommend vanilla Softmax. How should I apply mask to eliminate specific actions?
Or is there any categorical cross entropy loss functions with vanilla Softmax?
This module doesn’t work directly with NLLLoss, which expects the Log to be computed between the Softmax and itself. Use Logsoftmax instead (it’s faster and has better numerical properties).
http://pytorch.org/docs/master/nn.html#torch.nn.Softmax

The output of LogSoftmax is simply the log of the output of Softmax. That means you can just call torch.exp(output_from_logsoftmax) to get the same values as from Softmax.
So, if I'm reading your question correctly, you would calculate LogSoftmax, and then feed that into NLLLoss and also exponentiate that to use in your filtering.

Related

PyTorch - Neural Network - Output single scalar value

Let's say we have the following neural network in PyTorch
seq_model = nn.Sequential(
nn.Linear(1, 13),
nn.Tanh(),
nn.Linear(13, 1))
With the following input tensor
input = torch.tensor([1.0, 1.0, 5.0], dtype=torch.float32).unsqueeze(1)
I can run forward through the net and get
seq_model(input)
tensor([[-0.0165],
[-0.0165],
[-0.2289]], grad_fn=<TanhBackward0>)
Probably I also can get a single scalar value as an output, but I'm not sure how.
Thank you. I'm trying to use such an network for reinforcment learning, and use it
as an value function approximator for game board state evaluation.
The first dimension of input represents the number of observations in your minibatch (3), the second dimension represents instead the number of features (1).
If you want to forward a single 3d input, the network must be modified (nn.Linear(1, 13) becomes nn.Linear(3, 13)), and you must remove unsqueeze(1) on input. Otherwise, you can merge the three outputs by using a loss to compute a single scalar from them.

Using Softmax Activation function after calculating loss from BCEWithLogitLoss (Binary Cross Entropy + Sigmoid activation)

I am going through a Binary Classification tutorial using PyTorch and here, the last layer of the network is torch.Linear() with just one neuron. (Makes Sense) which will give us a single neuron. as pred=network(input_batch)
After that the choice of Loss function is loss_fn=BCEWithLogitsLoss() (which is numerically stable than using the softmax first and then calculating loss) which will apply Softmax function to the output of last layer to give us a probability. so after that, it'll calculate the binary cross entropy to minimize the loss.
loss=loss_fn(pred,true)
My concern is that after all this, the author used torch.round(torch.sigmoid(pred))
Why would that be? I mean I know it'll get the prediction probabilities in the range [0,1] and then round of the values with default threshold of 0.5.
Isn't it better to use the sigmoid once after the last layer within the network rather using a softmax and a sigmoid at 2 different places given it's a binary classification??
Wouldn't it be better to just
out = self.linear(batch_tensor)
return self.sigmoid(out)
and then calculate the BCE loss and use the argmax() for checking accuracy??
I am just curious that can it be a valid strategy?
You seem to be thinking of the binary classification as a multi-class classification with two classes, but that is not quite correct when using the binary cross-entropy approach. Let's start by clarifying the goal of the binary classification before looking at any implementation details.
Technically, there are two classes, 0 and 1, but instead of considering them as two separate classes, you can see them as opposites of each other. For example, you want to classify whether a StackOverflow answer was helpful or not. The two classes would be "helpful" and "not helpful". Naturally, you would simply ask "Was the answer helpful?", the negative aspect is left off, and if that wasn't the case, you could deduce that it was "not helpful". (Remember, it's a binary case, there is no middle ground).
Therefore, your model only needs to predict a single class, but to avoid confusion with the actual two classes, that can be expressed as: The model predicts the probability that the positive case occurs. In context of the previous example: What is the probability that the StackOverflow answer was helpful?
Sigmoid gives you values in the range [0, 1], which are the probabilities. Now you need to decide when the model is confident enough for it to be positive by defining a threshold. To make it balanced, the threshold is 0.5, therefore as long as the probability is greater than 0.5 it is positive (class 1: "helpful") otherwise it's negative (class 0: "not helpful"), which is achieved by rounding (i.e. torch.round(torch.sigmoid(pred))).
After that the choice of Loss function is loss_fn=BCEWithLogitsLoss() (which is numerically stable than using the softmax first and then calculating loss) which will apply Softmax function to the output of last layer to give us a probability.
Isn't it better to use the sigmoid once after the last layer within the network rather using a softmax and a sigmoid at 2 different places given it's a binary classification??
BCEWithLogitsLoss applies Sigmoid not Softmax, there is no Softmax involved at all. From the nn.BCEWithLogitsLoss documentation:
This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.
By not applying Sigmoid in the model you get a more numerically stable version of the binary cross-entropy, but that means you have to apply the Sigmoid manually if you want to make an actual prediction outside of training.
[...] and use the argmax() for checking accuracy??
Again, you're thinking of the multi-class scenario. You only have a single output class, i.e. output has size [batch_size, 1]. Taking argmax of that, will always give you 0, because that is the only available class.

What exactly is a Softmax output layer?

I'm trying to make a simple conv net in c#, and I want to make a Softmax outputlayer, but I don't really now what it is. Is it a fully connected layer with Softmax activation or just a layer which outputs the softmax of the data?
Softmax is just a function that takes a vector and outputs a vector of the same size having values within the range [0,1]. Also the values inside the vector follow the fundamental rule of probability ie. sum of values in vector = 1.
softmax(x)_i = exp(x_i) / ( SUM_{j=1}^K exp(x_j) ) # for each i = 1,.., K
But sometimes people use Softmax classifier which refers to a MLP with input and 1 output layer (which makes it a linear classifier like linear SVM) where softmax function is applied to the outputs of output layer. This setup gives the probability of the input being close to each of the output classes.

Why the cost function and the last activation function are bound in MXNet?

When we define a deep learning model, we do the following steps:
Specify how the output should be calculated based on the input and the model's parameters.
Specify a cost (loss) function.
Search for the model's parameters by minimizing the cost function.
It looks to me that in MXNet the first two steps are bound. For example, in the following way I define a linear transformation:
# declare a symbolic variable for the model's input
inp = mx.sym.Variable(name = 'inp')
# define how output should be determined by the input
out = mx.sym.FullyConnected(inp, name = 'out', num_hidden = 2)
# specify input and model's parameters
x = mx.nd.array(np.ones(shape = (5,3)))
w = mx.nd.array(np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]))
b = mx.nd.array(np.array([7.0, 8.0]))
# calculate output based on the input and parameters
p = out.bind(ctx = mx.cpu(), args = {'inp':x, 'out_weight':w, 'out_bias':b})
print(p.forward()[0].asnumpy())
Now, if I want to add a SoftMax transformation on top of it, I need to do the following:
# define the cost function
target = mx.sym.Variable(name = 'target')
cost = mx.symbol.SoftmaxOutput(out, target, name='softmax')
y = mx.nd.array(np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 0.0], [0.0, 1.0]]))
c = cost.bind(ctx = mx.cpu(), args = {'inp':x, 'out_weight':w, 'out_bias':b, 'target':y})
print(c.forward()[0].asnumpy())
What I do not understand, is why do we need to create the symbolic variable target. We would need it only if we want to calculate costs, but so far, we just calculate output based on the input (by doing a linear transformation and SoftMax).
Moreover, we need to provide a numerical value for the target to get the output calculated. So, it looks like it is required but it is not used (the provided value of the target does not change the value of the output).
Finally, we can use the cost object to define a model which we can fit as soon as we have data. But what about the cost function? It has to be specified, but it is not. Basically, it looks like I am forced to use a specific cost bunction just because I use SoftMax. But why?
ADDED
For more statistical / mathematical point of view check here. Although the current question is more pragmatic / programmatic in nature. It is basically: How to decouple the output nonlinearity and the cost function in MXNEt. For example I might want to do a linear transformation and then find the model parameters by minimizing absolute deviation instead of squared one.
You can use mx.sym.softmax() if you only want softmax. mx.sym.SoftmaxOutput() contains efficient code for calculating gradient of cross-entropy (negative log loss), which is the most common loss used with softmax. If you want to use your own loss, just use softmax and add a loss on top during training. I should note that you could also replace the SoftmaxOutput layer with a simple softmax during inference if you really want to.

How to get the probability of each vector belonging to each cluster?

I use the following code to create clusters. I would like to get the probability of each vector belonging to each cluster. How to do this?
import numpy as np
from nltk import cluster
from nltk.cluster import euclidean_distance
vectors = [np.array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]]
clusterer = cluster.KMeansClusterer(2, euclidean_distance)
clusters = clusterer.cluster(vectors, assign_clusters=True, trace=False)
from sklearn import mixture
model = mixture.GMM(n_components=4)
model.fit(dataset)
model.score_samples(dataset)
this returns, acc to docs
Posterior probabilities of each mixture component for each observation.
But of course this won't help if the Clustering doesn't converge for your data.
Are you talking about:
the assignments kmeans made to vectors from your vectors variable or
the assignment of a new vector to an existing cluster?
1. The K-means assignments
Simply print the clusters variables. If you see [0, 0, 1, 1], then it means [3, 3] and [1, 2] (the first two) got assigned to the cluster 0, and [4, 2] and [4, 0] (the last two) to the cluster 1. There's no probability here.
2. Assigning a new vector to an existing cluster
Since you're using KMeans, you first need to know what is the centroid of each cluster. The nltk API says this is a private information : the interesting variable (_means) is prefixed by an underscore. The variable could change in the future, but you can still get the value if you want to.
The NLTK algorithm is randomized, so you will get different centroids each time. As I said before, you can see the assignments with print(clusters). You can see the centroids with print(clusterer._means). Let's say you got the assignment [0, 0, 1, 1] with centroids [2, 2.5] and [4, 1]. A new vector (say [1, 2]) would be assigned to an existing cluster by using the closest cluster. Again, it makes little sense to talk about probability here. You could get scores by using distance for all clusters and then using softmax to get to probabilities if you really wanted to.