I am trying to look at the possible activation functions for deep network for speaker recognition. I will have an input and a label (0s and 1s)as an output. I was wondering if rectifier linear unit (ReLu) can be used with any type of output or just specific one ? Thank you
Activation functions are normally only used in the hidden layers. Output units almost always have a linear activation function (ie identity or no activation function). Rectified units are used in the hidden layers because their gradient is much simpler than their sigmoidal counterparts which allows for better training with many layers.
You mentioned your output has labels as either 0s or 1s is this an output vector with N outputs that are either 0 or 1? Or do you mean that you have only 2 classes (0 or 1). If you want to do classification (getting the network to output either class 0 or class 1) you would use Softmax activation on the output layer. Softmax scales the output to be a probability of the network's predicted class.
Let me know more information and I will see if I can help more.
Related
Can anyone help me to understand when I use conv1d and then a linear layer, What will be the inputs of the linear layer? How to calculate how many input features should I have to pass in pytorch
In Pytorch, Linear layers operate using only the last dimension of the input tensor: [*features_in] -> [*,features_out].
However, Conv1D layers consider the last 2 dimensions of the input tensor: [batches,channels_in, length_in] -> [batches,channels_out, length_out].
Therefore, if no pre-processing is used, Linear layers will only work with the signals defined for every channel, i.e., [batches,channels_in,features_in] -> [batches,channels_in,features_out]. This behavior is rarely desired, so people usually flatten tensors before passing them to a Linear layer. For example, it's common to use Linear(x.view(n_batches,-1)).
The behavior you need depends on the details of your application. Good luck,
Sources:
https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html (Conv1d)
https://pytorch.org/docs/stable/generated/torch.nn.Linear.html (Linear)
I think the answer would be yes, but I'm unable to reason out a good explanation on this.
The mathematical argument lies in a power to represent linearity, we can use following three lemmas to show that:
Lemma 1
With affine transformations (linear layer) we can map the input hypercube [0,1]^d into arbitrary small box [a,b]^k. Proof is quite simple, we can just make all the biases to be equal to a, and make weights multiply by (b-a).
Lemma 2
For sufficiently small scale, many non-linearities are approximately linear. This is actually very much a definition of a derivative, or, taylor expansion. In particular let us take relu(x), for x>0 it is in fact, linear! What about sigmoid? Well if we look at a tiny tiny region [-eps, eps] you can see that it approaches a linear function as eps->0!
Lemma 3
Composition of affine functions is affine. In other words, if I were to make a neural network with multiple linear layers, it is equivalent of having just one. This comes from the matrix composition rules:
W2(W1x + b1) + b2 = W2W1x + W2b1 + b2 = (W2W1)x + (W2b1 + b2)
------ -----------
New weights New bias
Combining the above
Composing the three lemmas above we see that with a non-linear layer, there always exists an arbitrarily good approximation of the linear function! We simply use the first layer to map entire input space into the tiny part of the pre-activation spacve where your linearity is approximately linear, and then we "map it back" in the following layer.
General case
This is a very simple proof, now in general you can use Universal Approximation Theorem to show that a non-linear neural network (Sigmoid, Relu, many others) that is sufficiently large, can approximate any smooth target function, which includes linear ones. This proof (originally given by Cybenko) is however much more complex and relies on showing that specific classes of functions are dense in the space of continuous functions.
Technically, yes.
The reason you could use a non-linear activation function for this task is that you can manually alter the results. Let's say the range the activation function outputs is between 0.0-1.0, then you can round up or down to get a binary 0/1. Just to be clear, rounding up or down isn't linear activation, but for this specific question the purpose of the network was for classification, where some kind of rounding has to be applied.
The reason you shouldn't is the same reason that you shouldn't attach an industrial heater to a fan and call it a hair-drier, it's unnecessarily powerful and it could potentially waste resources and time.
I hope this answer helped, have a good day!
I am going through a Binary Classification tutorial using PyTorch and here, the last layer of the network is torch.Linear() with just one neuron. (Makes Sense) which will give us a single neuron. as pred=network(input_batch)
After that the choice of Loss function is loss_fn=BCEWithLogitsLoss() (which is numerically stable than using the softmax first and then calculating loss) which will apply Softmax function to the output of last layer to give us a probability. so after that, it'll calculate the binary cross entropy to minimize the loss.
loss=loss_fn(pred,true)
My concern is that after all this, the author used torch.round(torch.sigmoid(pred))
Why would that be? I mean I know it'll get the prediction probabilities in the range [0,1] and then round of the values with default threshold of 0.5.
Isn't it better to use the sigmoid once after the last layer within the network rather using a softmax and a sigmoid at 2 different places given it's a binary classification??
Wouldn't it be better to just
out = self.linear(batch_tensor)
return self.sigmoid(out)
and then calculate the BCE loss and use the argmax() for checking accuracy??
I am just curious that can it be a valid strategy?
You seem to be thinking of the binary classification as a multi-class classification with two classes, but that is not quite correct when using the binary cross-entropy approach. Let's start by clarifying the goal of the binary classification before looking at any implementation details.
Technically, there are two classes, 0 and 1, but instead of considering them as two separate classes, you can see them as opposites of each other. For example, you want to classify whether a StackOverflow answer was helpful or not. The two classes would be "helpful" and "not helpful". Naturally, you would simply ask "Was the answer helpful?", the negative aspect is left off, and if that wasn't the case, you could deduce that it was "not helpful". (Remember, it's a binary case, there is no middle ground).
Therefore, your model only needs to predict a single class, but to avoid confusion with the actual two classes, that can be expressed as: The model predicts the probability that the positive case occurs. In context of the previous example: What is the probability that the StackOverflow answer was helpful?
Sigmoid gives you values in the range [0, 1], which are the probabilities. Now you need to decide when the model is confident enough for it to be positive by defining a threshold. To make it balanced, the threshold is 0.5, therefore as long as the probability is greater than 0.5 it is positive (class 1: "helpful") otherwise it's negative (class 0: "not helpful"), which is achieved by rounding (i.e. torch.round(torch.sigmoid(pred))).
After that the choice of Loss function is loss_fn=BCEWithLogitsLoss() (which is numerically stable than using the softmax first and then calculating loss) which will apply Softmax function to the output of last layer to give us a probability.
Isn't it better to use the sigmoid once after the last layer within the network rather using a softmax and a sigmoid at 2 different places given it's a binary classification??
BCEWithLogitsLoss applies Sigmoid not Softmax, there is no Softmax involved at all. From the nn.BCEWithLogitsLoss documentation:
This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.
By not applying Sigmoid in the model you get a more numerically stable version of the binary cross-entropy, but that means you have to apply the Sigmoid manually if you want to make an actual prediction outside of training.
[...] and use the argmax() for checking accuracy??
Again, you're thinking of the multi-class scenario. You only have a single output class, i.e. output has size [batch_size, 1]. Taking argmax of that, will always give you 0, because that is the only available class.
I am working on predicting Semantic Textual Similarity (SemEval 2017 Task-1) between a pair of texts. The similarity score (output) is a continuous value between [0,5]. The neural network model (link below), therefore, has 6 units in the final layer for prediction between values [0,5]. The objective function used is the Pearson correlation coefficient and softmax activation is used. Now, in order to train the model, how can I give the target output values to the model? Since there are 6 output classes, I should probably send one-hot-encoded vectors of the output. In that case, how can we convert the output (which might be a float value such as 2.33) to a one-hot vector of length 6? Or is there any other way of specifying the target output and training the model?
Paper: http://nlp.arizona.edu/SemEval-2017/pdf/SemEval016.pdf
If the value you're trying to predict is continuously-defined, you might be better off configuring this as a regression architecture. This will be simpler to train and interpret and will give you non-integer predictions (which you can then bucket or threshold however you please).
In order to do this, replace your softmax layer with a layer containing a single neuron with a linear activation function. Then you can simply train this network using your real-valued similarity numbers at the output. For loss function, you can use MSE / L2 unless you have a reason to do otherwise.
I am training a binary classifier using Sigmoid activation function with Binary crossentropy which gives good accuracy around 98%.
The same when I train using softmax with categorical_crossentropy gives very low accuracy (< 40%).
I am passing the targets for binary_crossentropy as list of 0s and 1s eg; [0,1,1,1,0].
Any idea why this is happening?
This is the model I am using for the second classifier:
Right now, your second model always answers "Class 0" as it can choose between only one class (number of outputs of your last layer).
As you have two classes, you need to compute the softmax + categorical_crossentropy on two outputs to pick the most probable one.
Hence, your last layer should be:
model.add(Dense(2, activation='softmax')
model.compile(...)
Your sigmoid + binary_crossentropy model, which computes the probability of "Class 0" being True by analyzing just a single output number, is already correct.
EDIT: Here is a small explanation about the Sigmoid function
Sigmoid can be viewed as a mapping between the real numbers space and a probability space.
Notice that:
Sigmoid(-infinity) = 0
Sigmoid(0) = 0.5
Sigmoid(+infinity) = 1
So if the real number, output of your network, is very low, the sigmoid will decide the probability of "Class 0" is close to 0, and decide "Class 1"
On the contrary, if the output of your network is very high, the sigmoid will decide the probability of "Class 0" is close to 1, and decide "Class 0"
Its decision is similar to deciding the Class only by looking at the sign of your output. However, this would not allow your model to learn! Indeed, the gradient of this binary loss is null nearly everywhere, making impossible for your model to learn from error, as it is not quantified properly.
That's why sigmoid and "binary_crossentropy" are used:
They are a surrogate to the binary loss, which has nice smooth properties, and enables learning.
Also, please find more info about Softmax Function and Cross Entropy