Predicting continuous valued output - deep-learning

I am working on predicting Semantic Textual Similarity (SemEval 2017 Task-1) between a pair of texts. The similarity score (output) is a continuous value between [0,5]. The neural network model (link below), therefore, has 6 units in the final layer for prediction between values [0,5]. The objective function used is the Pearson correlation coefficient and softmax activation is used. Now, in order to train the model, how can I give the target output values to the model? Since there are 6 output classes, I should probably send one-hot-encoded vectors of the output. In that case, how can we convert the output (which might be a float value such as 2.33) to a one-hot vector of length 6? Or is there any other way of specifying the target output and training the model?
Paper: http://nlp.arizona.edu/SemEval-2017/pdf/SemEval016.pdf

If the value you're trying to predict is continuously-defined, you might be better off configuring this as a regression architecture. This will be simpler to train and interpret and will give you non-integer predictions (which you can then bucket or threshold however you please).
In order to do this, replace your softmax layer with a layer containing a single neuron with a linear activation function. Then you can simply train this network using your real-valued similarity numbers at the output. For loss function, you can use MSE / L2 unless you have a reason to do otherwise.

Related

Using Softmax Activation function after calculating loss from BCEWithLogitLoss (Binary Cross Entropy + Sigmoid activation)

I am going through a Binary Classification tutorial using PyTorch and here, the last layer of the network is torch.Linear() with just one neuron. (Makes Sense) which will give us a single neuron. as pred=network(input_batch)
After that the choice of Loss function is loss_fn=BCEWithLogitsLoss() (which is numerically stable than using the softmax first and then calculating loss) which will apply Softmax function to the output of last layer to give us a probability. so after that, it'll calculate the binary cross entropy to minimize the loss.
loss=loss_fn(pred,true)
My concern is that after all this, the author used torch.round(torch.sigmoid(pred))
Why would that be? I mean I know it'll get the prediction probabilities in the range [0,1] and then round of the values with default threshold of 0.5.
Isn't it better to use the sigmoid once after the last layer within the network rather using a softmax and a sigmoid at 2 different places given it's a binary classification??
Wouldn't it be better to just
out = self.linear(batch_tensor)
return self.sigmoid(out)
and then calculate the BCE loss and use the argmax() for checking accuracy??
I am just curious that can it be a valid strategy?
You seem to be thinking of the binary classification as a multi-class classification with two classes, but that is not quite correct when using the binary cross-entropy approach. Let's start by clarifying the goal of the binary classification before looking at any implementation details.
Technically, there are two classes, 0 and 1, but instead of considering them as two separate classes, you can see them as opposites of each other. For example, you want to classify whether a StackOverflow answer was helpful or not. The two classes would be "helpful" and "not helpful". Naturally, you would simply ask "Was the answer helpful?", the negative aspect is left off, and if that wasn't the case, you could deduce that it was "not helpful". (Remember, it's a binary case, there is no middle ground).
Therefore, your model only needs to predict a single class, but to avoid confusion with the actual two classes, that can be expressed as: The model predicts the probability that the positive case occurs. In context of the previous example: What is the probability that the StackOverflow answer was helpful?
Sigmoid gives you values in the range [0, 1], which are the probabilities. Now you need to decide when the model is confident enough for it to be positive by defining a threshold. To make it balanced, the threshold is 0.5, therefore as long as the probability is greater than 0.5 it is positive (class 1: "helpful") otherwise it's negative (class 0: "not helpful"), which is achieved by rounding (i.e. torch.round(torch.sigmoid(pred))).
After that the choice of Loss function is loss_fn=BCEWithLogitsLoss() (which is numerically stable than using the softmax first and then calculating loss) which will apply Softmax function to the output of last layer to give us a probability.
Isn't it better to use the sigmoid once after the last layer within the network rather using a softmax and a sigmoid at 2 different places given it's a binary classification??
BCEWithLogitsLoss applies Sigmoid not Softmax, there is no Softmax involved at all. From the nn.BCEWithLogitsLoss documentation:
This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.
By not applying Sigmoid in the model you get a more numerically stable version of the binary cross-entropy, but that means you have to apply the Sigmoid manually if you want to make an actual prediction outside of training.
[...] and use the argmax() for checking accuracy??
Again, you're thinking of the multi-class scenario. You only have a single output class, i.e. output has size [batch_size, 1]. Taking argmax of that, will always give you 0, because that is the only available class.

How can we define an RNN - LSTM neural network with multiple output for the input at time "t"?

I am trying to construct a RNN to predict the possibility of a player playing the match along with the runs score and wickets taken by the player.I would use a LSTM so that performance in current match would influence player's future selection.
Architecture summary:
Input features: Match details - Venue, teams involved, team batting first
Input samples: Player roster of both teams.
Output:
Discrete: Binary: Did the player play.
Discrete: Wickets taken.
Continous: Runs scored.
Continous: Balls bowled.
Question:
Most often RNN uses "Softmax" or"MSE" in the final layers to process "a" from LSTM -providing only a single variable "Y" as output. But here there are four dependant variables( 2 Discrete and 2 Continuous). Is it possible to stitch together all four as output variables?
If yes, how do we handle mix of continuous and discrete outputs with loss function?
(Though the output from LSTM "a" has multiple features and carries the information to the next time-slot, we need multiple features at output for training based on the ground-truth)
You just do it. Without more detail on the software (if any) in use it is hard to give more detasmail
The output of the LSTM unit is at every times step on of the hidden layers of your network
You can then input it in to 4 output layers.
1 sigmoid
2 i'ld messarfound wuth this abit. Maybe 4x sigmoid(4 wickets to an innnings right?) Or relu4
3,4 linear (squarijng it is as lso an option,e or relu)
For training purposes your loss function is the sum of your 4 individual losses.
Since f they were all MSE you could concatenat your 4 outputs before calculating the loss.
But sincd the first is cross-entropy (for a decision sigmoid) yould calculate seperately and sum.
You can still concatenate them after to have a output vector

Obtaining multiple output in regression using deep learning

Given an RGB image of hand and 3d position of the keypoints of the hand as dataset, I want to do this as regression problem in DL. In this case input will be the RGB image, and output should be estimated 3d position of keypoints.
I have seen some info about regression but most of them are trying to estimate one single value. Is it possible to estimate multiple values(or output) all at once?
For now I have referred to this code. This guy is trying to estimate the age of a person in the image.
The output vector from a neural net can represent anything as long as you define loss function well. Say you want to detect (x,y,z) co-ordinates of 10 keypoints, then just have 30 element long output vector say (x1,y1,z1,x2,y2,z2..............,x10,y10,z10), where xi,yi,zi denote coordinates of ith keypoint, basically you can use any order you feel convenient with. Just be careful with your loss function. Say you want to calculate RMSE loss, you would have to extract tripes correctly and then calculate RMSE loss for each keypoint, or if you are fimiliar with linear algebra, just reshape it into a 3x10 matrix correctly and and have your results also as a 3x10 matrix and then just use
loss = tf.sqrt(tf.reduce_mean(tf.squared_difference(Y1, Y2)))
But once you have formulated your net you will have to stick to it.

What's the difference between Softmax and SoftmaxWithLoss layer in caffe?

While defining prototxt in caffe, I found sometimes we use Softmax as the last layer type, sometimes we use SoftmaxWithLoss, I know the Softmax layer will return the probability the input data belongs to each class, but it seems that SoftmaxwithLoss will also return the class probability, then what's the difference between them? or did I misunderstand the usage of the two layer types?
While Softmax returns the probability of each target class given the model predictions, SoftmaxWithLoss not only applies the softmax operation to the predictions, but also computes the multinomial logistic loss, returned as output. This is fundamental for the training phase (without a loss there will be no gradient that can be used to update the network parameters).
See
SoftmaxWithLossLayer
and Caffe Loss
for more info.

Loss function for ordinal target on SoftMax over Logistic Regression

I am using Pylearn2 OR Caffe to build a deep network. My target is ordered nominal. I am trying to find a proper loss function but cannot find any in Pylearn2 or Caffe.
I read a paper "Loss Functions for Preference Levels: Regression with Discrete Ordered Labels" . I get the general idea - but I am not sure I understand what will the thresholds be, if my final layer is a SoftMax over Logistic Regression (outputting probabilities).
Can some help me by pointing to any implementation of such a loss function ?
Thanks
Regards
For both pylearn2 and caffe, your labels will need to be 0-4 instead of 1-5...it's just the way they work. The output layer will be 5 units, each is a essentially a logistic unit...and the softmax can be thought of as an adaptor that normalizes the final outputs. But "softmax" is commonly used as an output type. When training, the value of any individual unit is rarely ever exactly 0.0 or 1.0...it's always a distribution across your units - which log-loss can be calculated on. This loss is used to compare against the "perfect" case and the error is back-propped to update your network weights. Note that a raw output from PL2 or Caffe is not a specific digit 0,1,2,3, or 5...it's 5 number, each associated to the likelihood of each of the 5 classes. When classifying, one just takes the class with the highest value as the 'winner'.
I'll try to give an example...
say I have a 3 class problem, I train a network with a 3 unit softmax.
the first unit represents the first class, second the second and third, third.
Say I feed a test case through and get...
0.25, 0.5, 0.25 ...0.5 is the highest, so a classifier would say "2". this is the softmax output...it makes sure the sum of the output units is one.
You should have a look at ordinal (logistic) regression. This is the formal solution to the problem setup you describe ( do not use plain regression as the distance measures of errors are wrong).
https://stats.stackexchange.com/questions/140061/how-to-set-up-neural-network-to-output-ordinal-data
In particular I recommend looking at Coral ordinal regression implementation at
https://github.com/ck37/coral-ordinal/issues.