How to deal with ordinal labels in keras? - deep-learning

I have data with integer target class in the range 1-5 where one is the lowest and five the highest. In this case, should I consider it as regression problem and have one node in the output layer?
My way of handling it is:
1- first I convert the labels to binary class matrix
labels = to_categorical(np.asarray(labels))
2- in the output layer, I have five nodes
main_output = Dense(5, activation='sigmoid', name='main_output')(x)
3- I use 'categorical_crossentropy with mean_squared_error when compiling
model.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['mean_squared_error'],loss_weights=[0.2])
Also, can anyone tells me: what is the difference between using categorical_accuracy and 'mean_squared_error in this case?

Regression and classification are vastly different things. If you reimagine this as a regression task than the difference of predicting 2 when the ground truth is 4 will be rated more than if you predict 3 instead of 4. If you have class like car, animal, person you do not care for the ranking between those classes. Predicting car is just as wrong as animal, iff the image shows a person.
Metrics do not impact your learning at all. It is just something that is computed additionally to the loss to show the performance of the model. Here the accuracy makes sense, because this is mostly the metric that we care about. Mean squared error does not tell you how well your model performs. If you get something like 0.0015 mean squared error it sounds good, but it is hard to visualize just how well this performs. In contrast using accuracy and achieving 95% accuracy for example is meaningful.
One last thing you should use softmax instead of sigmoid as your final output to get a probability distribution in your final layer. Softmax will output percentages for every class that sum up to 1. Then crossentropy calculates the difference of the probability distribution of your network output and the ground truth.

Related

Would this be a valid Implementation of an ordinal CrossEntropy?

Would this be a valid implementation of a cross entropy loss that takes the ordinal structure of the GT y into consideration? y_hat is the prediction from a neural network.
ce_loss = F.cross_entropy(y_hat, y, reduction="none")
distance_weight = torch.abs(y_hat.argmax(1) - y) + 1
ordinal_ce_loss = torch.mean(distance_weight * ce_loss)
I'll attempt to answer this question by first fully defining the task, since the question is a bit sparse on details.
I have a set of ordinal classes (e.g. first, second, third, fourth,
etc.) and I would like to predict the class of each data example from
among this set. I would like to define an entropy-based loss-function
for this problem. I would like this loss function to weight the loss
between a predicted class torch.argmax(y_hat) and the true class y
according to the ordinal distance between the two classes. Does the
given loss expression accomplish this?
Short answer: sure, it is "valid". You've roughly implemented L1-norm ordinal class weighting. I'd question whether this is truly the correct weighting strategy for this problem.
For instance, consider that for a true label n, the bin n response is weighted by 1, but the bin n+1 and n-1 responses are weighted by 2. This means that a lot more emphasis will be placed on NOT predicting false positives than on correctly predicting true positives, which may imbue your model with some strange bias.
It also means that examples on the edge will result in a larger total sum of weights, meaning that you'll be weighting examples where the true label is say "first" or "last" more highly than the intermediate classes. (Say you have 5 classes: 1,2,3,4,5. A true label of 1 will require distance_weight of [1,2,3,4,5], the sum of which is 15. A true label of 3 will require distance_weight of [3,2,1,2,3], the sum of which is 11.
In general, classification problems and entropy-based losses are underpinned by the assumption that no set of classes or categories is any more or less related than any other set of classes. In essence, the input data is embedded into an orthogonal feature space where each class represents one vector in the basis. This is quite plainly a bad assumption in your case, meaning that this embedding space is probably not particularly elegant: thus, you have to correct for it with sort of a hack-y weight fix. And in general, this assumption of class non-correlation is probably not true in a great many classification problems (consider e.g. the classic ImageNet classification problem, wherein the class pairs [bus,car], and [bus,zebra] are treated as equally dissimilar. But this is probably a digression into the inherent lack of usefulness of strict ontological structuring of information which is outside the scope of this answer...)
Long Answer: I'd highly suggest moving into a space where the ordinal value you care about is instead expressed in a continuous space. (In the first, second, third example, you might for instance output a continuous value over the range [1,max_place]. This allows you to benefit from loss functions that already capture well the notion that predictions closer in an ordered space are better than predictions farther away in an ordered space (e.g. MSE, Smooth-L1, etc.)
Let's consider one more time the case of the [first,second,third,etc.] ordinal class example, and say that we are trying to predict the places of a set of runners in a race. Consider two races, one in which the first place runner wins by 30% relative to the second place runner, and the second in which the first place runner wins by only 1%. This nuance is entirely discarded by the ordinal discrete classification. In essence, the selection of an ordinal set of classes truncates the amount of information conveyed in the prediction, which means not only that the final prediction is less useful, but also that the loss function encodes this strange truncation and binarization, which is then reflected (perhaps harmfully) in the learned model. This problem could likely be much more elegantly solved by regressing the finishing position, or perhaps instead by regressing the finishing time, of each athlete, and then performing the final ordinal classification into places OUTSIDE of the network training.
In conclusion, you might expect a well-trained ordinal classifier to produce essentially a normal distribution of responses across the class bins, with the distribution peak on the true value: a binned discretization of a space that almost certainly could, and likely should, be treated as a continuous space.

Using Softmax Activation function after calculating loss from BCEWithLogitLoss (Binary Cross Entropy + Sigmoid activation)

I am going through a Binary Classification tutorial using PyTorch and here, the last layer of the network is torch.Linear() with just one neuron. (Makes Sense) which will give us a single neuron. as pred=network(input_batch)
After that the choice of Loss function is loss_fn=BCEWithLogitsLoss() (which is numerically stable than using the softmax first and then calculating loss) which will apply Softmax function to the output of last layer to give us a probability. so after that, it'll calculate the binary cross entropy to minimize the loss.
loss=loss_fn(pred,true)
My concern is that after all this, the author used torch.round(torch.sigmoid(pred))
Why would that be? I mean I know it'll get the prediction probabilities in the range [0,1] and then round of the values with default threshold of 0.5.
Isn't it better to use the sigmoid once after the last layer within the network rather using a softmax and a sigmoid at 2 different places given it's a binary classification??
Wouldn't it be better to just
out = self.linear(batch_tensor)
return self.sigmoid(out)
and then calculate the BCE loss and use the argmax() for checking accuracy??
I am just curious that can it be a valid strategy?
You seem to be thinking of the binary classification as a multi-class classification with two classes, but that is not quite correct when using the binary cross-entropy approach. Let's start by clarifying the goal of the binary classification before looking at any implementation details.
Technically, there are two classes, 0 and 1, but instead of considering them as two separate classes, you can see them as opposites of each other. For example, you want to classify whether a StackOverflow answer was helpful or not. The two classes would be "helpful" and "not helpful". Naturally, you would simply ask "Was the answer helpful?", the negative aspect is left off, and if that wasn't the case, you could deduce that it was "not helpful". (Remember, it's a binary case, there is no middle ground).
Therefore, your model only needs to predict a single class, but to avoid confusion with the actual two classes, that can be expressed as: The model predicts the probability that the positive case occurs. In context of the previous example: What is the probability that the StackOverflow answer was helpful?
Sigmoid gives you values in the range [0, 1], which are the probabilities. Now you need to decide when the model is confident enough for it to be positive by defining a threshold. To make it balanced, the threshold is 0.5, therefore as long as the probability is greater than 0.5 it is positive (class 1: "helpful") otherwise it's negative (class 0: "not helpful"), which is achieved by rounding (i.e. torch.round(torch.sigmoid(pred))).
After that the choice of Loss function is loss_fn=BCEWithLogitsLoss() (which is numerically stable than using the softmax first and then calculating loss) which will apply Softmax function to the output of last layer to give us a probability.
Isn't it better to use the sigmoid once after the last layer within the network rather using a softmax and a sigmoid at 2 different places given it's a binary classification??
BCEWithLogitsLoss applies Sigmoid not Softmax, there is no Softmax involved at all. From the nn.BCEWithLogitsLoss documentation:
This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.
By not applying Sigmoid in the model you get a more numerically stable version of the binary cross-entropy, but that means you have to apply the Sigmoid manually if you want to make an actual prediction outside of training.
[...] and use the argmax() for checking accuracy??
Again, you're thinking of the multi-class scenario. You only have a single output class, i.e. output has size [batch_size, 1]. Taking argmax of that, will always give you 0, because that is the only available class.

How to perform multi labeling classification (for CNN)?

I am currently looking into multi-labeling classification and I have some questions (and I couldn't find clear answers).
For the sake of clarity let's take an example : I want to classify images of vehicles (car, bus, truck, ...) and their make (Audi, Volkswagen, Ferrari, ...).
So I thought about training two independant CNN (one for the "type" classification and one fore the "make" classifiaction) but I thought it might be possible to train only one CNN on all the classes.
I read that people tend to use sigmoid function instead of softmax to do that. I understand that sigmoid does not sum up to 1 like softmax does but I dont understand in what doing that enables to do multi-labeling classification ?
My second question is : Is it possible to take into account that some classes are completly independant ?
Thridly, in term of performances (accuracy and time to give the classification for a new image), isn't training two independant better ?
Thank you for those who could give my some answers or some ideas :)
Softmax is a special output function; it forces the output vector to have a single large value. Now, training neural networks works by calculating an output vector, comparing that to a target vector, and back-propagating the error. There's no reason to restrict your target vector to a single large value, and for multi-labeling you'd use a 1.0 target for every label that applies. But in that case, using a softmax for the output layer will cause unintended differences between output and target, differences that are then back-propagated.
For the second part: you define the target vectors; you can encode any sort of dependency you like there.
Finally, no - a combined network performs better than the two halves would do independently. You'd only run two networks in parallel when there's a difference in network layout, e.g. a regular NN and CNN in parallel might be viable.

Layer probability always 1

I'm using caffe/example/mnist network to classify numbers. When I give the network a picture of number it seems working ok. But when I give the network a picture not a number, the mnist trained network softmax layer gives the probabilities, which always has one probability 1 and others 0, like:
[0,0,0,...,1,0,0,0].
I think it should be something like:
[0,0.1,0.2,...,0.4,0.1,0.2],
in which case I can say that this should not be a number. What is the problem?
It is hard to know what to expect since it hasn't been trained on non-numbers and it has to give a result that sums to 1. By using Softmax, you are telling the network that there is a number, while showing it a non-number. You cannot look at its output to then determine whether it is a number or not.
Furthermore, the training data for MNIST is very stereotypical and not good for generalization. The foreground numbers are always 255 and the background is always 0. The average value will be much closer to 0, since there are more background pixels. Simply presenting an image with a mean pixel value of 100 could bias the prediction to numbers that typically have more pixels (like 8, perhaps). You can only expect to the network to generalize to similar types of stimulus. For your task, you should do quite a lot of data augmentation.
You want to allow the probability to be zeros for all numbers, which you can do by using the cross-entropy loss. This would also allow the probability to also sum to more than 1 (maximum 10). You could also try adding another class for "non-number" with the Softmax, however, then you should present non-number stimuli and number stimuli that is more similar to natural stimuli (so that it is not trivially separable).

Loss function for ordinal target on SoftMax over Logistic Regression

I am using Pylearn2 OR Caffe to build a deep network. My target is ordered nominal. I am trying to find a proper loss function but cannot find any in Pylearn2 or Caffe.
I read a paper "Loss Functions for Preference Levels: Regression with Discrete Ordered Labels" . I get the general idea - but I am not sure I understand what will the thresholds be, if my final layer is a SoftMax over Logistic Regression (outputting probabilities).
Can some help me by pointing to any implementation of such a loss function ?
Thanks
Regards
For both pylearn2 and caffe, your labels will need to be 0-4 instead of 1-5...it's just the way they work. The output layer will be 5 units, each is a essentially a logistic unit...and the softmax can be thought of as an adaptor that normalizes the final outputs. But "softmax" is commonly used as an output type. When training, the value of any individual unit is rarely ever exactly 0.0 or 1.0...it's always a distribution across your units - which log-loss can be calculated on. This loss is used to compare against the "perfect" case and the error is back-propped to update your network weights. Note that a raw output from PL2 or Caffe is not a specific digit 0,1,2,3, or 5...it's 5 number, each associated to the likelihood of each of the 5 classes. When classifying, one just takes the class with the highest value as the 'winner'.
I'll try to give an example...
say I have a 3 class problem, I train a network with a 3 unit softmax.
the first unit represents the first class, second the second and third, third.
Say I feed a test case through and get...
0.25, 0.5, 0.25 ...0.5 is the highest, so a classifier would say "2". this is the softmax output...it makes sure the sum of the output units is one.
You should have a look at ordinal (logistic) regression. This is the formal solution to the problem setup you describe ( do not use plain regression as the distance measures of errors are wrong).
https://stats.stackexchange.com/questions/140061/how-to-set-up-neural-network-to-output-ordinal-data
In particular I recommend looking at Coral ordinal regression implementation at
https://github.com/ck37/coral-ordinal/issues.