Is it possible to subset a classification task in mlr keeping the positive/negative class ratio unchanged? - mlr

In order to make small tests on a large machine learning classification task in mlr, I would like to create small tasks first that maintain the positive/negative ratio of the original task.
Currently I am doing this manually using the function subsetTask setting the argument subset to a fixed index vector that preserves the class ratio.
Is there any way to do this internally? something like "Take 75% of this task, preserving the class ratio". Maybe using a resampling instance?
Thanks!

The function downsample(my_task, perc=0.05, stratify=TRUE) should be what you're looking for:
https://mlr.mlr-org.com/reference/downsample.html
Setting the argument stratify to TRUE (it defaults to FALSE) keeps the class ratios of the original data.
Does that help?

Related

Would this be a valid Implementation of an ordinal CrossEntropy?

Would this be a valid implementation of a cross entropy loss that takes the ordinal structure of the GT y into consideration? y_hat is the prediction from a neural network.
ce_loss = F.cross_entropy(y_hat, y, reduction="none")
distance_weight = torch.abs(y_hat.argmax(1) - y) + 1
ordinal_ce_loss = torch.mean(distance_weight * ce_loss)
I'll attempt to answer this question by first fully defining the task, since the question is a bit sparse on details.
I have a set of ordinal classes (e.g. first, second, third, fourth,
etc.) and I would like to predict the class of each data example from
among this set. I would like to define an entropy-based loss-function
for this problem. I would like this loss function to weight the loss
between a predicted class torch.argmax(y_hat) and the true class y
according to the ordinal distance between the two classes. Does the
given loss expression accomplish this?
Short answer: sure, it is "valid". You've roughly implemented L1-norm ordinal class weighting. I'd question whether this is truly the correct weighting strategy for this problem.
For instance, consider that for a true label n, the bin n response is weighted by 1, but the bin n+1 and n-1 responses are weighted by 2. This means that a lot more emphasis will be placed on NOT predicting false positives than on correctly predicting true positives, which may imbue your model with some strange bias.
It also means that examples on the edge will result in a larger total sum of weights, meaning that you'll be weighting examples where the true label is say "first" or "last" more highly than the intermediate classes. (Say you have 5 classes: 1,2,3,4,5. A true label of 1 will require distance_weight of [1,2,3,4,5], the sum of which is 15. A true label of 3 will require distance_weight of [3,2,1,2,3], the sum of which is 11.
In general, classification problems and entropy-based losses are underpinned by the assumption that no set of classes or categories is any more or less related than any other set of classes. In essence, the input data is embedded into an orthogonal feature space where each class represents one vector in the basis. This is quite plainly a bad assumption in your case, meaning that this embedding space is probably not particularly elegant: thus, you have to correct for it with sort of a hack-y weight fix. And in general, this assumption of class non-correlation is probably not true in a great many classification problems (consider e.g. the classic ImageNet classification problem, wherein the class pairs [bus,car], and [bus,zebra] are treated as equally dissimilar. But this is probably a digression into the inherent lack of usefulness of strict ontological structuring of information which is outside the scope of this answer...)
Long Answer: I'd highly suggest moving into a space where the ordinal value you care about is instead expressed in a continuous space. (In the first, second, third example, you might for instance output a continuous value over the range [1,max_place]. This allows you to benefit from loss functions that already capture well the notion that predictions closer in an ordered space are better than predictions farther away in an ordered space (e.g. MSE, Smooth-L1, etc.)
Let's consider one more time the case of the [first,second,third,etc.] ordinal class example, and say that we are trying to predict the places of a set of runners in a race. Consider two races, one in which the first place runner wins by 30% relative to the second place runner, and the second in which the first place runner wins by only 1%. This nuance is entirely discarded by the ordinal discrete classification. In essence, the selection of an ordinal set of classes truncates the amount of information conveyed in the prediction, which means not only that the final prediction is less useful, but also that the loss function encodes this strange truncation and binarization, which is then reflected (perhaps harmfully) in the learned model. This problem could likely be much more elegantly solved by regressing the finishing position, or perhaps instead by regressing the finishing time, of each athlete, and then performing the final ordinal classification into places OUTSIDE of the network training.
In conclusion, you might expect a well-trained ordinal classifier to produce essentially a normal distribution of responses across the class bins, with the distribution peak on the true value: a binned discretization of a space that almost certainly could, and likely should, be treated as a continuous space.

Training out false positives in object detection

This is my first foray into the world of object recognition. I have successfully trained a model on yolo with images that I have found on Google and annotated myself in CVAT.
My questions are as follows.
a) How do I train the model to ignore some special variant that I am specifically NOT interested in detecting? Say I am getting false positives because something looks similar to one of my objects, and I want to train so that these are not detected. Does it simply work to include images that contain the unwanted object into the training set, but don't annotate the unwanted object?
b) If so, am I right in assuming that if I train on annotated images that have somehow missed occasional instances of desired objects, is that effectively telling the training engine that I'm not interested in that object? In other words, is it therefore BAD if images don't have every single instance of desired objects annotated?
c) If I happen to include an image in my training set with an empty annotation file, and there are desired objects in that image, that effectively disincentivizes the training engine to find those in future?
Thanks for any thoughts.
a) This is true. The model will consider space inside bounding boxes as positive for a certain class during training, and space outside the boxes for the class negative for that class.
b) See a, this is indeed the case.
c) Empty annotation files will be used during training, but the model will train on that image as a 'background' class, so these are negatives too.
So, in short, annotate all instances of objects of a certain class and maybe add 'background images' as negative examples to disincentivize those.

How to perform multi labeling classification (for CNN)?

I am currently looking into multi-labeling classification and I have some questions (and I couldn't find clear answers).
For the sake of clarity let's take an example : I want to classify images of vehicles (car, bus, truck, ...) and their make (Audi, Volkswagen, Ferrari, ...).
So I thought about training two independant CNN (one for the "type" classification and one fore the "make" classifiaction) but I thought it might be possible to train only one CNN on all the classes.
I read that people tend to use sigmoid function instead of softmax to do that. I understand that sigmoid does not sum up to 1 like softmax does but I dont understand in what doing that enables to do multi-labeling classification ?
My second question is : Is it possible to take into account that some classes are completly independant ?
Thridly, in term of performances (accuracy and time to give the classification for a new image), isn't training two independant better ?
Thank you for those who could give my some answers or some ideas :)
Softmax is a special output function; it forces the output vector to have a single large value. Now, training neural networks works by calculating an output vector, comparing that to a target vector, and back-propagating the error. There's no reason to restrict your target vector to a single large value, and for multi-labeling you'd use a 1.0 target for every label that applies. But in that case, using a softmax for the output layer will cause unintended differences between output and target, differences that are then back-propagated.
For the second part: you define the target vectors; you can encode any sort of dependency you like there.
Finally, no - a combined network performs better than the two halves would do independently. You'd only run two networks in parallel when there's a difference in network layout, e.g. a regular NN and CNN in parallel might be viable.

What does caffe do with the mean-binary file ?

In the caffe-input layer one can define a mean image that holds mean values of all the images used. From the image net example: "The model requires us to subtract the image mean from each image, so we have to compute the mean".
My question is: What is the implementation of this subtraction? Is it simply :
used_image = original_image - mean_image
or
used_image = mean_image - original_iamge
or
used_image = |original_image - mean_image|^2
if it is one of the first two, then how are negative pixels handeld ? Since the pictures are usually stored in uint8 it would mean that it simply starts from the beginning. e.g
200 - 255 = 56
Why I need to know this? I made tests and I know that the second example or the third example would work better.
It's the first one, a trivial normalization step. Using the second instead wouldn't really matter: the weights would invert.
There are no "negative pixels", per se: this is simply integer input to the matrix operations. You are welcome to interpret this as a visual alteration of some sort, but the arithmetic doesn't care.

How can a fourier transform ever be used for exprapolation?

Please excuse a clueless newbie question.
Since a discrete Fourier Transform on a fixed interval is treated as repeating indefinitely, how can it ever be used to extrapolate a time series? What follows the end of the interval will always be identical to the beginning.
Even a simple least square fit would at least give a trend.
How can all that cycle information in a FT be useless for extrapolation?
How? By changing your initial assumption.
One does not need to assume that the input to a DFT repeats indefinitely exactly periodic in aperture width. Assuming that the input is a rectangular window upon a longer stationary sequence which may or may not be periodic within the DFT aperture is also a valid assumption, and commonly used to interpolate/estimate "between bin" spectra.
(e.g. if the DFT result looks exactly like offset samples of Sinc function corresponding to the window width, one could assume that this is the result of a rectangular window upon a single low degree of freedom oscillator, or extreme luck or an alien intelligence that just happens to order all N bins in just such an interesting pattern. Occam's razor may or may not suggest that the former is a better assumption depending on your model of the input.)
Extending interpolated "between bin" or estimated non-periodic-in-aperture spectra (e.g. after deconvolving the assumed Sinc distortion caused by the rectangular window) beyond the end of the DFT aperture/window may allow extrapolating data not identical with the beginning of the DFT aperture/window.