Can a neural network having non-linear activation function (say ReLU) be used for linear classification task? - deep-learning

I think the answer would be yes, but I'm unable to reason out a good explanation on this.

The mathematical argument lies in a power to represent linearity, we can use following three lemmas to show that:
Lemma 1
With affine transformations (linear layer) we can map the input hypercube [0,1]^d into arbitrary small box [a,b]^k. Proof is quite simple, we can just make all the biases to be equal to a, and make weights multiply by (b-a).
Lemma 2
For sufficiently small scale, many non-linearities are approximately linear. This is actually very much a definition of a derivative, or, taylor expansion. In particular let us take relu(x), for x>0 it is in fact, linear! What about sigmoid? Well if we look at a tiny tiny region [-eps, eps] you can see that it approaches a linear function as eps->0!
Lemma 3
Composition of affine functions is affine. In other words, if I were to make a neural network with multiple linear layers, it is equivalent of having just one. This comes from the matrix composition rules:
W2(W1x + b1) + b2 = W2W1x + W2b1 + b2 = (W2W1)x + (W2b1 + b2)
------ -----------
New weights New bias
Combining the above
Composing the three lemmas above we see that with a non-linear layer, there always exists an arbitrarily good approximation of the linear function! We simply use the first layer to map entire input space into the tiny part of the pre-activation spacve where your linearity is approximately linear, and then we "map it back" in the following layer.
General case
This is a very simple proof, now in general you can use Universal Approximation Theorem to show that a non-linear neural network (Sigmoid, Relu, many others) that is sufficiently large, can approximate any smooth target function, which includes linear ones. This proof (originally given by Cybenko) is however much more complex and relies on showing that specific classes of functions are dense in the space of continuous functions.

Technically, yes.
The reason you could use a non-linear activation function for this task is that you can manually alter the results. Let's say the range the activation function outputs is between 0.0-1.0, then you can round up or down to get a binary 0/1. Just to be clear, rounding up or down isn't linear activation, but for this specific question the purpose of the network was for classification, where some kind of rounding has to be applied.
The reason you shouldn't is the same reason that you shouldn't attach an industrial heater to a fan and call it a hair-drier, it's unnecessarily powerful and it could potentially waste resources and time.
I hope this answer helped, have a good day!

Related

Bayesian Gamma regression, what is the correct link function?

I'm trying to do a bayesian gamma regression with stan.
I know the correct link function is the inverse canonical link,
but if i dont use a log link parameters can be negative, and enter in a gamma distribution with a negative value, that obviously can't be possible.
how can i deal with it?
parameters {
vector[K] betas; //the regression parameters
real beta0;
real<lower=0.001, upper=100 > shape; //the variance parameter
}
transformed parameters {
vector<lower=0>[N] eta; //the expected values (linear predictor)
vector[N] alpha; //shape parameter for the gamma distribution
vector[N] beta; //rate parameter for the gamma distribution
eta <- beta0 + X*betas; //using the log link
}
model {
beta0 ~ normal( 0 , 2^2 );
for(i in 2:K)
betas[i] ~ normal( 0 , 2^2 );
y ~ gamma(shape,shape * eta);
}
I was struggling with this a couple weeks ago, and while I don't consider this a definitive answer, hopefully it is still helpful. For what it's worth, McCullagh and Nelder directly acknowledge this inappropriate support of the canonical link function. They advise that one must constrain the betas to properly match the support. Here's the relevant passage
The canonical link function yields sufficient statistics which are linear functions of the data and it is given by η = 1/μ. Unlike the canonical links for the Poisson and binomial distributions, the reciprocal transformation, which is often interpretable as the rate of a process, does not map the range of μ onto the whole real line. Thus the requirement that η > 0 implies restrictions on the βs in any linear model. Suitable precautions must be taken in computing β_hat so that negative values of μ_hat are avoided.
-- McCullagh and Nelder (1989). Generalized Linear Models. p. 291
It depends on your X values, but as far as I can tell (please correct me someone!) in an MCMC-based Bayesian case, you can achieve this by either using a truncated prior on the betas or a strong enough prior on your intercept to make the inappropriate regions numerically impossible to reach.
In my case, I ultimately used an identity link with a strong positive prior intercept and that was sufficient and yielded reasonable results.
Also, the choice of link really depends on your X. As the passage above implies, the use of the canonical link assumes that your linear model is in rate space. Using log or identity link functions appear to be also very common, and ultimately it's about providing a space that offers a sufficient span for the linear function to capture the response.

Is it possible to implement a loss function that prioritizes the correct answer being in the top k probabilities?

I am working on an multi-class image recognition problem. The task is to have the correct answer being in the top 3 output probabilities. So I was thinking that maybe there exists a clever cost function that prioritizes the correct answer being in the top K and doesn't penalize much in between these top K.
This can be achieved by class-weighted cross-entropy loss, which essentially assigns the weight to the errors associated with each class. This loss is used in research, e.g. see the paper "Multi-task learning and Weighted Cross-entropy for DNN-based Keyword" by S. Panchapagesan at al. Before computing the cross-entropy, you can check if the predicted distribution satisfies your condition (e.g., ground truth class is in top-k of the predicted classes) and assign the zero (or near zero) weights accordingly, if it does.
There are open questions though: when the correct class is in top-k, should you penalize the k-1 incorrectly predicted classes? What if, for example, the prediction is (0.9, 0.05, 0.01, ...), the third class is correct and it is in top-3 -- is this prediction good enough or not? Should you care what exactly k-1 incorrect classes are?
All these question arise because this kind of loss doesn't have probabilistic interpretation, unlike standard cross-entropy. That's why I wouldn't recommend using it in practice, but reformulate the goal instead.
E.g., if the original problem is that for some inputs several classes are equally good, the best way to deal with it is to use soft labels, e.g. (0.33, 0.33, 0.33, 0, 0, 0, ...) instead of one-hot (note that this totally agrees with probabilistic interpretation). It will force the network to learn features associated with all three good classes, and generally lead to the same goal, but with better control over target classes.

autoencoder only provides linear output

Hello Guys,
I am working right now on an Autoencoder reducing some simple 2D Data to 1D. The architecture is 2 - 10 - 1 - 10 - 2 Neurons/Layer. As Activation Function I use sigmoid in every layer but the output-layer, where I use the identity.
I am using the Accord.NET Framework to build that.
I am Pre-Training the Autoencoder with RBMs and CD-Algorithm, where I can change the initial weights, the learning rate, the momentum and the weight decay.
The Fine-Tuning is accomplished by backpropagation where I can configure the learning rate and the momentum.
The data is some artificially created shape and is marked green in the picture:
data + reconstruction
The reconstruction of the autoencoder is the yellow line. Which leads to my problem. Somehow the encoder is not able to create a non-linear shape as output.
Although I tested arround a lot and changed values a dozen times, I am not getting better results. Maybe someone here has an idea how I could find the problem.
Thanks!
Look in general any neural network is based on a linear representation for your feature against the output so what the net is actually doing (consider two features) is [w1*x1 + w2*x2 = output].
What you need to do to achieve a non-linear representation is to use extra feature(s) which is a non-linear representation of the old feature(s). Let's say for example use x1^2 as an extra feature or x2^2 or both of them. Hence, the net will give this global equation [w1*x1 + w2*x2 + w3*x1^2 = output] which in nature is a non-linear equation and then you can have a nonlinear representation.
The extra feature equation depends mainly on your data. I have used a quadratic equation in my example but it is not always the correct thing to do. Referring to
Your data I think you need to use a cos(x) or sin(x) representation.

How to perform multi labeling classification (for CNN)?

I am currently looking into multi-labeling classification and I have some questions (and I couldn't find clear answers).
For the sake of clarity let's take an example : I want to classify images of vehicles (car, bus, truck, ...) and their make (Audi, Volkswagen, Ferrari, ...).
So I thought about training two independant CNN (one for the "type" classification and one fore the "make" classifiaction) but I thought it might be possible to train only one CNN on all the classes.
I read that people tend to use sigmoid function instead of softmax to do that. I understand that sigmoid does not sum up to 1 like softmax does but I dont understand in what doing that enables to do multi-labeling classification ?
My second question is : Is it possible to take into account that some classes are completly independant ?
Thridly, in term of performances (accuracy and time to give the classification for a new image), isn't training two independant better ?
Thank you for those who could give my some answers or some ideas :)
Softmax is a special output function; it forces the output vector to have a single large value. Now, training neural networks works by calculating an output vector, comparing that to a target vector, and back-propagating the error. There's no reason to restrict your target vector to a single large value, and for multi-labeling you'd use a 1.0 target for every label that applies. But in that case, using a softmax for the output layer will cause unintended differences between output and target, differences that are then back-propagated.
For the second part: you define the target vectors; you can encode any sort of dependency you like there.
Finally, no - a combined network performs better than the two halves would do independently. You'd only run two networks in parallel when there's a difference in network layout, e.g. a regular NN and CNN in parallel might be viable.

Best techinique to approximate a 32-bit function using machine learning?

I was wondering which is the best machine learning technique to approximate a function that takes a 32-bit number and returns another 32-bit number, from a set of observations.
Thanks!
Multilayer perceptron neural networks would be worth taking a look at. Though you'll need to process the inputs to a floating point number between 0 and 1, and then map the outputs back to the original range.
There are several possible solutions to your problem:
1.) Fitting a linear hypothesis with least-squares method
In that case, you are approximating a hypothesis y = ax + b with the least squares method. This one is really easy to implement, but sometimes, a linear model is not good enough to fit your data. But - I would give this one a try first.
Good thing is that there is a closed form, so you can directly calculate parameters a and b from your data.
See Least Squares
2.) Fitting a non-linear model
Once seen that your linear model does not describe your function very well, you can try to fit higher polynomial models to your data.
Your hypothesis then might look like
y = ax² + bx + c
y = ax³ + bx² + cx + d
etc.
You can also use least squares method to fit your data, and techniques from the gradient descent types (simmulated annealing, ...). See also this thread: Fitting polynomials to data
Or, as in the other answer, try fitting a Neural Network - the good thing is that it will automatically learn the hypothesis, but it is not so easy to explain what the relation between input and output is. But in the end, a neural network is also a linear combination of nonlinear functions (like sigmoid or tanh functions).