Why don't people use different activation funtions at each layer of an artificial nueral network - deep-learning

Why don't people use different activation funtions at each layer of an artificial nueral network. For example an ANN with sigmoid as its activation funtion for its first layer, rectified linear as its second layer, and softmax as its third. Even just a combination of 2 of them every other layer, or the first 10 layers being sigmoid, and the last 10 being rectified linear. Is it simply that the ANN would not work as well as using one activation function. Can someone explain why is not a normal standard when making ANN's?

Related

What activation layers learn?

I am trying to figure out what CNN architecture after every activation layers. Therefore, I have written a code to visualize some activation layers in my model. I used LeakyReLU as my activation layer. This is the figureLeakyRelu after Conv2d + BatchNorm
As can be seen from the figure, there are quite purple frames, which shows nothing. So my question is what does it mean. Does my model learn anything?
Generally speaking, activation layers (AL) don't learn.
The purpose of AL is to add non-linearity into the model, hence they usually apply a certain, fixed, function regardless of the data, without adapting with the data. As an example:
Max Pool: take the highest number in the region
Sigmoid/Tanh: put the all the numbers through a fixed computation
ReLU: takes the max between the numbers and 0
I tried to simplify the math, so pardon my inaccuracies.
As a closure, your purple frames are probably filters that didn't learn just yet, train the model to convergence and unless your model is highly bloated (too big for your data) your will see 'structures' in your filters.

Deep reinforcement learning for similar observations but need totally different actions, how to solve it?

For DRL using neural networks, like DQN, if there is a task that needs total different actions at similar observations, is NN going to show its weakness at this moment? Will two near input to the NN generate similar output? If so, it cannot get the different the task need?
For instance:
the agent can choose discrete action from [A,B,C,D,E], here is the observation by a set of plugs in a binary list [0,0,0,0,0,0,0].
For observation [1,1,1,1,1,1,1] and [1,1,1,1,1,1,0] they are quite similar but if the agent should conduct action A at [1,1,1,1,1,1,1] but action D at [1,1,1,1,1,1,0]. Those two observation are too closed on the distance so the DQN may not easily get the proper action? How to solve?
One more thing:
One hot encoding is a way to improve the distance between observations. It is also a common and useful way for many supervised learning tasks. But one hot will also increase the dimension heavily.
Will two near input to the NN generate similar output ?
Artificial neural networks, by nature, are non-linear function approximators. Meaning that for two given similar inputs, the output can be very different.
You might get an intuition on it considering this example, two very similar pictures (the one on the right just has some light noise added to it) give very different results for the model.
For observation [1,1,1,1,1,1,1] and [1,1,1,1,1,1,0] they are quite similar but if the agent should conduct action A at [1,1,1,1,1,1,1] but action D at [1,1,1,1,1,1,0]. Those two observation are too closed on the distance so the DQN may not easily get the proper action ?
I see no problem with this example, a properly trained NN should be able to map the desired action for both inputs. Furthermore, in your example the input vectors contain binary values, a single difference in these vectors (meaning that they have a Hamming distance of 1) is big enough for the neural net to classify them properly.
Also, the non-linearity in neural networks comes from the activation functions, hope this helps !

What are the disadvantages of Leaky-ReLU?

We use ReLu instead of Sigmoid activation function since it is devoid of vanishing and exploding gradients problem that has been in sigmoid like activation functions,
Leaky-ReLU is one of rely's improvements. Everyone is talking about the advantages of Leaky-ReLU. But what are the disadvantages of Leaky-ReLU?
ReLU replaced sigmoid in the hidden layers since it yields better results for general purpose applications, but it really depends in your case and other activation function might work better. Leaky ReLU helps with the vainishing gradient problem.
I think the main disadvange of Leaky ReLU is that you have another parameter to tune, the slope. But I remark that it really depends in your problem which function works better.
The adventage:
LeakyRelu is "inmortal".
If you play enough with your Relu neural network some neurons are going to die. (specialy with L1, L2 normalization) Detect death neurons is hard. Correct them even harder.
The disadventage:
You will add computational work on every epoch. (it's harder to multiply than to assign a zero)
Depending the job you may need a few more epochs to convergence.
The slope at negative z is another parameter but not a very critical one.
When you reach small learning rates a dead neuron tend to remain dead.

Predicting rare events and their strength with LSTM autoencoder

I’m currently creating and LSTM to predict rare events. I’ve seen this paper which suggest: first an autoencoder LSTM for extracting features and second to use the embeddings for a second LSTM that will make the actual prediction. According to them, the autoencoder extract features (this is usually true) which are then useful for the prediction layers to predict.
In my case, I need to predict if it would be or not an extreme event (this is the most important thing) and then how strong is gonna be. Following their advice, I’ve created the model, but instead of adding one LSTM from embeddings to predictions I add two. One for binary prediction (It is, or it is not), ending with a sigmoid layer, and the second one for predicting how strong will be. Then I have three losses. The reconstruction loss (MSE), the prediction loss (MSE), and the binary loss (Binary Entropy).
The thing is that I’m not sure that is learning anything… the binary loss keeps in 0.5, and even the reconstruction loss is not really good. And of course, the bad thing is that the time series is plenty of 0, and some numbers from 1 to 10, so definitely MSE is not a good metric.
What do you think about this approach?
This is the better architecture for predicting rare events? Which one would be better?
Should I add some CNN or FC from the embeddings before the other to LSTM, for extracting 1D patterns from the embedding, or directly to make the prediction?
Should the LSTM that predicts be just one? And only use MSE loss?
Would be a good idea to multiply the two predictions to force in both cases the predicted days without the event coincide?
Thanks,

Does resnet have fully connected layers?

In my understanding, fully connected layer(fc in short) is used for predicting.
For example, VGG Net used 2 fc layers, which are both 4096 dimension. The last layer for softmax has dimension same with classes num:1000.
But for resnet, it used global average pooling, and use the pooled result of last convolution layer as the input.
But they still has a fc layer! Does this layer a really fc layer? Or this layer is only to make input into a vector of features which number is classes number? Does this layer has function for prediction result?
In a word, how many fc layers do resnet and VGGnet have? Does VGGnet's 1st 2nd 3rd fc layer has different function?
VGG has three FC layers, two with 4096 neurons and one with 1000 neurons which outputs the class probabilities.
ResNet only has one FC layer with 1000 neurons which again outputs the class probabilities. In a NN classifier always the best choice is to use softmax, some authors make this explicit in the diagram while others do not.
In essence the guys at microsoft (ResNet) favor more convolutional layers instead of fully connected ones and therefore ommit fully connected layers. GlobalAveragePooling also decreases the feature size dramatically and therefore reduces the number of parameters going from the convolutional part to the fully connected part.
I would argue that the performance difference is quite slim, but one of their main accomplishments, by introducing ResNets is the dramatic reduction of parameters and those two points helped them accomplish that.