What activation layers learn?

What activation layers learn? - deep-learning

I am trying to figure out what CNN architecture after every activation layers. Therefore, I have written a code to visualize some activation layers in my model. I used LeakyReLU as my activation layer. This is the figureLeakyRelu after Conv2d + BatchNorm
As can be seen from the figure, there are quite purple frames, which shows nothing. So my question is what does it mean. Does my model learn anything?

Generally speaking, activation layers (AL) don't learn.
The purpose of AL is to add non-linearity into the model, hence they usually apply a certain, fixed, function regardless of the data, without adapting with the data. As an example:
Max Pool: take the highest number in the region
Sigmoid/Tanh: put the all the numbers through a fixed computation
ReLU: takes the max between the numbers and 0
I tried to simplify the math, so pardon my inaccuracies.
As a closure, your purple frames are probably filters that didn't learn just yet, train the model to convergence and unless your model is highly bloated (too big for your data) your will see 'structures' in your filters.

Related

Ways to prevent underfitting and overfitting to when using data augmentation to train a transposed CNN

I'm training a CNN (one using a series of ConvTranspose2D in pytorch) that uses input data from JSON to constitute an image. Unlike natural language, the input data can be in any order, as it contains info about various sprites in a scene.
In my first attempts to train the model, I didn't change the order of the input data (meaning, on each epoch, each sprite was represented in the same place in the input data). The model learned for about 10 epochs, but then there started to be divergence between the training loss (which continued to go down) and the test loss. So classic overfitting.
I tried to solve this by doing a form of data augmentation where the output data (in this case an image) stayed the same but I shuffled the order of the input data. As I have around 400 sprites, the maximum shuffling is 400!, so theoretically this can vastly expand the amount of training data. For example, instead of 100k JSON documents corresponding to 100K images, by shuffling the order of sprites in the input data, you have 400!*100000 training data points. In practice of course this amount of data is impractical, so I went with around 2m data points for an initial test. The issue I ran into here was that the model was not learning at all - after getting to a certain loss very quickly (after the first few mini-batches), it didn't learn at all for around 4 epochs. So classic underfitting.
Like Goldilocks, I'd like to find "just right" between the initial overfitting and subsequent underfitting. I'm wondering other strategies I could try out. One idea I had was letting the model train on a predetermined order of sprites (the overfitting case) and then, once overfitting starts (ie two straight epochs with divergence between the test and training loss) shuffling the data. I can also play with changing the model, although it can only be so big because of constraints with the hardware and the fact that inference needs to happen in under 20ms.
Are there any papers or techniques that are recommended in this scenario where data augmentation can lead to vastly more data points but results in a model ceasing to learn? Thanks in advance for any tips!

Is normalization of word embeddings important?

I am doing actor-critic reinforcement learning for an environment that is best represented as a "bag-of-words". For this reason, I have opted to use a single body, multi-head approach for the net architecture. I use a linear pre-processing layer to generate n word embeddings of dimension d. Then I run the (batch,n,d) words through a stack of (2) nn.TransformerEncoder layers for the body and each head is another encoder layer followed by a linear logit layer.
Since this is RL and I have limited compute, it is also difficult to evaluate training as its happening. I decided to try looking a the mean cosine similarity of the latent words after the encoder body. My intuition tells me if the net is learning a proper latent representation of the environment then dis-similar words should have low cosine similarity.
However even though the net is clearly improving somewhat the mean cosine sim. remains very high, > .99
Thinking about it more, I don't think theres any reason to believe my first intuition, especially since I am not even normalizing the words after encoder body stack. But even if I did normalize, I'm not sure that would encourage lower cosine sim. as I am using 256 dimensions per word. All normalizing does is reduce the dimension of the output space by 1, which should hardly matter here.
Does this make sense? Also any general advice about my net is welcome

Regression problem getting much better results when dividing values by 100

I'm working on a regression problem in pytorch. My target values can be either between 0 to 100 or 0 to 1 (they represent % or % divided by 100).
The data is unbalanced, I have much more data with lower targets.
I've noticed that when I run the model with targets in the range 0-100, it doesn't learn - the validation loss doesn't improve, and the loss on the 25% large targets is very big, much bigger than the std in this group.
However, when I run the model with targets in the range 0-1, it does learn and I get good results.
If anyone can explain why this happens, and if using the ranges 0-1 is "cheating", that will be great.
Also - should I scale the targets? (either if I use the larger or the smaller range).
Some additional info - I'm trying to fine tune bert for a specific task. I use MSEloss.
Thanks!

I think your observation relates to batch normalization. There is a paper written on the subject, an numerous medium/towardsdatascience posts, which i will not list here. Idea is that if you have a no non-linearities in your model and loss function, it doesn't matter. But even in MSE you do have non-linearity, which makes it sensitive to scaling of both target and source data. You can experiment with inserting Batch Normalization Layers into your models, after dense or convolutional layers. In my experience it often improves accuracy.

Predicting rare events and their strength with LSTM autoencoder

I’m currently creating and LSTM to predict rare events. I’ve seen this paper which suggest: first an autoencoder LSTM for extracting features and second to use the embeddings for a second LSTM that will make the actual prediction. According to them, the autoencoder extract features (this is usually true) which are then useful for the prediction layers to predict.
In my case, I need to predict if it would be or not an extreme event (this is the most important thing) and then how strong is gonna be. Following their advice, I’ve created the model, but instead of adding one LSTM from embeddings to predictions I add two. One for binary prediction (It is, or it is not), ending with a sigmoid layer, and the second one for predicting how strong will be. Then I have three losses. The reconstruction loss (MSE), the prediction loss (MSE), and the binary loss (Binary Entropy).
The thing is that I’m not sure that is learning anything… the binary loss keeps in 0.5, and even the reconstruction loss is not really good. And of course, the bad thing is that the time series is plenty of 0, and some numbers from 1 to 10, so definitely MSE is not a good metric.
What do you think about this approach?
This is the better architecture for predicting rare events? Which one would be better?
Should I add some CNN or FC from the embeddings before the other to LSTM, for extracting 1D patterns from the embedding, or directly to make the prediction?
Should the LSTM that predicts be just one? And only use MSE loss?
Would be a good idea to multiply the two predictions to force in both cases the predicted days without the event coincide?
Thanks,

Why (MNIST trained) model is not good at digits which not in the center of the picture

About the problem
My CNN model has accuracy up to 99.4% on the MNIST dataset. So I try some irregular input. And the predicted result is not correct.
The following are some of the irregular input I use
As we know, CNN convolution will scan the whole image, also don't care about the key features in which areas of the image.
Why CNN could not deal with irregular input

As we know, CNN convolution will scan the whole image, also don't care about the key features in which areas of the image.
This is simply false. CNN do not "scan" image, a single filter can be seen as scanning, but the whole network does not. CNN is composed of many layers, which will eventually reduce amount of information, and at some point also use location-specific feature (in final fully connected layers, in some global averaging and so on). Consequently, while CNNs are robust to small perturbations (translations or noise, but not rotations!), they are not invariant to these transformations. In other words - moving an image 3 pixels to the left is fine, but trying to classify a number in completely different scale/position will fail because there is nothing forcing your model to be invariant to that. Some models that indeed learn these kind of invariances are Spatial Transformers Networks, but CNNs simply don't.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008