In a deep neural network loss decreases significantly once I increase the number of fully connected layers - deep-learning

I am training a ViT based classification model and trying to study the behaviour of the model by increasing the number of fc layers. One thing that I noticed that if I increase the fc layers beyond 2, loss decreases significantly only after a few iterations while with one or two fc layers, loss curve appears smoother and decreases slowly. I am adding the loss curves for your reference (left: 3 layers, right: 2 layers)
I have read that increasing the number of layers can cause more training accuracy but at the same time, it might overfit. But by looking at the loss curve, it doesn't appear to be overfitting (number of neurons in one fc layer: 1000)
Can someone explain this behaviour? Thanks in advance

Related

Data augmentation stops overfitting by preventing learning entirely?

I am training a network to classify psychosis (binary classification as either healthy or psychosis) given an MRI scan of a subject. My dataset is 500 items, where I am using 350 for training and 150 for validation. Around 44% of the dataset is healthy, and ~56% has psychosis.
When I train the network without data augmentation, the training loss begins decreasing immediately while validation loss never changes. The red line in the accuracy graph below is the dominant class percentage (56%).
When I re-train using data augmentation 80% of the time (random affine, blur, noise, flip), overfitting is prevented, but now nothing is learned at all.
So I suppose my question is: What are some ideas for how to get the validation accuracy to increase? i.e. get the network to learn things without overfitting...

How does the dataset size affect the iteration speed when training a neural network model?

I train a CNN model on pytorch with datasets: small one which is consist of ~100000 images with annotations and the big one with ~3500000 (35 times larger). So my training speed become slower when I'm training model with the large dataset. It decreases from 60it/sec to 30it/sec. I use the same batch size, num workers and all other parameters. I thought that training speed should not depend on dataset size.
What could be the reasons for this behavior of the model?

Predicting rare events and their strength with LSTM autoencoder

I’m currently creating and LSTM to predict rare events. I’ve seen this paper which suggest: first an autoencoder LSTM for extracting features and second to use the embeddings for a second LSTM that will make the actual prediction. According to them, the autoencoder extract features (this is usually true) which are then useful for the prediction layers to predict.
In my case, I need to predict if it would be or not an extreme event (this is the most important thing) and then how strong is gonna be. Following their advice, I’ve created the model, but instead of adding one LSTM from embeddings to predictions I add two. One for binary prediction (It is, or it is not), ending with a sigmoid layer, and the second one for predicting how strong will be. Then I have three losses. The reconstruction loss (MSE), the prediction loss (MSE), and the binary loss (Binary Entropy).
The thing is that I’m not sure that is learning anything… the binary loss keeps in 0.5, and even the reconstruction loss is not really good. And of course, the bad thing is that the time series is plenty of 0, and some numbers from 1 to 10, so definitely MSE is not a good metric.
What do you think about this approach?
This is the better architecture for predicting rare events? Which one would be better?
Should I add some CNN or FC from the embeddings before the other to LSTM, for extracting 1D patterns from the embedding, or directly to make the prediction?
Should the LSTM that predicts be just one? And only use MSE loss?
Would be a good idea to multiply the two predictions to force in both cases the predicted days without the event coincide?
Thanks,

Does resnet have fully connected layers?

In my understanding, fully connected layer(fc in short) is used for predicting.
For example, VGG Net used 2 fc layers, which are both 4096 dimension. The last layer for softmax has dimension same with classes num:1000.
But for resnet, it used global average pooling, and use the pooled result of last convolution layer as the input.
But they still has a fc layer! Does this layer a really fc layer? Or this layer is only to make input into a vector of features which number is classes number? Does this layer has function for prediction result?
In a word, how many fc layers do resnet and VGGnet have? Does VGGnet's 1st 2nd 3rd fc layer has different function?
VGG has three FC layers, two with 4096 neurons and one with 1000 neurons which outputs the class probabilities.
ResNet only has one FC layer with 1000 neurons which again outputs the class probabilities. In a NN classifier always the best choice is to use softmax, some authors make this explicit in the diagram while others do not.
In essence the guys at microsoft (ResNet) favor more convolutional layers instead of fully connected ones and therefore ommit fully connected layers. GlobalAveragePooling also decreases the feature size dramatically and therefore reduces the number of parameters going from the convolutional part to the fully connected part.
I would argue that the performance difference is quite slim, but one of their main accomplishments, by introducing ResNets is the dramatic reduction of parameters and those two points helped them accomplish that.

Estimating the training time in convolutional neural network

I want to know whether it is possible to estimate the training time of a convolutional neural network, given parameters like depth, filter, size of input, etc.
For instance, I am working on a 3D convolutional neural network whose structure is like:
a (20x20x20) convolutional layer with stride of 1 and 8 filters
a (20x20x20) max-pooling layer with stride of 20
a fully connected layer mapping to 8 nodes
a fully connected layer mapping to 1 output
I am running 100 epochs and print the loss(mean squared error) every 10 epochs. Now it has run 24 hours and no loss printed(I suppose it has not run 10 epochs yet). By the way, I am not using GPU.
Is it possible to estimate the training time like a formula or something like that? Is it related to time complexity or my hardware? I also found the following paper, will it give me some information?
https://ai2-s2-pdfs.s3.amazonaws.com/140f/467566e799f32831db6913d84ccdbdcac0b2.pdf
Thanks in advance.