In my understanding, fully connected layer(fc in short) is used for predicting.
For example, VGG Net used 2 fc layers, which are both 4096 dimension. The last layer for softmax has dimension same with classes num:1000.
But for resnet, it used global average pooling, and use the pooled result of last convolution layer as the input.
But they still has a fc layer! Does this layer a really fc layer? Or this layer is only to make input into a vector of features which number is classes number? Does this layer has function for prediction result?
In a word, how many fc layers do resnet and VGGnet have? Does VGGnet's 1st 2nd 3rd fc layer has different function?
VGG has three FC layers, two with 4096 neurons and one with 1000 neurons which outputs the class probabilities.
ResNet only has one FC layer with 1000 neurons which again outputs the class probabilities. In a NN classifier always the best choice is to use softmax, some authors make this explicit in the diagram while others do not.
In essence the guys at microsoft (ResNet) favor more convolutional layers instead of fully connected ones and therefore ommit fully connected layers. GlobalAveragePooling also decreases the feature size dramatically and therefore reduces the number of parameters going from the convolutional part to the fully connected part.
I would argue that the performance difference is quite slim, but one of their main accomplishments, by introducing ResNets is the dramatic reduction of parameters and those two points helped them accomplish that.
Related
I’m currently creating and LSTM to predict rare events. I’ve seen this paper which suggest: first an autoencoder LSTM for extracting features and second to use the embeddings for a second LSTM that will make the actual prediction. According to them, the autoencoder extract features (this is usually true) which are then useful for the prediction layers to predict.
In my case, I need to predict if it would be or not an extreme event (this is the most important thing) and then how strong is gonna be. Following their advice, I’ve created the model, but instead of adding one LSTM from embeddings to predictions I add two. One for binary prediction (It is, or it is not), ending with a sigmoid layer, and the second one for predicting how strong will be. Then I have three losses. The reconstruction loss (MSE), the prediction loss (MSE), and the binary loss (Binary Entropy).
The thing is that I’m not sure that is learning anything… the binary loss keeps in 0.5, and even the reconstruction loss is not really good. And of course, the bad thing is that the time series is plenty of 0, and some numbers from 1 to 10, so definitely MSE is not a good metric.
What do you think about this approach?
This is the better architecture for predicting rare events? Which one would be better?
Should I add some CNN or FC from the embeddings before the other to LSTM, for extracting 1D patterns from the embedding, or directly to make the prediction?
Should the LSTM that predicts be just one? And only use MSE loss?
Would be a good idea to multiply the two predictions to force in both cases the predicted days without the event coincide?
Thanks,
I have two different word embedding pretrained models that I want to combine together so that a missing word from one model can be complimented by the other model (in case the other model has the word that is missing in the first model). But the vectors are of different dimensions in the models. The first model vectors are of 300 dimensions and the second model vectors are of 1000 dimensions.
Can I simply retain the first 300 dimensions and discard the rest (700) in the second model and build one combined model of 300 dimensions?
Since the two models have been trained at separate times they will not "semantically align", even if they would have the same dimensionality. As there are some random aspects in the initialisation of the training one can't directly compare two independent vectors sets. The topological aspects, i.e. the relations between the vectors in high-dimensional space, will most likely be the same, but two vectors from two independent vector sets corresponding to the same word will not lie in the same position.
There are dimensionality reduction algorithms that can reduce the dimensionality from 1000 to 300 (SVD, PCA, SOM, autoencoders), but as I mentioned this won't solve your problem.
I would suggest to retrain a model based on corpora containing the full vocabulary, if possible. Even if there is some fancy way of combining to independent models I would assume that what you get will suffer in quality.
In implementation of densenet model as in CheXNet paper, in section 3.1 it is mentioned that:
Before inputting the images into the network, we downscale the images to 224x224 and normalize based on the mean and standard edviation of images in the ImageNet training set.
Why would we want to normalize new set of images with mean and std of different dataset?
How do we get the mean and std of ImageNet dataset? Is it provided somewhere?
Subtracting the mean centers the input to 0, and dividing by the standard deviation makes any scaled feature value the number of standard deviations away from the mean.
Consider how a neural network learns its weights. C(NN)s learn by continually adding gradient error vectors (multiplied by a learning rate) computed from backpropagation to various weight matrices throughout the network as training examples are passed through.
The thing to notice here is the "multiplied by a learning rate".
If we didn't scale our input training vectors, the ranges of our distributions of feature values would likely be different for each feature, and thus the learning rate would cause corrections in each dimension that would differ (proportionally speaking) from one another. We might be over compensating a correction in one weight dimension while undercompensating in another.
This is non-ideal as we might find ourselves in a oscillating (unable to center onto a better maxima in cost(weights) space) state or in a slow moving (traveling too slow to get to a better maxima) state.
Original Post: https://stats.stackexchange.com/questions/185853/why-do-we-need-to-normalize-the-images-before-we-put-them-into-cnn
They used mean and std dev of the ImageNet training set because the weights of their model were pretrained on ImageNet (see Model Architecture and Training section of the paper).
About the problem
My CNN model has accuracy up to 99.4% on the MNIST dataset. So I try some irregular input. And the predicted result is not correct.
The following are some of the irregular input I use
As we know, CNN convolution will scan the whole image, also don't care about the key features in which areas of the image.
Why CNN could not deal with irregular input
As we know, CNN convolution will scan the whole image, also don't care about the key features in which areas of the image.
This is simply false. CNN do not "scan" image, a single filter can be seen as scanning, but the whole network does not. CNN is composed of many layers, which will eventually reduce amount of information, and at some point also use location-specific feature (in final fully connected layers, in some global averaging and so on). Consequently, while CNNs are robust to small perturbations (translations or noise, but not rotations!), they are not invariant to these transformations. In other words - moving an image 3 pixels to the left is fine, but trying to classify a number in completely different scale/position will fail because there is nothing forcing your model to be invariant to that. Some models that indeed learn these kind of invariances are Spatial Transformers Networks, but CNNs simply don't.
I want to know whether it is possible to estimate the training time of a convolutional neural network, given parameters like depth, filter, size of input, etc.
For instance, I am working on a 3D convolutional neural network whose structure is like:
a (20x20x20) convolutional layer with stride of 1 and 8 filters
a (20x20x20) max-pooling layer with stride of 20
a fully connected layer mapping to 8 nodes
a fully connected layer mapping to 1 output
I am running 100 epochs and print the loss(mean squared error) every 10 epochs. Now it has run 24 hours and no loss printed(I suppose it has not run 10 epochs yet). By the way, I am not using GPU.
Is it possible to estimate the training time like a formula or something like that? Is it related to time complexity or my hardware? I also found the following paper, will it give me some information?
https://ai2-s2-pdfs.s3.amazonaws.com/140f/467566e799f32831db6913d84ccdbdcac0b2.pdf
Thanks in advance.