Most cutting-edge/famous CNN architectures have a stem that does not use a block like the rest of the part of the network, instead, most architectures use plain Conv2d or pooling in the stem without special modules/layers like a shortcut(residual), an inverted residual, a ghost conv, and so on.
Why is this? Are there experiments/theories/papers/intuitions behind this?
examples of stems:
classic ResNet: Conv2d+MaxPool:
bag of tricks ResNet-C: 3*Conv2d+MaxPool,
even though 2 Conv2d can form the exact same structure as a classic residual block as shown below in [figure 2], there is no shortcut in stem:
there are many other examples that have similar observations, such as EfficientNet, MobileNet, GhostNet, SE-Net, and so on.
cite:
https://arxiv.org/abs/1812.01187
https://arxiv.org/abs/1512.03385
As far as I know, this is done in order to quickly downsample an input image with strided convolutions of quite large kernel size (5x5 or 7x7) so that further layers can effectively do their work with much less computational complexity.
This is because these specialized modules can do no more than just convolutions. The difference is in the trainability of the resulting architecture. For example, the skip connections in ResNet are meant to bypass some layers when these are still so badly trained that they do not propagate the useful information from the input to the output. However, when fully trained, the skip connections could in theory be completely removed (or integrated) since the information can still propagate throught the layers that would otherwise be skipped. However, when you are using a backbone that you dont intend to train yourself, it does not make sence to include architectural features that are aimed at trainability. Instead, you can "compless" the backbone leaving only relatively fundamental operations and freeze all weights. This saves computational costs both when training the head as well as in the final deployment.
Stem layers work as a compression mechanism over the initial image.
This leads to a fast reduction in the spatial size of the activations, reducing memory and computational costs.
Related
Here's the link to the paper regarding MobileNet V3.
MobileNet V3
According to the paper, h-swish and Squeeze-and-excitation module are implemented in MobileNet V3, but they aim to enhance the accuracy and don't help boost the speed.
h-swish is faster than swish and helps enhance the accuracy, but is much slower than ReLU if I'm not mistaken.
SE also helps enhance the accuracy, but it increases the number of parameters of the network.
Am I missing something? I still have no idea how MobileNet V3 can be faster than V2 with what's said above implemented in V3.
I didn't mention the fact that they also modify the last part of their network as I plan to use MobileNet V3 as the backbone network and combine it with SSD layers for the detection purpose, so the last part of the network won't be used.
The following table, which can be found in the paper mentioned above, shows that V3 is still faster than V2 is.
Object detection results for comparison
MobileNetV3 is faster and more accurate than MobileNetV2 on classification task, but this is not necessarily true on different task, such as object detection.
As you mention yourself, optimizations they did on the deepest end of network are mostly relevant to the classification variant, and as can be seen on the table you referenced, the mAP is no better.
Few things to consider though:
It's true SE and h-swish both slow down the network a bit. SE adds
some FLOPs and parameters, and h-swish adds complexity, and both
causes some latency. However, both are added such that the
accuracy-latency trade-off is better, meaning either the latency
addition is worth the accuracy gain, or you can maintain the same
accuracy while reducing other stuff, thus reducing overall latency.
Specifically regarding h-swish, note that they mostly use it in
deeper layers, where the tensors are smaller. They are thicker, but
due to quadratic drop in resolution (height x width), they are
smaller overall, hence h-swish causes less latency.
The architecture itself (without h-swish, and even without considering the SE) is searched. Meaning it is better suited to the task than "vanilla" MobileNetV2, since the architecture is "less hand-engineered", and actually optimized to the task. You can see for example, that as in MNASNet, some of the kernels grew to 5x5 (rather than 3x3), not all expansion rates are x6, etc.
One change they did to the deepest end of the network is also relevant to object detection. Oddly, while using SSDLite-MobileNetV2, the original authors chose to keep the last 1x1 convolution which expands from depth of 320 to 1280. While this amount of features makes sense for 1000 classes classification, for 80 classes detection it's probably redundant, as the authors of MNv3 say themselves in the middle of page 7 (bottom of first column-top of second).
I have a dataset in which there is high correlation between the (500+) columns. From what I understand (and correct me if I am wrong), one of the reasons that you do normalising with zero mean and a std dev of one is so that it is easier for a optimizer with a given learning rate to deal with across many problems, rather than adopt the learning rate to the scale of X.
Similarly is there a reason as to why I should 'whiten' my dataset. It seems to be a common step in image processing. Would it make it easier on the optimizer somehow if the columns were independent?
I understand that classically people used to decorrelate the matrices so that the weights became more statistically significant, and also to make the matrix inversion more stable. The matrix inversion part atleast seems to be non-existent when it comes to DL since we use variations of Stochastic Gradient Descent (SGD) these days instead.
It's not something really essential now. Read this note from Andrej. Normally we don't use PCA in deep learning architectures. Because we don't need to reduce features since we have deep architectures which can extract hierarchical features. It's always good to zero center data. Which means you need to normalize data in order to reduce variance in the batch. Anyway normally in CNN we use batch normalization layer. This really helps the network to converge without having covariate shift. ALso modern optimization techniques like adam.rmsprop make the data pre-processing part less important.
My understanding is that filters in convolutional neural networks are going to extract features in raw data (or previous layers), so designing them by supervised learning through backpropagation makes complete sense. But I have seen some papers in which the filters are found by unsupervised clustering of input data samples. That looks strange to me how cluster centers can be regarded as good filters for feature extraction. Does anybody have a good explanation for that?
Certain popular clustering algorithms such as k-means are vector quantization methods.
They try to find a good least-squares quantization of the data, such that every data point can be represented by a similar vector with least-squares difference.
So from a least-squares approximation point of view, the cluster centers are good approximations (we can't afford to find the optimal centers, but we have a good chance at finding reasonably good centers). Whether or not least squares is appropriate depends a lot on the data, for example all attributes should be of the same kind. For a typical image processing task, where each pixel is represented the same way, this will be a good starting point for later supervised optimization. But I believe soft factorizations will usually be better that do not assume every patch is of exactly one kind.
I have a small dataset collect from imagenet(7 classes each class with 1000 training data). I try to train it with alexnet model. But somehow the accuracy just cant go any higher(about 68% maximum). I remove conv4 and conv5 layer to prevent model overfitting also decrease the number of neuron in each layer(conv and fc). here is my setup.
Did i do anything wrong so that the accuracy is so low?
I want to sort out a few terms:
(1) A perceptron is an individual cell in a neural net.
(2) In a CNN, we generally focus on the kernel (filter) as a unit; this is the square matrix of perceptrons that forms a psuedo-visual unit.
(3) The only place it usually makes sense to focus on an individual perceptron is in the FC layers. When you talk about removing some of the perceptrons, I think you mean kernels.
The most important part of training a model is to make sure that your model is properly fitted to the problem at hand. AlexNet (and CaffeNet, the BVLC implementation) is fitted to the full ImageNet data set. Alex Krizhevsky and his colleagues spent a lot of research effort in tuning their network to the problem. You are not going to get similar accuracy -- on a severely reduced data set -- by simply removing layers and kernels at random.
I suggested that you start from CONVNET (the CIFAR-10 net) because it's much better tuned to this scale of problem. Most of all, I strongly recommend that you make constant use of your visualization tools, so that you can detect when the various kernel layers begin to learn their patterns, and to see the effects of small changes in the topology.
You need to run some experiments to tune and understand your topology. Record the kernel visualizations at chosen times during the training -- perhaps at intervals of 10% of expected convergence -- and compare the visual acuity as you remove a few kernels, or delete an entire layer, or whatever else you choose.
For instance, I expect that if you do this with your current amputated CaffeNet, you'll find that the severe losses in depth and breadth greatly change the feature recognition it's learning. The current depth of building blocks is not enough to recognize edges, then shapes, then full body parts. However, I could be wrong -- you do have three remaining layers. That's why I asked you to post the visualizations you got, to compare with published AlexNet features.
edit: CIFAR VISUALIZATION
CIFAR is much better differentiated between classes than is ILSVRC-2012. Thus, the training requires less detail per layer and fewer layers. Training is faster, and the filters are not nearly as interesting to the human eye. This is not a problem with the Gabor (not Garbor) filter; it's just that the model doesn't have to learn so many details.
For instance, for CONVNET to discriminate between a jonquil and a jet, we just need a smudge of yellow inside a smudge of white (the flower). For AlexNet to tell a jonquil from a cymbidium orchid, the network needs to learn about petal count or shape.
I am trying to build a 11 class image classifier with 13000 training images and 3000 validation images. I am using deep neural network which is being trained using mxnet. Training accuracy is increasing and reached above 80% but validation accuracy is coming in range of 54-57% and its not increasing.
What can be the issue here? Should I increase the no of images?
The issue here is that your network stop learning useful general features at some point and start adapting to peculiarities of your training set (overfitting it in result). You want to 'force' your network to keep learning useful features and you have few options here:
Use weight regularization. It tries to keep weights low which very often leads to better generalization. Experiment with different regularization coefficients. Try 0.1, 0.01, 0.001 and see what impact they have on accuracy.
Corrupt your input (e.g., randomly substitute some pixels with black or white). This way you remove information from your input and 'force' the network to pick up on important general features. Experiment with noising coefficients which determines how much of your input should be corrupted. Research shows that anything in the range of 15% - 45% works well.
Expand your training set. Since you're dealing with images you can expand your set by rotating / scaling etc. your existing images (as suggested). You could also experiment with pre-processing your images (e.g., mapping them to black and white, grayscale etc. but the effectiveness of this technique will depend on your exact images and classes)
Pre-train your layers with denoising critera. Here you pre-train each layer of your network individually before fine tuning the entire network. Pre-training 'forces' layers to pick up on important general features that are useful for reconstructing the input signal. Look into auto-encoders for example (they've been applied to image classification in the past).
Experiment with network architecture. Your network might not have sufficient learning capacity. Experiment with different neuron types, number of layers, and number of hidden neurons. Make sure to try compressing architectures (less neurons than inputs) and sparse architectures (more neurons than inputs).
Unfortunately the process of training network that generalizes well involves a lot of experimentation and almost brute force exploration of parameter space with a bit of human supervision (you'll see many research works employing this approach). It's good to try 3-5 values for each parameter and see if it leads you somewhere.
When you experiment plot accuracy / cost / f1 as a function of number of iterations and see how it behaves. Often you'll notice a peak in accuracy for your test set, and after that a continuous drop. So apart from good architecture, regularization, corruption etc. you're also looking for a good number of iterations that yields best results.
One more hint: make sure each training epochs randomize the order of images.
This clearly looks like a case where the model is overfitting the Training set, as the validation accuracy was improving step by step till it got fixed at a particular value. If the learning rate was a bit more high, you would have ended up seeing validation accuracy decreasing, with increasing accuracy for training set.
Increasing the number of training set is the best solution to this problem. You could also try applying different transformations (flipping, cropping random portions from a slightly bigger image)to the existing image set and see if the model is learning better.