How is MobileNet V3 faster than V2? - deep-learning

Here's the link to the paper regarding MobileNet V3.
MobileNet V3
According to the paper, h-swish and Squeeze-and-excitation module are implemented in MobileNet V3, but they aim to enhance the accuracy and don't help boost the speed.
h-swish is faster than swish and helps enhance the accuracy, but is much slower than ReLU if I'm not mistaken.
SE also helps enhance the accuracy, but it increases the number of parameters of the network.
Am I missing something? I still have no idea how MobileNet V3 can be faster than V2 with what's said above implemented in V3.
I didn't mention the fact that they also modify the last part of their network as I plan to use MobileNet V3 as the backbone network and combine it with SSD layers for the detection purpose, so the last part of the network won't be used.
The following table, which can be found in the paper mentioned above, shows that V3 is still faster than V2 is.
Object detection results for comparison

MobileNetV3 is faster and more accurate than MobileNetV2 on classification task, but this is not necessarily true on different task, such as object detection.
As you mention yourself, optimizations they did on the deepest end of network are mostly relevant to the classification variant, and as can be seen on the table you referenced, the mAP is no better.
Few things to consider though:
It's true SE and h-swish both slow down the network a bit. SE adds
some FLOPs and parameters, and h-swish adds complexity, and both
causes some latency. However, both are added such that the
accuracy-latency trade-off is better, meaning either the latency
addition is worth the accuracy gain, or you can maintain the same
accuracy while reducing other stuff, thus reducing overall latency.
Specifically regarding h-swish, note that they mostly use it in
deeper layers, where the tensors are smaller. They are thicker, but
due to quadratic drop in resolution (height x width), they are
smaller overall, hence h-swish causes less latency.
The architecture itself (without h-swish, and even without considering the SE) is searched. Meaning it is better suited to the task than "vanilla" MobileNetV2, since the architecture is "less hand-engineered", and actually optimized to the task. You can see for example, that as in MNASNet, some of the kernels grew to 5x5 (rather than 3x3), not all expansion rates are x6, etc.
One change they did to the deepest end of the network is also relevant to object detection. Oddly, while using SSDLite-MobileNetV2, the original authors chose to keep the last 1x1 convolution which expands from depth of 320 to 1280. While this amount of features makes sense for 1000 classes classification, for 80 classes detection it's probably redundant, as the authors of MNv3 say themselves in the middle of page 7 (bottom of first column-top of second).

Related

Why does CNNs usually have a stem?

Most cutting-edge/famous CNN architectures have a stem that does not use a block like the rest of the part of the network, instead, most architectures use plain Conv2d or pooling in the stem without special modules/layers like a shortcut(residual), an inverted residual, a ghost conv, and so on.
Why is this? Are there experiments/theories/papers/intuitions behind this?
examples of stems:
classic ResNet: Conv2d+MaxPool:
bag of tricks ResNet-C: 3*Conv2d+MaxPool,
even though 2 Conv2d can form the exact same structure as a classic residual block as shown below in [figure 2], there is no shortcut in stem:
there are many other examples that have similar observations, such as EfficientNet, MobileNet, GhostNet, SE-Net, and so on.
cite:
https://arxiv.org/abs/1812.01187
https://arxiv.org/abs/1512.03385
As far as I know, this is done in order to quickly downsample an input image with strided convolutions of quite large kernel size (5x5 or 7x7) so that further layers can effectively do their work with much less computational complexity.
This is because these specialized modules can do no more than just convolutions. The difference is in the trainability of the resulting architecture. For example, the skip connections in ResNet are meant to bypass some layers when these are still so badly trained that they do not propagate the useful information from the input to the output. However, when fully trained, the skip connections could in theory be completely removed (or integrated) since the information can still propagate throught the layers that would otherwise be skipped. However, when you are using a backbone that you dont intend to train yourself, it does not make sence to include architectural features that are aimed at trainability. Instead, you can "compless" the backbone leaving only relatively fundamental operations and freeze all weights. This saves computational costs both when training the head as well as in the final deployment.
Stem layers work as a compression mechanism over the initial image.
This leads to a fast reduction in the spatial size of the activations, reducing memory and computational costs.

In deep learning, can I change the weight of loss dynamically?

Call for experts in deep learning.
Hey, I am recently working on training images using tensorflow in python for tone mapping. To get the better result, I focused on using perceptual loss introduced from this paper by Justin Johnson.
In my implementation, I made the use of all 3 parts of loss: a feature loss that extracted from vgg16; a L2 pixel-level loss from the transferred image and the ground true image; and the total variation loss. I summed them up as the loss for back propagation.
From the function
yˆ=argminλcloss_content(y,yc)+λsloss_style(y,ys)+λTVloss_TV(y)
in the paper, we can see that there are 3 weights of the losses, the λ's, to balance them. The value of three λs are probably fixed throughout the training.
My question is that does it make sense if I dynamically change the λ's in every epoch(or several epochs) to adjust the importance of these losses?
For instance, the perceptual loss converges drastically in the first several epochs yet the pixel-level l2 loss converges fairly slow. So maybe the weight λs should be higher for the content loss, let's say 0.9, but lower for others. As the time passes, the pixel-level loss will be increasingly important to smooth up the image and to minimize the artifacts. So it might be better to adjust it higher a bit. Just like changing the learning rate according to the different epochs.
The postdoc supervises me straightly opposes my idea. He thought it is dynamically changing the training model and could cause the inconsistency of the training.
So, pro and cons, I need some ideas...
Thanks!
It's hard to answer this without knowing more about the data you're using, but in short, dynamic loss should not really have that much effect and may have opposite effect altogether.
If you are using Keras, you could simply run a hyperparameter tuner similar to the following in order to see if there is any effect (change the loss accordingly):
https://towardsdatascience.com/hyperparameter-optimization-with-keras-b82e6364ca53
I've only done this on smaller models (way too time consuming) but in essence, it's best to keep it constant and also avoid angering off your supervisor too :D
If you are running a different ML or DL library, there are optimizer for each, just Google them. It may be best to run these on a cluster and overnight, but they usually give you a good enough optimized version of your model.
Hope that helps and good luck!

caffe - how to properly train alexnet with only 7 classes

I have a small dataset collect from imagenet(7 classes each class with 1000 training data). I try to train it with alexnet model. But somehow the accuracy just cant go any higher(about 68% maximum). I remove conv4 and conv5 layer to prevent model overfitting also decrease the number of neuron in each layer(conv and fc). here is my setup.
Did i do anything wrong so that the accuracy is so low?
I want to sort out a few terms:
(1) A perceptron is an individual cell in a neural net.
(2) In a CNN, we generally focus on the kernel (filter) as a unit; this is the square matrix of perceptrons that forms a psuedo-visual unit.
(3) The only place it usually makes sense to focus on an individual perceptron is in the FC layers. When you talk about removing some of the perceptrons, I think you mean kernels.
The most important part of training a model is to make sure that your model is properly fitted to the problem at hand. AlexNet (and CaffeNet, the BVLC implementation) is fitted to the full ImageNet data set. Alex Krizhevsky and his colleagues spent a lot of research effort in tuning their network to the problem. You are not going to get similar accuracy -- on a severely reduced data set -- by simply removing layers and kernels at random.
I suggested that you start from CONVNET (the CIFAR-10 net) because it's much better tuned to this scale of problem. Most of all, I strongly recommend that you make constant use of your visualization tools, so that you can detect when the various kernel layers begin to learn their patterns, and to see the effects of small changes in the topology.
You need to run some experiments to tune and understand your topology. Record the kernel visualizations at chosen times during the training -- perhaps at intervals of 10% of expected convergence -- and compare the visual acuity as you remove a few kernels, or delete an entire layer, or whatever else you choose.
For instance, I expect that if you do this with your current amputated CaffeNet, you'll find that the severe losses in depth and breadth greatly change the feature recognition it's learning. The current depth of building blocks is not enough to recognize edges, then shapes, then full body parts. However, I could be wrong -- you do have three remaining layers. That's why I asked you to post the visualizations you got, to compare with published AlexNet features.
edit: CIFAR VISUALIZATION
CIFAR is much better differentiated between classes than is ILSVRC-2012. Thus, the training requires less detail per layer and fewer layers. Training is faster, and the filters are not nearly as interesting to the human eye. This is not a problem with the Gabor (not Garbor) filter; it's just that the model doesn't have to learn so many details.
For instance, for CONVNET to discriminate between a jonquil and a jet, we just need a smudge of yellow inside a smudge of white (the flower). For AlexNet to tell a jonquil from a cymbidium orchid, the network needs to learn about petal count or shape.

Which is best for object localization among R-CNN, fast R-CNN, faster R-CNN and YOLO

what is the difference between R-CNN, fast R-CNN, faster R-CNN and YOLO in terms of the following:
(1) Precision on same image set
(2) Given SAME IMAGE SIZE, the run time
(3) Support for android porting
Considering these three criteria which is the best object localization technique?
R-CNN is the daddy-algorithm for all the mentioned algos, it really provided the path for researchers to build more complex and better algorithm on top of it.
R-CNN, or Region-based Convolutional Neural Network
R-CNN consist of 3 simple steps:
Scan the input image for possible objects using an algorithm called Selective Search, generating ~2000 region proposals
Run a convolutional neural net (CNN) on top of each of these region proposals
Take the output of each CNN and feed it into a) an SVM to classify the region and b) a linear regressor to tighten the bounding box of the object, if such an object exists.
Fast R-CNN:
Fast R-CNN was immediately followed R-CNN. Fast R-CNN is faster and better by the virtue of following points:
Performing feature extraction over the image before proposing regions, thus only running one CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
Replacing the SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model
Intuitively it makes a lot of sense to remove 2000 conv layers and instead take once Convolution and make boxes on top of that.
Faster R-CNN:
One of the drawbacks of Fast R-CNN was the slow selective search algorithm and Faster R-CNN introduced something called Region Proposal network(RPN).
Here’s is the working of the RPN:
At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d)
For each sliding-window location, it generates multiple possible regions based on k fixed-ratio anchor boxes (default bounding boxes)
Each region proposal consists of:
an “objectness” score for that region and
4 coordinates representing the bounding box of the region
In other words, we look at each location in our last feature map and consider k different boxes centered around it: a tall box, a wide box, a large box, etc. For each of those boxes, we output whether or not we think it contains an object, and what the coordinates for that box are. This is what it looks like at one sliding window location:
The 2k scores represent the softmax probability of each of the k bounding boxes being on “object.” Notice that although the RPN outputs bounding box coordinates, it does not try to classify any potential objects: its sole job is still proposing object regions. If an anchor box has an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a region proposal.
Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.
YOLO:
YOLO uses a single CNN network for both classification and localising the object using bounding boxes. This is the architecture of YOLO :
In the end you will have a tensor of shape 1470 i.e 7*7*30 and the structure of the CNN output will be:
The 1470 vector output is divided into three parts, giving the probability, confidence and box coordinates. Each of these three parts is also further divided into 49 small regions, corresponding to the predictions at the 49 cells that form the original image.
In postprocessing steps, we take this 1470 vector output from the network to generate the boxes that with a probability higher than a certain threshold.
I hope you get the understanding of these networks, to answer your question on how the performance of these network differs:
On the same dataset: 'You can be sure that the performance of these networks are in the order they are mentioned, with YOLO being the best and R-CNN being the worst'
Given SAME IMAGE SIZE, the run time: Faster R-CNN achieved much better speeds and a state-of-the-art accuracy. It is worth noting that although future models did a lot to increase detection speeds, few models managed to outperform Faster R-CNN by a significant margin. Faster R-CNN may not be the simplest or fastest method for object detection, but it is still one of the best performing. However researchers have used YOLO for video segmentation and by far its the best and fastest when it comes to video segmentation.
Support for android porting: As far as my knowledge goes, Tensorflow has some android APIs to port to android but I am not sure how these network will perform or even will you be able to port it or not. That again is subjected to hardware and data_size. Can you please provide the hardware and the size so that I will be able to answer it clearly.
The youtube video tagged by #A_Piro gives a nice explanation too.
P.S. I borrowed a lot of material from Joyce Xu Medium blog.
If your are interested in these algorithms you should take a look into this lesson which go through the algoritmhs you named : https://www.youtube.com/watch?v=GxZrEKZfW2o.
PS: There is also a Fast YOLO if I remember well haha !
I have been working with YOLO and FRCNN a lot. To me the YOLO has the best accuracy and speed but if you want to do research on image processing, I will suggest FRCNN as many previous works are done with it, and to do research you really want to be consistent.
For Object detection, I am trying SSD+ Mobilenet. It has a balance of accuracy and speed So it can also be ported to android devices easily with good fps.
It has less accuracy compared to faster rcnn but more speed than other algorithms.
It also has good support for android porting.

How to increase validation accuracy with deep neural net?

I am trying to build a 11 class image classifier with 13000 training images and 3000 validation images. I am using deep neural network which is being trained using mxnet. Training accuracy is increasing and reached above 80% but validation accuracy is coming in range of 54-57% and its not increasing.
What can be the issue here? Should I increase the no of images?
The issue here is that your network stop learning useful general features at some point and start adapting to peculiarities of your training set (overfitting it in result). You want to 'force' your network to keep learning useful features and you have few options here:
Use weight regularization. It tries to keep weights low which very often leads to better generalization. Experiment with different regularization coefficients. Try 0.1, 0.01, 0.001 and see what impact they have on accuracy.
Corrupt your input (e.g., randomly substitute some pixels with black or white). This way you remove information from your input and 'force' the network to pick up on important general features. Experiment with noising coefficients which determines how much of your input should be corrupted. Research shows that anything in the range of 15% - 45% works well.
Expand your training set. Since you're dealing with images you can expand your set by rotating / scaling etc. your existing images (as suggested). You could also experiment with pre-processing your images (e.g., mapping them to black and white, grayscale etc. but the effectiveness of this technique will depend on your exact images and classes)
Pre-train your layers with denoising critera. Here you pre-train each layer of your network individually before fine tuning the entire network. Pre-training 'forces' layers to pick up on important general features that are useful for reconstructing the input signal. Look into auto-encoders for example (they've been applied to image classification in the past).
Experiment with network architecture. Your network might not have sufficient learning capacity. Experiment with different neuron types, number of layers, and number of hidden neurons. Make sure to try compressing architectures (less neurons than inputs) and sparse architectures (more neurons than inputs).
Unfortunately the process of training network that generalizes well involves a lot of experimentation and almost brute force exploration of parameter space with a bit of human supervision (you'll see many research works employing this approach). It's good to try 3-5 values for each parameter and see if it leads you somewhere.
When you experiment plot accuracy / cost / f1 as a function of number of iterations and see how it behaves. Often you'll notice a peak in accuracy for your test set, and after that a continuous drop. So apart from good architecture, regularization, corruption etc. you're also looking for a good number of iterations that yields best results.
One more hint: make sure each training epochs randomize the order of images.
This clearly looks like a case where the model is overfitting the Training set, as the validation accuracy was improving step by step till it got fixed at a particular value. If the learning rate was a bit more high, you would have ended up seeing validation accuracy decreasing, with increasing accuracy for training set.
Increasing the number of training set is the best solution to this problem. You could also try applying different transformations (flipping, cropping random portions from a slightly bigger image)to the existing image set and see if the model is learning better.