Computer vision detection on small object bad results why? - deep-learning

Currently, for detection (localisation + recognition tasks) we use mainly deep learning algorithm in computer vision. Two types of detector exist :
one stage : SSD, YOLO, retinanet, ...
two stage : RCNN, Fast RCNN and faster RCNN for example
Using these detectors on very small objects (10 pixels for example) is a very challenging tasks and it seems the one stage algorithm are worse than the two stage algorithm. But I do not really understand why it works better on Faster RCNN for example. In fact, the one and two stage detector use both of them the anchor concept, and most of them use the same backbone like VGG16 or resnet50/resnet101. That means the receptive fields is the same. For example, I tried to detect very small object on retinanet and on faster RCNN. On retinanet, small object are not detected contrary to faster rcnn. I do not understand why. What is the explication theoretically ? (same backbone : resnet50)

I think in general networks like retinaNet are trying to bridge the gap you mention.Usually in one stage networks we will have anchor boxes of varying scales in the feature maps produced by the Backbone net, These feature maps are produced by heavily down sampling the input image, A lot of information about small object might be lost while performing this operation.While this is the case with one stage detectors, In two stage detectors because of flexibility of the RPN network, The RPN network may still propose regions which are small and this may help it to perform slightly better than its one stage counterparts.
I don't think you should be very surprised that both of these might use the same backbone, After the conv features are extracted both networks use different methods to perform detection.
Hope this helps, Let me know if i wasn't clear enough,or you have questions.

Related

What is the efficient way to get compressed ML models in multiple sizes of one model for non ML expert?

I'm not an ML expert and know just a little background about it.
I know that there are techniques to reduce the size of neural networks like distillation and pruning. But I don't know how to efficiently perform those techniques.
Now I need to solve a quite practical problem. I'd like to ship FaceNet, the face recognition model, to mobile devices. There might be trade-offs between recognition accuracy and performance + size. I don't know which size of model would fit best my demands. I think I need to test models in many sizes and figure out which one is the best empirically. To do this, I should acquire models in many sizes that are obtained by compressing the pre-trained FaceNet on its website. For example, 30Mb version of FaceNet, 40Mb version of FaceNet, etc.
However, model compression is not free and costs money. I'm worried that I'd do something very stupid and expensive. What is the recommended way to do this? Does this problem have any common solution for not ML expert?
Network "compression" is not like a slider in which you can move anywhere from slow/accurate to fast/not-accurate.
From your starting model you can apply some techniques which may or may not reduce the size of the network and may or may not reduce also the accuracy. For example converting weights from float32 to float16 will for sure reduce the memory requirement of the network by half, but can also introduce a small reduction in accuracy and it's not supported on all devices.
My suggestion is: start with the base model and perform some tests with it. Understand how far you are from the target FPS and size of your application and then decide which approach makes more sense to reach the objective you have in mind.
Since you said you're not an ML expert I think this kind of task (at least to my knowledge) requires some amount of studying of the subject and experimentation and I would start with an open source solution like for example https://tvm.apache.org/.

One box object detection

I am using a faster rcnn model to predict one object in an image. There can only be one object in each image.
Is it possible to force Faster Rcnn to train and predict as if it should only find one object per image?
Yes, all depends only on data that you train on.
But I don't think that fast-rcnn is the best solution for this case: it's "brute force" solution (if only one object)- but if the data is really complex and such big object detection model worth it, try to use modern convolution-based object detection architectures like YOLO or SSD

Object detection from synthetic to real life data with Yolov5

Currently trying yolov5 with custom synthetic data. The dataset we've created consists of 8 different objects. Each object has a minimum of 1500 pictures/labels, where the pictures are split 500/500/500 of normal/fog/distractors around object. Sample images from the dataset is in the first imgur link. The model is not trained from scratch, but from yolov5 standard .pt.
So far we've tried:
Adding more data (from 300 images per object, to 4500)
Creating more complex data (distractors on/around objects)
Running multiple runs of training
Trained with network size small, medium, large, xlarge
Different batch size between 4-32 (depending on model size)
Everything so far has resulted in good/great detection on synthetic data, but completely off when used on real-life data.
Examples: Thinks that the whole pictures of unrelated objects is a paperbox, walls are pallets, etc. Quick sample images in the last imgur link.
Anyone got clues for how to improve the training or data to be better suited for real life detection? Or how to better interpret the results? I don't understand how the model draws the conclusion that a whole picture, with unrelated objects, is a box/pallet.
Results from training uploaded to imgur:
https://imgur.com/a/P0TQeBl
Example on real life data:
https://imgur.com/a/SGY7w8w
There are couple of things to improve results.
After training your model with synthetic data, fine tune your model with real training data, with a smaller learning rate (1/10th maybe). This will reduce the gap between synthetic and real life images. In some cases rather than fine tuning, training the model with mixed (synthetic+real) produces better results.
Generate images structurally similar to real life examples. For example, put humans inside forklifts, or pallets or barrels on forks, etc. Models learn from it.
Randomize the texture on items that you want to detect. Models tend to focus on textures for detection. By randomizing textures, with lots of variability including mon natural occurrences, you force model to learn to identify objects not based on its textures. Although, texture of an object sometimes is a good identifier, synthetic data suffers from not replicating that feature good enough, hence the domain gap, so you reduce its impact on model decision.
I am not sure whether the screenshot accurately represent your data generation distribution, if so, you have to randomize the angles of objects, sizes and occlusion amounts more.
Use objects that you don’t want to detect but will be in the images you will do inference as distractors, rather than simple shapes like spheres.
Randomize lighting more. Intensity, color, angles etc.
Increase background and ground randomization. Use hdris, there are lots of free hdris
Balance your dataset
https://imgur.com/a/LdCa8aO
Checking your results the answer is that your synthetic data is way to dissimilar to the real life data you want it to work for. Try to generate synthetic scenes that are closer to your real life counterparts and training again would clearly improve your results. That includes more realistic backgrounds and scene compositions. I don't know if your training set resembles the validation images you shared here but in case it does, try to have more objects per image, closer to the camera and add variation to their relative positions. Having just one random 3D object in the middle of an image is not going to provide good results. By the way, you are already overfitting your models, so more training images wouldn't help at this point.

Why does CNNs usually have a stem?

Most cutting-edge/famous CNN architectures have a stem that does not use a block like the rest of the part of the network, instead, most architectures use plain Conv2d or pooling in the stem without special modules/layers like a shortcut(residual), an inverted residual, a ghost conv, and so on.
Why is this? Are there experiments/theories/papers/intuitions behind this?
examples of stems:
classic ResNet: Conv2d+MaxPool:
bag of tricks ResNet-C: 3*Conv2d+MaxPool,
even though 2 Conv2d can form the exact same structure as a classic residual block as shown below in [figure 2], there is no shortcut in stem:
there are many other examples that have similar observations, such as EfficientNet, MobileNet, GhostNet, SE-Net, and so on.
cite:
https://arxiv.org/abs/1812.01187
https://arxiv.org/abs/1512.03385
As far as I know, this is done in order to quickly downsample an input image with strided convolutions of quite large kernel size (5x5 or 7x7) so that further layers can effectively do their work with much less computational complexity.
This is because these specialized modules can do no more than just convolutions. The difference is in the trainability of the resulting architecture. For example, the skip connections in ResNet are meant to bypass some layers when these are still so badly trained that they do not propagate the useful information from the input to the output. However, when fully trained, the skip connections could in theory be completely removed (or integrated) since the information can still propagate throught the layers that would otherwise be skipped. However, when you are using a backbone that you dont intend to train yourself, it does not make sence to include architectural features that are aimed at trainability. Instead, you can "compless" the backbone leaving only relatively fundamental operations and freeze all weights. This saves computational costs both when training the head as well as in the final deployment.
Stem layers work as a compression mechanism over the initial image.
This leads to a fast reduction in the spatial size of the activations, reducing memory and computational costs.

Which is best for object localization among R-CNN, fast R-CNN, faster R-CNN and YOLO

what is the difference between R-CNN, fast R-CNN, faster R-CNN and YOLO in terms of the following:
(1) Precision on same image set
(2) Given SAME IMAGE SIZE, the run time
(3) Support for android porting
Considering these three criteria which is the best object localization technique?
R-CNN is the daddy-algorithm for all the mentioned algos, it really provided the path for researchers to build more complex and better algorithm on top of it.
R-CNN, or Region-based Convolutional Neural Network
R-CNN consist of 3 simple steps:
Scan the input image for possible objects using an algorithm called Selective Search, generating ~2000 region proposals
Run a convolutional neural net (CNN) on top of each of these region proposals
Take the output of each CNN and feed it into a) an SVM to classify the region and b) a linear regressor to tighten the bounding box of the object, if such an object exists.
Fast R-CNN:
Fast R-CNN was immediately followed R-CNN. Fast R-CNN is faster and better by the virtue of following points:
Performing feature extraction over the image before proposing regions, thus only running one CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
Replacing the SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model
Intuitively it makes a lot of sense to remove 2000 conv layers and instead take once Convolution and make boxes on top of that.
Faster R-CNN:
One of the drawbacks of Fast R-CNN was the slow selective search algorithm and Faster R-CNN introduced something called Region Proposal network(RPN).
Here’s is the working of the RPN:
At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d)
For each sliding-window location, it generates multiple possible regions based on k fixed-ratio anchor boxes (default bounding boxes)
Each region proposal consists of:
an “objectness” score for that region and
4 coordinates representing the bounding box of the region
In other words, we look at each location in our last feature map and consider k different boxes centered around it: a tall box, a wide box, a large box, etc. For each of those boxes, we output whether or not we think it contains an object, and what the coordinates for that box are. This is what it looks like at one sliding window location:
The 2k scores represent the softmax probability of each of the k bounding boxes being on “object.” Notice that although the RPN outputs bounding box coordinates, it does not try to classify any potential objects: its sole job is still proposing object regions. If an anchor box has an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a region proposal.
Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.
YOLO:
YOLO uses a single CNN network for both classification and localising the object using bounding boxes. This is the architecture of YOLO :
In the end you will have a tensor of shape 1470 i.e 7*7*30 and the structure of the CNN output will be:
The 1470 vector output is divided into three parts, giving the probability, confidence and box coordinates. Each of these three parts is also further divided into 49 small regions, corresponding to the predictions at the 49 cells that form the original image.
In postprocessing steps, we take this 1470 vector output from the network to generate the boxes that with a probability higher than a certain threshold.
I hope you get the understanding of these networks, to answer your question on how the performance of these network differs:
On the same dataset: 'You can be sure that the performance of these networks are in the order they are mentioned, with YOLO being the best and R-CNN being the worst'
Given SAME IMAGE SIZE, the run time: Faster R-CNN achieved much better speeds and a state-of-the-art accuracy. It is worth noting that although future models did a lot to increase detection speeds, few models managed to outperform Faster R-CNN by a significant margin. Faster R-CNN may not be the simplest or fastest method for object detection, but it is still one of the best performing. However researchers have used YOLO for video segmentation and by far its the best and fastest when it comes to video segmentation.
Support for android porting: As far as my knowledge goes, Tensorflow has some android APIs to port to android but I am not sure how these network will perform or even will you be able to port it or not. That again is subjected to hardware and data_size. Can you please provide the hardware and the size so that I will be able to answer it clearly.
The youtube video tagged by #A_Piro gives a nice explanation too.
P.S. I borrowed a lot of material from Joyce Xu Medium blog.
If your are interested in these algorithms you should take a look into this lesson which go through the algoritmhs you named : https://www.youtube.com/watch?v=GxZrEKZfW2o.
PS: There is also a Fast YOLO if I remember well haha !
I have been working with YOLO and FRCNN a lot. To me the YOLO has the best accuracy and speed but if you want to do research on image processing, I will suggest FRCNN as many previous works are done with it, and to do research you really want to be consistent.
For Object detection, I am trying SSD+ Mobilenet. It has a balance of accuracy and speed So it can also be ported to android devices easily with good fps.
It has less accuracy compared to faster rcnn but more speed than other algorithms.
It also has good support for android porting.