Image Classification on heavy occluded and background camouflage - deep-learning

I am doing a project on image classification on classifying various species of bamboos.
The problems on Kaggle are pretty well labeled, singluar and concise pictures.
But the issue with bamboo is they appear in a cluster in most images sometimes with more than 1 species. Also there is a prevalence of heavy occlusion and background camouflage.
Besides there is not much training data available for this problem.
So I have been making my own dataset by collecting the data from the internet and also clicking images from my DSLR.
My first approach was to use a weighted Mask RCNN for instance segmentation and then classifying it using VGGNet and GoogleNet.
My next approach is to test on Attention UNet, YOLO v3 and a new paper BCNet from ICLR 2021.
And then classify on ResNext, GoogleNet and SENet then compare the results.
Any tips or better approach is much appreciated.

Related

Retrain or fine-tuning in Caffe a network with images of the existing categories

I'm quite new to caffe and this could be a non sense question.
I have trained my network from scratch. It trains well and gets a reasonable accuracy in tests. The question is about retraining or fine tuning this network. Suppose you have new samples of images of the same original categories and you want to teach the net with this new images (because for example the net fails to predict in this particular images).
As far a I know it is possible to resume training with a snapshot and solverstate or fine-tuning using only the weights of the training model. What is the best option in this case?. or is better to retrain the net with original images and new ones together?.
Think in a possible "incremental training" scheme, because not all the cases for a particular category are available in the initial training. Is it possible to retrain the net only with the new samples?. Should I change the learning rate or maintain any parameters in order to maintain the original accuracy in prediction when training with the new samples? the net should predict in original image set with the same behaviour after fine tuning.
Thanks in advance.

Which is best for object localization among R-CNN, fast R-CNN, faster R-CNN and YOLO

what is the difference between R-CNN, fast R-CNN, faster R-CNN and YOLO in terms of the following:
(1) Precision on same image set
(2) Given SAME IMAGE SIZE, the run time
(3) Support for android porting
Considering these three criteria which is the best object localization technique?
R-CNN is the daddy-algorithm for all the mentioned algos, it really provided the path for researchers to build more complex and better algorithm on top of it.
R-CNN, or Region-based Convolutional Neural Network
R-CNN consist of 3 simple steps:
Scan the input image for possible objects using an algorithm called Selective Search, generating ~2000 region proposals
Run a convolutional neural net (CNN) on top of each of these region proposals
Take the output of each CNN and feed it into a) an SVM to classify the region and b) a linear regressor to tighten the bounding box of the object, if such an object exists.
Fast R-CNN:
Fast R-CNN was immediately followed R-CNN. Fast R-CNN is faster and better by the virtue of following points:
Performing feature extraction over the image before proposing regions, thus only running one CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
Replacing the SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model
Intuitively it makes a lot of sense to remove 2000 conv layers and instead take once Convolution and make boxes on top of that.
Faster R-CNN:
One of the drawbacks of Fast R-CNN was the slow selective search algorithm and Faster R-CNN introduced something called Region Proposal network(RPN).
Here’s is the working of the RPN:
At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d)
For each sliding-window location, it generates multiple possible regions based on k fixed-ratio anchor boxes (default bounding boxes)
Each region proposal consists of:
an “objectness” score for that region and
4 coordinates representing the bounding box of the region
In other words, we look at each location in our last feature map and consider k different boxes centered around it: a tall box, a wide box, a large box, etc. For each of those boxes, we output whether or not we think it contains an object, and what the coordinates for that box are. This is what it looks like at one sliding window location:
The 2k scores represent the softmax probability of each of the k bounding boxes being on “object.” Notice that although the RPN outputs bounding box coordinates, it does not try to classify any potential objects: its sole job is still proposing object regions. If an anchor box has an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a region proposal.
Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.
YOLO:
YOLO uses a single CNN network for both classification and localising the object using bounding boxes. This is the architecture of YOLO :
In the end you will have a tensor of shape 1470 i.e 7*7*30 and the structure of the CNN output will be:
The 1470 vector output is divided into three parts, giving the probability, confidence and box coordinates. Each of these three parts is also further divided into 49 small regions, corresponding to the predictions at the 49 cells that form the original image.
In postprocessing steps, we take this 1470 vector output from the network to generate the boxes that with a probability higher than a certain threshold.
I hope you get the understanding of these networks, to answer your question on how the performance of these network differs:
On the same dataset: 'You can be sure that the performance of these networks are in the order they are mentioned, with YOLO being the best and R-CNN being the worst'
Given SAME IMAGE SIZE, the run time: Faster R-CNN achieved much better speeds and a state-of-the-art accuracy. It is worth noting that although future models did a lot to increase detection speeds, few models managed to outperform Faster R-CNN by a significant margin. Faster R-CNN may not be the simplest or fastest method for object detection, but it is still one of the best performing. However researchers have used YOLO for video segmentation and by far its the best and fastest when it comes to video segmentation.
Support for android porting: As far as my knowledge goes, Tensorflow has some android APIs to port to android but I am not sure how these network will perform or even will you be able to port it or not. That again is subjected to hardware and data_size. Can you please provide the hardware and the size so that I will be able to answer it clearly.
The youtube video tagged by #A_Piro gives a nice explanation too.
P.S. I borrowed a lot of material from Joyce Xu Medium blog.
If your are interested in these algorithms you should take a look into this lesson which go through the algoritmhs you named : https://www.youtube.com/watch?v=GxZrEKZfW2o.
PS: There is also a Fast YOLO if I remember well haha !
I have been working with YOLO and FRCNN a lot. To me the YOLO has the best accuracy and speed but if you want to do research on image processing, I will suggest FRCNN as many previous works are done with it, and to do research you really want to be consistent.
For Object detection, I am trying SSD+ Mobilenet. It has a balance of accuracy and speed So it can also be ported to android devices easily with good fps.
It has less accuracy compared to faster rcnn but more speed than other algorithms.
It also has good support for android porting.

I am trying out a yes/no classification of an image using CNN.

Is it possible to determine the features of the image from the hidden layers that will lead to "yes"?
Like suppose I train the CNN with 1000 images, then I would like to know from the intermediate hidden layers about which features actually are leading to the image being tagged with a yes finally.
Is it possible?
And also how many training examples are required to converge for a binary classification using CNN?
Is it possible to determine the features of the image from the hidden layers that will lead to "yes"?
Yes, it is. Have a look at
Zeiler, M.D. and Fergus, R., 2014, September. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (pp. 818-833). Springer International Publishing.
Summary
There are three main ideas:
Training data argmax method: Pump your data through the network. Record for the neuron which you are interested which caused the highest activation.
Occlusion sensitivity analysis: Cover a part of the image. Push the occluded image through the network. How did the score change? If it was about the same, the important features are likely not in that part of the image.
Gradient methods: Train a "reconstruction network" which reconstructs the activation. Then set the neuron you are interested in to maximum activation, the rest to no activation. Reconstuct what could cause this behavior.

Training model to recognize one specific object (or scene)

I am trying to train a learning model to recognize one specific scene. For example, say I would like to train it to recognize pictures taken at an amusement park and I already have 10 thousand pictures taken at an amusement park. I would like to train this model with those pictures so that it would be able to give a score for other pictures of the probability that they were taken at an amusement park. How do I do that?
Considering this is an image recognition problem, I would probably use a convolutional neural network, but I am not quite sure how to train it in this case.
Thanks!
There are several possible ways. The most trivial one is to collect a large number of negative examples (images from other places) and train a two-class model.
The second approach would be to train a network to extract meaningful low-dimensional representations from an input image (embeddings). Here you can use siamese training to explicitly train the network to learn similarities between images. Such an approach is employed for face recognition, for instance (see FaceNet). Having such embeddings, you can use some well-established methods for outlier detections, for instance, 1-class SVM, or any other classifier. In this case you also need negative examples.
I would heavily augment your data using image cropping - it is the most obvious way to increase the amount of training data in your case.
In general, your success in this task strongly depends on the task statement (are restricted to parks only, or any kind of place) and the proper data.

Building an Image search engine using Convolutional Neural Networks

I am trying to implement an image search engine using AlexNethttps://github.com/akrizhevsky/cuda-convnet2
The idea is to implement an image search engine by training a neural net to classify images and then using the code from the net's last hidden layer as a similarity measure.
I am trying to figure out how to train the CNN on a new set of images to classify them. Does anyone know how to get started with this?
Thanks
You basically have two approaches to your problem:
-Either you have plenty of good training data (>1M) and dozens of GPUs and you retrain the network from scratch using SGD with the classes you have for your queries.
-Either you don't and then you simply truncate a pretrained AlexNet (where exactly you truncate it is for you to choose) and plug it to your images (possibly resized to fit the network (227x227x3 if I am not mistaken)).
Then from your image you get a feature vector (sometimes called a descriptor) and you use those feature vectors to train a linear SVM on your images and your specific task.