Training model to recognize one specific object (or scene)

Training model to recognize one specific object (or scene) - deep-learning

I am trying to train a learning model to recognize one specific scene. For example, say I would like to train it to recognize pictures taken at an amusement park and I already have 10 thousand pictures taken at an amusement park. I would like to train this model with those pictures so that it would be able to give a score for other pictures of the probability that they were taken at an amusement park. How do I do that?
Considering this is an image recognition problem, I would probably use a convolutional neural network, but I am not quite sure how to train it in this case.
Thanks!

There are several possible ways. The most trivial one is to collect a large number of negative examples (images from other places) and train a two-class model.
The second approach would be to train a network to extract meaningful low-dimensional representations from an input image (embeddings). Here you can use siamese training to explicitly train the network to learn similarities between images. Such an approach is employed for face recognition, for instance (see FaceNet). Having such embeddings, you can use some well-established methods for outlier detections, for instance, 1-class SVM, or any other classifier. In this case you also need negative examples.
I would heavily augment your data using image cropping - it is the most obvious way to increase the amount of training data in your case.
In general, your success in this task strongly depends on the task statement (are restricted to parks only, or any kind of place) and the proper data.

Related

Object detection from synthetic to real life data with Yolov5

Currently trying yolov5 with custom synthetic data. The dataset we've created consists of 8 different objects. Each object has a minimum of 1500 pictures/labels, where the pictures are split 500/500/500 of normal/fog/distractors around object. Sample images from the dataset is in the first imgur link. The model is not trained from scratch, but from yolov5 standard .pt.
So far we've tried:
Adding more data (from 300 images per object, to 4500)
Creating more complex data (distractors on/around objects)
Running multiple runs of training
Trained with network size small, medium, large, xlarge
Different batch size between 4-32 (depending on model size)
Everything so far has resulted in good/great detection on synthetic data, but completely off when used on real-life data.
Examples: Thinks that the whole pictures of unrelated objects is a paperbox, walls are pallets, etc. Quick sample images in the last imgur link.
Anyone got clues for how to improve the training or data to be better suited for real life detection? Or how to better interpret the results? I don't understand how the model draws the conclusion that a whole picture, with unrelated objects, is a box/pallet.
Results from training uploaded to imgur:
https://imgur.com/a/P0TQeBl
Example on real life data:
https://imgur.com/a/SGY7w8w

There are couple of things to improve results.
After training your model with synthetic data, fine tune your model with real training data, with a smaller learning rate (1/10th maybe). This will reduce the gap between synthetic and real life images. In some cases rather than fine tuning, training the model with mixed (synthetic+real) produces better results.
Generate images structurally similar to real life examples. For example, put humans inside forklifts, or pallets or barrels on forks, etc. Models learn from it.
Randomize the texture on items that you want to detect. Models tend to focus on textures for detection. By randomizing textures, with lots of variability including mon natural occurrences, you force model to learn to identify objects not based on its textures. Although, texture of an object sometimes is a good identifier, synthetic data suffers from not replicating that feature good enough, hence the domain gap, so you reduce its impact on model decision.
I am not sure whether the screenshot accurately represent your data generation distribution, if so, you have to randomize the angles of objects, sizes and occlusion amounts more.
Use objects that you don’t want to detect but will be in the images you will do inference as distractors, rather than simple shapes like spheres.
Randomize lighting more. Intensity, color, angles etc.
Increase background and ground randomization. Use hdris, there are lots of free hdris
Balance your dataset
https://imgur.com/a/LdCa8aO

Checking your results the answer is that your synthetic data is way to dissimilar to the real life data you want it to work for. Try to generate synthetic scenes that are closer to your real life counterparts and training again would clearly improve your results. That includes more realistic backgrounds and scene compositions. I don't know if your training set resembles the validation images you shared here but in case it does, try to have more objects per image, closer to the camera and add variation to their relative positions. Having just one random 3D object in the middle of an image is not going to provide good results. By the way, you are already overfitting your models, so more training images wouldn't help at this point.

How to choose which pre-trained weights to use for my model?

I am a beginner, and I am very confused about how we can choose a pre-trained model that will improve my model.
I am trying to create a cat breed classifier using pre-trained weights of a model, lets say VGG16 trained on digits dataset, will that improve the performance of the model? or if I train my model just on the database without using any other weights will be better, or will both be the same as those pre-trained weights will be just a starting point.
Also if I use weights of the VGG16 trained for cat vs dog data as a starting point of my cat breed classification model will that help me in improving the model?

Since you've mentioned that you are a beginner I'll try to be a bit more verbose than normal so please bear with me.
How neural models recognise images
The layers in a pre-trained model store multiple aspects of the images they were trained on like patterns(lines, curves), colours within the image which it uses to decide if an image is of a specific class or not
With each layer the complexity of what it can store increases initially it captures lines or dots or simple curves but with each layer, the representation power increases and it starts capturing features like cat ears, dog face, curves in a number etc.
The image below from Keras blog shows how initial layers learn to represent simple things like dots and lines and as we go deeper they start to learn to represent more complex patterns.
Read more about Conv net Filters at keras's blog here
How does using a pretrained model give better results ?
When we train a model we waste a lot of compute and time initially creating these representations and in order to get to those representations we need quite a lot of data too else we might not be able to capture all relevant features and our model might not be as accurate.
So when we say we want to use a pre-trained model we want to use these representations so if we use a model trained on imagenet which has lots of cat pics we can be sure that the model already has representations to identify important features required to identify a cat and will converge to a better point than if we used random weights.
How to use pre-trained weights
So when we say to use pre-trained weights we mean use the layers which hold the representations to identify cats but discard the last layer (dense and output) and instead add fresh dense and output layers with random weights. So our predictions can make use of the representations already learned.
In real life we freeze our pretrained weights during the initial training as we do not want our random weights at the bottom to ruin the learned representations. we only unfreeze the representations in the end after we have a good classification accuracy to fine-tune them, and that too with a very small learning rate.
Which kind of pre-trained model to use
Always choose those pretrained weights that you know has the most amount of representations which can help you in identifying the class you are interested in.
So will using a mnist digits trained weights give relatively bad results when compared with one trained on image net?
Yes, but given that the initial layers have already learned simple patterns like lines and curves for digits using these weights will still put you at an advantage when compared to starting from scratch in most of the cases.

Sane weight initialization
The pre-trained weights to choose depends upon the type of classes you wish to classify. Since, you wish to classify Cat Breeds, use pre-trained weights from a classifier that is trained on similar task. As mentioned by the above answers the initial layers learn things like edges, horizontal or vertical lines, blobs, etc. As you go deeper, the model starts learning problem specific features. So for generic tasks you can use say imagenet & then fine-tune it for the problem at hand.
However, having a pre-trained model which closely resembles your training data helps immensely. A while ago, I had participated in Scene Classification Challenge where we initialized our model with the ResNet50 weights trained on Places365 dataset. Since, the classes in the above challenge were all present in the Places365 dataset, we used the weights available here and fine-tuned our model. This gave us a great boost in our accuracy & we ended up at top positions on the leaderboard.
You can find some more details about it in this blog
Also, understand that the one of the advantages of transfer learning is saving computations. Using a model with randomly initialized weights is like training a neural net from scratch. If you use VGG16 weights trained on digits dataset, then it might have already learned something, so it will definitely save some training time. If you train a model from scratch then it will eventually learn all the patterns which using a pre-trained digits classifier weights would have learnt.
On the other hand using weights from a Dog-vs-Cat classifier should give you better performance as it already has learned features to detect say paws, ears, nose or whiskers.

Could you provide more information, what do you want to classify exactly? I see you wish to classify images, which type of images (containing what?) and in which classes?
As a general remark : If you use a trained model, it must fit your need, of course. Keep in mind that a model which was trained on a given dataset, learned only the information contained in that dataset and can classify / indentify information analogous to the one in the training dataset.
If you want to classify an image containing an animal with a Y/N (binary) classifier, (cat or not cat) you should use a model trained on different animals, cats among them.
If you want to classify an image of a cat into classes corresponding to cat races, let's say, you should use a model trained only on cats images.
I should say you should use a pipeline, containing steps 1. followed by 2.

it really depends on the size of the dataset you have at hand and how related the task and data that the model was pretrained on to your task and data. Read more about Transfer Learning http://cs231n.github.io/transfer-learning/ or Domain Adaptation if your task is the same.
I am trying to create a cat breed classifier using pre-trained weights of a model, lets say VGG16 trained on digits dataset, will that improve the performance of the model?
There are general characteristics that are still learned from digits like edge detection that could be useful for your target task, so the answer here is maybe. You can here try just training the top layers which is common in computer vision applications.
Also if I use weights of the VGG16 trained for cat vs dog data as a starting point of my cat breed classification model will that help me in improving the model?
Your chances should be better if the task and data are more related and similar

How to combine the probability (soft) output of different networks and get the hard output?

I have trained three different models separately in caffe, and I can get the probability of belonging to each class for semantic segmentation. I want to get an output based on the 3 probabilities that I am getting (for example, the argmax of three probabilities). This can be done by inferring through net model and deploy.prototxt files. And then based on the final soft output, the hard output shows the final segmentation.
My questions are:
How to get ensemble output of these networks?
How to do end-to-end training of ensemble of three networks? Is there any resources to get help?
How to get final segmentation based on the final probability (e.g., argmax of three probabilities), which is soft output?
My question may sound very basic question, and my apologies for that. I am still trying to learn step by step. I really appreciate your help.

There are two ways (at least that I know of) that you could do to solve (1):
One is to use pycaffe interface, instantiate the three networks, forward an input image through each of them, fetch the output and perform any operation you desire to combine all three probabilites. This is specially useful if you intend to combine them using a more complex logic.
The alternative (way less elegant) is to use caffe test and process all your inputs separately through each network saving the probabilities into files. Then combine the probabilities from the files later.
Regarding your second question, I have never trained more than two weight-sharing CNNs (siamese networks). From what I understood, your networks don't share weights, only the architecture. If you want to train all three end-to-end please take a look at this tutorial made for siamese networks. The authors define in their prototxt both paths/branches, connect each branch's layers to the input Data layer and, at the end, with a loss layer.
In your case you would define the three branches (one for each of your networks), connect with input data layers (check if each branch processes the same input or different inputs, for example, the same image pre-processed differently) and unite them with a loss, similarly to the tutorial.
Now, for the last question, it seems Caffe has a ArgMax layer that may be what you are looking for. If you are familiar with python, you could also use a python layer that allows you to define with great flexibility how to combine the output probabilities.

Is it possible to forward the output of a deep-learning network to another network with caffe / pycaffe?

I am using caffe, or more likely pycaffe to train and create my network. I am having a dataset with 5 labels at the end. I had the idea to create one network for each label that can just simply say the score for one class. After having then trained 5 networks I want to compare the outputs of the networks and which one has the highest score.
Sadly I do only know how to create one network , but not how to let them interact and moreover how to do something like a max function at the end. I add a picture to describe what I want to do.
Moreover, I do not know if this would have a better outcome than just a normal deep neuronal network.

I don't see what you expect to have as the input to this "max" function. Even if you use some sort of is / is not boundary training, your approach appears to be an inferior version of the softmax layer available in all popular frameworks.
Yes, you can build a multi-channel model, train each channel with a different data set, and then accept the most confident prediction -- but the result will take longer and be less accurate than a cooperative training pass. Your five channels wind up negotiating their boundaries after they've made other parametric assumptions.
Feed a single model all the information available from the outset; you'll get faster convergence and more accurate classification.

Any visualizations of neural network decision process when recognizing images?

I'm enrolled in Coursera ML class and I just started learning about neural networks.
One thing that truly mystifies me is how recognizing something so “human”, like a handwritten digit, becomes easy once you find the good weights for linear combinations.
It is even crazier when you understand that something seemingly abstract (like a car) can be recognized just by finding some really good parameters for linear combinations, and combining them, and feeding them to each other.
Combinations of linear combinations are much more expressible than I once thought.
This lead me to wonder if it is possible to visualize NN's decision process, at least in simple cases.
For example, if my input is 20x20 greyscale image (i.e. total 400 features) and the output is one of 10 classes corresponding to recognized digits, I would love to see some kind of visual explanation of which cascades of linear combinations led the NN to its conclusion.
I naïvely imagine that this may be implemented as visual cue over the image being recognized, maybe a temperature map showing “pixels that affected the decision the most”, or anything that helps to understand how neural network worked in a particular case.
Is there some neural network demo that does just that?

This is not a direct answer to your question. I would suggest you take a look at convolutional neural networks (CNN). In CNNs you can almost see the concept that is learned. You should read this publication:
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998
CNNs are often called "trainable feature extractors". In fact, CNNs implement 2D filters with trainable coefficients. This is why the activation of the first layers are usually shown as 2D images (see Fig. 13). In this paper the authors use another trick to make the networks even more transparant: the last layer is a radial basis function layer (with gaussian functions), i. e. the distance to an (adjustable) prototype for each class is calculated. You can really see the learned concepts by looking at the parameters of the last layer (see Fig. 3).
However, CNNs are artificial neural networks. But the layers are not fully connected and some neurons share the same weights.

Maybe it doesn't answer the question directly but I found this interesting piece in this Andrew Ng, Jeff Dean, Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin,
Kai Chen and
Greg Corrado paper (emphasis mine):
In this section, we will present two visualization techniques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the optimal stimulus
...
These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown [below], confirm that the tested neuron indeed learns the concept of faces.
In other words, they take a neuron that is best-performing at recognizing faces and
select images from the dataset that it cause it to output highest confidence;
mathematically find an image (not in dataset) that would get highest condifence.
It's fun to see that it actually “captures” features of the human face.
The learning is unsupervised, i.e. input data didn't say whether an image is a face or not.
Interestingly, here are generated “optimal input” images for cat heads and human bodies:

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008