How to perform polynomial landmark detection with deep learning - deep-learning

I am trying to build a system to segment vehicles using a deep convolutional neural network. I am familiar with predicting a set amount of points (i.e. ending a neural architecture with a Dense layer with 4 neurons to predict 2 points(x,y) coords for both). However, vehicles come in many different shapes and sizes and one vehicle may require more segmentation points than another. How can I create a neural network that can have different amounts of output values? I imagine I could use a RNN of some sort but would like a little guidance. Thank you
For example, in the following image the two vehicles have a different number of labeled keypoints.

Related

How necessary are activation functions after dense layer in neural networks?

I'm currently training multiple recurrent convolutional neural networks with deep q-learning for the first time.
Input is a 11x11x1 matrix, each network consists of 4 convolutional layer with dimensions 3x3x16, 3x3x32, 3x3x64, 3x3x64. I use stride=1 and padding=1. Each convLayer is followed by ReLU activation. The output is fed into a feedforward fully-connected dense layer with 128 units and after that into an LSTM layer, also containing 128 units. Two following dense layer produce separate advantage and value steams.
So training is running for a couple of days now and now I've realized (after I've read some related paper), I didn't add an activation function after the first dense layer (as in most of the papers). I wonder if adding one would significantly improve my network? Since I'm training the networks for university, I don't have unlimited time for training, because of a deadline for my work. However, I don't have enough experience in training neural networks, to decide on what to do...
What do you suggest? I'm thankful for every answer!
If I have to talk in general using an activation function helps you to include some non-linear property in your network.
The purpose of an activation function is to add some kind of non-linear property to the function, which is a neural network. Without the activation functions, the neural network could perform only linear mappings from inputs x to the outputs y. Why is this so?
Without the activation functions, the only mathematical operation during the forward propagation would be dot-products between an input vector and a weight matrix. Since a single dot product is a linear operation, successive dot products would be nothing more than multiple linear operations repeated one after the other. And successive linear operations can be considered as a one single learn operation.
A neural network without any activation function would not be able to realize such complex mappings mathematically and would not be able to solve tasks we want the network to solve.

How to choose which pre-trained weights to use for my model?

I am a beginner, and I am very confused about how we can choose a pre-trained model that will improve my model.
I am trying to create a cat breed classifier using pre-trained weights of a model, lets say VGG16 trained on digits dataset, will that improve the performance of the model? or if I train my model just on the database without using any other weights will be better, or will both be the same as those pre-trained weights will be just a starting point.
Also if I use weights of the VGG16 trained for cat vs dog data as a starting point of my cat breed classification model will that help me in improving the model?
Since you've mentioned that you are a beginner I'll try to be a bit more verbose than normal so please bear with me.
How neural models recognise images
The layers in a pre-trained model store multiple aspects of the images they were trained on like patterns(lines, curves), colours within the image which it uses to decide if an image is of a specific class or not
With each layer the complexity of what it can store increases initially it captures lines or dots or simple curves but with each layer, the representation power increases and it starts capturing features like cat ears, dog face, curves in a number etc.
The image below from Keras blog shows how initial layers learn to represent simple things like dots and lines and as we go deeper they start to learn to represent more complex patterns.
Read more about Conv net Filters at keras's blog here
How does using a pretrained model give better results ?
When we train a model we waste a lot of compute and time initially creating these representations and in order to get to those representations we need quite a lot of data too else we might not be able to capture all relevant features and our model might not be as accurate.
So when we say we want to use a pre-trained model we want to use these representations so if we use a model trained on imagenet which has lots of cat pics we can be sure that the model already has representations to identify important features required to identify a cat and will converge to a better point than if we used random weights.
How to use pre-trained weights
So when we say to use pre-trained weights we mean use the layers which hold the representations to identify cats but discard the last layer (dense and output) and instead add fresh dense and output layers with random weights. So our predictions can make use of the representations already learned.
In real life we freeze our pretrained weights during the initial training as we do not want our random weights at the bottom to ruin the learned representations. we only unfreeze the representations in the end after we have a good classification accuracy to fine-tune them, and that too with a very small learning rate.
Which kind of pre-trained model to use
Always choose those pretrained weights that you know has the most amount of representations which can help you in identifying the class you are interested in.
So will using a mnist digits trained weights give relatively bad results when compared with one trained on image net?
Yes, but given that the initial layers have already learned simple patterns like lines and curves for digits using these weights will still put you at an advantage when compared to starting from scratch in most of the cases.
Sane weight initialization
The pre-trained weights to choose depends upon the type of classes you wish to classify. Since, you wish to classify Cat Breeds, use pre-trained weights from a classifier that is trained on similar task. As mentioned by the above answers the initial layers learn things like edges, horizontal or vertical lines, blobs, etc. As you go deeper, the model starts learning problem specific features. So for generic tasks you can use say imagenet & then fine-tune it for the problem at hand.
However, having a pre-trained model which closely resembles your training data helps immensely. A while ago, I had participated in Scene Classification Challenge where we initialized our model with the ResNet50 weights trained on Places365 dataset. Since, the classes in the above challenge were all present in the Places365 dataset, we used the weights available here and fine-tuned our model. This gave us a great boost in our accuracy & we ended up at top positions on the leaderboard.
You can find some more details about it in this blog
Also, understand that the one of the advantages of transfer learning is saving computations. Using a model with randomly initialized weights is like training a neural net from scratch. If you use VGG16 weights trained on digits dataset, then it might have already learned something, so it will definitely save some training time. If you train a model from scratch then it will eventually learn all the patterns which using a pre-trained digits classifier weights would have learnt.
On the other hand using weights from a Dog-vs-Cat classifier should give you better performance as it already has learned features to detect say paws, ears, nose or whiskers.
Could you provide more information, what do you want to classify exactly? I see you wish to classify images, which type of images (containing what?) and in which classes?
As a general remark : If you use a trained model, it must fit your need, of course. Keep in mind that a model which was trained on a given dataset, learned only the information contained in that dataset and can classify / indentify information analogous to the one in the training dataset.
If you want to classify an image containing an animal with a Y/N (binary) classifier, (cat or not cat) you should use a model trained on different animals, cats among them.
If you want to classify an image of a cat into classes corresponding to cat races, let's say, you should use a model trained only on cats images.
I should say you should use a pipeline, containing steps 1. followed by 2.
it really depends on the size of the dataset you have at hand and how related the task and data that the model was pretrained on to your task and data. Read more about Transfer Learning http://cs231n.github.io/transfer-learning/ or Domain Adaptation if your task is the same.
I am trying to create a cat breed classifier using pre-trained weights of a model, lets say VGG16 trained on digits dataset, will that improve the performance of the model?
There are general characteristics that are still learned from digits like edge detection that could be useful for your target task, so the answer here is maybe. You can here try just training the top layers which is common in computer vision applications.
Also if I use weights of the VGG16 trained for cat vs dog data as a starting point of my cat breed classification model will that help me in improving the model?
Your chances should be better if the task and data are more related and similar

What is the role of fully connected layer in deep learning?

What is the role of fully connected layer (FC) in deep learning? I've seen some networks have 1 FC and some have 2 FC and some have 3 FC. Can anyone explain to me?
Thanks a lot
The fully connected layers are able to very effectively learn non-linear combinations of input features. Let's take a convolutional neural network for example.
The output from the convolutional layers represents high-level features in the data. While that output could be flattened and connected to the output layer, adding a fully-connected layer is a (usually) cheap way of learning non-linear combinations of these features.
Essentially the convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space, and the fully-connected layer is learning a (possibly non-linear) function in that space.

Which is best for object localization among R-CNN, fast R-CNN, faster R-CNN and YOLO

what is the difference between R-CNN, fast R-CNN, faster R-CNN and YOLO in terms of the following:
(1) Precision on same image set
(2) Given SAME IMAGE SIZE, the run time
(3) Support for android porting
Considering these three criteria which is the best object localization technique?
R-CNN is the daddy-algorithm for all the mentioned algos, it really provided the path for researchers to build more complex and better algorithm on top of it.
R-CNN, or Region-based Convolutional Neural Network
R-CNN consist of 3 simple steps:
Scan the input image for possible objects using an algorithm called Selective Search, generating ~2000 region proposals
Run a convolutional neural net (CNN) on top of each of these region proposals
Take the output of each CNN and feed it into a) an SVM to classify the region and b) a linear regressor to tighten the bounding box of the object, if such an object exists.
Fast R-CNN:
Fast R-CNN was immediately followed R-CNN. Fast R-CNN is faster and better by the virtue of following points:
Performing feature extraction over the image before proposing regions, thus only running one CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
Replacing the SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model
Intuitively it makes a lot of sense to remove 2000 conv layers and instead take once Convolution and make boxes on top of that.
Faster R-CNN:
One of the drawbacks of Fast R-CNN was the slow selective search algorithm and Faster R-CNN introduced something called Region Proposal network(RPN).
Here’s is the working of the RPN:
At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d)
For each sliding-window location, it generates multiple possible regions based on k fixed-ratio anchor boxes (default bounding boxes)
Each region proposal consists of:
an “objectness” score for that region and
4 coordinates representing the bounding box of the region
In other words, we look at each location in our last feature map and consider k different boxes centered around it: a tall box, a wide box, a large box, etc. For each of those boxes, we output whether or not we think it contains an object, and what the coordinates for that box are. This is what it looks like at one sliding window location:
The 2k scores represent the softmax probability of each of the k bounding boxes being on “object.” Notice that although the RPN outputs bounding box coordinates, it does not try to classify any potential objects: its sole job is still proposing object regions. If an anchor box has an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a region proposal.
Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.
YOLO:
YOLO uses a single CNN network for both classification and localising the object using bounding boxes. This is the architecture of YOLO :
In the end you will have a tensor of shape 1470 i.e 7*7*30 and the structure of the CNN output will be:
The 1470 vector output is divided into three parts, giving the probability, confidence and box coordinates. Each of these three parts is also further divided into 49 small regions, corresponding to the predictions at the 49 cells that form the original image.
In postprocessing steps, we take this 1470 vector output from the network to generate the boxes that with a probability higher than a certain threshold.
I hope you get the understanding of these networks, to answer your question on how the performance of these network differs:
On the same dataset: 'You can be sure that the performance of these networks are in the order they are mentioned, with YOLO being the best and R-CNN being the worst'
Given SAME IMAGE SIZE, the run time: Faster R-CNN achieved much better speeds and a state-of-the-art accuracy. It is worth noting that although future models did a lot to increase detection speeds, few models managed to outperform Faster R-CNN by a significant margin. Faster R-CNN may not be the simplest or fastest method for object detection, but it is still one of the best performing. However researchers have used YOLO for video segmentation and by far its the best and fastest when it comes to video segmentation.
Support for android porting: As far as my knowledge goes, Tensorflow has some android APIs to port to android but I am not sure how these network will perform or even will you be able to port it or not. That again is subjected to hardware and data_size. Can you please provide the hardware and the size so that I will be able to answer it clearly.
The youtube video tagged by #A_Piro gives a nice explanation too.
P.S. I borrowed a lot of material from Joyce Xu Medium blog.
If your are interested in these algorithms you should take a look into this lesson which go through the algoritmhs you named : https://www.youtube.com/watch?v=GxZrEKZfW2o.
PS: There is also a Fast YOLO if I remember well haha !
I have been working with YOLO and FRCNN a lot. To me the YOLO has the best accuracy and speed but if you want to do research on image processing, I will suggest FRCNN as many previous works are done with it, and to do research you really want to be consistent.
For Object detection, I am trying SSD+ Mobilenet. It has a balance of accuracy and speed So it can also be ported to android devices easily with good fps.
It has less accuracy compared to faster rcnn but more speed than other algorithms.
It also has good support for android porting.

Is it possible to forward the output of a deep-learning network to another network with caffe / pycaffe?

I am using caffe, or more likely pycaffe to train and create my network. I am having a dataset with 5 labels at the end. I had the idea to create one network for each label that can just simply say the score for one class. After having then trained 5 networks I want to compare the outputs of the networks and which one has the highest score.
Sadly I do only know how to create one network , but not how to let them interact and moreover how to do something like a max function at the end. I add a picture to describe what I want to do.
Moreover, I do not know if this would have a better outcome than just a normal deep neuronal network.
I don't see what you expect to have as the input to this "max" function. Even if you use some sort of is / is not boundary training, your approach appears to be an inferior version of the softmax layer available in all popular frameworks.
Yes, you can build a multi-channel model, train each channel with a different data set, and then accept the most confident prediction -- but the result will take longer and be less accurate than a cooperative training pass. Your five channels wind up negotiating their boundaries after they've made other parametric assumptions.
Feed a single model all the information available from the outset; you'll get faster convergence and more accurate classification.