Mask RCNN Completely Overlapping IOU Instance Segmentation - deep-learning

I am training a model with Mask RCNN.
I have images where I have two masks with one being completely inside the other. Does this affect the model in any negative way? I have read in the literature that Mask RCNN handles IOU overlapping well.

Related

Embedded feature vector components: position-dependent?

I have a question about the position of vector components of embedded features.
After the embedding layer, we usually flatten the embedded features to generate a 1D-shaped vector.
Assume we do not use CNN or RNN layers to process the embedded features, but go directly to the fully-connected layers to do classification:
I wonder if the position of every vector component would affect the meaning of the whole vector to the machine-learning algorithm (e.g. in Keras)?
Background of my question: In NLP, sentences have different lengths and the position of words with similar meaning does not always align. After vectorization of n-gram of words with embeddings, I wonder if it is possible to join the flattened embedded vector to the dense layer to do classification without CNN or RNN layer, which increases the size of my network + potentially cause overfitting to the training set.

Determining position of anchor boxes in original image using downsampled feature map

From what I have read, I understand that methods used in faster-RCNN and SSD involve generating a set of anchor boxes. We first downsample the training image using a CNN and for every pixel in the downsampled feature map (which will form the center for our anchor boxes) we project it back onto the training image. We then draw the anchor boxes centered around that pixel using our pre-determined scales and ratios. What I dont understand is why dont we directly assume the centers of our anchor boxes on the training image with a suitable stride and use the CNN to only output the classification and regression values. What are we gaining by using the CNN to determine the centers of our anchor boxes which are ultimately going to be distributed evenly on the training image ?
To state more clearly -
Where will the centers of our anchor boxes be on the training image before our first prediction of the offset values and how do we decide those?
I think the confusion comes from this:
What are we gaining by using the CNN to determine the centers of our anchor boxes which are ultimately going to be distributed evenly on the training image
The network usually doesn't predict centers but corrections to a prior belief. The initial anchor centers are distributed evenly across the image, and as such don't fit the objects in the scene tightly enough. Those anchors just constitute a prior in the probabilistic sense. What your network will exactly output is implementation dependent, but will likely just be updates, i.e. corrections to those initial priors. This means that the centers that are predicted by your network are some delta_x, delta_y that adjust the bounding boxes.
Regarding this part:
why dont we directly assume the centers of our anchor boxes on the training image with a suitable stride and use the CNN to only output the classification and regression values
The regression values should still contain sufficient information to determine a bounding box in a unique way. Predicting width, height and center offsets (corrections) is a straightforward way to do it, but it's certainly not the only way. For example, you could modify the network to predict for each pixel, the distance vector to its nearest object center, or you could use parametric curves. However, crude, fixed anchor centers are not a good idea since they will also cause problems in classification, as you use them to pool features that are representative of the object.

Initializing convolutional layers in CNN

Is there a function to initialize weights of a convolutional layer to focus more on information closer to the center of input images?
All my input images are centered, so pixels further away from the center of an image matter less than pixels closer to the center.
Please see the GIFs here for a demonstration of convolutions:
https://github.com/vdumoulin/conv_arithmetic#convolution-animations
As you can see, convolutions operate the same regardless of the position in the image, so weight initialization cannot change the focus of the image.
It is also not advisable to rush into thinking about what the net will and won't need to learn your task. There are sometimes surprising amounts of signal outside what you as a human might focus on. I would suggest training the net and seeing how it performs, and then (as others have suggested) thinking about cropping.
Is there a function to initialize weights of a convolutional layer to focus more on information closer to the center of input images?
This is not possible because, initialization is there just to trigger the process of learning.
Model however, is the one that can have functions, achieving the the attention.
You don't need to initialize conv. layers also because in PyTorch this is already done automatically.

Are the ,middle layers in Resnet even learning?

The skip connections allows our gradient all the way from 152nd layers and feed through the initial 1st or 2nd layers of the CNN. But what about the middle layers? Backpropagations in these middle layers are totally irrelevant so are we even learning in resnet?
Backpropagation in these middle layers aren't totally irrelevant. The basic idea of the relevance of the middle layers is that ResNet keeps improving its error-rate when adding new layers (from 5.71 top5 error with 34 layer to 4.49 top5 error with 152). Images have a lot of singularities and complexities, and the folks at Microsoft found out that, when you take care of the vanishing gradient problem (with the feed through) you can gain more knowledge throughout the network with more layers.
The ideia of adding the residual block, it's to prevent the vanishing gradient problem, when you are getting too many layers... But the middle layers are also updated on each training step, and they are also learning (usually high-level features).
Convolutional Neural Networks with lots of layers tend to overfit if the problem isn't too much complex, since its 152 layers have a capacity of learning a lot of different patterns.

How many neurons does the CNN input layer have?

In all the literature they say the input layer of a convnet is a tensor of shape (width, height, channels). I understand that a fully connected network has an input layer with the number of neurons same as the number of pixels in an image(considering grayscale image). So, my question is how many neurons are in the input layer of a Convolutional Neural Network? The below imageseems misleading(or I have understood it wrong) It says 3 neurons in the input layer. If so what do these 3 neurons represent? Are they tensors? From my understanding of CNN shouldn't there be just one neuron of size (height, width, channel)? Please correct me if I am wrong
It seems that you have misunderstood some of the terminology and are also confused that convolutional layers have 3 dimensions.
EDIT: I should make it clear that the input layer to a CNN is a convolutional layer.
The number of neurons in any layer is decided by the developer. For a fully connected layer, usually it is the case that there is a neuron for each input. So as you mention in your question, for an image, the number of neurons in a fully connected input layer would likely be equal to the number of pixels (unless the developer wanted to downsample at this point of something). This also means that you could create a fully connected input layer that takes all pixels in each channel (width, height, channel). Although each input is received by an input neuron only once, unlike convolutional layers.
Convolutional layers work a little differently. Each neuron in a convolutional layer has what we call a local receptive field. This just means that the neuron is not connected to the entire input (this would be called fully connected) but just some section of the input (that must be spatially local). These input neurons provide abstractions of small sections of the input data that when taken together over the whole input we call a feature map.
An important feature of convolutional layers is that they are spatially invariant. This means that they look for the same features across the entire image. After all, you wouldn't want a neural network trained on object recognition to only recognise a bicycle if it is in the bottom left corner of the image! This is achieved by constraining all of the weights across the local receptive fields to be the same. Neurons in a convolutional layer that cover the entire input and look for one feature are called filters. These filters are 2 dimensional (they cover the entire image).
However, having the whole convolutional layer looking for just one feature (such as a corner) would massively limit the capacity of your network. So developers add a number of filters so that the layer can look for a number of features across the whole input. This collection of filters creates a 3 dimensional convolutional layer.
I hope that helped!
EDIT-
Using the example the op gave to clear up loose ends:
OP's Question:
So imagine we have (27 X 27) image. And let's say there are 3 filters each of size (3 X 3). So there are totally 3 X 3 X 3 = 27 parameters (W's). So my question is how are these neurons connected? Each of the filters has to iterate over 27 pixels(neurons). So at a time, 9 input neurons are connected to one filter neuron. And these connections change as the filter iterates over all pixels.
Answer:
First, it is important to note that it is typical (and often important) that the receptive fields overlap. So for an overlap/stride of 2 the 3x3 receptive field of the top left neuron (neuron A), the receptive field of the neuron to its right (neuron B) would also have a 3x3 receptive field, whose leftmost 3 connections could take the same inputs as the rightmost connections of neuron A.
That being said, I think it seems that you would like to visualise this so I will stick to your example were there is no overlap and will assume that we do not want any padding around the image. If there is an image of resolution 27x27, and we want 3 filters (this is our choice). Then each filter will have 81 neurons (9x9 2D grid of neurons). Each of these neurons would have 9 connections (corresponding to the 3x3 receptive field). Because there are 3 filters, and each has 81 neurons, we would have 243 neurons.
I hope that clears things up. It is clear to me that you are confused with your terminology (layer, filter, neuron, parameter etc.). I would recommend that you read some blogs to better understand these things and then focus on CNNs. Good luck :)
First, lets clear up the image. The image doesn't say there are exactly 3 neurons in the input layer, it is only for visualisation purposes. The image is showing the general architecture of the network, representing each layer with an arbitrary number of neurons.
Now, to understand CNNs, it is best to see how they will work on images.
Images are 2D objects, and in a computer are represented as 2D matrices, each cell having an intensity value for the pixel. An image can have multiple channels, for example, the traditional RGB channels for a colored image. So these different channels can be thought of as values for different dimensions of the image (in case of RGB these are color dimensions) for the same locations in the image.
On the other hand, neural layers are single dimensional. They take input from one end, and give output from the other. So how do we process 2D images in 1D neural layers? Here the Convolutional Neural Networks (CNNs) come into play.
One can flatten a 2D image into a single 1D vector by concatenating successive rows in one channel, then successive channels. An image of size (width, height, channel) will become a 1D vector of size (width x height x channel) which will then be fed into the input layer of the CNN. So to answer your question, the input layer of a CNN has as many neurons as there are pixels in the image across all its channels.
I think you have confusion on the basic concept of a neuron:
From my understanding of CNN shouldn't there be just one neuron of size (height, width, channel)?
Think of a neuron as a single computational unit, which cant handle more than one number at a time. So a single neuron cant handle all the pixels of an image at once. A neural layer made up of many neurons is equipped for dealing with a whole image.
Hope this clears up some of your doubts. Please feel free to ask any queries in the comments. :)
Edit:
So imagine we have (27 X 27) image. And let's say there are 3 filters each of size (3 X 3). So there are totally 3 X 3 X 3 = 27 parameters (W's). So my question is how are these neurons connected? Each of the filters has to iterate over 27 pixels(neurons). So at a time, 9 input neurons are connected to one filter neuron. And these connections change as the filter iterates over all pixels.
Is my understanding right? I am just trying to visualize CNNs as neurons with the connections.
A simple way to visualise CNN filters is to imagine them as small windows that you are moving across the image. In your case you have 3 filters of size 3x3.
We generally use multiple filters so as to learn different kinds of features from the same local receptive field (as michael_question_answerer aptly puts it) or simpler terms, our window. Each filters' weights are randomly initialised, so each filter learns a slightly different feature.
Now imagine each filter moving across the image, covering only a 3x3 grid at a time. We define a stride value which specifies how much the window shifts to the right, and how much down. At each position, the filter weights and image pixels at the window will give a single new value in the new volume created. So to answer your question, at an instance a total of 3x3=9 pixels are connected with the 9 neurons corresponding to one filter. The same for the other 2 filters.
Your approach to understanding CNNs by visualisation is correct. But you still need the brush up your basic understanding of terminology. Here are a couple of nice resources that should help:
http://cs231n.github.io/convolutional-networks/
https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
Hope this helps. Keep up the curiosity :)