How do CNNs process RGB images - deep-learning

In a Convolutional Neural Network, the process of convolving is abundant.
It is known that if you take a 5x5 greyscale image (1 channel) and convolve it with a 3x3 filter (containing certain weights) you get a 3x3 feature map as a result as demonstrated by this picture: Convolutions
But what happens once you extend this notion of convolving into RGB images whereby now you have 3 channel (R,G,B) to convolve over? Well you simply add a channel to your filter proportional to the # of channel in your original image right? Lets say we did, the process of convolving with an RGB would like the following: a 6x6x3 RGB image convolved with a 3x3x3 filter. This apparently results in a 4x4x1 rather than what one would expect 4x4x3.
My question is why is this so?
If you surf the internet for visualizations of feature maps, they return with some form of colorful low & high level features. Are those visualizations of the kernels themselves or the feature maps? Either way, they all have color which means they must have more than 1 channel no?

Look at pytorch's Conv2d you'll notice that the size of the kernel is affected not only by its spatial width and height (3x3 in your question), but also by the number of input channels and output channels.
So, if you have an input RGB image (= 3 input channels) and a filter of size 3x3x3 (=a single output channel, for 3 input channels and spatial width/height = 3), then your output would indeed be 4x4x1.
You can visualize this filter since you can interpret it as a tiny 3x3 RGB image.
Visualizing features/filters that are deeper in the network is not at all trivial, and the images you see are usually the result of optimization processes designed to "uncover" the filters. this page gives an overview of some intricate methods for feature visualization.

Well, color images are :3 channels by definition, as well you can see a color picture as a stack of 3 matrices of values , so 2 Red and blue can be set up to zero,, also you should check about the sparcity of a network...

Related

Using padding in cnn layer distorts edges of feature maps

I try to implementing cnn for image denoising. I use noisy image fragments (size 32x32) and same-sized clear image for validation as a training set. I used the trained network on a noisy image and noticed that denoised image contains artifacts, like a grid 32x32 pixels. I visualised feature maps produced by convolution layers and noticed that layers with zero-padding give distorted edges. I found this topic in which as a solution of same problem describes convolution, last step of which (eg for 3x3 kernel) is operation divide result by 9 or 4, when using zero-padding.
In all articles about convolution operations that I read, this is not mentioned. Does anyone know where I can read more about this?

what is class score map?

I was going through vggnet paper and i came across the testing phase of vggnet.
During the testing phase, test image goes through the vggnet and a class score map is obtained. This class score map is spatially averaged to produce a fixed size vector.
I have googled class score map, but then i couldn't find any relevant results. I wish to know what is the role of class score map.
Any hint would be greatly helpful. Thanks
When you train an image recognition model, you train it for a specific image size (and resolution), let's say n_dims = [256, 256]. Now, in the prediction phase, you have images of different sizes (with respect to pixels), e.g. [1024, 1024]. You extract patches (you can resize the image first by lowering the resolution) and hover over the image patches with your model, and for each patch, you obtain a prediction for all classes (in a patch, more than one of the objects might be present), which you have to average somehow for the whole image at the end.
See OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks.
Instead, we explore the entire image by densely running the network at each location and at multiple
scales. While the sliding window approach may be computationally prohibitive for certain types
of model, it is inherently efficient in the case of ConvNets (see section 3.5). This approach yields
significantly more views for voting, which increases robustness while remaining efficient. The result
of convolving a ConvNet on an image of arbitrary size is a spatial map of C-dimensional vectors at
each scale.

How many neurons does the CNN input layer have?

In all the literature they say the input layer of a convnet is a tensor of shape (width, height, channels). I understand that a fully connected network has an input layer with the number of neurons same as the number of pixels in an image(considering grayscale image). So, my question is how many neurons are in the input layer of a Convolutional Neural Network? The below imageseems misleading(or I have understood it wrong) It says 3 neurons in the input layer. If so what do these 3 neurons represent? Are they tensors? From my understanding of CNN shouldn't there be just one neuron of size (height, width, channel)? Please correct me if I am wrong
It seems that you have misunderstood some of the terminology and are also confused that convolutional layers have 3 dimensions.
EDIT: I should make it clear that the input layer to a CNN is a convolutional layer.
The number of neurons in any layer is decided by the developer. For a fully connected layer, usually it is the case that there is a neuron for each input. So as you mention in your question, for an image, the number of neurons in a fully connected input layer would likely be equal to the number of pixels (unless the developer wanted to downsample at this point of something). This also means that you could create a fully connected input layer that takes all pixels in each channel (width, height, channel). Although each input is received by an input neuron only once, unlike convolutional layers.
Convolutional layers work a little differently. Each neuron in a convolutional layer has what we call a local receptive field. This just means that the neuron is not connected to the entire input (this would be called fully connected) but just some section of the input (that must be spatially local). These input neurons provide abstractions of small sections of the input data that when taken together over the whole input we call a feature map.
An important feature of convolutional layers is that they are spatially invariant. This means that they look for the same features across the entire image. After all, you wouldn't want a neural network trained on object recognition to only recognise a bicycle if it is in the bottom left corner of the image! This is achieved by constraining all of the weights across the local receptive fields to be the same. Neurons in a convolutional layer that cover the entire input and look for one feature are called filters. These filters are 2 dimensional (they cover the entire image).
However, having the whole convolutional layer looking for just one feature (such as a corner) would massively limit the capacity of your network. So developers add a number of filters so that the layer can look for a number of features across the whole input. This collection of filters creates a 3 dimensional convolutional layer.
I hope that helped!
EDIT-
Using the example the op gave to clear up loose ends:
OP's Question:
So imagine we have (27 X 27) image. And let's say there are 3 filters each of size (3 X 3). So there are totally 3 X 3 X 3 = 27 parameters (W's). So my question is how are these neurons connected? Each of the filters has to iterate over 27 pixels(neurons). So at a time, 9 input neurons are connected to one filter neuron. And these connections change as the filter iterates over all pixels.
Answer:
First, it is important to note that it is typical (and often important) that the receptive fields overlap. So for an overlap/stride of 2 the 3x3 receptive field of the top left neuron (neuron A), the receptive field of the neuron to its right (neuron B) would also have a 3x3 receptive field, whose leftmost 3 connections could take the same inputs as the rightmost connections of neuron A.
That being said, I think it seems that you would like to visualise this so I will stick to your example were there is no overlap and will assume that we do not want any padding around the image. If there is an image of resolution 27x27, and we want 3 filters (this is our choice). Then each filter will have 81 neurons (9x9 2D grid of neurons). Each of these neurons would have 9 connections (corresponding to the 3x3 receptive field). Because there are 3 filters, and each has 81 neurons, we would have 243 neurons.
I hope that clears things up. It is clear to me that you are confused with your terminology (layer, filter, neuron, parameter etc.). I would recommend that you read some blogs to better understand these things and then focus on CNNs. Good luck :)
First, lets clear up the image. The image doesn't say there are exactly 3 neurons in the input layer, it is only for visualisation purposes. The image is showing the general architecture of the network, representing each layer with an arbitrary number of neurons.
Now, to understand CNNs, it is best to see how they will work on images.
Images are 2D objects, and in a computer are represented as 2D matrices, each cell having an intensity value for the pixel. An image can have multiple channels, for example, the traditional RGB channels for a colored image. So these different channels can be thought of as values for different dimensions of the image (in case of RGB these are color dimensions) for the same locations in the image.
On the other hand, neural layers are single dimensional. They take input from one end, and give output from the other. So how do we process 2D images in 1D neural layers? Here the Convolutional Neural Networks (CNNs) come into play.
One can flatten a 2D image into a single 1D vector by concatenating successive rows in one channel, then successive channels. An image of size (width, height, channel) will become a 1D vector of size (width x height x channel) which will then be fed into the input layer of the CNN. So to answer your question, the input layer of a CNN has as many neurons as there are pixels in the image across all its channels.
I think you have confusion on the basic concept of a neuron:
From my understanding of CNN shouldn't there be just one neuron of size (height, width, channel)?
Think of a neuron as a single computational unit, which cant handle more than one number at a time. So a single neuron cant handle all the pixels of an image at once. A neural layer made up of many neurons is equipped for dealing with a whole image.
Hope this clears up some of your doubts. Please feel free to ask any queries in the comments. :)
Edit:
So imagine we have (27 X 27) image. And let's say there are 3 filters each of size (3 X 3). So there are totally 3 X 3 X 3 = 27 parameters (W's). So my question is how are these neurons connected? Each of the filters has to iterate over 27 pixels(neurons). So at a time, 9 input neurons are connected to one filter neuron. And these connections change as the filter iterates over all pixels.
Is my understanding right? I am just trying to visualize CNNs as neurons with the connections.
A simple way to visualise CNN filters is to imagine them as small windows that you are moving across the image. In your case you have 3 filters of size 3x3.
We generally use multiple filters so as to learn different kinds of features from the same local receptive field (as michael_question_answerer aptly puts it) or simpler terms, our window. Each filters' weights are randomly initialised, so each filter learns a slightly different feature.
Now imagine each filter moving across the image, covering only a 3x3 grid at a time. We define a stride value which specifies how much the window shifts to the right, and how much down. At each position, the filter weights and image pixels at the window will give a single new value in the new volume created. So to answer your question, at an instance a total of 3x3=9 pixels are connected with the 9 neurons corresponding to one filter. The same for the other 2 filters.
Your approach to understanding CNNs by visualisation is correct. But you still need the brush up your basic understanding of terminology. Here are a couple of nice resources that should help:
http://cs231n.github.io/convolutional-networks/
https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
Hope this helps. Keep up the curiosity :)

Image from concept of a convolutional neural network

Suppose I have a CNN which is trained for classifying images of different animals, the output of such model will be a point (output point) in a n spatial dimension, where n is the number of animal classes the model is trained on; then that output is transformed in a way to convert it into a one-hot vector of n parameters, giving then the correct label for the image from the point of view of the CNN, but let's stick with the n dimensional point, which is the concept of an input image.
Suppose then that I want to take that point and transform it in a way so that the final output is an image with constraint width and height (the dimensions should be the same with different input images) which outputs the same point as the input image's, how do I do that?
I'm basically asking for the methods used (training mostly) for this kind of task, where an image must be reconstructed based on the output point of the CNN -I know the image will never be identical, but I'm looking for images that generate the same (or at least not so different) output point as a input image when that point is inputted to the CNN-. Take in mind that the input of the model I'm asking for is n and the output is a two (or three if it's not in grayscale) dimensional tensor. I noticed that deepdream does exactly this kind of thing (I think), but every time I put "deepdream" and "generate" in Google, an online generator is almost always shown, not the actual techniques; so if there are some answers to this I'd love to hear about them.
The output label does not contain enough information to reconstruct an entire image.
Quoting from the DeepDream example ipython notebook:
Making the "dream" images is very simple. Essentially it is just a gradient ascent process that tries to maximize the L2 norm of activations of a particular DNN layer.
So the algorithm modifies an existing image such that outputs of certain nodes in the network (can be in an intermediate layer, not necessarily the output nodes) become large. In order to do that, it has to calculate the gradient of the node output with respect to the input pixels.

how to format the image data for training/prediction when images are different in size?

I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?
You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.
Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .