I'd like to upsample one layer with size of (w,h,channels) to size of (w',h',channels), but the Upsample2D layer just can upsample to the double size.
Anybody could tell me how do any size upsampling?
The Keras UpSample2D can upsample to different sizes, not just double size. From the Keras docs we can see this is indicated for such layer:
keras.layers.UpSampling2D(size=(2, 2), data_format=None)
Upsampling layer for 2D inputs.
Repeats the rows and columns of the data by size[0] and size[1] respectively.
The default size value is indeed (2,2), so in that case your upsampling will be double. By specifying the size you desire you can manage to upsample to different sizes according to your needs. So, if you want an upsample factor of say, 3 then you should use size=(3,3), etc.
As alternatives, you can also define your own custom layers if you want something really specific to your case. For example, here is a Github issue about creating custom pooling function (opposite of upsampling layers, so easily comparable), which could help you in case you needed such custom layer.
Related
I was going through vggnet paper and i came across the testing phase of vggnet.
During the testing phase, test image goes through the vggnet and a class score map is obtained. This class score map is spatially averaged to produce a fixed size vector.
I have googled class score map, but then i couldn't find any relevant results. I wish to know what is the role of class score map.
Any hint would be greatly helpful. Thanks
When you train an image recognition model, you train it for a specific image size (and resolution), let's say n_dims = [256, 256]. Now, in the prediction phase, you have images of different sizes (with respect to pixels), e.g. [1024, 1024]. You extract patches (you can resize the image first by lowering the resolution) and hover over the image patches with your model, and for each patch, you obtain a prediction for all classes (in a patch, more than one of the objects might be present), which you have to average somehow for the whole image at the end.
See OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks.
Instead, we explore the entire image by densely running the network at each location and at multiple
scales. While the sliding window approach may be computationally prohibitive for certain types
of model, it is inherently efficient in the case of ConvNets (see section 3.5). This approach yields
significantly more views for voting, which increases robustness while remaining efficient. The result
of convolving a ConvNet on an image of arbitrary size is a spatial map of C-dimensional vectors at
each scale.
i try to understand dilated convolution. I already familiar with increasing the size of the kernel by filling the gaps with zeros. Its usefull to cover a bigger area and get a better understanding about larger objects.
But please can someone explain me how it is possible that dilated convolutional layers keep the origin resolution of the receptive field. It is used in the deeplabV3+ structure with a atrous rate from 2 to 16. How is it possible to use dilated convolution with a obvious bigger kernel without zero padding and the output size will be consistent.
deeplabV3+ structure:
Im confused because when i have a look at these explanation here:
The outputsize (3x3) of the dilated convolution layer is smaller?
Thank you so much for your help!
Lukas
Maybe there is a small confusion between strided convolution and dilated convolution here. Strided convolution is the general convolution operation that acts like a sliding window, but instead of jumping by a single pixel each time it uses a stride to allow jumping more than one pixel when moving from computing the convolution result for the current pixel and the next one. Dilated convolution is "looking" on a bigger window - instead of taking neighboring pixels, it takes them with "holes". The dilation factor defines the size of those "holes".
Well, without padding the output would become smaller than the input. The effect is comparable to the reduction effect of a normal convolution.
Imagine you have a 1d-tensor with 1000 elements and a dilated 1x3 convolution kernel with dilation factor of 3. This corresponds to a "total kernel length" of 1+2free+1+2free+1 = 7. Considering a stride of 1 the output would be a 1d-tensor with 1000+1-7= 994 elements. In case of a normal convolution with a 1x3 kernel and a stride factor of 1 the output would have 1000+1-3= 998 elements. As you can see the effect can be calculated similar to a normal convolution :)
In both situation the output would become smaller without padding. But, as you can see, the dilation factor has no scaling effect on the output's size like it is the case for the stride factor.
Why do you think no padding is done within the deeplab framework? I think in the official tensorflow implementation padding is used.
Best Frank
My understanding is that the authors are saying that one does not need to downsample image (or, any intermediate feature map) before applying let's say 3x3 convolution which is typical in DCNNs (e.g., VGG16 or ResNet) for feature extraction and followed by upsampling for semantic segmentation. In a typical encoder-decoder network (e.g. UNet or SegNet), one first downsamples the feature map by half, followed by convolution operation and upsampling the feature map again by 2x times.
All of these effects (downsampling, feature extraction and upsampling) can be captured in a single atrous convolution (of course with stride=1). Moreover, the output of an atrous convolution is a dense feature map comparing to same "downsampling, feature extraction and upsampling" which results in a spare feature map. See the following figure for more details. It is from DeepLabV1 paper. Therefore, you can control the size of a feature map by replacing any normal convolution by atrous convolution in an intermediate layer.
That's also why there is a constant "output_stride (input resolution / feature map resolution)" of 16 in all the atrous convolutions in the picture (cascaded model) you posted above.
In all the literature they say the input layer of a convnet is a tensor of shape (width, height, channels). I understand that a fully connected network has an input layer with the number of neurons same as the number of pixels in an image(considering grayscale image). So, my question is how many neurons are in the input layer of a Convolutional Neural Network? The below imageseems misleading(or I have understood it wrong) It says 3 neurons in the input layer. If so what do these 3 neurons represent? Are they tensors? From my understanding of CNN shouldn't there be just one neuron of size (height, width, channel)? Please correct me if I am wrong
It seems that you have misunderstood some of the terminology and are also confused that convolutional layers have 3 dimensions.
EDIT: I should make it clear that the input layer to a CNN is a convolutional layer.
The number of neurons in any layer is decided by the developer. For a fully connected layer, usually it is the case that there is a neuron for each input. So as you mention in your question, for an image, the number of neurons in a fully connected input layer would likely be equal to the number of pixels (unless the developer wanted to downsample at this point of something). This also means that you could create a fully connected input layer that takes all pixels in each channel (width, height, channel). Although each input is received by an input neuron only once, unlike convolutional layers.
Convolutional layers work a little differently. Each neuron in a convolutional layer has what we call a local receptive field. This just means that the neuron is not connected to the entire input (this would be called fully connected) but just some section of the input (that must be spatially local). These input neurons provide abstractions of small sections of the input data that when taken together over the whole input we call a feature map.
An important feature of convolutional layers is that they are spatially invariant. This means that they look for the same features across the entire image. After all, you wouldn't want a neural network trained on object recognition to only recognise a bicycle if it is in the bottom left corner of the image! This is achieved by constraining all of the weights across the local receptive fields to be the same. Neurons in a convolutional layer that cover the entire input and look for one feature are called filters. These filters are 2 dimensional (they cover the entire image).
However, having the whole convolutional layer looking for just one feature (such as a corner) would massively limit the capacity of your network. So developers add a number of filters so that the layer can look for a number of features across the whole input. This collection of filters creates a 3 dimensional convolutional layer.
I hope that helped!
EDIT-
Using the example the op gave to clear up loose ends:
OP's Question:
So imagine we have (27 X 27) image. And let's say there are 3 filters each of size (3 X 3). So there are totally 3 X 3 X 3 = 27 parameters (W's). So my question is how are these neurons connected? Each of the filters has to iterate over 27 pixels(neurons). So at a time, 9 input neurons are connected to one filter neuron. And these connections change as the filter iterates over all pixels.
Answer:
First, it is important to note that it is typical (and often important) that the receptive fields overlap. So for an overlap/stride of 2 the 3x3 receptive field of the top left neuron (neuron A), the receptive field of the neuron to its right (neuron B) would also have a 3x3 receptive field, whose leftmost 3 connections could take the same inputs as the rightmost connections of neuron A.
That being said, I think it seems that you would like to visualise this so I will stick to your example were there is no overlap and will assume that we do not want any padding around the image. If there is an image of resolution 27x27, and we want 3 filters (this is our choice). Then each filter will have 81 neurons (9x9 2D grid of neurons). Each of these neurons would have 9 connections (corresponding to the 3x3 receptive field). Because there are 3 filters, and each has 81 neurons, we would have 243 neurons.
I hope that clears things up. It is clear to me that you are confused with your terminology (layer, filter, neuron, parameter etc.). I would recommend that you read some blogs to better understand these things and then focus on CNNs. Good luck :)
First, lets clear up the image. The image doesn't say there are exactly 3 neurons in the input layer, it is only for visualisation purposes. The image is showing the general architecture of the network, representing each layer with an arbitrary number of neurons.
Now, to understand CNNs, it is best to see how they will work on images.
Images are 2D objects, and in a computer are represented as 2D matrices, each cell having an intensity value for the pixel. An image can have multiple channels, for example, the traditional RGB channels for a colored image. So these different channels can be thought of as values for different dimensions of the image (in case of RGB these are color dimensions) for the same locations in the image.
On the other hand, neural layers are single dimensional. They take input from one end, and give output from the other. So how do we process 2D images in 1D neural layers? Here the Convolutional Neural Networks (CNNs) come into play.
One can flatten a 2D image into a single 1D vector by concatenating successive rows in one channel, then successive channels. An image of size (width, height, channel) will become a 1D vector of size (width x height x channel) which will then be fed into the input layer of the CNN. So to answer your question, the input layer of a CNN has as many neurons as there are pixels in the image across all its channels.
I think you have confusion on the basic concept of a neuron:
From my understanding of CNN shouldn't there be just one neuron of size (height, width, channel)?
Think of a neuron as a single computational unit, which cant handle more than one number at a time. So a single neuron cant handle all the pixels of an image at once. A neural layer made up of many neurons is equipped for dealing with a whole image.
Hope this clears up some of your doubts. Please feel free to ask any queries in the comments. :)
Edit:
So imagine we have (27 X 27) image. And let's say there are 3 filters each of size (3 X 3). So there are totally 3 X 3 X 3 = 27 parameters (W's). So my question is how are these neurons connected? Each of the filters has to iterate over 27 pixels(neurons). So at a time, 9 input neurons are connected to one filter neuron. And these connections change as the filter iterates over all pixels.
Is my understanding right? I am just trying to visualize CNNs as neurons with the connections.
A simple way to visualise CNN filters is to imagine them as small windows that you are moving across the image. In your case you have 3 filters of size 3x3.
We generally use multiple filters so as to learn different kinds of features from the same local receptive field (as michael_question_answerer aptly puts it) or simpler terms, our window. Each filters' weights are randomly initialised, so each filter learns a slightly different feature.
Now imagine each filter moving across the image, covering only a 3x3 grid at a time. We define a stride value which specifies how much the window shifts to the right, and how much down. At each position, the filter weights and image pixels at the window will give a single new value in the new volume created. So to answer your question, at an instance a total of 3x3=9 pixels are connected with the 9 neurons corresponding to one filter. The same for the other 2 filters.
Your approach to understanding CNNs by visualisation is correct. But you still need the brush up your basic understanding of terminology. Here are a couple of nice resources that should help:
http://cs231n.github.io/convolutional-networks/
https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
Hope this helps. Keep up the curiosity :)
I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?
You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.
Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .
I want to build a convolutional autoencoder where the size of the input in not constant. I'm doing that by stacking up conv-pool layers until I reach an encoding layer, and then doing the reverse with upsample-conv layers. the problem is that no matter what settings I use, I can't get the exact same size in the output layer as the input layer. The reason for that is that the UpSampling layer (given say (2,2) size), doubles the size of the input, so I can't get odd dimensions for instance. Is there a way to tie the output dimension of a given layer to the input dimension of a previous layer for individual samples (as I said, the input size for the max-pool layer in variable)?
Yes, there is.
You can use three methods
Padding
Resizing
Crop or Pad
Padding will only work to increase the dimensions. Not beneficial for reducing the size.
Resizing should be more costly but optimum solution for each case (up or downsampling). It will keep all the values in the range and will simply resample them to resize in a given dimension.
Crop or Pad will work as resize and it will be more compute-efficient as there is no interpolation in this method. However, if you want to resize it to a smaller dimension, it will crop from the edges.
By using those 3, you can arrange your layer's dimensions.