I noticed that Batch Normalization layer follows Scale layer in mobile net. It seems BN layer and scale layer are a pair.
And Convolution layer + BN layer + Scale layer + ReLU layer works well.
So what scale layer do?
It seems caffe can't learn parameters in BN layer, so Scale layer is useful, but why?
In tensorflow doc, https://www.tensorflow.org/api_docs/python/tf/contrib/layers/batch_norm
When the next layer is linear (also e.g. nn.relu), this can be
disabled since the scaling can be done by the next layer.
It makes me more confuse.
Please help me, thanks!
Batch Normalization does two things: First normalize with the mean and standard deviation of activations in a batch, and then perform scaling and bias to restore an appropriate range of the activations.
Caffe implemented this with two layers, the Batch Normalization layer only does the normalization part, without the scaling and bias, which can be done with the scaling layer, or might not even be needed if the next layer can also do scaling (this is what TF docs mention).
Hope this helps.
Related
I segment multiple targets in medical image (CT) with DeeplabV3+, but with 3D volumes, so I can't load pretrained backbone(resnet...etc.) in the net.
And the details is:
patch size: 16, 256, 256(cannot edit)
batch size: 2(cause' GPU cannot afford the bigger one)
optimizer: SGD
loss: Dice+CrossEntropy(refer to nnUNet setting)
dataset: just about 20 cases.
the original code is for 2D situation, and I exchange each layer from 2D to 3D(like nn.Conv2d TO nn.Conv3d and something)
But finally, My validation DSC just reached 0.6 around, I have no idea what's wrong in my code? Could anyone give me a hand(idea), please? Thanks a lot!
Increase the performance of the model, because now I don't have any idea why my network is so bad. Thanks a lot.
You can try to use a few 3x3 convolutional layers on 3D volumes of images keeping dimensions (h and w) of the features constant and then convert such tensor to 3 channel tensor using 1x1 convolutional layer. Now you will have a tensor of same height and width of the image with 3 channels and you can use the pretrained models.
For reference, check here:
https://segmentation-models.readthedocs.io/en/latest/tutorial.html#training-with-non-rgb-data
According to the documentation on pre-trained computer vision models for transfer learning (e.g., here), input images should come in "mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224".
However, when running transfer learning experiments on 3-channel images with height and width smaller than expected (e.g., smaller than 224), the networks generally run smoothly and often get decent performances.
Hence, it seems to me that the "minimum height and width" is somehow a convention and not a critical parameter. Am I missing something here?
There is a limitation on your input size which corresponds to the receptive field of the last convolution layer of your network. Intuitively, you can observe the spatial dimensionality decreasing as you progress through the network. At least this is the case for feature extractor CNNs which aim at extracting feature embeddings from the input image. That is most pre-trained models such as vanilla VGG, and ResNets networks do not retain spatial dimensionality. If the input of a convolutional layer is smaller than the kernel size (even if/when padded), then you simply won't be able to perform the operation.
TLDR: adaptive pooling layer
For example, the standard resnet50 model accepts input only in ranges 193-225, and this is due to the architecture and downscaling layers (see below).
The only reason why the default pytorch model works is that it is using adaptive pooling layer which allows to not restrict input size. So it's gonna work but you should be ready for performance decay and other fun things :)
Hope you will find it useful:
https://discuss.pytorch.org/t/how-can-torchvison-models-deal-with-image-whose-size-is-not-224-224/51077/3
What is Adaptive average pooling and How does it work?
https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html
https://github.com/pytorch/vision/blob/c187c2b12d86c3909e59a40dbe49555d85b98703/torchvision/models/resnet.py#L118
https://github.com/pytorch/vision/blob/c187c2b12d86c3909e59a40dbe49555d85b98703/torchvision/models/resnet.py#L151
https://developpaper.com/pytorch-implementation-examples-of-resnet50-resnet101-and-resnet152/
It might be very basic but just got confuse in understanding why in VGG net we have multiple convolutional layers of 3x3 filter. What specific will happen when we are taking convolution of same image twice or more?
Nothing, if you don't have a non-linear transformation in between. Then you can always collapse it into a single Convulational layer which computes the same thing.
But VGG uses ReLU activation functions. This makes it possible to learn non-linear transformations of the data.
I am using a pre-trained model which I want to add Elementwise layer that products the output of two layers: one layer is output of convolution layer 1x1x256x256 and the other is also the output of convolution layer 1x32x256x256. My question is: If we add elementwise layer for multiplying two layers and sending to the next layer, should we train from the scratch because the architecture is modified or still it is possible to use the pretrained model?
Thanks
Indeed making architectural changes puts the learned features at odds.
However, there's no reason not to use the learned weight for layers below the change -- these layers are not affected by the change, so they can benefit from the initialization.
As for the rest of the layers, I suppose init from trained weights should not be worse than random, So why not?
Don't forget to init any new layers with random weights (the default in caffe is zero - and this might cause trouble for learning).
What is exactly fully convolutaionl layer? I mean, why is it 'fully'? The wording in [Long] is quite confusing to me.
Is it because they never use fully connected layer? Or is it because the convolution layers obtained by the 'convolutionization' described in Figure 2 have their kernels cover their entire input regions?
Do you see the last part in this image " fully connected" in fully convolution network we remove this part. But then how can do classification since we already have many channels with big activation map ?
In the example you mentioned they do up-sampling and their cost function is to measure the error between the re-construed image (up-sampled) and the ground truth.
So why it is called fully convolution because it is just convolution there. spatial feature extraction.
The phrase comes from a blend of the phrases "fully connected layer" with "convolutional layer". You can think of it as a fully connected layer which acts on a sub-region of an image. Then, instead of getting a single output feature vector for the whole image, you get a set of vectors, each per its corresponding image part. Where the vectors are formed to produce a map, which is a reminiscent of convolutional feature maps.