Keras LSTM: dropout vs recurrent_dropout - deep-learning

I realize this post is asking a similar question to this.
But I just wanted some clarification, preferably a link to some kind of Keras documentation that says the difference.
In my mind, dropout works between neurons. And recurrent_dropout works each neurons between timesteps. But, I have no grounding for this whatsoever.
The documentation on the Keras webite is not helpful at all.

Keras LSTM documentation contains high-level explanation:
dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.
recurrent_dropout: Float between 0 and 1. Fraction of the units to drop for the linear transformation
of the recurrent state.
But this totally corresponds to the answer you refer to:
Regular dropout is applied on the inputs and/or the outputs, meaning the vertical arrows from x_t and to h_t. ...
Recurrent dropout masks (or "drops") the connections between the recurrent units; that would be the horizontal arrows in your picture.
If you're interested in details on the formula level, the best way is to inspect the source code: keras/layers/recurrent.py, look for rec_dp_mask (recurrent dropout mask) and dp_mask. One is affecting the h_tm1 (the previous memory cell), the other affects the inputs.

Related

What does output of a hidden layers in ANN mean?

Say, if we have a binary classification problem, we expect our output to be a single value, but while building ANN we write:
nn.Linear(4,6) # assuming we are predicting based on the 4 input features nn.linear(6,3) nn.Linear(3,1)
What does that output features of 6 in first layer or 3 in second layer mean?
I am unable to visualise what's happening in these hidden layers
They are linear combinations of the input.
If you are not using a non-linearity, then these hidden layers are effectively pointless. Since the entire network collapses into a single linear regression. You'll also only learn a linear relationship between the input and the output classes.
On the other hand, if you are using non-linearities between layers, then these hidden layers will represent new non-linear combinations of the input features.
These features allow the model to learn different things about the input and then make a final decision about the input and which classes it should belong too.
Initially, we map the 4 inputs into a higher dimension (6), so that we can combine the different parts of the input and learn relationships between them. Then further combining these features allows for learning even more complex non-linear relationships in the next layer. Choosing the size of these hidden layers is its own problem and usually involves some amount of guessing or brute-forcing.
Here is a good answer that talks more about the need for hidden layers.

U-net how to understand the cropped output

I'm looking for U-net implementation for landmark detection task, where the architecture is intended to be similar to the figure above. For reference please see this: An Attention-Guided Deep Regression Model for Landmark Detection in Cephalograms
From the figure, we can see the input dimension is 572x572 but the output dimension is 388x388. My question is, how do we visualize and correctly understand the cropped output? From what I know, we ideally expect the output size is the same as input size (which is 572x572) so we can apply the mask to the original image to carry out segmentation. However, from some tutorial like (this one), the author recreate the model from scratch then use "same padding" to overcome my question, but I would prefer not to use same padding to achieve same output size.
I couldn't use same padding because I choose to use pretrained ResNet34 as my encoder backbone, from PyTorch pretrained ResNet34 implementation they didn't use same padding on the encoder part, which means the result is exactly similar as what you see in the figure above (intermediate feature maps are cropped before being copied). If I would to continue building the decoder this way, the output will have smaller size compared to input image.
The question being, if I want to use the output segmentation maps, should I pad its outside until its dimension match the input, or I just resize the map? I'm worrying the first one will lost information about the boundary of image and also the latter will dilate the landmarks predictions. Is there a best practice about this?
The reason I must use a pretrained network is because my dataset is small (only 100 images), so I want to make sure the encoder can generate good enough feature maps from the experiences gained from ImageNet.
After some thinking and testing of my program, I found that PyTorch's pretrained ResNet34 didn't loose the size of image because of convolution, instead its implementation is indeed using same padding. An illustration is
Input(3,512,512)-> Layer1(64,128,128) -> Layer2(128,64,64) -> Layer3(256,32,32)
-> Layer4(512,16,16)
so we can use deconvolution (or ConvTranspose2d in PyTorch) to bring the dimension back to 128, then dilate the result 4 times bigger to get the segmentation mask (or landmarks heatmaps).

What is the input to the hidden layers of a multilayer RNN

This question makes most of it pretty clear. There's just one part I don't know the answer to yet... In the fig1 of this paper, is the input to deep layers the same input (i.e. x[t]) or is it the output from the previous layer?
A really simple way to phrase the question is in fig1 of the paper, is the red line going over each layer or is it the output from the previous layer.
I think the input to all layers at time t is x[t] because if it was the output of the previous layer and x[t] wasn't the same dimension as h[t] then you'd need all the hidden GRU cells to accept a different dimension for the t input (that is the first layer would accept the hidden state and the input, but all subsequent layers would accept the corresponding hidden state from t-1 but also the hidden state from the previous layer).
But then again, in one of my classes the TA had a solution that assumed x[t] and h[t] were the same dimension so for subsequent layers he passed the preceding layers input... This just doesn't seem like it'd generally be the case.
Probably tensorflow and pytorch source code would provide a definitive answer?
Figured it out for Pytorch. It's definitely the input at time t because the layers all have the same dimension for the input.
I haven't found specific docs for tensorflow that would imply this but I'm assuming it'd be the same.

Why dilated Convolution Layer doesn't reduce the resolution of the receptive field?

i try to understand dilated convolution. I already familiar with increasing the size of the kernel by filling the gaps with zeros. Its usefull to cover a bigger area and get a better understanding about larger objects.
But please can someone explain me how it is possible that dilated convolutional layers keep the origin resolution of the receptive field. It is used in the deeplabV3+ structure with a atrous rate from 2 to 16. How is it possible to use dilated convolution with a obvious bigger kernel without zero padding and the output size will be consistent.
deeplabV3+ structure:
Im confused because when i have a look at these explanation here:
The outputsize (3x3) of the dilated convolution layer is smaller?
Thank you so much for your help!
Lukas
Maybe there is a small confusion between strided convolution and dilated convolution here. Strided convolution is the general convolution operation that acts like a sliding window, but instead of jumping by a single pixel each time it uses a stride to allow jumping more than one pixel when moving from computing the convolution result for the current pixel and the next one. Dilated convolution is "looking" on a bigger window - instead of taking neighboring pixels, it takes them with "holes". The dilation factor defines the size of those "holes".
Well, without padding the output would become smaller than the input. The effect is comparable to the reduction effect of a normal convolution.
Imagine you have a 1d-tensor with 1000 elements and a dilated 1x3 convolution kernel with dilation factor of 3. This corresponds to a "total kernel length" of 1+2free+1+2free+1 = 7. Considering a stride of 1 the output would be a 1d-tensor with 1000+1-7= 994 elements. In case of a normal convolution with a 1x3 kernel and a stride factor of 1 the output would have 1000+1-3= 998 elements. As you can see the effect can be calculated similar to a normal convolution :)
In both situation the output would become smaller without padding. But, as you can see, the dilation factor has no scaling effect on the output's size like it is the case for the stride factor.
Why do you think no padding is done within the deeplab framework? I think in the official tensorflow implementation padding is used.
Best Frank
My understanding is that the authors are saying that one does not need to downsample image (or, any intermediate feature map) before applying let's say 3x3 convolution which is typical in DCNNs (e.g., VGG16 or ResNet) for feature extraction and followed by upsampling for semantic segmentation. In a typical encoder-decoder network (e.g. UNet or SegNet), one first downsamples the feature map by half, followed by convolution operation and upsampling the feature map again by 2x times.
All of these effects (downsampling, feature extraction and upsampling) can be captured in a single atrous convolution (of course with stride=1). Moreover, the output of an atrous convolution is a dense feature map comparing to same "downsampling, feature extraction and upsampling" which results in a spare feature map. See the following figure for more details. It is from DeepLabV1 paper. Therefore, you can control the size of a feature map by replacing any normal convolution by atrous convolution in an intermediate layer.
That's also why there is a constant "output_stride (input resolution / feature map resolution)" of 16 in all the atrous convolutions in the picture (cascaded model) you posted above.

CNN attention/activation maps

What are common techniques for finding which parts of images contribute most to image classification via convolutional neural nets?
In general, suppose we have 2d matrices with float values between 0 and 1 as entires. Each matrix is associated with a label (single-label, multi-class) and the goal is to perform classification via (Keras) 2D CNN's.
I'm trying to find methods to extract relevant subsequences of rows/columns that contribute most to classification.
Two examples:
https://github.com/jacobgil/keras-cam
https://github.com/tdeboissiere/VGG16CAM-keras
Other examples/resources with an eye toward Keras would be much appreciated.
Note my datasets are not actual images, so using methods with ImageDataGenerator might not directly apply in this case.
There are many visualization methods. Each of these methods has its strengths and weaknesses.
However, you have to keep in mind that the methods partly visualize different things. Here is a short overview based on this paper.
You can distinguish between three main visualization groups:
Functions (gradients, saliency map): These methods visualize how a change in input space affects the prediction
Signal (deconvolution, Guided BackProp, PatternNet): the signal (reason for a neuron's activation) is visualized. So this visualizes what pattern caused the activation of a particular neuron.
Attribution (LRP, Deep Taylor Decomposition, PatternAttribution): these methods visualize how much a single pixel contributed to the prediction. As a result you get a heatmap highlighting which pixels of the input image most strongly contributed to the classification.
Since you are asking how much a pixel has contributed to the classification, you should use methods of attribution. Nevertheless, the other methods also have their right to exist.
One nice toolbox for visualizing heatmaps is iNNvestigate.
This toolbox contains the following methods:
SmoothGrad
DeConvNet
Guided BackProp
PatternNet
PatternAttribution
Occlusion
Input times Gradient
Integrated Gradients
Deep Taylor
LRP
DeepLift