I have been learning GAN (Generative Adversarial Networks) lately and having a hard time understanding the output size for transpose convolution. Let's say I am using a Tensor of [1, 64, 1, 1] as an input noise. How do I calculate the output of each layer until I construct a 28x28 image (let's say an MNIST digit)? What should be the kernel size, stride, and padding and assuming I use 3 or 4 layers to reconstruct the 28x28 image?
Note: A handwritten example will be enough as well.
Related
While working on a problem related to question-answering(MRC), I have implemented two different architectures that independently give two tensors (probability distribution over the tokens). Both the tensors are of dimension (batch_size,512). I wish to obtain the final output of the form (batch_size,512). How can I combine the two tensors using trainable weights and then train the model on the final prediction?
Edit (Additional Information):
So in the forward function of my NN model, I have used BERT model to encode the 512 tokens. These encodings are 768 dimensional. These are then passed to a Linear layer nn.Linear(768,1) to output a tensor of shape (batch_size,512,1). Apart from this I have another model built on top of the BERT encodings that also yields a tensor of shape (batch_size, 512, 1). I wish to combine these two tensors to finally get a tensor of shape (batch_size, 512, 1) which can be trained against the output logits of the same shape using CrossEntropyLoss.
Please share the PyTorch code snippet if possible.
Assume your two vectors are V1 and V2. You need to combine them (ensembling) to get a new vector. You can use a weighted sum like this:
alpha = sigmoid(alpha)
V_final = alpha * V1 + (1 - alpha) * V2
where alpha is a learnable scaler. The sigmoid is to bound alpha between 0 and 1,
and you can initialise alpha = 0 so that sigmoid(alpha) is half, meaning you are adding V1 and V2 with equal weights.
This is a linear combination, and there can be non-linear versions as well.
You can have a nonlinear layer that accepts (V1;V2) (the concatenation) and outputs a softmaxed output as well e.g. softmax(W * (V1;V2) + b).
I created a fully connected network in Pytorch with an input layer of shape (1,784) and a first hidden layer of shape (1,256).
To be short: nn.Linear(in_features=784, out_features=256, bias=True)
Method 1 : model.fc1.weight.data.shape gives me torch.Size([128, 256]), while
Method 2 : list(model.parameters())[0].shape gives me torch.Size([256, 784])
In fact, between an input layer of size 784 and a hidden layer of size 256, I was expecting a matrix of shape (784,256).
So, in the first case, I see the shape of the next hidden layer (128), which does not make sense for the weights between the input and first hidden layer, and, in the second case, it looks like Pytorch took the transform of the weight matrix.
I don't really understand how Pytorch shapes the different weight matrices, and how can I access individual weights after the training. Should I use method 1 or 2? When I display the corresponding tensors, the displays look totally similar, while the shapes are different.
In Pytorch, the weights of model parameters are transposed before applying the matmul operation on the input matrix. That's why the weight matrix dimensions are flipped, and is different from what you expect; i.e., instead of being [784, 256], you observe that it is [256, 784].
You can see the Pytorch source documentation for nn.Linear, where we have:
...
self.weight = Parameter(torch.Tensor(out_features, in_features))
...
def forward(self, input):
return F.linear(input, self.weight, self.bias)
When looking at the implementation of F.linear, we see the corresponding line that multiplies the input matrix with the transpose of the weight matrix:
output = input.matmul(weight.t())
I created Convolutional Autoencoder using Pytorch and I'm trying to improve it.
For the encoding layer I use first 4 layers of pre-trained ResNet 18 model from torchvision.models.resnet.
I have mid-layer with just one Convolutional layer with input and output channel sizes of 512. For the decoding layer I use Convolutional layers following with BatchNorm and ReLU activation function.
The decoding layer reduces the channel each layer: 512 -> 256 -> 128 -> 64 -> 32 -> 16 -> 3 and increases the resolution of the image with interpolation to match the dimension of the corresponding layer in the encoding part. For the last layer I use sigmoid instead of ReLu.
All Convolutional layers are:
self.up = nn.Sequential(
nn.Conv2d(input_channels, output_channels,
kernel_size=5, stride=1,
padding=2, bias=False),
nn.BatchNorm2d(output_channels),
nn.ReLU()
)
The input images are scaled to [0, 1] range and have shapes 224x224x3. Sample outputs are (First is from training set, the second from the test set):
First image
First image output
Second image
Second image output
Any ideas why output is blurry? The provided model has been trained around 160 epochs with ~16000 images using Adam optimizer with lr=0.00005. I'm thinking about adding one more Convolutional layer in self.up given above. This will increase complexity of the model, but I'm not sure if it is the right way to improve the model.
In a typical CNN, a conv layer will have Y filters of size NxM, and thus it has N x M x Y trainable parameters (not including bias).
Accordingly, in the following simple keras model, I expect the second conv layer to have 16 kernels of size (7x7), and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
I understand the mechanics of what is happening: the Conv2D layers are actually doing a 3D convolution, treating the output maps of the previous layer as channels. It has 16 3D kernels of size(7x7x8). What I don't understand is:
why this is Keras's default behavior?
how do I get a "traditional" convolutional layer without dropping down into the low-level API (avoiding that is my reason for using Keras in the first place)?
_
from keras.models import Sequential
from keras.layers import InputLayer, Conv2D
model = Sequential([
InputLayer((101, 101, 1)),
Conv2D(8, (11, 11)),
Conv2D(16, (7, 7))
])
model.weights
Q1:and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
No, the kernel weights is not the size(7x7x16).
from cs231n:
Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).
Be careful the 'every'.
In your model, 7x7 is your single filter size, and it will connect to previous conv layer, so the parameters on a single filter is 7x7x8, and you have 16, so the total parameters is 7x7x8x16
Q2:why this is Keras's default behavior?
See Q1.
In the typical jargon, when someone refers to a conv layer with N kernels of size (x, y), it is implied that the kernels actually have size (x, y, z), where z is the depth of the input volume to that layer.
Imagine what happens when the input image to the network has R, G, and B channels: each of the initial kernels itself has 3 channels. Subsequent layers are the same, treating the input volume as a multi-channel image, where the channels are now maps of some other feature.
The motion of that 3D kernel as it "sweeps" across the input is only 2D, so it is still referred to as a 2D convolution, and the output of that convolution is a 2D feature map.
Edit:
I found a good quote about this in a recent paper, https://arxiv.org/pdf/1809.02601v1.pdf
"In a convolutional layer, the input feature map X is a W1 × H1 × D1 cube, with W1, H1 and D1 indicating its width, height and depth (also referred to as the number of channels), respectively. The output feature map, similarly, is a cube Z with W2 × H2 × D2 entries. The convolution Z = f(X) is parameterized by D2 convolutional kernels, each of which is a S × S × D1 cube."
I have the necessity to keep the model as small as possible to deploy an image classifier that can run efficiently on an app (the accuracy is not really relevant for me)
I recently approached deep learning and I haven't great experience, hence I'm currently playing with the cifar-10 example.
I tried to replace the first two 5x5 convolutional layers with two 3x3 convolution each, as described in the inception paper.
Unluckily, when I'm going to classify the test set, I got around 0.1 correct classification (random choice)
This is the modified code of the first layer (the second is similar):
with tf.variable_scope('conv1') as scope:
kernel_l1 = _variable_with_weight_decay('weights_l1', shape=[3, 3, 3, 64],
stddev=1e-4, wd=0.0)
kernel_l2 = _variable_with_weight_decay('weights_l2', shape=[3, 3, 64, 1],
stddev=1e-4, wd=0.0)
biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0))
conv_l1 = tf.nn.conv2d(images, kernel_l1, [1, 1, 1, 1], padding='SAME')
conv_l2 = tf.nn.depthwise_conv2d(conv_l1, kernel_l2, [1, 1, 1, 1], padding='SAME')
bias = tf.nn.bias_add(conv_l2, biases)
conv1 = tf.nn.relu(bias, name=scope.name)
_activation_summary(conv1)
Is it correct?
It seems you're attempting to compute 64 features (for each 3x3 patch) in the first convolutional layer and feed this directly into the second convolutional layer, with no intermediate pooling layer. Convolutional neural networks typically have a structure of stacked convolutional layers, followed by contrast normalization and max pooling.
To reduce processing overheads researchers have experimented in moving from fully connected to sparsely connected architectures, and hence the creation of inception architecture. However, whilst these yield good results for high dimensional inputs, you may be expecting too much from the 32x32 pixels of Cifar10 in TensorFlow.
Therefore, I think the issue is less around patch size of and more to do with overall architecture. This code is a known good starting point. Get this working and start reducing parameters until it breaks.