The caffe documentation on the softmax_loss_layer.hpp file seems to be targeted towards classification tasks and not semantic segmentation. However, I have seen this layer being used for the latter.
What would be the dimensions of the input blobs and output blob in the case where you're classifying each pixel (semantic segmentation)?
More importantly, how are the equations for calculating the loss applied to these blobs? Like, in what form are the matrices/blobs arranged and the eventual "loss value" that's output, what is the equation for that?
Thank you.
edits:
I have referenced this page for understanding concepts of loss equation, just don't know how it's applied to the blobs, which axis, etc.: http://cs231n.github.io/linear-classify/
Here is the documentation from caffe:
Firstly, the input blobs should be of the form data NxKxHxW and label Nx1XHxW where each value in the label blob is an integer from [0-K]. I think there's an error in the caffe documentation where it doesn't consider the case for semantic segmentation, and I'm not sure what K = CHW means. The output blob is of the shape 1x1x1x1 which is the loss.
Secondly, the loss function is as follows, from softmax_loss_layer.cpp:
loss -= log(std::max(prob_data[i * dim + label_value * inner_num_ + j], Dtype(FLT_MIN)));
Breaking that line down (for semantic segmentation):
std::max is just to ensure there's no invalid input like nan
prob_data is the output from the softmax, as explained in the caffe tutorials, softmax loss layer can be decomposed into a softmax layer followed by multinomial logistic loss
i * dim specifies the Nth image in your batch where the batch shape is like so NxKxHxW where K is the number of classes
label_value * inner_num_ specifies the Kth image, because at this stage, each one of your classes have their own "image" of probabilities, so to speak
Finally, j is the index for each pixel
Basically, you want prob_data[i * dim + label_value * inner_num_ + j] for each pixel to be as close to 1 as possible. This means that the negative log of that will be close to 0. Here the log is to base e. And then you do the stochastic gradient descent for that loss.
Related
I am beginner in deep learning.
I am using this dataset and I want my network to detect keypoints of a hand.
How can I make my output layer's nodes to be in range [-1, 1] (range of normalized 2D points)?
Another problem is when I train for more than 1 epoch the loss gets negative values
criterion: torch.nn.MultiLabelSoftMarginLoss() and optimizer: torch.optim.SGD()
Here u can find my repo
net = nnModel.Net()
net = net.to(device)
criterion = nn.MultiLabelSoftMarginLoss()
optimizer = optim.SGD(net.parameters(), lr=learning_rate)
lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer=optimizer, gamma=decay_rate)
You can use the Tanh activation function, since the image of the function lies in [-1, 1].
The problem of predicting key-points in an image is more of a regression problem than a classification problem (especially if you're making your model outputs + targets fall within a continuous interval). Therefore, I suggest you use the L2 Loss.
In fact, it could be a good exercise for you to determine which loss function that is appropriate for regression problems provides the lowest expected generalization error using cross-validation. There's several such functions available in PyTorch.
One way I can think of is to use torch.nn.Sigmoid which produces outputs in [0,1] range and scale outputs to [-1,1] using 2*x-1 transformation.
I read a paper about machine translation, and it uses projection layer. Its encoder has 6 bidirectional LSTM layers. If input embedding dimension is 512, how much will be the dimension of the encoder output? 512*2**5?
The paper's link: https://www.aclweb.org/anthology/P18-1008.pdf
Not quite. Unfortunately, Figure 1 in the mentioned paper is a bit misleading. It is not that the six encoding layers are in parallel, as it might be understood from the figure, but rather that these layers are successive, meaning that the hidden state/output from the previous layer is used in the subsequent layer as an input.
This, and the fact that the input (embedding) dimension is NOT the output dimension of the LSTM layer (in fact, it is 2 * hidden_size) change your output dimension to exactly that: 2 * hidden_size, before it is put into the final projection layer, which again is changing the dimension depending on your specifications.
It is not quite clear to me what the description of add does in the layer, but if you look at a reference implementation it seems to be irrelevant to the answer. Specifically, observe how the encoding function is basically
def encode(...):
encode_inputs = self.embed(...)
for l in num_layers:
prev_input = encode_inputs
encode_inputs = self.nth_layer(...)
# ...
Obviously, there is a bit more happening here, but this illustrates the basic functional block of the network.
In my Neural network model, I represent an 8 word-sentence with a 8x256 dimensional embedding matrix. I want to give it to a LSTM as a input where LSTM takes a single word embedding at a time as input and process it. According to pytorch documentation, the input should be in the shape of (seq_len, batch, input_size). What is the correct way to convert my input to desired shape ? I don't want to mixup the numbers by mistake. I am quite new in PyTorch and row-major calculations, therefore I wanted to ask it here. I do it as follows, is it correct ?
x = torch.rand(8,256)
lstm_input = torch.reshape(x,(8,1,256))
Your solution is correct: you added a Singleton dimension for the "batch" dimension, leaving x to be with temporal dimension 8 and input dimension 256.
Since you are new to pytorch, here are a few equivalent ways of doing the same thing:
x = x[:, None, :]
Putting None in the dim=1 indicates to pytorch to add a singelton dimension.
Another way is to use view:
x = x.view(8, 1, 256)
I am trying to implement discriminant condition codes in Keras as proposed in
Xue, Shaofei, et al., "Fast adaptation of deep neural network based
on discriminant codes for speech recognition."
The main idea is you encode each condition as an input parameter and let the network learn dependency between the condition and the feature-label mapping. On a new dataset instead of adapting the entire network you just tune these weights using backprop. For example say my network looks like this
X ---->|----|
|DNN |----> Y
Z --- >|----|
X: features Y: labels Z:condition codes
Now given a pretrained DNN, and X',Y' on a new dataset I am trying to estimate the Z' using backprop that will minimize prediction error on Y'. The math seems straightforward except I am not sure how to implement this in keras without having access to the backprop itself.
For instance, can I add an Input() layer with trainable=True with all other layers set to trainable= False. Can backprop in keras update more than just layer weights? Or is there a way to hack keras layers to do this?
Any suggestions welcome.
thanks
I figured out how to do this (exactly) in Keras by looking at fchollet's post here
Using the keras backend I was able to compute the gradient of my loss w.r.t to Z directly and used it to drive the update.
Code below:
import keras.backend as K
import numpy as np
model.summary() #Pretrained model
loss = K.categorical_crossentropy(Y, Y_out)
grads = K.gradients(loss, Z)
grads /= (K.sqrt(K.mean(K.square(grads)))+ 1e-5)
iterate = K.function([X,Z],[loss,grads])
step = 0.1
Z_adapt = Z_in.copy()
for i in range(100):
loss_val, grads_val = iterate([X_in,Z_adapt])
Z_adapt -= grads_val[0] * step
print "iter:",i,np.mean(loss_value)
print "Before:"
print model.evaluate([X_in, Z_in],Y_out)
print "After:"
print model.evaluate([X_in, Z_adapt],Y_out)
X,Y,Z are nodes in the model graph. Z_in is an initial value for Z'. I set it to an average value from the train set. Z_adapt is after 100 iterations of gradient descent and should give you a better result.
Assume that the size of Z is m x n. Then you can first define an input layer of size m * n x 1. The input will be an m * n x 1 vector of ones. You can define a dense layer containing m * n neurons and set trainable = True for it. The response of this layer will give you a flattened version of Z. Reshape it appropriately and give it as input to the rest of the network that can be appended ahead of this.
Keep in mind that if the size of Z is too large, then network may not be able to learn a dense layer of that many neurons. In that case, maybe you need to put additional constraints or look into convolutional layers. However, convolutional layers will put some constraints on Z.
In caffe, the convolution layer takes one bottom blob, and convolves it with learned filters (which are initialized using the weight type - "Xavier", "MSRA" etc.). However, my question is whether we can simply convolve two bottom blobs and produce a top blob. What would be the most elegant way of doing this? The purpose of this is: one of the bottom blob will be data and the other one will be a dynamic filter (changing depending on the data) produced by previous layers (I am trying to implement dynamic convolution).
My attempt:
One way which came to my mind was to modify the filler.hpp and assign a bottom blob as a filler matrix itself (instead of "Xavier", "MSRA" etc.). Then I thought the convolution layer would pick up from there. We can set lr = 0 to indicate that the weight initialized by our custom filler should not be changed. However, after I looked at the source code, I still don't know how to do it. On the other hand, I don't want to break the workflow of caffe. I still want conv layers to function normally, if I want them to.
Obviously a more tedious way is to use a combination of Slice, tile and/or Scale layer to literally implement convolution. I think it would work, but it will turn out to be messy. Any other thoughts?
Edit 1:
I wrote a new layer by modifying the convolution layer of caffe. In particular, in src/caffe/layers/conv_layer.cpp, on line 27, it takes the weight defined by the filler and convolves it with the bottom blob. So instead of populating that blob from the filler, I modified the layer such that it now takes two bottoms. One of the bottom directly gets assigned to the filler. Now I had to make some other changes such as:
weight blob has the same value for all the samples. Here it will have a different value for different samples. So I changed line 32 from:
this->forward_cpu_gemm(
bottom_data + n * this->bottom_dim_,
weight,
top_data + n * this->top_dim_);
to:
this->forward_cpu_gemm(
bottom_data + n * bottom[1]->count(1),
bottom[0]->cpu_data() + n * bottom[0]->count(1),
top_data + n * this->top_dim_);
To make things easier, I assumed that there is no bias term involved, stride is always 1, padding can always be 0, group will always be 1 etc. However, when I tested the forward pass, it gave me some weird answer (with a simple convolution kernel = np.ones((1,1,3,3)). The learning rates were set to zero for this kernel so that it doesn't change. However, I can't get a right answer. Any suggestions will be appreciated.
Please do not propose solutions using existing layers such as Slice, Eltwise, Crop. I have already implemented - it works - but it is unbelievably complex and memory inefficient.
I think you are on the right way as a whole.
For the "weird" convolution results, I guess the bug most possibly is:
Consider 2D convolution
and suppose bottom[1]'s shape is (num, channels, height, width),
since convolution in caffe is performed as a multiplication of 2 matrix, weight(representing convolution kernels) and col_buffer(reorganized from data to be convolved), and weight is of num_out rows and channels / this->group_ * kernel_h * kernel_w columns, col_buffer is of channels / this->group_ * kernel_h * kernel_w rows and height_out * width_out columns, so as a weight blob of dynamic convolution layer, bottom[0]'s shape should better be (num, num_out, channels/group, kernel_h, kernel_w) to satisfy
bottom[0]->count(1) == num_out * channels / this->group_ * kernel_h * kernel_w
, in which num_out is the number of the dynamic convolution layer's output feature maps.
That means, to make the convolution function
this->forward_cpu_gemm(bottom_data + n * bottom[1]->count(1)
, bottom[0]->cpu_data() + n * bottom[0]->count(1)
, top_data + n * this->top_dim_);
work properly, you must make sure that
bottom[0]->shape(0) == bottom[1]->shape(0) == num
bottom[0]->count(1) == num_out * channels / this->group_ * kernel_h * kernel_w
So most possibly the simple convolution kernel of 4-dimension np.ones((1,1,3,3)) you used may not satify the above condition and result in the wrong convolution results.
Hope it's clear and will help you.
########## Update 1, Oct 10th,2016,Beijing time ##########
I add a dynamic convolution layer here but with no unit test yet. This layer doesn't break the workflow of caffe and only change some private members of BaseConvolution class to be protected.
The files involved are:
include/caffe/layers/dyn_conv_layer.hpp,base_conv_layer.hpp
src/caffe/layers/dyn_conv_layer.cpp(cu)
It grows almost the same with the convolution layer in caffe, and the differences mainly are:
Override the function LayerSetUp() to initialize this->kernel_dim_, this->weight_offset_ etc properly for convolution and ignore initializing this->blobs_ used by Convolution layer routinely to contain weight and bias;
Override the function Reshape() to check that the bottom[1] as a kernel container has proper shape for convolution.
Because I have no time to test it, there may be bugs and I will be very glad to see your feedbacks.
########## Update 2, Oct 12th,2016,Beijing time ##########
I updated test case for dynamic convolution just now. The involved file is src/caffe/test/test_dyn_convolution_layer.cpp. It seems to work fine, but maybe need more thorough tests.
You can build this caffe by cd $CAFFE_ROOT/build && ccmake .., cmake -DBUILD_only_tests="dyn_convolution_layer" .. and make runtest to check it.