In caffe, the convolution layer takes one bottom blob, and convolves it with learned filters (which are initialized using the weight type - "Xavier", "MSRA" etc.). However, my question is whether we can simply convolve two bottom blobs and produce a top blob. What would be the most elegant way of doing this? The purpose of this is: one of the bottom blob will be data and the other one will be a dynamic filter (changing depending on the data) produced by previous layers (I am trying to implement dynamic convolution).
My attempt:
One way which came to my mind was to modify the filler.hpp and assign a bottom blob as a filler matrix itself (instead of "Xavier", "MSRA" etc.). Then I thought the convolution layer would pick up from there. We can set lr = 0 to indicate that the weight initialized by our custom filler should not be changed. However, after I looked at the source code, I still don't know how to do it. On the other hand, I don't want to break the workflow of caffe. I still want conv layers to function normally, if I want them to.
Obviously a more tedious way is to use a combination of Slice, tile and/or Scale layer to literally implement convolution. I think it would work, but it will turn out to be messy. Any other thoughts?
Edit 1:
I wrote a new layer by modifying the convolution layer of caffe. In particular, in src/caffe/layers/conv_layer.cpp, on line 27, it takes the weight defined by the filler and convolves it with the bottom blob. So instead of populating that blob from the filler, I modified the layer such that it now takes two bottoms. One of the bottom directly gets assigned to the filler. Now I had to make some other changes such as:
weight blob has the same value for all the samples. Here it will have a different value for different samples. So I changed line 32 from:
this->forward_cpu_gemm(
bottom_data + n * this->bottom_dim_,
weight,
top_data + n * this->top_dim_);
to:
this->forward_cpu_gemm(
bottom_data + n * bottom[1]->count(1),
bottom[0]->cpu_data() + n * bottom[0]->count(1),
top_data + n * this->top_dim_);
To make things easier, I assumed that there is no bias term involved, stride is always 1, padding can always be 0, group will always be 1 etc. However, when I tested the forward pass, it gave me some weird answer (with a simple convolution kernel = np.ones((1,1,3,3)). The learning rates were set to zero for this kernel so that it doesn't change. However, I can't get a right answer. Any suggestions will be appreciated.
Please do not propose solutions using existing layers such as Slice, Eltwise, Crop. I have already implemented - it works - but it is unbelievably complex and memory inefficient.
I think you are on the right way as a whole.
For the "weird" convolution results, I guess the bug most possibly is:
Consider 2D convolution
and suppose bottom[1]'s shape is (num, channels, height, width),
since convolution in caffe is performed as a multiplication of 2 matrix, weight(representing convolution kernels) and col_buffer(reorganized from data to be convolved), and weight is of num_out rows and channels / this->group_ * kernel_h * kernel_w columns, col_buffer is of channels / this->group_ * kernel_h * kernel_w rows and height_out * width_out columns, so as a weight blob of dynamic convolution layer, bottom[0]'s shape should better be (num, num_out, channels/group, kernel_h, kernel_w) to satisfy
bottom[0]->count(1) == num_out * channels / this->group_ * kernel_h * kernel_w
, in which num_out is the number of the dynamic convolution layer's output feature maps.
That means, to make the convolution function
this->forward_cpu_gemm(bottom_data + n * bottom[1]->count(1)
, bottom[0]->cpu_data() + n * bottom[0]->count(1)
, top_data + n * this->top_dim_);
work properly, you must make sure that
bottom[0]->shape(0) == bottom[1]->shape(0) == num
bottom[0]->count(1) == num_out * channels / this->group_ * kernel_h * kernel_w
So most possibly the simple convolution kernel of 4-dimension np.ones((1,1,3,3)) you used may not satify the above condition and result in the wrong convolution results.
Hope it's clear and will help you.
########## Update 1, Oct 10th,2016,Beijing time ##########
I add a dynamic convolution layer here but with no unit test yet. This layer doesn't break the workflow of caffe and only change some private members of BaseConvolution class to be protected.
The files involved are:
include/caffe/layers/dyn_conv_layer.hpp,base_conv_layer.hpp
src/caffe/layers/dyn_conv_layer.cpp(cu)
It grows almost the same with the convolution layer in caffe, and the differences mainly are:
Override the function LayerSetUp() to initialize this->kernel_dim_, this->weight_offset_ etc properly for convolution and ignore initializing this->blobs_ used by Convolution layer routinely to contain weight and bias;
Override the function Reshape() to check that the bottom[1] as a kernel container has proper shape for convolution.
Because I have no time to test it, there may be bugs and I will be very glad to see your feedbacks.
########## Update 2, Oct 12th,2016,Beijing time ##########
I updated test case for dynamic convolution just now. The involved file is src/caffe/test/test_dyn_convolution_layer.cpp. It seems to work fine, but maybe need more thorough tests.
You can build this caffe by cd $CAFFE_ROOT/build && ccmake .., cmake -DBUILD_only_tests="dyn_convolution_layer" .. and make runtest to check it.
Related
I want to learn how to backpropagate neural net offline, so I came up with the example following:
and 1 stays for bias. Activation function in last layer is linear.
where
My intuitive problem
I want to calculate backprogation offline but I'm not sure how it should be done. I understand intuition behind online backprop where we calculate gradient observation by observation. But I don't have idea how it should work offline i.e. to calculate with all observations at once. Could you please give me a hint in which direction should I follow?
So you say that you know how to do backpropagation on a single sample, but not on many samples at the same time, right? Then let's just assume that you have some loss function, e.g. the mean squared error. Then, for a single sample (x, label) your loss would be (y - label)^2. Since you said you know how to do backpropagation on a single sample, you should know that we now take the gradient of our loss with respect to out last node, y. This gradient should be 2 * (y - label). From there on, wo propagate the gradient through the whole network.
If we now have a batch of samples instead, we just use the loss function for the whole batch. Lets say out batch has two samples. Then our mean square error would just be the sum of the individual losses divided by the number of samples. Thus, the loss would be 1/2 * ((y_1 - label_1)^2 + ((y_2 - label_2)^2). And now we can just do the same as before: We take the gradient of our loss function with respect to our last node, y. It is important to realize, that both y_1 and y_2 in our loss are actually the output of our final node y, just for the different samples. This means, that the gradient is 1/2 * (2*(y_1 - label_1) + 2*(y_2 - label_2)). From here you have a single gradient value for the whole batch (if you plug in y_1, y_2, label_1, and label_2) and can propagate the gradient through the network as before.
As a summary: Instead of calculating the loss for a single sample, we now use a loss function that includes our whole batch (e.g. by just summing over all samples). This produces a single gradient, and we can proceed as before.
i have been following cs231n lectures of Stanford and trying to complete assignments on my own and sharing these solutions both on github and my blog. But i'm having a hard time on understanding how to modelize backpropagation. I mean i can code modular forward and backward passes but what bothers me is that if i have the model below : Two Layered Neural Network
Lets assume that our loss function here is a softmax loss function. In my modular softmax_loss() function i am calculating loss and gradient with respect to scores (dSoft = dL/dY). After that, when i'am following backwards lets say for b2, db2 would be equal to dSoft*1 or dW2 would be equal to dSoft*dX2(outputs of relu gate). What's the chain rule here ? Why isnt dSoft equal to 1 ? Because dL/dL would be 1 ?
The softmax function is outputs a number given an input x.
What dSoft means is that you're computing the derivative of the function softmax(x) with respect to the input x. Then to calculate the derivative with respect to W of the last layer you use the chain rule i.e. dL/dW = dsoftmax/dx * dx/dW. Note that x = W*x_prev + b where x_prev is the input to the last node. Therefore dx/dW is just x and dx/db is just 1, which means that dL/dW or simply dW is dsoftmax/dx * x_prev and dL/db or simply db is dsoftmax/dx * 1. Note that here dsoftmax/dx is dSoft we defined earlier.
The caffe documentation on the softmax_loss_layer.hpp file seems to be targeted towards classification tasks and not semantic segmentation. However, I have seen this layer being used for the latter.
What would be the dimensions of the input blobs and output blob in the case where you're classifying each pixel (semantic segmentation)?
More importantly, how are the equations for calculating the loss applied to these blobs? Like, in what form are the matrices/blobs arranged and the eventual "loss value" that's output, what is the equation for that?
Thank you.
edits:
I have referenced this page for understanding concepts of loss equation, just don't know how it's applied to the blobs, which axis, etc.: http://cs231n.github.io/linear-classify/
Here is the documentation from caffe:
Firstly, the input blobs should be of the form data NxKxHxW and label Nx1XHxW where each value in the label blob is an integer from [0-K]. I think there's an error in the caffe documentation where it doesn't consider the case for semantic segmentation, and I'm not sure what K = CHW means. The output blob is of the shape 1x1x1x1 which is the loss.
Secondly, the loss function is as follows, from softmax_loss_layer.cpp:
loss -= log(std::max(prob_data[i * dim + label_value * inner_num_ + j], Dtype(FLT_MIN)));
Breaking that line down (for semantic segmentation):
std::max is just to ensure there's no invalid input like nan
prob_data is the output from the softmax, as explained in the caffe tutorials, softmax loss layer can be decomposed into a softmax layer followed by multinomial logistic loss
i * dim specifies the Nth image in your batch where the batch shape is like so NxKxHxW where K is the number of classes
label_value * inner_num_ specifies the Kth image, because at this stage, each one of your classes have their own "image" of probabilities, so to speak
Finally, j is the index for each pixel
Basically, you want prob_data[i * dim + label_value * inner_num_ + j] for each pixel to be as close to 1 as possible. This means that the negative log of that will be close to 0. Here the log is to base e. And then you do the stochastic gradient descent for that loss.
I have a compute shader which simulates some fluid as particle. Particles are read from a buffer. Each particle is handled in one thread. During the execution of the thread, one particle moves its uv position and adds to pixel of a UAV named Water . Therefore each thread leaves a trail of its movement on the Water texture.
_watTx[texID] += watAddition * cellArea.x;
The problem is there are lots of particles moving around and most often multiples are present at the same texID. It seems there is a race condition since every time I run the simulation the results are slightly different. Is there a way to enforce mutual exclusion so the writes do not happen at the same time and the results become predictable?
I found a way to resolve this issue. InterlockedAdd adds to the pixel in an atomic fashion. But it only works on int and unit UAVs.
In my case the values are floating point but the range is quite limited (like 0 to 10). So the solution is to use an int UAV. We multiply the calculation result by a huge number (like 10000) and then write to the UAV:
InterlockedAdd(_watTx[texID], (watAddition * cellArea.x * 10000));
The results will have a 0.0001 precision which is perfectly fine in my case. After this in another pixel or compute shader we can multiply values from the int UAV by 0.0001 and write to the desired floating point render target.
This process eliminates the concurrent write problem and the results are identical in each run.
Recently I learn DM_Script for TEM image processing
I needed Gaussian blur process and I found one whose name is 'Gaussian Blur' in http://www.dmscripting.com/recent_updates.html
This code implements Gaussian blur algorithm by multiplying the fast fourier transform(FFT) of source image by the FFT of Gaussian-kernel image and finally doing inverse fourier transform of it.
Here is the part of the code,
// Carry out the convolution in Fourier space
compleximage fftkernelimg:=realFFT(kernelimg) (-> FFT of Gaussian-kernel image)
compleximage FFTSource:=realfft(warpimg) (-> FFT of source image)
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
realimage invFFT:=realIFFT(FFTProduct)
The point I want to ask is this
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
Why does the FFT of Gaussian-kernel need '.modulus().sqrt()' for the convolution?
It is related to the fact that the fourier transform of a Gaussian function becomes another Gaussian function?
Or It is related to a sort of limitation of discrete fourier transform?
Please answer me
Thanks
This is related to the general precision limitation of any floating point numeric computing. (see f.e. here, or more in depth here)
A rotational (real-valued) Gaussian of stand.dev. sigma should be transformed into a 100% real-values rotational Gaussioan of 1/sigma. However, doing this numerically will show you deviations: Just try the following:
number sigma = 30
number A0 = 1
realimage first := RealImage( "First", 8, 256, 256 )
first = A0 * exp( - (iradius**2/(2*sigma*sigma) ))
first.showimage()
complexImage second := FFT(first)
second.Showimage()
image nonZeroImaginaryMask = ( 0 != second.Imaginary() )
nonZeroImaginaryMask.Showimage()
nonZeroImaginaryMask.SetLimits(0,1)
When you then multiply these complex images (before back-transferring) you are introducing even more errors. By using modulus, one ensures that the forward transformed kernel is purely real and hence a better "damping" curve.
A better implementation of a FFT filtering code would actually create the FFT(Gaussian) directly with a std.dev of 1/sigma, as this is the analytically correct result. Doing a FFT of the kernel only makes sense if the kernel (or its FFT) is not analytically known.
In general: When implementing any "maths" into a program code, it can pay hugely to think it through with numerical computation limits in the back of your head. Reduce actual computation whenever possible (i.e. compute analytically and use the result instead of relying on brute force numerical computation) and try to "reshape" equations when possible, f.e. avoid large sums over many small numbers, be careful about checks against exact numeric values, try to avoid expressions which are very sensitive on small numerica errors etc.