Unexpected shape for deconv layer - caffe

I work on a deconv layer which upscales 64 channels : 64x48x48 => 64x96x96.
layer {
bottom: "layer41_conv"
top: "layertest_upsample"
name: "layertest_upsample"
type: "Deconvolution"
convolution_param {
num_output: 64
group: 64
kernel_size: 2
pad: 0
stride: 2
}
}
When I print the shape of the parameters :
(64,1,2,2).
I was expected something like :
(64,64,2,2) because of 64 channels in input and 64 channels in output.
Can anyone explain me what's going on ?

You defined group: 64
What group does is (according to manual):
group (g) [default 1]: If g > 1, we restrict the connectivity of each filter to a subset of the input. Specifically, the input and output channels are separated into g groups, and the k-th output group channels will be only connected to the k-th input group channels.
In your case you grouped all 64 channels into 64 groups - this that the k-th input channel is mapped (in isolation) by a 2x2 kernel to the k-th output channel. Over all you have 64 such 2x2 mappings and this is why your weight blob is 64x1x2x2 and not 64x64x2x2.
If you remove the group: 64 you'll have the full weight matrix you expect.

Related

what is the result inside fft bin at different index?

For example if I have 10 samples in the FFT in a system sampled at 100 Hz. Then how are the results represented in FFT bin output ?
The frequency resolution is going to be how many Hz each DFT bin represents. This is, as you have noted, given by fs/N. In this case :--
resolution = 100/10 = 10 hz
Definition of FFT states that frequency range [0,fs] is represented by N point.
But when the result comes :--
X[0] is the constant term (also refereed to as DC bias)
X[1] is the 10 hz
X[2] is the 20 hz
X[3] is the 30 hz
X[4] is the 40 hz
X[5] is the 50 hz
X[6] to X[9] are the complex conjugates of X[4] to X[1] respectively.
so my question is :--
X[6] is the 60hz signal & its value will be complex conjugate of 40 hz signal .. is it right ?
Or X[6] does not contains spectral of 60hz but it is only complex conjugate of X[4] ?
Please suggest.
The frequencies of each bin in the DFT go from $n = 0 to N-1$ where each frequency is given as $n F_s/N$.
Due to the sampling process, for any signals (either real and complex signals) the upper half of this frequency spectrum that extends from $F_s/2$ to 1 sample less than $F_s$ is equivalent to the negative half spectrum (the spectrum that extends from $-F_s/2% to 1 sample less than 0.). This applies to the DFT and all digital signals.
This is demonstrated in the graphics below, showing how an analog spectrum (top line), convolves with the sampling spectrum (second line) of a system sampled at 20 Hz to result in the digital spectrum of a sampled signal. Since the spectrum is unique over a 20 Hz range (and repeats everywhere else), we only need to show the frequencies over any 20 Hz range to represent the signal. This can be identically shown from -10 Hz to +10 Hz, or 0 to 20 Hz.
Representing the sampled spectrum of a real signal sampled at 20 Hz over the range of -10Hz to +10Hz:
The same signal can also be represented over the range of 0 to 20 Hz:
Representing the sampled spectrum of a complex signal sampled at 20 Hz over the range of -10Hz to +10Hz:
The same signal can also be represented over the range of 0 to 20 Hz:
Since the spectrum for the DFT is discrete, the samples go from $n = 0 to N-1$ where each frequency is given as $F_s/n$ as described above. There is also only N samples in the DFT of course but as you rotate the samples in the DFT, you effectively move through the expanded spectrum above. It is helpful to view the spectrum on the surface of a cylinder with a circumference that goes from 0 to $F_s$ where you see that $F_s$ is equivalent to 0, and going backwards from 0 is equivalent to going into the negative half spectrum.
So in your example specifically:
X[0] = DC
X1 = 10
X2 = 20
X3 = 30
X4 = 40
X[5] = 50
X[6] = 60
X[7] = 70
X[8] = 80
X[9] = 90
Can also be represented as
X[0] = DC
X1 = 10
X2 = 20
X3 = 30
X4 = 40
X[5] = -50
X[6] = -40
X[7] = -30
X[8] = -20
X[9] = -10
Note that the command "FFTSHIFT" in MATLAB shifts the DFT vector accordingly to produce the following order representing the range from -F_s/2 to +F_s/2 :
fftshift([X[0], X1, X2, X3, X4, X[5], X[6], X[7], X[8], X[9]]) =
[X[5], X[6], X[7], X[8], X[9], X[0], X1, X2, X3, X4]
Assuming your input signal is purely real, then X[6] is just the complex conjugate of X[4]. The top half of the spectrum is essentially redundant for real signals. See: this question and answer for further details.
Note that the input signal should be bandwidth-limited to fs / 2, otherwise aliasing will occur, so for a baseband signal there should not be any components >= fs / 2.

32-bit int struct bits don't seem to match up (nodejs)

I have a file that defines a set of tiles (used in an online game). The format for each tile is as follows:
x: 12 bits
y: 12 bits
tile: 8 bits
32 bits in total, so each tile can be expressed as a 32 bit integer.
More info about the file format can be found here:
http://wiki.minegoboom.com/index.php/LVL_Format
http://www.rarefied.org/subspace/lvlformat.html
The 4 byte structures are not broken along byte boundaries. As you can see x: and y: are both defined as 12 bits. ie. x is stored in 1.5 bytes, y is stored in 1.5 bytes and tile is stored in 1 byte.
Even though x and y use 12 bits their max value is 1023, so they could be expressed in 10 bits. This was down to the creator of the format. I guess they were just padding things out so they could use a 32-bit integer for each tile? Either way, for x and y we can ignore the final 2 bits.
I'm using a nodejs Buffer to read the file and I'm using the following code to read the values.
var n = tileBuffer.readUInt32LE(0);
var x = n & 0x03FF;
var y = (n >> 12) & 0x03FF;
var tile = (n >> 24) & 0x00ff;
This code works fine but when I read the bits themselves, in an attempt to understand binary better, I see something that confuses me.
Take, for example a int that expresses the following:
x: 1023
y: 1023
tile: 1
Creating the tiles in a map editor and reading the resulting file into a buffer returns <Buffer ff f3 3f 01>
When I convert each byte into a string of bits I get the following:
ff = 11111111
f3 = 11110011
3f = 00111111
01 = 00000001
11111111 11110011 00111111 00000001
I assume I should just take the first 12 bits as x but chop off the last 2 bits. Use the next 12 bits as y, chopping off 2 bits again, and the remaining 8 bits would be the tile.
x: 1111111111
y: 0011001111
tile: 00000001
The x is correct (1111111111 = 1023), the y is wrong (0011001111 = 207, not 1023), and tile is correct (00000001 = 1)
I'm confused and obviously missing something.
It makes more sense to look at it in this order: (this would be the binary representation of n)
00000001 00111111 11110011 11111111
On that order, you can easily do the masking and shifting visually.
The problem with what you did is that for example in 11111111 11110011, the bits of the second byte that belong to the first field are at the right (the lowest part of that byte), which in that order is discontinuous.
Also, masking with 0x03FF makes those first two fields have 10 bits, with two bits just disappearing. You can make them 12 bits by masking with 0x0FFF. As it is now, you effectively have two padding bits.

Octave leasqr only doing one iteration

As I'm trying to fit a function to some experimental data, I've written a function with three inputs, three parameters and one output:
qrfunc = #(x, p) exp(-1*p(1)*x(:,1) - p(2)*x(:,2))+p(3)*x(:,3)+20;
When I generate some input and output values:
pS = [0.5; 0.3; 0.3];
x1 = [1 1 1; 1 1.1 1; 1 1.1 1.1; 2 1.2 2];
y1 = qrfunc(x1, pS);
And call the leasqr function:
pin =[1; 1; 1];
[f1, p1, kvg1, iter1, corp1, covp1, covr1, stdresid1, Z1, r21] = leasqr(x1, y1, pin, qrfunc, 0.0001);
This works correct, the function makes 7 iterations and provides the right outputs.
But when I load x1 from my experimental data (a text file with three columns, about 1500 lines) as well as my y1 (a text file with the same amount of lines) and run the same function, it only makes one iteration, and does not change the parameters.
It even shows that the error margins are very high:
sqrt(diag(covp1))
ans =
3.0281e+004
3.7614e+005
1.9477e-002
What am I doing wrong? There are no error messages, no 'Convergence not achieved' or anything like that...
Edit:
The data is loaded with the command:
load "input.txt"
load "output.txt"
Proof of loading:
size(input)
ans =
1540 3
The first few lines from my input file:
10 0.4 5
20 0.4 5
30 0.4 5
40 0.4 5
50 0.4 5
The second and third parameters have different values further down the line.

Negative Padding in Convolution (Caffe)

I have a vector which consists of a concatenation the features of a video sequence across 7 frames.
I would like to apply 1D convolution to this vector such that only a part of a frame is processed.
Say a feature vector for one frame has the length 10
my input feature vector would be of total length 7 x 10 = 70
Now I want two convolutions to work on different parts of that vector
conv1 should treat the features 1:5
conv2 should treat 6:10
stride for both would be 10
so the convolution filters apply to the same features only in different frames
Basically I would need to specify an offset for the second conv filter. Is that possible?
On the Caffe website they speak only about a zero padding, but for an offset I would need a negative padding.
Is something like this possible?
layer {
name: "conv2"
type: "Convolution"
bottom: "data"
top: "conv2"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 90
kernel_h: 1
kernel_w: 5
pad_h: 0
pad_w: -5
stride: 10
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
I think you can achieve this by using a slicing layer instead of negative padding.

How is 2D Shared Memory arranged in CUDA

I've always worked with linear shared memory (load, store, access neighbours) but I've made a simple test in 2D to study bank conflicts which results have confused me.
The next code read data from one dimensional global memory array to shared memory and copy it back from shared memory to global memory.
__global__ void update(int* gIn, int* gOut, int w) {
// shared memory space
__shared__ int shData[16][16];
// map from threadIdx/BlockIdx to data position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global id into the one dimensional array
int gid = x + y * w;
// load shared memory
shData[threadIdx.x][threadIdx.y] = gIn[gid];
// synchronize threads not really needed but keep it for convenience
__syncthreads();
// write data back to global memory
gOut[gid] = shData[threadIdx.x][threadIdx.y];
}
The visual profiler reported conflicts in shared memory. The next code avoid thouse conflicts (only show the differences)
// load shared memory
shData[threadIdx.y][threadIdx.x] = gIn[gid];
// write data back to global memory
gOut[gid] = shData[threadIdx.y][threadIdx.x];
This behavior has confused me because in Programming Massively Parallel Processors. A Hands-on approach we can read:
matrix elements in C and CUDA are placed into the linearly addressed locations according to the row major convention. That is, the elements of row 0 of a matrix are first placed in order into consecutive locations.
Is this related to shared memory arragment? or with threads indexes? Maybe am I missing something?
The kernel configuration is as follow:
// kernel configuration
dim3 dimBlock = dim3 ( 16, 16, 1 );
dim3 dimGrid = dim3 ( 64, 64 );
// Launching a grid of 64x64 blocks with 16x16 threads -> 1048576 threads
update<<<dimGrid, dimBlock>>>(d_input, d_output, 1024);
Thanks in advance.
Yes, shared memory is arranged in row-major order as you expected. So your [16][16] array is stored row wise, something like this:
bank0 .... bank15
row 0 [ 0 .... 15 ]
1 [ 16 .... 31 ]
2 [ 32 .... 47 ]
3 [ 48 .... 63 ]
4 [ 64 .... 79 ]
5 [ 80 .... 95 ]
6 [ 96 .... 111 ]
7 [ 112 .... 127 ]
8 [ 128 .... 143 ]
9 [ 144 .... 159 ]
10 [ 160 .... 175 ]
11 [ 176 .... 191 ]
12 [ 192 .... 207 ]
13 [ 208 .... 223 ]
14 [ 224 .... 239 ]
15 [ 240 .... 255 ]
col 0 .... col 15
Because there are 16 32 bit shared memory banks on pre-Fermi hardware, every integer entry in each column maps onto one shared memory bank. So how does that interact with your choice of indexing scheme?
The thing to keep in mind is that threads within a block are numbered in the equivalent of column major order (technically the x dimension of the structure is the fastest varying, followed by y, followed by z). So when you use this indexing scheme:
shData[threadIdx.x][threadIdx.y]
threads within a half-warp will be reading from the same column, which implies reading from the same shared memory bank, and bank conflicts will occur. When you use the opposite scheme:
shData[threadIdx.y][threadIdx.x]
threads within the same half-warp will be reading from the same row, which implies reading from each of the 16 different shared memory banks, no conflicts occur.