caffe - understanding SoftmaxLayer::Backward_cpu function - caffe

I'm new to caffe and trying to understand the implementation of softmax layer backward function
template <typename Dtype>
void SoftmaxLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
const vector<bool>& propagate_down,
const vector<Blob<Dtype>*>& bottom) {
const Dtype* top_diff = top[0]->cpu_diff();
const Dtype* top_data = top[0]->cpu_data();
Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
Dtype* scale_data = scale_.mutable_cpu_data();
int channels = top[0]->shape(softmax_axis_);
int dim = top[0]->count() / outer_num_;
caffe_copy(top[0]->count(), top_diff, bottom_diff);
for (int i = 0; i < outer_num_; ++i) {
// compute dot(top_diff, top_data) and subtract them from the bottom diff
for (int k = 0; k < inner_num_; ++k) {
scale_data[k] = caffe_cpu_strided_dot<Dtype>(channels,
bottom_diff + i * dim + k, inner_num_,
top_data + i * dim + k, inner_num_);
}
// subtraction
caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels, inner_num_, 1,
-1., sum_multiplier_.cpu_data(), scale_data, 1., bottom_diff + i * dim);
}
// elementwise multiplication
caffe_mul(top[0]->count(), bottom_diff, top_data, bottom_diff);
}
(the full file can be found here: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_layer.cpp)
I have a few questions:
What is the data in bottom_diff?
If I understand correctly, top is where the output goes so what is the data in top_data when calling caffe_cpu_strided_dot?
What is substructed in caffe_cpu_gemm?
Thanks!

After a while, I found the answers to all of the questions above.
top_diff – the gradient
top_data – the data transferred in the network
bottom_diff - the gradient of the result at the end of function
sum_multiplier_ is in size of channel*inner and initialized to 1 -
therefore in caffe_cpu_gemm we substruct the sum we calculate from the bottom_diff

Related

Efficiently XOR two images in Flash compile target

I need to XOR two BitmapData objects together.
I'm writing in Haxe, using the flash.* libraries and the AS3 compile target.
I've investigated HxSL and PixelBender, and neither one seems to have a bitwise XOR operator, nor do they have any other bitwise operators that could be used to create XOR (but am I missing something obvious? I'd accept any answer which gives a way to do a bitwise XOR using only the integer/float operators and functions available in HxSL or PixelBlender).
None of the predefined filters or shaders in Flash that I can find seem to be able to do a XOR of two images (but again, am I missing something obvious? Can XOR be done with a combination of other filters).
I can find nothing like a XOR drawmode for drawing things onto other things (but that doesn't mean it doesn't exist! That would work too, if it exists!)
The only way I can find at the moment is a pixel-by-pixel loop over the image, but this takes a couple of seconds per image even on a fast machine, as opposed to filters, which I use for my other image processing operations, which are about a hundred times faster.
Is there any faster method?
Edit:
Playing around with this a bit more I found that removing the conditional and extra Vector access in the loop speeds it up by about 100ms on my machine.
Here's the previous XOR loop:
// Original Vector XOR code:
for (var i: int = 0; i < len; i++) {
// XOR.
result[i] = vec1[i] ^ vec2[i];
if (ignoreAlpha) {
// Force alpha of FF so we can see the result.
result[i] |= 0xFF000000;
}
}
Here is the updated XOR loop for the Vector solution:
if (ignoreAlpha) {
// Force alpha of FF so we can see the result.
alphaMask = 0xFF000000;
}
// Fewer Vector accessors makes it quicker:
for (var i: int = 0; i < len; i++) {
// XOR.
result[i] = alphaMask | (vec1[i] ^ vec2[i]);
}
Answer:
Here are the solutions that I've tested to XOR two images in Flash.
I found that the PixelBender solution is about 6-10 slower than doing it in straight ActionScript.
I don't know if it's because I have a slow algorithm or it's just the limits of trying to fake bitwise operations in PixelBender.
Results:
PixelBender: ~6500ms
BitmapData.getVector(): ~480-500ms
BitmapData.getPixel32(): ~1200ms
BitmapData.getPixels(): ~1200ms
The clear winner is use BitmapData.getVector() and then XOR the two streams of pixel data.
1. PixelBender solution
This is how I implemented the bitwise XOR in PixelBender, based on the formula given on Wikipedia: http://en.wikipedia.org/wiki/Bitwise_operation#Mathematical_equivalents
Here is a Gist of the final PBK: https://gist.github.com/Coridyn/67a0ff75afaa0163f673
On my machine running an XOR on two 3200x1400 images this takes about 6500-6700ms.
I first converted the formula to JavaScript to check that it was correct:
// Do it for each RGBA channel.
// Each channel is assumed to be 8bits.
function XOR(x, y){
var result = 0;
var bitCount = 8; // log2(x) + 1
for (var n = 0; n < bitCount; n++) {
var pow2 = pow(2, n);
var x1 = mod(floor(x / pow2), 2);
var y1 = mod(floor(y / pow2), 2);
var z1 = mod(x1 + y1, 2);
result += pow2 * z1;
}
console.log('XOR(%s, %s) = %s', x, y, result);
console.log('%s ^ %s = %s', x, y, (x ^ y));
return result;
}
// Split out these functions so it's
// easier to convert to PixelBender.
function mod(x, y){
return x % y;
}
function pow(x, y){
return Math.pow(x, y);
}
function floor(x){
return Math.floor(x);
}
Confirm that it's correct:
// Test the manual XOR is correct.
XOR(255, 85); // 170
XOR(170, 85); // 255
XOR(170, 170); // 0
Then I converted the JavaScript to PixelBender by unrolling the loop using a series of macros:
// Bitwise algorithm was adapted from the "mathematical equivalents" formula on Wikipedia:
// http://en.wikipedia.org/wiki/Bitwise_operation#Mathematical_equivalents
// Macro for 2^n (it needs to be done a lot).
#define POW2(n) pow(2.0, n)
// Slight optimisation for the zeroth case - 2^0 = 1 is redundant so remove it.
#define XOR_i_0(x, y) ( mod( mod(floor(x), 2.0) + mod(floor(y), 2.0), 2.0 ) )
// Calculations for a given "iteration".
#define XOR_i(x, y, i) ( POW2(i) * ( mod( mod(floor(x / POW2(i)), 2.0) + mod(floor(y / POW2(i)), 2.0), 2.0 ) ) )
// Flash doesn't support loops.
// Unroll the loop by defining macros that call the next macro in the sequence.
// Adapted from: http://www.simppa.fi/blog/category/pixelbender/
// http://www.simppa.fi/source/LoopMacros2.pbk
#define XOR_0(x, y) XOR_i_0(x, y)
#define XOR_1(x, y) XOR_i(x, y, 1.0) + XOR_0(x, y)
#define XOR_2(x, y) XOR_i(x, y, 2.0) + XOR_1(x, y)
#define XOR_3(x, y) XOR_i(x, y, 3.0) + XOR_2(x, y)
#define XOR_4(x, y) XOR_i(x, y, 4.0) + XOR_3(x, y)
#define XOR_5(x, y) XOR_i(x, y, 5.0) + XOR_4(x, y)
#define XOR_6(x, y) XOR_i(x, y, 6.0) + XOR_5(x, y)
#define XOR_7(x, y) XOR_i(x, y, 7.0) + XOR_6(x, y)
// Entry point for XOR function.
// This will calculate the XOR the current pixels.
#define XOR(x, y) XOR_7(x, y)
// PixelBender uses floats from 0.0 to 1.0 to represent 0 to 255
// but the bitwise operations above work on ints.
// These macros convert between float and int values.
#define FLOAT_TO_INT(x) float(x) * 255.0
#define INT_TO_FLOAT(x) float(x) / 255.0
XOR for each channel of the current pixel in the evaluatePixel function:
void evaluatePixel()
{
// Acquire the pixel values from both images at the current location.
float4 frontPixel = sampleNearest(inputImage, outCoord());
float4 backPixel = sampleNearest(diffImage, outCoord());
// Set up the output variable - RGBA.
pixel4 result = pixel4(0.0, 0.0, 0.0, 1.0);
// XOR each channel.
result.r = INT_TO_FLOAT ( XOR(FLOAT_TO_INT(frontPixel.r), FLOAT_TO_INT(backPixel.r)) );
result.g = INT_TO_FLOAT ( XOR(FLOAT_TO_INT(frontPixel.g), FLOAT_TO_INT(backPixel.g)) );
result.b = INT_TO_FLOAT ( XOR(FLOAT_TO_INT(frontPixel.b), FLOAT_TO_INT(backPixel.b)) );
// Return the result for this pixel.
dst = result;
}
ActionScript Solutions
2. BitmapData.getVector()
I found the fastest solution is to extract a Vector of pixels from the two images and perform the XOR in ActionScript.
For the same two 3200x1400 this takes about 480-500ms.
package diff
{
import flash.display.Bitmap;
import flash.display.DisplayObject;
import flash.display.IBitmapDrawable;
import flash.display.BitmapData;
import flash.geom.Rectangle;
import flash.utils.ByteArray;
/**
* #author Coridyn
*/
public class BitDiff
{
/**
* Perform a binary diff between two images.
*
* Return the result as a Vector of uints (as used by BitmapData).
*
* #param image1
* #param image2
* #param ignoreAlpha
* #return
*/
public static function diffImages(image1: DisplayObject,
image2: DisplayObject,
ignoreAlpha: Boolean = true): Vector.<uint> {
// For simplicity get the smallest common width and height of the two images
// to perform the XOR.
var w: Number = Math.min(image1.width, image2.width);
var h: Number = Math.min(image1.height, image2.height);
var rect: Rectangle = new Rectangle(0, 0, w, h);
var vec1: Vector.<uint> = BitDiff.getVector(image1, rect);
var vec2: Vector.<uint> = BitDiff.getVector(image2, rect);
var resultVec: Vector.<uint> = BitDiff.diffVectors(vec1, vec2, ignoreAlpha);
return resultVec;
}
/**
* Extract a portion of an image as a Vector of uints.
*
* #param drawable
* #param rect
* #return
*/
public static function getVector(drawable: DisplayObject, rect: Rectangle): Vector.<uint> {
var data: BitmapData = BitDiff.getBitmapData(drawable);
var vec: Vector.<uint> = data.getVector(rect);
data.dispose();
return vec;
}
/**
* Perform a binary diff between two streams of pixel data.
*
* If `ignoreAlpha` is false then will not normalise the
* alpha to make sure the pixels are opaque.
*
* #param vec1
* #param vec2
* #param ignoreAlpha
* #return
*/
public static function diffVectors(vec1: Vector.<uint>,
vec2: Vector.<uint>,
ignoreAlpha: Boolean): Vector.<uint> {
var larger: Vector.<uint> = vec1;
if (vec1.length < vec2.length) {
larger = vec2;
}
var len: Number = Math.min(vec1.length, vec2.length),
result: Vector.<uint> = new Vector.<uint>(len, true);
var alphaMask = 0;
if (ignoreAlpha) {
// Force alpha of FF so we can see the result.
alphaMask = 0xFF000000;
}
// Assume same length.
for (var i: int = 0; i < len; i++) {
// XOR.
result[i] = alphaMask | (vec1[i] ^ vec2[i]);
}
if (vec1.length != vec2.length) {
// Splice the remaining items.
result = result.concat(larger.slice(len));
}
return result;
}
}
}
3. BitmapData.getPixel32()
Your current approach of looping over the BitmapData with BitmapData.getPixel32() gave a similar speed of about 1200ms:
for (var y: int = 0; y < h; y++) {
for (var x: int = 0; x < w; x++) {
sourcePixel = bd1.getPixel32(x, y);
resultPixel = sourcePixel ^ bd2.getPixel(x, y);
result.setPixel32(x, y, resultPixel);
}
}
4. BitmapData.getPixels()
My final test was to try iterating over two ByteArrays of pixel data (very similar to the Vector solution above). This implementation also took about 1200ms:
/**
* Extract a portion of an image as a Vector of uints.
*
* #param drawable
* #param rect
* #return
*/
public static function getByteArray(drawable: DisplayObject, rect: Rectangle): ByteArray {
var data: BitmapData = BitDiff.getBitmapData(drawable);
var pixels: ByteArray = data.getPixels(rect);
data.dispose();
return pixels;
}
/**
* Perform a binary diff between two streams of pixel data.
*
* If `ignoreAlpha` is false then will not normalise the
* alpha to make sure the pixels are opaque.
*
* #param ba1
* #param ba2
* #param ignoreAlpha
* #return
*/
public static function diffByteArrays(ba1: ByteArray,
ba2: ByteArray,
ignoreAlpha: Boolean): ByteArray {
// Reset position to start of array.
ba1.position = 0;
ba2.position = 0;
var larger: ByteArray = ba1;
if (ba1.bytesAvailable < ba2.bytesAvailable) {
larger = ba2;
}
var len: Number = Math.min(ba1.length / 4, ba2.length / 4),
result: ByteArray = new ByteArray();
// Assume same length.
var resultPixel:uint;
for (var i: uint = 0; i < len; i++) {
// XOR.
resultPixel = ba1.readUnsignedInt() ^ ba2.readUnsignedInt();
if (ignoreAlpha) {
// Force alpha of FF so we can see the result.
resultPixel |= 0xFF000000;
}
result.writeUnsignedInt(resultPixel);
}
// Seek back to the start.
result.position = 0;
return result;
}
There are a few possible options depending on what you want to achieve (e.g. is the XOR per channel or is it just any pixel that is non-black?).
There is the BitmapData.compare() method which can give you a lot of information about the two bitmaps. You could BitmapData.threshold() the input data before comparing.
Another option would be to use the draw method with the BlendMode.DIFFERENCE blend mode to draw your two images into the same BitmapData instance. That will show you the difference between the two images (equivalent to the Difference blending mode in Photoshop).
If you need to check if any pixel is non-black then you can try running a BitmapData.threshold first and then draw the result with the difference blend mode as above for the two images.
Are you doing this for image processing or something else like per-pixel hit detection?
To start with I'd have a look at BitmapData and see what is available to play with.

Sub-Matrix computations

I want to calculate the pair wise distance between two sub-matrices of a matrix. For example I have a matrix A (MxN) and two blocks of that matrix B1 (mxn) and B2 (kxt). More specifically, I want to calculate the distance of the B1(1,1) element from all the other elements of the B2 and to do this process for all the B1 elements. To be more clear the B1 and B2 may be not compact parts of the matrices and basically the information I know is the coordinates of the elements of B1 and B2 on the matrix A. Here is an example.
for(int i = 0; i < nRowsCoordsB1 ; i++ ){//nRows of B1
for(int j = 0; j < nRowsCoordsB2 ; j++ ){//nRows of B2
//CoordsofB1 is a nRowsB1x2 matrix that contains the element coordinates of the B1 sub matrix
a_x = CoordsofB1[ i ]; //take the x coord of the corresponding row i
a_y = CoordsofB1[ i + nRowsCoordsB1 ]; //take the y coord of the corresponding row
b_x = CoordsofB2[ j ];
b_y = CoordsofB2[ j + nRowsCoordsB2 ];
int element1 = A[ a_x + a_y*nRowsofA ];
int element2 = A[ b_x + b_y*nRowsofA ] ;
sum +=abs( element1 - element2 ) ;
}
}
*Output = sum/(float)(numberOfElementsofB1*numberOfElementsofB2);
Now I want to speedup computations with CUDA :) Because I am new in Cuda perspective I found it a little complicated. Since now I think that I have understand the logic of allocating block threads in Matrix level but here the fact that I have two different parts of the matrix with different size, CoordsofB1 and CoordsofB2, confuse me a little on how I can access them take the coordinates and use them in the hole matrix. I thought that we should work in A using constrains but I did not come with a clear thought.
Also the fact that in the end of the for loops the sum is divided with a quantity confuse me on who we would combined in the cuda translated code.
Any suggestions-snippets-examples-references would be great.
PS: the reason I use column-major ordering is because the code is evaluated in matlab.
UPDATE: Can we allocate thread block of size equal the size of the biggest sub matrix B1 or B2 and work with them using the correct conditions? I comment the last line because I was not sure about what to do with it. Any comments?
int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
if( r < nRowsCoordsB1 ){
a_x = CoordsofB1[ r ];
a_y = CoordsofB1[ r + nRowsCoordsB1 ];
if( r < nRowsCoordsB2 ;){
b_x = CoordsofB2[ r ];
b_y = CoordsofB2[ r + nRowsCoordsB2 ];
int element1 = A[ a_x + a_y*nRowsofA ];
int element2 = A[ b_x + b_y*nRowsofA ] ;
sum +=abs( element1 - element2 ) ;
}
}
//*Output = sum/(float)(numberOfElementsofB1*numberOfElementsofB2);
Here a sketch
I have the coordinates of each element inside the B1 and B2 and I want to calculate the differences between the values in
[ (B1(1,1) - B2(1,1)) + (B1(1,1) - B2(1,2)) + ... + (B1(1,1) - B2(:,:)) ] +
[ (B1(1,2) - B2(1,1)) + (B1(1,2) - B2(1,2)) + ... + (B1(1,2) - B2(:,:)) ] +
[ (B1(:,:) - B2(1,1)) + (B1(:,:) - B2(1,2)) + ... + (B1(:,:) - B2(:,:)) ].
If I understand it correctly, what you are trying to do can be written in the following matlab code.
rep_B1 = repmat(B1(:), 1, length(B2(:)) );
rep_B2 = repmat(B2(:)', length(B1(:), 1) );
absdiff_B1B2 = abs(rep_B1 - repB2);
Result = mean( absdiff_B1B2(:) );
Your will notice that before the reduction, there is a matrix absdiff_B1B2 of the size length(B1(:)) x length(B2(:)), i.e. m*n x k*t (this matrix is never stored to global mem if you implement the above code in one CUDA kernel). You could partition this matrix into 16x16 sub-matrices and use one 256-thread-block per sub-matrix to decompose the workload to GPU.
On the other hand, you could use thrust to make your life easier.
Update
Since B1 and B2 are sub-matrices of A, you could first use cudaMemcpy2D() to copy them to linear space, then use a kernel to construct and then reduce the matrix absdiff_B1B2.
For the final normalization operation (last line of your code), you could do it on CPU.
Here's the code using thrust to show how to construct and reduce the matrix absdiff_B1B2 in a single kernel. However you will find that the construction procedure use no shared memory and is not optimized. Further optimization using shared mem will improve the performance.
#include <thrust/device_vector.h>
#include <thrust/inner_product.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
template<typename T>
struct abs_diff
{
inline __host__ __device__ T operator()(const T& x, const T& y)
{
return abs(x - y);
}
};
int main()
{
using namespace thrust::placeholders;
const int m = 98;
const int n = 87;
int k = 76;
int t = 65;
double result;
thrust::device_vector<double> B1(m * n, 1.0);
thrust::device_vector<double> B2(k * t, 2.0);
result = thrust::inner_product(
thrust::make_permutation_iterator(
B1.begin(),
thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
_1 % (m * n))),
thrust::make_permutation_iterator(
B1.begin(),
thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
_1 % (m * n))) + (m * n * k * t),
thrust::make_permutation_iterator(
B2.begin(),
thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
_1 / (m * n))),
0.0,
thrust::plus<double>(),
abs_diff<double>());
result /= m * n * k * t;
std::cout << result << std::endl;
return 0;
}
Perhaps the solution below using a 2D thread grid, could be an alternative to Eric's use of thrust to have some more insight to the problem.
The code snippet below is to illustrate the concept only. It is an untested code.
2D grid
Define a partial_distances matrix of size nRowsCoordsB1 X nRowsCoordsB2 that will contain all the involved absolute value differences between the elements of B1 and B2. In the main file you will have
__global__ void distance_calculator(int* partial_distances, int* CoordsofB1, int* CoordsofB2, int nRowsCoordsB1, int nRowsCoordsB2) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
int a_x = CoordsofB1[i];
int a_y = CoordsofB1[i+nRowsCoordsB1];
int b_x = CoordsofB2[j];
int b_y = CoordsofB2[j+nRowsCoordsB2];
partial_distances[j*nRowsCoordsB1+i] = abs(A[a_x+a_y*nRowsofA]-A[b_x+b_y*nRowsofA]);
}
int iDivUp(int a, int b) { return (a % b != 0) ? (a / b + 1) : (a / b); }
#define BLOCKSIZE 32
int main() {
int* partial_distances; cudaMalloc((void**)&partial_distances,nRowsCoordsB1*nRowsCoordsB2*sizeof(int));
dim3 BlocSize(BLOCKSIZE,BLOCKSIZE);
dim3 GridSize;
GridSize.x = iDivUp(nRowsCoordsB1,BLOCKSIZE);
GridSize.y = iDivUp(nRowsCoordsB2,BLOCKSIZE);
distance_calculator<<<GridSize,BlockSize>>>(partial_distances,CoordsofB1,CoordsofB2,nRowsCoordsB1,nRowsCoordsB2);
REDUCTION_STEP
}
The REDUCTION_STEP could be implemented as the iterative call to a 1D reduction kernel to sum up all the elements corresponding to a particular element of B1.
An alternative would be to use dynamic parallelism to call the reduction routine directly within the kernel, but this is an option not suitable to the card you are using.

CUDA "convolution" as slow as OpenMP version

I'm trying to "convolve" a featWidth * featHeight * 31 cube with another modelWidth * modelHeight * 31 cube. The problem is that this kernel is quite slow (well, I manage to be quicker than a sequential CPU code, but as slow as a OpenMP version). I'm using a Quadro FX 1800 (yeah, 64 CUDA cores...).
__constant__ float d_model[31*22*22];
#define IMUL(a,b) ( __mul24((a), (b)) )
#define IMAD(a,b,c) ( __mul24((a), (b)) + (c) )
__global__ void dMatch(float *score, const int featWidth, const int featHeight, const int modelWidth, const int modelHeight, const int scoreWidth, const int scoreHeight)
{
const int x = IMAD(blockIdx.x, blockDim.x, threadIdx.x);
const int y = IMAD(blockIdx.y, blockDim.y, threadIdx.y);
if(x < scoreWidth && y < scoreHeight)
{
const int scoreIdx = IMAD(x, scoreHeight, y);
score[scoreIdx] = 0.f;
const int baseFeatIdx = IMUL(x,scoreHeight) + IMAD(modelHeight-1, x, y);
for(int z = 0; z < 31; ++z)
{
// Index positionning
int featIdx = IMAD(z, IMUL(featWidth,featHeight), baseFeatIdx);
int modelIdx = IMUL(z, IMUL(modelWidth,modelHeight));
float value = 0.f;
// filter
for(int xx=0; xx<modelWidth; xx++)
{
const int xxmodelIdx = IMAD(xx, modelHeight, modelIdx);
const int xxfeatIdx = IMAD(xx, featHeight, featIdx);
for(int yy=0; yy<modelHeight; yy++)
{
value += d_model[xxmodelIdx+yy] * tex1Dfetch(texFeatures,xxfeatIdx+yy);
}
}
score[scoreIdx] += value;
}
}
}
Anyway, I launch this kernel with 8*8 threads in block and with a grid size of (scoreWidth/8)*(scoreHeight/8) (scoreWidth and scoreHeight are the resulting matrix sizes) .
I'd like to know if you have any clue of what's wrong or what is rather slow in my code.
Edit:
A much faster version (150 ms drop for a 480 ms process!) thanks to tera:
__global__ void dMatch(float *score, const int featWidth, const int featHeight, const int modelWidth, const int modelHeight, const int scoreWidth, const int scoreHeight)
{
const int y = IMUL(4,IMAD(blockIdx.x, blockDim.x, threadIdx.x));
const int x = IMAD(blockIdx.y, blockDim.y, threadIdx.y);
if(x < scoreWidth && y < scoreHeight)
{
const int scoreIdx = IMAD(x, scoreHeight, y);
const int baseFeatIdx = IMUL(x,scoreHeight) + IMAD(modelHeight-1, x, y);
float value=0.f, value1 = 0.f, value2 = 0.f, value3 = 0.f;
float feat,feat1,feat2,feat3;
// Index positionning
int featIdx = 0;
int modelIdx = 0;
int xxmodelIdx;
int xxfeatIdx;
float val;
for(int z = 0; z < 31; ++z)
{
featIdx = IMAD(z,IMUL(featWidth,featHeight),baseFeatIdx);
modelIdx = IMUL(z,IMUL(modelWidth,modelHeight));
// filter
for(int xx=0; xx<modelWidth; xx++)
{
xxmodelIdx = IMAD(xx, modelHeight, modelIdx);
xxfeatIdx = IMAD(xx, featHeight, featIdx);
feat=tex1Dfetch(texFeatures,xxfeatIdx+0);
feat1=tex1Dfetch(texFeatures,xxfeatIdx+1);
feat2=tex1Dfetch(texFeatures,xxfeatIdx+2);
feat3=tex1Dfetch(texFeatures,xxfeatIdx+3);
for(int yy=0; yy<modelHeight; yy++)
{
val = d_model[xxmodelIdx+yy];
value += val * feat;
value1 += val * feat1;
value2 += val * feat2;
value3 += val * feat3;
feat = feat1;
feat1 = feat2;
feat2 = feat3;
feat3 = tex1Dfetch(texFeatures,xxfeatIdx+yy+4);
}
}
}
score[scoreIdx] = value;
if(y+1 < scoreHeight)
score[scoreIdx+1] = value1;
if(y+2 < scoreHeight)
score[scoreIdx+2] = value2;
if(y+3 < scoreHeight)
score[scoreIdx+3] = value3;
}
Launched with this dim3 threads(16,16); dim3 grid(divup(scoreHeight,64), divup(scoreWidth,16));.
What does the profiler says? The NVidia NSight(plugin for Visual Studio on Windows and for Eclipse on Linux) allows you two see where the stalls are and provides various hints to optimize performance.
My guess (without looking on profiler) is that the blocks you have are too small. There are 32 threads inside warp which is basic scheduling element. NVIDIA GPU is able to be fast as it can hide the latency by operating on other threads while the current one is doing the previous instruction. While it is possible to have 8 blocks per SM (on Tesla and Fermi) or 16 (on Kepler) you still have 16-32 warps at peaks which can be quite small (I may be wrong but launching block have certain latency). I would consider using much larger blocks.
The texture fetch is sub-optimal if I understand the code correctly - the threads in warp differs by modelHeight - 1 in baseFeatId and therefore in featIdx and xxfeatIdx. Therefore the texture fetch is entirely random and it does not exploit the data locality. Reversing x and y would make it more efficient.
However the good rule is to check with the profiler - if your problem is compute bound on GPU then you should concentrate on computing side. If your problem is memory bound the you should look on memory access patter. There might be several other parts which seems like spots to optimization but you won't know until you see what the bottleneck is. Once you know it you might want to read specific chapter on best practices guide.

2D kernel calling and launch parameters for non-square matrix

I am attempting to port the following (simplified) nested loop as a CUDA 2D kernel. The sizes of NgS and NgO will increase with larger data sets; for now I just want to get this kernel to output the correct results for all values:
// macro that translates 2D [i][j] array indices to 1D flattened array indices
#define idx(i,j,lda) ( (j) + ((i)*(lda)) )
int NgS = 1859;
int NgO = 900;
// 1D flattened matrices have been initialized as:
Radio_cpu = new double [NgS*NgO];
Result_cpu = new double [NgS*NgO];
// ignoring the part where they are filled w/ data
for (m=0; m<NgO; m++) {
for (n=0; n<NgS; n++) {
Result_cpu[idx(n,m,NgO)]] = k0*Radio_cpu[idx(n,m,NgO)]];
}
}
The examples I have come across usually deal with square loops, and I have been unable to get the correct output for all the GPU array indices compared to the CPU version. Here is the host code calling the kernel:
dim3 dimBlock(16, 16);
dim3 dimGrid;
dimGrid.x = (NgO + dimBlock.x - 1) / dimBlock.x;
dimGrid.y = (NgS + dimBlock.y - 1) / dimBlock.y;
// Result_gpu and Radio_gpu are allocated versions of the CPU variables on GPU
trans<<<dimGrid,dimBlock>>>(NgO, NgS, k0, Radio_gpu, Result_gpu);
Here is the kernel:
__global__ void trans(int NgO, int NgS,
double k0, double * Radio, double * Result) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
int m = blockIdx.y * blockDim.y + threadIdx.y;
if(n > NgS || m > NgO) return;
// map the two 2D indices to a single linear, 1D index
int grid_width = gridDim.x * blockDim.x;
int idxxx = m + (n * grid_width);
Result[idxxx] = k0 * Radio[idxxx];
}
With the current code, I proceeded to compare the Result_cpu variable with Result_gpu variable once copied back. When I cycle through the values I get:
// matches from NgS = 0...913
Result_gpu[NgS = 913][NgO = 0]: -56887.2
Result_cpu[Ngs = 913][NgO = 0]: -56887.2
// mismatches from NgS = 914...1858
Result_gpu[NgS = 914][NgO = 0]: -12.2352
Result_cpu[NgS = 914][NgO = 0]: 79448.6
This pattern is the same, irregardless of the value of NgO. I have been trying to figure out where I have made a mistake by looking at various examples for a few hours and trying out changes, but so far this scheme has worked minus the obvious issue at hand whereas the others have caused kernel invocation errors/left the GPU array uninitialized for all values. Since I clearly cannot see the mistake, I'd really appreciate if someone could point me in the right direction towards a fix. I'm pretty sure it's right under my nose and I can't see it.
In case it matters, I'm testing this code on a Kepler card, compiling using MSVC 2010, CUDA 4.2 and 304.79 driver and have compiled the code with both arch=compute_20,code=sm_20 and arch=compute_30,code=compute_30 flags with no difference.
#vaca_loca: I tested the following kernel (it works for me also with non-square block dimensions):
__global__ void trans(int NgO, int NgS,
double k0, double * Radio, double * Result) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
int m = blockIdx.y * blockDim.y + threadIdx.y;
if(n > NgO || m > NgS) return;
int ofs = m * NgO + n;
Result[ofs] = k0 * Radio[ofs];
}
void test() {
int NgS = 1859, NgO = 900;
int data_sz = NgS * NgO, bytes = data_sz * sizeof(double);
cudaSetDevice(0);
double *Radio_cpu = new double [data_sz*3],
*Result_cpu = Radio_cpu + data_sz,
*Result_gpu = Result_cpu + data_sz;
double k0 = -1.7961233;
srand48(time(NULL));
int i, j, n, m;
for(m=0; m<NgO; m++) {
for (n=0; n<NgS; n++) {
Radio_cpu[m + n*NgO] = lrand48() % 234234;
Result_cpu[m + n*NgO] = k0*Radio_cpu[m + n*NgO];
}
}
double *g_Radio, *g_Result;
cudaMalloc((void **)&g_Radio, bytes * 2);
g_Result = g_Radio + data_sz;
cudaMemcpy(g_Radio, Radio_cpu, bytes, cudaMemcpyHostToDevice);
dim3 dimBlock(16, 16);
dim3 dimGrid;
dimGrid.x = (NgO + dimBlock.x - 1) / dimBlock.x;
dimGrid.y = (NgS + dimBlock.y - 1) / dimBlock.y;
trans<<<dimGrid,dimBlock>>>(NgO, NgS, k0, g_Radio, g_Result);
cudaMemcpy(Result_gpu, g_Result, bytes, cudaMemcpyDeviceToHost);
for(m=0; m<NgO; m++) {
for (n=0; n<NgS; n++) {
double c1 = Result_cpu[m + n*NgO],
c2 = Result_gpu[m + n*NgO];
if(std::abs(c1-c2) > 1e-4)
printf("(%d;%d): %.7f %.7f\n", n, m, c1, c2);
}
}
cudaFree(g_Radio);
delete []Radio_cpu;
}
though, in my opinion, accessing data from global memory using quads might not be very cache-friendly since access stride is pretty large. You might consider using 2D textures instead if it's critical for your algorithm to access data in 2D locality

ideal lowpass filter with fftw

again I am still trying to get my lowpass filter running, but I am at a point where I do not know why this is still not running. I oriented my code according to FFT Filters and my previous question FFT Question in order to apply an ideal low pass filter to the image. The code below just makes the image darker and places some white pixels in the resulting image.
// forward fft the result is in freqBuffer
fftw_execute(forward);
for (int y = 0; y < h; y++)
{
for (int x = 0; x < w; x++)
{
uint gid = y * w + x;
// shifting coordinates normalized to [-0.5 ... 0.5]
double xN = (x - (w / 2)) / (double)w;
double yN = (y - (h / 2)) / (double)h;
// max radius
double maxR = sqrt(0.5f * 0.5f + 0.5f * 0.5f);
// current radius normalized to [0 .. 1]
double r = sqrt(xN * xN + yN * yN) / maxR ;
// filter response
double filter = r > 0.7f ? 0.0f : 1.0f;
// applying filter response
freqBuffer[gid][0] *= filter;
freqBuffer[gid][1] *= filter;
}
}
// normlization (see fftw scaling)
for (uint i = 0; i < size; i++)
{
freqBuffer[i][0] /= (float)size;
freqBuffer[i][1] /= (float)size;
}
// backward fft
fftw_execute(backward);
Some help would be appreciated.
Wolf
If you have a filter with a step response in the frequency domain then you will see significant sin(x)/x ringing in the spatial domain. This is known as the Gibbs Phenomenon. You need to apply a window function to the desired frequency response to mitigate this.