cuFFT R2C batch output size doesn't match input size - fft

I'm experimenting with the batch with cuFFT. But I don't think I'm getting the right output.
int NX = 16; // size of the array
int BATCH = 16; // # of batch
I'm allocating two arrays on the GPU:
float *src;
cufftComplex *dst;
cudaMalloc((void**)&src, sizeof(float)*NX*BATCH);
cudaMalloc((void**)&dst, sizeof(cufftComplex)*NX*BATCH);
I'm initializing the source array with a simple kernel like this:
__global__ void initFloatArray(float *data, const int size) {
const int i = (blockIdx.x * blockDim.x) + threadIdx.x;
if (i < size) {
data[i] = i % NX;
}
}
so basically, each array has values to goes from 0 to 15. And I get this 16 times.
I create my plan like this:
cufftPlanMany(&plan, 1, &NX, nullptr, 1, NX, nullptr, 1, NX, CUFFT_R2C, BATCH);
and then I'm executing my plan:
cufftExecR2C(plan, src, dst);
Finally, I transfer the content of dst back to the host. But when I print out the values, I'm getting this:
BATCH 0:
<120, 0>.length = 120
<-8, 40.2187>.length = 41.0066
<-8, 19.3137>.length = 20.905
<-8, 11.9728>.length = 14.3996
<-8, 8>.length = 11.3137
<-8, 5.34543>.length = 9.62152
<-8, 3.31371>.length = 8.65914
<-8, 1.5913>.length = 8.15673
<-8, 0>.length = 8
<120, 0>.length = 120
<-8, 40.2187>.length = 41.0066
<-8, 19.3137>.length = 20.905
<-8, 11.9728>.length = 14.3996
<-8, 8>.length = 11.3137
<-8, 5.34543>.length = 9.62152
<-8, 3.31371>.length = 8.65914
BATCH 1:
<-8, 1.5913>.length = 8.15673
<-8, 0>.length = 8
<120, 0>.length = 120
<-8, 40.2187>.length = 41.0066
<-8, 19.3137>.length = 20.905
<-8, 11.9728>.length = 14.3996
...
I was expecting a repetitive output, but it's repeat every 9 numbers, instead of every 16 like it should.
Am I doing something wrong? Or is there something I'm not understanding.

The DFT of a real-valued signal exhibit Hermitian symmetry (see real-input DFT on wikipedia). As a result, the full N complex output values of a N-point DFT can be constructed from only the first N/2+1 output values (ie. the other outputs are redundant).
Correspondingly and as with many FFT implementations for real-valued inputs, cuFFT does not return the redundant upper portion of the spectrum (as indicated in section 2.4 of cuFFT library user's guide). In your case with a 16-point FFT, you would thus get 16/2 + 1 = 9 non-redundant outputs. Those 9 values per FFT then get packed back-to-back in your final dst buffer (thus a new FFT result starts every 9 complex number).

Related

Static const array in Cuda kernel

I need to have the following in the Cuda kernel:
static const float PREDEFINED_CONSTS[16] = {...}; // 16 constants.
float c = PREDEFINED_CONSTS[threadId.x % 16];
/// Use c in computations.
What's the best way to provide PREDEFINED_CONSTS ?
Const memory does't seem good, cause different threads will access different locations.
If I define them as above, will PREDEFINED_CONSTS be stored in global memory?
What about this:
float c;
if ( threadId.x % 16 == 0 ) c = VAL0;
else if ( threadId.x % 16 == 1 ) c = VAL1;
...
else if ( threadId.x % 16 ==15 ) c = VAL15;
Although last example has thread divergence, literal VAL* values are part of the instruction opcode, so there will be no reading from memory.
What's the best way to provide PREDEFINED_CONSTS ?
If it were me, I would simply put what you have in your first example in your CUDA kernel and go with that. That is very likely the best way to do it. Later on, if you feel like you have a performance problem with your code, you can use a profiler to steer you in the direction of what needs to be addressed. I doubt it would be this. For constants, there really are only 2 possibilities:
Load them from some kind of memory
Load them as part of the instruction stream.
You've already indicated you are aware of this, you can simply benchmark both if you're really worried. Benchmarking would require more than what you have shown here, might be inconclusive, and may also depend on other factors such as how many times and in what way you are loading these constants.
As you have indicated already, __constant__ doesn't seem to be a sensible choice because the load pattern is clearly non-uniform, across the warp.
If I define them as above, will PREDEFINED_CONSTS be stored in global memory?
Yes, your first method will be stored in global memory. This can be confirmed with careful study and compilation using -Xptxas -v. Your second method has the potential (at least) to load the constants via the instruction stream. Since the second method is quite ugly from a coding perspective, and also very inflexible compared to the first method (what if I needed different constants per thread in different places in my code?), it's not what I would choose.
This strikes me as premature optimization. The first method is clearly preferred from a code flexibility and conciseness standpoint, and there is no real reason to think that simply because you are loading from memory that it is a problem. The second method is ugly, inflexible, and may not be any better from a performance perspective. Even if the data is part of the instruction stream, it still has to be loaded from memory.
Here's an example test case suggesting to me that the first case is preferred. If you come up with a different kind of test case, you may end up with a different observation:
$ cat t97.cu
#include <cstdio>
const float VAL0 = 1.1;
const float VAL1 = 2.2;
const float VAL2 = 3;
const float VAL3 = 4;
const float VAL4 = 5;
const float VAL5 = 6;
const float VAL6 = 7;
const float VAL7 = 8;
const float VAL8 = 9;
const float VAL9 = 10;
const float VAL10 = 11;
const float VAL11 = 12;
const float VAL12 = 13;
const float VAL13 = 14;
const float VAL14 = 15;
const float VAL15 = 16;
__global__ void k1(int l){
static const float PREDEFINED_CONSTS[16] = {VAL0, VAL1, VAL2, VAL3, VAL4, VAL5, VAL6, VAL7, VAL8, VAL9, VAL10, VAL11, VAL12, VAL13, VAL14, VAL15};
float sum = 0.0;
for (int i = 0; i < l; i++)
sum += PREDEFINED_CONSTS[(threadIdx.x+i) & 15];
if (sum == 0.0) printf("%f\n", sum);
}
__device__ float get_const(int i){
float c = VAL15;
unsigned t = (threadIdx.x+i) & 15;
if (t == 0) c = VAL0;
else if (t == 1) c = VAL1;
else if (t == 2) c = VAL2;
else if (t == 3) c = VAL3;
else if (t == 4) c = VAL4;
else if (t == 5) c = VAL5;
else if (t == 6) c = VAL6;
else if (t == 7) c = VAL7;
else if (t == 8) c = VAL8;
else if (t == 9) c = VAL9;
else if (t == 10) c = VAL10;
else if (t == 11) c = VAL11;
else if (t == 12) c = VAL12;
else if (t == 13) c = VAL13;
else if (t == 14) c = VAL14;
return c;
}
__global__ void k2(int l){
float sum = 0.0;
for (int i = 0; i < l; i++)
sum += get_const(i);
if (sum == 0.0) printf("%f\n", sum);
}
int main(){
int l = 1048576;
k1<<<1,16>>>(l);
k2<<<1,16>>>(l);
cudaDeviceSynchronize();
}
$ nvcc -o t97 t97.cu -Xptxas -v
ptxas info : 68 bytes gmem
ptxas info : Compiling entry function '_Z2k2i' for 'sm_52'
ptxas info : Function properties for _Z2k2i
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 9 registers, 324 bytes cmem[0], 8 bytes cmem[2]
ptxas info : Compiling entry function '_Z2k1i' for 'sm_52'
ptxas info : Function properties for _Z2k1i
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 32 registers, 324 bytes cmem[0]
$ nvprof ./t97
==22848== NVPROF is profiling process 22848, command: ./t97
==22848== Profiling application: ./t97
==22848== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 91.76% 239.39ms 1 239.39ms 239.39ms 239.39ms k2(int)
8.24% 21.508ms 1 21.508ms 21.508ms 21.508ms k1(int)
API calls: 62.34% 260.89ms 1 260.89ms 260.89ms 260.89ms cudaDeviceSynchronize
37.48% 156.85ms 2 78.427ms 10.319us 156.84ms cudaLaunchKernel
0.13% 542.39us 202 2.6850us 192ns 117.71us cuDeviceGetAttribute
0.04% 156.19us 2 78.094us 58.411us 97.777us cuDeviceTotalMem
0.01% 59.150us 2 29.575us 26.891us 32.259us cuDeviceGetName
0.00% 10.845us 2 5.4220us 1.7280us 9.1170us cuDeviceGetPCIBusId
0.00% 1.6860us 4 421ns 216ns 957ns cuDeviceGet
0.00% 1.5850us 3 528ns 283ns 904ns cuDeviceGetCount
0.00% 667ns 2 333ns 296ns 371ns cuDeviceGetUuid
$

Created Shared Memory Code with Python Cuda

I'm struggling to get some code running to explore the shared memory features to get a fast matrix multiply. But everytime I try this I seem to run into errors that I cannot fathom.
import numpy as np
from numba import cuda, types
m = 128
n = 32
a = np.arange(m*n).reshape(m,n).astype(np.int32)
b = np.arange(m*n).reshape(n,m).astype(np.int32)
c = np.zeros((m, n)).astype(np.int32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)
block_size = (m,n)
grid_size = (int(m/n),int(m/n))
#cuda.jit
def mm(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.y, cuda.threadIdx.x] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[column, row]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
c[row][column] = sum
and testing
mm[grid_size, block_size](d_a, d_b, d_c)
solution = a#b
output = d_c.copy_to_host()
keeps resulting in the following error:
CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
After chatting with the provider of one answer, I've updated the function. But still cannot make this work. So for the computation of the sum for each element in the output c we need to loop over the columns of A and the rows of B, using i as the index. We have therefore n*n products. I think the i us correct in the sum, but I cannot seem to get the correct index for the row and column of a and b in the expression for the sum.
import numpy as np
from numba import cuda, types
#cuda.jit
def mm_shared(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[row, column]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[cuda.threadIdx.x, i] * b_cache[i, cuda.threadIdx.y]
c[row][column] = sum
Your block size is invalid. CUDA devices have a limit of 1024 threads per block. When I run your code I see this:
/opt/miniconda3/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py in _check_error(self, fname, retcode)
327 _logger.critical(msg, _getpid(), self.pid)
328 raise CudaDriverError("CUDA initialized before forking")
--> 329 raise CudaAPIError(retcode, msg)
330
331 def get_device(self, devnum=0):
CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
When I fix that I see this:
$ cuda-memcheck python somethingsometing.py
========= CUDA-MEMCHECK
========= Invalid __shared__ read of size 4
========= at 0x000008b0 in cudapy::__main__::mm$241(Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>)
========= by thread (15,11,0) in block (3,2,0)
========= Address 0x00000ec0 is out of bounds
The why is pretty obvious:
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
row and column are dimensions in the execution grid, not the local share memory tile, and similarly i is bounded by the shape of a, not the shape of a_cache (note also that you seemed to lapse in C style 2D array indexing syntax about half way through the code, which is a potential bug if you don't understand the difference between the two in Python).
To fix it you will have to change the indexing and then implement the rest of the code for multiplication (i.e. you must iteratively load the whole row and column slices through the local shared tiles to compute the full dot product for each row/column pair which a block will process).
Note also that
The dimensions you have selected for c are wrong (should be m x m)
The grid size you run the kernel on is also wrong because the dimensions of C are wrong and so your code could never calculate the whole matrix
Even after fixing all of this, it is likely that the results of the multiplication will be incorrect at anything other than trivial sizes because of integer overflow.
#disruptive: Hi, did you find any solution to your problem?
I had the same problem as you but I solved it by restarting the kernel of Jupyter notebook.
My code is slightly different than yours:
def mm_shared(a, b, c):
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
col, row = cuda.grid(2)
row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row][col]
b_cache[cuda.threadIdx.y, cuda.threadIdx.x] = b[col][row]
for i in range(a.shape[1]):
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, cuda.threadIdx.y + i * N]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[cuda.threadIdx.x + i * N, col]
cuda.syncthreads()
for j in range(N):
sum += a_cache[cuda.threadIdx.x, j] * b_cache[j, cuda.threadIdx.y]
# Wait until all threads finish computing
cuda.syncthreads()
c[row][col] = sum
Please let me know if you have any update.
This is the correct solution:
import numpy as np
from numba import cuda, types
#cuda.jit
def mm_shared(a, b, c):
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
# TODO: use each thread to populate one element each a_cache and b_cache
x,y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x
TPB = int(N)
for i in range(a.shape[1] / TPB):
a_cache[tx, ty] = a[x, ty + i * TPB]
b_cache[tx, ty] = b[tx + i * TPB, y]
cuda.syncthreads()
for j in range(TPB):#a.shape[1]):
# TODO: calculate the `sum` value correctly using values from the cache
sum += a_cache[tx][j] * b_cache[j][ty]
cuda.syncthreads()
c[x][y] = sum

CUDA Illegal memory access with possibly 'insufficient' shared memory

I have a simple CUDA kernel that can do vector accumulation by basic reduction. I am scaling it up to be able to handle larger data by splitting it across multiple blocks. However, my assumption about allocating an appropriate amount of shared memory to be used by the kernel is failing with illegal memory access. It goes away when I increase this limit, but I want to know why.
Here is the code that I am talking about:
CORE KERNEL:
__global__ static
void vec_add(int *buffer,
int numElem, // The actual number of elements
int numIntermediates) // The next power of two of numElem
{
extern __shared__ unsigned int interim[];
int index = blockDim.x * blockIdx.x + threadIdx.x;
// Copy global intermediate values into shared memory.
interim[threadIdx.x] =
(index < numElem) ? buffer[index] : 0;
__syncthreads();
// numIntermediates2 *must* be a power of two!
for (unsigned int s = numIntermediates / 2; s > 0; s >>= 1) {
if (threadIdx.x < s) {
interim[threadIdx.x] += interim[threadIdx.x + s];
}
__syncthreads();
}
if (threadIdx.x == 0) {
buffer[blockIdx.x] = interim[0];
}
}
And this is the caller:
void accumulate (int* buffer, int numElem)
{
unsigned int numReductionThreads =
nextPowerOfTwo(numElem); // A routine to return the next higher power of 2.
const unsigned int maxThreadsPerBlock = 1024; // deviceProp.maxThreadsPerBlock
unsigned int numThreadsPerBlock, numReductionBlocks, reductionBlockSharedDataSize;
while (numReductionThreads > 1) {
numThreadsPerBlock = numReductionThreads < maxThreadsPerBlock ?
numReductionThreads : maxThreadsPerBlock;
numReductionBlocks = (numReductionThreads + numThreadsPerBlock - 1) / numThreadsPerBlock;
reductionBlockSharedDataSize = numThreadsPerBlock * sizeof(unsigned int);
vec_add <<< numReductionBlocks, numThreadsPerBlock, reductionBlockSharedDataSize >>>
(buffer, numElem, numReductionThreads);
numReductionThreads = nextPowerOfTwo(numReductionBlocks);
}
}
I tried this code with a sample set of 1152 elements on my GPU with the following configuration:
Type: Quadro 600
MaxThreadsPerBlock: 1024
MaxSharedMemory: 48KB
OUTPUT:
Loop 1: numElem = 1152, numReductionThreads = 2048, numReductionBlocks = 2, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 4096
Loop 2: numElem = 1152, numReductionThreads = 2, numReductionBlocks = 1, numThreadsPerBlock = 2, reductionBlockSharedDataSize = 8
CUDA Error 77: an illegal memory access was encountered
Suspecting that my 'interim' shared memory was causing illegal memory access, I arbitrarily increased the shared memory by two times in the following line:
reductionBlockSharedDataSize = 2 * numThreadsPerBlock * sizeof(unsigned int);
And my kernel started working fine!
What I do not understand is - why I had to provide this extra shared memory to make my problem go away (temporarily).
As a further experiment to check this magic number I ran my code with a much larger data-set with 6912 points. This time, even 2X or 4X didn't help me.
Loop 1: numElem = 6912, numReductionThreads = 8192, numReductionBlocks = 8, numThreadsPerBlock = 1024, reductionBlockSharedDataSize = 16384
Loop 2: numElem = 6912, numReductionThreads = 8, numReductionBlocks = 1, numThreadsPerBlock = 8, reductionBlockSharedDataSize = 128
CUDA Error 77: an illegal memory access was encountered
But the problem again went away when I increased the shared memory size by 8X.
Of course, I cannot be arbitrarily picking this scaling factor for larger and larger data-sets because I will soon run out of the 48KB shared memory limit. So I want to know a legitimate way of fixing my issue.
Thanks to #havogt for pointing out the out-of-index access.
The issue was that I was using the wrong argument as numIntermediates to the vec_add method. The intention was for the kernel to operate on exactly the same number of data points as the number of threads, which should have been 1024 all the time.
I fixed it by using numThreadsPerBlock as the argument:
vec_add <<< numReductionBlocks, numThreadsPerBlock, reductionBlockSharedDataSize >>>
(buffer, numElem, numThreadsPerBlock);

Why is cuFFT so slow?

I'm hoping to accelerate a computer vision application that computes many FFTs using FFTW and OpenMP on an Intel CPU. However, for a variety of FFT problem sizes, I've found that cuFFT is slower than FFTW with OpenMP.
In the experiments and discussion below, I find that cuFFT is slower than FFTW for batched 2D FFTs. Why is cuFFT so slow, and is there anything I can do to make cuFFT run faster?
Experiments (code download)
Our computer vision application requires a forward FFT on a bunch of small planes of size 256x256. I'm running the FFTs on on HOG features with a depth of 32, so I use the batch mode to do 32 FFTs per function call. Typically, I do about 8 FFT function calls of size 256x256 with a batch size of 32.
FFTW + OpenMP
The following code executes in 16.0ms on an Intel i7-2600 8-core CPU.
int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};
//if nCols is even, cols_padded = (nCols+2). if nCols is odd, cols_padded = (nCols+1)
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL
float* h_in = (float*)malloc(sizeof(float)*nRows*cols_padded*depth);
memset(h_in, 0, sizeof(float)*nRows*cols_padded*depth);
fftwf_complex* h_freq = reinterpret_cast<fftwf_complex*>(h_in); //in-place version
fftwf_plan forwardPlan = fftwf_plan_many_dft_r2c(2, //rank
n, //dims -- this doesn't include zero-padding
depth, //howmany
h_in, //in
inembed, //inembed
depth, //istride
1, //idist
h_freq, //out
onembed, //onembed
depth, //ostride
1, //odist
FFTW_PATIENT /*flags*/);
double start = read_timer();
#pragma omp parallel for
for(int i=0; i<nIter; i++){
fftwf_execute_dft_r2c(forwardPlan, h_in, h_freq);
}
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);
cuFFT
The following code executes in 21.7ms on a top-of-the-line NVIDIA K20 GPU. Note that, even if I use streams, cuFFT does not run multiple FFTs concurrently.
int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL in FFTW
cufftHandle forwardPlan;
float* d_in; cufftComplex* d_freq;
CHECK_CUFFT(cufftPlanMany(&forwardPlan,
2, //rank
n, //dimensions = {nRows, nCols}
inembed, //inembed
depth, //istride
1, //idist
onembed, //onembed
depth, //ostride
1, //odist
CUFFT_R2C, //cufftType
depth /*batch*/));
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth));
d_freq = reinterpret_cast<cufftComplex*>(d_in);
double start = read_timer();
for(int i=0; i<nIter; i++){
CHECK_CUFFT(cufftExecR2C(forwardPlan, d_in, d_freq));
}
CHECK_CUDART(cudaDeviceSynchronize());
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);
Other notes
In the GPU version, cudaMemcpys between the CPU and GPU are not included in my computation time.
The performance numbers presented here are averages of several experiments, where each experiment has 8 FFT function calls (total of 10 experiments, so 80 FFT function calls).
I've tried many problem sizes (e.g. 128x128, 256x256, 512x512, 1024x1024), all with depth=32. Based on the nvvp profiler, some sizes like 1024x1024 are able to fully saturate the GPU. But, for all of these sizes, the CPU FFTW+OpenMP is faster than cuFFT.
Question might be outdated, though here is a possible explanation (for the slowness of cuFFT).
When structuring your data for cufftPlanMany, the data arrangement is not very nice with the GPU. Indeed, using an istride and ostride of 32 means no data read is coalesced. See here for details on the read pattern
input[b * idist + (x * inembed[1] + y) * istride]
output[b * odist + (x * onembed[1] + y) * ostride]
in which case if i/ostride is 32, it will very unlikely be coalesced/optimal. (indeed b is the batch number). Here are the changes I applied:
CHECK_CUFFT(cufftPlanMany(&forwardPlan,
2, //rank
n, //dimensions = {nRows, nCols}
inembed, //inembed
1, // WAS: depth, //istride
nRows*cols_padded, // WAS: 1, //idist
onembed, //onembed
1, // WAS: depth, //ostride
nRows*cols_padded, // WAS:1, //odist
CUFFT_R2C, //cufftType
depth /*batch*/));
Running this, I entered a unspecified launch failure because of illegal memory access. You might want to change the memory allocation (cufftComplex is two floats, you need an x2 in your allocation size - looks like a typo).
// WAS : CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth));
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth*2));
When running it this way, I got a x8 performance improvement on my card.

Generalized sliding-window computation on the GPU

Here's some Python code that implements a sliding-window computation on two 3D matrices, X and Y.
import numpy
def sliding_dot( X,Y ) :
assert X.ndim == Y.ndim == 3
iw,ih,id = X.shape
fw,fh,fd = Y.shape
assert id == fd
assert fw < iw and fh < ih
ow,oh = iw-fw+1,ih-fh+1
out = numpy.zeros( [ow,oh] )
for x in xrange(ow) :
for y in xrange(oh) :
window = X[x:x+fw,y:y+fh,:]
out[x,y] = numpy.dot( window.flatten(),Y.flatten() )
return out
#################
A_dims = (640,480,32)
B_dims = (6,6,32)
A = numpy.random.rand(*A_dims)
B = numpy.random.rand(*B_dims)
sliding_dot(A,B)
In general, Y is always much smaller than X along the first and second dimensions, but they are equal in the third dimension.
Note that we could replace numpy.dot() with any function of Y and the window. This is a little bit different than convolution in that Y only slides along the first and second dimensions of X. I'm looking for an effective strategy for implementing this kind of sliding window computation, efficiently, using CUDA. Anybody want to offer me some direction? Cheers!
Update : You can watch me work through the optimization process with help from other users in my answer, below.
Trying to design a "generalised" implementation which could accommodate just about any operation you might want is going to be an enormous trade off in an architecture like CUDA. For your concrete dot product example, which is a typical reduction operation, this is a pretty useful implementation:
__constant__ int ldaX[3];
__constant__ int ldaY[3];
__constant__ int dimX[3];
__constant__ int dimY[3];
template<typename real,int blocksize>
__global__ void sliding_k(const real *X, const real *Y, real *out)
{
__shared__ volatile real buffer[blocksize];
int tid = threadIdx.x;
int gid = blockIdx.x * gridDim.y + blockIdx.y;
real value = (real)0;
int xpos = (blockIdx.y * ldaX[2]) + (blockIdx.x * ldaX[1]);
int ypos = 0;
for(int i=0; i<dimY[0]; i++) {
for(int jk=tid; jk<ldaY[1]; jk+=blocksize) {
value += X[xpos+jk] * Y[ypos+jk];
}
xpos += ldaX[1];
ypos += ldaY[1];
}
buffer[tid] = value;
__syncthreads();
# pragma unroll
for(int i=(tid+32); ((tid<32)&&(i<blocksize)); i+=32)
buffer[tid] += buffer[i];
if (tid < 16) buffer[tid] += buffer[tid + 16];
if (tid < 8) buffer[tid] += buffer[tid + 8];
if (tid < 4) buffer[tid] += buffer[tid + 4];
if (tid < 2) buffer[tid] += buffer[tid + 2];
if (tid == 0) out[gid] = buffer[0] + buffer[1];
}
You could substitute any kind of reduction operator you like for the floating point multiply add/summation operation which a dot product uses and the code should work OK. Each window calculation is performed by a single block. There is enough parallel work to justify at this window size a block per window. This allows coalesced global memory access, and on Fermi cards, a good amount of L1 cache hits.
Here I have only build one assumption into the code, that being that the third dimension of the source array and the window array are equal. This allows the inner two loops to be "fused" into a single operation because the common memory layout they share. Running a test harness in Python using an improved version of your reference code, with the host code written in PyCUDA, I get this:
In [15]: %timeit -n3 -r3 out2=sliding_cuda(A,B)
3 loops, best of 3: 49.8 ms per loop
In [16]: %timeit -n3 -r3 out=sliding_dot(A,B)
3 loops, best of 3: 2.18 s per loop
In [17]: (numpy.abs(out2-out)/numpy.abs(out)).max()
Out[17]: 4.2921323635558404e-15
when run on a 3GHz Phenom II with a GTX470 using 64 thread blocks on a 635x475 2D grid -- ie. about 50 times speed up including module loading, setup and memory transfers using pageable host memory allocations. The kernel itself is about 100 times faster than the Python without including memory transfers and setup overhead. Note that this is a double precision version - Python uses double precision floating point arithmetic by default.
Well, here are some thoughs:
You perform ~640*480 iterations of numpy.dot, which itself processes 6*6*32 elements. Parallelizing dot-product barely worth it: 192 parallel threads is not enough for GPU, and reduction on CUDA is additional troubles. So, IMO, the best way to parallelize you task is to assign one element of output array to each thread.
Now about memory: output array will be in global memory, there is not much choice. For input data, A looks quite good for texture memory, since adjacent threads access adjacent elements. Alternatively, you can manually "cache" it in shared memory, but in this case it does not look much advantageous over simply using texture. For B, shared memory is not good, since it would cause bank conflicts, since when you calculate dot-product, all threads in half-warp access the same B's element (you can start summation from different elements in different threads, but that's (again) doesn't look promising). So the choice is either texture or constant. I vote for constant, since (a) constant memory is suited for data which is accessed by all threads on the device, (b) you won't pollute texture cache.
The above is just my guesses, and to actually achieve good performance you better try out different variants...
Update regarding your naive implementation
for (int Yi = 0; Yi < Ydims[0]; Yi++ )
Here, you do aceess to a global memory on each iteration. That's a huge performance killer. Since you have 3 dimensions, you better replace your int *Ydims with int3 Ydims (same for Xdims and outdims).
out[out_indx] += X[X_indx]*Y[Y_indx];
Again, a very bad idea. Create a register variable and do all operations with it. Write to a global array only once at the end of a kernel.
These optimizations are first thing you should do. Second thing is to make you X and Y 3D textures, so access to them will be cached. I guess, after this CUDA would outperform CPU.
For further optimisations, you'd better read CUDA C Best Practices Guide. It's must read, and you would get much better idea of how to write efficient GPU code (right now your implementation is toooo naive)
v0.1 - Naive implementation
Here's my first, naive attempt at making this work:
__global__ void sliding_dot(float *out, int *outdims, float *X, int *Xdims, float *Y, int *Ydims )
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
int Y_indx = 0;
int X_indx = 0;
if ( i < outdims[0] & j < outdims[1] )
{
int out_indx = j + i*outdims[1];
for (int Yi = 0; Yi < Ydims[0]; Yi++ )
{
for (int Yj = 0; Yj < Ydims[1]; Yj++ )
{
for (int k = 0; k < Ydims[2]; k++ )
{
Y_indx = k + Yj* Ydims[2] + Yi* Ydims[2]*Ydims[1];
X_indx = k + (j+Yj)*Xdims[2] + (i+Yi)*Xdims[2]*Xdims[1];
out[out_indx] += X[X_indx]*Y[Y_indx];
}
}
}
}
}
So far the results are less-than-desirable. With block size (32,32,1) and grid dimensions p,q chosen such that p*32 >= outdims[0] and q*32 >= outdims[1] :
method=[ sliding_dot ] gputime=[ 7013.280 ] cputime=[ 18.000 ] occupancy=[ 0.667 ]
method=[ sliding_dot ] gputime=[ 6945.184 ] cputime=[ 7.000 ] occupancy=[ 0.667 ]
method=[ sliding_dot ] gputime=[ 6990.816 ] cputime=[ 6.000 ] occupancy=[ 0.667 ]
method=[ sliding_dot ] gputime=[ 6931.648 ] cputime=[ 6.000 ] occupancy=[ 0.667 ]
v0.2 - texture<float,1>
I hope everybody is learning as much from this as I am! I followed #aland's suggestions and got a considerable speed-up:
texture<float,1> X;
texture<float,1> Y;
__global__ void dotconv(float *out, int2 outdims, int3 Xdims, int3 Ydims )
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
if ( i < outdims.x & j < outdims.y )
{
int out_indx = j + i*outdims.y;
float total = 0.0f;
int X_indx = 0;
int Y_indx = 0;
for (int Yi=0; Yi<Ydims.x; Yi++ )
{
for (int Yj=0; Yj<Ydims.y; Yj++ )
{
for (int k=0; k<Ydims.z; k++ )
{
Y_indx = k + Yj* Ydims.z + Yi* Ydims.z*Ydims.y;
X_indx = k + (j+Yj)*Xdims.z + (i+Yi)*Xdims.z*Xdims.y;
total += tex1Dfetch(X,X_indx)*tex1Dfetch(Y,Y_indx);
}
}
}
out[out_indx] = total;
}
}
But we're still not running as quickly as the CPU:
method=[ dotconv ] gputime=[ 2224.928 ] cputime=[ 24.000 ] occupancy=[ 0.667 ]
method=[ dotconv ] gputime=[ 2222.592 ] cputime=[ 7.000 ] occupancy=[ 0.667 ]
method=[ dotconv ] gputime=[ 2225.216 ] cputime=[ 10.000 ] occupancy=[ 0.667 ]
method=[ dotconv ] gputime=[ 2222.752 ] cputime=[ 10.000 ] occupancy=[ 0.667 ]
v0.3 - texture<float,3>
texture<float,3,cudaReadModeElementType> X;
texture<float,3,cudaReadModeElementType> Y;
__global__ void dotconv(float *out, int2 outdims, int3 Xdims, int3 Ydims )
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
if ( i < outdims.x & j < outdims.y )
{
int out_indx = j + i*outdims.y;
float total = 0.0f;
for (int Yi=0; Yi<Ydims.x; Yi++ )
{
for (int Yj=0; Yj<Ydims.y; Yj++ )
{
for (int k=0; k<Ydims.z; k++ )
{
total += tex3D(X,k,j+Yj,i+Yi) * tex3D(Y,k,Yj,Yi);
}
}
}
out[out_indx] = total;
}
}
This is actually a little slower than the v0.2
method=[ dotconv ] gputime=[ 2403.360 ] cputime=[ 35.000 ] occupancy=[ 0.667 ]
method=[ dotconv ] gputime=[ 2392.160 ] cputime=[ 15.000 ] occupancy=[ 0.667 ]
method=[ dotconv ] gputime=[ 2396.448 ] cputime=[ 15.000 ] occupancy=[ 0.667 ]
method=[ dotconv ] gputime=[ 2398.880 ] cputime=[ 16.000 ] occupancy=[ 0.667 ]
Thanks for your suggestions!
You might want to try separating out your reads from your sums from your stores.
So each kernel should have 3 sections:
Read from texture memory, store to shared memory for the entire block
__shared blockX[ Ydims.z ][ Ydims.y ][ Ydims.x ];
__shared blockY[ Ydims.z ][ Ydims.y ][ Ydims.x ];
// NOTE: MAKE EACH THREAD LOAD k ELEMENTs * 2 rather than each thread loading Ydims.X*Y*Z elements
blockX[k][yj][yi] = ...
blockY[k][yj][yi] = ...
__syncthreads(); // <-- critical -- all threads in block must finish
// reading from shared memory before any may use the values.
#pragma Unroll your for loops.
This will significantly increase your ILP and have much less branching for your constant loop sizes
Ensure that your shared memory access is strided appropriately, otherwise bank conflicts will kill your performance.