CUFFT: How to calculate fft of pitched pointer? - cuda

I'm trying to calculate the fft of an image using CUFFT. It seems like CUFFT only offers fft of plain device pointers allocated with cudaMalloc.
My input images are allocated using cudaMallocPitch but there is no option for handling pitch of the image pointer.
Currently, I have to remove the alignment of rows, then execute the fft, and copy back the results to the pitched pointer. My current code is as follows:
void fft_device(float* src, cufftComplex* dst, int width, int height, int srcPitch, int dstPitch)
{
//src and dst are device pointers allocated with cudaMallocPitch
//Convert them to plain pointers. No padding of rows.
float *plainSrc;
cufftComplex *plainDst;
cudaMalloc<float>(&plainSrc,width * height * sizeof(float));
cudaMalloc<cufftComplex>(&plainDst,width * height * sizeof(cufftComplex));
cudaMemcpy2D(plainSrc,width * sizeof(float),src,srcPitch,width * sizeof(float),height,cudaMemcpyDeviceToDevice);
cufftHandle handle;
cufftPlan2d(&handle,width,height,CUFFT_R2C);
cufftSetCompatibilityMode(handle,CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(handle,plainSrc,plainDst);
cufftDestroy(handle);
cudaMemcpy2D(dst,dstPitch,plainDst,width * sizeof(cufftComplex),width * sizeof(cufftComplex),height,cudaMemcpyDeviceToDevice);
cudaFree(plainSrc);
cudaFree(plainDst);
}
It gives correct result, but I don't want to do 2 extra memory allocations and copies inside the function. I want to do something like this:
void fft_device(float* src, cufftComplex* dst, int width, int height, int srcPitch, int dstPitch)
{
//src and dst are device pointers allocated with cudaMallocPitch
//Don't know how to handle pitch here???
cufftHandle handle;
cufftPlan2d(&handle,width,height,CUFFT_R2C);
cufftSetCompatibilityMode(handle,CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(handle,src,dst);
cufftDestroy(handle);
}
Question:
How to calculate the fft of pitched pointer directly using CUFFT?

I think you may be interested in cufftPlanMany which would let you do 1D, 2D, and 3D ffts with pitches. The key here is inembed and onembed parameters.
You can look up CUDA_CUFFT_Users_Guide.pdf (Pages 23-24) for more information. But for your example, you'd be doing something like the follows.
void fft_device(float* src, cufftComplex* dst,
int width, int height,
int srcPitch, int dstPitch)
{
cufftHandle handle;
int rank = 2; // 2D fft
int n[] = {width, height}; // Size of the Fourier transform
int istride = 1, ostride = 1; // Stride lengths
int idist = 1, odist = 1; // Distance between batches
int inembed[] = {srcPitch, height}; // Input size with pitch
int onembed[] = {dstPitch, height}; // Output size with pitch
int batch = 1;
cufftPlanMany(&handle, rank, n,
inembed, istride, idist,
onembed, ostride, odist, CUFFT_R2C, batch);
cufftSetCompatibilityMode(handle,CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(handle,src,dst);
cufftDestroy(handle);
}
P.S. I did not add return checks for the sake of example here. Always check for return values in your code.

Related

Smart design for large kernel with different inputs that only changes one line of code

I am designing some kernels that I would like to have 2 ways of calling: Once with standard float * device as input (for writing), and another with cudaSurfaceObject_t as input (for writing). The kernel itself is long (>200 lines) and ultimately, I only need the last line to be different. In one case you have standard out[idx]=val type of assignment, while in the other one a surf3Dwrite() type. The rest of the kernel is identical.
Something like
__global__ kernel(float * out , ....)
{
// 200 lines of math
// only difference, aside from input argument
idx=....
out[idx]=a;
}
vs
__global__ kernel(cudaSurfaceObject_t * out, ...)
{
// 200 lines of math
// only difference, aside from input argument
surf3Dwrite(&out,val,x,y,z);
}
What is the smart way of coding this, without copy pasting the entire kernel and renaming it? I checked Templating, but (if I am not wrong) its for types only, one can not just have a completely different line of code when the type is different in a template. CUDA kernels don't seem to be able to be overloaded either.
CUDA kernels don't seem to be able to be overloaded either.
It should be possible to overload kernels. Here is one possible approach, using overloading (and no templating):
$ cat t1648.cu
// Includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <helper_cuda.h>
__device__ float my_common(float *d, int width, unsigned int x, unsigned int y){
// 200 lines of common code...
return d[y *width +x];
}
////////////////////////////////////////////////////////////////////////////////
// Kernels
////////////////////////////////////////////////////////////////////////////////
//! Write to a cuArray using surface writes
//! #param gIData input data in global memory
////////////////////////////////////////////////////////////////////////////////
__global__ void WriteKernel(float *gIData, int width, int height,
cudaSurfaceObject_t outputSurface)
{
// calculate surface coordinates
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
// read from global memory and write to cuarray (via surface reference)
surf2Dwrite(my_common(gIData, width, x, y),
outputSurface, x*4, y, cudaBoundaryModeTrap);
}
__global__ void WriteKernel(float *gIData, int width, int height,
float *out)
{
// calculate coordinates
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
// read from global memory and write to global memory
out[y*width+x] = my_common(gIData, width, x, y);
}
////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
printf("starting...\n");
unsigned width = 256;
unsigned height = 256;
unsigned int size = width * height * sizeof(float);
// Allocate device memory for result
float *dData = NULL;
checkCudaErrors(cudaMalloc((void **) &dData, size));
// Allocate array and copy image data
cudaChannelFormatDesc channelDesc =
cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray *cuArray;
float *out;
cudaMalloc(&out, size);
checkCudaErrors(cudaMallocArray(&cuArray,
&channelDesc,
width,
height,
cudaArraySurfaceLoadStore));
dim3 dimBlock(8, 8, 1);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
cudaSurfaceObject_t outputSurface;
cudaResourceDesc surfRes;
memset(&surfRes, 0, sizeof(cudaResourceDesc));
surfRes.resType = cudaResourceTypeArray;
surfRes.res.array.array = cuArray;
checkCudaErrors(cudaCreateSurfaceObject(&outputSurface, &surfRes));
WriteKernel<<<dimGrid, dimBlock>>>(dData, width, height, outputSurface);
WriteKernel<<<dimGrid, dimBlock>>>(dData, width, height, out);
checkCudaErrors(cudaDestroySurfaceObject(outputSurface));
checkCudaErrors(cudaFree(dData));
checkCudaErrors(cudaFreeArray(cuArray));
}
$ nvcc -I/usr/local/cuda/samples/common/inc t1648.cu -o t1648
$
The above example was hacked together rapidly from the simpleSurfaceWrite CUDA sample code. It is not intended to be functional or run "correctly". It is designed to show how overloading can be used from a code structure standpoint to address the stated objective.

Does the leading dimension in cuBLAS allow for accessing any submatrix?

I'm trying to understand the idea of the leading dimension in cuBLAS. It's mentioned that lda must always be greater than or equal to the # of rows in a matrix.
If I have a 100x100 matrix A and I wanted to access A(90:99, 0:99), what would be the arguments of cublasSetMatrix? lda specifies the number of rows between the elements in the same column(100 in this case), but where would I specify the 90? I can only see a way by adjusting *A.
The function definition is:
cublasStatus_t cublasSetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb)
And I'm also guessing that I wouldn't be able to transfer the bottom right 3x3 portion of a 5x5 matrix given the length limits.
You have to "adjust *A", as you called it. The pointer that is given to this function must be the starting entry of the respective sub-matrix.
You did not say whether your matrix A is actually the input- or the output matrix, but this should not change much, conceptually.
Assuming you have the following code:
// The matrix in host memory
int rowsA = 100;
int colsA = 100;
float *A = new float[rowsA*colsA];
// Fill A with values
...
// The sub-matrix that should be copied to the device.
// The minimum index is INCLUSIVE
// The maximum index is EXCLUSIVE
int minRowA = 0;
int maxRowA = 100;
int minColA = 90;
int maxColA = 100;
int rowsB = maxRowA-minRowA;
int colsB = maxColA-minColA;
// Allocate the device matrix
float *dB = nullptr;
cudaMalloc(&dB, rowsB * colsB * sizeof(float));
Then, for the cublasSetMatrix call, you have to compute the starting element of the source matrix:
float *sourceA = A + (minRowA + minColA * rowsA);
cublasSetMatrix(rowsB, colsB, sizeof(float), sourceA, rowsA, dB, rowsB);
And this is where the 90 that you asked for comes into play: It is the minColA in the computation of the source pointer.

Multiplication of two cudaArray in kernel?(using texture memory)

I have two cudaArray, a1 and a2 (which have the same size) which reprensent two matrices .
Using texture memory, I want to multiplicate those two cudaArrays .
Then I want to copy back the result in one normal arrays,let's name it *a1_h.
The fact is, I just don't know how to do it . I've managed to define, allocate my two cudaArrays and to put floats into them .
Now I want to do a kernel which does those multiplications .
Can somebody help me ?
ROOM_X and ROOM_Y are int, they define width and height of matrices .
mytex_M1 and mytex_M2 are texture defined as : texture < float,2,cudaReadModeElementType > .
Here is my main :
int main(int argc, char * argv[]) {
int size = ROOM_X * ROOM_Y * sizeof(float);
//creation of arrays on host.Will be useful for filling the cudaArrays
float *M1_h, *M2_h;
//allocating memories on Host
M1_h = (float *)malloc(size);
M2_h = (float *)malloc(size);
//creation of channel descriptions for 2d texture
cudaChannelFormatDesc channelDesc_M1 = cudaCreateChannelDesc<float>();
cudaChannelFormatDesc channelDesc_M2 = cudaCreateChannelDesc<float>();
//creation of 2 cudaArray * .
cudaArray *M1_array,*M2_array;
//bind arrays and channel in order to allocate space
cudaMallocArray(&M1_array,&channelDesc_M1,ROOM_X,ROOM_Y);
cudaMallocArray(&M2_array,&channelDesc_M2,ROOM_X,ROOM_Y);
//filling the matrices on host
Matrix(M1_h);
Matrix(M2_h);
//copy from host to device (putting the initial values of M1 and M2 into the arrays)
cudaMemcpyToArray(M1_array, 0, 0,M1_h, size,cudaMemcpyHostToDevice);
cudaMemcpyToArray(M2_array, 0, 0,M2_h, size,cudaMemcpyHostToDevice);
//set textures parameters
mytex_M1.addressMode[0] = cudaAddressModeWrap;
mytex_M1.addressMode[1] = cudaAddressModeWrap;
mytex_M1.filterMode = cudaFilterModeLinear;
mytex_M1.normalized = true; //NB coordinates in [0,1]
mytex_M2.addressMode[0] = cudaAddressModeWrap;
mytex_M2.addressMode[1] = cudaAddressModeWrap;
mytex_M2.filterMode = cudaFilterModeLinear;
mytex_M2.normalized = true; //NB coordinates in [0,1]
//bind arrays to the textures
cudaBindTextureToArray(mytex_M1,M1_array);
cudaBindTextureToArray(mytex_M2,M2_array);
//allocate device memory for result
float* M1_d;
cudaMalloc( (void**)&M1_d, size);
//dimensions of grid and blocks
dim3 dimGrid(ROOM_X,ROOM_Y);
dim3 dimBlock(1,1);
//execution of the kernel . The result of the multiplication has to be put in M1_d
mul_texture<<<dimGrid, dimBlock >>>(M1_d);
//copy result from device to host
cudaMemcpy(M1_h,M1_d, size, cudaMemcpyDeviceToHost);
//free memory on device
cudaFreeArray(M1_array);
cudaFreeArray(M2_array);
cudaFree(M1_d);
//free memory on host
free(M1_h);
free(M2_h);
return 0;
}
When you declare a texture
A texture reference can only be declared as a static global variable and cannot be passed as an argument to a function.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/#texture-reference-api
So, if you have successfully define the texture references, initialize the arrays, copy then to the texture space and prepare the output buffers (something that seems to be done according to your code), what you need to do is implement the kernel. For example:
__global__ void
mul_texture(float* M1_d, int w, int h)
{
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// take care of the size of the image, it's a good practice
if ( x < w && y < h )
{
// the output M1_d is actually represented as 1D array
// so the offset of each value is related to their (x,y) position
// in a tow-major order
int gid = x + y * w;
// As texture are declared at global scope,
// we can access their content at any kernel
float M1_value = tex2D(mytex_M1,x,y);
float M2_value = tex2D(mytex_M2,x,y);
// The final results is the pointwise multiplication
M1_d[ gid ] = M1_value * M2_value;
}
}
You need to change the kernel invocation to include the w and h values, corresponding to the width (number of columns in the matrix) and height (number of rows of the matrix).
mul_texture<<<dimGrid, dimBlock >>>(M1_d, ROOM_X, ROOM_Y);
Note you are not doing any error checking, something that will help you quite a lot now and in the future. I have not checked if the kernel provided in this answer works as your code didnt compile.

Finding min and max by the cuda kernel reduction

Here is my kernel code
typedef unsigned char Npp8u;
...
// Kernel Implementation
__device__ unsigned int min_device;
__device__ unsigned int max_device;
__global__ void findMax_Min(Npp8u * data, int numEl){
int index = blockDim.x*blockIdx.x + threadIdx.x;
int shared_index = threadIdx.x;
__shared__ Npp8u data_shared_min[BLOCKDIM];
__shared__ Npp8u data_shared_max[BLOCKDIM];
// check index condition
if(index < numEl){
data_shared_min[shared_index] = data[index]; //pass values from global to shared memory
__syncthreads();
data_shared_max[shared_index] = data[index]; //pass values from global to shared memory
for (unsigned int stride = BLOCKDIM/2; stride > 0; stride >>= 1) {
if(threadIdx.x < stride){
if(data_shared_max[threadIdx.x] < data_shared_max[threadIdx.x+stride]) data_shared_max[shared_index] = data_shared_max[shared_index+stride];
if(data_shared_min[threadIdx.x]> data_shared_min[threadIdx.x+stride]) data_shared_min[shared_index] = data_shared_min[shared_index+stride];
}
__syncthreads();
}
if(threadIdx.x == 0 ){
atomicMin(&(min_device), (unsigned int)data_shared_min[threadIdx.x ]);
//min_device =10;
__syncthreads();
atomicMax(&(max_device), (unsigned int)data_shared_max[threadIdx.x ]);
}
}else{
data_shared_min[shared_index] = 9999;
}
}
I have an image that is 512x512 and I want to find the min and max pixel values. data is the 1-D version of the image. This code works for max but not for min value. As I checked from matlab max value is 202 and min value is 10 but it finds 0 for the min value. Here is my kernel codes and memcpy calls
int main(){
// Host parameter declarations.
Npp8u * imageHost;
int nWidth, nHeight, nMaxGray;
// Load image to the host.
std::cout << "Load PGM file." << std::endl;
imageHost = LoadPGM("lena_before.pgm", nWidth, nHeight, nMaxGray);
// Device parameter declarations.
Npp8u * imageDevice;
unsigned int max, min;
size_t size = sizeof(Npp8u)*nWidth*nHeight;
cudaMalloc((Npp8u**)&imageDevice, size);
cudaMemcpy(imageDevice, imageHost, size, cudaMemcpyHostToDevice);
int numPixels = nWidth*nHeight;
dim3 numThreads(BLOCKDIM);
dim3 numBlocks(numPixels/BLOCKDIM + (numPixels%BLOCKDIM == 0 ? 0 : 1));
findMax_Min<<<numBlocks, numThreads>>>(imageDevice,numPixels);
cudaMemcpyFromSymbol(&max,max_device, sizeof(max_device), 0, cudaMemcpyDeviceToHost);
cudaMemcpyFromSymbol(&min,min_device, sizeof(min_device), 0, cudaMemcpyDeviceToHost);
printf("Min value for image : %i\n", min);
printf("Max value for image : %i\n", max);
...
Another interesting thing is changing the order of cudaMemcpy just after the kernel call also causes malfunctioning and values both are read as zero. I do not see the problem. Is there anyone sees the obstructed part?
You might want to do cuda error checking. You might also want to initialize min_device to a large value and max_device to zero. There are other problems with your reduction method related to stride (what happens in the last block of an odd size image when you add stride to threadIdx.x, it may exceed the defined image range in shared memory) , but I don't think it matters for a 512x512 image. If min_device just happened to start out at zero, all of your atomicMin operations would always leave zero there.
You can try initializing min_device and max_device like this:
__device__ unsigned int min_device = 9999;
__device__ unsigned int max_device = 0;
For the cudamemcpy calls at the end, you are copying 4 bytes (size of max_device) into a one-byte variable (Npp8u max) and likewise for min. So that's a problem. Since you're using pointers, the copy operation is definitely overwriting something that you don't intend. If the compiler stores the variables sequentially the way you have them defined, one copy operation is overwriting the other variable, which I think would explain the behavior you're seeing. If you created min and max as unsigned int quantities, I think this problem would go away.
EDIT: Since you haven't shown your actual block dimensions, it's possible that you still have a problem with your reduction. You might want to change this line:
if(threadIdx.x < stride){
To something like:
if((threadIdx.x < stride) && ((index + stride)< numEl)){
This or something like it should correct the hazard I mention in the first paragraph. I guess you're trying to account for the hazard using this line:
data_shared_min[shared_index] = 9999;
But there's no guarantee that line of code gets executed before the data element that it is setting in shared memory gets read by some other thread. I also don't know what happens when you assign a value of 9999 to a byte quantity, but it's probably not what you expect.

Bound CUDA texture reads zero

I try to read values from a texture and write them back to global memory.
I am sure the writing part works, beause I can put constant values in the kernel and I can see them in the output:
__global__ void
bartureKernel( float* g_odata, int width, int height)
{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
if(x < width && y < height) {
unsigned int idx = (y*width + x);
g_odata[idx] = tex2D(texGrad, (float)x, (float)y).x;
}
}
The texture I want to use is a 2D float texture with two channels, so I defined it as:
texture<float2, 2, cudaReadModeElementType> texGrad;
And the code which calls the kernel initializes the texture with some constant non-zero values:
float* d_data_grad = NULL;
cudaMalloc((void**) &d_data_grad, gradientSize * sizeof(float));
CHECK_CUDA_ERROR;
texGrad.addressMode[0] = cudaAddressModeClamp;
texGrad.addressMode[1] = cudaAddressModeClamp;
texGrad.filterMode = cudaFilterModeLinear;
texGrad.normalized = false;
cudaMemset(d_data_grad, 50, gradientSize * sizeof(float));
CHECK_CUDA_ERROR;
cudaBindTexture(NULL, texGrad, d_data_grad, cudaCreateChannelDesc<float2>(), gradientSize * sizeof(float));
float* d_data_barture = NULL;
cudaMalloc((void**) &d_data_barture, outputSize * sizeof(float));
CHECK_CUDA_ERROR;
dim3 dimBlock(8, 8, 1);
dim3 dimGrid( ((width-1) / dimBlock.x)+1, ((height-1) / dimBlock.y)+1, 1);
bartureKernel<<< dimGrid, dimBlock, 0 >>>( d_data_barture, width, height);
I know, setting the texture bytes to all "50" doesn't make much sense in the context of floats, but it should at least give me some non-zero values to read.
I can only read zeros though...
You are using cudaBindTexture to bind your texture to the memory allocated by cudaMalloc. In the kernel you are using tex2D function to read values from the texture. That is why it is reading zeros.
If you bind texture to linear memory using cudaBindTexture, it is read using tex1Dfetch inside the kernel.
tex2D is used to read only from those textures which are bound to pitch linear memory ( which is allocated by cudaMallocPitch ) using the function cudaBindTexture2D, or those textures which are bound to cudaArray using the function cudaBindTextureToArray
Here is the basic table, rest you can read from the programming guide:
Memory Type----------------- Allocated Using-----------------Bound Using-----------------------Read In The Kernel By
Linear Memory...................cudaMalloc........................cudaBindTexture.............................tex1Dfetch
Pitch Linear Memory.........cudaMallocPitch.............cudaBindTexture2D........................tex2D
cudaArray............................cudaMallocArray.............cudaBindTextureToArray.............tex1D or tex2D
3D cudaArray......................cudaMalloc3DArray........cudaBindTextureToArray.............tex3D
To add on, access using tex1Dfetch is based on integer indexing.
However, the rest are indexed based on floating point, and you have to add +0.5 to get the exact value you want.
I'm curious why do you create float and bind to a float2 texture? It may gives ambiguous results.
float2 is not 2D float texture. It can actually be used for representation of complex number.
typedef struct {float x; float y;} float2;
I think this tutorial will help you understand how to use texture memory in cuda.
http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/218100902
The kernel you shown does not benefit much from using texture. however, if utilized properly, by exploiting locality, texture memory can improve the performance by quite a lot. Also, it is useful for interpolation.