I'm trying to "convolve" a featWidth * featHeight * 31 cube with another modelWidth * modelHeight * 31 cube. The problem is that this kernel is quite slow (well, I manage to be quicker than a sequential CPU code, but as slow as a OpenMP version). I'm using a Quadro FX 1800 (yeah, 64 CUDA cores...).
__constant__ float d_model[31*22*22];
#define IMUL(a,b) ( __mul24((a), (b)) )
#define IMAD(a,b,c) ( __mul24((a), (b)) + (c) )
__global__ void dMatch(float *score, const int featWidth, const int featHeight, const int modelWidth, const int modelHeight, const int scoreWidth, const int scoreHeight)
{
const int x = IMAD(blockIdx.x, blockDim.x, threadIdx.x);
const int y = IMAD(blockIdx.y, blockDim.y, threadIdx.y);
if(x < scoreWidth && y < scoreHeight)
{
const int scoreIdx = IMAD(x, scoreHeight, y);
score[scoreIdx] = 0.f;
const int baseFeatIdx = IMUL(x,scoreHeight) + IMAD(modelHeight-1, x, y);
for(int z = 0; z < 31; ++z)
{
// Index positionning
int featIdx = IMAD(z, IMUL(featWidth,featHeight), baseFeatIdx);
int modelIdx = IMUL(z, IMUL(modelWidth,modelHeight));
float value = 0.f;
// filter
for(int xx=0; xx<modelWidth; xx++)
{
const int xxmodelIdx = IMAD(xx, modelHeight, modelIdx);
const int xxfeatIdx = IMAD(xx, featHeight, featIdx);
for(int yy=0; yy<modelHeight; yy++)
{
value += d_model[xxmodelIdx+yy] * tex1Dfetch(texFeatures,xxfeatIdx+yy);
}
}
score[scoreIdx] += value;
}
}
}
Anyway, I launch this kernel with 8*8 threads in block and with a grid size of (scoreWidth/8)*(scoreHeight/8) (scoreWidth and scoreHeight are the resulting matrix sizes) .
I'd like to know if you have any clue of what's wrong or what is rather slow in my code.
Edit:
A much faster version (150 ms drop for a 480 ms process!) thanks to tera:
__global__ void dMatch(float *score, const int featWidth, const int featHeight, const int modelWidth, const int modelHeight, const int scoreWidth, const int scoreHeight)
{
const int y = IMUL(4,IMAD(blockIdx.x, blockDim.x, threadIdx.x));
const int x = IMAD(blockIdx.y, blockDim.y, threadIdx.y);
if(x < scoreWidth && y < scoreHeight)
{
const int scoreIdx = IMAD(x, scoreHeight, y);
const int baseFeatIdx = IMUL(x,scoreHeight) + IMAD(modelHeight-1, x, y);
float value=0.f, value1 = 0.f, value2 = 0.f, value3 = 0.f;
float feat,feat1,feat2,feat3;
// Index positionning
int featIdx = 0;
int modelIdx = 0;
int xxmodelIdx;
int xxfeatIdx;
float val;
for(int z = 0; z < 31; ++z)
{
featIdx = IMAD(z,IMUL(featWidth,featHeight),baseFeatIdx);
modelIdx = IMUL(z,IMUL(modelWidth,modelHeight));
// filter
for(int xx=0; xx<modelWidth; xx++)
{
xxmodelIdx = IMAD(xx, modelHeight, modelIdx);
xxfeatIdx = IMAD(xx, featHeight, featIdx);
feat=tex1Dfetch(texFeatures,xxfeatIdx+0);
feat1=tex1Dfetch(texFeatures,xxfeatIdx+1);
feat2=tex1Dfetch(texFeatures,xxfeatIdx+2);
feat3=tex1Dfetch(texFeatures,xxfeatIdx+3);
for(int yy=0; yy<modelHeight; yy++)
{
val = d_model[xxmodelIdx+yy];
value += val * feat;
value1 += val * feat1;
value2 += val * feat2;
value3 += val * feat3;
feat = feat1;
feat1 = feat2;
feat2 = feat3;
feat3 = tex1Dfetch(texFeatures,xxfeatIdx+yy+4);
}
}
}
score[scoreIdx] = value;
if(y+1 < scoreHeight)
score[scoreIdx+1] = value1;
if(y+2 < scoreHeight)
score[scoreIdx+2] = value2;
if(y+3 < scoreHeight)
score[scoreIdx+3] = value3;
}
Launched with this dim3 threads(16,16); dim3 grid(divup(scoreHeight,64), divup(scoreWidth,16));.
What does the profiler says? The NVidia NSight(plugin for Visual Studio on Windows and for Eclipse on Linux) allows you two see where the stalls are and provides various hints to optimize performance.
My guess (without looking on profiler) is that the blocks you have are too small. There are 32 threads inside warp which is basic scheduling element. NVIDIA GPU is able to be fast as it can hide the latency by operating on other threads while the current one is doing the previous instruction. While it is possible to have 8 blocks per SM (on Tesla and Fermi) or 16 (on Kepler) you still have 16-32 warps at peaks which can be quite small (I may be wrong but launching block have certain latency). I would consider using much larger blocks.
The texture fetch is sub-optimal if I understand the code correctly - the threads in warp differs by modelHeight - 1 in baseFeatId and therefore in featIdx and xxfeatIdx. Therefore the texture fetch is entirely random and it does not exploit the data locality. Reversing x and y would make it more efficient.
However the good rule is to check with the profiler - if your problem is compute bound on GPU then you should concentrate on computing side. If your problem is memory bound the you should look on memory access patter. There might be several other parts which seems like spots to optimization but you won't know until you see what the bottleneck is. Once you know it you might want to read specific chapter on best practices guide.
Related
I wrote a dilation kernel in CUDA and it works well when my input and my output images are different buffers, but I am facing what I understand to be a memory race issue when I call my kernel in an in-situ case, i.e. the input and the output buffers point to the same memory location.
I tried :
a. using cooperative groups,
b. using a mutex and an atomic addition but as suggested in this paper and in several sources on the web,
c. using a lock-free inter-block synchronization, the synchronization proposed in this same paper.
All my attempts failed because :
a. did not work because my input buffer is a const pointer and I have a compilation error when I have to cast it into a void* parameter (which makes sense), so I could not go further.
b. did not work because I faced a wierd behaviour : I have 16x16 blocks, each with 32x32 threads. Synchronizing the blocks should increase the mutex to 256 but the program blocks after 48 atomic additions.
c. did not work because it seams to be no inter-block synchronization, although the code I used directly from the paper seems good to me. I could improve a little the race effect by adding some __syncthreads()
This is the dilation function ;
template <typename T>
__global__ void GenericDilate2dImg_knl(const ImageSizeInfo imgSizeInfo,
volatile int* syncArrayIn, volatile int* syncArrayOut,
const unsigned long localSizeX, const unsigned long localSizeY,
const int borderPolicyType, const T outOfImageValue,
const struct StructuringElementInfo seInfo,
const T* pInBuf, T* pOutBuf)
{
// Extract sizeX, sizeY, etc. from imgSizeInfo
SPLIT_SIZES_FROM_STRUCT(imgSizeInfo)
// Declare the shared buffer pSharedBuf
extern __shared__ char pSharedMem[];
T* pSharedBuf = reinterpret_cast<T*>(pSharedMem);
const unsigned long x = blockDim.x * blockIdx.x + threadIdx.x;
const unsigned long y = blockDim.y * blockIdx.y + threadIdx.y;
const unsigned long planIdx = blockDim.z * blockIdx.z + threadIdx.z;
const unsigned long nbPlans = sizeZ * sizeC * sizeT;
const unsigned long idx = x + y * sizeX + planIdx * sizeX*sizeY;
// Copy the input image data into shared memory
if (x < blockDim.x * gridDim.x && y < blockDim.y * gridDim.y && planIdx < blockDim.z * gridDim.z) {
copyDataToSharedMemory2d(pInBuf, sizeX, sizeY, planIdx,
localSizeX, localSizeY,
seInfo._paddingX, seInfo._paddingY,
borderPolicyType, outOfImageValue,
pSharedBuf);
}
// Wait to ensure that the copy is terminated
if (pInBuf == pOutBuf) {
// Grid synchronization for in-situ case
//__gpu_sync(gridDim.x * gridDim.y); // Use a mutex
__gpu_sync2(1, syncArrayIn, syncArrayOut); // Use a lock-free barrier
}
else
// The input and ouput buffers point to different data
// -> we simply need to synchronize the threads inside the block
__syncthreads();
// Compute the convolution for pixels inside the image
if (x < sizeX && y < sizeY && planIdx < nbPlans) {
T vMax = 0;
for (unsigned int curCoefIdx = 0; curCoefIdx < seInfo._nbOffsets; ++curCoefIdx) {
const unsigned int sx = threadIdx.x + seInfo._paddingX + seInfo._pOffsetsX[curCoefIdx];
const unsigned int sy = threadIdx.y + seInfo._paddingY + seInfo._pOffsetsY[curCoefIdx];
const unsigned long sidx = sx + sy * localSizeX;
const T curVal = pSharedBuf[sidx];
vMax = (vMax > curVal ? vMax : curVal);
}
// Round the result
pOutBuf[idx] = vMax;
}
}
My function to copy from global to shared memory is :
template <typename T>
__device__ void copyDataToSharedMemory2d(const T* pInBuf,
const unsigned long sizeX, const unsigned long sizeY, const unsigned long planIdx,
const unsigned long localSizeX, const unsigned long localSizeY,
const int paddingX, const int paddingY,
const int borderPolicyType, const T outOfImageValue,
T* pSharedBuf)
{
const int x = blockDim.x * blockIdx.x + threadIdx.x;
const int y = blockDim.y * blockIdx.y + threadIdx.y;
const int localX = threadIdx.x;
const int localY = threadIdx.y;
// Fill the shared buffer tile by tile
// A tile is related to the group size
const unsigned int groupSizeX = blockDim.x;
const unsigned int groupSizeY = blockDim.y;
// For each tile
for (int offsetY = 0; offsetY < localSizeY; offsetY += groupSizeY) {
int curLocalY = localY + offsetY;
int curGlobalY = y + offsetY - paddingY;
for (int offsetX = 0; offsetX < localSizeX; offsetX += groupSizeX) {
int curLocalX = localX + offsetX;
int curGlobalX = x + offsetX - paddingX;
// If the current coordinate is inside the shared sub-image
if (curLocalX < localSizeX && curLocalY < localSizeY) {
const int idx = curLocalX + curLocalY * localSizeX;
pSharedBuf[idx] = getPixel2d(pInBuf, sizeX, sizeY, curGlobalX, curGlobalY, planIdx, borderPolicyType, outOfImageValue);
}
}
}
}
Where getPixel2d allows me to manage the data out of the image:
template <typename T>
__device__
T getPixel2d(const T* pInBuf,
const unsigned long sizeX, const unsigned long sizeY,
const int x, const int y, const int z,
const int borderPolicyType, const T outOfImageValue)
{
int x_inside = x;
if (x < 0 || x >= sizeX) {
switch (borderPolicyType) {
case 0://outside the image, there is a constant value
return outOfImageValue;
case 1://outside the image, we propagate the data at the image borders
if (x < 0)
x_inside = 0;
else // x >= sizeX
x_inside = sizeX - 1;
break;
case 2://Miror effect
if (x < 0)
x_inside = -(x + 1);
else // x >= sizeX
x_inside = sizeX - ((x - sizeX) + 1);
break;
}
}
// y-coordinate inside the image
int y_inside = y;
if (y < 0 || y >= sizeY) {
switch (borderPolicyType) {
case 0://outside the image, there is a constant value
return outOfImageValue;
case 1://outside the image, we propagate the data at the image borders
if (y < 0)
y_inside = 0;
else // y >= sizeY
y_inside = sizeY - 1;
break;
case 2://Miror effect
if (y < 0)
y_inside = -(y + 1);
else // y >= sizeY
y_inside = sizeY - ((y - sizeY) + 1);
break;
default: break;
}
}
return pInBuf[x_inside + y_inside * sizeX + z * sizeX * sizeY];
}
and now, here are my inter-block synchronization functions :
// Using a mutex
__device__ volatile int g_mutex;
__device__ void __gpu_sync(int goalVal) {
//thread ID in a block
int tid_in_block = threadIdx.x * blockDim.y + threadIdx.y;
// only thread 0 is used for synchronization
if (tid_in_block == 0) {
atomicAdd((int*)&g_mutex, 1);
printf("[%d] %d Vs %d\n", blockIdx.x * gridDim.y + blockIdx.y, g_mutex, goalVal);
//only when all blocks add 1 to g_mutex
//will g_mutex equal to goalVal
while (g_mutex </*!=*/ goalVal) {
;//Do nothing here
}
}
__syncthreads();
}
// Lock-free barrier
__device__ void __gpu_sync2(int goalVal, volatile int* Arrayin, volatile int* Arrayout) {
// thread ID in a block
int tid_in_blk = threadIdx.x * blockDim.y + threadIdx.y;
int nBlockNum = gridDim.x * gridDim.y;
int bid = blockIdx.x * gridDim.y + blockIdx.y;
// only thread 0 is used for synchronization
if (tid_in_blk == 0) {
Arrayin[bid] = goalVal;
}
if (bid == 1) {
if (tid_in_blk < nBlockNum) {
while (Arrayin[tid_in_blk] != goalVal) {
;//Do nothing here
}
}
__syncthreads();
if (tid_in_blk < nBlockNum) {
Arrayout[tid_in_blk] = goalVal;
}
}
if (tid_in_blk == 0) {
while (Arrayout[bid] != goalVal) {
;//Do nothing here
}
}
__syncthreads();
}
The image I get for in-situ calculation is :
I used a 11x15 structuring emelent and the size of the shared buffer is (nbThreadsPerBlock+2*paddindX) * (nbThreadsPerBlock+2*paddindY). The wrong result (showed by the arrows) appears at the top of some blocks, but always at the same location and with the same values. I'd expect a more random result for memory race effect...
Is there a better approach to manage in-situ calculation or any reason that would prevent the grid synchronization to work?
EDIT
The size of the image I used is 510x509 and I run my code on a NVidia Quadro RTX 5000.
I would normally suggest minimal reproducible example for a question like this, as well as an indication of the GPU you are running on, but we can probably proceed without that. In short, what you are trying to do will not work reliably, as you've already discovered.
You have chosen a thread strategy of assigning one thread in your grid per output point:
pOutBuf[idx] = vMax;
which is sensible and fine. I imagine based on this:
I have 16x16 blocks, each with 32x32 threads.
that your input images are 512x512 (16x32 threads in each direction, one thread per output point).
And as you've already stated, you have 256 blocks (each of 1024 threads) in your grid. Furthermore, for the in-situ case, we can simplify your kernel to the following pseudo-code:
__global__ void GenericDilate2dImg_knl(...){
read_in_image();
grid_wide_sync();
write_out_image();
}
For such a methodology to work, then, the read_in_image() step must be able to read the entire image, before any writing occurs. However your methodology will not work in the general case, and evidently not on your specific GPU, either. In order to read in the entire image as per above, we must have every threadblock in the grid simultaneously resident on the SMs in your GPU. All 256 blocks need to be deposited, and running on an SM. But the GPU provides no inherent guarantees of such a thing. If your GPU has, for example 24 SMs in it, each of which can hold a maximum of 2048 threads, then your GPU would have a "running" or "instantaneous" capacity of 24*2048 threads, or 48 of your threadblocks. There would not be enough room for all 256 threadblocks to be running. Not only does your algorithm depend on that, but all 3 of your grid sync methods depend on that notion as well.
The fact that your 2nd grid sync method stops after 48 "atomic additions" suggested the example numbers above to me. It's a plausible proximal explanation for why that method may have failed that way: your GPU only allowed 48 of your threadblocks to be resident, and the other 208 threadblocks were waiting in the wings, not yet deposited on any SM, and therefore not allowing any of their threads to run. Those threads in those 208 threadblocks need to run to pick up the relevant input data, as well as to satisfy the requirements of the grid-wide sync. But they are not running, because they are waiting for room to open up on a SM. And room never opens up on a SM, because the full SMs have threadblocks that are waiting at the grid sync point. So you have deadlock.
This problem is not easily solvable in the general case. Any grid sync mechanism, including cooperative groups, has an inherent requirement that all threadblocks be actually simultaneously schedulable on your particular GPU. Therefore in the general case, where we don't know the data set size or the GPU we will be running on, the problem is quite difficult.
One possible approach is to divide your input data set into regions, and have your kernel process a region at a time. This may require multiple grid syncs, one to handle the in/out division in each region, and one to handle the progression of the kernel as it steps through regions. You would also have to handle the region edges carefully.
Another possible approach if you know the specifics of the data set size and the GPU you are running on, is just to make sure you are running on a GPU "large enough" to handle the data set size. For example, an A100 GPU could probably have as many 216 blocks simultaneously resident, so for that case you could handle a somewhat smaller image size, perhaps 14x32=448 height and 448 width dimensions.
Given that these approaches for in-place or in-situ work for this particular example require considerable complexity, I personally would be strongly motivated to use the methodology where output is different than input. That approach will likely run noticeably quicker as well. A grid wide sync is not a "free" construct from a performance perspective.
I just started in CUDA. Now I have a question.
I have N*N matrix, and a window scale is 8x8. I want subdivided this matrix into multiple sub-matrix and find max value of this.
For example if I have 64*64 matrix so I will have 8 small matrix with 8*8 scale and find out 8 max values. Finally I save all max values into new array, but its order always change. I want find solution to keep them in right order
__global__ void calculate_emax_kernel(float emap[],float emax[], int img_height, int img_width,int windows_size)
{
int x_index = blockIdx.x*blockDim.x+threadIdx.x;
int y_index = blockIdx.y*blockDim.y+threadIdx.y;
int num_row_block = img_height/windows_size;
int num_col_block = img_width/windows_size;
__shared__ float window_elements[256];
__shared__ int counter;
__shared__ int emax_count;
if (threadIdx.x == 0) emax_count = 0;
__syncthreads();
int index;
int emax_idx = 0;
if(y_index >= img_height|| x_index >= img_width) return;
for(int i = 0; i < num_row_block; i++)
{
for(int j = 0; j < num_col_block; j++)
{
counter = 0;
if(y_index >= i*windows_size && y_index < (i+1)*windows_size
&& x_index >= j*windows_size && x_index < (j+1)*windows_size)
{
int idx = y_index*img_height + x_index;
index = atomicAdd(&counter, 1);
window_elements[index] = emap[idx];
__syncthreads();
// reduction
unsigned int k = (windows_size*windows_size)/2;
while(k != 0)
{
if(index < k)
{
window_elements[index] = fmaxf(window_elements[index], window_elements[index+k]);
}
k /= 2;
}
if(index == 0)
{
emax[i*num_row_block+j] = window_elements[index];
}
}
__syncthreads();
}
__syncthreads();
}
__syncthreads();
}
This is my configuration
void construct_emax(float *input,float *output, int img_height, int img_width)
{
int windows_size = 4;
float * d_input, * d_output;
cudaMalloc(&d_input, img_width*img_height*sizeof(float));
cudaMalloc(&d_output, img_width*img_height*sizeof(float));
cudaMemcpy(d_input, input, img_width*img_height*sizeof(float), cudaMemcpyHostToDevice);
dim3 blocksize(16,16);
dim3 gridsize;
gridsize.x=(img_width+blocksize.x-1)/blocksize.x;
gridsize.y=(img_height+blocksize.y-1)/blocksize.y;
calculate_emax_kernel<<<gridsize,blocksize>>>(d_input,d_output,img_height,img_width,windows_size);
}
With CUDA, parallel reduction is tricky; segmented parallel reduction is trickier. Now you are doing it in 2-D, and your segment/window is smaller than the thread block.
For large window size, I don't think it is a problem. You could use one thread block to reduce one window. For example if you have a 16x16 window, you could simply use 16x16 thread block. If you have even larger window size, for example 64x64, you could still use 16x16 thread block. First reduce the 64x64 window to 16x16 elements during data loading, then reduce to 1 scalar within the thread block.
For window size smaller than the block size, you will have to reduce multiple windows per thread block for higher performance. You could use your current block/grid configuration, where each 256-thread block (16x16) is responsible for 16 4x4 windows. But this will not be optimal because each 32-thread wrap is organized in two parts (2x16). This is not good for coalesced global memory access, and it is hard to map a 2x16 warp to one or more 4x4 windows for efficient parallel reduction.
Alternatively I would suggest you use 1-D thread block with 256 threads. Every m threads reduce one mxm window. Then you could use 2-D grid to cover the whole image.
const int m = window_size;
dim3 blocksize(256);
dim3 gridsize((img_width+255)/256, (img_height+m-1)/m);
In the kernel function, you could
reduce each mxm window to a 1xm vector during global data loading;
use tree reduction method to reduce the 1xm vector to a scalar.
This following code is a conceptual demo which works when m is a power of 2 and m <= 32. You could further modify it for arbitrary m and better boundary checking.
#include <assert.h>
#include <cuda.h>
#include <thrust/device_vector.h>
__global__ void calculate_emax_kernel(const float* input, float* output,
int height, int width, int win_size,
int out_width) {
const int tid = threadIdx.x;
const int i = blockIdx.y * win_size;
const int j = blockIdx.x * 256 + tid;
const int win_id = j % win_size;
__shared__ float smax[256];
float tmax = -1e20;
if (j < width) {
for (int tile = 0; tile < win_size; tile++) {
if (i + tile < height) {
tmax = max(tmax, input[(i + tile) * width + j]);
}
}
}
smax[tid] = tmax;
for (int shift = win_size / 2; shift > 0; shift /= 2) {
if (win_id < shift) {
smax[tid] = max(smax[tid], smax[tid + shift]);
}
}
if (win_id == 0 && j < width) {
output[blockIdx.y * out_width + (j / win_size)] = smax[tid];
}
}
int main() {
const int height = 1024;
const int width = 1024;
const int m = 4;
thrust::device_vector<float> in(height * width);
thrust::device_vector<float> out(
((height + m - 1) / m) * ((width + m - 1) / m));
dim3 blocksize(256);
dim3 gridsize((width + 255) / 256, (height + m - 1) / m);
assert(m == 2 || m == 4 || m == 8 || m == 16 || m == 32);
calculate_emax_kernel<<<gridsize, blocksize>>>(
thrust::raw_pointer_cast(in.data()),
thrust::raw_pointer_cast(out.data()),
height, width, m, (width + m - 1) / m);
return 0;
}
In case you're willing to use a library, few pointers:
use NPP, set of primitives (from nvidia)
https://docs.nvidia.com/cuda/npp/group__image__filter__max.html
a lower level library, for other reduce operations and more granularity in the way you use the hardware (from nvidia / nvlabs)
http://nvlabs.github.io/cub/
I am trying to work with 3D arrays in CUDA (200x200x100).
The moment I change my z dimension (model_num) from 4 to 5, I get a segmentation fault. Why, and how can I fix it?
const int nrcells = 200;
const int nphicells = 200;
const int model_num = 5; //So far, 4 is the maximum model_num that works. At 5 and after, there is a segmentation fault
__global__ void kernel(float* mgridb)
{
const unsigned long long int i = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;
if(tx >= 0 && tx < nphicells && ty >=0 && ty < nrcells && tz >= 0 && tz < model_num){
//Do stuff with mgridb[i]
}
}
int main (void)
{
unsigned long long int size_matrices = nphicells*nrcells*model_num;
unsigned long long int mem_size_matrices = sizeof(float) * size_matrices;
float *h_mgridb = (float *)malloc(mem_size_matrices);
float mgridb[nphicells][nrcells][model_num];
for(int k = 0; k < model_num; k++){
for(int j = 0; j < nrcells; j++){
for(int i = 0; i < nphicells; i++){
mgridb[i][j][k] = 0;
}
}
}
float *d_mgridb;
cudaMalloc( (void**)&d_mgridb, mem_size_matrices );
cudaMemcpy(d_mgridb, h_mgridb, mem_size_matrices, cudaMemcpyHostToDevice);
int threads = nphicells;
uint3 blocks = make_uint3(nrcells,model_num,1);
kernel<<<blocks,threads>>>(d_mgridb);
cudaMemcpy( h_mgridb, d_mgridb, mem_size_matrices, cudaMemcpyDeviceToHost);
cudaFree(d_mgridb);
return 0;
}
This is getting stored on the stack:
float mgridb[nphicells][nrcells][model_num];
Your stack space is limited. When you exceed the amount you can store on the stack, you are getting a seg fault, either at the point of allocation, or as soon as you try and access it.
Use malloc instead. That allocates heap storage, which has much higher limits.
None of the above has anything to do with CUDA. Furthermore its not unique or specific to "3D" arrays. Any large stack based allocation (e.g. 1D array) is going to have the same trouble.
You may also have to adjust how you access the array, but it's not difficult to handle a flattened array using pointer indexing.
Your code is actually strange looking, because you are creating an appropriately sized array h_mgridb using malloc and then copying that array to the device (into d_mgridb). It's not clear what purpose mgridb serves in your code. h_mgridb and mgridb are not the same.
Serial code snippet looks like this:
int i, j;
for(j=0; j<ny; j++)
{
for(i=0; i<nx; i++)
{
x[i + j*nx] *= y[i];
}
}
I converted this to CUDA using this kernel:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(tid = 0; tid <nx*ny; tid++)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
However the GPU kernel does not give any speedup improvement? Any suggestions on a better solution?? Thanks in advance
If this is the serial code:
int i, j;
for(j=0; j<ny; j++)
{
for(i=0; i<nx; i++)
{
x[i + j*nx] *= y[i];
}
}
then you should be doing this:
__global__ void fn(float *x, int nx)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int j = tid/nx, i = tid - j * nx;
x[tid] *= y[i];
}
fn<<<nx*ny/B, B>>>(x, nx); // with B = 256, 512, etc.
What you're doing is fairly bizarre: you're instructing each thread of the CUDA kernel to iterate over all values of tid between 0 and nx*ny, and compute the same function as your CPU version! Moreover, instead of just iterating over the indices, you're actually doing the loop less efficiently than you did for the CPU version; in other words, you do the same thing in each thread, just less efficiently, than you are doing in 1 thread on the CPU. It's no wonder that this is slower; it should be much, much slower. Your CUDA kernel is:
int **tid** = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(**tid** = 0; **tid** <nx*ny; **tid**++)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
This does nx*ny iterations, same as your host code, for each thread; you lose all benefit of the parallelism, since each thread is doing the same thing; you would get the same performance using one thread on the GPU, and the same result!
If this is the verbatim code from your CUDA source file, you need to change it and redo the comparison; if this is code you have written to help explain what your code is doing for a lay non-CUDA audience, then you need to present your actual CUDA code so that we can see what's going on... as it is, the performance analysis I have done - the trivial one - is all you can expect.
Given your comment to this answer:
the nx * ny = 2205; so I used no. of blocks =
(nx*ny+(threads-1))/threads and threads = 64.
is implying you are intending to launch one thread per computation, the correct CUDA implementation would just be:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int j = tid/nx;
int i = tid - j*nx;
if (tid < (nx*ny))
x[tid] *= y[i];
If you were intending for each thread to compute more than one computation per kernel launch, then you would size the grid to "fill" each of the SM on the target GPU, not use the same number of threads as the input size, and then do something like:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int gsize = blockDim.x * gridDim.x;
int i,j;
for(; tid <nx*ny; tid+=gsize)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
That would get you at least coalesced reads and writes to x, and remove the enormous number of redundant calculations in your posted version. There are a number of further optimizations that could be made, but it would require more information about the problem than has been supplied in the question and subsequent comments. Your indexing scheme contains an integer division and then an integer multiply-add per calculation. That is a lot of overhead for a single FLOP per input value. However, having said all of that, if the problem size I quoted is that actual problem size you are interested in, the GPU will never be faster than even a modest host CPU. You would require many orders of magnitude larger problems to realize useful speed up using the GPU for this sort low arithmetic intensity operation.
How big is the block? it may be that the time needed to copy a small amount of data to the GPU and setup the envirnoment is much longer than the calculation time.
Remember also that CUDA does a jit compile on the first run so to get accurate benchmarking you need to run it many times.
Try this using shared memory. One of the best implementations around:
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride; // In number of elements
float *elements;
} Matrix;
// Thread block size
#define BLOCK_SIZE 16
// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col, float value)
{
A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE; Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row +
BLOCK_SIZE * col];
return Asub;
}
// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Same as in previous example, except the followings:
// d_A.width = d_A.stride = A.width;
// d_B.width = d_B.stride = B.width;
// d_C.width = d_C.stride = C.width;
}
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;
// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m)
{
// Get sub-matrix Asub of A and Bsub of B
Matrix Asub = GetSubMatrix(A, blockRow, m);
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}
I have a CUDA kernel which I'm compiling to a cubin file without any special flags:
nvcc text.cu -cubin
It compiles, though with this message:
Advisory: Cannot tell what pointer points to, assuming global memory space
and a reference to a line in some temporary cpp file. I can get this to work by commenting out some seemingly arbitrary code which makes no sense to me.
The kernel is as follows:
__global__ void string_search(char** texts, int* lengths, char* symbol, int* matches, int symbolLength)
{
int localMatches = 0;
int blockId = blockIdx.x + blockIdx.y * gridDim.x;
int threadId = threadIdx.x + threadIdx.y * blockDim.x;
int blockThreads = blockDim.x * blockDim.y;
__shared__ int localMatchCounts[32];
bool breaking = false;
for(int i = 0; i < (lengths[blockId] - (symbolLength - 1)); i += blockThreads)
{
if(texts[blockId][i] == symbol[0])
{
for(int j = 1; j < symbolLength; j++)
{
if(texts[blockId][i + j] != symbol[j])
{
breaking = true;
break;
}
}
if (breaking) continue;
localMatches++;
}
}
localMatchCounts[threadId] = localMatches;
__syncthreads();
if(threadId == 0)
{
int sum = 0;
for(int i = 0; i < 32; i++)
{
sum += localMatchCounts[i];
}
matches[blockId] = sum;
}
}
If I replace the line
localMatchCounts[threadId] = localMatches;
after the first for loop with this line
localMatchCounts[threadId] = 5;
it compiles with no notices. This can also be achieved by commenting out seemingly random parts of the loop above the line. I have also tried replacing the local memory array with a normal array to no effect. Can anyone tell me what the problem is?
The system is Vista 64bit, for what its worth.
Edit: I fixed the code so it actually works, though it still produces the compiler notice. It does not seem as though the warning is a problem, at least with regards to correctness (it might affect performance).
Arrays of pointers like char** are problematic in kernels, since the kernels have no access to the host's memory.
It is better to allocate a single continuous buffer and to divide it in a manner that enables parallel access.
In this case I'd define a 1D array which contains all the strings positioned one after another and another 1D array, sized 2*numberOfStrings which contains the offset of each string within the first array and it's length:
For example - preparation for kernel:
char* buffer = st[0] + st[1] + st[2] + ....;
int* metadata = new int[numberOfStrings * 2];
int lastpos = 0;
for (int cnt = 0; cnt < 2* numberOfStrings; cnt+=2)
{
metadata[cnt] = lastpos;
lastpos += length(st[cnt]);
metadata[cnt] = length(st[cnt]);
}
In kernel:
currentIndex = threadId + blockId * numberOfBlocks;
char* currentString = buffer + metadata[2 * currentIndex];
int currentStringLength = metadata[2 * currentIndex + 1];
The problem seems to be associated with the char** parameter. Turning this into a char* solved the warning, so I suspect that cuda might have problems with this form of data. Perhaps cuda prefers that one uses the specific cuda 2D arrays in this case.