cuda kernel failed when block number smaller than max number - cuda

I have write a CUDA function same as cublasSdgmm in CUBLAS, and I find when I increase block number, function performance may be poorer, or even failed.
Here is the code, M = 9.6e6, S = 3, the best performance block number is 320, my GPU is GTX960, and the max block size is 2147483647 in X dimension.
__global__ void DgmmKernel(float *d_y, float *d_r, int M, int S){
int row = blockIdx.x*blockDim.x + threadIdx.x;
int col = blockIdx.y*blockDim.y + threadIdx.y;
while(row < M){
d_y[row + col * M] *= d_r[row];
row += blockDim.x * gridDim.x;
}
}
void Dgmm(float *d_y, float *d_r, int M, int S){
int xthreads_per_block = 1024;
dim3 dimBlock(xthreads_per_block, 1);
dim3 dimGrid(320, S);
DgmmKernel<<<dimBlock, dimGrid>>>(d_y, d_r, M, S);
}
I guess the reason is that there may be a resource limit in GPU, is it right?
If it is right, what specific resource limits the performance, the kernel function just reads two vectors and do a multiplication operation. And is there any method to improve performance on my GPU.

You have the block and grid dimension arguments reversed in your kernel launch, and your kernel should never be running. You should do something like this:
dim3 dimBlock(xthreads_per_block, 1);
dim3 dimGrid(320, S);
DgmmKernel<<<dimGrid, dimBlock>>>(d_y, d_r, M, S);
If your code contained appropriate runtime error checking, you would already be aware that the kernel launch is failing with an invalid configuration error for any value of S>3.

Related

Modifying the basic example VECADD to use the shared memory

I wrote the following kernel to use the shared memory into the basic CUDA example vecadd (sum of two vectors). The code works, but the elapsed time for the kernel execution is the same as the basic original code. May someone suggest me a way to easily speed up such a code?
__global__ void vecAdd(float *in1, float *in2, float *out,long int len)
{
__shared__ float s_in1[THREADS_PER_BLOCK];
__shared__ float s_in2[THREADS_PER_BLOCK];
unsigned int xIndex = blockIdx.x * THREADS_PER_BLOCK + threadIdx.x;
s_in1[threadIdx.x]=in1[xIndex];
s_in2[threadIdx.x]=in2[xIndex];
out[xIndex]=s_in1[threadIdx.x]+s_in2[threadIdx.x];
}
May someone suggest me a way to easily speed up such a code
There are basically no useful optimizations to make on an operation like vector addition. Because of the nature of the calculation, the code could only ever hope to reach 50% peak arithmetic throughput, and the requirement for three memory transactions per FLOP makes this an intrinsically memory bandwidth bound operation.
As a result, this:
__global__ void vecAdd(float *in1, float *in2, float *out, unsigned int len)
{
unsigned int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
if (xIndex < len) {
float x = in1[xIndex];
float y = in2[xIndex];
out[xIndex] = x + y;
}
}
is about the best performing variant on most recent hardware, if the block size is selected for maximum occupancy, and len is sufficiently large for example:
int minGrid, minBlockSize;
cudaOccupancyMaxPotentialBlockSize(&minGrid, &minBlockSize, vecAdd);
int nblocks = (len / minBlockSize) + ((len % minBlockSize > 0) ? 1 : 0);
vecAdd<<<nblocks, minBlockSize>>>(x, y, z, len);

CUDA cudaMemcpy2D not giving expected results [duplicate]

How do I initialize device array which is allocated using cudaMalloc()?
I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.
cudaMemset(devPtr,value,number_bytes)
As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:
cudaError_t cudaMemset ( void * devPtr,
int value,
size_t count
)
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
So value is a byte value. If you do something like:
int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);
what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.
Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as
template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < nwords; tidx += stride)
devPtr[tidx] = val;
}
(standard disclaimer: written in browser, never compiled, never tested, use at own risk).
Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.
Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.
I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.
Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:
template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value,
const size_t n, const size_t incx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid*incx < n) {
a[tid*incx] = value;
}
}
Then you may invoke the kernel like this:
template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 gridDim(number_of_blocks, 1);
dim3 blockDim(BLOCK_SIZE, 1);
kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}

CUDA: sum-reduction --- data lost in call to device function [duplicate]

I'm aware that there are multiple questions similar to this one already answered but I've been unable to piece together anything very helpful from them other than that I'm probably incorrectly indexing something.
I'm trying to preform a sequential addressing reduction on input vector A into output vector B.
The full code is available here http://pastebin.com/7UGadgjX, but this is the kernel:
__global__ void vectorSum(int *A, int *B, int numElements) {
extern __shared__ int S[];
// Each thread loads one element from global to shared memory
int tid = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements) {
S[tid] = A[i];
__syncthreads();
// Reduce in shared memory
for (int t = blockDim.x/2; t > 0; t>>=1) {
if (tid < t) {
S[tid] += S[tid + t];
}
__syncthreads();
}
if (tid == 0) B[blockIdx.x] = S[0];
}
}
and these are the kernel launch statements:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorSum<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, numElements);
I'm getting a unspecified launch error which I've read is similar to a segfault. I've been following the nvidia reduction documentation closely and tried to keep my kernel within the bounds of numElements but I seem to be missing something key considering how simple the code is.
Your problem is that the reduction kernel requires dynamically allocated shared memory to operate correctly, but your kernel launch doesn't specify any. The result is out of bounds/illegal shared memory access which aborts the kernel.
In CUDA runtime API syntax, the kernel launch statement has four arguments. The first two are the grid and block dimensions for the launch. The latter two are optional with zero default values, but specify the dynamically allocated shared memory size and stream.
To fix this, change the launch code as follows:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
size_t shmsz = (size_t)threadsPerBlock * sizeof(int);
vectorSum<<<blocksPerGrid, threadsPerBlock, shmsz>>>(d_A, d_B, numElements);
[disclaimer: code written in browser, not compiled or tested, use at own risk]
This should at least fix the most obvious problem with your code.

CUDA 2D Convolution kernel

I'm a beginner in CUDA and I'm trying to implement a Sobel Edge detection kernel.
I'm using this code for it but it doesn't work.
Can anyone tell me what is wrong with it. I just get some -1's and some really big values.
__global__ void EdgeDetect_Hor(int *gpu_Edge_Hor, int *gpu_P,
int *gpu_Hor, int W, int H)
{
int X = threadIdx.x;
int Y = threadIdx.y;
int sum = 0;
int k1, k2;
int min1, min2;
for (k1 = 0; k1 < 3; k1++)
for(k2 = 0; k2 <3;k2++)
sum += gpu_Hor[k1*3+k2]*gpu_P[(X-k1)*H+Y-k2];
gpu_Edge_Hor[X*H+Y] = sum/5000;
}
I call this kernel like this:
dim3 dimBlock(W,H);
dim3 dimGrid(1,1);
EdgeDetect_Hor<<<dimGrid, dimBlock>>>(gpu_Edge_Hor, gpu_P, gpu_Hor, W, H);
First, your problem is that you process image of 480x720 pixels. CUDA supports maximum size of thread block 1024 for compute capability 2.0 and greater and 512 for previous. So you can't execute so many threads in one block. The line dim3 dimBlock(W,H); is incorrect. You should divide your threads to several blocks.
Another problem is that CUDA process data in row-major order. So you should change you memory access pattern.
Right memory access pattern for 2D arrays in CUDA is
BaseAddress + width * Y + X
where
unsigned int X = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int Y = blockIdx.y * blockDim.y + threadIdx.y;

CUDA: bad performance with shared memory and no parallelism

I'm trying to exploit shared memory in this kernel function, but the performance are not as good as I was expecting. This function is called many times in my application (about 1000 times or more), so I was thinking to exploit shared memory to avoid the memory latency. But something is wrong apparently because my application became really slow since i'm using shared memory.
This is the kernel:
__global__ void AndBitwiseOperation(int* _memory_device, int b1_size, int* b1_memory, int* b2_memory){
int j = 0;
// index GPU - Transaction-wise
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tid = threadIdx.x;
// shared variable
extern __shared__ int shared_memory_data[];
extern __shared__ int shared_b1_data[];
extern __shared__ int shared_b2_data[];
// copy from global memory into shared memory and sync threads
shared_b1_data[tid] = b1_memory[tid];
shared_b2_data[tid] = b2_memory[tid];
__syncthreads();
// AND each int bitwise
for(j = 0; j < b1_size; j++)
shared_memory_data[tid] = (shared_b1_data[tid] & shared_b2_data[tid]);
// write result for this block to global memory
_memory_device[i] = shared_memory_data[i];
}
The shared variables are declared extern because I don't know the size of b1 and b2 since they depend from the number of customer that I can only know at runtime (but both have the same size all the times).
This is how I call the kernel:
void Bitmap::And(const Bitmap &b1, const Bitmap &b2)
{
int* _memory_device;
int* b1_memory;
int* b2_memory;
int b1_size = b1.getIntSize();
// allocate memory on GPU
(cudaMalloc((void **)&b1_memory, _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&b2_memory, _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&_memory_device, _memSizeInt * SIZE_UINT));
// copy values on GPU
(cudaMemcpy(b1_memory, b1._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(b2_memory, b2._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(_memory_device, _memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
dim3 dimBlock(1, 1);
dim3 dimGrid(1, 1);
AndBitwiseOperation<<<dimGrid, dimBlock>>>(_memory_device, b1_size, b1_memory, b2_memory);
// return values
(cudaMemcpy(_memory, _memory_device, _memSizeInt * SIZE_UINT, cudaMemcpyDeviceToHost ));
// Free Memory
(cudaFree(b1_memory));
(cudaFree(b2_memory));
(cudaFree(_memory_device));
}
b1 and b2 are bitmaps with 4 bits for each element. The number of elements depend from the number of customers. Also, I have problem with the kernel's parameters, because if I add some blocks or threads, the AndBitwiseOperation() is not giving me the correct result. With just 1 block and 1 thread per block the result is correct but the kernel is not in parallel.
Every advice is welcomed :)
Thank you
I did not really understood what your kernel wants to do.
You should read more about CUDA and GPU programming.
I tried to point out some of the mistakes.
Shared memory (sm) should reduce global memory reads.
Analyze your global memory (gm) read and write operations per thread.
a. You read global memory two times and write sm two times
b. (nonsense loop ignored, no use of index) you read two times sn and write once sm
c. you read once sm and write once gm
So in total you have nothing gained. You could directly use the global memory.
You use all threads to write out one value at the block index "i".
You should only use one thread to write this data out.
It makes no sense outputing the same data by multiple threads which will get serialized.
You use a loop and don't use the loop counter at all.
You write at "tid" and read at "i" randomly.
This assignement is overhead.
unsigned int tid = threadIdx.x;
The results cannot be correct with more then one block since with one block tid = i!
All the wrong indexing results in wrong calculation using more then one block
The shared memory at "i" was never written!
_memory_device[i] = shared_memory_data[i];
My assumption what your kernel should do
/*
* Call kernel with x-block usage and up to 3D Grid
*/
__global__ void bitwiseAnd(int* outData_g,
const long long int inSize_s,
const int* inData1_g,
const int* inData2_g)
{
//get unique block index
const unsigned long long int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D
+ gridDim.x * gridDim.y * blockIdx.z; //3D
//get unique thread index
const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x;
//check global unique thread range
if(threadId >= inSize_s)
return;
//output bitwise and
outData_g[thread] = inData1_g[thread] & inData2_g[thread];
}
When you declare an extern __shared__ array, you must also specify its size in the kernel call.
The kernel configuration is:
<<< Dg, Db, Ns, S >>>
Ns is the size of the extern __shared__ arrays, and defaults to 0.
I don't think you can define more than one extern __shared__ array in your kernel. An example in the Programming Guide defines a single extern __shared__ array and manually sets arrays with offsets within it:
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}