Dynamically allocated shared memory in CUDA. Execution Configuration - cuda

What does by this Nvidia means?
Ns is of type size_t and specifies the number of bytes in shared
memory that is dynamically allocated per block for this call in
addition to the statically allocated memory; this dynamically
allocated memory is used by any of the variables declared as an
external array as mentioned in __shared__; Ns is an optional
argument which defaults to 0;
Size of shared memory in my GPU is 48kB.
For example I want to run 4 kernels at the same time, every of them uses 12kB of shared memory.
In order to do that, should I call kernek this way
kernel<<< gridSize, blockSize, 12 * 1024 >>>();
or should the third argument be 48 * 1024 ?

Ns in a size in bytes. If you want to reserve 12kB of shared memory you would do 12*1024*1024.
I doubt you want to do this. Ns value is PER BLOCK. So it is the amount of shared memory per block executing on the device. I'm guessing you'd like to do something around the lines of 12*1024*1024/number_of_blocks;
Kernel launching with concurrency:
If as mentioned in a comment, you are using streams there is a fourth input in the kernel launch which is the cuda stream.
If you want to launch a kernel on another stream without any shared memory it will look like:
kernel_name<<<128, 128, 0, mystream>>>(...);
but concurrency is a whole different issue.

Related

The shared memory size is limited to the maximum thread number when using AtomicAdd function

I use atomic operation to calculate summation of the values like histogram.
So, I use shared memory first to store the value in the block, and the values stored at the shared memory in each block are saved in the global memory next.
The whole code is follows.
__global__ void KERNEL_RIGID_force_sum(part1*P1,part3*P3,int_t*nop_sol,Real*xcm,Real*ycm,Real*zcm,Real*sum_fx,Real*sum_fy,Real*sum_fz)
{
int_t i=threadIdx.x+blockIdx.x*blockDim.x;
if(i>=k_num_part2) return;
if(P1[i].i_type==3) return;
// if(P1[i].p_type<RIGID) return;
// initilalize accumulation array in shared memory
__shared__ int_t tmp_nop[128];
__shared__ Real tmp_xcm[128],tmp_ycm[128],tmp_zcm[128];
__shared__ Real tmp_fx[128],tmp_fy[128],tmp_fz[128];
tmp_nop[threadIdx.x]=0;
tmp_xcm[threadIdx.x]=0;
tmp_ycm[threadIdx.x]=0;
tmp_zcm[threadIdx.x]=0;
tmp_fx[threadIdx.x]=0;
tmp_fy[threadIdx.x]=0;
tmp_fz[threadIdx.x]=0;
__syncthreads();
Real xi,yi,zi;
Real fxi,fyi,fzi;
int_t ptypei;
ptypei=P1[i].p_type;
xi=P1[i].x;
yi=P1[i].y;
zi=P1[i].z;
fxi=P3[i].ftotalx;
fyi=P3[i].ftotaly;
fzi=P3[i].ftotalz;
// save values to shared memory
atomicAdd(&tmp_nop[ptypei],1);
atomicAdd(&tmp_xcm[ptypei],xi);
atomicAdd(&tmp_ycm[ptypei],yi);
atomicAdd(&tmp_zcm[ptypei],zi);
atomicAdd(&tmp_fx[ptypei],fxi);
atomicAdd(&tmp_fy[ptypei],fyi);
atomicAdd(&tmp_fz[ptypei],fzi);
__syncthreads();
// save shared memory values to global memory
atomicAdd(&nop_sol[threadIdx.x],tmp_nop[threadIdx.x]);
atomicAdd(&xcm[threadIdx.x],tmp_xcm[threadIdx.x]);
atomicAdd(&ycm[threadIdx.x],tmp_ycm[threadIdx.x]);
atomicAdd(&zcm[threadIdx.x],tmp_zcm[threadIdx.x]);
atomicAdd(&sum_fx[threadIdx.x],tmp_fx[threadIdx.x]);
atomicAdd(&sum_fy[threadIdx.x],tmp_fy[threadIdx.x]);
atomicAdd(&sum_fz[threadIdx.x],tmp_fz[threadIdx.x]);
}
But, there are some problems.
Because the number of thread block is 128 in my code, I allocate shared memory and global memory size as 128.
How can I do if I want to use shared memory larger than max number of thread size 1,024? (when there are more than 1,024 p_type)
If I allocate shared memory size as 1,024 or higher value, system says
ptxas error : Entry function '_Z29KERNEL_RIGID_force_sum_sharedP17particles_array_1P17particles_array_3PiPdS4_S4_S4_S4_S4_' uses too much shared data (0xd000 bytes, 0xc000 max)
To put it simply, I don't know what to do when the size to perform reduction is more than 1,024.
Is it possible to calculate using anything else other than threadIdx.x?
Could you give me some advice?
Shared memory is limited in size. The default limits for most GPUs is 48KB. It has no direct connection to the number of threads in the threadblock. Some GPUs can go as high as 96KB, but you haven't indicated what GPU you are running on. The error you are getting is not directly related to the number of threads per block you have, but to the amount of shared memory you are requesting per block.
If the amount of shared memory you need exceeds the shared memory available, you'll need to come up with another algorithm. For example, a shared memory reduction using atomics (what you seem to have here) could be converted into an equivalent operation using global atomics.
Another approach would be to determine if it is possible to reduce the size of the array elements you are using. I have no idea what your types (Real, int_t) correspond to, but depending on the types, you may be able to get larger array sizes by converting to 16-bit types. cc7.x or higher devices can do atomic add operations on 16-bit floating point, for example, and with a bit of effort you can even do atomics on 8-bit integers.

What is the number of registers in CUDA CC 5.0?

I have a GeForce GTX 745 (CC 5.0).
The deviceQuery command shows that the total number of registers available per block is 65536 (65536 * 4 / 1024 = 256KB).
I wrote a kernel that uses an array of size 10K and the kernel is invoked as follows. I have tried two ways of allocating the array.
// using registers
fun1() {
short *arr = new short[100*100]; // 100*100*sizeof(short)=256K / per block
....
delete[] arr;
}
fun1<<<4, 64>>>();
// using global memory
fun2(short *d_arr) {
...
}
fun2<<<4, 64>>>(d_arr);
I can get the correct result in both cases.
The first one which uses registers runs much faster.
But when invoking the kernel using 6 blocks I got the error code 77.
fun1<<<6, 64>>>();
an illegal memory access was encountered
Now I'm wondering, actually how many of registers can I use? And how is it related to the number of blocks?
The important misconception in your question is that the new operator somehow uses registers to store memory allocated at runtime on the device. It does not. Registers are only allocated statically by the compiler. The new operator uses a dedicated heap for device allocation.
In detail: In your code, fun1, the first line is invoked by all threads, hence each thread of each block would allocate 10,000 16 bits values, that is 1,280,000 bytes per block. For 4 blocks, that make 5,120,000 bytes, for 6 that makes 7,680,000 bytes which for some reason seems to overflow the preallocated limit (default limit is 8MB - see Heap memory allocation). This may be why you get this Illegal Access Error (77).
Using new will make use of some preallocated global memory as malloc would, but not registers - maybe the code you provided is not exactly the one you run. If you want registers, you need to define the data in a fixed array:
func1()
{
short arr [100] ;
...
}
The compiler will then try to fit the array in registers. Note however that this register data is per thread, and maximum number of 32 bits registers per thread is 255 on your device.

Understanding this CUDA kernels launch parameters

I am trying to analyze some code I have found online and I keep thinking myself into a corner. I am looking at a histogram kernel launched with the following parameters
histogram<<<2500, numBins, numBins * sizeof(unsigned int)>>>(...);
I know that the parameters are grid, block, shared memory sizes.
So does that mean that there are 2500 blocks of numBins threads each, each block also having a numBins * sizeof(unsigned int) chunk of shared memory available to its threads?
Also, within the kernel itself there are calls to __syncthreads(), are there then 2500 sets of numBins calls to __syncthreads() over the course of the kernel call?
So does that mean that there are 2500 blocks of numBins threads each,
each block also having a numBins * sizeof(unsigned int) chunk of
shared memory available to its threads?
From the CUDA Toolkit documentation:
The execution configuration (of a global function call) is specified by inserting an expression of the form <<<Dg,Db,Ns,S>>>, where:
Dg (dim3) specifies the dimension and size of the grid.
Db (dim3) specifies the dimension and size of each block
Ns (size_t) specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory.
S (cudaStream_t) specifies the associated stream, is an optional parameter which defaults to 0.
So, as #Fazar pointed out, the answer is yes. This memory is allocated per block.
Also, within the kernel itself there are calls to __syncthreads(), are
there then 2500 sets of numBins calls to __syncthreads() over the
course of the kernel call?
__syncthreads() waits until all threads in the thread block have reached this point. Is used to coordinate the communication between threads in the same block.
So, there is a __syncthread() call per block.

Kernel calls in CUDA

I am new with CUDA, and I am confuse with the kernel calls.
When you call a Kernel method you specify the number of blocks and the thread per block, like this kernelMethod<<< block, Threads >>>(parameters);"
So why it is possible to use a 3rd parameter?
kernelMethod<<< block, Threads, ???>>>(parameters);
Using cudaDeviceProp you can read the number of thread per block in the variable maxThreadsPerBlock. But how can I know the maximum number of blocks?
Thanks!!
The third parameter specifies the amount of shared memory per block to be dynamically allocated. The programming guide provides additional detail about shared memory, as well as a description and example.
Shared memory can be allocated statically in a kernel:
__shared__ int myints[256];
or dynamically:
extern __shared__ int myints[];
In the latter case, it's necessary to pass as an additional kernel config parameter (the 3rd parameter you mention) the size of the shared memory to be allocated in bytes.
In that event, the pointer myints then points to the beginning of that dynamically allocated region.
The maximum number of blocks is specified per grid dimension (x, y, z) and can also be obtained through the device properties query. It is specified in the maxGridSize parameter. You may want to refer to the deviceQuery sample for a worked example.

What CUDA shared memory size means

I am trying to solve this problem myself but I can't.
So I want to get yours advice.
I am writing kernel code like this. VGA is GTX 580.
xxxx <<< blockNum, threadNum, SharedSize >>> (... threadNum ...)
(note. SharedSize is set 2*threadNum)
__global__ void xxxx(..., int threadNum, ...)
{
extern __shared__ int shared[];
int* sub_arr = &shared[0];
int* sub_numCounting = &shared[threadNum];
...
}
My program creates about 1085 blocks and 1024 threads per block.
(I am trying to handle huge size of array)
So size of shared memory per block is 8192(1024*2*4)bytes, right?
I figure out I can use maximum 49152bytes in shared memory per block on GTX 580 by using cudaDeviceProp.
And I know GTX 580 has 16 processors, thread block can be implemented on processor.
But my program occurs error.(8192bytes < 49152bytes)
I use "printf" in kernel to see whether well operates or not but several blocks not operates. (Although I create 1085blocks, actually only 50~100 blocks operates.)
And I want to know whether blocks which operated on same processor share same shared memory address or not. ( If not, allocates other memory for shared memory? )
I can't certainly understand what maximum size of shared memory per block means.
Give me advice.
Yes, blocks on the same multiprocessor shared the same amount of shared memory, which is 48KB per multiprocessor for your GPU card (compute capability 2.0). So if you have N blocks on the same multiprocessor, the maximum size of shared memory per block is (48/N) KB.