CUDA shared memory addressing - cuda

I understand that when I declare a shared memory array in a kernel, the same sized array is declared by all the threads. A code like
__shared__ int s[5];
will create a 20 byte array in each thread. The way I understand addressing shared memory is that it is universal across all the threads. So, if I address subscript 10 as follows
s[10] = 1900;
it is the exact same memory location across all the threads. It won't be the case that different threads access different shared memory address for subscript 10. Is this correct? The compiler of course throws warnings that the subscript is out of range.

Actually it will create a 20-byte array per block, not per thread.
Every thread within the block will be able to access these 20 bytes. So if you need to have N bytes per thread, and a block with M threads, you'll need to create a N*M buffer per block.
In your case, if there was 128 threads, you would have had
__shared__ int array[5*128];
And array[10] would have been a valid address for any thread within the block.

Related

The shared memory size is limited to the maximum thread number when using AtomicAdd function

I use atomic operation to calculate summation of the values like histogram.
So, I use shared memory first to store the value in the block, and the values stored at the shared memory in each block are saved in the global memory next.
The whole code is follows.
__global__ void KERNEL_RIGID_force_sum(part1*P1,part3*P3,int_t*nop_sol,Real*xcm,Real*ycm,Real*zcm,Real*sum_fx,Real*sum_fy,Real*sum_fz)
{
int_t i=threadIdx.x+blockIdx.x*blockDim.x;
if(i>=k_num_part2) return;
if(P1[i].i_type==3) return;
// if(P1[i].p_type<RIGID) return;
// initilalize accumulation array in shared memory
__shared__ int_t tmp_nop[128];
__shared__ Real tmp_xcm[128],tmp_ycm[128],tmp_zcm[128];
__shared__ Real tmp_fx[128],tmp_fy[128],tmp_fz[128];
tmp_nop[threadIdx.x]=0;
tmp_xcm[threadIdx.x]=0;
tmp_ycm[threadIdx.x]=0;
tmp_zcm[threadIdx.x]=0;
tmp_fx[threadIdx.x]=0;
tmp_fy[threadIdx.x]=0;
tmp_fz[threadIdx.x]=0;
__syncthreads();
Real xi,yi,zi;
Real fxi,fyi,fzi;
int_t ptypei;
ptypei=P1[i].p_type;
xi=P1[i].x;
yi=P1[i].y;
zi=P1[i].z;
fxi=P3[i].ftotalx;
fyi=P3[i].ftotaly;
fzi=P3[i].ftotalz;
// save values to shared memory
atomicAdd(&tmp_nop[ptypei],1);
atomicAdd(&tmp_xcm[ptypei],xi);
atomicAdd(&tmp_ycm[ptypei],yi);
atomicAdd(&tmp_zcm[ptypei],zi);
atomicAdd(&tmp_fx[ptypei],fxi);
atomicAdd(&tmp_fy[ptypei],fyi);
atomicAdd(&tmp_fz[ptypei],fzi);
__syncthreads();
// save shared memory values to global memory
atomicAdd(&nop_sol[threadIdx.x],tmp_nop[threadIdx.x]);
atomicAdd(&xcm[threadIdx.x],tmp_xcm[threadIdx.x]);
atomicAdd(&ycm[threadIdx.x],tmp_ycm[threadIdx.x]);
atomicAdd(&zcm[threadIdx.x],tmp_zcm[threadIdx.x]);
atomicAdd(&sum_fx[threadIdx.x],tmp_fx[threadIdx.x]);
atomicAdd(&sum_fy[threadIdx.x],tmp_fy[threadIdx.x]);
atomicAdd(&sum_fz[threadIdx.x],tmp_fz[threadIdx.x]);
}
But, there are some problems.
Because the number of thread block is 128 in my code, I allocate shared memory and global memory size as 128.
How can I do if I want to use shared memory larger than max number of thread size 1,024? (when there are more than 1,024 p_type)
If I allocate shared memory size as 1,024 or higher value, system says
ptxas error : Entry function '_Z29KERNEL_RIGID_force_sum_sharedP17particles_array_1P17particles_array_3PiPdS4_S4_S4_S4_S4_' uses too much shared data (0xd000 bytes, 0xc000 max)
To put it simply, I don't know what to do when the size to perform reduction is more than 1,024.
Is it possible to calculate using anything else other than threadIdx.x?
Could you give me some advice?
Shared memory is limited in size. The default limits for most GPUs is 48KB. It has no direct connection to the number of threads in the threadblock. Some GPUs can go as high as 96KB, but you haven't indicated what GPU you are running on. The error you are getting is not directly related to the number of threads per block you have, but to the amount of shared memory you are requesting per block.
If the amount of shared memory you need exceeds the shared memory available, you'll need to come up with another algorithm. For example, a shared memory reduction using atomics (what you seem to have here) could be converted into an equivalent operation using global atomics.
Another approach would be to determine if it is possible to reduce the size of the array elements you are using. I have no idea what your types (Real, int_t) correspond to, but depending on the types, you may be able to get larger array sizes by converting to 16-bit types. cc7.x or higher devices can do atomic add operations on 16-bit floating point, for example, and with a bit of effort you can even do atomics on 8-bit integers.

What is the number of registers in CUDA CC 5.0?

I have a GeForce GTX 745 (CC 5.0).
The deviceQuery command shows that the total number of registers available per block is 65536 (65536 * 4 / 1024 = 256KB).
I wrote a kernel that uses an array of size 10K and the kernel is invoked as follows. I have tried two ways of allocating the array.
// using registers
fun1() {
short *arr = new short[100*100]; // 100*100*sizeof(short)=256K / per block
....
delete[] arr;
}
fun1<<<4, 64>>>();
// using global memory
fun2(short *d_arr) {
...
}
fun2<<<4, 64>>>(d_arr);
I can get the correct result in both cases.
The first one which uses registers runs much faster.
But when invoking the kernel using 6 blocks I got the error code 77.
fun1<<<6, 64>>>();
an illegal memory access was encountered
Now I'm wondering, actually how many of registers can I use? And how is it related to the number of blocks?
The important misconception in your question is that the new operator somehow uses registers to store memory allocated at runtime on the device. It does not. Registers are only allocated statically by the compiler. The new operator uses a dedicated heap for device allocation.
In detail: In your code, fun1, the first line is invoked by all threads, hence each thread of each block would allocate 10,000 16 bits values, that is 1,280,000 bytes per block. For 4 blocks, that make 5,120,000 bytes, for 6 that makes 7,680,000 bytes which for some reason seems to overflow the preallocated limit (default limit is 8MB - see Heap memory allocation). This may be why you get this Illegal Access Error (77).
Using new will make use of some preallocated global memory as malloc would, but not registers - maybe the code you provided is not exactly the one you run. If you want registers, you need to define the data in a fixed array:
func1()
{
short arr [100] ;
...
}
The compiler will then try to fit the array in registers. Note however that this register data is per thread, and maximum number of 32 bits registers per thread is 255 on your device.

Dynamically allocated shared memory in CUDA. Execution Configuration

What does by this Nvidia means?
Ns is of type size_t and specifies the number of bytes in shared
memory that is dynamically allocated per block for this call in
addition to the statically allocated memory; this dynamically
allocated memory is used by any of the variables declared as an
external array as mentioned in __shared__; Ns is an optional
argument which defaults to 0;
Size of shared memory in my GPU is 48kB.
For example I want to run 4 kernels at the same time, every of them uses 12kB of shared memory.
In order to do that, should I call kernek this way
kernel<<< gridSize, blockSize, 12 * 1024 >>>();
or should the third argument be 48 * 1024 ?
Ns in a size in bytes. If you want to reserve 12kB of shared memory you would do 12*1024*1024.
I doubt you want to do this. Ns value is PER BLOCK. So it is the amount of shared memory per block executing on the device. I'm guessing you'd like to do something around the lines of 12*1024*1024/number_of_blocks;
Kernel launching with concurrency:
If as mentioned in a comment, you are using streams there is a fourth input in the kernel launch which is the cuda stream.
If you want to launch a kernel on another stream without any shared memory it will look like:
kernel_name<<<128, 128, 0, mystream>>>(...);
but concurrency is a whole different issue.

Kernel calls in CUDA

I am new with CUDA, and I am confuse with the kernel calls.
When you call a Kernel method you specify the number of blocks and the thread per block, like this kernelMethod<<< block, Threads >>>(parameters);"
So why it is possible to use a 3rd parameter?
kernelMethod<<< block, Threads, ???>>>(parameters);
Using cudaDeviceProp you can read the number of thread per block in the variable maxThreadsPerBlock. But how can I know the maximum number of blocks?
Thanks!!
The third parameter specifies the amount of shared memory per block to be dynamically allocated. The programming guide provides additional detail about shared memory, as well as a description and example.
Shared memory can be allocated statically in a kernel:
__shared__ int myints[256];
or dynamically:
extern __shared__ int myints[];
In the latter case, it's necessary to pass as an additional kernel config parameter (the 3rd parameter you mention) the size of the shared memory to be allocated in bytes.
In that event, the pointer myints then points to the beginning of that dynamically allocated region.
The maximum number of blocks is specified per grid dimension (x, y, z) and can also be obtained through the device properties query. It is specified in the maxGridSize parameter. You may want to refer to the deviceQuery sample for a worked example.

How to share a common value between threads in a given block?

I have a kernel that, for each thread in a given block, computes a for loop with a different number of iterations. I use a buffer of size N_BLOCKS to store the number of iterations required for each block. Hence, each thread in a given block must know the number of iterations specific to its block.
However, I'm not sure which way is the best (performance speaking) to read the value and distribute it to all the other threads. I see only one good way (please tell me if there is something better): store the value in shared memory and have each thread read it. For example:
__global__ void foo( int* nIterBuf )
{
__shared__ int nIter;
if( threadIdx.x == 0 )
nIter = nIterBuf[blockIdx.x];
__syncthreads();
for( int i=0; i < nIter; i++ )
...
}
Any other better solutions? My app will use a lot of data, so I want the best performance.
Thanks!
Read-only values that are uniform across all threads in a block are probably best stored in __constant__ arrays. On some CUDA architectures such as Fermi (SM 2.x), if you declare the array or pointer argument using the C++ const keyword AND you access it uniformly within the block (i.e. the index only depends on blockIdx, not threadIdx), then the compiler may automatically promote the reference to constant memory.
The advantage of constant memory is that it goes through a dedicated cache, so it doesn't pollute the L1, and if the amount of data you are accessing per block is relatively small, after the first access within each block, you should always hit in the cache after the initial compulsory miss in each thread block.
You also won't need to use any shared memory or transfer from global to shared memory.
If my info is up-to-date, the shared memory is the second fastest memory, second only to the registers.
If reading this data from shared memory every iteration slows you down and you still have registers available (refer to your GPU's compute capability and specs), you could perhaps try to store a copy of this value in every thread's register (using a local variable).