What is the number of registers in CUDA CC 5.0? - cuda

I have a GeForce GTX 745 (CC 5.0).
The deviceQuery command shows that the total number of registers available per block is 65536 (65536 * 4 / 1024 = 256KB).
I wrote a kernel that uses an array of size 10K and the kernel is invoked as follows. I have tried two ways of allocating the array.
// using registers
fun1() {
short *arr = new short[100*100]; // 100*100*sizeof(short)=256K / per block
....
delete[] arr;
}
fun1<<<4, 64>>>();
// using global memory
fun2(short *d_arr) {
...
}
fun2<<<4, 64>>>(d_arr);
I can get the correct result in both cases.
The first one which uses registers runs much faster.
But when invoking the kernel using 6 blocks I got the error code 77.
fun1<<<6, 64>>>();
an illegal memory access was encountered
Now I'm wondering, actually how many of registers can I use? And how is it related to the number of blocks?

The important misconception in your question is that the new operator somehow uses registers to store memory allocated at runtime on the device. It does not. Registers are only allocated statically by the compiler. The new operator uses a dedicated heap for device allocation.
In detail: In your code, fun1, the first line is invoked by all threads, hence each thread of each block would allocate 10,000 16 bits values, that is 1,280,000 bytes per block. For 4 blocks, that make 5,120,000 bytes, for 6 that makes 7,680,000 bytes which for some reason seems to overflow the preallocated limit (default limit is 8MB - see Heap memory allocation). This may be why you get this Illegal Access Error (77).
Using new will make use of some preallocated global memory as malloc would, but not registers - maybe the code you provided is not exactly the one you run. If you want registers, you need to define the data in a fixed array:
func1()
{
short arr [100] ;
...
}
The compiler will then try to fit the array in registers. Note however that this register data is per thread, and maximum number of 32 bits registers per thread is 255 on your device.

Related

The shared memory size is limited to the maximum thread number when using AtomicAdd function

I use atomic operation to calculate summation of the values like histogram.
So, I use shared memory first to store the value in the block, and the values stored at the shared memory in each block are saved in the global memory next.
The whole code is follows.
__global__ void KERNEL_RIGID_force_sum(part1*P1,part3*P3,int_t*nop_sol,Real*xcm,Real*ycm,Real*zcm,Real*sum_fx,Real*sum_fy,Real*sum_fz)
{
int_t i=threadIdx.x+blockIdx.x*blockDim.x;
if(i>=k_num_part2) return;
if(P1[i].i_type==3) return;
// if(P1[i].p_type<RIGID) return;
// initilalize accumulation array in shared memory
__shared__ int_t tmp_nop[128];
__shared__ Real tmp_xcm[128],tmp_ycm[128],tmp_zcm[128];
__shared__ Real tmp_fx[128],tmp_fy[128],tmp_fz[128];
tmp_nop[threadIdx.x]=0;
tmp_xcm[threadIdx.x]=0;
tmp_ycm[threadIdx.x]=0;
tmp_zcm[threadIdx.x]=0;
tmp_fx[threadIdx.x]=0;
tmp_fy[threadIdx.x]=0;
tmp_fz[threadIdx.x]=0;
__syncthreads();
Real xi,yi,zi;
Real fxi,fyi,fzi;
int_t ptypei;
ptypei=P1[i].p_type;
xi=P1[i].x;
yi=P1[i].y;
zi=P1[i].z;
fxi=P3[i].ftotalx;
fyi=P3[i].ftotaly;
fzi=P3[i].ftotalz;
// save values to shared memory
atomicAdd(&tmp_nop[ptypei],1);
atomicAdd(&tmp_xcm[ptypei],xi);
atomicAdd(&tmp_ycm[ptypei],yi);
atomicAdd(&tmp_zcm[ptypei],zi);
atomicAdd(&tmp_fx[ptypei],fxi);
atomicAdd(&tmp_fy[ptypei],fyi);
atomicAdd(&tmp_fz[ptypei],fzi);
__syncthreads();
// save shared memory values to global memory
atomicAdd(&nop_sol[threadIdx.x],tmp_nop[threadIdx.x]);
atomicAdd(&xcm[threadIdx.x],tmp_xcm[threadIdx.x]);
atomicAdd(&ycm[threadIdx.x],tmp_ycm[threadIdx.x]);
atomicAdd(&zcm[threadIdx.x],tmp_zcm[threadIdx.x]);
atomicAdd(&sum_fx[threadIdx.x],tmp_fx[threadIdx.x]);
atomicAdd(&sum_fy[threadIdx.x],tmp_fy[threadIdx.x]);
atomicAdd(&sum_fz[threadIdx.x],tmp_fz[threadIdx.x]);
}
But, there are some problems.
Because the number of thread block is 128 in my code, I allocate shared memory and global memory size as 128.
How can I do if I want to use shared memory larger than max number of thread size 1,024? (when there are more than 1,024 p_type)
If I allocate shared memory size as 1,024 or higher value, system says
ptxas error : Entry function '_Z29KERNEL_RIGID_force_sum_sharedP17particles_array_1P17particles_array_3PiPdS4_S4_S4_S4_S4_' uses too much shared data (0xd000 bytes, 0xc000 max)
To put it simply, I don't know what to do when the size to perform reduction is more than 1,024.
Is it possible to calculate using anything else other than threadIdx.x?
Could you give me some advice?
Shared memory is limited in size. The default limits for most GPUs is 48KB. It has no direct connection to the number of threads in the threadblock. Some GPUs can go as high as 96KB, but you haven't indicated what GPU you are running on. The error you are getting is not directly related to the number of threads per block you have, but to the amount of shared memory you are requesting per block.
If the amount of shared memory you need exceeds the shared memory available, you'll need to come up with another algorithm. For example, a shared memory reduction using atomics (what you seem to have here) could be converted into an equivalent operation using global atomics.
Another approach would be to determine if it is possible to reduce the size of the array elements you are using. I have no idea what your types (Real, int_t) correspond to, but depending on the types, you may be able to get larger array sizes by converting to 16-bit types. cc7.x or higher devices can do atomic add operations on 16-bit floating point, for example, and with a bit of effort you can even do atomics on 8-bit integers.

Dynamically allocated shared memory in CUDA. Execution Configuration

What does by this Nvidia means?
Ns is of type size_t and specifies the number of bytes in shared
memory that is dynamically allocated per block for this call in
addition to the statically allocated memory; this dynamically
allocated memory is used by any of the variables declared as an
external array as mentioned in __shared__; Ns is an optional
argument which defaults to 0;
Size of shared memory in my GPU is 48kB.
For example I want to run 4 kernels at the same time, every of them uses 12kB of shared memory.
In order to do that, should I call kernek this way
kernel<<< gridSize, blockSize, 12 * 1024 >>>();
or should the third argument be 48 * 1024 ?
Ns in a size in bytes. If you want to reserve 12kB of shared memory you would do 12*1024*1024.
I doubt you want to do this. Ns value is PER BLOCK. So it is the amount of shared memory per block executing on the device. I'm guessing you'd like to do something around the lines of 12*1024*1024/number_of_blocks;
Kernel launching with concurrency:
If as mentioned in a comment, you are using streams there is a fourth input in the kernel launch which is the cuda stream.
If you want to launch a kernel on another stream without any shared memory it will look like:
kernel_name<<<128, 128, 0, mystream>>>(...);
but concurrency is a whole different issue.

Understanding this CUDA kernels launch parameters

I am trying to analyze some code I have found online and I keep thinking myself into a corner. I am looking at a histogram kernel launched with the following parameters
histogram<<<2500, numBins, numBins * sizeof(unsigned int)>>>(...);
I know that the parameters are grid, block, shared memory sizes.
So does that mean that there are 2500 blocks of numBins threads each, each block also having a numBins * sizeof(unsigned int) chunk of shared memory available to its threads?
Also, within the kernel itself there are calls to __syncthreads(), are there then 2500 sets of numBins calls to __syncthreads() over the course of the kernel call?
So does that mean that there are 2500 blocks of numBins threads each,
each block also having a numBins * sizeof(unsigned int) chunk of
shared memory available to its threads?
From the CUDA Toolkit documentation:
The execution configuration (of a global function call) is specified by inserting an expression of the form <<<Dg,Db,Ns,S>>>, where:
Dg (dim3) specifies the dimension and size of the grid.
Db (dim3) specifies the dimension and size of each block
Ns (size_t) specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory.
S (cudaStream_t) specifies the associated stream, is an optional parameter which defaults to 0.
So, as #Fazar pointed out, the answer is yes. This memory is allocated per block.
Also, within the kernel itself there are calls to __syncthreads(), are
there then 2500 sets of numBins calls to __syncthreads() over the
course of the kernel call?
__syncthreads() waits until all threads in the thread block have reached this point. Is used to coordinate the communication between threads in the same block.
So, there is a __syncthread() call per block.

CUDA shared memory addressing

I understand that when I declare a shared memory array in a kernel, the same sized array is declared by all the threads. A code like
__shared__ int s[5];
will create a 20 byte array in each thread. The way I understand addressing shared memory is that it is universal across all the threads. So, if I address subscript 10 as follows
s[10] = 1900;
it is the exact same memory location across all the threads. It won't be the case that different threads access different shared memory address for subscript 10. Is this correct? The compiler of course throws warnings that the subscript is out of range.
Actually it will create a 20-byte array per block, not per thread.
Every thread within the block will be able to access these 20 bytes. So if you need to have N bytes per thread, and a block with M threads, you'll need to create a N*M buffer per block.
In your case, if there was 128 threads, you would have had
__shared__ int array[5*128];
And array[10] would have been a valid address for any thread within the block.

How to properly coalesce reads from global memory into shared memory with elements of type short or char (assuming one thread per element)?

I have a questions about coalesced global memory loads in CUDA. Currently I need to be able to execute on a CUDA device with compute capability CUDA 1.1 or 1.3.
I am writing a CUDA kernel function which reads an array of type T from global memory into shared memory, does some computation, and then will write out an array of type T back to global memory. I am using the shared memory because the computation for each output element actually depends not only on the corresponding input element, but also on the nearby input elements. I only want to load each input element once, hence I want to cache the input elements in shared memory.
My plan is to have each thread read one element into shared memory, then __syncthreads() before beginning the computation. In this scenario, each thread loads, computes, and stores one element (although the computation depends on elements loaded into shared memory by other threads).
For this question I want to focus on the read from global memory into shared memory.
Assuming that there are N elements in the array, I have configured CUDA to execute a total of N threads. For the case where sizeof(T) == 4, this should coalesce nicely according to my understanding of CUDA, since thread K will read word K (where K is the thread index).
However, in the case where sizeof(T) < 4, for example if T=unsigned char or if T=short, then I think there may be a problem. In this case, my (naive) plan is:
Compute numElementsPerWord = 4 / sizeof(T)
if(K % numElementsPerWord == 0), then read have thread K read the next full 32-bit word
store the 32 bit word in shared memory
after the shared memory has been populated, (and __syncthreads() called) then each thread K can process work on computing output element K
My concern is that it will not coalesce because (for example, in the case where T=short)
Thread 0 reads word 0 from global memory
Thread 1 does not read
Thread 2 reads word 1 from global memory
Thread 3 does not read
etc...
In other words, thread K reads word (K/sizeof(T)). This would seem to not coalesce properly.
An alternative approach that I considered was:
Launch with number of threads = (N + 3) / 4, such that each thread will be responsible for loading and processing (4/sizeof(T)) elements (each thread processes one 32-bit word - possibly 1, 2, or 4 elements depending on sizeof(T)). However I am concerned that this approach will not be as fast as possible since each thread must then do twice (if T=short) or even quadruple (if T=unsigned char) the amount of processing.
Can someone please tell me if my assumption about my plan is correct: i.e.: it will not coalesce properly?
Can you please comment on my alternative approach?
Can you recommend a more optimal approach that properly coalesces?
You are correct, you have to do loads of at least 32 bits in size to get coalescing, and the scheme you describe (having every other thread do a load) will not coalesce. Just shift the offset right by 2 bits and have each thread do a contiguous 32-bit load, and use conditional code to inhibit execution for threads that would operate on out-of-range addresses.
Since you are targeting SM 1.x, note also that 1) in order for coalescing to happen, thread 0 of a given warp (collections of 32 threads) must be 64-, 128- or 256-byte aligned for 4-, 8- and 16-byte operands, respectively, and 2) once your data is in shared memory, you may want to unroll your loop by 2x (for short) or 4x (for char) so adjacent threads reference adjacent 32-bit words, to avoid shared memory bank conflicts.