CUDA: Understanding the PTX info - cuda

I'm not finding much useful info on PTX info --ptxas-options=-v
I found a 2008 NVCC pdf that has a small blurb, but no details.
1) What is 64 bytes cmem[0], 12 bytes cmem[16] mean? I gather it refers to constant memory. I don't use any constant mem in the code, so this must come from the compiler. (What goes into RO mem?)
2) What does 49152+0 bytes smem mean? Yes, it is shared memory, but what do the two #'s mean?
3) Is there a doc that will help me with this? (What is it called?)
4) Where can I find a doc that will explain the *.ptx file? (I'd like to be able to read/understand the cuda assy code.)

cmem is dicussed here. In your case it means 64 bytes are used to pass arguments to kernel and 12 bytes are occupied by compiler-generated constants.
In case of smem, the first number is the amount of data your code request, and the second number (0) indicates how much memory is used for system purposes.
I don't know of any official information regarding verbose ptxas output format. E.g. in "CUDA Occupancy calculator" they simply say to sum the values for smem without any explnations.
There are several PTX docs on nVidia website. The most fundamental is PTX: Parallel Thread Execution ISA Version 3.0.

Please see "Miscellaneous NVCC Usage".
They mention, that the constant bank allocation is profile-specific.
In the PTX guide, they say that apart from 64KB constant memory, they had 10 more banks for constant memory. The driver may allocate and initialize constant buffers in these regions and pass pointers to the buffers as kernel function parameters.
I guess, that profile given for nvcc will take care of which constants go into which memory. Anyway, we don't need to worry if each constant memory cmem[n] is less than 64KB, because each bank is of size 64KB and common to all threads in grid.

Related

What is the number of registers in CUDA CC 5.0?

I have a GeForce GTX 745 (CC 5.0).
The deviceQuery command shows that the total number of registers available per block is 65536 (65536 * 4 / 1024 = 256KB).
I wrote a kernel that uses an array of size 10K and the kernel is invoked as follows. I have tried two ways of allocating the array.
// using registers
fun1() {
short *arr = new short[100*100]; // 100*100*sizeof(short)=256K / per block
....
delete[] arr;
}
fun1<<<4, 64>>>();
// using global memory
fun2(short *d_arr) {
...
}
fun2<<<4, 64>>>(d_arr);
I can get the correct result in both cases.
The first one which uses registers runs much faster.
But when invoking the kernel using 6 blocks I got the error code 77.
fun1<<<6, 64>>>();
an illegal memory access was encountered
Now I'm wondering, actually how many of registers can I use? And how is it related to the number of blocks?
The important misconception in your question is that the new operator somehow uses registers to store memory allocated at runtime on the device. It does not. Registers are only allocated statically by the compiler. The new operator uses a dedicated heap for device allocation.
In detail: In your code, fun1, the first line is invoked by all threads, hence each thread of each block would allocate 10,000 16 bits values, that is 1,280,000 bytes per block. For 4 blocks, that make 5,120,000 bytes, for 6 that makes 7,680,000 bytes which for some reason seems to overflow the preallocated limit (default limit is 8MB - see Heap memory allocation). This may be why you get this Illegal Access Error (77).
Using new will make use of some preallocated global memory as malloc would, but not registers - maybe the code you provided is not exactly the one you run. If you want registers, you need to define the data in a fixed array:
func1()
{
short arr [100] ;
...
}
The compiler will then try to fit the array in registers. Note however that this register data is per thread, and maximum number of 32 bits registers per thread is 255 on your device.

Counting registers/thread in Cuda kernel

The nSight profiler tells me that the following kernel uses 52 registers per thread:
//Just the first lines of the kernel.
__global__ void voles_kernel(float *params, int *ctrl_params,
float dt, float currTime,
float *dev_voles, float *dev_weasels,
curandStateMtgp32 *state)
{
__shared__ float dev_params[9];
__shared__ int BuYeSimStep[4];
if(threadIdx.x < 4)
{
BuYeSimStep[threadIdx.x] = ctrl_params[threadIdx.x];
}
if(threadIdx.x < 9){
dev_params[threadIdx.x] = params[threadIdx.x];
}
__syncthreads();
float currVole = curand_uniform(&state[blockIdx.x]) + 3.0;
float currWeas = curand_uniform(&state[blockIdx.x]) + 0.1;
float oldVole = currVole;
float oldWeas = currWeas;
int jj;
if (blockIdx.x * blockDim.x + threadIdx.x < BuYeSimStep[2])
{
int dayIndex = 0;
/* Not declaring any new variable from here on, just doing arithmetics.
....... */
If each register has 4 bytes I don't understand how we get to 52 registers, even
assuming that the arrays params[9] and ctrl_params[4] end up in registers (in which
case using shared memory as I did doesn't make sense). I would
like to increase occupancy, but I don't get why I'm using so many registers.
Any ideas?
It's generally difficult to look at C code and predict the register usage from it. The compiler may aggressively optimize code by increasing register usage, perhaps to save an instruction here or there. You seem to be making an assumption that register usage can be predicted from your C code variable allocations, and while there is some connection between the two, you cannot assume register usage can be computed directly from C code variable allocations.
Since you haven't provided your code, nobody can actually help with the register usage. If you want to better understand the register usage, you will need to look at the PTX code directly. To do this, compile your code using nvcc with the -ptx switch, and inspect the resultant .ptx file directly. To do this you may wish to refer to the PTX documentation as well as the nvcc documentation to look at the various compiler options.
You haven't provided your code, so it's not really possible to make any direct suggestions, but you may be able to reduce register usage by reducing constant usage, reducing or refactoring arithmetic usage, switching from double to float, and I'm sure there are many other suggestions as well. Register usage will also be affected if you are passing the -G switch to the compiler.
You can limit the compiler's usage of registers per thread by passing the -maxrregcount switch to nvcc with an appropriate parameter, such as -maxrregcount 20 which will instruct the compiler to limit itself to 20 registers per thread. This tactic may not give good results, however, or you may need to tune the parameter to a value which doesn't sacrifice too much performance. However you may find an optimum choice which doesn't sacrifice too much basic performance but allows you to improve occupancy. If you constrain the compiler too much, it will begin to spill it's needed register usage to local memory, which will generally reduce performance.
You should also be aware that you can pass -Xptxas -v to nvcc which will give useful output about the compiler's register usage and other related data (spilling, etc.) at compile time.
If you want to increase the occupancy, a direct way is using compiler flag: maxregcount to restrict the usage of registers, but it may suffer a performance loss because some registers will be spilled to local memory, which is very slow.
I suggest you debug your code with Eclipse Nsight.
Create a breakpoint at the first line of your kernel and step to there.
In Debug Perspective, inside the CUDA Thread, you have the current stack trace. Right-click on the stack and click on "Instruction Stepping Mode". The window "Disassembly" will open your kernel PTX Assembly. You can continue stepping in your kernel to track the correlation of your source code and the assembly. So you can discover which register is used for.

CUDA allocation alignment is 256 bytes - seriously?

In "CUDA C Programming Guide 5.0", p73 (also here) says "Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes". I do not know the exact meaning of this sentence. Could anyone show an example for me? Many thanks.
A derivative question:
So, what about allocating an one-dimensional array of basic elements (like int) or self-defined ones? The starting address of the array will be multiples of 256B, while the address of each element in the array is not necessarily multiples of 256B?
The pointers which are allocated by using any of the CUDA Runtime's device memory allocation functions e.g cudaMalloc or cudaMallocPitch are guaranteed to be 256 byte aligned, i.e. the address is a multiple of 256.
Consider the following example:
char *ptr1, *ptr2;
int bytes = 1;
cudaMalloc((void**)&ptr1,bytes);
cudaMalloc((void**)&ptr2,bytes);
Suppose the address returned in ptr1 is some multiple of 256, then the address returned in ptr2 will be atleast (ptr1 + 256).
This is a restriction imposed by the device on which the memory is being allocated. Mostly, pointers are aligned due to performance purposes. (Some NVIDIA guy should be able to tell if there is some other reason also).
Important:
Pointer alignment is not always 256. On my device (GTX460M), it is 512. You can get the device pointer alignment by the cudaDeviceProp::textureAlignment field.
Alignment of pointers is also a requirement for binding the pointer to textures.

How to properly coalesce reads from global memory into shared memory with elements of type short or char (assuming one thread per element)?

I have a questions about coalesced global memory loads in CUDA. Currently I need to be able to execute on a CUDA device with compute capability CUDA 1.1 or 1.3.
I am writing a CUDA kernel function which reads an array of type T from global memory into shared memory, does some computation, and then will write out an array of type T back to global memory. I am using the shared memory because the computation for each output element actually depends not only on the corresponding input element, but also on the nearby input elements. I only want to load each input element once, hence I want to cache the input elements in shared memory.
My plan is to have each thread read one element into shared memory, then __syncthreads() before beginning the computation. In this scenario, each thread loads, computes, and stores one element (although the computation depends on elements loaded into shared memory by other threads).
For this question I want to focus on the read from global memory into shared memory.
Assuming that there are N elements in the array, I have configured CUDA to execute a total of N threads. For the case where sizeof(T) == 4, this should coalesce nicely according to my understanding of CUDA, since thread K will read word K (where K is the thread index).
However, in the case where sizeof(T) < 4, for example if T=unsigned char or if T=short, then I think there may be a problem. In this case, my (naive) plan is:
Compute numElementsPerWord = 4 / sizeof(T)
if(K % numElementsPerWord == 0), then read have thread K read the next full 32-bit word
store the 32 bit word in shared memory
after the shared memory has been populated, (and __syncthreads() called) then each thread K can process work on computing output element K
My concern is that it will not coalesce because (for example, in the case where T=short)
Thread 0 reads word 0 from global memory
Thread 1 does not read
Thread 2 reads word 1 from global memory
Thread 3 does not read
etc...
In other words, thread K reads word (K/sizeof(T)). This would seem to not coalesce properly.
An alternative approach that I considered was:
Launch with number of threads = (N + 3) / 4, such that each thread will be responsible for loading and processing (4/sizeof(T)) elements (each thread processes one 32-bit word - possibly 1, 2, or 4 elements depending on sizeof(T)). However I am concerned that this approach will not be as fast as possible since each thread must then do twice (if T=short) or even quadruple (if T=unsigned char) the amount of processing.
Can someone please tell me if my assumption about my plan is correct: i.e.: it will not coalesce properly?
Can you please comment on my alternative approach?
Can you recommend a more optimal approach that properly coalesces?
You are correct, you have to do loads of at least 32 bits in size to get coalescing, and the scheme you describe (having every other thread do a load) will not coalesce. Just shift the offset right by 2 bits and have each thread do a contiguous 32-bit load, and use conditional code to inhibit execution for threads that would operate on out-of-range addresses.
Since you are targeting SM 1.x, note also that 1) in order for coalescing to happen, thread 0 of a given warp (collections of 32 threads) must be 64-, 128- or 256-byte aligned for 4-, 8- and 16-byte operands, respectively, and 2) once your data is in shared memory, you may want to unroll your loop by 2x (for short) or 4x (for char) so adjacent threads reference adjacent 32-bit words, to avoid shared memory bank conflicts.

coalesced read short integer cuda

say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed.
If I understand it correctly, in case I break my data into 32bytes (16 shorts) segments, thread id 0, 16, 32 ... has to access first element of each segment? do i need to consider 64bytes alignment or 128 bytes alignment as well? I have a gts 250, so i guess this is important. Advices are welcomed. thanks.
According to Section G.3.2.1 of the CUDA Programming Guide short will not coalesce on Compute Capability 1.0 and 1.1 devices under any circumstances. Specifically, it states:
The size of the words accessed by the
threads must be 4, 8, or 16 bytes
You can however use vector types such as short2, short4, or even short8 to get coalesced access. The coalescing rules for these types is spelled out in Section G.3.2.1 as well. However, as far as coalescing is concerned a short2 is just like a 32-bit int.
FWIW, devices with Compute Capability 1.3 or greater handle types like char and short much better. Reading chars on a 1.3 device might give you as much as ~60% of peak memory bandwidth vs. ~10% of peak memory bandwidth on a 1.0 or 1.1 device.