Suppose I have two __device__ CUDA function, each having the following local variable:
__shared__ int a[123];
and another function (say it's my kernel, i.e. a __global__ function), with:
extern __shared__ int b[];
Is this explicitly allowed/forbidden by nVIDIA? (I don't see it in the programming guide section B.2.3 on __shared__) Do the sizes all count together together towards the shared memory limit, or is it the maximum possibly in use at a single time? Or some other rule?
This can be considered a follow-up question to this one.
The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here:
ptxas info : Used 22 registers, 384 bytes smem, 48 bytes cmem[0]
Here, we have 384 bytes, which is 3 arrays of 32 ints. (see sample corde below).
You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration.
Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call.
Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region.
__device__ void dev1()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x ;
if (threadIdx.x == 0)
printf ("dev1 : %x\n", a) ;
}
__device__ void dev2()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x * 5 ;
if (threadIdx.x == 0)
printf ("dev2 : %x\n", a) ;
}
__global__ void kernel(int* res, int* res2)
{
__shared__ int a[32] ;
extern __shared__ int b[];
a[threadIdx.x] = 0 ;
b[threadIdx.x] = threadIdx.x * 3 ;
dev1();
__syncthreads();
dev2();
__syncthreads();
res[threadIdx.x] = a[threadIdx.x] ;
res2[threadIdx.x] = b[threadIdx.x] ;
if (threadIdx.x == 0)
printf ("global a : %x\n", a) ;
if (threadIdx.x == 0)
printf ("global b : %x\n", b) ;
}
int main()
{
int* dres ;
int* dres2 ;
cudaMalloc <> (&dres, 32*sizeof(int)) ;
cudaMalloc <> (&dres2, 32*sizeof(int)) ;
kernel<<<1,32,32*sizeof(float)>>> (dres, dres2);
int hres[32] ;
int hres2[32] ;
cudaMemcpy (hres, dres, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
cudaMemcpy (hres2, dres2, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
for (int k = 0 ; k < 32 ; ++k)
{
printf ("%d -- %d \n", hres[k], hres2[k]) ;
}
return 0 ;
}
This code outputs the ptxas info using 384 bytes smem, that is one array for global a array, a second for dev1 method a array, and a third for dev2 method a array. Totalling 3*32*sizeof(float)=384 bytes.
When running the kernel with dynamic shared memory equals to 32*sizeof(float), the pointer to b starts right after these three arrays.
EDIT:
The ptx file generated by this code holds declarations of statically-defined shared memory,
.shared .align 4 .b8 _ZZ4dev1vE1a[128];
.shared .align 4 .b8 _ZZ4dev2vE1a[128];
.extern .shared .align 4 .b8 b[];
except for the entry-point where it is defined in the body of the method
// _ZZ6kernelPiS_E1a has been demoted
The shared space of the memory is defined in the PTX documentation here:
The shared (.shared) state space is a per-CTA region of memory for threads in a CTA to share data. An address in shared memory can be read and written by any thread in a CTA. Use ld.shared and st.shared to access shared variables.
Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two.
During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF
1 IADD32I R1, R1, -0x8;
2 S2R R0, SR_LMEMHIOFF;
3 ISETP.GE.U32.AND P0, PT, R1, R0, PT;
and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc).
I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the b address does not change, and the unallocated shared memory region is never accessed).
Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition.
Related
I am practicing an exercise for Array of Struct (AoS). A struct with/without __align__ has been defined like:
#ifdef TESTALIGN8
struct __align__(8) InnerStruct {
float x;
float y;
};
#else
struct InnerStruct {
float x;
float y;
};
#endif
The test case is
__global__ void testGpuInnerStruct(InnerStruct *data, InnerStruct *result, const int n) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
result[idx].x = data[idx].x + 10;
result[idx].y = data[idx].y + 20;
}
}
The file could be found at gist
Both cases were profiled by ncu-ui on Quadro RTX 4000 and the Memory Workload Analysis is like
Performance without __align__(8)
Performance with __align__(8)
Why L1 hit of latter case is 0%? In my mind, the minimum granularity of load/store is 32 bytes and sizeof(InnerStruct) is 8 bytes with or without __align__(8) qualifier, the InnerStruct.x and InnerStruct.y would always be read in a same load with or without L1 cache. How __align__ impacts the performance like this?
Why L1 hit of latter case is 0%?
The __align__(8) directive allows the compiler to discover that it can convert 2 separate loads into a single load. The result is that whereas in the non-decorated case, the compiler generates 2 loads, and the 2nd load derives benefit (cache hit rate) from the first load, in the decorated case, there is only one load instruction. Therefore there is no observed cache benefit.
For the non-decorated case, the compiler does something like this:
if (idx < n) {
//result[idx].x = data[idx].x + 10;
LDG R0, [result[idx].x]; // pulls two 128-byte L1 cachelines per warp: MISS L1,MISS L2
FADD R1, R0, 10;
STG [result[idx].x], R1; // HIT L2
//result[idx].y = data[idx].y + 20;
LDG R0, [result[idx].y]; // benefits from the L1 cache: HIT L1
FADD R1, R0, 20;
STG [result[idx].y], R1; // HIT L2
}
For the decorated case, the compiler does something like:
if (idx < n) {
LDG.64 R0,[result[idx].x]; // nothing populated cache prior to this: MISS L1,MISS L2
//result[idx].x = data[idx].x + 10;
FADD R0, R0, 10;
//result[idx].y = data[idx].y + 20;
FADD R1, R1, 20;
STG.64 [result[idx].x], R0; // HIT L2
}
Thus there is only one load instruction, which does not get any cache benefit.
In the non-decorated case, the compiler cannot assume that the struct is aligned to 8 bytes. It can only assume a 4-byte alignment (the natural alignment for float type). If the struct only has 4 byte alignment (and not 8-byte alignment), then the LDG.64 instruction is not legal, because that instruction requires a "natural" 8-byte alignment. Therefore in the non-decorated case, the compiler must use two 4-byte loads, because it cannot assume 8 byte alignment, whereas in the decorated case, it knows that LDG.64 is legal, and so it uses it instead.
(Aside: I suspect your GPU is not actually a Quadro 4000, but instead maybe a Quadro RTX 4000, because the Quadro 4000 was a fermi-class GPU which is not supported by any recent version of CUDA, much less nsight compute.)
I believe my CUDA application could potentially benefit from shared memory, in order to keep the data near the GPU cores. Right now, I have a single kernel to which I pass a pointer to a previously allocated chunk of device memory, and some constants. After the kernel has finished, the device memory includes the result, which is copied to host memory. This scheme works perfectly and is cross-checked with the same algorithm run on the CPU.
The docs make it quite clear that global memory is much slower and has higher access latency than shared memory, but either way to get the best performance you should make your threads coalesce and align any access. My GPU has Compute Capability 6.1 "Pascal", has 48 kiB of shared memory per thread block and 2 GiB DRAM. If I refactor my code to use shared memory, how do I make sure to avoid bank conflicts?
Shared memory is organized in 32 banks, so that 32 threads from the same block each may simultaneously access a different bank without having to wait. Let's say I take the kernel from above, launch a kernel configuration with one block and 32 threads in that block, and statically allocate 48 kiB of shared memory outside the kernel. Also, each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on. Given this, I would access those 32 shared memory locations with on offset of 48 kiB / 32 banks / sizeof(double) which equals 192:
__shared__ double cache[6144];
__global__ void kernel(double *buf_out, double a, double b, double c)
{
for(...)
{
// Perform calculation on shared memory
cache[threadIdx.x * 192] = ...
}
// Write result to global memory
buf_out[threadIdx.x] = cache[threadIdx.x * 192];
}
My reasoning: while threadIdx.x runs from 0 to 31, the offset together with cache being a double make sure that each thread will access the first element of a different bank, at the same time. I haven't gotten around to modify and test the code, but is this the right way to align access for the SM?
MWE added:
This is the naive CPU-to-CUDA port of the algorithm, using global memory only. Visual Profiler reports a kernel execution time of 10.3 seconds.
Environment: Win10, MSVC 2019, x64 Release Build, CUDA v11.2.
#include "cuda_runtime.h"
#include <iostream>
#include <stdio.h>
#define _USE_MATH_DEFINES
#include <math.h>
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x/* + blockIdx.x * blockDim.x*/;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
buf[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
buf[tid] += cos(temp * (Y * y + Z * z));
}
}
}
int main(void)
{
double *dev_mem;
double *buf = NULL;
cudaError_t cudaStatus;
unsigned int screen_elems = 1000;
if ((buf = (double*)malloc(screen_elems * sizeof(double))) == NULL)
{
printf("Could not allocate memory...");
return -1;
}
memset(buf, 0, screen_elems * sizeof(double));
if ((cudaStatus = cudaMalloc((void**)&dev_mem, screen_elems * sizeof(double))) != cudaSuccess)
{
printf("cudaMalloc failed with code %u", cudaStatus);
return cudaStatus;
}
kernel<<<1, 1000>>>(dev_mem, 1e-3, 5e-5, 50e-9, 10.0, 2 * M_PI / 5e-7);
cudaDeviceSynchronize();
if ((cudaStatus = cudaMemcpy(buf, dev_mem, screen_elems * sizeof(double), cudaMemcpyDeviceToHost)) != cudaSuccess)
{
printf("cudaMemcpy failed with code %u", cudaStatus);
return cudaStatus;
}
cudaFree(dev_mem);
cudaDeviceReset();
free(buf);
return 0;
}
The kernel below uses shared memory instead and takes approximately 10.6 seconds to execute, again measured in Visual Profiler:
__shared__ double cache[1000];
__global__ void kernel(double *buf, double SCREEN_STEP_SIZE, double APERTURE_RADIUS,
double APERTURE_STEP_SIZE, double SCREEN_DIST, double WAVE_NUMBER)
{
double z, y, y_max;
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
double Z = tid * SCREEN_STEP_SIZE, Y = 0;
double temp = WAVE_NUMBER / SCREEN_DIST;
// Make sure the per-thread accumulator is zero before we begin
cache[tid] = 0;
for (z = -APERTURE_RADIUS; z <= APERTURE_RADIUS; z += APERTURE_STEP_SIZE)
{
y_max = sqrt(APERTURE_RADIUS * APERTURE_RADIUS - z * z);
for (y = -y_max; y <= y_max; y += APERTURE_STEP_SIZE)
{
cache[tid] += cos(temp * (Y * y + Z * z));
}
}
buf[tid] = cache[tid];
}
The innermost line inside the loops is typically executed several million times, depending on the five constants passed to the kernel. So instead of thrashing the off-chip global memory, I expected the on-chip shared-memory version to be much faster, but apparently it is not - what am I missing?
Let's say... each thread will only ever read from and write to the same single memory location in (shared) memory, which is specific to the algorithm I am working on.
In that case, it does not make sense to use shared memory. The whole point of shared memory is the sharing... among all threads in a block. Under your assumptions, you should keep your element in a register, not in shared memory. Indeed, in your "MWE Added" kernel - that's probably what you should do.
If your threads were to share information - then the pattern of this sharing would determine how best to utilize shared memory.
Also remember that if you don't read data repeatedly, or from multiple threads, it is much less likely that shared memory will help you - as you always have to read from global memory at least once and write to shared memory at least once to have your data in shared memory.
Suppose I have two __device__ CUDA function, each having the following local variable:
__shared__ int a[123];
and another function (say it's my kernel, i.e. a __global__ function), with:
extern __shared__ int b[];
Is this explicitly allowed/forbidden by nVIDIA? (I don't see it in the programming guide section B.2.3 on __shared__) Do the sizes all count together together towards the shared memory limit, or is it the maximum possibly in use at a single time? Or some other rule?
This can be considered a follow-up question to this one.
The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here:
ptxas info : Used 22 registers, 384 bytes smem, 48 bytes cmem[0]
Here, we have 384 bytes, which is 3 arrays of 32 ints. (see sample corde below).
You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration.
Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call.
Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region.
__device__ void dev1()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x ;
if (threadIdx.x == 0)
printf ("dev1 : %x\n", a) ;
}
__device__ void dev2()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x * 5 ;
if (threadIdx.x == 0)
printf ("dev2 : %x\n", a) ;
}
__global__ void kernel(int* res, int* res2)
{
__shared__ int a[32] ;
extern __shared__ int b[];
a[threadIdx.x] = 0 ;
b[threadIdx.x] = threadIdx.x * 3 ;
dev1();
__syncthreads();
dev2();
__syncthreads();
res[threadIdx.x] = a[threadIdx.x] ;
res2[threadIdx.x] = b[threadIdx.x] ;
if (threadIdx.x == 0)
printf ("global a : %x\n", a) ;
if (threadIdx.x == 0)
printf ("global b : %x\n", b) ;
}
int main()
{
int* dres ;
int* dres2 ;
cudaMalloc <> (&dres, 32*sizeof(int)) ;
cudaMalloc <> (&dres2, 32*sizeof(int)) ;
kernel<<<1,32,32*sizeof(float)>>> (dres, dres2);
int hres[32] ;
int hres2[32] ;
cudaMemcpy (hres, dres, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
cudaMemcpy (hres2, dres2, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
for (int k = 0 ; k < 32 ; ++k)
{
printf ("%d -- %d \n", hres[k], hres2[k]) ;
}
return 0 ;
}
This code outputs the ptxas info using 384 bytes smem, that is one array for global a array, a second for dev1 method a array, and a third for dev2 method a array. Totalling 3*32*sizeof(float)=384 bytes.
When running the kernel with dynamic shared memory equals to 32*sizeof(float), the pointer to b starts right after these three arrays.
EDIT:
The ptx file generated by this code holds declarations of statically-defined shared memory,
.shared .align 4 .b8 _ZZ4dev1vE1a[128];
.shared .align 4 .b8 _ZZ4dev2vE1a[128];
.extern .shared .align 4 .b8 b[];
except for the entry-point where it is defined in the body of the method
// _ZZ6kernelPiS_E1a has been demoted
The shared space of the memory is defined in the PTX documentation here:
The shared (.shared) state space is a per-CTA region of memory for threads in a CTA to share data. An address in shared memory can be read and written by any thread in a CTA. Use ld.shared and st.shared to access shared variables.
Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two.
During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF
1 IADD32I R1, R1, -0x8;
2 S2R R0, SR_LMEMHIOFF;
3 ISETP.GE.U32.AND P0, PT, R1, R0, PT;
and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc).
I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the b address does not change, and the unallocated shared memory region is never accessed).
Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition.
I am trying to implement the dot product in CUDA and compare the result with what MATLAB returns. My CUDA code (based on this tutorial) is the following:
#include <stdio.h>
#define N (2048 * 8)
#define THREADS_PER_BLOCK 512
#define num_t float
// The kernel - DOT PRODUCT
__global__ void dot(num_t *a, num_t *b, num_t *c)
{
__shared__ num_t temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads(); //Synchronize!
*c = 0.00;
// Does it need to be tid==0 that
// undertakes this task?
if (0 == threadIdx.x) {
num_t sum = 0.00;
int i;
for (i=0; i<THREADS_PER_BLOCK; i++)
sum += temp[i];
atomicAdd(c, sum);
//WRONG: *c += sum; This read-write operation must be atomic!
}
}
// Initialize the vectors:
void init_vector(num_t *x)
{
int i;
for (i=0 ; i<N ; i++){
x[i] = 0.001 * i;
}
}
// MAIN
int main(void)
{
num_t *a, *b, *c;
num_t *dev_a, *dev_b, *dev_c;
size_t size = N * sizeof(num_t);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
a = (num_t*)malloc(size);
b = (num_t*)malloc(size);
c = (num_t*)malloc(size);
init_vector(a);
init_vector(b);
cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
dot<<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, sizeof(num_t), cudaMemcpyDeviceToHost);
printf("a = [\n");
int i;
for (i=0;i<10;i++){
printf("%g\n",a[i]);
}
printf("...\n");
for (i=N-10;i<N;i++){
printf("%g\n",a[i]);
}
printf("]\n\n");
printf("a*b = %g.\n", *c);
free(a); free(b); free(c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
}
and I compile it with:
/usr/local/cuda-5.0/bin/nvcc -m64 -I/usr/local/cuda-5.0/include -gencode arch=compute_20,code=sm_20 -o multi_dot_product.o -c multi_dot_product.cu
g++ -m64 -o multi_dot_product multi_dot_product.o -L/usr/local/cuda-5.0/lib64 -lcudart
Information about my NVIDIA cards can be found at http://pastebin.com/8yTzXUuK. I tried to verify the result in MATLAB using the following simple code:
N = 2048 * 8;
a = zeros(N,1);
for i=1:N
a(i) = 0.001*(i-1);
end
dot_product = a'*a;
But as N increases, I'm getting significantly different results (For instance, for N=2048*32 CUDA reutrns 6.73066e+07 while MATLAB returns 9.3823e+07. For N=2048*64 CUDA gives 3.28033e+08 while MATLAB gives 7.5059e+08). I incline to believe that the discrepancy stems from the use of float in my C code, but if I replace it with double the compiler complains that atomicAdd does not support double parameters. How should I fix this problem?
Update: Also, for high values of N (e.g. 2048*64), I noticed that the result returned by CUDA changes at every run. This does not happen if N is low (e.g. 2048*8).
At the same time I have a more fundamental question: The variable temp is an array of size THREADS_PER_BLOCK and is shared between threads in the same block. Is it also shared between blocks or every block operates on a different copy of this variable? Should I think of the method dot as instructions to every block? Can someone elaborate on how exactly the jobs are split and how the variables are shared in this example
Comment this line out of your kernel:
// *c = 0.00;
And add these lines to your host code, before the kernel call (after the cudaMalloc of dev_c):
num_t h_c = 0.0f;
cudaMemcpy(dev_c, &h_c, sizeof(num_t), cudaMemcpyHostToDevice);
And I believe you'll get results that match matlab, more or less.
The fact that you have this line in your kernel unprotected by any synchronization is messing you up. Every thread of every block, whenever they happen to execute, is zeroing out c as you have written it.
By the way, we can do significantly better with this operation in general by using a classical parallel reduction method. A basic (not optimized) illustration is here. If you combine that method with your usage of shared memory and a single atomicAdd at the end (one atomicAdd per block) you'll have a significantly improved implementation. Although it's not a dot product, this example combines those ideas.
Edit: responding to a question below in the comments:
A kernel function is the set of instructions that all threads in the grid (all threads associated with a kernel launch, by definition) execute. However, it's reasonable to think of execution as being managed by threadblock, since the threads in a threadblock are executing together to a large extent. However, even within a threadblock, execution is not in perfect lockstep across all threads, necessarily. Normally when we think of lockstep execution, we think of a warp which is a group of 32 threads in a single threadblock. Therefore, since execution amongst warps within a block can be skewed, this hazard was present even for a single threadblock. However, if there were only one threadblock, we could have gotten rid of the hazard in your code using appropriate sync and control mechanisms like __syncthreads() and (if threadIdx.x == 0) etc. But these mechanisms are useless for the general case of controlling execution across multiple threadsblocks. Multiple threadblocks can execute in any order. The only defined sync mechanism across an entire grid is the kernel launch itself. Therefore to fix your issue, we had to zero out c prior to the kernel launch.
CUDA manual specifies the number of 32-bit registers per multiprocessor. Does it mean that:
Double variable takes two registers?
Pointer variable takes two registers? - It has to be more than one register on Fermi with 6 GB memory, right?
If answer to question 2 is yes, it must be better to use less pointer variables and more int indices.
E. g., this kernel code:
float* p1; // two regs
float* p2 = p1 + 1000; // two regs
int i; // one reg
for ( i = 0; i < n; i++ )
{
CODE THAT USES p1[i] and p2[i]
}
theoretically requires more registers than this kernel code:
float* p1; // two regs
int i; // one reg
int j; // one reg
for ( i = 0, j = 1000; i < n; i++, j++ )
{
CODE THAT USES p1[i] and p1[j]
}
The short answer to your three questions are:
Yes.
Yes, if the code is compiled for a 64 bit host operating system. Device pointer size always matches host application pointer size in CUDA.
No.
To expand on point 3, consider the following two simple memory copy kernels:
__global__
void debunk(float *in, float *out, int n)
{
int i = n * (threadIdx.x + blockIdx.x*blockDim.x);
for(int j=0; j<n; j++) {
out[i+j] = in[i+j];
}
}
__global__
void debunk2(float *in, float *out, int n)
{
int i = n * (threadIdx.x + blockIdx.x*blockDim.x);
float *x = in + i;
float *y = out + i;
for(int j=0; j<n; j++, x++, y++) {
*x = *y;
}
}
By your reckoning, debunk must use less registers because it has only two local integer variables, whereas debunk2 has two additional pointers. And yet, when I compile them using the CUDA 5 release toolchain:
$ nvcc -m64 -arch=sm_20 -c -Xptxas="-v" pointer_size.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z6debunkPfS_i' for 'sm_20'
ptxas info : Function properties for _Z6debunkPfS_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 52 bytes cmem[0]
ptxas info : Compiling entry function '_Z7debunk2PfS_i' for 'sm_20'
ptxas info : Function properties for _Z7debunk2PfS_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 52 bytes cmem[0]
They compile to the exact same register count. And if you disassemble the toolchain output you will see that apart from the setup code, the final instruction streams are almost identical. There are a number of reasons for this, but it basically comes down to two simple rules:
Trying to determine the register count from C code (or even PTX assembler) is mostly futile
Trying to second guess a very sophisticated compiler and assembler is also mostly futile.