I just noticed that my CUDA kernel uses exactly twice the space than that calculated by 'theory'. e.g.
__global__ void foo( )
{
__shared__ double t;
t = 1;
}
PTX info shows:
ptxas info : Function properties for _Z3foov, 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 4 registers, 16 bytes smem, 32 bytes cmem[0]
But the size of a double is only 8.
More example:
__global__ void foo( )
{
__shared__ int t[1024];
t[0] = 1;
}
ptxas info : Used 3 registers, 8192 bytes smem, 32 bytes cmem[0]
Could someone explain why?
Seems that the problem has gone in the current CUDA compiler.
__shared__ int a[1024];
compiled with command 'nvcc -m64 -Xptxas -v -ccbin /opt/gcc-4.6.3/bin/g++-4.6.3 shmem.cu' gives
ptxas info : Used 1 registers, 4112 bytes smem, 4 bytes cmem[1]
There are some shared memory overhead in this case, but the usage is not doubled.
Related
Suppose I have two __device__ CUDA function, each having the following local variable:
__shared__ int a[123];
and another function (say it's my kernel, i.e. a __global__ function), with:
extern __shared__ int b[];
Is this explicitly allowed/forbidden by nVIDIA? (I don't see it in the programming guide section B.2.3 on __shared__) Do the sizes all count together together towards the shared memory limit, or is it the maximum possibly in use at a single time? Or some other rule?
This can be considered a follow-up question to this one.
The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here:
ptxas info : Used 22 registers, 384 bytes smem, 48 bytes cmem[0]
Here, we have 384 bytes, which is 3 arrays of 32 ints. (see sample corde below).
You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration.
Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call.
Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region.
__device__ void dev1()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x ;
if (threadIdx.x == 0)
printf ("dev1 : %x\n", a) ;
}
__device__ void dev2()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x * 5 ;
if (threadIdx.x == 0)
printf ("dev2 : %x\n", a) ;
}
__global__ void kernel(int* res, int* res2)
{
__shared__ int a[32] ;
extern __shared__ int b[];
a[threadIdx.x] = 0 ;
b[threadIdx.x] = threadIdx.x * 3 ;
dev1();
__syncthreads();
dev2();
__syncthreads();
res[threadIdx.x] = a[threadIdx.x] ;
res2[threadIdx.x] = b[threadIdx.x] ;
if (threadIdx.x == 0)
printf ("global a : %x\n", a) ;
if (threadIdx.x == 0)
printf ("global b : %x\n", b) ;
}
int main()
{
int* dres ;
int* dres2 ;
cudaMalloc <> (&dres, 32*sizeof(int)) ;
cudaMalloc <> (&dres2, 32*sizeof(int)) ;
kernel<<<1,32,32*sizeof(float)>>> (dres, dres2);
int hres[32] ;
int hres2[32] ;
cudaMemcpy (hres, dres, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
cudaMemcpy (hres2, dres2, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
for (int k = 0 ; k < 32 ; ++k)
{
printf ("%d -- %d \n", hres[k], hres2[k]) ;
}
return 0 ;
}
This code outputs the ptxas info using 384 bytes smem, that is one array for global a array, a second for dev1 method a array, and a third for dev2 method a array. Totalling 3*32*sizeof(float)=384 bytes.
When running the kernel with dynamic shared memory equals to 32*sizeof(float), the pointer to b starts right after these three arrays.
EDIT:
The ptx file generated by this code holds declarations of statically-defined shared memory,
.shared .align 4 .b8 _ZZ4dev1vE1a[128];
.shared .align 4 .b8 _ZZ4dev2vE1a[128];
.extern .shared .align 4 .b8 b[];
except for the entry-point where it is defined in the body of the method
// _ZZ6kernelPiS_E1a has been demoted
The shared space of the memory is defined in the PTX documentation here:
The shared (.shared) state space is a per-CTA region of memory for threads in a CTA to share data. An address in shared memory can be read and written by any thread in a CTA. Use ld.shared and st.shared to access shared variables.
Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two.
During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF
1 IADD32I R1, R1, -0x8;
2 S2R R0, SR_LMEMHIOFF;
3 ISETP.GE.U32.AND P0, PT, R1, R0, PT;
and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc).
I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the b address does not change, and the unallocated shared memory region is never accessed).
Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition.
Suppose I have two __device__ CUDA function, each having the following local variable:
__shared__ int a[123];
and another function (say it's my kernel, i.e. a __global__ function), with:
extern __shared__ int b[];
Is this explicitly allowed/forbidden by nVIDIA? (I don't see it in the programming guide section B.2.3 on __shared__) Do the sizes all count together together towards the shared memory limit, or is it the maximum possibly in use at a single time? Or some other rule?
This can be considered a follow-up question to this one.
The shared memory is split in two parts: statically allocated and dynamically allocated. The first part is calculated during compilation, and each declaration is an actual allocation - activating ptxas info during compilation illustrates it here:
ptxas info : Used 22 registers, 384 bytes smem, 48 bytes cmem[0]
Here, we have 384 bytes, which is 3 arrays of 32 ints. (see sample corde below).
You may pass a pointer to shared memory since Kepler, to another function allowing a device sub-function to access another shared memory declaration.
Then, comes the dynamically allocated shared memory, which reserved size is declared during kernel call.
Here is an example of some various uses in a couple of functions. Note the pointer value of each shared memory region.
__device__ void dev1()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x ;
if (threadIdx.x == 0)
printf ("dev1 : %x\n", a) ;
}
__device__ void dev2()
{
__shared__ int a[32] ;
a[threadIdx.x] = threadIdx.x * 5 ;
if (threadIdx.x == 0)
printf ("dev2 : %x\n", a) ;
}
__global__ void kernel(int* res, int* res2)
{
__shared__ int a[32] ;
extern __shared__ int b[];
a[threadIdx.x] = 0 ;
b[threadIdx.x] = threadIdx.x * 3 ;
dev1();
__syncthreads();
dev2();
__syncthreads();
res[threadIdx.x] = a[threadIdx.x] ;
res2[threadIdx.x] = b[threadIdx.x] ;
if (threadIdx.x == 0)
printf ("global a : %x\n", a) ;
if (threadIdx.x == 0)
printf ("global b : %x\n", b) ;
}
int main()
{
int* dres ;
int* dres2 ;
cudaMalloc <> (&dres, 32*sizeof(int)) ;
cudaMalloc <> (&dres2, 32*sizeof(int)) ;
kernel<<<1,32,32*sizeof(float)>>> (dres, dres2);
int hres[32] ;
int hres2[32] ;
cudaMemcpy (hres, dres, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
cudaMemcpy (hres2, dres2, 32 * sizeof(int), cudaMemcpyDeviceToHost) ;
for (int k = 0 ; k < 32 ; ++k)
{
printf ("%d -- %d \n", hres[k], hres2[k]) ;
}
return 0 ;
}
This code outputs the ptxas info using 384 bytes smem, that is one array for global a array, a second for dev1 method a array, and a third for dev2 method a array. Totalling 3*32*sizeof(float)=384 bytes.
When running the kernel with dynamic shared memory equals to 32*sizeof(float), the pointer to b starts right after these three arrays.
EDIT:
The ptx file generated by this code holds declarations of statically-defined shared memory,
.shared .align 4 .b8 _ZZ4dev1vE1a[128];
.shared .align 4 .b8 _ZZ4dev2vE1a[128];
.extern .shared .align 4 .b8 b[];
except for the entry-point where it is defined in the body of the method
// _ZZ6kernelPiS_E1a has been demoted
The shared space of the memory is defined in the PTX documentation here:
The shared (.shared) state space is a per-CTA region of memory for threads in a CTA to share data. An address in shared memory can be read and written by any thread in a CTA. Use ld.shared and st.shared to access shared variables.
Though with no detail on the runtime. There is a word in the programming guide here with no further detail on the mixing of the two.
During PTX compilation, the compiler may know the amount of shared memory that is statically allocated. There might be some supplemental magic. Looking at the SASS, the first instructions use the SR_LMEMHIOFF
1 IADD32I R1, R1, -0x8;
2 S2R R0, SR_LMEMHIOFF;
3 ISETP.GE.U32.AND P0, PT, R1, R0, PT;
and calling functions in reverse order assign different values to the statically-allocated shared memory (looks very much like a form of stackalloc).
I believe the ptxas compiler calculates all the shared memory it might need in the worst case when all method may be called (when not using one of the method and using function pointers, the b address does not change, and the unallocated shared memory region is never accessed).
Finally, as einpoklum suggests in a comment, this is experimental and not part of a norm/API definition.
My running config:
- CUDA Toolkit 5.5
- NVidia Nsight Eclipse edition
- Ubuntu 12.04 x64
- CUDA device is NVidia GeForce GTX 560: cc=20, sm=21 (as you can see I can use blocks up to 1024 threads)
I render my display on iGPU (Intel HD Graphics), so I can use Nsight debugger.
However I encountered some weird behaviour, when I set threads > 960.
Code:
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void mytest() {
float a, b;
b = 1.0F;
a = b / 1.0F;
}
int main(void) {
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
// Here I run my kernel
mytest<<<1, 961>>>();
err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "error=%s\n", cudaGetErrorString(err));
exit (EXIT_FAILURE);
}
// Reset the device and exit
err = cudaDeviceReset();
if (err != cudaSuccess) {
fprintf(stderr, "Failed to deinitialize the device! error=%s\n",
cudaGetErrorString(err));
exit (EXIT_FAILURE);
}
printf("Done\n");
return 0;
}
And... it doesn't work. The problem is in the last line of code with float division. Every time I try to divide by float, my code compiles, but doesn't work. The output error at runtime is:
error=too many resources requested for launch
Here's what I get in debug, when I step it over:
warning: Cuda API error detected: cudaLaunch returned (0x7)
Build output using -Xptxas -v:
12:57:39 **** Incremental Build of configuration Debug for project block_size_test ****
make all
Building file: ../src/vectorAdd.cu
Invoking: NVCC Compiler
/usr/local/cuda-5.5/bin/nvcc -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -G -g -O0 -m64 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -optf /home/vitrums/cuda-workspace/block_size_test/options.txt -gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=sm_21 -odir "src" -M -o "src/vectorAdd.d" "../src/vectorAdd.cu"
/usr/local/cuda-5.5/bin/nvcc --compile -G -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -O0 -g -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_21 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -m64 -optf /home/vitrums/cuda-workspace/block_size_test/options.txt -x cu -o "src/vectorAdd.o" "../src/vectorAdd.cu"
../src/vectorAdd.cu(7): warning: variable "a" was set but never used
../src/vectorAdd.cu(7): warning: variable "a" was set but never used
ptxas info : 4 bytes gmem, 8 bytes cmem[14]
ptxas info : Function properties for _ZN4dim3C1Ejjj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Compiling entry function '_Z6mytestv' for 'sm_21'
ptxas info : Function properties for _Z6mytestv
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 8 bytes cumulative stack size, 32 bytes cmem[0]
ptxas info : Function properties for _ZN4dim3C2Ejjj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Finished building: ../src/vectorAdd.cu
Building target: block_size_test
Invoking: NVCC Linker
/usr/local/cuda-5.5/bin/nvcc --cudart static -m64 -link -o "block_size_test" ./src/vectorAdd.o
Finished building target: block_size_test
12:57:41 Build Finished (took 1s.659ms)
When I add -keep key, the compiler generates .cubin file, but I can't read it to find out the values of smem and reg, following this topic too-many-resources-requested-for-launch-how-to-find-out-what-resources-/. At least nowadays this file must have some different format.
Therefore I'm forced to use 256 threads per block, which is probably not a bad idea, considering this .xls: CUDA_Occupancy_calculator.
Anyway. Any help will be appreciated.
I filled the CUDA Occupancy calculator file with the current informations :
Compute capability : 2.1
Threads per block : 961
Registers per thread : 34
Shared memory : 0
I got 0% occupancy, limited by registers count.
If you set the number of thread to 960, you have 63% occupancy, which explains why it works.
Try to limit the count of registers to 32 and set the numbers of threads to 1024 to have 67% occupancy.
To limit the count of registers, use the following option :
nvcc [...] --maxrregcount=32
CUDA manual specifies the number of 32-bit registers per multiprocessor. Does it mean that:
Double variable takes two registers?
Pointer variable takes two registers? - It has to be more than one register on Fermi with 6 GB memory, right?
If answer to question 2 is yes, it must be better to use less pointer variables and more int indices.
E. g., this kernel code:
float* p1; // two regs
float* p2 = p1 + 1000; // two regs
int i; // one reg
for ( i = 0; i < n; i++ )
{
CODE THAT USES p1[i] and p2[i]
}
theoretically requires more registers than this kernel code:
float* p1; // two regs
int i; // one reg
int j; // one reg
for ( i = 0, j = 1000; i < n; i++, j++ )
{
CODE THAT USES p1[i] and p1[j]
}
The short answer to your three questions are:
Yes.
Yes, if the code is compiled for a 64 bit host operating system. Device pointer size always matches host application pointer size in CUDA.
No.
To expand on point 3, consider the following two simple memory copy kernels:
__global__
void debunk(float *in, float *out, int n)
{
int i = n * (threadIdx.x + blockIdx.x*blockDim.x);
for(int j=0; j<n; j++) {
out[i+j] = in[i+j];
}
}
__global__
void debunk2(float *in, float *out, int n)
{
int i = n * (threadIdx.x + blockIdx.x*blockDim.x);
float *x = in + i;
float *y = out + i;
for(int j=0; j<n; j++, x++, y++) {
*x = *y;
}
}
By your reckoning, debunk must use less registers because it has only two local integer variables, whereas debunk2 has two additional pointers. And yet, when I compile them using the CUDA 5 release toolchain:
$ nvcc -m64 -arch=sm_20 -c -Xptxas="-v" pointer_size.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z6debunkPfS_i' for 'sm_20'
ptxas info : Function properties for _Z6debunkPfS_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 52 bytes cmem[0]
ptxas info : Compiling entry function '_Z7debunk2PfS_i' for 'sm_20'
ptxas info : Function properties for _Z7debunk2PfS_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 52 bytes cmem[0]
They compile to the exact same register count. And if you disassemble the toolchain output you will see that apart from the setup code, the final instruction streams are almost identical. There are a number of reasons for this, but it basically comes down to two simple rules:
Trying to determine the register count from C code (or even PTX assembler) is mostly futile
Trying to second guess a very sophisticated compiler and assembler is also mostly futile.
A CUDA kernel compiled with the option --ptxas-options=-v seems to be displaying erroneous lmem (local memory) statistics when sm_20 GPU architecture is specified. The same gives meaningful lmem statistics with sm_10 / sm_11 / sm_12 / sm_13 architectures.
Can someone clarify if the sm_20 lmem statistics need to be read differently or they are plain wrong?
Here is the kernel:
__global__ void fooKernel( int* dResult )
{
const int num = 1000;
int val[num];
for ( int i = 0; i < num; ++i )
val[i] = i * i;
int result = 0;
for ( int i = 0; i < num; ++i )
result += val[i];
*dResult = result;
return;
}
--ptxas-options=-v and sm_20 report:
1>ptxas info : Compiling entry function '_Z9fooKernelPi' for 'sm_20'
1>ptxas info : Used 5 registers, 4+0 bytes lmem, 36 bytes cmem[0]
--ptxas-options=-v and sm_10 / sm_11 / sm_12 / sm_13 report:
1>ptxas info : Compiling entry function '_Z9fooKernelPi' for 'sm_10'
1>ptxas info : Used 3 registers, 4000+0 bytes lmem, 4+16 bytes smem, 4 bytes cmem[1]
sm_20 reports a lmem of 4 bytes, which is simply not possible if you see the 4x1000 byte array being used in the kernel. The older GPU architectures report the correct 4000 byte lmem statistic.
This was tried with CUDA 3.2. I have referred to the Printing Code Generation Statistics section of the NVCC manual (v3.2), but it does not help explain this anomaly.
The compiler is correct. Through clever optimization the array doesn't need to be stored. What you do is essentially calculating result += i * i without ever storing temporaries to val.
A look at the generated ptx code won't show any differences for sm_10 vs. sm_20. Decompiling the generated cubins with decuda will reveal the optimization.
BTW: Try to avoid local memory! It is as slow as global memory.