CUDA can't use all available constant memory - cuda

I have a code that uses cooperative group to perform some operations. Therefore I compile my code with:
/usr/local/cuda/bin/nvcc -arch=sm_61 -gencode=arch=compute_61,code=sm_61, --device-c -g -O2 foo.cu
Then I try to invoke the device linker:
/usr/local/cuda/bin/nvcc -arch=sm_61 -gencode=arch=compute_61,code=sm_61, -g -dlink foo.o
It then yields the error:
ptxas error : File uses too much global constant data (0x10100 bytes, 0x10000 max)
The problem is caused by the way I allocated constant memory:
__constant__ float d_cnst_centers[CONST_MEM / sizeof(float)];
where CONST_MEM = 65536 bytes, which I got from device query for SM_61. However, if I reduce the constant memory to something like 64536, the problem disappeared. It's almost as if the constant memory is "reserved" for some purposes during compilation. I've searched through CUDA documentation but didn't find a satisfying answer. Is it safe to use the maximum amount of constant memory available to you? Why would this problem happen?
EDIT: this is the code snippet that triggers the error on SM_61:
#include <algorithm>
#include <vector>
#include <type_traits>
#include <cuda_runtime.h>
#include <cfloat>
#include <iostream>
#include <cooperative_groups.h>
using namespace cooperative_groups;
struct foo_params {
float * points;
float * centers;
int * centersDist;
int * centersIndex;
int numPoints;
};
__constant__ float d_cnst_centers[65536 / sizeof(float)];
template <int R, int C>
__device__ int
nearestCenter(float * points, float * pC) {
float mindist = FLT_MAX;
int minidx = 0;
int clistidx = 0;
for(int i=0; i<C;i++) {
clistidx = i*R;
float dist;
{
float *point = points;
float *center = &pC[clistidx];
float accum;
for(int i = 0; i<R; i++) {
float delta = point[i] - center[i];
accum += delta*delta;
}
dist = sqrt(accum);
}
/* ... */
}
return minidx;
}
template<int R, int C, bool bRO, bool ROWMAJ=true>
__global__ void getNeatestCenter(struct foo_params params) {
float * points = params.points;
float * centers = params.centers;
int * centersDist = params.centersDist;
int * centersIndex = params.centersIndex;
int numPoints = params.numPoints;
grid_group grid = this_grid();
{
int idx = blockIdx.x*blockDim.x+threadIdx.x;
if (idx < numPoints) {
centersIndex[idx] = nearestCenter<R,C>(&points[idx*R], d_cnst_centers);
}
}
/* ... other code */
}
int main () {
// foo paramaters, for illustration purposes
struct foo_params param;
param.points = NULL;
param.centers = NULL;
param.centersDist = NULL;
param.centersIndex = NULL;
param.numPoints = 1000000;
void *p_params = &param;
int minGridSize = 0, blockSize = 0;
cudaOccupancyMaxPotentialBlockSize(
&minGridSize,
&blockSize,
(void*)getNeatestCenter<128, 64, true>,
0,
0);
dim3 dimGrid(minGridSize, 1, 1), dimBlock(blockSize, 1, 1);
cudaLaunchCooperativeKernel((void *)getNeatestCenter<32, 32, true>, dimGrid, dimBlock, &p_params);
}
The problem seems to be cause by the line:
grid_group grid = this_grid();
which seems to use approximately 0x100 bytes of constant memory without known reasons.

This answer is speculative, because minimal but complete repro code was not provided by OP.
GPUs contain multiple constant memory banks used for different parts of program storage. One of those banks is for use by the programmer. Importantly, CUDA standard math library code uses the same bank, because the math library code becomes part of the programmer's code by function inlining. In the past, this was blatantly obvious, as the entire CUDA math library initially was just a couple of header files.
Some math functions need small tables of constant data internally. Particular examples are sin, cos, tan. When these math functions are used, the amount of __constant__ data available to programmers is reduced from 64KB by a small amount. Here are some example programs for demonstration purposes, compiled with the CUDA 8 toolchain and -arch=sm_61:
#include <stdio.h>
#include <stdlib.h>
#define CONST_MEM (65536)
__constant__ float d_cnst_centers[CONST_MEM / sizeof(float)] = {1};
__global__ void kernel (int i, float f)
{
float r = d_cnst_centers[i] * expf(f);
printf ("r=%15.8f\n", r);
}
int main (void)
{
kernel<<<1,1>>>(0,25.0f);
cudaDeviceSynchronize();
return EXIT_SUCCESS;
}
This compiles fine and prints r=72004902912.00000000 at run time. Now lets change expf into sinf:
#include <stdio.h>
#include <stdlib.h>
#define CONST_MEM (65536)
__constant__ float d_cnst_centers[CONST_MEM / sizeof(float)] = {1};
__global__ void kernel (int i, float f)
{
float r = d_cnst_centers[i] * sinf(f);
printf ("r=%15.8f\n", r);
}
int main (void)
{
kernel<<<1,1>>>(0,25.0f);
cudaDeviceSynchronize();
return EXIT_SUCCESS;
}
This throws an error during compilation:
ptxas error : File uses too much global constant data (0x10018 bytes, 0x10000 max)
If we use the double-precision function sin instead, even more constant memory is needed:
#include <stdio.h>
#include <stdlib.h>
#define CONST_MEM (65536)
__constant__ float d_cnst_centers[CONST_MEM / sizeof(float)] = {1};
__global__ void kernel (int i, float f)
{
float r = d_cnst_centers[i] * sin((double)f);
printf ("r=%15.8f\n", r);
}
int main (void)
{
kernel<<<1,1>>>(0,25.0f);
cudaDeviceSynchronize();
return EXIT_SUCCESS;
}
We get the error message:
ptxas error : File uses too much global constant data (0x10110 bytes, 0x10000 max)

In order to document what exactly is happening in this use case, I have cobbled together the following work through of the compilation process. Hopefully it will shed some light on how this problem arises, and some useful diagnostic tools, and dispel a few misconceptions at the same time.
Note this is a work in progress and may be updated periodically as more information comes to light. Please edit and contribute as you see fit
To start, as noted in comments, it is perfectly possible to allocate every byte of constant memory up until the 64kb limit. This example is pretty much the use case described in the original question:
const int sz = 65536;
const int NMax = sz / sizeof(float);
__constant__ float buffer[NMax];
__global__
void akernel(const float* __restrict__ arg1, float* __restrict__ arg2, int N)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < N) {
float ans = 0;
#pragma unroll 128
for(int i=0; i<NMax; i++) {
float val = buffer[i];
float y = (i%2 == 0) ? 1.f : -1.f;
float x = val / 255.f;
ans = ans + y * sinf(x);
}
arg2[tid] = ans + arg1[tid];
}
}
and it compiles without a problem (Godbolt link here). This proves that the linker phase in the question must be pulling in additional constant memory allocations from other code, whether that is user code, other device libraries, or device runtime support.
So let's turn our attention to the repro case posted in the updated question, mildly modified so that it will pass the compilation and link phase by reducing the constant memory footprint slightly, with a buffer of 64536 bytes:
$ nvcc -arch=sm_61 --device-c -g -O2 -Xptxas="-v" -o constmemuse.cu.o constmemuse.cu
constmemuse.cu(51): warning: variable "centers" was declared but never referenced
constmemuse.cu(52): warning: variable "centersDist" was declared but never referenced
constmemuse.cu(31): warning: variable "dist" was set but never used
detected during instantiation of "void getNeatestCenter<R,C,bRO,ROWMAJ>(foo_params) [with R=128, C=64, bRO=true, ROWMAJ=true]"
constmemuse.cu(26): warning: variable "mindist" was declared but never referenced
detected during instantiation of "void getNeatestCenter<R,C,bRO,ROWMAJ>(foo_params) [with R=128, C=64, bRO=true, ROWMAJ=true]"
ptxas info : 0 bytes gmem, 64536 bytes cmem[3]
ptxas info : Function properties for cudaDeviceGetAttribute
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Compiling entry function '_Z16getNeatestCenterILi128ELi64ELb1ELb1EEv10foo_params' for 'sm_61'
ptxas info : Function properties for _Z16getNeatestCenterILi128ELi64ELb1ELb1EEv10foo_params
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 5 registers, 360 bytes cmem[0]
ptxas info : Function properties for cudaMalloc
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessor
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for cudaGetDevice
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Compiling entry function '_Z16getNeatestCenterILi32ELi32ELb1ELb1EEv10foo_params' for 'sm_61'
ptxas info : Function properties for _Z16getNeatestCenterILi32ELi32ELb1ELb1EEv10foo_params
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 5 registers, 360 bytes cmem[0]
ptxas info : Function properties for cudaFuncGetAttributes
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
A few points:
64536 bytes cmem[3] shows the size of the user controllable constant memory bank, as we specified it
ptxas info : Used 5 registers, 360 bytes cmem[0] shows the register usage of the function and cmem[0] is the internal reserved constant memory bank which is used for holding kernel arguments and anything else which the compiler puts to constant memory. Note that register spilling goes to local memory, not constant memory.
So now let's run the device linking step:
$ nvcc -arch=sm_61 -gencode=arch=compute_61,code=sm_61, -g -dlink -Xnvlink="-v" -o constmemuse.o constmemuse.cu.o
nvlink info : 9944 bytes gmem, 64792 bytes cmem[3] (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi1ELi1ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 10 registers, 0 stack, 2056 bytes smem, 448 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi1ELi1ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 10 registers, 0 stack, 2056 bytes smem, 448 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi1ELi0ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 10 registers, 0 stack, 2056 bytes smem, 448 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi1ELi0ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 10 registers, 0 stack, 2056 bytes smem, 448 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi0ELi1ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 20 registers, 0 stack, 2056 bytes smem, 448 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi0ELi1ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 23 registers, 0 stack, 2056 bytes smem, 448 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi0ELi0ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 28 registers, 0 stack, 2056 bytes smem, 448 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi0ELi0ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 23 registers, 0 stack, 2056 bytes smem, 448 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi1ELi1ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 10 registers, 0 stack, 2056 bytes smem, 416 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi1ELi1ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 10 registers, 0 stack, 2056 bytes smem, 416 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi1ELi0ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 10 registers, 0 stack, 2056 bytes smem, 416 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi1ELi0ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 10 registers, 0 stack, 2056 bytes smem, 416 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi0ELi1ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 12 registers, 0 stack, 2056 bytes smem, 416 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi0ELi1ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 17 registers, 0 stack, 2056 bytes smem, 416 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi0ELi0ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 14 registers, 0 stack, 2056 bytes smem, 416 bytes cmem[0], 4 bytes cmem[2], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi0ELi0ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_': (target: sm_61)
nvlink info : used 16 registers, 0 stack, 2056 bytes smem, 416 bytes cmem[0], 4 bytes cmem[2], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi1ELi1ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 6 registers, 0 stack, 0 bytes smem, 424 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi1ELi1ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 6 registers, 0 stack, 0 bytes smem, 424 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi1ELi0ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 6 registers, 0 stack, 0 bytes smem, 424 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi1ELi0ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 6 registers, 0 stack, 0 bytes smem, 424 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi0ELi1ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 16 registers, 0 stack, 0 bytes smem, 424 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi0ELi1ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 14 registers, 0 stack, 0 bytes smem, 424 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi0ELi0ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 17 registers, 0 stack, 0 bytes smem, 424 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi1ELi1ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 6 registers, 0 stack, 0 bytes smem, 400 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi1ELi1ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 6 registers, 0 stack, 0 bytes smem, 400 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi1ELi0ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 6 registers, 0 stack, 0 bytes smem, 400 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi1ELi0ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 6 registers, 0 stack, 0 bytes smem, 400 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi0ELi1ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 8 registers, 0 stack, 0 bytes smem, 400 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi0ELi1ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 11 registers, 0 stack, 0 bytes smem, 400 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi0ELi0ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 12 registers, 0 stack, 0 bytes smem, 400 bytes cmem[0], 4 bytes cmem[2], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi0ELi0ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 11 registers, 0 stack, 0 bytes smem, 400 bytes cmem[0], 4 bytes cmem[2], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '__nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi0ELi0ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_': (target: sm_61)
nvlink info : used 21 registers, 0 stack, 0 bytes smem, 424 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '_Z16getNeatestCenterILi32ELi32ELb1ELb1EEv10foo_params': (target: sm_61)
nvlink info : used 6 registers, 0 stack, 0 bytes smem, 360 bytes cmem[0], 0 bytes lmem (target: sm_61)
nvlink info : Function properties for '_Z16getNeatestCenterILi128ELi64ELb1ELb1EEv10foo_params': (target: sm_61)
nvlink info : used 6 registers, 0 stack, 0 bytes smem, 360 bytes cmem[0], 0 bytes lmem (target: sm_61)
Some more remarks:
9944 bytes gmem, 64792 bytes cmem[3] now shows the global and constant memory reservations for the linked module. As you can see, we have inherited 256 additional bytes in constant bank 0, which is the user modifiable bank, plus 9944 bytes of statically reserved global memory. If the array allocation had been 65536 bytes, as in the question, the linkage will fail because it exceeds the 64kb limit.
You can see that a number of device runtime library functions have been linked automagically during the linkage phase (memcpy and memset)
It is clear that the additional constant memory usage is coming linking the device runtime, it can be confirmed with cuobjdump post hoc. The object from compilation:
$ cuobjdump -res-usage constmemuse.cu.o
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
Resource usage:
Common:
GLOBAL:0 CONSTANT[3]:64536
Function cudaDeviceGetAttribute:
REG:5 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
Function _Z16getNeatestCenterILi128ELi64ELb1ELb1EEv10foo_params:
REG:5 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:360 TEXTURE:0 SURFACE:0 SAMPLER:0
Function cudaMalloc:
REG:5 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
Function cudaOccupancyMaxActiveBlocksPerMultiprocessor:
REG:5 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
Function cudaGetDevice:
REG:5 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
Function _Z16getNeatestCenterILi32ELi32ELb1ELb1EEv10foo_params:
REG:5 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:360 TEXTURE:0 SURFACE:0 SAMPLER:0
Function cudaFuncGetAttributes:
REG:5 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
Function cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags:
REG:5 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
Fatbin ptx code:
================
arch = sm_61
code version = [6,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
ptxasOptions = -v --compile-only
and the object after linking:
$ cuobjdump -res-usage constmemuse.o
Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
Resource usage:
Common:
GLOBAL:9944 CONSTANT[3]:64792
Function _Z16getNeatestCenterILi128ELi64ELb1ELb1EEv10foo_params:
REG:6 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:360 TEXTURE:0 SURFACE:0 SAMPLER:0
Function _Z16getNeatestCenterILi32ELi32ELb1ELb1EEv10foo_params:
REG:6 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:360 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi0ELi0ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:21 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:424 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi0ELi0ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:11 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:400 CONSTANT[2]:4 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi0ELi0ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:12 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:400 CONSTANT[2]:4 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi0ELi1ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:11 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi0ELi1ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:8 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi1ELi0ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:6 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi1ELi0ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:6 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi1ELi1ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:6 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceIjLi1ELi1ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:6 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:400 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi0ELi0ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:17 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:424 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi0ELi1ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:14 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:424 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi0ELi1ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:16 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:424 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi1ELi0ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:6 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:424 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi1ELi0ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:6 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:424 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi1ELi1ELi0EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:6 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:424 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memset_3d_deviceImLi1ELi1ELi1EEvPhhjT_S1_S1_S1_S1_jjjjjjjS1_S0_:
REG:6 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:424 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi0ELi0ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:16 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:416 CONSTANT[2]:4 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi0ELi0ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:14 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:416 CONSTANT[2]:4 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi0ELi1ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:17 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:416 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi0ELi1ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:12 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:416 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi1ELi0ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:10 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:416 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi1ELi0ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:10 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:416 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi1ELi1ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:10 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:416 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceIjLi1ELi1ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:10 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:416 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi0ELi0ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:23 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:448 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi0ELi0ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:28 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:448 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi0ELi1ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:23 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:448 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi0ELi1ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:20 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:448 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi1ELi0ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:10 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:448 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi1ELi0ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:10 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:448 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi1ELi1ELi0EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:10 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:448 TEXTURE:0 SURFACE:0 SAMPLER:0
Function __nv_static_51__38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37__Z16memcpy_3d_deviceImLi1ELi1ELi1EEvPKhPhT_S3_S3_S3_S3_S3_S3_jjjjjjjjS3_S1_S2_:
REG:10 STACK:0 SHARED:2056 LOCAL:0 CONSTANT[0]:448 TEXTURE:0 SURFACE:0 SAMPLER:0
Function cudaCGGetIntrinsicHandle:
REG:6 STACK:0 SHARED:0 LOCAL:0 TEXTURE:0 SURFACE:0 SAMPLER:0
It has been demonstrated in the accepted answer that the math library can reserve constant memory for coefficients and lookup tables for some trigonometric and transcendental functions. However, in this case, the cause seems to be the support boilerplate emitted by the use of cooperative groups in the kernel. Delving further into the exact source of the additional bank 0 constant memory would require disassembly and reverse engineering of that code, which I am not going to do for now.

Related

CUDA thread block size 1024 doesn't work (cc=20, sm=21)

My running config:
- CUDA Toolkit 5.5
- NVidia Nsight Eclipse edition
- Ubuntu 12.04 x64
- CUDA device is NVidia GeForce GTX 560: cc=20, sm=21 (as you can see I can use blocks up to 1024 threads)
I render my display on iGPU (Intel HD Graphics), so I can use Nsight debugger.
However I encountered some weird behaviour, when I set threads > 960.
Code:
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void mytest() {
float a, b;
b = 1.0F;
a = b / 1.0F;
}
int main(void) {
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
// Here I run my kernel
mytest<<<1, 961>>>();
err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "error=%s\n", cudaGetErrorString(err));
exit (EXIT_FAILURE);
}
// Reset the device and exit
err = cudaDeviceReset();
if (err != cudaSuccess) {
fprintf(stderr, "Failed to deinitialize the device! error=%s\n",
cudaGetErrorString(err));
exit (EXIT_FAILURE);
}
printf("Done\n");
return 0;
}
And... it doesn't work. The problem is in the last line of code with float division. Every time I try to divide by float, my code compiles, but doesn't work. The output error at runtime is:
error=too many resources requested for launch
Here's what I get in debug, when I step it over:
warning: Cuda API error detected: cudaLaunch returned (0x7)
Build output using -Xptxas -v:
12:57:39 **** Incremental Build of configuration Debug for project block_size_test ****
make all
Building file: ../src/vectorAdd.cu
Invoking: NVCC Compiler
/usr/local/cuda-5.5/bin/nvcc -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -G -g -O0 -m64 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -optf /home/vitrums/cuda-workspace/block_size_test/options.txt -gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=sm_21 -odir "src" -M -o "src/vectorAdd.d" "../src/vectorAdd.cu"
/usr/local/cuda-5.5/bin/nvcc --compile -G -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -O0 -g -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_21 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -m64 -optf /home/vitrums/cuda-workspace/block_size_test/options.txt -x cu -o "src/vectorAdd.o" "../src/vectorAdd.cu"
../src/vectorAdd.cu(7): warning: variable "a" was set but never used
../src/vectorAdd.cu(7): warning: variable "a" was set but never used
ptxas info : 4 bytes gmem, 8 bytes cmem[14]
ptxas info : Function properties for _ZN4dim3C1Ejjj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Compiling entry function '_Z6mytestv' for 'sm_21'
ptxas info : Function properties for _Z6mytestv
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 8 bytes cumulative stack size, 32 bytes cmem[0]
ptxas info : Function properties for _ZN4dim3C2Ejjj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Finished building: ../src/vectorAdd.cu
Building target: block_size_test
Invoking: NVCC Linker
/usr/local/cuda-5.5/bin/nvcc --cudart static -m64 -link -o "block_size_test" ./src/vectorAdd.o
Finished building target: block_size_test
12:57:41 Build Finished (took 1s.659ms)
When I add -keep key, the compiler generates .cubin file, but I can't read it to find out the values of smem and reg, following this topic too-many-resources-requested-for-launch-how-to-find-out-what-resources-/. At least nowadays this file must have some different format.
Therefore I'm forced to use 256 threads per block, which is probably not a bad idea, considering this .xls: CUDA_Occupancy_calculator.
Anyway. Any help will be appreciated.
I filled the CUDA Occupancy calculator file with the current informations :
Compute capability : 2.1
Threads per block : 961
Registers per thread : 34
Shared memory : 0
I got 0% occupancy, limited by registers count.
If you set the number of thread to 960, you have 63% occupancy, which explains why it works.
Try to limit the count of registers to 32 and set the numbers of threads to 1024 to have 67% occupancy.
To limit the count of registers, use the following option :
nvcc [...] --maxrregcount=32

When to use volatile with register/local variables

What is the meaning of declaring register arrays in CUDA with volatile qualifier?
When I tried with volatile keyword with a register array, it removed the number of spilled register memory to local memory. (i.e. Force the CUDA to use registers instead of local memory) Is this the intended behavior?
I did not find any information about the usage of volatile with regard to register arrays in CUDA documentation.
Here is the ptxas -v output for both versions
With volatile qualifier
__volatile__ float array[32];
ptxas -v output
ptxas info : Compiling entry function '_Z2swPcS_PfiiiiS0_' for 'sm_20'
ptxas info : Function properties for _Z2swPcS_PfiiiiS0_
88 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 47 registers, 16640 bytes smem, 80 bytes cmem[0], 8 bytes cmem[16]
Without volatile qualifier
float array[32];
ptxas -v output
ptxas info : Compiling entry function '_Z2swPcS_PfiiiiS0_' for 'sm_20'
ptxas info : Function properties for _Z2swPcS_PfiiiiS0_
96 bytes stack frame, 100 bytes spill stores, 108 bytes spill loads
ptxas info : Used 51 registers, 16640 bytes smem, 80 bytes cmem[0], 8 bytes cmem[16]
The volatile qualifier specifies to the compiler that all references to a variable (read or write) should result in a memory reference and those references must be in the order specified in the program. The use of the volatile qualifier is illustrated in Chapter 12 of the Shane Cook book, "CUDA Programming".
The use of volatile will avoid some optimizations the compiler can do and so change the number of used registers used. The best way to understand what volatile is actually doing is to disassemble the relevant __global__ function with and without the qualifier.
Consider indeed the following kernel functions
__global__ void volatile_test() {
volatile float a[3];
for (int i=0; i<3; i++) a[i] = (float)i;
}
__global__ void no_volatile_test() {
float a[3];
for (int i=0; i<3; i++) a[i] = (float)i;
}
Disassembling the above kernel functions one obtains
code for sm_20
Function : _Z16no_volatile_testv
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ EXIT ; /* 0x8000000000001de7 */
Function : _Z13volatile_testv
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ ISUB R1, R1, 0x10; /* 0x4800c00040105d03 */ R1 = address of a[0]
/*0010*/ MOV32I R2, 0x3f800000; /* 0x18fe000000009de2 */ R2 = 1
/*0018*/ MOV32I R0, 0x40000000; /* 0x1900000000001de2 */ R0 = 2
/*0020*/ STL [R1], RZ; /* 0xc8000000001fdc85 */
/*0028*/ STL [R1+0x4], R2; /* 0xc800000010109c85 */ a[0] = 0;
/*0030*/ STL [R1+0x8], R0; /* 0xc800000020101c85 */ a[1] = R2 = 1;
/*0038*/ EXIT ; /* 0x8000000000001de7 */ a[2] = R0 = 2;
As you can see, when NOT using the volatile keyword, the compiler realizes that a is set but never used (indeed, the compiler returns the following warning: variable "a" was set but never used) and there is practically no disassembled code.
Opposite to that, when using the volatile keyword, all references to a are translated to memory references (write in this case).

CUDA_ERROR_INVALID_IMAGE during cuModuleLoad

I've created a very simple kernel (can be found here) which I successfully compile using
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\bin\nvcc.exe" --cl-version 2012 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.5\include" -cudart static -cubin temp.cu
and subsequently use the following code to load the kernel in
CUresult err = cuInit(0);
CUdevice device;
err = cuDeviceGet(&device, 0);
CUcontext ctx;
err = cuCtxCreate(&ctx, 0, device);
CUmodule module;
string path = string(dir) + "\\temp.cubin";
err = cuModuleLoad(&module, path.c_str());
cuCtxDetach(ctx);
Unfortunately, during cuModuleLoad I get a result of CUDA_ERROR_INVALID_IMAGE. Can someone tell me why this could be happening? The kernel's valid and compiles without issues.
The CUDA_ERROR_INVALID_IMAGE error should only be returned by cuModuleLoad when the module file is invalid. If it is missing or contains an architecture mismatch you should probably see a CUDA_ERROR_FILE_NOT_FOUND or CUDA_ERROR_INVALID_SOURCE error. You haven't given us enough details or code to say for certain what is happening, but in principle at least, the API code you have should work.
To show how this should work, consider the following working example on Linux with CUDA 5.5:
First your kernel:
#include <cmath>
using namespace std;
__device__ __inline__ float trim(unsigned char value)
{
return fminf((unsigned char)255, fmaxf(value, (unsigned char)0));
}
__constant__ char z = 1;
__global__ void kernel(unsigned char* img, const float* a)
{
int ix = blockIdx.x;
int iy = threadIdx.x;
int tid = iy*blockDim.x + ix;
float x = (float)ix / blockDim.x;
float y = (float)iy / gridDim.x;
//placeholder
img[tid*4+0] = trim((a[0]*z*z+a[1]*z+a[2]) * 255.0f);
img[tid*4+1] = trim((a[3]*z*z+a[4]*z+a[5]) * 255.0f);
img[tid*4+2] = trim((a[6]*z*z+a[7]*z+a[8]) * 255.0f);
img[tid*4+3] = 255;
}
Then a simple program to load the cubin into a context at runtime:
#include <cuda.h>
#include <string>
#include <iostream>
#define Errchk(ans) { DrvAssert((ans), __FILE__, __LINE__); }
inline void DrvAssert( CUresult code, const char *file, int line)
{
if (code != CUDA_SUCCESS) {
std::cout << "Error: " << code << " " << file << "#" << line << std::endl;
exit(code);
} else {
std::cout << "Success: " << file << "#" << line << std::endl;
}
}
int main(void)
{
Errchk( cuInit(0) );
CUdevice device;
Errchk( cuDeviceGet(&device, 0) );
CUcontext ctx;
Errchk( cuCtxCreate(&ctx, 0, device) );
CUmodule module;
std::string path = "qkernel.cubin";
Errchk( cuModuleLoad(&module, path.c_str()) );
cuCtxDetach(ctx);
return 0;
}
Build the cubin for the architecture of the device present in the host (a GTX670 in this case):
$ nvcc -arch=sm_30 -Xptxas="-v" --cubin qkernel.cu
ptxas info : 11 bytes gmem, 1 bytes cmem[3]
ptxas info : Compiling entry function '_Z6kernelPhPKf' for 'sm_30'
ptxas info : Function properties for _Z6kernelPhPKf
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 10 registers, 336 bytes cmem[0]
and the host program:
$ nvcc -o qexe qmain.cc -lcuda
then run:
$ ./qexe
Success: qmain.cc#18
Success: qmain.cc#20
Success: qmain.cc#22
Success: qmain.cc#26
The module code loads. If I delete the cubin and run again, I see this:
$ rm qkernel.cubin
$ ./qexe
Success: qmain.cc#18
Success: qmain.cc#20
Success: qmain.cc#22
Error: 301 qmain.cc#26
If I compile for an incompatible architecture, I see this:
$ nvcc -arch=sm_10 -Xptxas="-v" --cubin qkernel.cu
ptxas info : 0 bytes gmem, 1 bytes cmem[0]
ptxas info : Compiling entry function '_Z6kernelPhPKf' for 'sm_10'
ptxas info : Used 5 registers, 32 bytes smem, 4 bytes cmem[1]
$ ./qexe
Success: qmain.cc#18
Success: qmain.cc#20
Success: qmain.cc#22
Error: 300 qmain.cc#26
If I compile to an object file, not a cubin, I see this:
$ nvcc -arch=sm_30 -Xptxas="-v" -c -o qkernel.cubin qkernel.cu
ptxas info : 11 bytes gmem, 1 bytes cmem[3]
ptxas info : Compiling entry function '_Z6kernelPhPKf' for 'sm_30'
ptxas info : Function properties for _Z6kernelPhPKf
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 10 registers, 336 bytes cmem[0]
$ ./qexe
Success: qmain.cc#18
Success: qmain.cc#20
Success: qmain.cc#22
Error: 200 qmain.cc#26
This is the only way I can get the code to emit a CUDA_ERROR_INVALID_IMAGE error. All I can suggest is to try my code and recipe and see if you can get it to work.
Happens if you compile for different machine types - for example 32 vs 64.
If you have 32bits app, add --machine 32 to the nvcc param and it will be fine.

CUDA shared memory occupy twice the space than needed

I just noticed that my CUDA kernel uses exactly twice the space than that calculated by 'theory'. e.g.
__global__ void foo( )
{
__shared__ double t;
t = 1;
}
PTX info shows:
ptxas info : Function properties for _Z3foov, 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 4 registers, 16 bytes smem, 32 bytes cmem[0]
But the size of a double is only 8.
More example:
__global__ void foo( )
{
__shared__ int t[1024];
t[0] = 1;
}
ptxas info : Used 3 registers, 8192 bytes smem, 32 bytes cmem[0]
Could someone explain why?
Seems that the problem has gone in the current CUDA compiler.
__shared__ int a[1024];
compiled with command 'nvcc -m64 -Xptxas -v -ccbin /opt/gcc-4.6.3/bin/g++-4.6.3 shmem.cu' gives
ptxas info : Used 1 registers, 4112 bytes smem, 4 bytes cmem[1]
There are some shared memory overhead in this case, but the usage is not doubled.

CUDA: Erroneous lmem statistics displayed for sm_20?

A CUDA kernel compiled with the option --ptxas-options=-v seems to be displaying erroneous lmem (local memory) statistics when sm_20 GPU architecture is specified. The same gives meaningful lmem statistics with sm_10 / sm_11 / sm_12 / sm_13 architectures.
Can someone clarify if the sm_20 lmem statistics need to be read differently or they are plain wrong?
Here is the kernel:
__global__ void fooKernel( int* dResult )
{
const int num = 1000;
int val[num];
for ( int i = 0; i < num; ++i )
val[i] = i * i;
int result = 0;
for ( int i = 0; i < num; ++i )
result += val[i];
*dResult = result;
return;
}
--ptxas-options=-v and sm_20 report:
1>ptxas info : Compiling entry function '_Z9fooKernelPi' for 'sm_20'
1>ptxas info : Used 5 registers, 4+0 bytes lmem, 36 bytes cmem[0]
--ptxas-options=-v and sm_10 / sm_11 / sm_12 / sm_13 report:
1>ptxas info : Compiling entry function '_Z9fooKernelPi' for 'sm_10'
1>ptxas info : Used 3 registers, 4000+0 bytes lmem, 4+16 bytes smem, 4 bytes cmem[1]
sm_20 reports a lmem of 4 bytes, which is simply not possible if you see the 4x1000 byte array being used in the kernel. The older GPU architectures report the correct 4000 byte lmem statistic.
This was tried with CUDA 3.2. I have referred to the Printing Code Generation Statistics section of the NVCC manual (v3.2), but it does not help explain this anomaly.
The compiler is correct. Through clever optimization the array doesn't need to be stored. What you do is essentially calculating result += i * i without ever storing temporaries to val.
A look at the generated ptx code won't show any differences for sm_10 vs. sm_20. Decompiling the generated cubins with decuda will reveal the optimization.
BTW: Try to avoid local memory! It is as slow as global memory.