Cuda gdb print constant - cuda

I am in cuda-gdb, I can use ((#global float *)array)[0]
but how to use constant memory in gdb ?
I try ((#parameter float *)const_array)
I declared const_array like this :
__constant__ float const_array[1 << 14]
I tried with 1 << 5, and it's the same problem.

I don't seem to have any trouble with it. In order to print device memory, you must be stopped at a breakpoint in device code.
Example:
$ cat t1973.cu
const int cs = 1 << 14;
__constant__ int cdata[cs];
__global__ void k(int *gdata){
gdata[0] = cdata[0];
}
int main(){
int *hdata = new int[cs];
for (int i = 0; i < cs; i++) hdata[i] = i+1;
cudaMemcpyToSymbol(cdata, hdata, cs*sizeof(cdata[0]));
int *gdata;
cudaMalloc(&gdata, sizeof(gdata[0]));
cudaMemset(gdata, 0, sizeof(gdata[0]));
k<<<1,1>>>(gdata);
cudaDeviceSynchronize();
}
$ nvcc -o t1973 t1973.cu -g -G -arch=sm_70
$ cuda-gdb ./t1973
sh: python3: command not found
Unable to determine python3 interpreter version. Python integration disabled.
NVIDIA (R) CUDA Debugger
11.4 release
Portions Copyright (C) 2007-2021 NVIDIA Corporation
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./t1973...
(cuda-gdb) b 5
Breakpoint 1 at 0x403b0c: file t1973.cu, line 6.
(cuda-gdb) run
Starting program: /home/user2/misc/t1973
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after fork from child process 22872]
[New Thread 0x7fffef475700 (LWP 22879)]
[New Thread 0x7fffeec74700 (LWP 22880)]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
Thread 1 "t1973" hit Breakpoint 1, k<<<(1,1,1),(1,1,1)>>> (
gdata=0x7fffcdc00000) at t1973.cu:5
5 gdata[0] = cdata[0];
(cuda-gdb) print gdata[0]
$1 = 0
(cuda-gdb) print cdata[0]
$2 = 1
(cuda-gdb) s
6 }
(cuda-gdb) print gdata[0]
$3 = 1
(cuda-gdb) print cdata[0]
$4 = 1
(cuda-gdb) print cdata[1]
$5 = 2
(cuda-gdb)

Try putting you __constant__ into .cuh, then use as a classic C global variable.

Related

cuda-gdb set breakpoint at another line of __global__ function

Problem
I have a __global__ function in CUDA and I want to debug it using cuda-gdb but I cannot set a breakpoint inside the kernel and it points to another line. Here is my code
// include stuff
// ...
#define blockNUM 1
#define threadNUM 1
// ...
int main() {
// ... (define d_R_0, d_R_1, d_R_2, and d_H)
cudaSetDevice(0);
dim3 threadsPerBlock(threadNUM);
dim3 numBlocks(blockNUM);
decode<<<numBlocks,threadsPerBlock>>>(d_R_0, d_R_1, d_R_2, d_H);
// ... (other codes go here)
}
__global__ void decode(uint *d_R_0, uint *d_R_1, uint *d_R_2, uint *d_H) {
uint idx = (blockIdx.x * blockDim.x + threadIdx.x); // --> I want to set breakpoint here! (line 197) <--
// ... (implementation of the function)
} // --> But the cuda-gdb set the breakpoint here! (line 288) <--
And here is the cuda-gdb
(cuda-gdb) break 197
Breakpoint 1 at 0xa7f6: file /home/matin/main.cu, line 288.
Extra Info
I compile main.cu using this command:
$ nvcc -g -G main.cu
I also have the same problem with the A First CUDA C Program snippet on Nvidia's website
Specs:
GNU gdb (GDB) 10.1
NVIDIA (R) CUDA Debugger: 11.5 release
CUDA Version: 12.0
Ubuntu Version: 22.04
After updating my Nvidia drivers, I've encountered the same issue. I hope that this solution works for you too.
You have to set the breakpoint using the kernel function name. For example for the First CUDA C Program you should follow these steps:
Set a breakpoint using the kernel function name
(cuda-gdb) b saxpy
Breakpoint 1 at 0x338: file /home/nahid/temp/saxpy.cu, line 5.
Run to reach the breakpoint.
(cuda-gdb) r
Finally, set the breakpoint to the line that you want!
(cuda-gdb) b 7
Breakpoint 2 at 0xfffe3258e10: file saxpy.cu, line 7

What does nvprof output: "No kernels were profiled" mean, and how to fix it

I have recently installed Cuda on my arch-Linux machine through the system's package manager, and I have been trying to test whether or not it is working by running a simple vector addition program.
I simply copy-paste the code from this tutorial (Both the one using one and more kernels) into a file titled cuda_test.cu and run
> nvcc cuda_test.cu -o cuda_test
In either case, the program can run, and I get no errors (both as in the program doesn't crash and the output is that there were no errors). But when I try to run the Cuda profiler on the program:
> sudo nvprof ./cuda_test
I get result:
==3201== NVPROF is profiling process 3201, command: ./cuda_test
Max error: 0
==3201== Profiling application: ./cuda_test
==3201== Profiling result:
No kernels were profiled.
No API activities were profiled.
==3201== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
The latter warning is not my main problem or the topic of my question, my problem is the message saying that No Kernels were profiled and no API activities were profiled.
Does this mean that the program was run entirely on my CPU? or is it an error in nvprof?
I have found a discussion about the same error here, but there the answer was that the wrong version of Cuda was installed, and in my case, the version installed is the latest version installed through the systems package manager (Version 10.1.243-1)
Is there any way I can get either nvprof to display the expected output?
Edit
Trying to adhere to the warning at the end does not solve the problem:
Adding call to cudaProfilerStop() (or cuProfilerStop()), and also adding cudaDeviceReset(); at end as suggested and linking the appropriate library (cuda_profiler_api.h or cudaProfiler.h) and compiling with
> nvcc cuda_test.cu -o cuda_test -lcuda
Yields a program which can still run, but which, when uppon which nvprof is run, returns:
==12558== NVPROF is profiling process 12558, command: ./cuda_test
Max error: 0
==12558== Profiling application: ./cuda_test
==12558== Profiling result:
No kernels were profiled.
No API activities were profiled.
==12558== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139
This has not solved the original problem, and has in fact created a new error; the same happens when cudaProfilerStop() is used on its own or alongside cuProfilerStop() and cudaDeviceReset();
The code
The code is, as mentioned copied from a tutorial to test if Cuda is working, though I also have included calls to cudaProfilerStop() and cudaDeviceReset(); for clarity, it is here included:
#include <iostream>
#include <math.h>
#include <cuda_profiler_api.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y;
cudaProfilerStart();
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Run kernel on 1M elements on the GPU
add<<<1, 1>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
cudaDeviceReset();
cudaProfilerStop();
return 0;
}
This problem was apparently somewhat well known, after some searching I found this thread about the error-code in the edited version; the solution as discussed there is to call nvprof with the flag --unified-memory-profiling off:
> sudo nvprof --unified-memory-profiling off ./cuda_test
This makes nvprof work as expected-- even without the call to cudaProfileStop.
You can solve the problem by using
sudo nvprof --unified-memory-profiling per-process-device <your program>

Determining which gencode (compute_, arch_) values I need for nvcc - within CMake

I'm using CMake as a build system for my code, which involves CUDA. I was thinking of automating the task of deciding which compute_XX and arch_XX I need to to pass to my nvcc in order to compile for the GPU(s) on my current machine.
Is there a way to do this:
With the NVIDIA GPU deployment kit?
Without the NVIDIA GPU deployment kit?
Does CMake's FindCUDA help you in determining the values for these switches?
My strategy has been to compile and run a bash script that probes the card and returns the gencode for cmake. Inspiration came from University of Chicago's SLURM. To handle errors or multiple gpus or other circumstances, modify as necessary.
In your project folder create a file cudaComputeVersion.bash and ensure it is executable from the shell. Into this file put:
#!/bin/bash
# create a 'here document' that is code we compile and use to probe the card
cat << EOF > /tmp/cudaComputeVersion.cu
#include <stdio.h>
int main()
{
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop,0);
int v = prop.major * 10 + prop.minor;
printf("-gencode arch=compute_%d,code=sm_%d\n",v,v);
}
EOF
# probe the card and cleanup
/usr/local/cuda/bin/nvcc /tmp/cudaComputeVersion.cu -o /tmp/cudaComputeVersion
/tmp/cudaComputeVersion
rm /tmp/cudaComputeVersion.cu
rm /tmp/cudaComputeVersion
And in your CMakeLists.txt put:
# at cmake-build-time, probe the card and set a cmake variable
execute_process(COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/cudaComputeVersion.bash OUTPUT_VARIABLE GENCODE)
# at project-compile-time, include the gencode into the compile options
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS}; "${GENCODE}")
# this makes CMake all chatty and allows you to see that GENCODE was set correctly
set(CMAKE_VERBOSE_MAKEFILE TRUE)
cheers
You can use the cuda_select_nvcc_arch_flags() macro in the FindCUDA module for this without any additional scripts when using CMake 3.7 or newer.
include(FindCUDA)
set(CUDA_ARCH_LIST Auto CACHE STRING
"List of CUDA architectures (e.g. Pascal, Volta, etc) or \
compute capability versions (6.1, 7.0, etc) to generate code for. \
Set to Auto for automatic detection (default)."
)
cuda_select_nvcc_arch_flags(CUDA_ARCH_FLAGS ${CUDA_ARCH_LIST})
list(APPEND CUDA_NVCC_FLAGS ${CUDA_ARCH_FLAGS})
The above sets CUDA_ARCH_FLAGS to -gencode arch=compute_61,code=sm_61 on my machine, for example.
The CUDA_ARCH_LIST cache variable can be configured by the user to generate code for specific compute capabilites instead of automatic detection.
Note: the FindCUDA module has been deprecated since CMake 3.10. However, no equivalent alternative to the cuda_select_nvcc_arch_flags() macro appears to be provided yet in the latest CMake release (v3.14). See this relevant issue at the CMake issue tracker for further details.
A slight improvement over #orthopteroid's answer, which pretty much ensures a unique temporary file is generated, and only requires one instead of two temporary files.
The following goes into scripts/get_cuda_sm.sh:
#!/bin/bash
#
# Prints the compute capability of the first CUDA device installed
# on the system, or alternatively the device whose index is the
# first command-line argument
device_index=${1:-0}
timestamp=$(date +%s.%N)
gcc_binary=$(which g++)
gcc_binary=${gcc_binary:-g++}
cuda_root=${CUDA_DIR:-/usr/local/cuda}
CUDA_INCLUDE_DIRS=${CUDA_INCLUDE_DIRS:-${cuda_root}/include}
CUDA_CUDART_LIBRARY=${CUDA_CUDART_LIBRARY:-${cuda_root}/lib64/libcudart.so}
generated_binary="/tmp/cuda-compute-version-helper-$$-$timestamp"
# create a 'here document' that is code we compile and use to probe the card
source_code="$(cat << EOF
#include <stdio.h>
#include <cuda_runtime_api.h>
int main()
{
cudaDeviceProp prop;
cudaError_t status;
int device_count;
status = cudaGetDeviceCount(&device_count);
if (status != cudaSuccess) {
fprintf(stderr,"cudaGetDeviceCount() failed: %s\n", cudaGetErrorString(status));
return -1;
}
if (${device_index} >= device_count) {
fprintf(stderr, "Specified device index %d exceeds the maximum (the device count on this system is %d)\n", ${device_index}, device_count);
return -1;
}
status = cudaGetDeviceProperties(&prop, ${device_index});
if (status != cudaSuccess) {
fprintf(stderr,"cudaGetDeviceProperties() for device ${device_index} failed: %s\n", cudaGetErrorString(status));
return -1;
}
int v = prop.major * 10 + prop.minor;
printf("%d\\n", v);
}
EOF
)"
echo "$source_code" | $gcc_binary -x c++ -I"$CUDA_INCLUDE_DIRS" -o "$generated_binary" - -x none "$CUDA_CUDART_LIBRARY"
# probe the card and cleanup
$generated_binary
rm $generated_binary
and the following goes into CMakeLists.txt or a CMake module:
if (NOT CUDA_TARGET_COMPUTE_CAPABILITY)
if("$ENV{CUDA_SM}" STREQUAL "")
set(ENV{CUDA_INCLUDE_DIRS} "${CUDA_INCLUDE_DIRS}")
set(ENV{CUDA_CUDART_LIBRARY} "${CUDA_CUDART_LIBRARY}")
set(ENV{CMAKE_CXX_COMPILER} "${CMAKE_CXX_COMPILER}")
execute_process(COMMAND
bash -c "${CMAKE_CURRENT_SOURCE_DIR}/scripts/get_cuda_sm.sh"
OUTPUT_VARIABLE CUDA_TARGET_COMPUTE_CAPABILITY_)
else()
set(CUDA_TARGET_COMPUTE_CAPABILITY_ $ENV{CUDA_SM})
endif()
set(CUDA_TARGET_COMPUTE_CAPABILITY "${CUDA_TARGET_COMPUTE_CAPABILITY_}"
CACHE STRING "CUDA compute capability of the (first) CUDA device on \
the system, in XY format (like the X.Y format but no dot); see table \
of features and capabilities by capability X.Y value at \
https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications")
execute_process(COMMAND
bash -c "echo -n $(echo ${CUDA_TARGET_COMPUTE_CAPABILITY})"
OUTPUT_VARIABLE CUDA_TARGET_COMPUTE_CAPABILITY)
execute_process(COMMAND
bash -c "echo ${CUDA_TARGET_COMPUTE_CAPABILITY} | sed 's/^\\([0-9]\\)\\([0-9]\\)/\\1.\\2/;' | xargs echo -n"
OUTPUT_VARIABLE FORMATTED_COMPUTE_CAPABILITY)
message(STATUS
"CUDA device-side code will assume compute capability \
${FORMATTED_COMPUTE_CAPABILITY}")
endif()
set(CUDA_GENCODE
"arch=compute_${CUDA_TARGET_COMPUTE_CAPABILITY}, code=compute_${CUDA_TARGET_COMPUTE_CAPABILITY}")
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -gencode ${CUDA_GENCODE} )

Invalid device error returned by cuPointerGetAttribute()

This question is the same that was previously asked here without being answered:
https://stackoverflow.com/questions/22996075/invalid-device-error-return-by-cupointergetattribute.
I use a GTX680, CUDA 6.5 toolkit, NVIDIA 340.46 kernel. The GPU has unified addressing capability and compute capability 3.0.
The following code returns CUDA_ERROR_INVALID_DEVICE:
CUDA_DR_ASSERT(cuMemAlloc(&dev_ptr, size));
CUDA_DR_ASSERT(cuPointerGetAttribute(&tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS, dev_ptr));
Has anyone (Sankar?) had similar problems and found the reason?
edit: this is the code I get the errors from:
CUDA_DR_ASSERT( cuInit(0) );
CUdevice dev;
CUDA_DR_ASSERT( cuDeviceGet(&dev, 0) );
CUDA_ASSERT(cudaSetDevice(dev));
CUdeviceptr dev_ptr;
std::size_t size = 2*65536;
CUDA_DR_ASSERT( cuMemAlloc( &dev_ptr, size ) );
uint flag = 1; // set CU_POINTER_ATTRIBUTE_SYNC_MEMOPS (set to 0 for unsetting this option)
CUDA_DR_ASSERT( cuPointerSetAttribute(&flag, CU_POINTER_ATTRIBUTE_SYNC_MEMOPS, dev_ptr) );
CUDA_POINTER_ATTRIBUTE_P2P_TOKENS tokens;
CUDA_DR_ASSERT( cuPointerGetAttribute( &tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS, dev_ptr ) );
I have a system with a Quadro GPU at device 0 and a GeForce GPU at device 1.
Here's a fully worked example:
$ cat t642.cpp
#include <cuda.h>
#include <helper_cuda_drvapi.h>
#include <drvapi_error_string.h>
int main(int argc, char *argv[]){
int my_dev = 0;
int dev_count = 0;
if (argc > 1) my_dev=atoi(argv[1]);
CUcontext my_ctx;
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGetCount(&dev_count));
if (my_dev > dev_count-1) {printf("device does not exist\n"); return 1;}
char deviceName[256];
checkCudaErrors(cuDeviceGetName(deviceName, 256, my_dev));
printf("using device %d, %s\n", my_dev, deviceName);
checkCudaErrors(cuCtxCreate(&my_ctx, 0, my_dev));
CUdeviceptr dev_ptr;
size_t size = 256;
CUDA_POINTER_ATTRIBUTE_P2P_TOKENS tokens;
checkCudaErrors(cuMemAlloc(&dev_ptr, size));
checkCudaErrors(cuPointerGetAttribute(&tokens, CU_POINTER_ATTRIBUTE_P2P_TOKENS, dev_ptr));
printf("success!\n");
return 0;
}
$ g++ -I/usr/local/cuda/include -I/usr/local/cuda/samples/common/inc t642.cpp -lcuda -o t642
$ ./t642 0
using device 0, Quadro 5000
success!
$ ./t642 1
using device 1, GeForce GT 640
checkCudaErrors() Driver API error = 0101 "CUDA_ERROR_INVALID_DEVICE (device specified is not a valid CUDA device)" from file <t642.cpp>, line 22.
$
Using a GeForce GPU with this mechanism (which is designed in support of GPUDirect RDMA) is not suppoorted. This is documented in the GPUDirect RDMA documentation, which states:
GPUDirect RDMA is available on both Tesla and Quadro GPUs.
And while it is not the crux of your issue, you may also wish to read the GPUDirect RDMA release notes, that indicate that this token mechanism was deprecated in CUDA 6.0.

CUDA thread block size 1024 doesn't work (cc=20, sm=21)

My running config:
- CUDA Toolkit 5.5
- NVidia Nsight Eclipse edition
- Ubuntu 12.04 x64
- CUDA device is NVidia GeForce GTX 560: cc=20, sm=21 (as you can see I can use blocks up to 1024 threads)
I render my display on iGPU (Intel HD Graphics), so I can use Nsight debugger.
However I encountered some weird behaviour, when I set threads > 960.
Code:
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void mytest() {
float a, b;
b = 1.0F;
a = b / 1.0F;
}
int main(void) {
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
// Here I run my kernel
mytest<<<1, 961>>>();
err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "error=%s\n", cudaGetErrorString(err));
exit (EXIT_FAILURE);
}
// Reset the device and exit
err = cudaDeviceReset();
if (err != cudaSuccess) {
fprintf(stderr, "Failed to deinitialize the device! error=%s\n",
cudaGetErrorString(err));
exit (EXIT_FAILURE);
}
printf("Done\n");
return 0;
}
And... it doesn't work. The problem is in the last line of code with float division. Every time I try to divide by float, my code compiles, but doesn't work. The output error at runtime is:
error=too many resources requested for launch
Here's what I get in debug, when I step it over:
warning: Cuda API error detected: cudaLaunch returned (0x7)
Build output using -Xptxas -v:
12:57:39 **** Incremental Build of configuration Debug for project block_size_test ****
make all
Building file: ../src/vectorAdd.cu
Invoking: NVCC Compiler
/usr/local/cuda-5.5/bin/nvcc -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -G -g -O0 -m64 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -optf /home/vitrums/cuda-workspace/block_size_test/options.txt -gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=sm_21 -odir "src" -M -o "src/vectorAdd.d" "../src/vectorAdd.cu"
/usr/local/cuda-5.5/bin/nvcc --compile -G -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -O0 -g -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_21 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -m64 -optf /home/vitrums/cuda-workspace/block_size_test/options.txt -x cu -o "src/vectorAdd.o" "../src/vectorAdd.cu"
../src/vectorAdd.cu(7): warning: variable "a" was set but never used
../src/vectorAdd.cu(7): warning: variable "a" was set but never used
ptxas info : 4 bytes gmem, 8 bytes cmem[14]
ptxas info : Function properties for _ZN4dim3C1Ejjj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Compiling entry function '_Z6mytestv' for 'sm_21'
ptxas info : Function properties for _Z6mytestv
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 8 bytes cumulative stack size, 32 bytes cmem[0]
ptxas info : Function properties for _ZN4dim3C2Ejjj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Finished building: ../src/vectorAdd.cu
Building target: block_size_test
Invoking: NVCC Linker
/usr/local/cuda-5.5/bin/nvcc --cudart static -m64 -link -o "block_size_test" ./src/vectorAdd.o
Finished building target: block_size_test
12:57:41 Build Finished (took 1s.659ms)
When I add -keep key, the compiler generates .cubin file, but I can't read it to find out the values of smem and reg, following this topic too-many-resources-requested-for-launch-how-to-find-out-what-resources-/. At least nowadays this file must have some different format.
Therefore I'm forced to use 256 threads per block, which is probably not a bad idea, considering this .xls: CUDA_Occupancy_calculator.
Anyway. Any help will be appreciated.
I filled the CUDA Occupancy calculator file with the current informations :
Compute capability : 2.1
Threads per block : 961
Registers per thread : 34
Shared memory : 0
I got 0% occupancy, limited by registers count.
If you set the number of thread to 960, you have 63% occupancy, which explains why it works.
Try to limit the count of registers to 32 and set the numbers of threads to 1024 to have 67% occupancy.
To limit the count of registers, use the following option :
nvcc [...] --maxrregcount=32