Why is my CUDA warp shuffle sum using the wrong offset for one shuffle step? - cuda

Edit: I've filed this as a bug at https://developer.nvidia.com/nvidia_bug/3711214.
I'm writing a numerical simulation program that is giving subtly-incorrect results in Release mode, but seemingly correct results in Debug mode. The original program used curand for random sampling, but I've reduced it to a much simpler and more deterministic MVCE which launches a single kernel of 1 block * 1 warp (of 32 threads), where each thread:
Performs a computation with a loop that will likely become warp-divergent, especially near the end as some threads complete their task before others.
Syncs the threads back together.
Attempts to butterfly-shuffle data with fellow threads in the warp to obtain a single sum.
[not needed in the MVCE] thread 0 would write the sum back to global memory so it can be copied to the host
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
__global__ void test_kernel()
{
int cSteps = 0;
int cIters = 0;
float pos = 0;
//curandState localState = state[threadIdx.x];
while (true) {
float rn = threadIdx.x * 0.01 + 0.001;
pos += rn;
cSteps++;
if (pos > 1.0f) {
pos = 0;
cIters++;
if (cSteps > 1024) {
break;
}
}
}
printf(" 0: Th %d cI %d\n", threadIdx.x, cIters);
__syncthreads();
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 1, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 1, 32);
printf(" 1: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 2, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 2, 32);
printf(" 2: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 4, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 4, 32);
printf(" 4: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 8, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 8, 32);
printf(" 8: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 16, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 16, 32);
printf("16: Th %2d cI %d\n", threadIdx.x, cIters);
}
int main()
{
test_kernel <<<1, 32>>> ();
return 0;
}
In debug mode, the shuffle works as expected. I see each thread start out with its own value:
0: Th 0 cI 2
0: Th 1 cI 12
0: Th 2 cI 22
0: Th 3 cI 32
0: Th 4 cI 41
// ...
after the first shuffle xor 1, each pair of threads agrees on the same number:
1: Th 0 cI 14
1: Th 1 cI 14
1: Th 2 cI 54
1: Th 3 cI 54
after the shuffle xor 2, each group of four threads agrees:
2: Th 0 cI 68
2: Th 1 cI 68
2: Th 2 cI 68
2: Th 3 cI 68
2: Th 4 cI 223
2: Th 5 cI 223
2: Th 6 cI 223
2: Th 7 cI 223
and so on. After the last shuffle, all threads in the warp agree on the same value (4673).
As soon as I enable Release mode, I get results that are subtly garbage. The values entering the shuffle are the same, and the values after the first round of the shuffle agree with the debug build (and agree within each pair as before). As soon as I do a shuffle xor 2, the results fall apart:
2: Th 0 cI 28
2: Th 1 cI 28
2: Th 2 cI 108
2: Th 3 cI 108
2: Th 4 cI 186
2: Th 5 cI 186
2: Th 6 cI 260
2: Th 7 cI 260
In fact, this is the exact output that a debug build (and hand inspection) would produce if the shuffle sequence were replaced by this specific broken one:
printf(" 0: Th %d cI %d\n", threadIdx.x, cIters);
__syncthreads();
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 1, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 1, 32);
printf(" 1: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 1, 32); // 2 changed to 1
cIters += __shfl_xor_sync(0xffffffff, cIters, 1, 32); // 2 changed to 1
printf(" 2: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 4, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 4, 32);
printf(" 4: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 8, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 8, 32);
printf(" 8: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 16, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 16, 32);
The full diff of the output is here.
Hardware and software environment is as follows:
GA103 3080Ti (mobile), at manufacturer-recommended clocks, 16 G VRAM. Machine doesn't seem to be having corruption with other Cuda programs (tested with primegrid-CUDA and tasks verified against double-checks)
CUDA 11.0
MVSC host compiler 14.29.30133
Full debug command line as follows:
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_52,code=\"sm_52,compute_52\" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64" -x cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -G --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart static -g -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdx64\Debug\vc142.pdb /FS /Zi /RTC1 /MDd " -o x64\Debug\kernel.cu.obj "C:\Users\[username]\source\repos\BugRepro\BugRepro\kernel.cu"
Full release command line as follows:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_52,code=\"sm_52,compute_52\" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64" -x cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdx64\Release\vc142.pdb /FS /Zi /MD " -o x64\Release\kernel.cu.obj "C:\Users\[username]\source\repos\BugRepro\BugRepro\kernel.cu"
Things I tried without resolution:
Adding/removing syncthreads calls (where one is shown, and between shuffle calls), even though they shouldn't be necessary since each shuffle synchronizes
Changing the compute capability to 8.0 to better match my card
Forcing base clocks on the GPU
Shuffling in the opposite order (16/8/4/2/1)
Using __shfl_down_sync instead of xor, with the same pattern of offsets.
Having each thread write to global memory and then summing on the host CPU does produce correct results.
Replacing all the shuffles with calls to __shfl_sync and manually-calculated lane IDs works. Replacing just the broken shuffle xor 2 with a __shfl_sync doesn't. Replacing just the first shuffle xor 1 (which worked correctly) with a __shfl_sync does seem to fix it. (These two workarounds apply to my MVCE; I have not had a chance to evaluate whether they apply to the full program)
// unexpectedly working
int id = threadIdx.x;
printf(" 0: Th %d cI %d\n", threadIdx.x, cIters);
__syncthreads();
cSteps += __shfl_sync(0xffffffff, cSteps, id ^ 1, 32);
cIters += __shfl_sync(0xffffffff, cIters, id ^ 1, 32);
printf(" 1: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 2, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 2, 32);
printf(" 2: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 4, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 4, 32);
printf(" 4: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 8, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 8, 32);
printf(" 8: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 16, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 16, 32);
printf("16: Th %2d cI %d\n", threadIdx.x, cIters);
Even though I have a workaround, I'm afraid that I'm still hitting undefined behavior somewhere and my fix might be brittle.
Can anyone shed light on this? Is there indeed UB in my program? Is this a known compiler bug?

This was confirmed to be a compiler bug according to the CUDA engineering team. The fix is coming soon, as confirmed by a communication from them:
The fix is targeting a future major CUDA release after CUDA 11. The JIT fix will possibly be a little earlier in a Driver branch after latest R515 online.
Edit: Doesn't appear to be fixed in the 516.94 Game Ready driver. It does seem fixed in 522.25 with Cuda 11.8.
They also confirm that turning off optimizations fixes the issue; they do not comment on any workarounds that work reliably with optimization still on.
The following workarounds worked for me on my hardware and compiler, but YMMV:
using __shfl_sync instead of shfl_add_sync or shfl_xor_sync
__reduce_add_sync

Related

Can I update or change a cuda kernel without stopping the process? [duplicate]

I have an application which generates CUDA C++ source code, compiles it into PTX at runtime using NVRTC, and then creates CUDA modules from it using the CUDA driver API.
If I debug this application using cuda-gdb, it displays the kernel (where an error occured) in the backtrace, but does not show the line number.
I export the generated source code into a file, and give the directory to cuda-gdb using the --directory option. I also tried passing its file name to nvrtcCreateProgram() (name argument). I use the compile options --device-debug and --generate-line-info with NVRTC.
Is there a way to let cuda-gdb know the location of the generated source code file, and display the line number information in its backtrace?
For those who may not be familiar with nvrtc, it is a CUDA facility that allows runtime-compilation of CUDA C++ device code. As a result, device code generated at runtime (including modifications) can be used on a CUDA GPU. There is documentation for nvrtc and there are various CUDA sample codes for nvrtc, most or all of which have _nvrtc in the file name.
I was able to do kernel source-level debugging on a nvrtc-generated kernel with cuda-gdb as follows:
start with vectorAdd_nvrtc sample code
modify the compileFileToPTX routine (provided by nvrtc_helper.h) to add the --device-debug switch during the compile-cu-to-ptx step.
modify the loadPTX routine (provided by nvrtc_helper.h) to add the CU_JIT_GENERATE_DEBUG_INFO option (set to 1) for the cuModuleLoadDataEx load/JIT PTX-to-binary step.
compile the main function (vectorAdd.cpp) with -g option.
Here is a complete test case/session. I'm only showing the vectorAdd.cpp file from the project because that is the only file I modified. Other project file(s) are identical to what is in the sample project:
$ cat vectorAdd.cpp
/**
* Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
/**
* Vector addition: C = A + B.
*
* This sample is a very basic sample that implements element by element
* vector addition. It is the same as the sample illustrating Chapter 2
* of the programming guide with some additions like error checking.
*/
#include <stdio.h>
#include <cmath>
// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda.h>
#include <cuda_runtime.h>
// helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <nvrtc_helper.h>
#include <iostream>
#include <fstream>
/**
* Host main routine
*/
void my_compileFileToPTX(char *filename, int argc, char **argv, char **ptxResult,
size_t *ptxResultSize, int requiresCGheaders) {
std::ifstream inputFile(filename,
std::ios::in | std::ios::binary | std::ios::ate);
if (!inputFile.is_open()) {
std::cerr << "\nerror: unable to open " << filename << " for reading!\n";
exit(1);
}
std::streampos pos = inputFile.tellg();
size_t inputSize = (size_t)pos;
char *memBlock = new char[inputSize + 1];
inputFile.seekg(0, std::ios::beg);
inputFile.read(memBlock, inputSize);
inputFile.close();
memBlock[inputSize] = '\x0';
int numCompileOptions = 0;
char *compileParams[2];
std::string compileOptions;
if (requiresCGheaders) {
char HeaderNames[256];
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
sprintf_s(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");
#else
snprintf(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");
#endif
compileOptions = "--include-path=";
std::string path = sdkFindFilePath(HeaderNames, argv[0]);
if (!path.empty()) {
std::size_t found = path.find(HeaderNames);
path.erase(found);
} else {
printf(
"\nCooperativeGroups headers not found, please install it in %s "
"sample directory..\n Exiting..\n",
argv[0]);
}
compileOptions += path.c_str();
compileParams[0] = reinterpret_cast<char *>(
malloc(sizeof(char) * (compileOptions.length() + 1)));
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
sprintf_s(compileParams[0], sizeof(char) * (compileOptions.length() + 1),
"%s", compileOptions.c_str());
#else
snprintf(compileParams[0], compileOptions.size(), "%s",
compileOptions.c_str());
#endif
numCompileOptions++;
}
compileOptions = "--device-debug ";
compileParams[numCompileOptions] = reinterpret_cast<char *>(malloc(sizeof(char) * (compileOptions.length() + 1)));
snprintf(compileParams[numCompileOptions], compileOptions.size(), "%s", compileOptions.c_str());
numCompileOptions++;
// compile
nvrtcProgram prog;
NVRTC_SAFE_CALL("nvrtcCreateProgram",
nvrtcCreateProgram(&prog, memBlock, filename, 0, NULL, NULL));
nvrtcResult res = nvrtcCompileProgram(prog, numCompileOptions, compileParams);
// dump log
size_t logSize;
NVRTC_SAFE_CALL("nvrtcGetProgramLogSize",
nvrtcGetProgramLogSize(prog, &logSize));
char *log = reinterpret_cast<char *>(malloc(sizeof(char) * logSize + 1));
NVRTC_SAFE_CALL("nvrtcGetProgramLog", nvrtcGetProgramLog(prog, log));
log[logSize] = '\x0';
if (strlen(log) >= 2) {
std::cerr << "\n compilation log ---\n";
std::cerr << log;
std::cerr << "\n end log ---\n";
}
free(log);
NVRTC_SAFE_CALL("nvrtcCompileProgram", res);
// fetch PTX
size_t ptxSize;
NVRTC_SAFE_CALL("nvrtcGetPTXSize", nvrtcGetPTXSize(prog, &ptxSize));
char *ptx = reinterpret_cast<char *>(malloc(sizeof(char) * ptxSize));
NVRTC_SAFE_CALL("nvrtcGetPTX", nvrtcGetPTX(prog, ptx));
NVRTC_SAFE_CALL("nvrtcDestroyProgram", nvrtcDestroyProgram(&prog));
*ptxResult = ptx;
*ptxResultSize = ptxSize;
#ifdef DUMP_PTX
std::ofstream my_f;
my_f.open("vectorAdd.ptx");
for (int i = 0; i < ptxSize; i++)
my_f << ptx[i];
my_f.close();
#endif
if (requiresCGheaders) free(compileParams[0]);
}
CUmodule my_loadPTX(char *ptx, int argc, char **argv) {
CUmodule module;
CUcontext context;
int major = 0, minor = 0;
char deviceName[256];
// Picks the best CUDA device available
CUdevice cuDevice = findCudaDeviceDRV(argc, (const char **)argv);
// get compute capabilities and the devicename
checkCudaErrors(cuDeviceGetAttribute(
&major, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, cuDevice));
checkCudaErrors(cuDeviceGetAttribute(
&minor, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, cuDevice));
checkCudaErrors(cuDeviceGetName(deviceName, 256, cuDevice));
printf("> GPU Device has SM %d.%d compute capability\n", major, minor);
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGet(&cuDevice, 0));
checkCudaErrors(cuCtxCreate(&context, 0, cuDevice));
CUjit_option opt[1];
opt[0] = CU_JIT_GENERATE_DEBUG_INFO;
void **vals = new void *[1];
vals[0] = (void *)(size_t)1;
checkCudaErrors(cuModuleLoadDataEx(&module, ptx, 1, opt, vals));
free(ptx);
return module;
}
int main(int argc, char **argv) {
char *ptx, *kernel_file;
size_t ptxSize;
kernel_file = sdkFindFilePath("vectorAdd_kernel.cu", argv[0]);
my_compileFileToPTX(kernel_file, argc, argv, &ptx, &ptxSize, 0);
CUmodule module = my_loadPTX(ptx, argc, argv);
CUfunction kernel_addr;
checkCudaErrors(cuModuleGetFunction(&kernel_addr, module, "vectorAdd"));
// Print the vector length to be used, and compute its size
int numElements = 50000;
size_t size = numElements * sizeof(float);
printf("[Vector addition of %d elements]\n", numElements);
// Allocate the host input vector A
float *h_A = reinterpret_cast<float *>(malloc(size));
// Allocate the host input vector B
float *h_B = reinterpret_cast<float *>(malloc(size));
// Allocate the host output vector C
float *h_C = reinterpret_cast<float *>(malloc(size));
// Verify that allocations succeeded
if (h_A == NULL || h_B == NULL || h_C == NULL) {
fprintf(stderr, "Failed to allocate host vectors!\n");
exit(EXIT_FAILURE);
}
// Initialize the host input vectors
for (int i = 0; i < numElements; ++i) {
h_A[i] = rand() / static_cast<float>(RAND_MAX);
h_B[i] = rand() / static_cast<float>(RAND_MAX);
}
// Allocate the device input vector A
CUdeviceptr d_A;
checkCudaErrors(cuMemAlloc(&d_A, size));
// Allocate the device input vector B
CUdeviceptr d_B;
checkCudaErrors(cuMemAlloc(&d_B, size));
// Allocate the device output vector C
CUdeviceptr d_C;
checkCudaErrors(cuMemAlloc(&d_C, size));
// Copy the host input vectors A and B in host memory to the device input
// vectors in device memory
printf("Copy input data from the host memory to the CUDA device\n");
checkCudaErrors(cuMemcpyHtoD(d_A, h_A, size));
checkCudaErrors(cuMemcpyHtoD(d_B, h_B, size));
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid,
threadsPerBlock);
dim3 cudaBlockSize(threadsPerBlock, 1, 1);
dim3 cudaGridSize(blocksPerGrid, 1, 1);
void *arr[] = {reinterpret_cast<void *>(&d_A), reinterpret_cast<void *>(&d_B),
reinterpret_cast<void *>(&d_C),
reinterpret_cast<void *>(&numElements)};
checkCudaErrors(cuLaunchKernel(kernel_addr, cudaGridSize.x, cudaGridSize.y,
cudaGridSize.z, /* grid dim */
cudaBlockSize.x, cudaBlockSize.y,
cudaBlockSize.z, /* block dim */
0, 0, /* shared mem, stream */
&arr[0], /* arguments */
0));
checkCudaErrors(cuCtxSynchronize());
// Copy the device result vector in device memory to the host result vector
// in host memory.
printf("Copy output data from the CUDA device to the host memory\n");
checkCudaErrors(cuMemcpyDtoH(h_C, d_C, size));
// Verify that the result vector is correct
for (int i = 0; i < numElements; ++i) {
if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
fprintf(stderr, "Result verification failed at element %d!\n", i);
exit(EXIT_FAILURE);
}
}
printf("Test PASSED\n");
// Free device global memory
checkCudaErrors(cuMemFree(d_A));
checkCudaErrors(cuMemFree(d_B));
checkCudaErrors(cuMemFree(d_C));
// Free host memory
free(h_A);
free(h_B);
free(h_C);
printf("Done\n");
return 0;
}
$ nvcc -g -I/usr/local/cuda/samples/common/inc -o test vectorAdd.cpp -lnvrtc -lcuda
$ cuda-gdb ./test
NVIDIA (R) CUDA Debugger
10.0 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./test...done.
(cuda-gdb) break vectorAdd
Function "vectorAdd" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (vectorAdd) pending.
(cuda-gdb) r
Starting program: /home/user2/misc/junk/vectorAdd_nvrtc/test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffedc00700 (LWP 16789)]
> Using CUDA Device [1]: Tesla K40m
> GPU Device has SM 3.5 compute capability
[New Thread 0x7fffed3ff700 (LWP 16790)]
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
Thread 1 "test" hit Breakpoint 1, vectorAdd<<<(196,1,1),(256,1,1)>>> (A=0x7fffce800000, B=0x7fffce830e00, C=0x7fffce861c00, numElements=50000) at ./vectorAdd_kernel.cu:21
21 int i = blockDim.x * blockIdx.x + threadIdx.x;
(cuda-gdb) step
23 if (i < numElements) {
(cuda-gdb) step
24 C[i] = A[i] + B[i];
(cuda-gdb) step
26 }
(cuda-gdb) quit
A debugging session is active.
Inferior 1 [process 16777] will be killed.
Quit anyway? (y or n) y
$

Using CUDA-gdb with NVRTC

I have an application which generates CUDA C++ source code, compiles it into PTX at runtime using NVRTC, and then creates CUDA modules from it using the CUDA driver API.
If I debug this application using cuda-gdb, it displays the kernel (where an error occured) in the backtrace, but does not show the line number.
I export the generated source code into a file, and give the directory to cuda-gdb using the --directory option. I also tried passing its file name to nvrtcCreateProgram() (name argument). I use the compile options --device-debug and --generate-line-info with NVRTC.
Is there a way to let cuda-gdb know the location of the generated source code file, and display the line number information in its backtrace?
For those who may not be familiar with nvrtc, it is a CUDA facility that allows runtime-compilation of CUDA C++ device code. As a result, device code generated at runtime (including modifications) can be used on a CUDA GPU. There is documentation for nvrtc and there are various CUDA sample codes for nvrtc, most or all of which have _nvrtc in the file name.
I was able to do kernel source-level debugging on a nvrtc-generated kernel with cuda-gdb as follows:
start with vectorAdd_nvrtc sample code
modify the compileFileToPTX routine (provided by nvrtc_helper.h) to add the --device-debug switch during the compile-cu-to-ptx step.
modify the loadPTX routine (provided by nvrtc_helper.h) to add the CU_JIT_GENERATE_DEBUG_INFO option (set to 1) for the cuModuleLoadDataEx load/JIT PTX-to-binary step.
compile the main function (vectorAdd.cpp) with -g option.
Here is a complete test case/session. I'm only showing the vectorAdd.cpp file from the project because that is the only file I modified. Other project file(s) are identical to what is in the sample project:
$ cat vectorAdd.cpp
/**
* Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
/**
* Vector addition: C = A + B.
*
* This sample is a very basic sample that implements element by element
* vector addition. It is the same as the sample illustrating Chapter 2
* of the programming guide with some additions like error checking.
*/
#include <stdio.h>
#include <cmath>
// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda.h>
#include <cuda_runtime.h>
// helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <nvrtc_helper.h>
#include <iostream>
#include <fstream>
/**
* Host main routine
*/
void my_compileFileToPTX(char *filename, int argc, char **argv, char **ptxResult,
size_t *ptxResultSize, int requiresCGheaders) {
std::ifstream inputFile(filename,
std::ios::in | std::ios::binary | std::ios::ate);
if (!inputFile.is_open()) {
std::cerr << "\nerror: unable to open " << filename << " for reading!\n";
exit(1);
}
std::streampos pos = inputFile.tellg();
size_t inputSize = (size_t)pos;
char *memBlock = new char[inputSize + 1];
inputFile.seekg(0, std::ios::beg);
inputFile.read(memBlock, inputSize);
inputFile.close();
memBlock[inputSize] = '\x0';
int numCompileOptions = 0;
char *compileParams[2];
std::string compileOptions;
if (requiresCGheaders) {
char HeaderNames[256];
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
sprintf_s(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");
#else
snprintf(HeaderNames, sizeof(HeaderNames), "%s", "cooperative_groups.h");
#endif
compileOptions = "--include-path=";
std::string path = sdkFindFilePath(HeaderNames, argv[0]);
if (!path.empty()) {
std::size_t found = path.find(HeaderNames);
path.erase(found);
} else {
printf(
"\nCooperativeGroups headers not found, please install it in %s "
"sample directory..\n Exiting..\n",
argv[0]);
}
compileOptions += path.c_str();
compileParams[0] = reinterpret_cast<char *>(
malloc(sizeof(char) * (compileOptions.length() + 1)));
#if defined(WIN32) || defined(_WIN32) || defined(WIN64) || defined(_WIN64)
sprintf_s(compileParams[0], sizeof(char) * (compileOptions.length() + 1),
"%s", compileOptions.c_str());
#else
snprintf(compileParams[0], compileOptions.size(), "%s",
compileOptions.c_str());
#endif
numCompileOptions++;
}
compileOptions = "--device-debug ";
compileParams[numCompileOptions] = reinterpret_cast<char *>(malloc(sizeof(char) * (compileOptions.length() + 1)));
snprintf(compileParams[numCompileOptions], compileOptions.size(), "%s", compileOptions.c_str());
numCompileOptions++;
// compile
nvrtcProgram prog;
NVRTC_SAFE_CALL("nvrtcCreateProgram",
nvrtcCreateProgram(&prog, memBlock, filename, 0, NULL, NULL));
nvrtcResult res = nvrtcCompileProgram(prog, numCompileOptions, compileParams);
// dump log
size_t logSize;
NVRTC_SAFE_CALL("nvrtcGetProgramLogSize",
nvrtcGetProgramLogSize(prog, &logSize));
char *log = reinterpret_cast<char *>(malloc(sizeof(char) * logSize + 1));
NVRTC_SAFE_CALL("nvrtcGetProgramLog", nvrtcGetProgramLog(prog, log));
log[logSize] = '\x0';
if (strlen(log) >= 2) {
std::cerr << "\n compilation log ---\n";
std::cerr << log;
std::cerr << "\n end log ---\n";
}
free(log);
NVRTC_SAFE_CALL("nvrtcCompileProgram", res);
// fetch PTX
size_t ptxSize;
NVRTC_SAFE_CALL("nvrtcGetPTXSize", nvrtcGetPTXSize(prog, &ptxSize));
char *ptx = reinterpret_cast<char *>(malloc(sizeof(char) * ptxSize));
NVRTC_SAFE_CALL("nvrtcGetPTX", nvrtcGetPTX(prog, ptx));
NVRTC_SAFE_CALL("nvrtcDestroyProgram", nvrtcDestroyProgram(&prog));
*ptxResult = ptx;
*ptxResultSize = ptxSize;
#ifdef DUMP_PTX
std::ofstream my_f;
my_f.open("vectorAdd.ptx");
for (int i = 0; i < ptxSize; i++)
my_f << ptx[i];
my_f.close();
#endif
if (requiresCGheaders) free(compileParams[0]);
}
CUmodule my_loadPTX(char *ptx, int argc, char **argv) {
CUmodule module;
CUcontext context;
int major = 0, minor = 0;
char deviceName[256];
// Picks the best CUDA device available
CUdevice cuDevice = findCudaDeviceDRV(argc, (const char **)argv);
// get compute capabilities and the devicename
checkCudaErrors(cuDeviceGetAttribute(
&major, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, cuDevice));
checkCudaErrors(cuDeviceGetAttribute(
&minor, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, cuDevice));
checkCudaErrors(cuDeviceGetName(deviceName, 256, cuDevice));
printf("> GPU Device has SM %d.%d compute capability\n", major, minor);
checkCudaErrors(cuInit(0));
checkCudaErrors(cuDeviceGet(&cuDevice, 0));
checkCudaErrors(cuCtxCreate(&context, 0, cuDevice));
CUjit_option opt[1];
opt[0] = CU_JIT_GENERATE_DEBUG_INFO;
void **vals = new void *[1];
vals[0] = (void *)(size_t)1;
checkCudaErrors(cuModuleLoadDataEx(&module, ptx, 1, opt, vals));
free(ptx);
return module;
}
int main(int argc, char **argv) {
char *ptx, *kernel_file;
size_t ptxSize;
kernel_file = sdkFindFilePath("vectorAdd_kernel.cu", argv[0]);
my_compileFileToPTX(kernel_file, argc, argv, &ptx, &ptxSize, 0);
CUmodule module = my_loadPTX(ptx, argc, argv);
CUfunction kernel_addr;
checkCudaErrors(cuModuleGetFunction(&kernel_addr, module, "vectorAdd"));
// Print the vector length to be used, and compute its size
int numElements = 50000;
size_t size = numElements * sizeof(float);
printf("[Vector addition of %d elements]\n", numElements);
// Allocate the host input vector A
float *h_A = reinterpret_cast<float *>(malloc(size));
// Allocate the host input vector B
float *h_B = reinterpret_cast<float *>(malloc(size));
// Allocate the host output vector C
float *h_C = reinterpret_cast<float *>(malloc(size));
// Verify that allocations succeeded
if (h_A == NULL || h_B == NULL || h_C == NULL) {
fprintf(stderr, "Failed to allocate host vectors!\n");
exit(EXIT_FAILURE);
}
// Initialize the host input vectors
for (int i = 0; i < numElements; ++i) {
h_A[i] = rand() / static_cast<float>(RAND_MAX);
h_B[i] = rand() / static_cast<float>(RAND_MAX);
}
// Allocate the device input vector A
CUdeviceptr d_A;
checkCudaErrors(cuMemAlloc(&d_A, size));
// Allocate the device input vector B
CUdeviceptr d_B;
checkCudaErrors(cuMemAlloc(&d_B, size));
// Allocate the device output vector C
CUdeviceptr d_C;
checkCudaErrors(cuMemAlloc(&d_C, size));
// Copy the host input vectors A and B in host memory to the device input
// vectors in device memory
printf("Copy input data from the host memory to the CUDA device\n");
checkCudaErrors(cuMemcpyHtoD(d_A, h_A, size));
checkCudaErrors(cuMemcpyHtoD(d_B, h_B, size));
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid,
threadsPerBlock);
dim3 cudaBlockSize(threadsPerBlock, 1, 1);
dim3 cudaGridSize(blocksPerGrid, 1, 1);
void *arr[] = {reinterpret_cast<void *>(&d_A), reinterpret_cast<void *>(&d_B),
reinterpret_cast<void *>(&d_C),
reinterpret_cast<void *>(&numElements)};
checkCudaErrors(cuLaunchKernel(kernel_addr, cudaGridSize.x, cudaGridSize.y,
cudaGridSize.z, /* grid dim */
cudaBlockSize.x, cudaBlockSize.y,
cudaBlockSize.z, /* block dim */
0, 0, /* shared mem, stream */
&arr[0], /* arguments */
0));
checkCudaErrors(cuCtxSynchronize());
// Copy the device result vector in device memory to the host result vector
// in host memory.
printf("Copy output data from the CUDA device to the host memory\n");
checkCudaErrors(cuMemcpyDtoH(h_C, d_C, size));
// Verify that the result vector is correct
for (int i = 0; i < numElements; ++i) {
if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
fprintf(stderr, "Result verification failed at element %d!\n", i);
exit(EXIT_FAILURE);
}
}
printf("Test PASSED\n");
// Free device global memory
checkCudaErrors(cuMemFree(d_A));
checkCudaErrors(cuMemFree(d_B));
checkCudaErrors(cuMemFree(d_C));
// Free host memory
free(h_A);
free(h_B);
free(h_C);
printf("Done\n");
return 0;
}
$ nvcc -g -I/usr/local/cuda/samples/common/inc -o test vectorAdd.cpp -lnvrtc -lcuda
$ cuda-gdb ./test
NVIDIA (R) CUDA Debugger
10.0 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./test...done.
(cuda-gdb) break vectorAdd
Function "vectorAdd" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (vectorAdd) pending.
(cuda-gdb) r
Starting program: /home/user2/misc/junk/vectorAdd_nvrtc/test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffedc00700 (LWP 16789)]
> Using CUDA Device [1]: Tesla K40m
> GPU Device has SM 3.5 compute capability
[New Thread 0x7fffed3ff700 (LWP 16790)]
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
Thread 1 "test" hit Breakpoint 1, vectorAdd<<<(196,1,1),(256,1,1)>>> (A=0x7fffce800000, B=0x7fffce830e00, C=0x7fffce861c00, numElements=50000) at ./vectorAdd_kernel.cu:21
21 int i = blockDim.x * blockIdx.x + threadIdx.x;
(cuda-gdb) step
23 if (i < numElements) {
(cuda-gdb) step
24 C[i] = A[i] + B[i];
(cuda-gdb) step
26 }
(cuda-gdb) quit
A debugging session is active.
Inferior 1 [process 16777] will be killed.
Quit anyway? (y or n) y
$

-ta=tesla:managed:cuda8 but cuMemAllocManaged returned error 2: Out of memory

I'm new to OpenACC. I like it very much so far as I'm familiar with OpenMP.
I have 2 1080Ti cards each with 9GB and I've 128GB of RAM. I'm trying a very basic test to allocate an array, initialize it, then sum it up in parallel. This works for 8 GB but when I increase to 10 GB I get out-of-memory error. My understanding was that with unified memory of Pascal (which these card are) and CUDA 8, I could allocate an array larger than the GPU's memory and the hardware will page in and page out on demand.
Here's my full C code test :
$ cat firstAcc.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 10
int main()
{
float *a;
size_t n = GB*1024*1024*1024/sizeof(float);
size_t s = n * sizeof(float);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
float sum=0.0;
#pragma acc loop reduction (+:sum)
for (int i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
As per the "Enable Unified Memory" section of this article I compile it with :
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo firstAcc.c
main:
20, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
28, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
I need to understand those messages but for now I don't think they are relevant. Then I run it :
$ ./a.out
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted (core dumped)
This works fine if I change GB to 8. I expected 10GB to work (despite the GPU card having 9GB) thanks to Pascal 1080Ti and CUDA 8.
Have I misunderstand, or what am I doing wrong? Thanks in advance.
$ pgcc -V
pgcc 17.4-0 64-bit target on x86-64 Linux -tp haswell
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
$ cat /usr/local/cuda-8.0/version.txt
CUDA Version 8.0.61
Besides what Bob mentioned, I made a few more fixes.
First, you're not actually generating an OpenACC compute region since you only have a "#pragma acc loop" directive. This should be "#pragma acc parallel loop". You can see this in the compiler feedback messages where it's only showing host code optimizations.
Second, the "i" index should be declared as a "long". Otherwise, you'll overflow the index.
Finally, you need to add "cc60" to your target accelerator options to tell the compiler to target a Pascal based GPU.
% cat mi.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 20ULL
int main()
{
float *a;
size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
double sum=0.0;
#pragma acc parallel loop reduction (+:sum)
for (long i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
% pgcc -fast -acc -ta=tesla:managed,cuda8.0,cc60 -Minfo=accel mi.c
main:
21, Accelerator kernel generated
Generating Tesla code
21, Generating reduction(+:sum)
22, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
21, Generating implicit copyin(a[:5368709120])
% ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
I believe a problem is here:
size_t n = GB*1024*1024*1024/sizeof(float);
when I compile that line of code with g++, I get a warning about integer overflow. For some reason the PGI compiler is not warning, but the same badness is occurring under the hood. After the declarations of s, and n, if I add a printout like this:
size_t n = GB*1024*1024*1024/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s); // add this line
and compile with PGI 17.04, and run (on a P100, with 16GB) I get output like this:
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
16, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
22, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
$ ./a.out
n = 4611686017890516992, s = 18446744071562067968
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted
$
so it's evident that n and s are not what you intended.
We can fix this by marking all of those constants with ULL, and then things seem to work correctly for me:
$ cat m1.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#define GB 20ULL
int main()
{
float *a;
size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
size_t s = n * sizeof(float);
printf("n = %lu, s = %lu\n", n, s);
a = (float *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
printf("Initializing ... ");
for (int i = 0; i < n; ++i) {
a[i] = 0.1f;
}
printf("done\n");
double sum=0.0;
#pragma acc loop reduction (+:sum)
for (int i = 0; i < n; ++i) {
sum+=a[i];
}
printf("Sum is %f\n", sum);
free(a);
return 0;
}
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
16, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop
22, Loop not fused: function call before adjacent loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
$ ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
$
Note that I've made another change above as well. I changed the sum accumulation variable from float to double. This is necessary to preserve somewhat "sensible" results when doing a very large reduction across very small quantities.
And, as #MatColgrove pointed out in his answer, I missed a few other things as well.

Incorrect results with CUB ReduceByKey when specifying gencode

In one of my projects, I'm seeing some incorrect results when using CUB's
DeviceReduce::ReduceByKey. However, using the same inputs/outputs with thrust::reduce_by_key produces the expected results.
#include "cub/cub.cuh"
#include <vector>
#include <iostream>
#include <cuda.h>
struct AddFunctor {
__host__ __device__ __forceinline__
float operator()(const float & a, const float & b) const {
return a + b;
}
} reduction_op;
int main() {
int n = 7680;
std::vector < uint64_t > keys_h(n);
for (int i = 0; i < 4000; i++) keys_h[i] = 1;
for (int i = 4000; i < 5000; i++) keys_h[i] = 2;
for (int i = 5000; i < 7680; i++) keys_h[i] = 3;
uint64_t * keys;
cudaMalloc(&keys, sizeof(uint64_t) * n);
cudaMemcpy(keys, &keys_h[0], sizeof(uint64_t) * n, cudaMemcpyDefault);
uint64_t * unique_keys;
cudaMalloc(&unique_keys, sizeof(uint64_t) * n);
std::vector < float > values_h(n);
for (int i = 0; i < n; i++) values_h[i] = 1.0;
float * values;
cudaMalloc(&values, sizeof(float) * n);
cudaMemcpy(values, &values_h[0], sizeof(float) * n, cudaMemcpyDefault);
float * aggregates;
cudaMalloc(&aggregates, sizeof(float) * n);
int * remaining;
cudaMalloc(&remaining, sizeof(int));
size_t size = 0;
void * buffer = NULL;
cub::DeviceReduce::ReduceByKey(
buffer,
size,
keys,
unique_keys,
values,
aggregates,
remaining,
reduction_op,
n);
cudaMalloc(&buffer, sizeof(char) * size);
cub::DeviceReduce::ReduceByKey(
buffer,
size,
keys,
unique_keys,
values,
aggregates,
remaining,
reduction_op,
n);
int remaining_h;
cudaMemcpy(&remaining_h, remaining, sizeof(int), cudaMemcpyDefault);
std::vector < float > aggregates_h(remaining_h);
cudaMemcpy(&aggregates_h[0], aggregates, sizeof(float) * remaining_h, cudaMemcpyDefault);
for (int i = 0; i < remaining_h; i++) {
std::cout << i << ", " << aggregates_h[i] << std::endl;
}
cudaFree(buffer);
cudaFree(keys);
cudaFree(unique_keys);
cudaFree(values);
cudaFree(aggregates);
cudaFree(remaining);
}
When I include "-gencode arch=compute_35,code=sm_35" (for a Kepler GTX Titan), it produces the wrong results, but when I leave these flags out entirely, it works.
$ nvcc cub_test.cu
$ ./a.out
0, 4000
1, 1000
2, 2680
$ nvcc cub_test.cu -gencode arch=compute_35,code=sm_35
$ ./a.out
0, 4000
1, 1000
2, 768
I use a handful of other CUB calls without issue, just this one is misbehaving. I've also tried running this code on a GTX 1080 Ti (with
compute_61, sm_61) and see the same behavior.
Is the right solution to omit these compiler flags?
tried on one machine with:
cuda 8.0
ubuntu 16.04
gcc 5.4.0
cub 1.6.4
Kepler GTX Titan (compute capability 3.5)
and another with:
cuda 8.0
ubuntu 16.04
gcc 5.4.0
cub 1.6.4
Pascal GTX 1080 Ti (compute capability 6.1)
Sounds like you should file a bug report at the CUB repository issues page.
Edit: I can reproduce this issue:
[joeuser#myhost:/tmp]$ nvcc -I/opt/cub -o a a.cu
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
[joeuser#myhost:/tmp]$ ./a
0, 4000
1, 1000
2, 2680
[joeuser#myhost:/tmp]$ nvcc -I/opt/cub -o a a.cu -gencode arch=compute_30,code=sm_30
[joeuser#myhost:/tmp]$ ./a
0, 4000
1, 1000
2, 512
Relevant info:
CUDA: 8.0.61
nVIDIA driver: 375.39
Distribution: GNU/Linux Mint 18.1
Linux kernel: 4.4.0
GCC: 5.4.0-6ubuntu1~16.04.4
cub: 1.6.4
GPU: GTX 650 Ti (Compute Capability 3.0)

Impact of matrix sparsity on cblas sgemm in Ubuntu 14.04

I have recently discovered that the performance of a cblas_sgemm call for matrix multiplication dramatically improves if the matrices have a "large" number of zeros in them. It improves to the point that it beats its cublas cousin by around 100 times. This could be most probably attributed to some automatic detection of sparsity and suitable format conversion by cblas_sgemm function.
Unfortunately, no such behavior is exhibited by its cuda counterpart i.e. cublasSgemm.
So, the question is, how can I get the same type of optimization on cublasSgemm for matrices that may have a large number of zeros.
And what technique does cblas_sgemm use to automatically adjust to sparse matrices?
Please, do not recommend cuSparse / CUSP etc because
I am not sure about the sparsity of input matrices beforehand
I am working on an iterative algorithm where for initial few iterations the matrices may be sparse but gradually become dense as time goes on.
Thanks in advance
Edited to include code to reproduce the above scenario
#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>
#include <cublas_v2.h>
using namespace std;
int main()
{
const int m = 5000;
timespec blas_start, blas_end, cublas_start, cublas_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m
// input martix A: every 32nd element is non-zero
for(i = 0; i < m; i++)
{
A[i] = 0;
if( i % 32 == 0) //adjust for sparsity
A[i] = i;
}
// input matrix B: identity matrix
// col major = row major
for(i = 0; i < m; i++)
for(j = 0; j < m; j++)
{
if (i==j)
B[j*m + i] = 1;
else
B[j*m + i] = 0;
}
clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1, A, m, B, m, 0, C, m);
clock_gettime(CLOCK_REALTIME, &blas_end);
/*
for(i = 0; i < 12; i++)
printf("%f ", C[i]);
*/
//cublas section
cudaError_t cudaStat;
cublasHandle_t handle;
cublasCreate(&handle);
//Declaring Device Variables
float *A_d, *B_d, *C_d;
//Allocating Memory for Device Variables
cudaStat = cudaMalloc(&A_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for A_d\n");
cudaStat = cudaMalloc(&B_d, sizeof(float)*m*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for B_d\n");
cudaStat = cudaMalloc(&C_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for C_d\n");
// Moving values of A, B onto Device variables
cublasSetVector(m, sizeof(float), A, 1, A_d, 1);
cublasSetMatrix(m, m, sizeof(float), B, m, B_d, m);
// Do the actual multiplication
float alpha = 1.0f, beta = 0.0f;
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_start);
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 1, m, m, &alpha, A_d, 1, B_d, m, &beta, C_d, 1);
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_end);
cublasGetVector(m, sizeof(float), C, 1, C_d, 1);
/*
for(i = 0; i < 12; i++)
printf("%f ", C[i]);
*/
// Print times
// blas time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"BLAS Time = "<< totaltime << "\n";
//cublas
totalsec = (double)cublas_end.tv_sec - (double)cublas_start.tv_sec;
totalnsec = cublas_end.tv_nsec - cublas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"CUBLAS Time = "<< totaltime << "\n";
return 0;
}
Ran it to get the following results
malang#ubuntu:~/uas/stackoverflow$ nvcc -arch=sm_12 blascomp.cu -o blascomp.o -lblas -lcublas
malang#ubuntu:~/uas/stackoverflow$ ./blascomp.o
BLAS Time = 0.000964504
CUBLAS Time = 0.0365322
EDIT
Edited after the answer of #Eric
Use of cublasSgemv has greatly enhanced the performance on the GPU. But, I still have this problem of cblas_sgemm being much more efficient for sparse matrices on the CPU. What could be the possible reasons?
EDIT Executed the following commands on the suggestion of #Eric #osgx #Robert Crovella
erisp#ubuntu:~/uas/stackoverflow$ ldd ./gemmcomp.o
linux-gate.so.1 => (0xb76f6000)
libblas.so.3 => /usr/lib/libblas.so.3 (0xb765e000)
libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb7576000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb73c7000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7381000)
/lib/ld-linux.so.2 (0xb76f7000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7364000)
erisp#ubuntu:~/uas/stackoverflow$ ll -d /usr/lib/libblas* /etc/alternatives/libblas.*
lrwxrwxrwx 1 root root 26 مارچ 13 2015 /etc/alternatives/libblas.a -> /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root 27 مارچ 13 2015 /etc/alternatives/libblas.so -> /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root 29 مارچ 13 2015 /etc/alternatives/libblas.so.3 -> /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root 29 مارچ 13 2015 /etc/alternatives/libblas.so.3gf -> /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2 root root 4096 مارچ 13 2015 /usr/lib/libblas/
lrwxrwxrwx 1 root root 27 مارچ 13 2015 /usr/lib/libblas.a -> /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root 28 مارچ 13 2015 /usr/lib/libblas.so -> /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root 30 مارچ 13 2015 /usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root 32 مارچ 13 2015 /usr/lib/libblas.so.3gf -> /etc/alternatives/libblas.so.3gf
Your code has a problem - you are using the wrong BLAS API. You use the matrix-matrix-multiplication routine gemm() to do a vector-matrix-multiplication operation.
For vec-mat-mul or mat-vec-mul you should use gemv(). Of course gemm() can give correct result with a matrix having only 1 row. But this is an unexpected corner case that gemv() should handle,
so you may not get the peak performance on GPU and/or CPU.
You could change to gemv() and benchmark again.
EDIT
Here's my benchmark result with single thread MKL. Values of A and B are same as in your code. I cannot reproduce your result of '0.000964504s' on CPU. You could check the correctness of your result vector. There's a chance that your cblas library has a bug.
Using gemm()
BLAS Time = 0.0169784
CUBLAS Time = 0.00356155
Using gemv()
BLAS Time = 0.0167557
CUBLAS Time = 0.0013809
EDIT2
I now can reproduce the FAST result on unbuntu 14.04 with the package libblas-dev.
The reason is answered in the following question.
cblas gemm time dependent on input matrix values - Ubuntu 14.04
In the particular version of BLAS, there's code to check for zero element. The checking cost is O(n^2), so it is worth doing this on matrix-matrix multiplication whose cost is O(n^3).
For GPU gemm(), as the order of computing is different (block by block instead of line by line), such optimization may not be feasible. But it is doable for GPU gemv(), where the time spent on loading the matrix from global memory could be saved.