compute-sanitizer reports both "Address is out of bounds" and "is inside the nearest allocation" - cuda

I'm having trouble interpreting this CUDA error message. I've used compute-sanitizer to track it down to a memory access in a particular kernel, a libcublas batch matrix multiplication. I don't understand the error, because it reports both "Address is out of bounds" and "Address is inside the nearest allocation". If it's inside an allocation, how is it out of bounds? What's actually going wrong here?
========= Invalid __global__ write of size 4 bytes
========= at 0x18e0 in void gemv2N_kernel<int, int, float, float, float, float, (int)128, (int)8, (int)4, (int)4, (int)1, (bool)0, cublasGemvParams<cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<float>, float>>(T13)
========= by thread (0,0,0) in block (0,0,137)
========= Address 0x7fd8c0000224 is out of bounds
========= and is inside the nearest allocation at 0x7fd8aa000000 of size 503316480 bytes
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x209e4a]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so
========= Host Frame: [0x21caf9b]
========= in /usr/local/cuda-11.5.0/lib64/libcublasLt.so.11
========= Host Frame: [0x2224d18]
========= in /usr/local/cuda-11.5.0/lib64/libcublasLt.so.11
========= Host Frame: [0x8c257c]
========= in /usr/local/cuda-11.5.0/lib64/libcublasLt.so.11
========= Host Frame: [0x8c7717]
========= in /usr/local/cuda-11.5.0/lib64/libcublasLt.so.11
========= Host Frame: [0x67d937]
========= in /usr/local/cuda-11.5.0/lib64/libcublasLt.so.11
========= Host Frame:cublasLtSSSMatmul [0x6a3e03]
<a bunch of stack trace of my calling code omitted>

I don't understand the error, because it reports both "Address is out of bounds" and "Address is inside the nearest allocation". If it's inside an allocation, how is it out of bounds?
There is no contradiction here.
It is inside an allocation which you made within the current CUDA context. It is not inside the allocation you passed to the kernel to operate on.

Related

RTX 2080 Ti cuda-memcheck hit error at the beginning of creating Cublas context

I run the same program on both GTX 1080 Ti and RTX 2080 Ti. I found that when I try to use Cuda-memcheck tool to check my program, I always got the following errors based on the device RTX 2080 Ti.
========= CUDA-MEMCHECK
========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaFuncSetAttribute.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x359363]
========= Host Frame:/usr/local/cuda/lib64/libcublas.so.10.0 [0x79a03c]
========= Host Frame:/usr/local/cuda/lib64/libcublas.so.10.0 [0x72c2ab]
========= Host Frame:/usr/local/cuda/lib64/libcublas.so.10.0 [0x72c610]
========= Host Frame:/usr/local/cuda/lib64/libcublas.so.10.0 (cublasCreate_v2 + 0x1ce7) [0x14b337]
========= Host Frame:./GPU_LMM (main + 0x43) [0xb633]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:./GPU_LMM (_start + 0x2a) [0xb77a]
=========
========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaGetLastError.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x359363]
========= Host Frame:/usr/local/cuda/lib64/libcublas.so.10.0 [0x79deb3]
========= Host Frame:/usr/local/cuda/lib64/libcublas.so.10.0 [0x72c2b8]
========= Host Frame:/usr/local/cuda/lib64/libcublas.so.10.0 [0x72c610]
========= Host Frame:/usr/local/cuda/lib64/libcublas.so.10.0 (cublasCreate_v2 + 0x1ce7) [0x14b337]
I make sure that what I do at this point is only to create a Cublas context and do nothing. I am not sure what the problem is. Is it caused by the version mismatching between CUDA 10.0 and RTX 2080 Ti?
The information about my server is as the following.
NVIDIA-SMI 410.93 Driver Version: 410.93 CUDA Version: 10.0
The RTX2080 Ti should be supported in the latest CUDA Version 10.0.130
Make sure your Driver is up to date too.
On Linux, that is Driver version >= 410.48 and on Windows >= 411.31
CuBlas got Turing support in Version 10, too.
The real problem is that the Cublas library is not compatible with the Cuda 10 version and RTX gpu card.

Trouble compiling/running CUDA code involving dynamic parallelism

I am trying to use dynamic parallelism with CUDA, but I cannot go through the compilation step.
I am working on a GPU with Compute Capability 3.5 and the CUDA version 7.5.
Depending on the switches in the compile command I use, I am getting different error messages, but using the documentation,
I arrived to one line leading to a successful compilation:
nvcc -arch=compute_35 -rdc=true cudaDynamic.cu -o cudaDynamic.out -lcudadevrt
But when the program is launched, all the program fails. With
CUDA-memcheck, for each call to an API function, I get the same error
message:
========= CUDA-MEMCHECK
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to ...
I have also tried this line (taken from CUDA dynamic samples makefile):
nvcc -ccbin g++ -I../../common/inc -m64 -dc -gencode arch=compute_35,code=compute_35 -o cudaDynamic.out -c cudaDynamic.cu
But upon execution, I get:
cudaDynamic.out: Permission denied
I would like to understand how to correctly compile a CUDA dynamic code, because all the other compilation lines that I have tried so far have failed.
I fixed the problem by fully reinstalling CUDA.
I'm now able to compile both the CUDA samples and my own code.

Whatever Cuda function call returns cudaErrorMemoryAllocation

I was wondering why none of simple Cuda code examples I found in Internet are working for me and I found that even this simplest code cause an error:
#include <stdio.h>
int main(int argc, char ** argv) {
size_t available, total;
cudaError_t err = cudaMemGetInfo(&available, &total);
if (err == cudaErrorMemoryAllocation) {
printf("cudaErrorMemoryAllocation");
} else {
printf("OK or not memory allocation error");
}
return 0;
}
The code above always prints out "cudaErrorMemoryAllocation".
Here is the output of cuda-memcheck test for this program:
cudaErrorMemoryAllocation
========= CUDA-MEMCHECK
========= Program hit error 2 on CUDA API call to cudaMemGetInfo
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuD3D11CtxCreate + 0x118a92) [0x137572]
========= Host Frame:D:\Cuda\a.exe [0x1223]
========= Host Frame:D:\Cuda\a.exe [0x101c]
========= Host Frame:D:\Cuda\a.exe [0x901f]
========= Host Frame:C:\Windows\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x1a) [0x1832]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x5d609]
=========
========= ERROR SUMMARY: 1 error
Platform Windows 8 64-bit
Compiler Visual Studio 2008
Compute capability 1.1 (GeForce 8800 GT)
CUDA version 5.5
When creating a CUDA context, a lot of stuff is allocated so it might happen that your available memory isn't enough to intialize it. That might explain the cudaErrorMemoryAllocation error you're getting.
cudaMemGetInfo doesn't throw that specific error so it must be something else:
Note that this function may also return error codes from previous,
asynchronous launches.
The cuD3D11CtxCreate in the stack trace also creates a CUDA context so it might be it.
Also: if you have multiple apps running contending your device, that might be the cause as well.
The problem was solved. I'm still not sure what caused this, either there was a conflict between old video card driver and a new one installed hiddenly under "Recommended" option in Cuda installer, or a askew installed VS 2008. However I decided to reinstall everything and try VS 2012 instead of VS 2008, and now everything works fine.

Cuda kernel error if I use cuda-memcheck

I have a Cuda kernel that runs well if I use the nsight cuda profiler or if I run it directly from the terminal. But if I use this command
cuda-memcheck --leak-check full ./CudaTT 1 ../../file.jpg
It crashes with "unspecified launch failure". I'm using this after each kernel code.
e=cudaDeviceSynchronize();
if (e != cudaSuccess) printf("Fail in kernel 2 %s",cudaGetErrorString(e));
and cuda-memcheck shows several of this
========= Program hit error 4 on CUDA API call to cudaDeviceSynchronize
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x24e129]
========= Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaDeviceSynchronize + 0x214) [0x27e24]
=========
========= Program hit error 4 on CUDA API call to cudaFree
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x24e129]
========= Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaFree + 0x228) [0x338b8]
in the end it shows
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 10 errors
Any idea why this happens?
Edit:
I commented out another kernel which was not launching due to having many registers and now the error on the kernel above changed now it says: "the launch timed out and was terminated". Again it runs ok on the cuda profiler and without cuda-memcheck on the terminal but when using cuda-memcheck it shows this
========= Program hit error 6 on CUDA API call to cudaDeviceSynchronize
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x24e129]
========= Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaDeviceSynchronize + 0x214) [0x27e24]
=========
========= Program hit error 6 on CUDA API call to cudaFree
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x24e129]
========= Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaFree + 0x228) [0x338b8]
========= Host Frame:[0xbf913ea8]
And the same 10 errors in the end
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 10 errors
Error 6 appears to be due to a timeout of a kernel lasting too much time but how come it works without cuda-memcheck? On the profiler it shows the kernel lasts 3.771 seconds.
Another strange behavior is that I'm printing some values after the calculations. The values are different if I use cuda-memcheck than if I don't.
A better link would be http://docs.nvidia.com/cuda/cuda-memcheck/index.html. Cuda-memcheck can and does alter the run time of the application's CUDA kernels. If the GPU is being used for display, then a watchdog timeout is present that prevents the runtime of the kernel from exceeding a fixed boundary (on Linux, this is usually ~5 seconds). Given that the uninstrumented kernel takes 3.7 seconds, it is very likely that the modified version of the kernel being run by memcheck is actually exceeding the watchdog and hence the kernel launch is being timed out. There are a couple of options in such cases :
Run on a system where X has not been started
Launch the X server in non interactive mode using Option "Interactive" "off" in /etc/X11/xorg.conf. Note that in this mode, the display will not update while the CUDA kernel is running.
It appears kernels launch much slower with cuda-memcheck
people.maths.ox.ac.uk/gilesm/cuda/doc/cuda-memcheck.pdf
Page 16
"Applications run much slower under CUDA‐MEMCHECK. This may cause some
kernel launches to fail with a launch timeout error when running with CUDA‐
MEMCHECK enabled.
"

How to decipher error code from Cuda-Memcheck

Problem:
Running cuda-memcheck on Windows in a console is reporting
Program hit error 2 on CUDA API call to cudaLaunch.
The problem is what does "error 2" mean? I thought it was the error code contained in the cudaError enum but I am not sure.
Platform: Windows 7 (64-bit)
Compiler (Visual Studio 2010)
Compute capability: 1.1 (Quadro FX 3800M)
CUDA version (5.0)
Question: How do you find out what the error number means?
Cuda-Memcheck results
Running 1 test case...
========= CUDA-MEMCHECK
========= Program hit error 2 on CUDA API call to cudaLaunch
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\Windows\system32\nvcuda.dll
(cuD3D11CtxCreate + 0x11f702) [0x144812]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\bin\cudart64_50_35.dll
(cudaLaunch + 0x2a5) [0x235c5]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(__device_stub__ZN17signal_processing6kernel24cfar_move_valid_elementsEPjPK6float2PS2_fS1_S1_ + 0xab) [0x313cb]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(signal_processing::kernel::cfar_move_valid_elements + 0x1d) [0x313ed]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(signal_processing::algorithm::constant_false_alarm_rate::identify_valid + 0x1af) [0x2a46f]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(signal_processing::algorithm::constant_false_alarm_rate::process + 0x2e3) [0x2bc03]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(server::stream::operator() + 0x5c2) [0x20ce2]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::`anonymous namespace'::thread_start_function + 0x21) [0xe01a1]
========= Host Frame:C:\Windows\system32\MSVCR100.dll
(endthreadex + 0x43) [0x21d9f]
========= Host Frame:C:\Windows\system32\MSVCR100.dll
(endthreadex + 0xdf) [0x21e3b]
========= Host Frame:C:\Windows\system32\kernel32.dll
(BaseThreadInitThunk + 0xd) [0x1652d]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll
(RtlUserThreadStart + 0x21) [0x2c521]
=========
========= Program hit error 17 on CUDA API call to cudaFree
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\Windows\system32\nvcuda.dll
(cuD3D11CtxCreate + 0x11f702) [0x144812]
========= Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\bin\cudart64_50_35.dll
(cudaFree + 0x248) [0x24a98]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(signal_processing::algorithm::pulse_compression::~pulse_compression + 0x38) [0x29228]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::detail::sp_counted_impl_pd<signal_processing::algorithm::pulse_compression * __ptr64,boost::detail::sp_ms_deleter<signal_processing::algorithm::pulse_compression> >::dispose + 0x18) [0x18538]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(server::stream::~stream + 0xe7) [0x167e7]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::detail::thread_data<server::stream>::`scalar deleting destructor' + 0x26) [0x14246]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::thread::join + 0x12d) [0xe146d]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::thread_group::join_all + 0x49) [0x13fd9]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(unit_tests::test_stream::test_method + 0x430) [0x14a50]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::unit_test::ut_detail::callback0_impl_t<boost::unit_test::ut_detail::unused,void (__cdecl*)(void)>::invoke + 0xc) [0x6dcc]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::unit_test::ut_detail::callback0_impl_t<int,boost::unit_test::`anonymous namespace'::zero_return_wrapper_t<boost::unit_test::callback0<boost::unit_test::ut_detail::unused> > >::invoke + 0x16) [0xdab16]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::execution_monitor::catch_signals + 0xb1) [0xdb561]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::execution_monitor::execute + 0x37) [0xdb627]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::unit_test::unit_test_monitor_t::execute_and_translate + 0x5e) [0xdaa8e]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::unit_test::framework_impl::visit + 0x122) [0x92cc2]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::unit_test::traverse_test_tree + 0xae) [0x8d82e]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::unit_test::traverse_test_tree + 0xae) [0x8d82e]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::unit_test::framework::run + 0x4a2) [0x93c62]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(boost::unit_test::unit_test_main + 0x7e) [0xe5ffe]
========= Host Frame:C:\Users\Stephen.Torri\Documents\Visual Studio 2010\Projects\signal_processing\Win64\Bin\RelWithDebInfo\test_stream.exe
(__tmainCRTStartup + 0x11a) [0xe2082]
========= Host Frame:C:\Windows\system32\kernel32.dll
(BaseThreadInitThunk + 0xd) [0x1652d]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x2c521]
=========
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 2 errors
Please refer to cudaLaunch() in CUDA runtime API.
You can fine the meaning of the err code in that doc. link
cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.
On the other hand, I would suggest you to show more output of cuda-memcheck, usually it will show more human readable info than just a code.