Theano: cublasSgemm failed (14) an internal operation failed

Theano: cublasSgemm failed (14) an internal operation failed - cuda

Sometimes, after a while of running fine, I get such an error with Theano / CUDA:
RuntimeError: cublasSgemm failed (14) an internal operation failed
unit=0 N=0, c.dims=[512 2048], a.dim=[512 493], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(512, 493), (493, 2048)]
Inputs strides: [(493, 1), (2048, 1)]
Inputs values: ['not shown', 'not shown']
As my code runs fine for a while (I do Neural Network training, and it runs most of the time through, and even when this error occurred, it already ran fine for >2000 mini-batches), I wonder about the cause of this. Maybe some hardware fault?
This is with CUDA 6.0 and a very recent Theano (yesterday from Git), Ubuntu 12.04, GTX 580.
I also got the error with CUDA 6.5 on a K20:
RuntimeError: cublasSgemm failed (14) an internal operation failed
unit=0 N=0, c.dims=[2899 2000], a.dim=[2899 493], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(2899, 493), (493, 2000)]
Inputs strides: [(493, 1), (2000, 1)]
Inputs values: ['not shown', 'not shown']
(Another error I sometimes got in the past is this now instead. Not sure if this is related.)
Via Markus, who got the same error:
RuntimeError: cublasSgemm failed (14) an internal operation failed
unit=0 N=0, c.dims=[2 100], a.dim=[2 9919], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuFlatten{2}.0, weight_hidden_)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(2, 9919), (9919, 100)]
Inputs strides: [(9919, 1), (100, 1)]
Inputs values: ['not shown', 'not shown']
With CUDA 6.5, Windows 8.1, Python 2.7, GTX 970M.
The error only occurs in my own network, if I run the LeNet example from Theano, it runs fine. Though the network is compiling and running fine on the CPU (and also on the GPU for some colleagues using Linux). Does anyone have an idea what the problem could be?

Just for reference in case anyone stumbles upon this:
This doesn't occur anymore for me. I'm not exactly sure what fixed it, but I think the main difference is that I avoid any multithreading and forks (without exec). This caused many similar problems, e.g. Theano CUDA error: an illegal memory access was encountered (StackOverflow), and Theano CUDA error: an illegal memory access was encountered (Google Groups discussion). Esp. that discussion on Google Groups is very helpful.
Theano functions are not multithreading safe. However, that is not a
problem for me because I'm only using it in one thread. However, I
still think that other threads might cause these problems. Maybe it is
related to the GC of Python which frees some Cuda_Ndarray in some
other thread while the theano.function is running.
I looked a bit at the relevant Theano code and not sure if it covers
all such cases.
Note that you might even not be aware that you have some background
threads. Some Python stdlib code can spawn such background threads.
E.g. multiprocessing.Queue will do that.
I cannot avoid having multiple
threads, and until this is fixed in Theano, I create a new subprocess
with a single thread where I do all the Theano work. This also has
several advantages such as: More clear separation of the code, being
faster in some cases because it all really runs in parallel, and being
able to use multiple GPUs.
Note that just using the multiprocessing module did not work for me
that well because there are a few libs (Numpy and others, and maybe
Theano itself) which might behave bad in a forked process (depending
on the versions, the OS and race conditions). Thus, I needed a real
subprocess (fork + exec, not just fork).
My code is here, in case anyone is interested in this.
There is ExecingProcess which is modeled after multiprocessing.Process
but does a fork+exec. (Btw, on Windows, the multiprocessing module
will anyway do this, because there is no fork on Windows.)
And there is AsyncTask which adds up a duplex pipe to this which works
with both ExecingProcess and the standard multiprocessing.Process.
See also: Theano Wiki: Using multiple GPUs

Ran into a similar issue, and fwiw, in my case it was solved by eliminating the import of another library that used pycuda. It appears theano really does not like to share.

Related

CUDA kernel launched from Nsight Compute gives inconsistent results

I have completed writing my CUDA kernel, and confirmed it runs as expected when I compile it using nvcc directly, by:
Validating with test data over 100 runs (just in case)
Using cuda-memcheck (memcheck, synccheck, racecheck, initcheck)
Yet, the results printed into the terminal while the application is getting profiled using Nsight Compute differs from run to run. I am curious if the difference is a cause for concern, or if this is the expected behavior.
Note: The application also gives correct & consistent results while getting profiled bu nvprof.

I followed up on the NVIDIA forums but will post here as well for tracking:
What inconsistencies are you seeing in the output? Nsight Compute runs a kernel multiple times to collect all of its information. So things like print statements in the kernel will show up multiple times. Could it be related to that or is it a value being calculated differently? One other issue is with Unified Memory (UVM) or zero copy memory Nsight Compute is not able to restore those values before each replay. Are you using that in your application? If so, the application replay mode could help. It may be worth trying to see if anything changes.

I was able to resolve the issue by addressing my shared memory initializations. Since Nsight Compute runs a kernel multiple times as #Jackson stated, the effects of uninitialized memory were amplified (I was performing atomicAdd into uninitialized memory).

Two same GPUs, but different performances in deep learning

I'm using two same GPUs, Geforce RTX 2080 ti, in ubuntu 16.04 for deep learning.
Since a week ago, suddenly, I've been in trouble with a GPU.
One gpu has worked very well, but another one has shown errors.
The error is "An illegal memory access was encountered".
I searched for solving this problem and
I updated CUDA version to 10.2, nvidia driver version to 440.64.00.
I modified /etc/X11/xorg.conf
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
Option "Interactive" "0" # I added this**
EndSection
Now it seems to work well, but randomly.
Almost, it shows cuda runtime error(700) gpu memory access error.
When it worked well, I found different performances between them.
A normal gpu showed below.
avg_loss 0.04434689141832936
max_acc 0.9197530864197536
avg_acc 0.6771604938271607
Another one, an abnormal gpu, showed below.
avg_loss 0.16801862874683451
max_acc 0.9197530864197536
avg_acc 0.541358024691358
I typed same commands Like
CUDA_VISIBLE_DEVICES={gpu_num} python main.py --test config/www.yml
I also tried other open source codes but same situation.
Maybe the abnormal gpu has been broken, but I don't know.
So, does anyone have solutions to fix the gpu (an illegal memory access)?
I think, it's not driver compatibility issue, codes, or some mistakes because the problem occurred for only the abnormal gpu.

CUDA: Compilation failure on large struct >4GB

I have a rather large struct in my CUDA code
struct cDevData {
~5GB worth of stuff ...
};
I allocate the space required to hold that structure during system setup with cudaMalloc because windows limits static code and data to 2GB. Annoying, but Fine. Obviously I'm compiling a 64bit application, but when I do I get the following error for the Debug configuration:
ptxas C : /Users/user/AppData/Local/Temp/tmpxft_0000123c_00000000-4_kernel.ptx, line 2897; error : Value out of range for type .b32
ptxas fatal : Ptx assembly aborted due to errors
And curiously a different one for Release configuration:
error C2089: 'cDevData' : 'struct' too large
It only started happening when I increased the size of this structure over 4GB.
I've also tried to compile a 32bit application just to check and I get a different (expected) error class is too large.
What's going on, and is there a way around it?
System: Windows 7, Visual Studio 2012, CUDA toolkit 8.0, GPU = Titan.

This is a combination of two bugs - one in NVCC and another in VS2012. From NVIDIA's response:
The error generated in the Release configuration "error C2089: 'cDevData' : 'struct' too large" is coming from the host compiler. So this issue is due to a limitation in the host compiler on Windows. We will fix the other issue exposed in Debug configuration. However, even after the fix, the compilation will fail on Windows due to the host compiler limitation.
I don't know if this issue is fixed in VS2015 or later.
In the meanwhile I circumvented the problem(s) through an industrious use of boost::mpl and some macros, to retain struct-like semantics (accessing fields by names, complex field types like multi-dimensional arrays retain their dimensions, etc). Thus at the end my code required minimal changes (changing field deref op -> into a macro, and replacing sizeof(cDevData) with another macro).

ManagedCuda: ErrorIllegalAddress: While executing a kernel, the device encountered a load or store instruction on an invalid memory address [duplicate]

I am using the ManagedCuda library in a C# project to utilise the GPU, currently I am following along with this tutorial regarding how to write code compatible between C# and C++ after failing to achieve it with OpenCV.
Everything seems to be working fine with my code, the kernel is found, built and the method call is executed however I am getting an error:
An unhandled exception of type 'ManagedCuda.CudaException' occurred in ManagedCuda.dll
Additional information: ErrorIllegalAddress: While executing a kernel, the device
encountered a load or store instruction on an invalid memory address.
The context cannot be used, so it must be destroyed (and a new one should be created).
I understand that C# is complaining that a valid address is not found when it attempts to pass the device pointer to the kernel, the only difference I can tell between my code and the post in the cited tutorial, is that ManagedCuda seems to have recently had a facelift which allows users to use Lambdas, I've done some reading and haven't found anything to clarify whether or not this is what's causing my problem:
static Func<int, int, int> cudaAdd = (a, b) =>
{
// init output parameters
CudaDeviceVariable<int> result_dev = 0;
int result_host = 0;
// run CUDA method
addWithCuda.Run(a, b, result_dev.DevicePointer); <--- Code throws error here
// copy return to host
result_dev.CopyToHost(ref result_host);
return result_host;
};
In the original tutorial code the OP uses CudaDeviceVariable result_dev = 0;. Could this be the problem? I don't see why it would be, but maybe my cast is wrong?
For clarity here is the kernel which is being called:
__global__ void kernel(int a, int b, int *c)
{
*c = (a + b)*(a + b);
}

TL;DR: Exception is related to 32/64 bit settings in your C# project. Either set platform target to x86 or if you have it on Any CPU make sure to tick Prefer 32-bit.
How i found out:
Made a solution in .NET 4.5 according to https://algoslaves.wordpress.com/2013/08/25/nvidia-cuda-hello-world-in-managed-c-and-f-with-use-of-managedcuda/ (same as OP)
used NuGet to add ManagedCuda 6.5 - Standalone
Worked fine
Backup point A.
Changed .NET version to 4.0, used NuGet to remove and add ManagedCuda 6.5 - Standalone:
Exception thrown: Additional information: ErrorIllegalAddress: While executing a kernel, the device encountered a load or store instruction on an invalid memory address.
Switched .NET version to 4.5, used NuGet to remove and add ManagedCuda 6.5 - Standalone:
Exception thrown: Same as above.
Clean solution, Rebuild solution, Build solution:
Exception thrown: Same as above.
Manually deleted every single file/folder generated by Visual Studio:
Exception thrown: Same as above.
Reopen project backed up from point A:
Worked fine.
Manually delete every single file/folder generated by Visual Studio in both Working and Not working project.
Visually compare all files.
Every file is the same except: not working version has extra <Prefer32Bit>false</Prefer32Bit> in MangedCudaTest.csproj
Deleted lines with <Prefer32Bit>false</Prefer32Bit>.
Not working version finally Worked fine.
Made same changes in my main project, finally Worked fine.

From the comments it seems that the thrown exception is due to running on different threads. In a single threaded environment the sample code runs fine returning the right result. In order to use Cuda in a multithreaded application, proper synchronization of the threads and binding the cuda context to the currently active threads would be necessary.

I had the same issue, the solution for me was to set the .NET project to 64bit, NOT 32 bit like suggested in aeroson's answer. I think this maybe because I'm using CUDA SDK 7.5 and I read somewhere about 32bit being phased out.

How to debug: CUDA kernel fails when there are many threads?

I am using a Quadro K2000M card, CUDA capability 3.0, CUDA Driver 5.5, runtime 5.0, programming with Visual Studio 2010. My GPU algorithm runs many parallel breadth first searches (BFS) of a tree (constant). The threads are independed except reading from a constant array and the tree. In each thread there can be some malloc/free operations, following the BFS algorithm with queues (no recursion). There N threads; the number of tree leaf nodes is also N. I used 256 threads per block and (N+256-1)/256 blocks per grid.
Now the problem is the program works for less N=100000 threads but fails for more than that. It also works in CPU or in GPU thread by thread. When N is large (e.g. >100000), the kernel crashes and then cudaMemcpy from device to host also fails. I tried Nsight, but it is too slow.
Now I set cudaDeviceSetLimit(cudaLimitMallocHeapSize, 268435456); I also tried larger values, up to 1G; cudaDeviceSetLimit succeeded but the problem remains.
Does anyone know some common reason for the above problem? Or any hints for further debugging? I tried to put some printf's, but there are tons of output. Moreover, once a thread crashes, all remaining printf's are discarded. Thus it is hard to identify the problem.

"CUDA Driver 5.5, runtime 5.0" -- that seems odd.
You might be running into a windows TDR event. Based on your description, I would check that first. If, as you increase the threads, the kernel begins to take more than about 2 seconds to execute, you may hit the windows timeout.
You should also add proper cuda error checking to your code, for all kernel calls and CUDA API calls. A windows TDR event will be more easily evident based on the error codes you receive. Or the error codes may steer you in another direction.
Finally, I would run your code with cuda-memcheck in both the passing and failing cases, looking for out-of-bounds accesses in the kernel or other issues.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008