CUDA: Compilation failure on large struct >4GB - cuda

I have a rather large struct in my CUDA code
struct cDevData {
~5GB worth of stuff ...
};
I allocate the space required to hold that structure during system setup with cudaMalloc because windows limits static code and data to 2GB. Annoying, but Fine. Obviously I'm compiling a 64bit application, but when I do I get the following error for the Debug configuration:
ptxas C : /Users/user/AppData/Local/Temp/tmpxft_0000123c_00000000-4_kernel.ptx, line 2897; error : Value out of range for type .b32
ptxas fatal : Ptx assembly aborted due to errors
And curiously a different one for Release configuration:
error C2089: 'cDevData' : 'struct' too large
It only started happening when I increased the size of this structure over 4GB.
I've also tried to compile a 32bit application just to check and I get a different (expected) error class is too large.
What's going on, and is there a way around it?
System: Windows 7, Visual Studio 2012, CUDA toolkit 8.0, GPU = Titan.

This is a combination of two bugs - one in NVCC and another in VS2012. From NVIDIA's response:
The error generated in the Release configuration "error C2089: 'cDevData' : 'struct' too large" is coming from the host compiler. So this issue is due to a limitation in the host compiler on Windows. We will fix the other issue exposed in Debug configuration. However, even after the fix, the compilation will fail on Windows due to the host compiler limitation.
I don't know if this issue is fixed in VS2015 or later.
In the meanwhile I circumvented the problem(s) through an industrious use of boost::mpl and some macros, to retain struct-like semantics (accessing fields by names, complex field types like multi-dimensional arrays retain their dimensions, etc). Thus at the end my code required minimal changes (changing field deref op -> into a macro, and replacing sizeof(cDevData) with another macro).

Related

Two same GPUs, but different performances in deep learning

I'm using two same GPUs, Geforce RTX 2080 ti, in ubuntu 16.04 for deep learning.
Since a week ago, suddenly, I've been in trouble with a GPU.
One gpu has worked very well, but another one has shown errors.
The error is "An illegal memory access was encountered".
I searched for solving this problem and
I updated CUDA version to 10.2, nvidia driver version to 440.64.00.
I modified /etc/X11/xorg.conf
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
Option "Interactive" "0" # I added this**
EndSection
Now it seems to work well, but randomly.
Almost, it shows cuda runtime error(700) gpu memory access error.
When it worked well, I found different performances between them.
A normal gpu showed below.
avg_loss 0.04434689141832936
max_acc 0.9197530864197536
avg_acc 0.6771604938271607
Another one, an abnormal gpu, showed below.
avg_loss 0.16801862874683451
max_acc 0.9197530864197536
avg_acc 0.541358024691358
I typed same commands Like
CUDA_VISIBLE_DEVICES={gpu_num} python main.py --test config/www.yml
I also tried other open source codes but same situation.
Maybe the abnormal gpu has been broken, but I don't know.
So, does anyone have solutions to fix the gpu (an illegal memory access)?
I think, it's not driver compatibility issue, codes, or some mistakes because the problem occurred for only the abnormal gpu.

Theano: cublasSgemm failed (14) an internal operation failed

Sometimes, after a while of running fine, I get such an error with Theano / CUDA:
RuntimeError: cublasSgemm failed (14) an internal operation failed
unit=0 N=0, c.dims=[512 2048], a.dim=[512 493], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(512, 493), (493, 2048)]
Inputs strides: [(493, 1), (2048, 1)]
Inputs values: ['not shown', 'not shown']
As my code runs fine for a while (I do Neural Network training, and it runs most of the time through, and even when this error occurred, it already ran fine for >2000 mini-batches), I wonder about the cause of this. Maybe some hardware fault?
This is with CUDA 6.0 and a very recent Theano (yesterday from Git), Ubuntu 12.04, GTX 580.
I also got the error with CUDA 6.5 on a K20:
RuntimeError: cublasSgemm failed (14) an internal operation failed
unit=0 N=0, c.dims=[2899 2000], a.dim=[2899 493], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(2899, 493), (493, 2000)]
Inputs strides: [(493, 1), (2000, 1)]
Inputs values: ['not shown', 'not shown']
(Another error I sometimes got in the past is this now instead. Not sure if this is related.)
Via Markus, who got the same error:
RuntimeError: cublasSgemm failed (14) an internal operation failed
unit=0 N=0, c.dims=[2 100], a.dim=[2 9919], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuFlatten{2}.0, weight_hidden_)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(2, 9919), (9919, 100)]
Inputs strides: [(9919, 1), (100, 1)]
Inputs values: ['not shown', 'not shown']
With CUDA 6.5, Windows 8.1, Python 2.7, GTX 970M.
The error only occurs in my own network, if I run the LeNet example from Theano, it runs fine. Though the network is compiling and running fine on the CPU (and also on the GPU for some colleagues using Linux). Does anyone have an idea what the problem could be?
Just for reference in case anyone stumbles upon this:
This doesn't occur anymore for me. I'm not exactly sure what fixed it, but I think the main difference is that I avoid any multithreading and forks (without exec). This caused many similar problems, e.g. Theano CUDA error: an illegal memory access was encountered (StackOverflow), and Theano CUDA error: an illegal memory access was encountered (Google Groups discussion). Esp. that discussion on Google Groups is very helpful.
Theano functions are not multithreading safe. However, that is not a
problem for me because I'm only using it in one thread. However, I
still think that other threads might cause these problems. Maybe it is
related to the GC of Python which frees some Cuda_Ndarray in some
other thread while the theano.function is running.
I looked a bit at the relevant Theano code and not sure if it covers
all such cases.
Note that you might even not be aware that you have some background
threads. Some Python stdlib code can spawn such background threads.
E.g. multiprocessing.Queue will do that.
I cannot avoid having multiple
threads, and until this is fixed in Theano, I create a new subprocess
with a single thread where I do all the Theano work. This also has
several advantages such as: More clear separation of the code, being
faster in some cases because it all really runs in parallel, and being
able to use multiple GPUs.
Note that just using the multiprocessing module did not work for me
that well because there are a few libs (Numpy and others, and maybe
Theano itself) which might behave bad in a forked process (depending
on the versions, the OS and race conditions). Thus, I needed a real
subprocess (fork + exec, not just fork).
My code is here, in case anyone is interested in this.
There is ExecingProcess which is modeled after multiprocessing.Process
but does a fork+exec. (Btw, on Windows, the multiprocessing module
will anyway do this, because there is no fork on Windows.)
And there is AsyncTask which adds up a duplex pipe to this which works
with both ExecingProcess and the standard multiprocessing.Process.
See also: Theano Wiki: Using multiple GPUs
Ran into a similar issue, and fwiw, in my case it was solved by eliminating the import of another library that used pycuda. It appears theano really does not like to share.

ManagedCuda: ErrorIllegalAddress: While executing a kernel, the device encountered a load or store instruction on an invalid memory address [duplicate]

I am using the ManagedCuda library in a C# project to utilise the GPU, currently I am following along with this tutorial regarding how to write code compatible between C# and C++ after failing to achieve it with OpenCV.
Everything seems to be working fine with my code, the kernel is found, built and the method call is executed however I am getting an error:
An unhandled exception of type 'ManagedCuda.CudaException' occurred in ManagedCuda.dll
Additional information: ErrorIllegalAddress: While executing a kernel, the device
encountered a load or store instruction on an invalid memory address.
The context cannot be used, so it must be destroyed (and a new one should be created).
I understand that C# is complaining that a valid address is not found when it attempts to pass the device pointer to the kernel, the only difference I can tell between my code and the post in the cited tutorial, is that ManagedCuda seems to have recently had a facelift which allows users to use Lambdas, I've done some reading and haven't found anything to clarify whether or not this is what's causing my problem:
static Func<int, int, int> cudaAdd = (a, b) =>
{
// init output parameters
CudaDeviceVariable<int> result_dev = 0;
int result_host = 0;
// run CUDA method
addWithCuda.Run(a, b, result_dev.DevicePointer); <--- Code throws error here
// copy return to host
result_dev.CopyToHost(ref result_host);
return result_host;
};
In the original tutorial code the OP uses CudaDeviceVariable result_dev = 0;. Could this be the problem? I don't see why it would be, but maybe my cast is wrong?
For clarity here is the kernel which is being called:
__global__ void kernel(int a, int b, int *c)
{
*c = (a + b)*(a + b);
}
TL;DR: Exception is related to 32/64 bit settings in your C# project. Either set platform target to x86 or if you have it on Any CPU make sure to tick Prefer 32-bit.
How i found out:
Made a solution in .NET 4.5 according to https://algoslaves.wordpress.com/2013/08/25/nvidia-cuda-hello-world-in-managed-c-and-f-with-use-of-managedcuda/ (same as OP)
used NuGet to add ManagedCuda 6.5 - Standalone
Worked fine
Backup point A.
Changed .NET version to 4.0, used NuGet to remove and add ManagedCuda 6.5 - Standalone:
Exception thrown: Additional information: ErrorIllegalAddress: While executing a kernel, the device encountered a load or store instruction on an invalid memory address.
Switched .NET version to 4.5, used NuGet to remove and add ManagedCuda 6.5 - Standalone:
Exception thrown: Same as above.
Clean solution, Rebuild solution, Build solution:
Exception thrown: Same as above.
Manually deleted every single file/folder generated by Visual Studio:
Exception thrown: Same as above.
Reopen project backed up from point A:
Worked fine.
Manually delete every single file/folder generated by Visual Studio in both Working and Not working project.
Visually compare all files.
Every file is the same except: not working version has extra <Prefer32Bit>false</Prefer32Bit> in MangedCudaTest.csproj
Deleted lines with <Prefer32Bit>false</Prefer32Bit>.
Not working version finally Worked fine.
Made same changes in my main project, finally Worked fine.
From the comments it seems that the thrown exception is due to running on different threads. In a single threaded environment the sample code runs fine returning the right result. In order to use Cuda in a multithreaded application, proper synchronization of the threads and binding the cuda context to the currently active threads would be necessary.
I had the same issue, the solution for me was to set the .NET project to 64bit, NOT 32 bit like suggested in aeroson's answer. I think this maybe because I'm using CUDA SDK 7.5 and I read somewhere about 32bit being phased out.

In CUDA, how can we call a device function in another translation unit?

I'm pretty new to CUDA. I use Microsoft Visual Studio 2010 where I don't need to worry about writing a makefile. A problem arose as I tried to call in a .cu file a device function which was declared in the .h file and defined in another .cu file. At the end of building, I received an error message:
1>ptxas : fatal error : Unresolved extern function '_Z22atomicAddEmulateDoublePdd'
This appears in both CUDA 4.2 and 5.0. I'm wondering how should I configure my MVS to avoid this error. Sorry for the nooby questions and thanks for any suggestion!
CUDA 4.2 and does not support static linking so device functions must be defined in the same compilation unit. A common technique is to write the device function in a .cuh file and include it in the .cu file.
CUDA 5.0 supports a new feature called separate compilation. The CUDA 5.0 VS msbuild rules should be available in the CUDA 5.0 RC download.

Memory access exception handling with MinGW on XP

I am trying to use the MinGW GCC toolchain on XP with some vendor code from an embedded project that accesses high memory (>0xFFFF0000) which is, I believe, beyond the virtual mem address space allowed in 'civilian' processes in XP.
I want to handle the memory access exceptions myself in some way that will permit execution to continue at the instruction following the exception, ie ignore it. Is there some way to do it with MinGW? Or with MS toolchain?
The vastly simplified picture is thus:
/////////////
// MyFile.c
MyFunc(){
VendorFunc_A();
}
/////////////////
// VendorFile.c
VendorFunc_A(){
VendorFunc_DoSomeDesirableSideEffect();
VendorFunc_B();
VendorFunc_DoSomeMoreGoodStuff();
}
VendorFunc_B(){
int *pHW_Reg = 0xFFFF0000;
*pHW_Reg = 1; // Mem Access EXCEPTION HERE
return(0); // I want to continue here
}
More detail:
I am developing an embedded project on an Atmel AVR32 platform with freeRTOS using the AVR32-gcc toolchain. It is desirable to develop/debug high level application code independent of the hardware (and the slow avr32 simulator). Various gcc, makefile and macro tricks permit me to build my Avr32/freeRTOS project in the MinGW/Win32 freeRTOS port enviroment and I can debug in eclipse/gdb. But the high-mem HW access in the (vendor supplied) Avr32 code crashes the MinGW exe (due to the mem access exception).
I am contemplating some combination of these approaches:
1) Manage the access exceptions in SW. Ideally I'd be creating a kind of HW simulator but that'd be difficult and involve some gnarly assembly code, I think. Alot of the exceptions can likely just be ignored.
2) Creating a modified copy of the Avr32 header files so as to relocate the HW register #defines into user process address space (and create some structs and linker sections that commmit those areas of virtual memory space)
3) Conditional compilation of function calls that result in highMem/HW access, or alernatively more macro tricks, so as to minimize code cruft in the 'real' HW target code. (There are other developers on this project.)
Any suggestions or helpful links would be appreciated.
This page is on the right track, but seems overly complicated, and is C++ which I'd like to avoid. But I may try it yet, absent other suggestions.
http://www.programmingunlimited.net/siteexec/content.cgi?page=mingw-seh
You need to figure out why the vendor code wants to write 1 to address 0xFFFF0000 in the first place, and then write a custom VendorFunc_B() function that emulates this behavior. It is likely that 0xFFFF0000 is a hardware register that will do something special when written to (eg. change baud rate on a serial port or power up the laser or ...). When you know what will happen when you write to this register on the target hardware, you can rewrite the vendor code to do something appropriate in the windows code (eg. write the string "Starting laser" to a log file). It is safe to assume that writing 1 to address 0xFFFF0000 on Windows XP will not be the right thing to do, and the Windows XP memory protection system detects this and terminates your program.
I had a similar issue recently, and this is the solution i settled on:
Trap memory accesses inside a standard executable built with MinGW
First of all, you need to find a way to remap those address ranges (maybe some undef/define combos) to some usable memory. If you can't do this, maybe you can hook through a seg-fault and handle the write yourself.
I also use this to "simulate" some specific HW behavior inside a single executable, for some already written code. However, in my case, i found a way to redefine early all the register access macros.