Cuda device code doesn't throw an error when segfaults - cuda

Having a cuda kernel like this
__global__ void kernel(...)
{
free( 48190);
}
is probably not a good idea.
Let's say that 48190 is not a valid address to deallocate during the current execution.
If we were on the host, the runtime would probably stop execution right away, throw a segfault error and give us some nasty description of what happened like "A heap has been corrupted" or something.
But what if it did all that except for the message? what if when it hits that point, it blows up and exits without telling me anything about what happened. That is what that code gives me. If I wrote the above kernel on my machine, it would compile, run, and if that was everything my program did (just call that kernel) it would happily exit with no error message :(. I only find out later when I try to do a cudaMemcpy that something went wrong because it fails with error code 30: unknown error
My question is: is this supposed to happen? is there any way to enable some kind of error description when something goes wrong in a kernel call?

is there any way to enable some kind of message description when something goes wrong in a kernel call?
Yes, do cuda error checking. It's described here.

Related

CUDA error checking on cudaDeviceSynchronize after kernel launch may not catch every error?

I recently found a comment at the #talonmies accepted answer stating the following:
Note that, unlike all other CUDA errors, kernel launch errors will not be reported by subsequent synchronizing calls to the CUDA runtime API. Just putting gpuErrchk() around the next cudaMemcpy() or cudaDeviceSynchronize() call is thus insufficient to catch all possible error conditions. I'd argue it is better style to call cudaGetLastError() instead of cudaPeekAtLastError() immediately after a kernel launch` even though they have the same effect, to aid the unwitting reader.
My question is, how is it possible that cudaGetLastError may catch an error that would not be detected in a cudaDeviceSynchronize? Shouldn’t any error that hasn’t been cleaned be returned by cudaDeviceSynchronize?
I always do error checking around API calls and after a kernel launch I call cudaDeviceSynchronize (since my kernels take way longer than the data transfer so I have no significant performance loss) and I thought I was safe this way. In what scenarios could this approach fail?
A description of asynchronous (sticky) vs. synchronous (non-sticky) errors is covered here. Furthermore, I cover this exact topic in some detail in this online training series unit 12.
Please familiarize yourself with those. I'm not going to give a full recital or repeat of all related ideas here.
My question is, how is it possible that cudaGetLastError may catch an error that would not be detected in a cudaDeviceSynchronize? Shouldn’t any error that hasn’t been cleaned be returned by cudaDeviceSynchronize?
No.
Most CUDA runtime API calls, including such examples as cudaMemcpy(), cudaMalloc() cudaStreamCreate(), cudaDeviceSynchronize() and many others will return errors that fit the following descriptions:
Any previously occurring asynchronous error. Such errors occur during the execution of device code, and they corrupt the CUDA context, and they cannot be cleared except by destruction of the underlying CUDA context.
Any synchronous error that occurs as the result of the runtime call itself, only.
That means if I call cudaMemcpy(), it will report any async errors per item 1, and any synchronous errors that occur as a result of the cudaMemcpy() call, not any other. Likewise for cudaDeviceSynchronize().
So what is missed?
A synchronous error as a result of a kernel call. For example:
mykernel<<<1,1025>>>(...);
We immediately know that such a launch cannot proceed, because 1025 threads per block is illegal in CUDA (currently). An error of that type is not occurring as a result of device code execution but rather as a result of inspection of the kernel launch request. It is a synchronous error, not asynchronous.
If you do this:
__global__ void mykernel(){}
int main(){
mykernel<<<1,1025>>>();
cudaError_t err = cudaDeviceSynchronize();
}
the err variable will contain cudaSuccess (more accurately, the enum token that corresponds to cudaSuccess, and likewise for all other such references in this answer). On the other hand if you do this:
int main(){
mykernel<<<1,1025>>>();
cudaError_t err = cudaGetLastError();
cudaDeviceSynchronize();
}
the err variable will contain something like cudaErrorInvalidConfiguration (enum token 9, see here).
You might wish to try this experiment yourself.
Anyway, this answer has been carefully crafted, by an expert. I personally wouldn't discount any part of it.
Yes, cudaGetLastError() (and cudaPeekAtLastError()) behavior is different than most other cuda runtime API call error reporting that I described in the section containing items 1 and 2 above. They will (among other things) report a synchronous error, from another previously occurring runtime API call (or kernel launch) that has not yet been cleared.

mret does not return to pc [duplicate]

In an assembly program, the .text section is loaded at 0x08048000; the .data and the .bss section comes after that.
What would happen if I don't put an exit syscall in the .text section? Would it lead to the .data and the .bss section being interpreted as code causing "unpredictable" behavior? When will the program terminate -- probably after every "instruction" is executed?
I can easily write a program without the exit syscall, but testing if .data and .bss gets executed is something I don't know because I guess I would have to know the real machine code that is generated under-the-hoods to understand that.
I think this question is more about "How would OS and CPU handle such a scenario?" than assembly language, but it is still interesting to know for assembly programmers etc.
The processor does not know where your code ends. It faithfully executes one instruction after another until execution is redirected elsewhere (e.g. by a jump, call, interrupt, system call, or similar).
If your code ends without jumping elsewhere, the processor continues executing whatever is in memory after your code. It is fairly unpredictable what exactly happens, but eventually, your code typically crashes because it tries to execute an invalid instruction or tries to access memory that it is not allowed to access.
If neither happens and no jump occurs, eventually the processor tries to execute unmapped memory or memory that is marked as “not executable” as code, causing a segmentation violation. On Linux, this raises a SIGSEGV or SIGBUS. When unhandled, these terminate your process and optionally produce core dumps.
If you're curious, run under a debugger and look at disassembly of the faulting instruction.

States of memory data after cuda exceptions

CUDA document is not clear on how memory data changes after CUDA applications throws an exception.
For example, a kernel launch(dynamic) encountered an exception (e.g. Warp Out-of-range Address), current kernel launch will be stopped. After this point, will data (e.g. __device__ variables) on device still kept or they are removed along with the exceptions?
A concrete example would be like this:
CPU launches a kernel
The kernel updates the value of __device__ variableA to be 5 and then crashes
CPU memcpy the value of variableA from device to host, what is the value the CPU gets in this case, 5 or something else?
Can someone show the rationale behind this?
The behavior is undefined in the event of a CUDA error which corrupts the CUDA context.
This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed.
Non-sticky errors are cleared automatically after they are returned by a cuda API call (with the exception of cudaPeekAtLastError). Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.) will be a sticky error. In your example, step 3 would (always) return an API error on the result of the cudaMemcpy call to transfer variableA from device to host, so the results of the cudaMemcpy operation are undefined and unreliable -- it is as if the cudaMemcpy operation also failed in some unspecified way.
Since the behavior of a corrupted CUDA context is undefined, there is no definition for the contents of any allocations, or in general the state of the machine after such an error.
An example of a non-sticky error might be an attempt to cudaMalloc more data than is available in device memory. Such an operation will return an out-of-memory error, but that error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error. A non-sticky error does not corrupt the CUDA context, and the behavior of the cuda context is exactly the same as if the invalid operation had never been requested.
This distinction between sticky and non-sticky error is called out in many of the documented error code descriptions, for example:
synchronous, non-sticky, non-cuda-context-corrupting:
cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.
asynchronous, sticky, cuda-context-corrupting:
cudaErrorMisalignedAddress = 74
The device encountered a load or store instruction on a memory address which is not aligned. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.
Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

Why does this PUSH instruction cause a UNDEFINED_INSTRUCTION exception at my ARM processor?

I am working with a Cortex-A9 and my program crashes because of an UNDEFINED_INSTRUCTION exception. The assembly line that causes this exception is according to my debugger's trace:
Trace #9999 : S:0x022D9A7C E92D4800 ARM PUSH {r11,lr}
Exception: UNDEFINED_INSTRUCTION (9)
I program in C and don't write assembly or binary and I am using gcc. Is this really the instruction that causes the exception, i.e. is the encoding of this PUSH instruction wrong and hence a compiler/assembler bug? Or is the encoding correct and something strange is going on? Scrolling back in the trace I found another PUSH instruction, that does not cause errors and looks like this:
Trace #9966 : S:0x022A65FC E52DB004 ARM PUSH {r11}
And of course there are a lot of other PUSH instruction too. But I did not find any other that tries to push specifically R11 and LR, so I can't compare.
I can't answer my own question, so I edit it:
Sorry guys, I don't exactly know what happend. I tried it several times and got the same error again and again. Then I turned the device off, went away and tried it again later and know it works fine...
Maybe the memory was corrupted somehow due to overheating or something? I don't know. Thanks for your answers anyway.
I use gcc 4.7.2 btw.
I suspect something is corrupting the SP register. Load/store multiple (of which PUSH is one alias) to unaligned addresses are undefined in the architecture, so if SP gets overwritten with something that's not a multiple of 4, then a subsequent push/pop will throw an undef exception.
Now, if you're on ARM Linux, there is (usually) a kernel trap for unaligned accesses left over from the bad old days which if enabled will attempt to fix up most unaligned load/store multiple instructions (despite them being architecturally invalid). However if the address is invalid (as is likely in the case of SP being overwritten with nonsense) it will give up and leave the undef handler to do its thing.
In the (highly unlikely) case that the compiler has somehow generated bad code that is fix-uppable most of the time,
cat /proc/cpuinfo/alignment
would show some non-zero fixup counts, but as I say, it's most likely corruption - a previous function has smashed the stack in such a way that an invalid SP is loaded on return, that then shows up at the next stack operation. Best double-check your pointer and array accesses.

Crashing a kernel gracefully

A follow up to: CUDA: Stop all other threads
I'm looking for a way to exit a kernel if a "bad condition" occurs.
The prog manual say NVCC does not support exception handling. I'm wondering if there is a user defined cuda-error-code. In other words if "bad" happens, then terminate with this user error code. I doubt there is one, so my other idea would be to cause one.
Something like, if "bad" happens, divide by zero. But I'm unsure if one thread does a divide-by-zero, is that enough to crash the whole kernel, or just that thread?
Is there a better approach to terminating a kernel?
You should first read this question and the answers by harrism and tera (asked/answered yesterday).
You may be tempted to use something like
if (there_is_an_error) {
*status = MY_ERROR_CODE; // store to device pointer
__threadfence(); // ensure store issued before trap
asm("trap;"); // kill kernel with error
}
This does not exactly satisfy your condition of "graceful", in my opinion. Trap causes the kernel to exit and the runtime to report cudaErrorUnknown. But since kernel execution is asynchronous, you will need to synchronize your stream / device in order to catch this error, which means synchronizing after every kernel call, unless you are OK with having imprecise errors (i.e. you may not catch the error code until after calls to subsequent CUDA API calls).
But this is just the way kernel error handling is in CUDA, and well-written codes should be synchronizing in debug builds to check kernel errors, and settling for imprecise error messages in release builds. Unfortunately, I don't think there is a more graceful way than that.
edit: on Compute capability 2.0 and later you can use assert() to exit with an error in debug builds. It was unclear if this is what you want though.
The assertion may help you. You could find it in B.15 of CUDA C Programming Guide.