Crashing a kernel gracefully - cuda

A follow up to: CUDA: Stop all other threads
I'm looking for a way to exit a kernel if a "bad condition" occurs.
The prog manual say NVCC does not support exception handling. I'm wondering if there is a user defined cuda-error-code. In other words if "bad" happens, then terminate with this user error code. I doubt there is one, so my other idea would be to cause one.
Something like, if "bad" happens, divide by zero. But I'm unsure if one thread does a divide-by-zero, is that enough to crash the whole kernel, or just that thread?
Is there a better approach to terminating a kernel?

You should first read this question and the answers by harrism and tera (asked/answered yesterday).
You may be tempted to use something like
if (there_is_an_error) {
*status = MY_ERROR_CODE; // store to device pointer
__threadfence(); // ensure store issued before trap
asm("trap;"); // kill kernel with error
}
This does not exactly satisfy your condition of "graceful", in my opinion. Trap causes the kernel to exit and the runtime to report cudaErrorUnknown. But since kernel execution is asynchronous, you will need to synchronize your stream / device in order to catch this error, which means synchronizing after every kernel call, unless you are OK with having imprecise errors (i.e. you may not catch the error code until after calls to subsequent CUDA API calls).
But this is just the way kernel error handling is in CUDA, and well-written codes should be synchronizing in debug builds to check kernel errors, and settling for imprecise error messages in release builds. Unfortunately, I don't think there is a more graceful way than that.
edit: on Compute capability 2.0 and later you can use assert() to exit with an error in debug builds. It was unclear if this is what you want though.

The assertion may help you. You could find it in B.15 of CUDA C Programming Guide.

Related

CUDA error checking on cudaDeviceSynchronize after kernel launch may not catch every error?

I recently found a comment at the #talonmies accepted answer stating the following:
Note that, unlike all other CUDA errors, kernel launch errors will not be reported by subsequent synchronizing calls to the CUDA runtime API. Just putting gpuErrchk() around the next cudaMemcpy() or cudaDeviceSynchronize() call is thus insufficient to catch all possible error conditions. I'd argue it is better style to call cudaGetLastError() instead of cudaPeekAtLastError() immediately after a kernel launch` even though they have the same effect, to aid the unwitting reader.
My question is, how is it possible that cudaGetLastError may catch an error that would not be detected in a cudaDeviceSynchronize? Shouldn’t any error that hasn’t been cleaned be returned by cudaDeviceSynchronize?
I always do error checking around API calls and after a kernel launch I call cudaDeviceSynchronize (since my kernels take way longer than the data transfer so I have no significant performance loss) and I thought I was safe this way. In what scenarios could this approach fail?
A description of asynchronous (sticky) vs. synchronous (non-sticky) errors is covered here. Furthermore, I cover this exact topic in some detail in this online training series unit 12.
Please familiarize yourself with those. I'm not going to give a full recital or repeat of all related ideas here.
My question is, how is it possible that cudaGetLastError may catch an error that would not be detected in a cudaDeviceSynchronize? Shouldn’t any error that hasn’t been cleaned be returned by cudaDeviceSynchronize?
No.
Most CUDA runtime API calls, including such examples as cudaMemcpy(), cudaMalloc() cudaStreamCreate(), cudaDeviceSynchronize() and many others will return errors that fit the following descriptions:
Any previously occurring asynchronous error. Such errors occur during the execution of device code, and they corrupt the CUDA context, and they cannot be cleared except by destruction of the underlying CUDA context.
Any synchronous error that occurs as the result of the runtime call itself, only.
That means if I call cudaMemcpy(), it will report any async errors per item 1, and any synchronous errors that occur as a result of the cudaMemcpy() call, not any other. Likewise for cudaDeviceSynchronize().
So what is missed?
A synchronous error as a result of a kernel call. For example:
mykernel<<<1,1025>>>(...);
We immediately know that such a launch cannot proceed, because 1025 threads per block is illegal in CUDA (currently). An error of that type is not occurring as a result of device code execution but rather as a result of inspection of the kernel launch request. It is a synchronous error, not asynchronous.
If you do this:
__global__ void mykernel(){}
int main(){
mykernel<<<1,1025>>>();
cudaError_t err = cudaDeviceSynchronize();
}
the err variable will contain cudaSuccess (more accurately, the enum token that corresponds to cudaSuccess, and likewise for all other such references in this answer). On the other hand if you do this:
int main(){
mykernel<<<1,1025>>>();
cudaError_t err = cudaGetLastError();
cudaDeviceSynchronize();
}
the err variable will contain something like cudaErrorInvalidConfiguration (enum token 9, see here).
You might wish to try this experiment yourself.
Anyway, this answer has been carefully crafted, by an expert. I personally wouldn't discount any part of it.
Yes, cudaGetLastError() (and cudaPeekAtLastError()) behavior is different than most other cuda runtime API call error reporting that I described in the section containing items 1 and 2 above. They will (among other things) report a synchronous error, from another previously occurring runtime API call (or kernel launch) that has not yet been cleared.

Do failed CUDA memory allocations require a cudaDeviceReset call? [duplicate]

CUDA document is not clear on how memory data changes after CUDA applications throws an exception.
For example, a kernel launch(dynamic) encountered an exception (e.g. Warp Out-of-range Address), current kernel launch will be stopped. After this point, will data (e.g. __device__ variables) on device still kept or they are removed along with the exceptions?
A concrete example would be like this:
CPU launches a kernel
The kernel updates the value of __device__ variableA to be 5 and then crashes
CPU memcpy the value of variableA from device to host, what is the value the CPU gets in this case, 5 or something else?
Can someone show the rationale behind this?
The behavior is undefined in the event of a CUDA error which corrupts the CUDA context.
This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed.
Non-sticky errors are cleared automatically after they are returned by a cuda API call (with the exception of cudaPeekAtLastError). Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.) will be a sticky error. In your example, step 3 would (always) return an API error on the result of the cudaMemcpy call to transfer variableA from device to host, so the results of the cudaMemcpy operation are undefined and unreliable -- it is as if the cudaMemcpy operation also failed in some unspecified way.
Since the behavior of a corrupted CUDA context is undefined, there is no definition for the contents of any allocations, or in general the state of the machine after such an error.
An example of a non-sticky error might be an attempt to cudaMalloc more data than is available in device memory. Such an operation will return an out-of-memory error, but that error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error. A non-sticky error does not corrupt the CUDA context, and the behavior of the cuda context is exactly the same as if the invalid operation had never been requested.
This distinction between sticky and non-sticky error is called out in many of the documented error code descriptions, for example:
synchronous, non-sticky, non-cuda-context-corrupting:
cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.
asynchronous, sticky, cuda-context-corrupting:
cudaErrorMisalignedAddress = 74
The device encountered a load or store instruction on a memory address which is not aligned. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.
Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

States of memory data after cuda exceptions

CUDA document is not clear on how memory data changes after CUDA applications throws an exception.
For example, a kernel launch(dynamic) encountered an exception (e.g. Warp Out-of-range Address), current kernel launch will be stopped. After this point, will data (e.g. __device__ variables) on device still kept or they are removed along with the exceptions?
A concrete example would be like this:
CPU launches a kernel
The kernel updates the value of __device__ variableA to be 5 and then crashes
CPU memcpy the value of variableA from device to host, what is the value the CPU gets in this case, 5 or something else?
Can someone show the rationale behind this?
The behavior is undefined in the event of a CUDA error which corrupts the CUDA context.
This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed.
Non-sticky errors are cleared automatically after they are returned by a cuda API call (with the exception of cudaPeekAtLastError). Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.) will be a sticky error. In your example, step 3 would (always) return an API error on the result of the cudaMemcpy call to transfer variableA from device to host, so the results of the cudaMemcpy operation are undefined and unreliable -- it is as if the cudaMemcpy operation also failed in some unspecified way.
Since the behavior of a corrupted CUDA context is undefined, there is no definition for the contents of any allocations, or in general the state of the machine after such an error.
An example of a non-sticky error might be an attempt to cudaMalloc more data than is available in device memory. Such an operation will return an out-of-memory error, but that error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error. A non-sticky error does not corrupt the CUDA context, and the behavior of the cuda context is exactly the same as if the invalid operation had never been requested.
This distinction between sticky and non-sticky error is called out in many of the documented error code descriptions, for example:
synchronous, non-sticky, non-cuda-context-corrupting:
cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.
asynchronous, sticky, cuda-context-corrupting:
cudaErrorMisalignedAddress = 74
The device encountered a load or store instruction on a memory address which is not aligned. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.
Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

OS development: How to avoid an infinite loop after an exception routine

For some months I've been working on a "home-made" operating system.
Currently, it boots and goes into 32-bit protected mode.
I've loaded the interrupt table, but haven't set up the pagination (yet).
Now while writing my exception routines I've noticed that when an instruction throws an exception, the exception routine is executed, but then the CPU jumps back to the instruction which threw the exception! This does not apply to every exception (for example, a div by zero exception will jump back to the instruction AFTER the division instruction), but let's consider the following general protection exception:
MOV EAX, 0x8
MOV CS, EAX
My routine is simple: it calls a function that displays a red error message.
The result: MOV CS, EAX fails -> My error message is displayed -> CPU jumps back to MOV CS -> infinite loop spamming the error message.
I've talked about this issue with a teacher in operating systems and unix security.
He told me he knows Linux has a way around it, but he doesn't know which one.
The naive solution would be to parse the throwing instruction from within the routine, in order to get the length of that instruction.
That solution is pretty complex, and I feel a bit uncomfortable adding a call to a relatively heavy function in every affected exception routine...
Therefore, I was wondering if the is another way around the problem. Maybe there's a "magic" register that contains a bit that can change this behaviour?
--
Thank you very much in advance for any suggestion/information.
--
EDIT: It seems many people wonder why I want to skip over the problematic instruction and resume normal execution.
I have two reasons for this:
First of all, killing a process would be a possible solution, but not a clean one. That's not how it's done in Linux, for example, where (AFAIK) the kernel sends a signal (I think SIGSEGV) but does not immediately break execution. It makes sense, since the application can block or ignore the signal and resume its own execution. It's a very elegant way to tell the application it did something wrong IMO.
Another reason: what if the kernel itself performs an illegal operation? Could be due to a bug, but could also be due to a kernel extension. As I've stated in a comment: what should I do in that case? Shall I just kill the kernel and display a nice blue screen with a smiley?
That's why I would like to be able to jump over the instruction. "Guessing" the instruction size is obviously not an option, and parsing the instruction seems fairly complex (not that I mind implementing such a routine, but I need to be sure there is no better way).
Different exceptions have different causes. Some exceptions are normal, and the exception only tells the kernel what it needs to do before allowing the software to continue running. Examples of this include a page fault telling the kernel it needs to load data from swap space, an undefined instruction exception telling the kernel it needs to emulate an instruction that the CPU doesn't support, or a debug/breakpoint exception telling the kernel it needs to notify a debugger. For these it's normal for the kernel to fix things up and silently continue.
Some exceptions indicate abnormal conditions (e.g. that the software crashed). The only sane way of handling these types of exceptions is to stop running the software. You may save information (e.g. core dump) or display information (e.g. "blue screen of death") to help with debugging, but in the end the software stops (either the process is terminated, or the kernel goes into a "do nothing until user resets computer" state).
Ignoring abnormal conditions just makes it harder for people to figure out what went wrong. For example, imagine instructions to go to the toilet:
enter bathroom
remove pants
sit
start generating output
Now imagine that step 2 fails because you're wearing shorts (a "can't find pants" exception). Do you want to stop at that point (with a nice easy to understand error message or something), or ignore that step and attempt to figure out what went wrong later on, after all the useful diagnostic information has gone?
If I understand correctly, you want to skip the instruction that caused the exception (e.g. mov cs, eax) and continue executing the program at the next instruction.
Why would you want to do this? Normally, shouldn't the rest of the program depend on the effects of that instruction being successfully executed?
Generally speaking, there are three approaches to exception handling:
Treat the exception as an unrepairable condition and kill the process. For example, division by zero is usually handled this way.
Repair the environment and then execute the instruction again. For example, page faults are sometimes handled this way.
Emulate the instruction using software and skip over it in the instruction stream. For example, complicated arithmetic instructions are sometimes handled this way.
What you're seeing is the characteristic of the General Protection Exception. The Intel System Programming Guide clearly states that (6.15 Exception and Interrupt Reference / Interrupt 13 - General Protection Exception (#GP)) :
Saved Instruction Pointer
The saved contents of CS and EIP registers point to the instruction that generated the
exception.
Therefore, you need to write an exception handler that will skip over that instruction (which would be kind of weird), or just simply kill the offending process with "General Protection Exception at $SAVED_EIP" or a similar message.
I can imagine a few situations in which one would want to respond to a GPF by parsing the failed instruction, emulating its operation, and then returning to the instruction after. The normal pattern would be to set things up so that the instruction, if retried, would succeed, but one might e.g. have some code that expects to access some hardware at addresses 0x000A0000-0x000AFFFF and wish to run it on a machine that lacks such hardware. In such a situation, one might not want to ever bank in "real" memory in that space, since every single access must be trapped and dealt with separately. I'm not sure whether there's any way to handle that without having to decode whatever instruction was trying to access that memory, although I do know that some virtual-PC programs seem to manage it pretty well.
Otherwise, I would suggest that you should have for each thread a jump vector which should be used when the system encounters a GPF. Normally that vector should point to a thread-exit routine, but code which was about to do something "suspicious" with pointers could set it to an error handler that was suitable for that code (the code should unset the vector when laving the region where the error handler would have been appropriate).
I can imagine situations where one might want to emulate an instruction without executing it, and cases where one might want to transfer control to an error-handler routine, but I can't imagine any where one would want to simply skip over an instruction that would have caused a GPF.

CUDA host to device (or device to host) memcpy operations with application rendering graphics using OpenGL on the same graphics card

I have posted my problem in the CUDA forums, but not sure if it's appropriate to post a link here for more ideas in case there are significant number of different audiences between the two forums. The link is here. I apologize for any inconvenience and appreciate any comments on this question, as I haven't heard back yet on some specifics of a particular CUDA memory access and management problems. Thanks in advance.
I'm not sure if this is relevant without seeing more of your code but where is CudaObj's destructor being called from?
you said:
However, if I do it this way, I run into errors exiting the application in the line of CudaObj's destructor where cudaFree() is called. This causes the memory cleanup code after CUDA context's cleanup code not being executed due to the error, leaving behind a mess of memory leaks.
After your description of how you changed the cuda setup stuff to be in the beginning of thread2's main function. If you're calling the destructor on CudaObj from a different thread then doing the cudaFree cleanup will be in error for the same reason that you had to move the cuda initialization into thread 2. It sounds like you know this already, but the cuda context is specific to a single thread in your process. Cleaning up in a different thread is not supported according to the documentation, though I've never tried it myself.
Hope this helps