Do failed CUDA memory allocations require a cudaDeviceReset call? [duplicate] - cuda

CUDA document is not clear on how memory data changes after CUDA applications throws an exception.
For example, a kernel launch(dynamic) encountered an exception (e.g. Warp Out-of-range Address), current kernel launch will be stopped. After this point, will data (e.g. __device__ variables) on device still kept or they are removed along with the exceptions?
A concrete example would be like this:
CPU launches a kernel
The kernel updates the value of __device__ variableA to be 5 and then crashes
CPU memcpy the value of variableA from device to host, what is the value the CPU gets in this case, 5 or something else?
Can someone show the rationale behind this?

The behavior is undefined in the event of a CUDA error which corrupts the CUDA context.
This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed.
Non-sticky errors are cleared automatically after they are returned by a cuda API call (with the exception of cudaPeekAtLastError). Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.) will be a sticky error. In your example, step 3 would (always) return an API error on the result of the cudaMemcpy call to transfer variableA from device to host, so the results of the cudaMemcpy operation are undefined and unreliable -- it is as if the cudaMemcpy operation also failed in some unspecified way.
Since the behavior of a corrupted CUDA context is undefined, there is no definition for the contents of any allocations, or in general the state of the machine after such an error.
An example of a non-sticky error might be an attempt to cudaMalloc more data than is available in device memory. Such an operation will return an out-of-memory error, but that error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error. A non-sticky error does not corrupt the CUDA context, and the behavior of the cuda context is exactly the same as if the invalid operation had never been requested.
This distinction between sticky and non-sticky error is called out in many of the documented error code descriptions, for example:
synchronous, non-sticky, non-cuda-context-corrupting:
cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.
asynchronous, sticky, cuda-context-corrupting:
cudaErrorMisalignedAddress = 74
The device encountered a load or store instruction on a memory address which is not aligned. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.
Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

Related

CUDA error checking on cudaDeviceSynchronize after kernel launch may not catch every error?

I recently found a comment at the #talonmies accepted answer stating the following:
Note that, unlike all other CUDA errors, kernel launch errors will not be reported by subsequent synchronizing calls to the CUDA runtime API. Just putting gpuErrchk() around the next cudaMemcpy() or cudaDeviceSynchronize() call is thus insufficient to catch all possible error conditions. I'd argue it is better style to call cudaGetLastError() instead of cudaPeekAtLastError() immediately after a kernel launch` even though they have the same effect, to aid the unwitting reader.
My question is, how is it possible that cudaGetLastError may catch an error that would not be detected in a cudaDeviceSynchronize? Shouldn’t any error that hasn’t been cleaned be returned by cudaDeviceSynchronize?
I always do error checking around API calls and after a kernel launch I call cudaDeviceSynchronize (since my kernels take way longer than the data transfer so I have no significant performance loss) and I thought I was safe this way. In what scenarios could this approach fail?
A description of asynchronous (sticky) vs. synchronous (non-sticky) errors is covered here. Furthermore, I cover this exact topic in some detail in this online training series unit 12.
Please familiarize yourself with those. I'm not going to give a full recital or repeat of all related ideas here.
My question is, how is it possible that cudaGetLastError may catch an error that would not be detected in a cudaDeviceSynchronize? Shouldn’t any error that hasn’t been cleaned be returned by cudaDeviceSynchronize?
No.
Most CUDA runtime API calls, including such examples as cudaMemcpy(), cudaMalloc() cudaStreamCreate(), cudaDeviceSynchronize() and many others will return errors that fit the following descriptions:
Any previously occurring asynchronous error. Such errors occur during the execution of device code, and they corrupt the CUDA context, and they cannot be cleared except by destruction of the underlying CUDA context.
Any synchronous error that occurs as the result of the runtime call itself, only.
That means if I call cudaMemcpy(), it will report any async errors per item 1, and any synchronous errors that occur as a result of the cudaMemcpy() call, not any other. Likewise for cudaDeviceSynchronize().
So what is missed?
A synchronous error as a result of a kernel call. For example:
mykernel<<<1,1025>>>(...);
We immediately know that such a launch cannot proceed, because 1025 threads per block is illegal in CUDA (currently). An error of that type is not occurring as a result of device code execution but rather as a result of inspection of the kernel launch request. It is a synchronous error, not asynchronous.
If you do this:
__global__ void mykernel(){}
int main(){
mykernel<<<1,1025>>>();
cudaError_t err = cudaDeviceSynchronize();
}
the err variable will contain cudaSuccess (more accurately, the enum token that corresponds to cudaSuccess, and likewise for all other such references in this answer). On the other hand if you do this:
int main(){
mykernel<<<1,1025>>>();
cudaError_t err = cudaGetLastError();
cudaDeviceSynchronize();
}
the err variable will contain something like cudaErrorInvalidConfiguration (enum token 9, see here).
You might wish to try this experiment yourself.
Anyway, this answer has been carefully crafted, by an expert. I personally wouldn't discount any part of it.
Yes, cudaGetLastError() (and cudaPeekAtLastError()) behavior is different than most other cuda runtime API call error reporting that I described in the section containing items 1 and 2 above. They will (among other things) report a synchronous error, from another previously occurring runtime API call (or kernel launch) that has not yet been cleared.

In CUDA programming how to understand if the GPU kernel has completed its task?

In CUDA programming suppose I am calling a kernel function from the host.
Suppose the kernel function is,
my_kernel_func(){
doing some tasks utilizing multiple threads
}
Now from the host I am calling it using,
my_kernel_func<<<grid,block>>>();
In the NVDIA examples, they have called three more functions afterwards,
cudaGetLastError()
CUDA Doc : Returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to cudaSuccess.
cudaMemcpy()
CUDA Doc : Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. Calling cudaMemcpy() with dst and src pointers that do not match the direction of the copy results in an undefined behavior.
and then
cudaDeviceSynchronize()
CUDA Doc : Blocks until the device has completed all preceding requested tasks. cudaDeviceSynchronize() returns an error if one of the preceding tasks has failed. If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the device has finished its work.
Now I have tried to put a print statement at the end of the kernel function,
my_kernel_func(){
doing some tasks utilizing multiple threads
print D
}
As well as printed at different locations of the sequential flow,
cudaGetLastError()
print A
cudaMemcpy()
print B
cudaDeviceSynchronize()
print C
This thing prints in the following order
A
D
B
C
Basically, I need the time by which the kernel completes its task. Now I am confused to take the ending time. Because there should be a considerable time taken for copying back the data. Now if I put the ending time stamp after that it might incorporate the copying time also.
Is there any other function available to catch the ending?
As pointed out in the documentation, cudaMemcpy() exhibits synchronous behavior, so the cudaDeviceSynchronize() turns into a no-op because the synchronization was done at the memcpy.
The cudaGetLastError() checks whether you make an okay kernel call.
If you want to time for the kernel and not the memcpy's, switch the order of the cudaMemcpy()/cudaDeviceSynchronize() calls, start the timer just before the kernel call, then get the timer value after the cudaDeviceSynchronize() call. Make sure to test the result of cudaDeviceSynchronize() call as well.

States of memory data after cuda exceptions

CUDA document is not clear on how memory data changes after CUDA applications throws an exception.
For example, a kernel launch(dynamic) encountered an exception (e.g. Warp Out-of-range Address), current kernel launch will be stopped. After this point, will data (e.g. __device__ variables) on device still kept or they are removed along with the exceptions?
A concrete example would be like this:
CPU launches a kernel
The kernel updates the value of __device__ variableA to be 5 and then crashes
CPU memcpy the value of variableA from device to host, what is the value the CPU gets in this case, 5 or something else?
Can someone show the rationale behind this?
The behavior is undefined in the event of a CUDA error which corrupts the CUDA context.
This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed.
Non-sticky errors are cleared automatically after they are returned by a cuda API call (with the exception of cudaPeekAtLastError). Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.) will be a sticky error. In your example, step 3 would (always) return an API error on the result of the cudaMemcpy call to transfer variableA from device to host, so the results of the cudaMemcpy operation are undefined and unreliable -- it is as if the cudaMemcpy operation also failed in some unspecified way.
Since the behavior of a corrupted CUDA context is undefined, there is no definition for the contents of any allocations, or in general the state of the machine after such an error.
An example of a non-sticky error might be an attempt to cudaMalloc more data than is available in device memory. Such an operation will return an out-of-memory error, but that error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error. A non-sticky error does not corrupt the CUDA context, and the behavior of the cuda context is exactly the same as if the invalid operation had never been requested.
This distinction between sticky and non-sticky error is called out in many of the documented error code descriptions, for example:
synchronous, non-sticky, non-cuda-context-corrupting:
cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.
asynchronous, sticky, cuda-context-corrupting:
cudaErrorMisalignedAddress = 74
The device encountered a load or store instruction on a memory address which is not aligned. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.
Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.

Would CPU continue executing the next line code when the GPU kernel is running? Would it cause an error?

I am reading the book "CUDA by example" by Sanders, where the author mentioned that p.441: For example, when we launched the kernel in our ray tracer, the GPU begins executing our code, but the CPU continues executing the next line of our program before the GPU finishes. -- Highlighted mar 4, 2014
I am wondering if this statement is correct. For example, what if the next instruction CPU continues executing depends on the variables that the GPU kernel outputs? Would it cause an error? From my experience, it does not cause an error. So what does the author really mean?
Many thanks!
Yes, the author is correct. Suppose my kernel launch looks like this:
int *h_in_data, *d_in_data, *h_out_data, *d_out_data;
// code to allocate host and device pointers, and initialize host data
...
// copy host data to device
cudaMemcpy(d_in_data, h_in_data, size_of_data, cudaMemcpyHostToDevice);
mykernel<<<grid, block>>>(d_in_data, d_out_data);
// some other host code happens here
// at this point, h_out_data does not point to valid data
...
cudaMemcpy(h_out_data, d_out_data, size_of_data, cudaMemcpyDeviceToHost);
//h_out_data now points to valid data
Immediately after the kernel launch, the CPU continues executing host code. But the data generated by the device (either d_out_data or h_out_data) is not ready yet. If the host code attempts to use whatever is pointed to by h_out_data, it will just be garbage data. This data only becomes valid after the 2nd cudaMemcpy operation.
Note that using the data (h_out_data) before the 2nd cudaMemcpy will not generate an error, if by that you mean a segmentation fault or some other run time error. But any results generated will not be correct.
Kernel launches in CUDA are by default asynchronous, i.e., the control will return to CPU after the launch. Now if the next instruction of the CPU is another kernel launch, then you don't need to worry, this launch will be done only after the previously launched kernel has finished its execution.
However, if the next instruction is some CPU instruction only which is accessing the results of the kernel, there can be a problem of accessing garbage value. Therefore, excessive care has to be taken and device synchronization should be done as and when needed.

"invalid device ordinal" error on kernel launch

I'm trying to use multiple CUDA devices from multiple OpenMP threads. The devices are initialized (i.e. memory is allocated on them) from the main thread, and then I use cudaSetDevice from different threads to then launch kernels on different devices. Threads are not sharing devices, each thread has exclusive access to its device.
From what I understand, this should work fine. However, as soon as I launch a kernel on a device from an OpenMP thread which is the not the main (i.e. omp_get_thread_num() != 0) I get an "invalid device ordinal error" from CUDA:
kernel<<<...>>>(...);
error = cudaDeviceSynchronize(); // returns cudaSuccess
error = cudaGetLastError(); // returns invalid device ordinal error
Am I missing something? Has anyone seen something like this before? I'm using CUDA 5.0.
Just to close this issue, this problem was a result of me using cudaGetLastError to try and check for errors after a kernel launch, but not checking the error return value of one previous call. Therefore, it was returning the error code from a call to cudaGetDeviceInfo after the kernel launch which I mistakenly inferred to be coming from the launch itself. If you see this error, I would just advise making sure that you're checking the error values returned by all previous calls to the CUDA API.