What thread runs the callback passed to cudaStreamAddCallback? - cuda

If I register a callback via cudaStreamAddCallback(), what thread is going to run it ?
The CUDA documentation says that cudaStreamAddCallback
adds a callback to be called on the host after all currently enqueued items in the stream have completed. For each cudaStreamAddCallback call, a callback will be executed exactly once. The callback will block later work in the stream until it is finished.
but says nothing about how the callback itself is called.

Just to flesh out comments so that this question has an answer and will fall off the unanswered queue:
The short answer is that this is an internal implementation detail of the CUDA runtime and you don't need to worry about it.
The longer answer is that if you look carefully at the operation of the CUDA runtime, you will notice that context establishment on a device (be it explicit via the driver API, or implicit via the runtime API) spawns a small thread pool. It is these threads which are used to implement features of the runtime like stream command queues and call back operations. Again, an internal implementation detail which the programmer doesn't need to know about.

Related

Does importScripts pause worker thread execution? (Dedicated worker)

The WebAPI specs mention that importScripts loads scripts synchronously
Here's the chromium code which implements this.
However, my understanding is that web apis run in a separate thread pool. Since importScripts is a web api offering to begin with, it should run separately from the worker thread and not pause worker execution. Then what does synchronously mean in this context?
(Answering my own question, because I have an insight)
There are two facts pointed out in the question :
importScripts is synchronous
importSripts does not run on the worker thread (i.e. it runs outside the V8 runtime)
My misconception was these two facts are contradictory. They're not!
Being a web api, it is implemented as native browser code. But that does not guarantee concurrency.
Another example is when you call XmlHttpRequest.send() with synchronous = true.
Bottomline : running separately from the JS runtime thread does not guarantee asynchronous.

Are asynchronous Windows Runtime calls guaranteed to expire?

The Windows Runtime heavily uses asynchronous patterns, offloading long(-er) running tasks to the thread pool. I've read through all articles in Threading and async programming, but couldn't find an answer to my question:
Are all Windows Runtime asynchronous calls guaranteed to return at some point?
As #Paulo mentions in the comments, this depends entirely on the how the code is written. It is easy to write your own async code that never returns, and it is trivial to deadlock your application using platform APIs by doing a .Wait() from the UI thread.
Fundamentally, an async operation is a function that returns an object (often called a "promise" or a "future") and then that object either sets an event or calls a callback function at some future point in time (this is the "logical" return value of the async operation).
Either part of this could fail -- the initial function might never get around to returning the promise object, or the promise might never get around to calling the callback / setting the event.

CUDA kernel launched after call to thrust is synchronous or asynchronous?

I am having some troubles with the results of my computations, for some reason they are not correct, I checked the code and it seems right (although I will check it again).
My question is if custom cuda kernels are synchronous or asynchronous after being launch after a call to thrust, e.g.
thrust::sort_by_key(args);
arrangeData<<<blocks,threads>>>(args);
will the kernel arrangeData run after thrust::sort has finished?
Assuming your code looks like that, and there is no usage of streams going on (niether the kernel call nor the thrust call indicate any stream usage as you have posted it), then both activities are issued to the default stream. I also assume (although it would not change my answer in this case) that the args passed to the thrust call are device arguments, not host arguments. (e.g. device_vector, not host_vector).
All CUDA API and kernel calls issued to the default stream (or any given single stream) will be executed in order.
The arrangeData kernel will not begin until any kernels launched by the thrust::sort_by_key call are complete.
You can verify this using a profiler, e.g. nvvp
Note that synchronous vs. asynchronous may be a bit confusing. When we talk about kernel launches being asynchronous, we are almost always referring to the host CPU activity, i.e. the kernel launch is asynchronous with respect to the host thread, which means it returns control to the host thread immediately, and its execution will occur at some unspecified time with respect to the host thread.
CUDA API calls and kernel calls issued to the same stream are always synchronous with respect to each other. A given kernel will not begin execution until all prior cuda activity issued to that stream (even things like cudaMemcpyAsync) has completed.

Do cudaBindTextureToArray and cudaUnbindTexture break GPU-CPU concurrency?

I want my CPU and GPU to overlap computation, however, my GPU code contains some synchronous function calls like cudaBindTextureToArray() and cudaUnbindTexture() for which no asynchronous counterparts exists. Will these calls calls break GPU-CPU concurrency?
In general, the functions that may be asynchronous are listed here:
- •Kernel launches;
- •Memory copies between two addresses to the same device memory;
- •Memory copies from host to device of a memory block of 64 KB or less;
- •Memory copies performed by functions that are suffixed with Async;
- •Memory set function calls.
Asynchronous functions usually have an Async suffix, and they will usually accept a stream parameter.
Functions that don't meet the above description should be assumed to be synchronous. Specific exceptions (like cudaSetDevice()) are usually evident from their description.
In the context of a single-device system, synchronous functions (with the exception of specific stream synchronizing functions like cudaStreamSynchronize and cudaStreamWaitEvent) will:
Wait to begin until all cuda activity has completed (i.e. all previous cuda API calls and kernel calls have completed)
Execute their designated activity (e.g. cudaMemcpy() will begin the designated copy operation after step 1 is complete)
Release the calling (host) thread after step 2 is complete
Therefore the calling (host) thread is blocked from the moment the cudaMemcpy() call is made until all previous cuda activity is complete and the cudaMemcpy() call is complete. I think most people would say this may "break" GPU-CPU concurrency, because for the duration of the sequence described above (steps 1-3) the CPU thread is effectively doing nothing.
Whether or not it makes much difference in your application will depend on what is happening before and after the synchronous call in question.

How to uninitialise CUDA?

CUDA implicitly initialises when the first CUDA runtime function is called.
I'm timing the runtime of my code and repeating 100 times via a loop (for([100 times]) {[Time CUDA code and log]}), which also needs to take into account the initialisation time for CUDA at each iteration. Thus I need to uninitialise CUDA after every iteration - how to do this?
I've tried using cudaDeviceReset(), but seems not to have uninitialised CUDA.
Many thanks.
cudaDeviceReset is the canonical way to destroy a context in the runtime API (and calling cudaFree(0) is the canonical way to create a context). Those are the only levels of "re-initialization" available to a running process. There are other per-process events which happen when a process loads the CUDA driver and runtime libraries and connects to the kernel driver, but there is no way I am aware of to make those happen programatically short of forking a new process.
But I really doubt you want or should be needing to account for this sort of setup time when calculating performance metrics anyway.