How to detect current stream in kernel code? - cuda

Is it possible to check within the cuda kernel in which stream (https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf) it is executing? In particular, I am interested in checking if I am running in default stream or not.
I am thinking that potentially, this information could be extracted from %envreg or %pt registers https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#identifiers but didn't find any documentation on this.

There currently isn't any method to retrieve a host based stream in CUDA device code.
As suggested in the comments, you may be able to serve your needs by passing the needed information via kernel arguments.

Related

Should I make cublas handle global and reuse them in different (host) functions?

I think this saves some configuration time, but I am not sure whether this will cause unexpected behaviours.
If you need to issue calls in any sort of thread concurrency scenario, its recommended to use independent handles:
https://docs.nvidia.com/cuda/cublas/index.html#thread-safety2
The library is thread safe and its functions can be called from multiple host threads, as long as threads do not share the same cuBLAS handle simultaneously.
Also note that the device associated with a particular cublas handle is expected to remain unchanged for duration of handle use:
https://docs.nvidia.com/cuda/cublas/index.html#cublas-context
The device associated with a particular cuBLAS context is assumed to remain unchanged between the corresponding cublasCreate() and cublasDestroy() calls.
Otherwise, using a single handle should be fine amongst cublas calls belonging to the same device and host thread, even if shared amongst multiple streams.
An example of using a single "global" handle with multiple streamed CUBLAS calls (from the same host thread, on the same GPU device) is given in the CUDA batchCUBLAS sample code.

NVRTC and __device__ functions

I am trying to optimize my simulator by leveraging run-time compilation. My code is pretty long and complex, but I identified a specific __device__ function whose performances can be strongly improved by removing all global memory accesses.
Does CUDA allow the dynamic compilation and linking of a single __device__ function (not a __global__), in order to "override" an existing function?
I am pretty sure the really short answer is no.
Although CUDA has dynamic/JIT device linker support, it is important to remember that the linkage process itself is still static.
So you can't delay load a particular function in an existing compiled GPU payload at runtime as you can in a conventional dynamic link loading environment. And the linker still requires that a single instance of all code objects and symbols be present at link time, whether that is a priori or at runtime. So you would be free to JIT link together precompiled objects with different versions of the same code, as long as a single instance of everything is present when the session is finalised and the code is loaded into the context. But that is as far as you can go.
It looks like you have a "main" kernel with a part that is "switchable" at run time.
You can definitely do this using nvrtc. You'd need to go about doing something like this:
Instead of compiling the main kernel ahead of time, store it as as string to be compiled and linked at runtime.
Let's say the main kernel calls "myFunc" which is a device kernel that is chosen at runtime.
You can generate the appropriate "myFunc" kernel based on equations at run time.
Now you can create an nvrtc program using multiple sources using nvrtcCreateProgram.
That's about it. The key is to delay compiling the main kernel until you need it at run time. You may also want to cache your kernels somehow so you end up compiling only once.
There is one problem I foresee. nvrtc may not find the curand device calls which may cause some issues. One work around would be to look at the header the device function call is in and use nvcc to compile the appropriate device kernel to ptx. You can store the resulting ptx as text and use cuLinkAddData to link with your module. You can find more information in this section.

Using CuRand in ManagedCuda

I'm currently working with ManagedCuda, and want to generate random numbers on the device. However I can't seem to find a simple example how to do this (browsing through objects in the ManagedCuda.CudaRand namespace and comparing with the C++ equivalent doesn't get me any further).
Actual question: How can I generate random numbers in a kernel when using managedCuda instead of the regular C++ API?
As it seems, you only want to use the device side API of CURAND, you will be then entirely independent of managedCuda: All you need to do in managedCuda is to allocate a large enough chunk of memory to save the current curandStates. You don’t even need a reference to managedCuda's CudaRand.dll.
Then you create an init kernel that calls for each thread curand_init() and then in your actual kernel you use curand_normal() or any of the other rand-functions. A step-by-step example is given in the curand manual in chapter 3.6.

How to uninitialise CUDA?

CUDA implicitly initialises when the first CUDA runtime function is called.
I'm timing the runtime of my code and repeating 100 times via a loop (for([100 times]) {[Time CUDA code and log]}), which also needs to take into account the initialisation time for CUDA at each iteration. Thus I need to uninitialise CUDA after every iteration - how to do this?
I've tried using cudaDeviceReset(), but seems not to have uninitialised CUDA.
Many thanks.
cudaDeviceReset is the canonical way to destroy a context in the runtime API (and calling cudaFree(0) is the canonical way to create a context). Those are the only levels of "re-initialization" available to a running process. There are other per-process events which happen when a process loads the CUDA driver and runtime libraries and connects to the kernel driver, but there is no way I am aware of to make those happen programatically short of forking a new process.
But I really doubt you want or should be needing to account for this sort of setup time when calculating performance metrics anyway.

CUDA inter-kernel communication between different streams

Has anyone successfully run 2 different kernels in 2 different CUDA streams and gotten them to synchronize? Basically I want to have 1 kernel A send data to another concurrently running kernel B (in a different stream), then get results back. The reason: kernel A is running in 1 CUDA thread and I want a multiple GPU thread implementation for kernel B.
This is with high end GPUs (Fermi/Tesla), CUDA 4.2
Same GPU, different streams. So the data should be able to be communicated thru device memory, but how to sync them?
The CUDA Programming Model only supports communication between threads in the same thread block (CUDA C Programming Guide at the end of section 2.2 Thread Hierarchy). This cannot be reliably implemented through the current CUDA API. If you try you may find partial success. However, this will fail on different OSes, different executions of your application, and this will be broken by future driver updates and new hardware (GK110 supports enhanced concurrency model).
If I correctly caught your question, you have two problems:
Inter-Kernel data exchange
Inter-Kernel synchronization
1) Inter-Kernel Data Exchange can be achieved through sharing data in global device memory.
2) As I know, there is no reliable facilities for inter-kernel synchronization provided by CUDA. And I'm unaware about any suitable trick that can be applied here.
CUDA C Programming Gide v7.5 tells us:
"Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined)."
You will need to synchronize on the host. From the top of my head, calling cudaDeviceSynchronize for every stream in turn should do the trick but it may not be that easy.
Your data must be in global memory
You need to get the data address on the host
You must send this data back to the second kernel
your code must be something similar to this:
*dataToExchange_h,*dataToExchange_d;
cudaMalloc((void**)dataToExchange, sizeof(data));
kernel1<<< M1,N1,0,stream1>>>(dataToExchange);
cudaStreamSynchronize(stream1);
kernel2<<< M2,N2,0,stream2>>>(dataToExchange);
But note that stream synchronization slow down the process, so you should avoid it as much as possible.
You can also get stream synchronization through cuda events, it less obvious and does not give special advantage, but it's useful to know it ;-)