Accessing cuda device memory when the cuda kernel is running - cuda

I have allocated memory on device using cudaMalloc and have passed it to a kernel function. Is it possible to access that memory from host before the kernel finishes its execution?

The only way I can think of to get a memcpy to kick off while the kernel is still executing is by submitting an asynchronous memcpy in a different stream than the kernel. (If you use the default APIs for either kernel launch or asynchronous memcpy, the NULL stream will force the two operations to be serialized.)
But because there is no way to synchronize a kernel's execution with a stream, that code would be subject to a race condition. i.e. the copy engine might pull from memory that hasn't yet been written by the kernel.
The person who alluded to mapped pinned memory is into something: if the kernel writes to mapped pinned memory, it is effectively "copying" data to host memory as it finishes processing it. This idiom works nicely, provided the kernel will not be touching the data again.

It is possible, but there's no guarantee as to the contents of the memory you retrieve in such a way, since you don't know what the progress of the kernel is.
What you're trying to achieve is to overlap data transfer and execution. That is possible through the use of streams. You create multiple CUDA streams, and queue a kernel execution and a device-to-host cudaMemcpy in each stream. For example, put the kernel that fills the location "0" and cudaMemcpy from that location back to host into stream 0, kernel that fills the location "1" and cudaMemcpy from "1" into stream 1, etc. What will happen then is that the GPU will overlap copying from "0" and executing "1".
Check CUDA documentation, it's documented somewhere (in the best practices guide, I think).

You can't access GPU memory directly from the host regardless of a kernel is running or not.
If you're talking about copying that memory back to the host before the kernel is finished writing to it, then the answer depends on the compute capability of your device. But all but the very oldest chips can perform data transfers while the kernel is running.
It seems unlikely that you would want to copy memory that is still being updated by a kernel though. You would get some random snapshot of partially finished data. Instead, you might want to set up something where you have two buffers on the device. You can copy one of the buffers while the GPU is working on the other.
Update:
Based on your clarification, I think the closest you can get is using mapped page-locked host memory, also called zero-copy memory. With this approach, values are copied to the host as they are written by the kernel. There is no way to query the kernel to see how much of the work it has performed, so I think you would have to repeatedly scan the memory for newly written values. See section 3.2.4.3, Mapped Memory, in the CUDA Programming Guide v4.2 for a bit more information.
I wouldn't recommend this though. Unless you have some very unusual requirements, there is likely to be a better way to accomplish your task.

When you launch the Kernel it is an asynchronous (non blocking) call. Calling cudaMemcpy next will block until the Kernel has finished.
If you want to have the result for Debug purposes maybe it is possible for you to use cudaDebugging where you can step through the kernel and inspect the memory.
For small result checks you could also use printf() in the Kernel code.
Or run only a threadblock of size (1,1) if you are interested in that specific result.

Related

Is sort_by_key in thrust a blocking call?

I repeatedly enqueue a sequence of kernels:
for 1..100:
for 1..10000:
// Enqueue GPU kernels
Kernel 1 - update each element of array
Kernel 2 - sort array
Kernel 3 - operate on array
end
// run some CPU code
output "Waiting for GPU to finish"
// copy from device to host
cudaMemcpy ... D2H(array)
end
Kernel 3 is of order O(N^2) so is by far the slowest of all. For Kernel 2 I use thrust::sort_by_key directly on the device:
thrust::device_ptr<unsigned int> key(dKey);
thrust::device_ptr<unsigned int> value(dValue);
thrust::sort_by_key(key,key+N,value);
It seems that this call to thrust is blocking, as the CPU code only gets executed once the inner loop has finished. I see this because if I remove the call to sort_by_key, the host code (correctly) outputs the "Waiting" string before the inner loop finishes, while it does not if I run the sort.
Is there a way to call thrust::sort_by_key asynchronously?
First of all consider that there is a kernel launch queue, which can hold only so many pending launches. Once the launch queue is full, additional kernel launches, of any kind are blocking. The host thread will not proceed (beyond those launch requests) until empty queue slots become available. I'm pretty sure 10000 iterations of 3 kernel launches will fill this queue before it has reached 10000 iterations. So there will be some latency (I think) with any sort of non-trivial kernel launches if you are launching 30000 of them in sequence. (eventually, however, when all kernels are added to the queue because some have already completed, then you would see the "waiting..." message, before all kernels have actually completed, if there were no other blocking behavior.)
thrust::sort_by_key requires temporary storage (of a size approximately equal to your data set size). This temporary storage is allocated, each time you use it, via a cudaMalloc operation, under the hood. This cudaMalloc operation is blocking. When cudaMalloc is launched from a host thread, it waits for a gap in kernel activity before it can proceed.
To work around item 2, it seems there might be at least 2 possible approaches:
Provide a thrust custom allocator. Depending on the characteristics of this allocator, you might be able to eliminate the blocking cudaMalloc behavior. (but see discussion below)
Use cub SortPairs. The advantage here (as I see it - your example is incomplete) is that you can do the allocation once (assuming you know the worst-case temp storage size throughout the loop iterations) and eliminate the need to do a temporary memory allocation within your loop.
The thrust method (1, above) as far as I know, will still effectively do some kind of temporary allocation/free step at each iteration, even if you supply a custom allocator. If you have a well-designed custom allocator, it might be that this is almost a "no-op" however. The cub method appears to have the drawback of needing to know the max size (in order to completely eliminate the need for an allocation/free step), but I contend the same requirement would be in place for a thrust custom allocator. Otherwise, if you needed to allocate more memory at some point, the custom allocator is effectively going to have to do something like a cudaMalloc, which will throw a wrench in the works.

Share a variable across CPU and GPU-CUDA

I need to access a variable on CPU and CUDA GPU. Currently, I am transferring that variable to CPU after kernel finishes, but it is turning out to be bottleneck in my application. Is there any way faster way to access a variable on CPU after GPU finishes execution? Can pinned memory help me here?
You are asking if you should use pinned memory, therefore I assume that you are not using it, which also means that you are not doing asynchronous memcpy because that would require pinned memory.
So to answer your question: yes, you should use pinned memory and use streams and async memory transfer functions to get the result as fast as possible.
Please see also http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution and http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#page-locked-host-memory

Persistent GPU shared memory

I am new to CUDA programming, and I am mostly working with shared memory per block because of performance reasons. The way my program is structured right now, I use one kernel to load the shared memory and another kernel to read the pre-loaded shared memory. But, as I understand it, shared memory cannot persist between two different kernels.
I have two solutions in mind; I am not sure about the first one, and second might be slow.
First Solution: Instead of using two kernels, I use one kernel. After loading the shared memory, the kernel may wait for an input from the host, perform the operation and then return the value to host. I am not sure whether a kernel can wait for a signal from the host.
Second solution: After loading the shared memory, copy the shared memory value in the global memory. When the next kernel is launched, copy the value from global memory back into the shared memory and then perform the operation.
Please comment on the feasibility of the two solutions.
I would use a variation of your proposed first solution: As you already suspected, you can't wait for host input in a kernel - but you can syncronise your kernels at a point. Just call "__syncthreads();" in your kernel after loading your data into shared memory.
I don't really understand your second solution: why would you copy data to shared memory just to copy it back to global memory in the first kernel? Or would this first kernel also compute something? In this case I guess it will not help to store the preliminary results in the shared memory first, I would rather store them directly in global memory (however, this might depend on the algorithm).

CUDA: CPU code in parallel to GPU code

I have a program where I do a bunch of calculations on GPU, then I do memory operations with those results on CPU, then I take the next batch if data and do the same all over. Now it would be a lot faster if I could do the first set of calculations and then start with the second batch whilst my CPU churned away at the memory operations. How would I do that?
All CUDA kernel calls (e.g. function<<<blocks, threads>>>()) are asynchronous -- they return control immediately to the calling host thread. Therefore you can always perform CPU work in parallel with GPU work just by putting the CPU work after the kernel call.
If you also need to transfer data from GPU to CPU at the same time, you will need a GPU that has the deviceOverlap field set to true (check using cudaGetDeviceProperties()), and you need to use cudaMemcpyAsync() from a separate CUDA stream.
There are examples to demonstrate this functionality in the NVIDIA CUDA SDK -- For example the "simpleStreams" and "asyncAPI" examples.
The basic idea can be something like this:
Do 1st batch of calculations on GPU
Enter a loop: {
Copy results from device mem to host mem
Do next batch of calculations in GPU (the kernel launch is assynchronous and the control returns immediately to the CPU)
Process results of the previous iteration on CPU
}
Copy results from last iteration from device mem to host mem
Process results of last iteration
You can get finer control over asynchronous work between CPU and GPU by using cudaMemcpyAsync, cudaStream and cudaEvent.
As #harrism said you need your device to support deviceOverlap to do memory transfers and execute kernels at the same time but even if it does not have that option you can at least execute a kernel asynchronously with other computations on the CPU.
edit: deviceOverlap has been deprecated, one should use asyncEngineCount property.

Copying an integer from GPU to CPU

I need to copy a single boolean or an integer value from the device to the host after every kernel call (I am calling the same kernel in a for loop). That is, after every kernel call, I need to send an integer or a boolean back to the host. What is the best way to do this?
Should I write the value directly to RAM? Or should I use cudaMemcpy()? Or is there any other way to do this? Would copying just 1 integer after every kernel launch slow down my program?
Let me first answer your last question:
Would copying just 1 integer after every kernel launch slow down my program?
A bit - yes. Issuing the command, waiting for GPU to respond, etc, etc... The amount of data (1 int vs 100 ints) probably doesn't really matter in this case. However, you can still achieve speeds of thousands memory transfers per second. Most likely, your kernel will be slower than this single memory transfer (otherwise, it would be probably better to do the whole task on a CPU)
what is the best way to do this?
Well, I would suggest simply trying it yourself. As you said: you can either use mapped-pinned memory and have your kernel store the value directly to RAM, or use cudaMemcpy. The first one might be better if your kernels still have some work to do after sending the integer back. In that case, the latency of sending it to host could be hidden by the execution of the kernel.
If you use the first method, you will have to call cudaThreadsynchronize() to make sure the kernel ended its execution. Kernel calls are asynchronous.
You can use cudaMemcpyAsync which is also asynchronous, but GPU cannot have kernel running and having cudaMemcpyAsync executed in parallel, unless you use streams.
I never actually tried that, but if your program won't crash if the loop executes too many times, you might try to ignore synchronisation and let it iterate until the special value is seen in RAM. In that solution, the memory transfer might be completely hidden and you would pay an overhead only at the end. You will need however to somehow prevent the loop from iterating too many times, CUDA events may be helpful.
Why not use pinned memory? If your system supports it -- see CUDA C Programming Guide's section on pinned memory.
Copying data to and from the GPU will be much slower than accessing the data from the CPU. If you are not running a significant number of threads for this value then this will result in very slow performance, don't do it.
What you are describing sounds like a serial algorithm, your algorithm needs to be parallelised in order to make it worth doing using CUDA. If you can't rewrite your algorithm to become a single write of multiple data to the GPU, multiple threads, single write of multiple data back to CPU; then your algorithm should be done on CPU.
If you need the value computed in the previous kernel call to launch the next one then is serialized and your choice is to cudaMemcpy(dst,src, size =1, ...);
If all the kernel launch parameters do not depend on the previous launch then you can store all the result of each kernel invocation in GPU memory and then download all the results at once.