Copying an integer from GPU to CPU - cuda

I need to copy a single boolean or an integer value from the device to the host after every kernel call (I am calling the same kernel in a for loop). That is, after every kernel call, I need to send an integer or a boolean back to the host. What is the best way to do this?
Should I write the value directly to RAM? Or should I use cudaMemcpy()? Or is there any other way to do this? Would copying just 1 integer after every kernel launch slow down my program?

Let me first answer your last question:
Would copying just 1 integer after every kernel launch slow down my program?
A bit - yes. Issuing the command, waiting for GPU to respond, etc, etc... The amount of data (1 int vs 100 ints) probably doesn't really matter in this case. However, you can still achieve speeds of thousands memory transfers per second. Most likely, your kernel will be slower than this single memory transfer (otherwise, it would be probably better to do the whole task on a CPU)
what is the best way to do this?
Well, I would suggest simply trying it yourself. As you said: you can either use mapped-pinned memory and have your kernel store the value directly to RAM, or use cudaMemcpy. The first one might be better if your kernels still have some work to do after sending the integer back. In that case, the latency of sending it to host could be hidden by the execution of the kernel.
If you use the first method, you will have to call cudaThreadsynchronize() to make sure the kernel ended its execution. Kernel calls are asynchronous.
You can use cudaMemcpyAsync which is also asynchronous, but GPU cannot have kernel running and having cudaMemcpyAsync executed in parallel, unless you use streams.
I never actually tried that, but if your program won't crash if the loop executes too many times, you might try to ignore synchronisation and let it iterate until the special value is seen in RAM. In that solution, the memory transfer might be completely hidden and you would pay an overhead only at the end. You will need however to somehow prevent the loop from iterating too many times, CUDA events may be helpful.

Why not use pinned memory? If your system supports it -- see CUDA C Programming Guide's section on pinned memory.

Copying data to and from the GPU will be much slower than accessing the data from the CPU. If you are not running a significant number of threads for this value then this will result in very slow performance, don't do it.
What you are describing sounds like a serial algorithm, your algorithm needs to be parallelised in order to make it worth doing using CUDA. If you can't rewrite your algorithm to become a single write of multiple data to the GPU, multiple threads, single write of multiple data back to CPU; then your algorithm should be done on CPU.

If you need the value computed in the previous kernel call to launch the next one then is serialized and your choice is to cudaMemcpy(dst,src, size =1, ...);
If all the kernel launch parameters do not depend on the previous launch then you can store all the result of each kernel invocation in GPU memory and then download all the results at once.

Related

How do you keep data in fast GPU memory (l1/shared) across kernel invocations?

How do you keep data in fast GPU memory across kernel invocations?
Let's suppose, I need to answer 1 million queries, each of which has about 1.5MB of data that's reusable across invocations and about 8KB of data that's unique to each query.
One approach is to launch a kernel for each query, copying the 1.5MB + 8KB of data to shared memory each time. However, then I spend a lot of time just copying 1.5MB of data that really could persist across queries.
Another approach is to "recycle" the GPU threads (see https://stackoverflow.com/a/49957384/3738356). That involves launching one kernel that immediately copies the 1.5MB of data to shared memory. And then the kernel waits for requests to come in, waiting for the 8KB of data to show up before proceeding with each iteration. It really seems like CUDA wasn't meant to be used this way. If one just uses managed memory, and volatile+monotonically increasing counters to synchronize, there's still no guarantee that the data necessary to compute the answer will be on the GPU when you go to read it. You can seed the values in the memory with dummy values like -42 that indicate that the value hasn't yet made its way to the GPU (via the caching/managed memory mechanisms), and then busy wait until the values become valid. Theoretically, that should work. However, I had enough memory errors that I've given up on it for now, and I've pursued....
Another approach still uses recycled threads but instead synchronizes data via cudaMemcpyAsync, cuda streams, cuda events, and still a couple of volatile+monotonically increasing counters. I hear I need to pin the 8KB of data that's fresh with each query in order for the cudaMemcpyAsync to work correctly. But, the async copy isn't blocked -- its effects just aren't observable. I suspect with enough grit, I can make this work too.
However, all of the above makes me think "I'm doing it wrong." How do you keep extremely re-usable data in the GPU caches so it can be accessed from one query to the next?
First of all to observe the effects of the streams and Async copying
you definitely need to pin the host memory. Then you can observe
concurrent kernel invocations "almost" happening at the same time.
I'd rather used Async copying since it makes me feel in control of
the situation.
Secondly you could just hold on to the data in global memory and load
it in the shared memory whenever you need it. To my knowledge shared
memory is only known to the kernel itself and disposed after
termination. Try using Async copies while the kernel is running and
synchronize the streams accordingly. Don't forget to __syncthreads()
after loading to the shared memory. I hope it helps.

Transferring data to GPU while kernel is running to save time

GPU is really fast when it comes to paralleled computation and out performs CPU with being 15-30 ( some have reported even 50 ) times faster however,
GPU memory is very limited compared to CPU memory and communication between GPU memory and CPU is not as fast.
Lets say we have some data what won't fit into GPU ram but we still want to use
it's wonders to compute. What we can do is split that data into pieces and feed it into GPU one by one.
Sending large data to GPU can take time and one might think, what if we would split a data piece into two and feed the first half, run the kernel and then feed the other half while kernel is running.
By that logic we should save some time because data transfer should be going on while computation is, hopefully not interrupting it's job and when finished, it can just, well, continue it's job without needs for waiting a new data path.
I must say that I'm new to gpgpu, new to cuda but I have been experimenting around with simple cuda codes and have noticed that the function cudaMemcpy used to transfer data between CPU and GPU will block if kerner is running. It will wait until kernel is finished and then will do its job.
My question, is it possible to accomplish something like that described above and if so, could one show an example or provide some information source of how it could be done?
Thank you!
is it possible to accomplish something like that described above
Yes, it's possible. What you're describing is a pipelined algorithm, and CUDA has various asynchronous capabilities to enable it.
The asynchronous concurrent execution section of the programming guide covers the necessary elements in CUDA to make it work. To use your example, there exists a non-blocking version of cudaMemcpy, called cudaMemcpyAsync. You'll need to understand CUDA streams and how to use them.
I would also suggest this presentation which covers most of what is needed.
Finally, here is a worked example. That particular example happens to use CUDA stream callbacks, but those are not necessary for basic pipelining. They enable additional host-oriented processing to be asynchronously triggered at various points in the pipeline, but the basic chunking of data, and delivery of data while processing is occurring does not depend on stream callbacks. Note also the linked CUDA sample codes in that answer, which may be useful for study/learning.

Accessing cuda device memory when the cuda kernel is running

I have allocated memory on device using cudaMalloc and have passed it to a kernel function. Is it possible to access that memory from host before the kernel finishes its execution?
The only way I can think of to get a memcpy to kick off while the kernel is still executing is by submitting an asynchronous memcpy in a different stream than the kernel. (If you use the default APIs for either kernel launch or asynchronous memcpy, the NULL stream will force the two operations to be serialized.)
But because there is no way to synchronize a kernel's execution with a stream, that code would be subject to a race condition. i.e. the copy engine might pull from memory that hasn't yet been written by the kernel.
The person who alluded to mapped pinned memory is into something: if the kernel writes to mapped pinned memory, it is effectively "copying" data to host memory as it finishes processing it. This idiom works nicely, provided the kernel will not be touching the data again.
It is possible, but there's no guarantee as to the contents of the memory you retrieve in such a way, since you don't know what the progress of the kernel is.
What you're trying to achieve is to overlap data transfer and execution. That is possible through the use of streams. You create multiple CUDA streams, and queue a kernel execution and a device-to-host cudaMemcpy in each stream. For example, put the kernel that fills the location "0" and cudaMemcpy from that location back to host into stream 0, kernel that fills the location "1" and cudaMemcpy from "1" into stream 1, etc. What will happen then is that the GPU will overlap copying from "0" and executing "1".
Check CUDA documentation, it's documented somewhere (in the best practices guide, I think).
You can't access GPU memory directly from the host regardless of a kernel is running or not.
If you're talking about copying that memory back to the host before the kernel is finished writing to it, then the answer depends on the compute capability of your device. But all but the very oldest chips can perform data transfers while the kernel is running.
It seems unlikely that you would want to copy memory that is still being updated by a kernel though. You would get some random snapshot of partially finished data. Instead, you might want to set up something where you have two buffers on the device. You can copy one of the buffers while the GPU is working on the other.
Update:
Based on your clarification, I think the closest you can get is using mapped page-locked host memory, also called zero-copy memory. With this approach, values are copied to the host as they are written by the kernel. There is no way to query the kernel to see how much of the work it has performed, so I think you would have to repeatedly scan the memory for newly written values. See section 3.2.4.3, Mapped Memory, in the CUDA Programming Guide v4.2 for a bit more information.
I wouldn't recommend this though. Unless you have some very unusual requirements, there is likely to be a better way to accomplish your task.
When you launch the Kernel it is an asynchronous (non blocking) call. Calling cudaMemcpy next will block until the Kernel has finished.
If you want to have the result for Debug purposes maybe it is possible for you to use cudaDebugging where you can step through the kernel and inspect the memory.
For small result checks you could also use printf() in the Kernel code.
Or run only a threadblock of size (1,1) if you are interested in that specific result.

Is there a good way use a read only hashmap on cuda?

I am really new to programming and Cuda. Basically I have a C function that reads a list of data and then checks each item against a hashmap (I'm using uthash for this in C). It works well but I want to run this process in Cuda (once it gets the value for the hash key then it does a lot of processing), but I'm unsure the best way to create a read only hash function that's as quick as possible in Cuda.
Background
Basically I'm trying to value a very very large batch of portfolio as quickly as possible. I get several million portfolio constantly that are in the form of two lists. One has the stock name and the other has the weight. I then use the stock name to look up a hashtable to get other data(value, % change,etc..) and then process it based on the weight. On a CPU in plain C it takes about 8 minutes so I am interesting in trying it on a GPU.
I have read and done the examples in cuda by example so I believe I know how to do most of this except the hash function(there is one in the appendix but it seems focused on adding to it while I only really want it as a reference since it'll never change. I might be rough around the edges in cuda for example so maybe there is something I'm missing that is helpful for me in this situation, like using textual or some special form of memory for this). How would I structure this for best results should each block have its own access to the hashmap or should each thread or is one good enough for the entire GPU?
Edit
Sorry just to clarify, I'm only using C. Worst case I'm willing to use another language but ideally I'd like something that I can just natively put on the GPU once and have all future threads read to it since to process my data I'll need to do it in several large batches).
This is some thoughts on potential performance issues of using a hash map on a GPU, to back up my comment about keeping the hash map on the CPU.
NVIDIA GPUs run threads in groups of 32 threads, called warps. To get good performance, each of the threads in a warp must be doing essentially the same thing. That is, they must run the same instructions and they must read from memory locations that are close to each other.
I think a hash map may break with both of these rules, possibly slowing the GPU down so much that there's no use in keeping the hash map on the GPU.
Consider what happens when the 32 threads in a warp run:
First, each thread has to create a hash of the stock name. If these names differ in length, this will involve a different number of rounds in the hashing loop for the different lengths and all the threads in the warp must wait for the hash of the longest name to complete. Depending on the hashing algorithm, there might different paths that the code can take inside the hashing algorithm. Whenever the different threads in a warp need to take different paths, the same code must run multiple times (once for each code path). This is called warp divergence.
When all the threads in warp each have obtained a hash, each thread will then have to read from different locations in slow global memory (designated by the hashes). The GPU runs optimally when each of the 32 threads in the warp read in a tight, coherent pattern. But now, each thread is reading from an essentially random location in memory. This could cause the GPU to have to serialize all the threads, potentially dropping the performance to 1/32 of the potential.
The memory locations that the threads read are hash buckets. Each potentially containing a different number of hashes, again causing the threads in the warp to have to do different things. They may then have to branch out again, each to a random location, to get the actual structures that are mapped.
If you instead keep the stock names and data structures in a hash map on the CPU, you can use the CPU to put together arrays of information that are stored in the exact pattern that the GPU is good at handling. Depending on how busy the CPU is, you may be able to do this while the GPU is processing the previously submitted work.
This also gives you an opportunity to change the array of structures (AoS) that you have on the CPU to a structure of arrays (SoA) for the GPU. If you are not familiar with this concept, essentially, you convert:
my_struct {
int a;
int b;
};
my_struct my_array_of_structs[1000];
to:
struct my_struct {
int a[1000];
int b[1000];
} my_struct_of_arrays;
This puts all the a's adjacent to each other in memory so that when the 32 threads in a warp get to the instruction that reads a, all the values are neatly laid out next to each other, causing the entire warp to be able to load the values very quickly. The same is true for the b's, of course.
There is a hash_map extension for CUDA Thrust, in the cuda-thrust-extensions library. I have not tried it.
Because of your hash map is so large, I think it can be replaced by a database, mysql or other products will all be OK, they probably will be fast than hash map design by yourself. And I agree with Roger's viewpoint, it is not suitable to move it to GPU, it consumes too large device memory (may be not capable to contain it) and it is terribly slow for kernel function access global memory on device.
Further more, which part of your program takes 8 minutes, finding in hash map or process on weight? If it is the latter, may be it can be accelerated by GPU.
Best regards!

Is cudaMemcpy from host to device executed in parallel?

I am curious if cudaMemcpy is executed on the CPU or the GPU when we copy from host to device?
I other words, it the copy a sequential process or is it done in parallel?
Let me explain why I ask this: I have an array of 5 million elements . Now, I want to copy 2 sets of 50,000 elements from different parts of the array. SO, i was thinking will it be faster to first form a large array of all the elements i want to copy on the CPU and then do just 1 large transfer or should i just call 2 cudaMemcpy, one for each set.
If cudaMemcpy is done in parallel, then i think the 2nd approach will be faster as you dont have to copy 100000 elements sequentially on the CPU first
I am curious if cudaMemcpy is executed on the CPU or the GPU when we
copy from host to device?
In the case of the synchronous API call with regular pageable user allocated memory, the answer is it runs on both. The driver must first copy data from the source memory to a DMA mapped source buffer on the host, then signal to the GPU that data is waiting for transfer. The GPU then executes the transfer. The process is repeated as many times as necessary for the complete copy from source memory to the GPU.
The throughput of process can be improved by using pinned memory, which the driver can directly DMA to or from without intermediate copying (although pinning has a large initialization/allocation overhead which needs to be amortised as well).
As to the rest of the question, I suspect that two memory copies directly from the source memory would be more efficient than the alternative, but this is the sort of question that can only be conclusively answered by benchmarking.
I believe a transfer from host to GPU memory is a blocking call. It uses the entire bus and, as such, it doesn't really make sense (even if it was physically possible) to run multiple operations in parallel.
I doubt you'll get any performance gain from concatenating the data before transferring it. The bottleneck will likely be the transfer itself. The copies should be queued and executed sequentially with minimal overhead.