Persistent GPU function/operation - cuda

I'm wondering if it is possible to write a persistent GPU function. I have my doubts, but I'm not sure how the scheduler works.
I'm looking to process an unknown number of data points, (approx 50 million). The data arrives in chunks of 20 or so. It would be good if I could drop these 20 points into a GPU 'bucket', and have this 'persistent' operation grab and process them as they come in. When done, grab the result.
I could keep the GPU busy w/ dummy data while the bucket is empty. But I think race conditions on a partially empty bucket will be an issue.
I suspect I wouldn't be able to run any other operations on the GPU while this persistent operation is running. i.e. Put other undedicated SM's to work.
Is this a viable (fermi) GPU approach, or just a bad idea?

I'm not sure whether this persistent kernel is possible, but it surely would be very inefficient. Although the idea is elegant, it doesn't fit the GPU: You would have to globally communicate which thread picks which element out of the bucket, some threads might never even start as they wait for others to finish and the bucket would have to be declared volatile and therefore slow down your entire input data.
A more common solution to your problem is to divide the data into chunks and asynchronosly copy the chunks onto the GPU. You would use two streams, one working on the last sent chunk and the other sending a new chunk from the host. This actually will be done simultaneously. That way you are likely to hide most of the transfer. But don't let the chunks become too small or your kernel will suffer from low occupancy and performance will degrade.

Related

Most generally correct way of updating a vertex buffer in Vulkan

Assume a vertex buffer in device memory and a staging buffer that's host coherent and visible. Also assume a desktop system with a discrete GPU (so separate memories). And lastly, assume correct inter-frame synchronization.
I see two general possible ways of updating a vertex buffer:
Map + memcpy + unmap into the staging buffer, followed by a transient (single command) command buffer that contains a vkCmdCopyBuffer, submit it to the graphics queue and wait for the queue to idle, then free the transient command buffer. After that submit the regular frame draw queue to the graphics queue as usual. This is the code used on https://vulkan-tutorial.com (for example, this .cpp file).
Similar to above, only instead use additional semaphores to signal after the staging buffer copy submit, and wait in the regular frame draw submit, thus skipping the "wait-for-idle" command.
#2 sort of makes sense to me, and I've repeatedly read not to do any "wait-for-idle" operations in Vulkan because it synchronizes the CPU with the GPU, but I've never seen it used in any tutorial or example online. What do the pros usually do if the vertex buffer has to be updated relatively often?
First, if you allocated coherent memory, then you almost certainly did so in order to access it from the CPU. Which requires mapping it. Vulkan is not OpenGL; there is no requirement that memory be unmapped before it can be used (and OpenGL doesn't even have that requirement anymore).
Unmapping memory should only ever be done when you are about to delete the memory allocation itself.
Second, if you think of an idea that involves having the CPU wait for a queue or device to idle before proceeding, then you have come up with a bad idea and should use a different one. The only time you should wait for a device to idle is when you want to destroy the device.
Tutorial code should not be trusted to give best practices. It is often intended to be simple, to make it easy to understand a concept. Simple Vulkan code often gets in the way of performance (and if you don't care about performance, you shouldn't be using Vulkan).
In any case, there is no "most generally correct way" to do most things in Vulkan. There are lots of definitely incorrect ways, but no "generally do this" advice. Vulkan is a low-level, explicit API, and the result of that is that you need to apply Vulkan's tools to your specific circumstances. And maybe profile on different hardware.
For example, if you're generating completely new vertex data every frame, it may be better to see if the implementation can read vertex data directly from coherent memory, so that there's no need for a staging buffer at all. Yes, the reads may be slower, but the overall process may be faster than a transfer followed by a read.
Then again, it may not. It may be faster on some hardware, and slower on others. And some hardware may not allow you to use coherent memory for any buffer that has the vertex input usage at all. And even if it's allowed, you may be able to do other work during the transfer, and thus the GPU spends minimal time waiting before reading the transferred data. And some hardware has a small pool of device-local memory which you can directly write to from the CPU; this memory is meant for these kinds of streaming applications.
If you are going to do staging however, then your choices are primarily about which queue you submit the transfer operation on (assuming the hardware has multiple queues). And this primarily relates to how much latency you're willing to endure.
For example, if you're streaming data for a large terrain system, then it's probably OK if it takes a frame or two for the vertex data to be usable on the GPU. In that case, you should look for an alternative, transfer-only queue on which to perform the copy from the staging buffer to the primary memory. If you do, then you'll need to make sure that later commands which use the eventual results synchronize with that queue, which will need to be done via a semaphore.
If you're in a low-latency scenario where the data being transferred needs to be used this frame, then it may be better to submit both to the same queue. You could use an event to synchronize them rather than a semaphore. But you should also endeavor to put some kind of unrelated work between the transfer and the rendering operation, so that you can take advantage of some degree of parallelism in operations.

How do you keep data in fast GPU memory (l1/shared) across kernel invocations?

How do you keep data in fast GPU memory across kernel invocations?
Let's suppose, I need to answer 1 million queries, each of which has about 1.5MB of data that's reusable across invocations and about 8KB of data that's unique to each query.
One approach is to launch a kernel for each query, copying the 1.5MB + 8KB of data to shared memory each time. However, then I spend a lot of time just copying 1.5MB of data that really could persist across queries.
Another approach is to "recycle" the GPU threads (see https://stackoverflow.com/a/49957384/3738356). That involves launching one kernel that immediately copies the 1.5MB of data to shared memory. And then the kernel waits for requests to come in, waiting for the 8KB of data to show up before proceeding with each iteration. It really seems like CUDA wasn't meant to be used this way. If one just uses managed memory, and volatile+monotonically increasing counters to synchronize, there's still no guarantee that the data necessary to compute the answer will be on the GPU when you go to read it. You can seed the values in the memory with dummy values like -42 that indicate that the value hasn't yet made its way to the GPU (via the caching/managed memory mechanisms), and then busy wait until the values become valid. Theoretically, that should work. However, I had enough memory errors that I've given up on it for now, and I've pursued....
Another approach still uses recycled threads but instead synchronizes data via cudaMemcpyAsync, cuda streams, cuda events, and still a couple of volatile+monotonically increasing counters. I hear I need to pin the 8KB of data that's fresh with each query in order for the cudaMemcpyAsync to work correctly. But, the async copy isn't blocked -- its effects just aren't observable. I suspect with enough grit, I can make this work too.
However, all of the above makes me think "I'm doing it wrong." How do you keep extremely re-usable data in the GPU caches so it can be accessed from one query to the next?
First of all to observe the effects of the streams and Async copying
you definitely need to pin the host memory. Then you can observe
concurrent kernel invocations "almost" happening at the same time.
I'd rather used Async copying since it makes me feel in control of
the situation.
Secondly you could just hold on to the data in global memory and load
it in the shared memory whenever you need it. To my knowledge shared
memory is only known to the kernel itself and disposed after
termination. Try using Async copies while the kernel is running and
synchronize the streams accordingly. Don't forget to __syncthreads()
after loading to the shared memory. I hope it helps.

Is there a good way use a read only hashmap on cuda?

I am really new to programming and Cuda. Basically I have a C function that reads a list of data and then checks each item against a hashmap (I'm using uthash for this in C). It works well but I want to run this process in Cuda (once it gets the value for the hash key then it does a lot of processing), but I'm unsure the best way to create a read only hash function that's as quick as possible in Cuda.
Background
Basically I'm trying to value a very very large batch of portfolio as quickly as possible. I get several million portfolio constantly that are in the form of two lists. One has the stock name and the other has the weight. I then use the stock name to look up a hashtable to get other data(value, % change,etc..) and then process it based on the weight. On a CPU in plain C it takes about 8 minutes so I am interesting in trying it on a GPU.
I have read and done the examples in cuda by example so I believe I know how to do most of this except the hash function(there is one in the appendix but it seems focused on adding to it while I only really want it as a reference since it'll never change. I might be rough around the edges in cuda for example so maybe there is something I'm missing that is helpful for me in this situation, like using textual or some special form of memory for this). How would I structure this for best results should each block have its own access to the hashmap or should each thread or is one good enough for the entire GPU?
Edit
Sorry just to clarify, I'm only using C. Worst case I'm willing to use another language but ideally I'd like something that I can just natively put on the GPU once and have all future threads read to it since to process my data I'll need to do it in several large batches).
This is some thoughts on potential performance issues of using a hash map on a GPU, to back up my comment about keeping the hash map on the CPU.
NVIDIA GPUs run threads in groups of 32 threads, called warps. To get good performance, each of the threads in a warp must be doing essentially the same thing. That is, they must run the same instructions and they must read from memory locations that are close to each other.
I think a hash map may break with both of these rules, possibly slowing the GPU down so much that there's no use in keeping the hash map on the GPU.
Consider what happens when the 32 threads in a warp run:
First, each thread has to create a hash of the stock name. If these names differ in length, this will involve a different number of rounds in the hashing loop for the different lengths and all the threads in the warp must wait for the hash of the longest name to complete. Depending on the hashing algorithm, there might different paths that the code can take inside the hashing algorithm. Whenever the different threads in a warp need to take different paths, the same code must run multiple times (once for each code path). This is called warp divergence.
When all the threads in warp each have obtained a hash, each thread will then have to read from different locations in slow global memory (designated by the hashes). The GPU runs optimally when each of the 32 threads in the warp read in a tight, coherent pattern. But now, each thread is reading from an essentially random location in memory. This could cause the GPU to have to serialize all the threads, potentially dropping the performance to 1/32 of the potential.
The memory locations that the threads read are hash buckets. Each potentially containing a different number of hashes, again causing the threads in the warp to have to do different things. They may then have to branch out again, each to a random location, to get the actual structures that are mapped.
If you instead keep the stock names and data structures in a hash map on the CPU, you can use the CPU to put together arrays of information that are stored in the exact pattern that the GPU is good at handling. Depending on how busy the CPU is, you may be able to do this while the GPU is processing the previously submitted work.
This also gives you an opportunity to change the array of structures (AoS) that you have on the CPU to a structure of arrays (SoA) for the GPU. If you are not familiar with this concept, essentially, you convert:
my_struct {
int a;
int b;
};
my_struct my_array_of_structs[1000];
to:
struct my_struct {
int a[1000];
int b[1000];
} my_struct_of_arrays;
This puts all the a's adjacent to each other in memory so that when the 32 threads in a warp get to the instruction that reads a, all the values are neatly laid out next to each other, causing the entire warp to be able to load the values very quickly. The same is true for the b's, of course.
There is a hash_map extension for CUDA Thrust, in the cuda-thrust-extensions library. I have not tried it.
Because of your hash map is so large, I think it can be replaced by a database, mysql or other products will all be OK, they probably will be fast than hash map design by yourself. And I agree with Roger's viewpoint, it is not suitable to move it to GPU, it consumes too large device memory (may be not capable to contain it) and it is terribly slow for kernel function access global memory on device.
Further more, which part of your program takes 8 minutes, finding in hash map or process on weight? If it is the latter, may be it can be accelerated by GPU.
Best regards!

Is it possible to split Cuda jobs between GPU & CPU?

I'm having a bit of problems understanding how or if its possible to share a work load between a gpu and cpu. I have a large log file that I need to read each line then run about 5 million operations on(testing for various scenarios). My current approach has been to read a few hundred lines, add it to an array and then send it to each GPU, which is working fine but because there is so much work per line and so many lines it takes a long time. I noticed that while this is going on my CPU cores are basically doing nothing. I'm using EC2, so I have 2 quad core Xeon & 2 Tesla GPUs, one cpu core reads the file(running the main program) and the GPU's do the work so I'm wondering how or what can I do to involve the other 7 cores into the process?
I'm a bit confused at how to design a program to balance the tasks between GPU/CPU because they both would finish the jobs at different times so I couldn't just send it to them all at the same time. I thought about setting up a queue(I'm new to c, so not sure if this is possible yet) but then is there a way to know when a GPU job is completed(since I thought sending jobs to Cuda was asynchronous)? I kernel is very similar to a normal c function so converting it for cpu usage is not problem just balancing the work seems to be the issue. I went though 'Cuda by example' again but couldn't really find anything referring to this type of balancing.
Any suggestions would be great.
I think the key is to create a multithreaded app, following all the common practices for that, and have two types of worker threads. One that does work with the GPU and one that does work with the CPU. So basically, you will need a thread pool and a queue.
http://en.wikipedia.org/wiki/Thread_pool_pattern
The queue can be very simple. You can have one shared integer that is the index of the current row in the log file. When a thread is ready to retrieve more work, it locks that index, gets some number of lines from the log file, starting at the line designated by the index, then increases the index by the number of lines that it retrieved, and then unlocks.
When a worker thread is done with one chunk of the log file, it posts its results back to the main thread and gets another chunk (or exits if there are no more lines to process).
The app launches some combination of GPU and CPU worker threads to utilize all available GPUs and CPU cores.
One problem you may run into is that if the CPU is busy, performance of the GPUs may suffer, as slight delays in submitting new work or processing results from the GPUs are introduced. You may need to experiment with the number of threads and their affinity. For instance, you may need to reserve one CPU core for each GPU by manipulating thread affinities.
Since you say line-by-line may be you can split the jobs across 2 different process -
One CPU + GPU Process
One CPU process that utilized remaining 7 cores
You can start of each process with different offsets - like 1st process reads the lines 1-50, 101-150 etc while the 2nd one reads 51-100, 151-200 etc
This will avoid you the headache of optimizing CPU-GPU interaction

Is cudaMemcpy from host to device executed in parallel?

I am curious if cudaMemcpy is executed on the CPU or the GPU when we copy from host to device?
I other words, it the copy a sequential process or is it done in parallel?
Let me explain why I ask this: I have an array of 5 million elements . Now, I want to copy 2 sets of 50,000 elements from different parts of the array. SO, i was thinking will it be faster to first form a large array of all the elements i want to copy on the CPU and then do just 1 large transfer or should i just call 2 cudaMemcpy, one for each set.
If cudaMemcpy is done in parallel, then i think the 2nd approach will be faster as you dont have to copy 100000 elements sequentially on the CPU first
I am curious if cudaMemcpy is executed on the CPU or the GPU when we
copy from host to device?
In the case of the synchronous API call with regular pageable user allocated memory, the answer is it runs on both. The driver must first copy data from the source memory to a DMA mapped source buffer on the host, then signal to the GPU that data is waiting for transfer. The GPU then executes the transfer. The process is repeated as many times as necessary for the complete copy from source memory to the GPU.
The throughput of process can be improved by using pinned memory, which the driver can directly DMA to or from without intermediate copying (although pinning has a large initialization/allocation overhead which needs to be amortised as well).
As to the rest of the question, I suspect that two memory copies directly from the source memory would be more efficient than the alternative, but this is the sort of question that can only be conclusively answered by benchmarking.
I believe a transfer from host to GPU memory is a blocking call. It uses the entire bus and, as such, it doesn't really make sense (even if it was physically possible) to run multiple operations in parallel.
I doubt you'll get any performance gain from concatenating the data before transferring it. The bottleneck will likely be the transfer itself. The copies should be queued and executed sequentially with minimal overhead.