Overlapped input and output for cublas functions - cuda

I'm working with some large data using the cublas library for matrix multiplication. To save memory space, I want something like A=A*B where A and B are both n-by-n square matrices, i.e. using the same memory space for the output and one of the input matrices.
While some old posts say this is not allowed in the cublas library, I actually implemented it using the cublasZgemmStridedBatched() function. Surprisingly the calculation is totally correct, and is stable with repeated run. So I'm wondering if the overlapped input and output is supported by the current cublas library. If yes, how much memory does it actually save? I mean intuitively the function at least needs some extra memory to store intermediate calculations, since Aij = AikBkj is dependent on a whole row of A. Is this particularly memory saving for batched gemms?

While some old posts say this is not allowed in the cublas library,
And they are completely correct (noting that the "old posts" were referring to the standard GEMM calls, not the batched implementations you are asking about).
I actually implemented it using the cublasZgemmStridedBatched() function. Surprisingly the calculation is totally correct, and is stable with repeated run
This isn't documented as being safe and I suspect you are probably only getting stable results by luck, given that small matrices are probably preloaded into shared memory or registers and so an in-place operation works. If you went to larger matrices, I guess you would see failures, because eventually there would be a case where a single GEMM could not be performed without multiple trips to the source matrix after a write cycle, which would corrupt the source matrix.
I would not recommend in-place operations even if you find it works for one case. Different problem sizes, library versions, and hardware could produce failures which you simply haven't tested. The choice and associated risk is up to you.

Related

Optimal use of GPU resources in case of many interdependent tasks

In my use case, the global GPU memory has many chunks of data. Preferably, the number of these could change, but assuming the number and sizes of these chunks of data to be constant is fine as well. Now, there are a set of functions that take as input some of the chunks of data and modify some of them. Some of these functions should only start processing if others completed already. In other words, these functions could be drawn in graph form with the functions being the nodes and edges being dependencies between them. The ordering of these tasks is quite weak though.
My question is now the following: What is (on a conceptual level) a good way to implement this in CUDA?
An idea that I had, which could serve as a starting point, is the following: A single kernel is launched. That single kernel creates a grid of blocks with the blocks corresponding to the functions mentioned above. Inter-block synchronization ensures that blocks only start processing data once their predecessors completed execution.
I looked up how this could be implemented, but I failed to figure out how inter-block synchronization can be done (if this is possible at all).
I would create for any solution an array in memory 500 node blocks * 10,000 floats (= 20 MB) with each 10,000 floats being stored as one continuous block. (The number of floats be better divisible by 32 => e.g. 10,016 floats for memory alignment reasons).
Solution 1: Runtime Compilation (sequential, but optimized)
Use Python code to generate a sequential order of functions according to the graph and create (printing out the source code into a string) a small program which calls the functions in turn. Each function should read the input from its predecessor blocks in memory and store the output in its own output block. Python should output the glue code (as string) which calls all functions in the correct order.
Use NVRTC (https://docs.nvidia.com/cuda/nvrtc/index.html, https://github.com/NVIDIA/pynvrtc) for runtime compilation and the compiler will optimize a lot.
A further optimization would be to not store the intermediate results in memory, but in local variables. They will be enough for all your specified cases (Maximum of 255 registers per thread). But of course makes the program (a small bit) more complicated. The variables can be freely named. And you can have 500 variables. The compiler will optimize the assignment to registers and reusing registers. So have one variable for each node output. E.g. float node352 = f_352(node45, node182, node416);
Solution 2: Controlled run on device (sequential)
The python program creates a list with the order, in which the functions have to be called. The individual functions know, from what memory blocks to read and in what block to write (either hard-coded, or you have to submit it to them in a memory structure).
On the device kernel a for loop is run, where the order list is went through sequentially and the kernel from the list is called.
How to specify, which functions to call?
The function pointers in the list can be created on the CPU like the following code: https://leimao.github.io/blog/Pass-Function-Pointers-to-Kernels-CUDA/ (not sure, if it works in Python).
Or regardless of host programming language a separate kernel can create a translation table: device function pointers (assign_kernel). Then the list from Python would contain indices into this table.
Solution 3: Dynamic Parallelism (parallel)
With Dynamic Parallelism kernels themselves start other kernels (grids).
https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism
There is a maximum depth of 24.
The state of the parent grid could be swapped to memory (which could take a maximum of 860 MB per level, probably not for your program). But this could be a limitation.
All this swapping could make the parallel version slower again.
But the advantage would be that nodes can really be run in parallel.
Solution 4: Use Cuda Streams and Events (parallel)
Each kernel just calls one function. The synchronization and scheduling is done from Python. But the kernels run asynchronously and call a callback as soon as they are finished. Each kernel running in parallel has to be run on a separate stream.
Optimization: You can use the CUDA graph API, with which CUDA learns the order of the kernels and can do additional optimizations, when replaying (with possibly other float input data, but the same graph).
For all methods
You can try different launch configurations from 32 or better 64 threads per block up to 1024 threads per block.
Let's assume that most, or all, of your chunks of data are large; and that you have many distinct functions. If the former does not hold it's not clear you will even benefit from having them on a GPU in the first place. Let's also assume that the functions are black boxes to you, and you don't have the ability to identify fine-graines dependencies between individual values in your different buffers, with simple, local dependency functions.
Given these assumptions - your workload is basically the typical case of GPU work, which CUDA (and OpenCL) have catered for since their inception.
Traditional plain-vanilla approach
You define multiple streams (queues) of tasks; you schedule kernels on these streams for your various functions; and schedule event-fires and event-waits corresponding to your function's inter-dependency (or the buffer processing dependency). The event-waits before kernel launches ensure no buffer is processed until all preconditions have been satisfied. Then you have different CPU threads wait/synchronize with these streams, to get your work going.
Now, as far as the CUDA APIs go - this is bread-and-butter stuff. If you've read the CUDA Programming Guide, or at least the basic sections of it, you know how to do this. You could avail yourself of convenience libraries, like my API wrapper library, or if your workload fits, a higher-level offering such as NVIDIA Thrust might be more appropriate.
The multi-threaded synchronization is a bit less trivial, but this still isn't rocket-science. What is tricky and delicate is choosing how many streams to use and what work to schedule on what stream.
Using CUDA task graphs
With CUDA 10.x, NVIDIA add API functions for explicitly creating task graphs, with kernels and memory copies as nodes and edges for dependencies; and when you've completed the graph-construction API calls, you "schedule the task graph", so to speak, on any stream, and the CUDA runtime essentially takes care of what I've described above, automagically.
For an elaboration on how to do this, please read:
Getting Started with CUDA Graphs
on the NVIDIA developer blog. Or, for a deeper treatment - there's actually a section about them in the programming guide, and a small sample app using them, simpleCudaGraphs .
White-box functions
If you actually do know a lot about your functions, then perhaps you can create larger GPU kernels which perform some dependent processing, by keeping parts of intermediate results in registers or in block shared memory, and continuing to the part of a subsequent function applied to such local results. For example, if your first kernels does c[i] = a[i] + b[i] and your second kernel does e[i] = d[i] * e[i], you could instead write a kernel which performs the second action after the first, with inputs a,b,d (no need for c). Unfortunately I can't be less vague here, since your question was somewhat vague.

Why are CUDA indices 2D? [duplicate]

I have basically the same question as posed in this discussion. In particular I want to refer to this final response:
I think there are two different questions mixed together in this
thread:
Is there a performance benefit to using a 2D or 3D mapping of input or output data to threads? The answer is "absolutely" for all the
reasons you and others have described. If the data or calculation has
spatial locality, then so should the assignment of work to threads in
a warp.
Is there a performance benefit to using CUDA's multidimensional grids to do this work assignment? In this case, I don't think so since
you can do the index calculation trivially yourself at the top of the
kernel. This burns a few arithmetic instructions, but that should be
negligible compared to the kernel launch overhead.
This is why I think the multidimensional grids are intended as a
programmer convenience rather than a way to improve performance. You
do absolutely need to think about each warp's memory access patterns,
though.
I want to know if this situation still holds today. I want to know the reason why there is a need for a multidimensional "outer" grid.
What I'm trying to understand is whether or not there is a significant purpose to this (e.g. an actual benefit from spatial locality) or is it there for convenience (e.g. in an image processing context, is it there only so that we can have CUDA be aware of the x/y "patch" that a particular block is processing so it can report it to the CUDA Visual Profiler or something)?
A third option is that this nothing more than a holdover from earlier versions of CUDA where it was a workaround for hardware indexing limits.
There is definitely a benefit in the use of multi-dimensional grid. The different entries (tid, ctaid) are read-only variables visible as special registers. See PTX ISA
PTX includes a number of predefined, read-only variables, which are visible as special registers and accessed through mov or cvt instructions.
The special registers are:
%tid
%ntid
%laneid
%warpid
%nwarpid
%ctaid
%nctaid
If some of this data may be used without further processing, not-only you may gain arithmetic instructions - potentially at each indexing step of multi-dimension data, but more importantly you are saving registers which is a very scarce resource on any hardware.

Is there a good way use a read only hashmap on cuda?

I am really new to programming and Cuda. Basically I have a C function that reads a list of data and then checks each item against a hashmap (I'm using uthash for this in C). It works well but I want to run this process in Cuda (once it gets the value for the hash key then it does a lot of processing), but I'm unsure the best way to create a read only hash function that's as quick as possible in Cuda.
Background
Basically I'm trying to value a very very large batch of portfolio as quickly as possible. I get several million portfolio constantly that are in the form of two lists. One has the stock name and the other has the weight. I then use the stock name to look up a hashtable to get other data(value, % change,etc..) and then process it based on the weight. On a CPU in plain C it takes about 8 minutes so I am interesting in trying it on a GPU.
I have read and done the examples in cuda by example so I believe I know how to do most of this except the hash function(there is one in the appendix but it seems focused on adding to it while I only really want it as a reference since it'll never change. I might be rough around the edges in cuda for example so maybe there is something I'm missing that is helpful for me in this situation, like using textual or some special form of memory for this). How would I structure this for best results should each block have its own access to the hashmap or should each thread or is one good enough for the entire GPU?
Edit
Sorry just to clarify, I'm only using C. Worst case I'm willing to use another language but ideally I'd like something that I can just natively put on the GPU once and have all future threads read to it since to process my data I'll need to do it in several large batches).
This is some thoughts on potential performance issues of using a hash map on a GPU, to back up my comment about keeping the hash map on the CPU.
NVIDIA GPUs run threads in groups of 32 threads, called warps. To get good performance, each of the threads in a warp must be doing essentially the same thing. That is, they must run the same instructions and they must read from memory locations that are close to each other.
I think a hash map may break with both of these rules, possibly slowing the GPU down so much that there's no use in keeping the hash map on the GPU.
Consider what happens when the 32 threads in a warp run:
First, each thread has to create a hash of the stock name. If these names differ in length, this will involve a different number of rounds in the hashing loop for the different lengths and all the threads in the warp must wait for the hash of the longest name to complete. Depending on the hashing algorithm, there might different paths that the code can take inside the hashing algorithm. Whenever the different threads in a warp need to take different paths, the same code must run multiple times (once for each code path). This is called warp divergence.
When all the threads in warp each have obtained a hash, each thread will then have to read from different locations in slow global memory (designated by the hashes). The GPU runs optimally when each of the 32 threads in the warp read in a tight, coherent pattern. But now, each thread is reading from an essentially random location in memory. This could cause the GPU to have to serialize all the threads, potentially dropping the performance to 1/32 of the potential.
The memory locations that the threads read are hash buckets. Each potentially containing a different number of hashes, again causing the threads in the warp to have to do different things. They may then have to branch out again, each to a random location, to get the actual structures that are mapped.
If you instead keep the stock names and data structures in a hash map on the CPU, you can use the CPU to put together arrays of information that are stored in the exact pattern that the GPU is good at handling. Depending on how busy the CPU is, you may be able to do this while the GPU is processing the previously submitted work.
This also gives you an opportunity to change the array of structures (AoS) that you have on the CPU to a structure of arrays (SoA) for the GPU. If you are not familiar with this concept, essentially, you convert:
my_struct {
int a;
int b;
};
my_struct my_array_of_structs[1000];
to:
struct my_struct {
int a[1000];
int b[1000];
} my_struct_of_arrays;
This puts all the a's adjacent to each other in memory so that when the 32 threads in a warp get to the instruction that reads a, all the values are neatly laid out next to each other, causing the entire warp to be able to load the values very quickly. The same is true for the b's, of course.
There is a hash_map extension for CUDA Thrust, in the cuda-thrust-extensions library. I have not tried it.
Because of your hash map is so large, I think it can be replaced by a database, mysql or other products will all be OK, they probably will be fast than hash map design by yourself. And I agree with Roger's viewpoint, it is not suitable to move it to GPU, it consumes too large device memory (may be not capable to contain it) and it is terribly slow for kernel function access global memory on device.
Further more, which part of your program takes 8 minutes, finding in hash map or process on weight? If it is the latter, may be it can be accelerated by GPU.
Best regards!

Can CUDA handle its own work queues?

Sorry if this is obvious, but I'm studying c++ and Cuda right now and wanted to know if this was possible so I could focus more on the relevant sections.
Basically my problem is highly parallelizable, in fact I'm running it on multiple servers currently. My program gets a work item(very small list) and runs a loop on it and makes one of 3 decisions:
keep the data(saves it),
Discard the data(doesn't do anything with it),
Process data further(its unsure of what to do so it modifies the data and resends it to the queue to process.
This used to be a recursion but I made each part independent and although I'm longer bound by one cpu but the negative effect of it is there's alot of messages that pass back/forth. I understand at a high level how CUDA works and how to submit work to it but is it possible for CUDA to manage the queue on the device itself?
My current thought process was manage the queue on the c++ host and then send the processing to the device, after which the results are returned back to the host and sent back to the device(and so on). I think that could work but I wanted to see if it was possible to have the queue on the CUDA memory itself and kernels take work and send work directly to it.
Is something like this possible with CUDA or is there a better way to do this?
I think what you're asking is if you can keep intermediate results on the device. The answer to that is yes. In other words, you should only need to copy new work items to the device and only copy finished items from the device. The work items that are still undetermined can stay on the device between kernel calls.
You may want to look into CUDA Thrust for this. Thrust has efficient algorithms for transformations, which can be combined with custom logic (search for "kernel fusion" in the Thrust manual.) It sounds like maybe your processing can be considered to be transformations, where you take a vector of work items and create two new vectors, one of items to keep and one of items that are still undetermined.
Is the host aware(or can it monitor) memory on device? My concern is how to be aware and deal with data that starts to exceed GPU onboard memory.
It is possible to allocate and free memory from within a kernel but it's probably not going to be very efficient. Instead, manage memory by running CUDA calls such as cudaMalloc() and cudaFree() or, if you're using Thrust, creating or resizing vectors between kernel calls.
With this "manual" memory management you can keep track of how much memory you have used with cudaMemGetInfo().
Since you will be copying completed work items back to the host, you will know how many work items are left on the device and thus, what the maximum amount of memory that might be required in a kernel call is.
Maybe a good strategy will be to swap source and destination vectors for each transform. To take a simple example, say you have a set of work items that you want to filter in multiple steps. You create vector A and fill it with work items. Then you create vector B of the same size and leave it empty. After the filtering, some portion of the work items in A have been moved to B, and you have the count. Now you run the filter again, this time with B as the source and A as the destination.

CUDA contexts, streams, and events on multiple GPUs

TL;DR version: "What's the best way to round-robin kernel calls to multiple GPUs with Python/PyCUDA such that CPU and GPU work can happen in parallel?" with a side of "I can't have been the first person to ask this; anything I should read up on?"
Full version:
I would like to know the best way to design context, etc. handling in an application that uses CUDA on a system with multiple GPUs. I've been trying to find literature that talks about guidelines for when context reuse vs. recreation is appropriate, but so far haven't found anything that outlines best practices, rules of thumb, etc.
The general overview of what we're needing to do is:
Requests come in to a central process.
That process forks to handle a single request.
Data is loaded from the DB (relatively expensive).
The the following is repeated an arbitrary number of times based on the request (dozens):
A few quick kernel calls to compute data that is needed for later kernels.
One slow kernel call (10 sec).
Finally:
Results from the kernel calls are collected and processed on the CPU, then stored.
At the moment, each kernel call creates and then destroys a context, which seems wasteful. Setup is taking about 0.1 sec per context and kernel load, and while that's not huge, it is precluding us from moving other quicker tasks to the GPU.
I am trying to figure out the best way to manage contexts, etc. so that we can use the machine efficiently. I think that in the single-gpu case, it's relatively simple:
Create a context before starting any of the GPU work.
Launch the kernels for the first set of data.
Record an event for after the final kernel call in the series.
Prepare the second set of data on the CPU while the first is computing on the GPU.
Launch the second set, repeat.
Insure that each event gets synchronized before collecting the results and storing them.
That seems like it should do the trick, assuming proper use of overlapped memory copies.
However, I'm unsure what I should do when wanting to round-robin each of the dozens of items to process over multiple GPUs.
The host program is Python 2.7, using PyCUDA to access the GPU. Currently it's not multi-threaded, and while I'd rather keep it that way ("now you have two problems" etc.), if the answer means threads, it means threads. Similarly, it would be nice to just be able to call event.synchronize() in the main thread when it's time to block on data, but for our needs efficient use of the hardware is more important. Since we'll potentially be servicing multiple requests at a time, letting other processes use the GPU when this process isn't using it is important.
I don't think that we have any explicit reason to use Exclusive compute modes (ie. we're not filling up the memory of the card with one work item), so I don't think that solutions that involve long-standing contexts are off the table.
Note that answers in the form of links to other content that covers my questions are completely acceptable (encouraged, even), provided they go into enough detail about the why, not just the API. Thanks for reading!
Caveat: I'm not a PyCUDA user (yet).
With CUDA 4.0+ you don't even need an explicit context per GPU. You can just call cudaSetDevice (or the PyCUDA equivalent) before doing per-device stuff (cudaMalloc, cudaMemcpy, launch kernels, etc.).
If you need to synchronize between GPUs, you will need to potentially create streams and/or events and use cudaEventSynchronize (or the PyCUDA equivalent). You can even have one stream wait on an event inserted in another stream to do sophisticated dependencies.
So I suspect the answer to day is quite a lot simpler than talonmies' excellent pre-CUDA-4.0 answer.
You might also find this answer useful.
(Re)Edit by OP: Per my understanding, PyCUDA supports versions of CUDA prior to 4.0, and so still uses the old API/semantics (the driver API?), so talonmies' answer is still relevant.