Can you predict the runtime of a CUDA kernel? - cuda

To what degree can one predict / calculate the performance of a CUDA kernel?
Having worked a bit with CUDA, this seems non trivial.
But a colleage of mine, who is not working on CUDA, told me, that it cant be hard if you have the memory bandwidth, the number of processors and their speed?
What he said seems not to be consistent with what I read. This is what I could imagine could work. What do you think?
Memory processed
------------------ = runtime for memory bound kernels ?
Memory bandwidth
or
Flops
------------ = runtime for computation bound kernels?
Max GFlops

Such calculation will barely give good prediction. There are many factors that hurt the performance. And those factors interact with each other in a extremely complicated way. So your calculation will give the upper bound of the performance, which is far away from the actual performance (in most cases).
For example, for memory bound kernels, those with a lot cache misses will be different with those with hits. Or those with divergences, those with barriers...
I suggest you to read this paper, which might give you more ideas on the problem: "An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness".
Hope it helps.

I think you can predict a best-case with a bit of work. Like you said, with instruction counts, memory bandwidth, input size, etc.
However, predicting the actual or worst-case is much trickier.
First off, there are factors like memory access patterns. Eg: with older CUDA capable cards, you had to pay attention to distribute your global memory accesses so that they wouldn't all contend for a single memory bank. (The newer CUDA cards use a hash between logical and physical addresses to resolve this).
Secondly, there are non-deterministic factors like: how busy is the PCI bus? How busy is the host kernel? Etc.
I suspect the easiest way to get close to actual run-times is basically to run the kernel on subsets of the input and see how long it actually takes.

Related

Using dynamic parallelism results in 30x worse performance

Note: I don't have my computer and GPU with me so this me typing from memory. I timed this and compiled it correctly so ignore any odd typos should they exist.
I don't know if the overhead of what I'm going to describe below is the problem, or if I'm doing this wrong, or why launching kernels in kernels is slower than one big kernel that has a lot of threads predicate off and not get used. Maybe this is because I'm not swamping the GPU with work that I don't notice the saturation.
Suppose we're doing something simple for the sake of this example, like multiplying all the values in a square matrix by two. The matrices can be any size, but they won't be larger than 16x16.
Now suppose I have 200 matrices all in the device memory ready to go. I launch a kernel like
// One matrix given to each block
__global__ void matrixFunc(Matrix** matrices)
{
Matrix* m = matrices[blockIdx.x];
int area = m->width * m->height;
if (threadIdx.x < area)
// Heavy calculations
}
// Assume 200 matrices, no larger than 16x16
matrixFunc<<<200, 256>>>(ptrs);
whereby I'm using one block per matrix, and an abundance of threads such that I know I'm never going to have less threads per block than cells in a matrix.
The above runs in 0.17 microseconds.
This seems wasteful. I know that I have a bunch of small matrices (so 256 threads is overkill when a 2x2 matrix can function on 4 threads), so why not launch a bunch of them dynamically from a kernel to see what the runtime overhead is? (for learning reasons)
I change my code to be like the following:
__device__ void matrixFunc(float* matrix)
{
// Heavy calculations (on threadIdx.x for the cell)
}
__global__ void matrixFuncCaller(Matrix** matrices)
{
Matrix* m = matrices[threadIdx.x];
int area = m->width * m->height;
matrixFunc<<<1, area>>>(m.data);
}
matrixFuncCaller<<<1, 200>>>(ptrs);
But this performs a lot worse at 11.3 microseconds.
I realize I could put them all on a stream, so I do that. I then change this to make a new stream:
__global__ void matrixFuncCaller(Matrix** matrices)
{
Matrix* m = matrices[threadIdx.x];
int area = m->width * m->height;
// Create `stream`
matrixFunc<<<1, area, 0, stream>>>(m.data);
// Destroy `stream`
}
This does better, it's now 3 microseconds instead of 11, but it's still much worse than 0.17 microseconds.
I want to know why this is worse.
Is this kernel launching overhead? I figure that maybe my examples are small enough such that the overhead drowns out the work seen here. In my real application which I cannot post, there is a lot more work done than just "2 * matrix", but it still is probably small enough that there might be decent overhead.
Am I doing anything wrong?
Put it shortly: the benchmark is certainly biased and the computation is latency bound.
I do not know how did you measure the timings but I do not believe "0.17 microseconds" is even possible. In fact the overhead of launching a kernel is typically few microseconds (I never saw an overhead smaller than 1 microsecond). Indeed, running a kernel should typically require a system call that are expensive and known to take an overhead of at least about 1000 cycles. An example of overhead analysis can be found in this research paper (confirming that it should takes several microseconds). Not to mention current RAM accesses should take at least 50-100 ns on mainstream x86-64 platforms and the one one of GPU requires several hundreds of cycles. While everything may fit in both the CPU and GPU cache is possible this is very unlikely to be the case regarding the kernels (and the fact the GPU may be used for other tasks during multiple kernel executions). For more information about this, please read this research paper. Thus, what you measure has certainly nothing to do with the kernel execution. To measure the overhead of the kernel, you need to care about synchronizations (eg. call cudaDeviceSynchronize) since kernels are launched asynchronously.
When multiple kernels are launched, you may pay the overhead of an implicit synchronization since the queue is certainly bounded (for sake of performance). In fact, as pointed out by #talonmies in the comments, the number of concurrent kernels is bounded to 16-128 (so less than the number of matrices).
Using multiple streams reduces the need for synchronizations hence the better performance results but there is certainly still a synchronization. That being said, for the comparison to be fair, you need to add a synchronization in all cases or measure the execution time on the GPU itself (without taking care of the launching overhead) still in all cases.
Profilers like nvvp help a lot to understand what is going on in such a case. I strongly advise you to use them.
As for the computation, please note that GPU are designed for heavy computational SIMT-friendly kernels, not low-latency kernel operating on small variable-sized matrices stored in unpredictable memory locations. In fact, the overhead of a global memory access is so big that it should be much bigger than the actual matrix computation. If you want GPUs to be useful, then you need to submit more work to them (so to provide more parallelism to them and so to overlap the high latencies). If you cannot provide more work, then the latency cannot be overlapped and if you care about microsecond latencies then GPUs are clearly not suited for the task.
By the way, not that Nvidia GPUs operate on warp of typically 32 threads. Threads should perform coalesced memory loads/stores to be efficient (otherwise they are split in many load/store requests). Operating on very small matrices like this likely prevent that. Not to mention most threads will do nothing. Flattening the matrices and sorting them by size as proposed by #sebastian in the comments help a bit but the computations and memory access will still be very inefficient for a GPU (not SIMT-friendly). Note that using less thread and make use of unrolling should also be a bit more efficient (but still far from being great). CPUs are better suited for such a task (thanks to a higher frequency, instruction-level parallelism combined with an out-of-order execution). For fast low-latency kernels like this FPGAs can be even better suited (though they are hard to program).

random access gpgpu performance drop?

I've heard there is a drop in performance when performing computations on arrays with random access on a gpu.
My question is how severe is this performance drop?
Searching around some comments seemed to imply code ran faster on cpu. But seeing the vast difference in int and flop between gpus and cpus it seems difficult to believe performance would drop so bad.
I think that it is related to cache loss. GPU also has L1 L2 caches and if you hit random memory space, then you will have more chance to lose cache. And also because GPU has special memory access pattern that called memory coalescing. It is accessing memory with wide range. It is why GPU is so fast when they run SIMD friendly code. But if you access random memory space, it will break memory coalescing. I think that it would be good to read cuda document to see how GPU works.

ArrayFire versus raw CUDA programming?

I am quite new to GPU programming, but since I have a computationally intensive task I have turned to the GPU for possible performance gains.
I tried rewriting my program with ArrayFire Free version. It is indeed faster than my CPU routine with multi-threading enabled, but not to the degree I expected (that is, < 100% speedup), and the returned results are not quite right (< 1% error compared to CPU routine, assuming the CPU routine's results are correct).
My task is mainly element-wise float-32 maths operations on large matrices (300MB-500MB size), with little if-thens/switch-cases etc. I guess the performance bottleneck is likely the bandwidth between CPU and GPU memory since there is a lot of data-reading, etc. The GPU I tested is a GeForce 580GTX with 3GB of video memory.
Is there still some significant room for optimization if I write raw CUDA code (with CUBLAS etc. and average optimization) instead of using ArrayFire for my task? I read some NVIDIA optimization guides; it seems that there is some memory-access tricks there for faster data-access and reducing bank-conflicts. Does ArrayFire use these general tricks automatically or not?
Thanks for the post. Glad to hear initial results were giving some speedup. I work on ArrayFire and can chime in here on your questions.
First and foremost, code is really required here for anyone to help with specificity. Can you share the code you wrote?
Second, you should think about CUDA and ArrayFire in the following way: CUDA is a way to program the GPU that provides you with the ability to write any GPU code you want. But there is a huge difference between naive CUDA code (often slower than the CPU) and expert, time-staking, hand-optimized CUDA code. ArrayFire (and some other GPU libraries like CUBLAS) have many man-years of optimizations poured into them, and are typically going to give better results than most normal people will have time to achieve on their own. However, there is also variability in how well someone uses ArrayFire (or other libraries). There are variables that can and should be tweaked in the usage of ArrayFire library calls to get the best performance. If you post your code, we can help share some of those here.
Third, ArrayFire uses CUBLAS in the functions that rely on BLAS, so you're not likely to see much difference using CUBLAS directly.
Fourth, yes, ArrayFire uses all the optimizations that are available in the NVIDIA CUDA Programming Guide for (e.g. faster data-transfer and reducing memory bank conflicts like you mention). That's where the bulk of ArrayFire development is focused, on optimizing those sorts of things.
Finally, the data discrepancies you noticed are likely due to that nature of CPU vs GPU computing. Since they are different devices, you will often see slightly different results. It's not that the CPU gives better results than the GPU, but rather that they are both working with finite amounts of precision in slightly different ways. If you're using single-precision instead of double, you might consider that. Posting code will let us help on that too.
Happy to expand my answer once code is posted.

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

Why do GPU based algorithms perform faster

I just implemented an algorithm on the GPU that computes the difference btw consecutive indices of an array. I compared it with a CPU based implementation and noticed that for large sized array, the GPU based implementation performs faster.
I am curious WHY does the GPU based implementation perform faster. Please note that i know the surface reasoning that a GPU has several cores and can thus do the operation is parallel i.e., instead of visiting each index sequentially, we can assign a thread to compute the difference for each index.
But can someone tell me a deeper reason as to why GPU's perform faster. What is so different about their architecture that they can beat a CPU based implementation
They don't perform faster, generally.
The point is: Some algorithms fit better into a CPU, some fit better into a GPU.
The execution model of GPUs differs (see SIMD), the memory model differs, the instruction set differs... The whole architecture is different.
There are no obvious way to compare a CPU versus a GPU. You can only discuss whether (and why) the CPU implementation A of an algorithm is faster or slower than a GPU implementation B of this algorithm.
This ended up kind of vague, so a tip of an iceberg of concrete reasons would be: The strong side of CPU is random memory access, branch prediction, etc. GPU excels when there's a high amount of computation with high data locality, so that your implementation can achieve a nice ratio of compute-to-memory-access. SIMD makes GPU implementations slower than CPU where there's a lot of unpredictable braching to many code paths, for example.
The real reason is that a GPU not only has several cores, but it has many cores, typically hundreds of them! Each GPU core however is much slower than a low-end CPU.
But the programming mode is not at all like multi-cores CPUs. So most programs cannot be ported to or take benefit from GPUs.
While some answers have already been given here and this is an old thread, I just thought I'd add this for posterity and what not:
The main reason that CPU's and GPU's differ in performance so much for certain problems is design decisions made on how to allocate the chip's resources. CPU's devote much of their chip space to large caches, instruction decoders, peripheral and system management, etc. Their cores are much more complicated and run at much higher clock rates (which produces more heat per core that must be dissipated.) By contrast, GPU's devote their chip space to packing as many floating-point ALU's on the chip as they can possibly get away with. The original purpose of GPU's was to multiply matricies as fast as possible (because that is the primary type of computation involved in graphics rendering.) Since matrix multiplication is an embarrasingly parallel problem (e.g. each output value is computed completely independently of every other output value) and the code path for each of those computations is identical, chip space can be saved by having several ALU's follow the instructions decoded by a single instruction decoder, since they're all performing the same operations at the same time. By contrast, each of a CPU's cores must have its own separate instruction decoder since the cores are not following identical code paths, which makes each of a CPU's cores much larger on the die than a GPU's cores. Since the primary computations performed in matrix multiplication are floating-point multiplication and floating-point addition, GPU's are implemented such that each of these are single-cycle operations and, in fact, even contain a fused multiply-and-add instruction that multiplies two numbers and adds the result to a third number in a single cycle. This is much faster than a typical CPU, where floating-point multiplication is often a many-cycle operation. Again, the trade-off here is that the chip space is devoted to the floating-point math hardware and other instructions (such as control flow) are often much slower per core than on a CPU or sometimes even just don't exist on a GPU at all.
Also, since GPU cores run at much lower clock rates than typical CPU cores and don't contain as much complicated circuitry, they don't produce as much heat per core (or use as much power per core.) This allows more of them to be packed into the same space without overheating the chip and also allows a GPU with 1,000+ cores to have similar power and cooling requirements to a CPU with only 4 or 8 cores.