Utilizing GPU worth it? - cuda

I want to compute the trajectories of particles subject to certain potentials, a typical N-body problem. I've been researching methods for utilizing a GPU (CUDA for example), and they seem to benefit simulations with large N (20000). This makes sense since the most expensive calculation is usually finding the force.
However, my system will have "low" N (less than 20), many different potentials/factors, and many time steps. Is it worth it to port this system to a GPU?
Based on the Fast N-Body Simulation with CUDA article, it seems that it is efficient to have different kernels for different calculations (such as acceleration and force). For systems with low N, it seems that the cost of copying to/from the device is actually significant, since for each time step one would have to copy and retrieve data from the device for EACH kernel.
Any thoughts would be greatly appreciated.

If you have less than 20 entities that need to be simulated in parallel, I would just use parallel processing on an ordinary multi-core CPU and not bother about using GPU.
Using a multi-core CPU would be much easier to program and avoid the steps of translating all your operations into GPU operations.
Also, as you already suggested, the performance gain using GPU will be small (or even negative) with this small number of processes.

There is no need to copy results from the device to host and back between time steps. Just run your entire simulation on the GPU and copy results back only after several time steps have been calculated.
For how many different potentials do you need to run simulations? Enough to just use the structure from the N-body example and still load the whole GPU?
If not, and assuming the potential calculation is expensive, I'd think it would be best to use one thread for each pair of particles in order to make the problem sufficiently parallel. If you use one block per potential setting, you can then write out the forces to shared memory, __syncthreads(), and use a subset of the block's threads (one per particle) to sum the forces. __syncthreads() again, and continue for the next time step.
If the potential calculation is not expensive, it might be worth exploring first where the main cost of your simulation is.

Related

Theoretical Speedup - GPU [duplicate]

I have a couple of doubts regarding the application of Amdahl's law with respect to GPUs. For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right? Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code?
Thanks
For instance, I have a kernel code that I have launched with a number
of threads, say N. So,in the amdahl's law the number of processors
will be N right?
Not exactly. GPU does not have as many physical cores (K) as the number of threads you can launch (N) (usually, K is around 103, N is in range 104 -- 106). However, significant portion of kernel time is (usually) spend just waiting for the data to be read/written from/to global memory, so one core can seamlessly handle several threads. This way the device can handle up to N0 threads without them interfering with each other, where N0 is usually several times bigger than K, but actually depends upon you kernel function.
In my opinion, the best way to determine this N0 is to experimentally measure performance of your application and then use this data to fit parameters of Amdahl's law :)
Also, for any CUDA programming using a large number of threads, is it
safe for me to assume that the Amdahl's law is reduced to 1/(1-p)
wherein p stands for the parallel code?
This assumption basically means that you neglect the time for the parallel part of your code (it is executed infinitely fast) and only consider time for serial part.
E.g. if you compute the sum of two 100-element vectors on GPU, then initializing of device, data copying, kernel launch overhead etc (serial part) takes much more time than kernel execution (parallel part). However, usually this is not true.
Also, the individual GPU core does not have the same performance as CPU core, so you should do some scaling, making Amdah'l law 1 / [(1-p) + k*p/N] (at it's simplest, k = Frequency(CPU) / Frequency(GPU), sometimes k is increased even more to take into account architectural differences, like CPU core having SIMD block).
I could also argue against literally applying Amdahl's law to real systems. Sure, it shows the general trend, but it does not grasp some non-trivial processes.
First, Amdahl's law assumes that given infinite number of cores the parallel part is executed instantly. This assumption is not true (though sometimes it might be pretty accurate). Even if you calculate the sum of two vectors, you can't compute it faster than it takes to add two bytes. One can neglect this "quanta", or include it in serial portion of algorithm, but it somewhat "breaks" the idea.
How to correctly estimate in Amdahl's law the effect of barrier synchronization, critical section, atomic operations etc. is, to the best of my knowledge, unresolved mystery. Such operations belong to parallel part, but walltime of their execution is at best independent of the number of threads and, at worst, is positively dependent.
Simple example: broadcasting time between computational nodes in CPU cluster scales as O(log N). Some initial initialization can take up to O(N) time.
In simple cases one can somewhat estimate the benefit of parallelisation of the algorithm, but (as often the case with CUDA) the static overhead of using the parallel processing might take more time, than parallel processing itself saves.
So, in my opinion, it is usually simpler to write application, measure it's performance and use it to plot Amdahl's curve than trying to a priori correctly estimate all the nuances of algorithm and hardware. In case where such estimations could be easily made, they are usually obvious without any "laws".

Median selection in CUDA kernel

I need to compute the median of an array of size p inside a CUDA kernel (in my case, p is small e.g. p = 10). I am using an O(p^2) algorithm for its simplicity, but at the cost of time performance.
Is there a "function" to find the median efficiently that I can call inside a CUDA kernel?
I know I could implement a selection algorithm, but I'm looking for a function and/or tested code.
Thanks!
Here are a few hints:
Use a better selection algorithm: QuickSelect is a faster version of QuickSort for selecting the kth element in an array. For compile-time-constant mask sizes, sorting networks are even faster, thanks to high TLP and a O(log^2 n) critical path. If you only have 8-bit values, you can use a histogram-based approach. This paper describes an implementation that takes constant time per pixel, independent of mask size, which makes it very fast for very large mask sizes. You can parallelize it by using a minimal launch strategy (only run as many threads as you need to keep all SMs at max capacity), tiling the image, and letting threads of the same block cooperate on each kernel histogram.
Sort in registers. For small mask sizes, you can keep the entire array in registers, making median selection with a sorting network much faster. For larger mask sizes, you can use shared memory.
Copy all pixels used by the block to shared memory first, and then copy to thread-local buffers that are also in shared memory.
If you only have a few masks that need to go really fast (such as 3x3 and 5x5), use templates to make them compile time constants. This can speed things up a lot because the compiler can unroll loops and re-order a lot more instructions, possibly improving load batching and other goodies, leading to large speed-ups.
Make sure, your reads are coalesced and aligned.
There are many other optimizations you can do. Make sure, you read through the CUDA documents, especially the Programming Guide and the Best Practices Guide.
When you really want to gun for high performance, don't forget to take a good look at a CUDA profiler, such as the Visual Profiler.
Even in a single thread one can sort the array and pick the value in the middle in O(p*log(p)), which makes O(p^2) look excessive. If you have p threads at your disposal it's also possible to sort the array as fast as O(log(p)), although that may not be the fastest solution for small p. See the top answer here:
Which parallel sorting algorithm has the best average case performance?

right way to report CUDA speedup

I would like to compare the performance of a serial program running on a CPU and a CUDA program running on a GPU. But I'm not sure how to compare the performance fairly. For example, if I compare the performance of an old CPU with a new GPU, then I will have immense speedup.
Another question: How can I compare my CUDA program with another CUDA program reported in a paper (both run on different GPUs and I cannot access the source code).
For fairness, you should include the data transfer times to get the data into and out of the GPU. It's not hard to write a blazing fast CUDA function. The real trick is in figuring out how to keep it fed, or how to hide the cost of data transfer by overlapping it with other necessary work. Unless your routine is 100% compute-bound, including data transfer in your units-of-work-done-per-unit-of-time is critical to understanding how your implementation would handle, say, a lot more units of work.
For cross-device comparisons, it might be useful to report units of work performed per unit of time per processor core. The per processor core will help normalize large differences between, say, a 200 core and a 2000 core CUDA device.
If you're talking about your algorithm (not just output), it is useful to describe how you broke the problem down for parallel execution - your block/thread distribution, for example.
Make sure you are not measuring performance on a debug build, or running in a debugger. Debugging adds overhead.
Make sure that your work sample is large enough that it is significantly above the "noise floor". A test run that takes a few seconds to complete will be measuring more of your function and less of the ambient noise of the environment than a test run that completes in milliseconds. You can always divide the units of work by the test execution time to arrive at a sexy "units per nanosecond" figure, but you don't actually measure it that way.
The speed of cuda program on different GPUs depends on many factors of the GPU like memory bandwidth, core clock speed, cores, number of threads/registers/shared memory available. so it is difficult to compare the performance in different GPUs

How does the speed of CUDA program scale with the number of blocks?

I am working on Tesla C1060, which contains 240 processor cores with compute capability 1.3. Knowing that each 8 cores are controlled by a single multi-processor, and that each block of threads is assigned to a single multi-processor, then I would expect that launching a grid of 30 blocks, should take the same execution time as one single block. However, things don't scale that nicely, and I never got this nice scaling even with 8 threads per block. Going to the other extreme with 512 threads per block, I get approximately the same time of one block, when the grid contains a maximum of 5 blocks. This was disappointing when I compared the performance with implementing the same task parallelized with MPI on an 8-core CPU machine.
Can some one explain that to me?
By the way, the computer actually contains two of this Tesla card, so does it distribute blocks between them automatically, or do I have to take further steps to ensure that both are fully exploited?
EDIT:
Regarding my last question, if I launch two independent MPI processes on the same computer, how can I make each work on a different graphics card?
EDIT2: Based on the request of Pedro, here is a plot depicting the total time on the vertical access, normalized to 1 , versus the number of parallel blocks. The number of threads/block = 512. The numbers are rough, since I observed quite large variance of the times for large numbers of blocks.
The speed is not a simple linear relation with the number of blocks. It depends on bunch of stuffs. For example, the memory usage, the number of instruction excuted in a block, etc.
If you want to do multi-GPU computing, you need to modify your code, otherwise you can only use one GPU card.
It seems to me that you have simply taken a C program and compiled it in CUDA without much tought.
Dear friend, this is not the way to go. You have to design your code to take advantage of the fact that CUDA cards have a different internal architecture than regular CPUs. In particular, take the following into account:
memory access pattern - there is a number of memory systems in a GPU and each requires consideration on how to use it best
thread divergence problems - performance will only be good if most of your threads follow the same code path most of the time
If your system has 2 GPUs, you can use both to accelerate some(suitable) problems. The thing is that the memory area of the two are split and not easily 'visible' by each other - you have to design your algorithm to take this into account.
A typical C program written in pre-GPU era will often not be easily transplantable unless originally written with MPI in mind.
To make each CPU MPI thread work with a different GPU card you can use cudaSetDevice()

Where does super-linear speedup come from?

In parallel computing theoretically super-linear speedup is not possible. But in practice we do see such cases. One reason is cache effect but I fail to understand what does it play. Also, there are other things involved but what are they? In summary,
How are super-linear speedups possible?
I'm a beginner with respect to parallel computing.
Suppose you have an 8 processor machine, each processor has a 1MB cache, and your computation uses 6MB of data.
On 1 processor the computation will be doing a lot of data movement between CPU, cache and RAM. On 8 processors the computation will only have to move data between CPU and cache. This way you can achieve super-linear speedup.
These figures and this analysis have been simplified for exposition for a beginner.
In short, superlinear speedup is achieved when the total amount of work processors do is strictly less than the total work performed by a single processor.
This can happen in three ways:
The original sequential algorithm was really bad, using the parallel version of the algorithm on one processor will usually do away with the superlinear speedup.
The parallel algorithm uses some search like a random walk, the more processors that are walking, the less distance has to be walked in total before you reach what you are looking for.
Modern processors have faster and slower memories. Typically it will try to keep the data you are using in the fast memory. We can safely say your amount of data is larger than the amount of fast memory. If you use n processors you have n times the amount of faster memory. More data fits in the fast memory which makes it possible to take less time (thus amount of work) on the same task.