Where does super-linear speedup come from? - language-agnostic

In parallel computing theoretically super-linear speedup is not possible. But in practice we do see such cases. One reason is cache effect but I fail to understand what does it play. Also, there are other things involved but what are they? In summary,
How are super-linear speedups possible?
I'm a beginner with respect to parallel computing.

Suppose you have an 8 processor machine, each processor has a 1MB cache, and your computation uses 6MB of data.
On 1 processor the computation will be doing a lot of data movement between CPU, cache and RAM. On 8 processors the computation will only have to move data between CPU and cache. This way you can achieve super-linear speedup.
These figures and this analysis have been simplified for exposition for a beginner.

In short, superlinear speedup is achieved when the total amount of work processors do is strictly less than the total work performed by a single processor.
This can happen in three ways:
The original sequential algorithm was really bad, using the parallel version of the algorithm on one processor will usually do away with the superlinear speedup.
The parallel algorithm uses some search like a random walk, the more processors that are walking, the less distance has to be walked in total before you reach what you are looking for.
Modern processors have faster and slower memories. Typically it will try to keep the data you are using in the fast memory. We can safely say your amount of data is larger than the amount of fast memory. If you use n processors you have n times the amount of faster memory. More data fits in the fast memory which makes it possible to take less time (thus amount of work) on the same task.

Related

Why is solving system of linear equations using cula(dgesv) slower than mkl (dgesv) for small data sets

I have written a CUDA C and C program to solve a matrix equation Ax=b using CULA routine dgesv and MKL routine dgesv. It seems like for a small data set, the CPU program is faster than the GPU program. But the GPU overcomes the CPU as the data set increases past 500. I am using my dell laptop which has i3 CPU and Geforce 525M GPU. What is the best explanation for the initial slow performance of the GPU?
I wrote another program which takes two vectors, multiplies them and add the result. This is just like the dot product just that the result is a vector sum not a scalar. In this program, the GPU is faster than the CPU even for small data set. I am using the same notebook. Why is the GPU faster in this program even for small data set as compared to the one explained above? Is it because there is not much computation involved in the summation?
It's not uncommon for GPUs to be less interesting on small data sets as compared to large data sets. The reasons for this will vary depending on the specific algorithm. GPUs generally have a higher main memory bandwidth than CPUs and also can usually outperform them for heavy-duty number crunching. But GPUs usually only work well when there is parallelism inherent in the problem, which can be exposed. Taking advantage of this parallelism allows an algorithm to tap into the greater memory bandwidth as well as the higher compute capability.
However, before the GPU can do anything, it's necessary to get the data to the GPU. And this creates a "cost" to the GPU version of the code that will not normally be present in the CPU version.
To be more precise, the GPU will provide a benefit when the reduction in computation time on the GPU (over the CPU) exceeds the cost of the data transfer. I believe that solving a system of linear equations is somewhere between O(n^2) and O(n^3) complexity. For very small n, this computational complexity may not be large enough to offset the cost of data transfer. But clearly as n becomes larger it should. On the other hand your vector operation may only be O(n) complexity. So the benefit scenario will look different.
For the O(n^2) or O(n^3) case, as we move to larger data sets, the "cost" to transfer the data increases as O(n), but the compute requirements for solution increase as O(n^2) (or O(n^3)). Therefore larger data sets should have exponentially larger compute workloads, reducing the effect of the "cost" of the data transfer. An O(n) problem on the other hand, probably won't have this scaling dynamic. The workload increases at the same rate as the "cost" of data transfer.
Also note that if the "cost" of transferring data to the GPU can be hidden by overlapping it with computation work, then the "cost" for the overlapped portion becomes "free", i.e. it does not contribute to the overall solution time.

Utilizing GPU worth it?

I want to compute the trajectories of particles subject to certain potentials, a typical N-body problem. I've been researching methods for utilizing a GPU (CUDA for example), and they seem to benefit simulations with large N (20000). This makes sense since the most expensive calculation is usually finding the force.
However, my system will have "low" N (less than 20), many different potentials/factors, and many time steps. Is it worth it to port this system to a GPU?
Based on the Fast N-Body Simulation with CUDA article, it seems that it is efficient to have different kernels for different calculations (such as acceleration and force). For systems with low N, it seems that the cost of copying to/from the device is actually significant, since for each time step one would have to copy and retrieve data from the device for EACH kernel.
Any thoughts would be greatly appreciated.
If you have less than 20 entities that need to be simulated in parallel, I would just use parallel processing on an ordinary multi-core CPU and not bother about using GPU.
Using a multi-core CPU would be much easier to program and avoid the steps of translating all your operations into GPU operations.
Also, as you already suggested, the performance gain using GPU will be small (or even negative) with this small number of processes.
There is no need to copy results from the device to host and back between time steps. Just run your entire simulation on the GPU and copy results back only after several time steps have been calculated.
For how many different potentials do you need to run simulations? Enough to just use the structure from the N-body example and still load the whole GPU?
If not, and assuming the potential calculation is expensive, I'd think it would be best to use one thread for each pair of particles in order to make the problem sufficiently parallel. If you use one block per potential setting, you can then write out the forces to shared memory, __syncthreads(), and use a subset of the block's threads (one per particle) to sum the forces. __syncthreads() again, and continue for the next time step.
If the potential calculation is not expensive, it might be worth exploring first where the main cost of your simulation is.

right way to report CUDA speedup

I would like to compare the performance of a serial program running on a CPU and a CUDA program running on a GPU. But I'm not sure how to compare the performance fairly. For example, if I compare the performance of an old CPU with a new GPU, then I will have immense speedup.
Another question: How can I compare my CUDA program with another CUDA program reported in a paper (both run on different GPUs and I cannot access the source code).
For fairness, you should include the data transfer times to get the data into and out of the GPU. It's not hard to write a blazing fast CUDA function. The real trick is in figuring out how to keep it fed, or how to hide the cost of data transfer by overlapping it with other necessary work. Unless your routine is 100% compute-bound, including data transfer in your units-of-work-done-per-unit-of-time is critical to understanding how your implementation would handle, say, a lot more units of work.
For cross-device comparisons, it might be useful to report units of work performed per unit of time per processor core. The per processor core will help normalize large differences between, say, a 200 core and a 2000 core CUDA device.
If you're talking about your algorithm (not just output), it is useful to describe how you broke the problem down for parallel execution - your block/thread distribution, for example.
Make sure you are not measuring performance on a debug build, or running in a debugger. Debugging adds overhead.
Make sure that your work sample is large enough that it is significantly above the "noise floor". A test run that takes a few seconds to complete will be measuring more of your function and less of the ambient noise of the environment than a test run that completes in milliseconds. You can always divide the units of work by the test execution time to arrive at a sexy "units per nanosecond" figure, but you don't actually measure it that way.
The speed of cuda program on different GPUs depends on many factors of the GPU like memory bandwidth, core clock speed, cores, number of threads/registers/shared memory available. so it is difficult to compare the performance in different GPUs

GPGPU Applications other than Image processing?

I am in search for few cpu applications which can be ported to gpgpu for better efficiency.
Else where can gpgpu be used other than image processing area ?
This is actually for my graduate project.
The specialized processing architectures of GPU compute engines are useful for just about any data crunching problem where you have:
a non-trivial amount of data
a non-trivial computation to perform on every element of that data,
the input data needed to compute each output element fits in GPU memory, or can be choreographed to arrive in GPU memory when it is needed.
It helps if the computation can be performed independently on all data elements at the same time, but this is not strictly required.
Image processing happens to be one example of that scenario - a finite (but large) number of pixels to process, and many image algorithms can be executed on each pixel in parallel.
Other examples include: generalized signal analysis, such as processing audio signals. Image processing is just a specialized form of signal analysis. Pattern recognition, where much of the challenge is to separate the signal from the noise. Voice recognition, anyone? 3 dimensional surface matching, such as figuring out the shapes of organic compounds based on the flex angles of their chemical bonds, or figuring out if two organic compounds are likely to interact in interesting ways (eg, bioreceptors). Physical modelling of all kinds (collision simulations, seismic analysis, etc). And of course, cryptography, where you can always spend more compute time going over the same data again and again.
GPU compute engines are not well suited for problems where the volume of data significantly dwarfs the computations to be performed. GPUs work well on stuff in memory. Moving data into or out of GPU memory is often the most expensive step of an entire computation, so you want to make sure you have enough computation going on to "make up" for the cost of loading the data into memory. If the data is too big to fit into memory you have to adopt distributed computing tactics.
For example, calculating a primary key index of a petabyte database probably isn't a great fit for a GPU since most of the effort will probably be spent just getting the data off the hard disk into memory. The index computation itself is fairly trivial, which doesn't make for a very interesting GPU win, and while I'm sure the data could be carved up into chunks and the chunks indexed independently by a boatload of GPU cores, variability in the data will likely prevent the GPU from operating at its full capacity. (GPU code works best when all "oarsmen" (processor cores / threads) are pulling in the same direction - uniform execution on separate data) While database indexing might see some benefit by using a GPU approach, it certainly won't be as big of a performance improvement over CPU baseline as something better suited to the GPU execution model constraints - like signal processing.
Brute force crypto attacks? MD5 has been done by Whitepixel and SHA-256 has been done by all the Bitcoin miners. On the other hand, I am not aware of any GPU implementations of bcrypt() or scrypt(), but an academic working in the area is probably a better person to ask.
Easiest way to figure out what kinds of applications are well-suited for GPGPU is to look at the speedups other groups have achieved. Here are a couple of links with that information:
NVIDIA's case studies
AccelerEyes' case studies
Looks like military/aerospace, life science, energy, finance, manufacturing, media, and a few other smaller industries have examples of strong speedups.

Why do GPU based algorithms perform faster

I just implemented an algorithm on the GPU that computes the difference btw consecutive indices of an array. I compared it with a CPU based implementation and noticed that for large sized array, the GPU based implementation performs faster.
I am curious WHY does the GPU based implementation perform faster. Please note that i know the surface reasoning that a GPU has several cores and can thus do the operation is parallel i.e., instead of visiting each index sequentially, we can assign a thread to compute the difference for each index.
But can someone tell me a deeper reason as to why GPU's perform faster. What is so different about their architecture that they can beat a CPU based implementation
They don't perform faster, generally.
The point is: Some algorithms fit better into a CPU, some fit better into a GPU.
The execution model of GPUs differs (see SIMD), the memory model differs, the instruction set differs... The whole architecture is different.
There are no obvious way to compare a CPU versus a GPU. You can only discuss whether (and why) the CPU implementation A of an algorithm is faster or slower than a GPU implementation B of this algorithm.
This ended up kind of vague, so a tip of an iceberg of concrete reasons would be: The strong side of CPU is random memory access, branch prediction, etc. GPU excels when there's a high amount of computation with high data locality, so that your implementation can achieve a nice ratio of compute-to-memory-access. SIMD makes GPU implementations slower than CPU where there's a lot of unpredictable braching to many code paths, for example.
The real reason is that a GPU not only has several cores, but it has many cores, typically hundreds of them! Each GPU core however is much slower than a low-end CPU.
But the programming mode is not at all like multi-cores CPUs. So most programs cannot be ported to or take benefit from GPUs.
While some answers have already been given here and this is an old thread, I just thought I'd add this for posterity and what not:
The main reason that CPU's and GPU's differ in performance so much for certain problems is design decisions made on how to allocate the chip's resources. CPU's devote much of their chip space to large caches, instruction decoders, peripheral and system management, etc. Their cores are much more complicated and run at much higher clock rates (which produces more heat per core that must be dissipated.) By contrast, GPU's devote their chip space to packing as many floating-point ALU's on the chip as they can possibly get away with. The original purpose of GPU's was to multiply matricies as fast as possible (because that is the primary type of computation involved in graphics rendering.) Since matrix multiplication is an embarrasingly parallel problem (e.g. each output value is computed completely independently of every other output value) and the code path for each of those computations is identical, chip space can be saved by having several ALU's follow the instructions decoded by a single instruction decoder, since they're all performing the same operations at the same time. By contrast, each of a CPU's cores must have its own separate instruction decoder since the cores are not following identical code paths, which makes each of a CPU's cores much larger on the die than a GPU's cores. Since the primary computations performed in matrix multiplication are floating-point multiplication and floating-point addition, GPU's are implemented such that each of these are single-cycle operations and, in fact, even contain a fused multiply-and-add instruction that multiplies two numbers and adds the result to a third number in a single cycle. This is much faster than a typical CPU, where floating-point multiplication is often a many-cycle operation. Again, the trade-off here is that the chip space is devoted to the floating-point math hardware and other instructions (such as control flow) are often much slower per core than on a CPU or sometimes even just don't exist on a GPU at all.
Also, since GPU cores run at much lower clock rates than typical CPU cores and don't contain as much complicated circuitry, they don't produce as much heat per core (or use as much power per core.) This allows more of them to be packed into the same space without overheating the chip and also allows a GPU with 1,000+ cores to have similar power and cooling requirements to a CPU with only 4 or 8 cores.