Why do GPU based algorithms perform faster - cuda

I just implemented an algorithm on the GPU that computes the difference btw consecutive indices of an array. I compared it with a CPU based implementation and noticed that for large sized array, the GPU based implementation performs faster.
I am curious WHY does the GPU based implementation perform faster. Please note that i know the surface reasoning that a GPU has several cores and can thus do the operation is parallel i.e., instead of visiting each index sequentially, we can assign a thread to compute the difference for each index.
But can someone tell me a deeper reason as to why GPU's perform faster. What is so different about their architecture that they can beat a CPU based implementation

They don't perform faster, generally.
The point is: Some algorithms fit better into a CPU, some fit better into a GPU.
The execution model of GPUs differs (see SIMD), the memory model differs, the instruction set differs... The whole architecture is different.
There are no obvious way to compare a CPU versus a GPU. You can only discuss whether (and why) the CPU implementation A of an algorithm is faster or slower than a GPU implementation B of this algorithm.
This ended up kind of vague, so a tip of an iceberg of concrete reasons would be: The strong side of CPU is random memory access, branch prediction, etc. GPU excels when there's a high amount of computation with high data locality, so that your implementation can achieve a nice ratio of compute-to-memory-access. SIMD makes GPU implementations slower than CPU where there's a lot of unpredictable braching to many code paths, for example.

The real reason is that a GPU not only has several cores, but it has many cores, typically hundreds of them! Each GPU core however is much slower than a low-end CPU.
But the programming mode is not at all like multi-cores CPUs. So most programs cannot be ported to or take benefit from GPUs.

While some answers have already been given here and this is an old thread, I just thought I'd add this for posterity and what not:
The main reason that CPU's and GPU's differ in performance so much for certain problems is design decisions made on how to allocate the chip's resources. CPU's devote much of their chip space to large caches, instruction decoders, peripheral and system management, etc. Their cores are much more complicated and run at much higher clock rates (which produces more heat per core that must be dissipated.) By contrast, GPU's devote their chip space to packing as many floating-point ALU's on the chip as they can possibly get away with. The original purpose of GPU's was to multiply matricies as fast as possible (because that is the primary type of computation involved in graphics rendering.) Since matrix multiplication is an embarrasingly parallel problem (e.g. each output value is computed completely independently of every other output value) and the code path for each of those computations is identical, chip space can be saved by having several ALU's follow the instructions decoded by a single instruction decoder, since they're all performing the same operations at the same time. By contrast, each of a CPU's cores must have its own separate instruction decoder since the cores are not following identical code paths, which makes each of a CPU's cores much larger on the die than a GPU's cores. Since the primary computations performed in matrix multiplication are floating-point multiplication and floating-point addition, GPU's are implemented such that each of these are single-cycle operations and, in fact, even contain a fused multiply-and-add instruction that multiplies two numbers and adds the result to a third number in a single cycle. This is much faster than a typical CPU, where floating-point multiplication is often a many-cycle operation. Again, the trade-off here is that the chip space is devoted to the floating-point math hardware and other instructions (such as control flow) are often much slower per core than on a CPU or sometimes even just don't exist on a GPU at all.
Also, since GPU cores run at much lower clock rates than typical CPU cores and don't contain as much complicated circuitry, they don't produce as much heat per core (or use as much power per core.) This allows more of them to be packed into the same space without overheating the chip and also allows a GPU with 1,000+ cores to have similar power and cooling requirements to a CPU with only 4 or 8 cores.

Related

Theoretical Speedup - GPU [duplicate]

I have a couple of doubts regarding the application of Amdahl's law with respect to GPUs. For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right? Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code?
Thanks
For instance, I have a kernel code that I have launched with a number
of threads, say N. So,in the amdahl's law the number of processors
will be N right?
Not exactly. GPU does not have as many physical cores (K) as the number of threads you can launch (N) (usually, K is around 103, N is in range 104 -- 106). However, significant portion of kernel time is (usually) spend just waiting for the data to be read/written from/to global memory, so one core can seamlessly handle several threads. This way the device can handle up to N0 threads without them interfering with each other, where N0 is usually several times bigger than K, but actually depends upon you kernel function.
In my opinion, the best way to determine this N0 is to experimentally measure performance of your application and then use this data to fit parameters of Amdahl's law :)
Also, for any CUDA programming using a large number of threads, is it
safe for me to assume that the Amdahl's law is reduced to 1/(1-p)
wherein p stands for the parallel code?
This assumption basically means that you neglect the time for the parallel part of your code (it is executed infinitely fast) and only consider time for serial part.
E.g. if you compute the sum of two 100-element vectors on GPU, then initializing of device, data copying, kernel launch overhead etc (serial part) takes much more time than kernel execution (parallel part). However, usually this is not true.
Also, the individual GPU core does not have the same performance as CPU core, so you should do some scaling, making Amdah'l law 1 / [(1-p) + k*p/N] (at it's simplest, k = Frequency(CPU) / Frequency(GPU), sometimes k is increased even more to take into account architectural differences, like CPU core having SIMD block).
I could also argue against literally applying Amdahl's law to real systems. Sure, it shows the general trend, but it does not grasp some non-trivial processes.
First, Amdahl's law assumes that given infinite number of cores the parallel part is executed instantly. This assumption is not true (though sometimes it might be pretty accurate). Even if you calculate the sum of two vectors, you can't compute it faster than it takes to add two bytes. One can neglect this "quanta", or include it in serial portion of algorithm, but it somewhat "breaks" the idea.
How to correctly estimate in Amdahl's law the effect of barrier synchronization, critical section, atomic operations etc. is, to the best of my knowledge, unresolved mystery. Such operations belong to parallel part, but walltime of their execution is at best independent of the number of threads and, at worst, is positively dependent.
Simple example: broadcasting time between computational nodes in CPU cluster scales as O(log N). Some initial initialization can take up to O(N) time.
In simple cases one can somewhat estimate the benefit of parallelisation of the algorithm, but (as often the case with CUDA) the static overhead of using the parallel processing might take more time, than parallel processing itself saves.
So, in my opinion, it is usually simpler to write application, measure it's performance and use it to plot Amdahl's curve than trying to a priori correctly estimate all the nuances of algorithm and hardware. In case where such estimations could be easily made, they are usually obvious without any "laws".

Is it fair to compare SSE/AVX units to GPU cores?

I have a presentation to make to people who have (almost) no clue of how a GPU works. I think saying that a GPU has a thousand cores where a CPU only has four to eight of them is a non-sense. But I want to give my audience an element of comparison.
After a few months working with NVidia's Kepler and AMD's GCN architectures, I'm tempted to compare a GPU "core" to a CPU's SIMD ALU (I don't know if they have a name for that at Intel). Is it fair ? After all, when looking at an assembly level, those programming models have much in common (at least with GCN, take a look at p2-6 of the ISA manual).
This article states that an Haswell processor can do 32 single-precision operations per cycle, but I suppose there is pipelining or other things happening to achieve that rate. In NVidia parlance, how many Cuda-cores does this processor have ? I would say 8 per CPU-core for 32 bits operations, but this is just a guess based on the SIMD width.
Of course there is many other things to take into account when comparing CPU and GPU hardware, but this is not what I'm trying to do. I just have to explain how the thing is working.
PS: All pointers to CPU hardware documentations or CPU/GPU presentations are greatly appreciated !
EDIT:
Thanks for your answers, sadly I had to chose only one of them. I marked Igor's answer because it sticks the most to my initial question and gave me enough informations to justify why this comparison shouldn't be taken too far, but CaptainObvious provided very good articles.
I'd be very caution on making this kind of comparison. After all even in the GPU world the term "core" depending on the context has really different capability: the new AMD GCN is quite different from the old VLIW4 one which itself is quite different from the CUDA core one.
Besides that, you will bring more puzzlement than understanding to your audience if you make just one small comparison with CPU and that's it. If I were you I'd still go for a more detailed (can still be quick) comparison. For instance someone used to CPU and with little knowledge of GPU, might wonder how come a GPU can have so many registers though it's so expensive (in the CPU world). An explanation to that question is given at the end of this post as well as some more comparison GPU vs CPU.
This other article gives a nice comparison between these two kind of processing units by explaining how GPUs work but also how they evolved and showing the differences with CPUs. It addresses topics like data flow, memory hierarchy but also for what kind of applications a GPU is useful. After all the power a GPU can developed is accessible (efficiently) only for some types of problems.
And personally, If I had to make a presentation about GPU and had the possibility to make only one reference to CPU it would be this: presenting the problems a GPU can solve efficiently vs those a CPU can handle better.
As a bonus even though it's not related directly to your presentation here is an article that put GPGPU in perspective, showing that some speedup claimed by some people are overrated (this is linked to my last point btw :))
Very loosely speaking, it is not entirely unreasonable to say that a Haswell core has about 16 CUDA cores, but you definitely don't want to take that comparison too far. You may want to be cautious about making that statement directly in a presentation, but I've found it to be useful to think of a CUDA core as being somewhat related to a scalar FP unit.
It may help if I explain why Haswell can perform 32 single-precision operations per cycle.
8 single-precision operations execute in each AVX/AVX2 instruction. When writing code that will run on a Haswell CPU, you can use AVX and AVX2 instructions which operate on 256-bit vectors. These 256-bit vectors can represent 8 single-precision FP numbers, 8 integers (32-bit) or 4 double-precision FP numbers.
2 AVX/AVX2 instructions can execute in each core per cycle, although there are some restrictions on which instructions can be paired up.
A fused multiply add (FMA) instruction technically performs 2 single-precision operations. FMA instructions perform "fused" operations such as A = A * B + C, so there are arguably two operations per scalar operand: a multiplication and an addition.
This article explains the above points in more detail: http://www.realworldtech.com/haswell-cpu/4/
In the total accounting, a Haswell core can perform 8 * 2 * 2 single-precision operations per cycle. Since CUDA cores support FMA operations as well, you cannot count that factor of 2 when comparing CUDA cores to Haswell cores.
A Kepler CUDA core has one single-precision floating-point unit, so it can perform one floating-point operation per cycle: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, http://www.realworldtech.com/kepler-brief/
If I was putting together slides on this, I would have one section explaining how many FP operations Haswell can do per cycle: the three points above, plus you have multiple cores and possibly multiple processors. And, I'd have another section explaining how many FP operations a Kepler GPU can do per cycle: 192 per SMX, and you have multiple SMX units on the GPU.
PS.: I may be stating the obvious, but just to avoid confusion: the Haswell architecture also includes an integrated GPU, which has an altogether different architecture from the Haswell CPU.
I completely agree with CaptainObvious, especially that presenting the problems a GPU can solve efficiently vs those a CPU can handle better would be a good idea.
One way I like to compare CPUs and GPUs is by the number of operation/sec that they can perorm. But of course don't compare one cpu core to a multi-core gpu.
A SandyBridge core can perform 2 AVX op/cycles, that is crunch 8 double precision numbers/cycle. Hence, a computer with 16 Sandy-Bridge cores clocked at 2.6 GHz has a peak power of 333 Gflops.
A K20 computing module GK110 has a peak of 1170 Gflops, that is 3.5 times more. This is a fair comparaison in my opinion, and it should be emphasized that the peak performance is much easier to reach on CPU (some applications reach 80%-90% of peak) than on GPU (best cases I know are less than 50% of peak).
So to summerize, I would not go into architecture details, but rather state some shear numbers with the perspective that the peak is often far from reach on GPUs.
It's more fair to compare GPU to vectorized CPU units however if your audience has zero idea of how GPUs work, it seems fair to assume that they have a similar knowledge of vectorized SSE instructions.
For audiences such as these it's important to point out the high level differences, like how blocks of "cores" on the gpu share a scheduler and register file.
I would refer to the GTC Kepler architecture overview for a better idea of what the Kepler architecture looks like.
This is also a reasonably graspable comparison between the two if you want to stick to the "gpu core" idea.

Why is solving system of linear equations using cula(dgesv) slower than mkl (dgesv) for small data sets

I have written a CUDA C and C program to solve a matrix equation Ax=b using CULA routine dgesv and MKL routine dgesv. It seems like for a small data set, the CPU program is faster than the GPU program. But the GPU overcomes the CPU as the data set increases past 500. I am using my dell laptop which has i3 CPU and Geforce 525M GPU. What is the best explanation for the initial slow performance of the GPU?
I wrote another program which takes two vectors, multiplies them and add the result. This is just like the dot product just that the result is a vector sum not a scalar. In this program, the GPU is faster than the CPU even for small data set. I am using the same notebook. Why is the GPU faster in this program even for small data set as compared to the one explained above? Is it because there is not much computation involved in the summation?
It's not uncommon for GPUs to be less interesting on small data sets as compared to large data sets. The reasons for this will vary depending on the specific algorithm. GPUs generally have a higher main memory bandwidth than CPUs and also can usually outperform them for heavy-duty number crunching. But GPUs usually only work well when there is parallelism inherent in the problem, which can be exposed. Taking advantage of this parallelism allows an algorithm to tap into the greater memory bandwidth as well as the higher compute capability.
However, before the GPU can do anything, it's necessary to get the data to the GPU. And this creates a "cost" to the GPU version of the code that will not normally be present in the CPU version.
To be more precise, the GPU will provide a benefit when the reduction in computation time on the GPU (over the CPU) exceeds the cost of the data transfer. I believe that solving a system of linear equations is somewhere between O(n^2) and O(n^3) complexity. For very small n, this computational complexity may not be large enough to offset the cost of data transfer. But clearly as n becomes larger it should. On the other hand your vector operation may only be O(n) complexity. So the benefit scenario will look different.
For the O(n^2) or O(n^3) case, as we move to larger data sets, the "cost" to transfer the data increases as O(n), but the compute requirements for solution increase as O(n^2) (or O(n^3)). Therefore larger data sets should have exponentially larger compute workloads, reducing the effect of the "cost" of the data transfer. An O(n) problem on the other hand, probably won't have this scaling dynamic. The workload increases at the same rate as the "cost" of data transfer.
Also note that if the "cost" of transferring data to the GPU can be hidden by overlapping it with computation work, then the "cost" for the overlapped portion becomes "free", i.e. it does not contribute to the overall solution time.

Amdahl's law and GPU

I have a couple of doubts regarding the application of Amdahl's law with respect to GPUs. For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right? Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code?
Thanks
For instance, I have a kernel code that I have launched with a number
of threads, say N. So,in the amdahl's law the number of processors
will be N right?
Not exactly. GPU does not have as many physical cores (K) as the number of threads you can launch (N) (usually, K is around 103, N is in range 104 -- 106). However, significant portion of kernel time is (usually) spend just waiting for the data to be read/written from/to global memory, so one core can seamlessly handle several threads. This way the device can handle up to N0 threads without them interfering with each other, where N0 is usually several times bigger than K, but actually depends upon you kernel function.
In my opinion, the best way to determine this N0 is to experimentally measure performance of your application and then use this data to fit parameters of Amdahl's law :)
Also, for any CUDA programming using a large number of threads, is it
safe for me to assume that the Amdahl's law is reduced to 1/(1-p)
wherein p stands for the parallel code?
This assumption basically means that you neglect the time for the parallel part of your code (it is executed infinitely fast) and only consider time for serial part.
E.g. if you compute the sum of two 100-element vectors on GPU, then initializing of device, data copying, kernel launch overhead etc (serial part) takes much more time than kernel execution (parallel part). However, usually this is not true.
Also, the individual GPU core does not have the same performance as CPU core, so you should do some scaling, making Amdah'l law 1 / [(1-p) + k*p/N] (at it's simplest, k = Frequency(CPU) / Frequency(GPU), sometimes k is increased even more to take into account architectural differences, like CPU core having SIMD block).
I could also argue against literally applying Amdahl's law to real systems. Sure, it shows the general trend, but it does not grasp some non-trivial processes.
First, Amdahl's law assumes that given infinite number of cores the parallel part is executed instantly. This assumption is not true (though sometimes it might be pretty accurate). Even if you calculate the sum of two vectors, you can't compute it faster than it takes to add two bytes. One can neglect this "quanta", or include it in serial portion of algorithm, but it somewhat "breaks" the idea.
How to correctly estimate in Amdahl's law the effect of barrier synchronization, critical section, atomic operations etc. is, to the best of my knowledge, unresolved mystery. Such operations belong to parallel part, but walltime of their execution is at best independent of the number of threads and, at worst, is positively dependent.
Simple example: broadcasting time between computational nodes in CPU cluster scales as O(log N). Some initial initialization can take up to O(N) time.
In simple cases one can somewhat estimate the benefit of parallelisation of the algorithm, but (as often the case with CUDA) the static overhead of using the parallel processing might take more time, than parallel processing itself saves.
So, in my opinion, it is usually simpler to write application, measure it's performance and use it to plot Amdahl's curve than trying to a priori correctly estimate all the nuances of algorithm and hardware. In case where such estimations could be easily made, they are usually obvious without any "laws".

Can you predict the runtime of a CUDA kernel?

To what degree can one predict / calculate the performance of a CUDA kernel?
Having worked a bit with CUDA, this seems non trivial.
But a colleage of mine, who is not working on CUDA, told me, that it cant be hard if you have the memory bandwidth, the number of processors and their speed?
What he said seems not to be consistent with what I read. This is what I could imagine could work. What do you think?
Memory processed
------------------ = runtime for memory bound kernels ?
Memory bandwidth
or
Flops
------------ = runtime for computation bound kernels?
Max GFlops
Such calculation will barely give good prediction. There are many factors that hurt the performance. And those factors interact with each other in a extremely complicated way. So your calculation will give the upper bound of the performance, which is far away from the actual performance (in most cases).
For example, for memory bound kernels, those with a lot cache misses will be different with those with hits. Or those with divergences, those with barriers...
I suggest you to read this paper, which might give you more ideas on the problem: "An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness".
Hope it helps.
I think you can predict a best-case with a bit of work. Like you said, with instruction counts, memory bandwidth, input size, etc.
However, predicting the actual or worst-case is much trickier.
First off, there are factors like memory access patterns. Eg: with older CUDA capable cards, you had to pay attention to distribute your global memory accesses so that they wouldn't all contend for a single memory bank. (The newer CUDA cards use a hash between logical and physical addresses to resolve this).
Secondly, there are non-deterministic factors like: how busy is the PCI bus? How busy is the host kernel? Etc.
I suspect the easiest way to get close to actual run-times is basically to run the kernel on subsets of the input and see how long it actually takes.