I am a beginner on CUDA. Now I am calculating the number of clock cycles per one instruction (e.g. addition). In https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions, it only gives the instruction throughput for different arithmetic operations. For example, the throughput in 7.x is 64 for 32-bit floating-point add. So, can i take 64/32=2 as the number of clock cycles per one instruction? If not, how can i calculate it?
In the general case, the CUDA documentation does not give you enough information to calculate the number of clock cycles that a particular instruction requires. This would be related to the pipeline depth for the instruction (i.e. for the functional unit servicing that instruction) and this is not documented. The throughput table is largely useless for this exercise.
This is one reason why you will find various microbenchmarking papers for CUDA. Here is one such example.
It has to be measured empirically (and carefully), for each architecture of interest, and for each SASS instruction of interest; it is not documented.
Related
I have a couple of doubts regarding the application of Amdahl's law with respect to GPUs. For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right? Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code?
Thanks
For instance, I have a kernel code that I have launched with a number
of threads, say N. So,in the amdahl's law the number of processors
will be N right?
Not exactly. GPU does not have as many physical cores (K) as the number of threads you can launch (N) (usually, K is around 103, N is in range 104 -- 106). However, significant portion of kernel time is (usually) spend just waiting for the data to be read/written from/to global memory, so one core can seamlessly handle several threads. This way the device can handle up to N0 threads without them interfering with each other, where N0 is usually several times bigger than K, but actually depends upon you kernel function.
In my opinion, the best way to determine this N0 is to experimentally measure performance of your application and then use this data to fit parameters of Amdahl's law :)
Also, for any CUDA programming using a large number of threads, is it
safe for me to assume that the Amdahl's law is reduced to 1/(1-p)
wherein p stands for the parallel code?
This assumption basically means that you neglect the time for the parallel part of your code (it is executed infinitely fast) and only consider time for serial part.
E.g. if you compute the sum of two 100-element vectors on GPU, then initializing of device, data copying, kernel launch overhead etc (serial part) takes much more time than kernel execution (parallel part). However, usually this is not true.
Also, the individual GPU core does not have the same performance as CPU core, so you should do some scaling, making Amdah'l law 1 / [(1-p) + k*p/N] (at it's simplest, k = Frequency(CPU) / Frequency(GPU), sometimes k is increased even more to take into account architectural differences, like CPU core having SIMD block).
I could also argue against literally applying Amdahl's law to real systems. Sure, it shows the general trend, but it does not grasp some non-trivial processes.
First, Amdahl's law assumes that given infinite number of cores the parallel part is executed instantly. This assumption is not true (though sometimes it might be pretty accurate). Even if you calculate the sum of two vectors, you can't compute it faster than it takes to add two bytes. One can neglect this "quanta", or include it in serial portion of algorithm, but it somewhat "breaks" the idea.
How to correctly estimate in Amdahl's law the effect of barrier synchronization, critical section, atomic operations etc. is, to the best of my knowledge, unresolved mystery. Such operations belong to parallel part, but walltime of their execution is at best independent of the number of threads and, at worst, is positively dependent.
Simple example: broadcasting time between computational nodes in CPU cluster scales as O(log N). Some initial initialization can take up to O(N) time.
In simple cases one can somewhat estimate the benefit of parallelisation of the algorithm, but (as often the case with CUDA) the static overhead of using the parallel processing might take more time, than parallel processing itself saves.
So, in my opinion, it is usually simpler to write application, measure it's performance and use it to plot Amdahl's curve than trying to a priori correctly estimate all the nuances of algorithm and hardware. In case where such estimations could be easily made, they are usually obvious without any "laws".
I wanted to ask.We say that using --ptxas-options=-v doesn't give the exact number of registers that our program uses.
1) Then , how am I going to supply the occupancu calculator with registers per thread and shared memory per block?
2) In my program I use also thrust calls which generate ptx code.I am having 2 kernels but I can see the thrust functions to produce ptx as well.So , I am taking into account these numbers also when I am counting the total number of registers I use? (I think yes!)
(the same applies for the shared memory)
1) Then , how am I going to supply the occupancy calculator with registers per thread and shared memory per block?
The only other thing needed should be rounding up (if necessary) the output of ptxas to an even granularity of register allocation, which varies by device (see Greg's answer here) I think the common register allocation granularities are 4 and 8, but I don't have a table of register allocation granularity by compute capability.
I think shared memory also has an allocation granularity. Since the max number of threadblocks per SM is limited anyway, this should only matter (for occupancy) if your allocation/usage is within a granular amount of exceeding the limit for however many blocks you are otherwise limited to.
I think in most cases you'll get a pretty good feel by using the numbers from ptxas without rounding. If you feel you need this level of accuracy in the occupancy calculator, asking a nice directed question like "what are the allocation granularities for registers and shared memory for various GPUs" may get someone like Greg to give you a crisp answer.
2) In my program I use also thrust calls which generate ptx code.I am having 2 kernels but I can see the thrust functions to produce ptx as well.So , I am taking into account these numbers also when I am counting the total number of registers I use? (I think yes!) (the same applies for the shared memory)
Fundamentally I believe this thinking is incorrect. The only place I could see where it might matter is if you are running concurrent kernels, and I doubt that is the case since you mention thrust. The only figures that matter for occupancy are the metrics for a single kernel launch. You do not add threads, or registers, or shared memory across different kernels, to calculate resource usage. When a kernel completes execution, it releases its resource usage, at least for these resource types (registers, shared memory, threads).
I have a presentation to make to people who have (almost) no clue of how a GPU works. I think saying that a GPU has a thousand cores where a CPU only has four to eight of them is a non-sense. But I want to give my audience an element of comparison.
After a few months working with NVidia's Kepler and AMD's GCN architectures, I'm tempted to compare a GPU "core" to a CPU's SIMD ALU (I don't know if they have a name for that at Intel). Is it fair ? After all, when looking at an assembly level, those programming models have much in common (at least with GCN, take a look at p2-6 of the ISA manual).
This article states that an Haswell processor can do 32 single-precision operations per cycle, but I suppose there is pipelining or other things happening to achieve that rate. In NVidia parlance, how many Cuda-cores does this processor have ? I would say 8 per CPU-core for 32 bits operations, but this is just a guess based on the SIMD width.
Of course there is many other things to take into account when comparing CPU and GPU hardware, but this is not what I'm trying to do. I just have to explain how the thing is working.
PS: All pointers to CPU hardware documentations or CPU/GPU presentations are greatly appreciated !
EDIT:
Thanks for your answers, sadly I had to chose only one of them. I marked Igor's answer because it sticks the most to my initial question and gave me enough informations to justify why this comparison shouldn't be taken too far, but CaptainObvious provided very good articles.
I'd be very caution on making this kind of comparison. After all even in the GPU world the term "core" depending on the context has really different capability: the new AMD GCN is quite different from the old VLIW4 one which itself is quite different from the CUDA core one.
Besides that, you will bring more puzzlement than understanding to your audience if you make just one small comparison with CPU and that's it. If I were you I'd still go for a more detailed (can still be quick) comparison. For instance someone used to CPU and with little knowledge of GPU, might wonder how come a GPU can have so many registers though it's so expensive (in the CPU world). An explanation to that question is given at the end of this post as well as some more comparison GPU vs CPU.
This other article gives a nice comparison between these two kind of processing units by explaining how GPUs work but also how they evolved and showing the differences with CPUs. It addresses topics like data flow, memory hierarchy but also for what kind of applications a GPU is useful. After all the power a GPU can developed is accessible (efficiently) only for some types of problems.
And personally, If I had to make a presentation about GPU and had the possibility to make only one reference to CPU it would be this: presenting the problems a GPU can solve efficiently vs those a CPU can handle better.
As a bonus even though it's not related directly to your presentation here is an article that put GPGPU in perspective, showing that some speedup claimed by some people are overrated (this is linked to my last point btw :))
Very loosely speaking, it is not entirely unreasonable to say that a Haswell core has about 16 CUDA cores, but you definitely don't want to take that comparison too far. You may want to be cautious about making that statement directly in a presentation, but I've found it to be useful to think of a CUDA core as being somewhat related to a scalar FP unit.
It may help if I explain why Haswell can perform 32 single-precision operations per cycle.
8 single-precision operations execute in each AVX/AVX2 instruction. When writing code that will run on a Haswell CPU, you can use AVX and AVX2 instructions which operate on 256-bit vectors. These 256-bit vectors can represent 8 single-precision FP numbers, 8 integers (32-bit) or 4 double-precision FP numbers.
2 AVX/AVX2 instructions can execute in each core per cycle, although there are some restrictions on which instructions can be paired up.
A fused multiply add (FMA) instruction technically performs 2 single-precision operations. FMA instructions perform "fused" operations such as A = A * B + C, so there are arguably two operations per scalar operand: a multiplication and an addition.
This article explains the above points in more detail: http://www.realworldtech.com/haswell-cpu/4/
In the total accounting, a Haswell core can perform 8 * 2 * 2 single-precision operations per cycle. Since CUDA cores support FMA operations as well, you cannot count that factor of 2 when comparing CUDA cores to Haswell cores.
A Kepler CUDA core has one single-precision floating-point unit, so it can perform one floating-point operation per cycle: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, http://www.realworldtech.com/kepler-brief/
If I was putting together slides on this, I would have one section explaining how many FP operations Haswell can do per cycle: the three points above, plus you have multiple cores and possibly multiple processors. And, I'd have another section explaining how many FP operations a Kepler GPU can do per cycle: 192 per SMX, and you have multiple SMX units on the GPU.
PS.: I may be stating the obvious, but just to avoid confusion: the Haswell architecture also includes an integrated GPU, which has an altogether different architecture from the Haswell CPU.
I completely agree with CaptainObvious, especially that presenting the problems a GPU can solve efficiently vs those a CPU can handle better would be a good idea.
One way I like to compare CPUs and GPUs is by the number of operation/sec that they can perorm. But of course don't compare one cpu core to a multi-core gpu.
A SandyBridge core can perform 2 AVX op/cycles, that is crunch 8 double precision numbers/cycle. Hence, a computer with 16 Sandy-Bridge cores clocked at 2.6 GHz has a peak power of 333 Gflops.
A K20 computing module GK110 has a peak of 1170 Gflops, that is 3.5 times more. This is a fair comparaison in my opinion, and it should be emphasized that the peak performance is much easier to reach on CPU (some applications reach 80%-90% of peak) than on GPU (best cases I know are less than 50% of peak).
So to summerize, I would not go into architecture details, but rather state some shear numbers with the perspective that the peak is often far from reach on GPUs.
It's more fair to compare GPU to vectorized CPU units however if your audience has zero idea of how GPUs work, it seems fair to assume that they have a similar knowledge of vectorized SSE instructions.
For audiences such as these it's important to point out the high level differences, like how blocks of "cores" on the gpu share a scheduler and register file.
I would refer to the GTC Kepler architecture overview for a better idea of what the Kepler architecture looks like.
This is also a reasonably graspable comparison between the two if you want to stick to the "gpu core" idea.
I have a couple of doubts regarding the application of Amdahl's law with respect to GPUs. For instance, I have a kernel code that I have launched with a number of threads, say N. So,in the amdahl's law the number of processors will be N right? Also, for any CUDA programming using a large number of threads, is it safe for me to assume that the Amdahl's law is reduced to 1/(1-p) wherein p stands for the parallel code?
Thanks
For instance, I have a kernel code that I have launched with a number
of threads, say N. So,in the amdahl's law the number of processors
will be N right?
Not exactly. GPU does not have as many physical cores (K) as the number of threads you can launch (N) (usually, K is around 103, N is in range 104 -- 106). However, significant portion of kernel time is (usually) spend just waiting for the data to be read/written from/to global memory, so one core can seamlessly handle several threads. This way the device can handle up to N0 threads without them interfering with each other, where N0 is usually several times bigger than K, but actually depends upon you kernel function.
In my opinion, the best way to determine this N0 is to experimentally measure performance of your application and then use this data to fit parameters of Amdahl's law :)
Also, for any CUDA programming using a large number of threads, is it
safe for me to assume that the Amdahl's law is reduced to 1/(1-p)
wherein p stands for the parallel code?
This assumption basically means that you neglect the time for the parallel part of your code (it is executed infinitely fast) and only consider time for serial part.
E.g. if you compute the sum of two 100-element vectors on GPU, then initializing of device, data copying, kernel launch overhead etc (serial part) takes much more time than kernel execution (parallel part). However, usually this is not true.
Also, the individual GPU core does not have the same performance as CPU core, so you should do some scaling, making Amdah'l law 1 / [(1-p) + k*p/N] (at it's simplest, k = Frequency(CPU) / Frequency(GPU), sometimes k is increased even more to take into account architectural differences, like CPU core having SIMD block).
I could also argue against literally applying Amdahl's law to real systems. Sure, it shows the general trend, but it does not grasp some non-trivial processes.
First, Amdahl's law assumes that given infinite number of cores the parallel part is executed instantly. This assumption is not true (though sometimes it might be pretty accurate). Even if you calculate the sum of two vectors, you can't compute it faster than it takes to add two bytes. One can neglect this "quanta", or include it in serial portion of algorithm, but it somewhat "breaks" the idea.
How to correctly estimate in Amdahl's law the effect of barrier synchronization, critical section, atomic operations etc. is, to the best of my knowledge, unresolved mystery. Such operations belong to parallel part, but walltime of their execution is at best independent of the number of threads and, at worst, is positively dependent.
Simple example: broadcasting time between computational nodes in CPU cluster scales as O(log N). Some initial initialization can take up to O(N) time.
In simple cases one can somewhat estimate the benefit of parallelisation of the algorithm, but (as often the case with CUDA) the static overhead of using the parallel processing might take more time, than parallel processing itself saves.
So, in my opinion, it is usually simpler to write application, measure it's performance and use it to plot Amdahl's curve than trying to a priori correctly estimate all the nuances of algorithm and hardware. In case where such estimations could be easily made, they are usually obvious without any "laws".
I just implemented an algorithm on the GPU that computes the difference btw consecutive indices of an array. I compared it with a CPU based implementation and noticed that for large sized array, the GPU based implementation performs faster.
I am curious WHY does the GPU based implementation perform faster. Please note that i know the surface reasoning that a GPU has several cores and can thus do the operation is parallel i.e., instead of visiting each index sequentially, we can assign a thread to compute the difference for each index.
But can someone tell me a deeper reason as to why GPU's perform faster. What is so different about their architecture that they can beat a CPU based implementation
They don't perform faster, generally.
The point is: Some algorithms fit better into a CPU, some fit better into a GPU.
The execution model of GPUs differs (see SIMD), the memory model differs, the instruction set differs... The whole architecture is different.
There are no obvious way to compare a CPU versus a GPU. You can only discuss whether (and why) the CPU implementation A of an algorithm is faster or slower than a GPU implementation B of this algorithm.
This ended up kind of vague, so a tip of an iceberg of concrete reasons would be: The strong side of CPU is random memory access, branch prediction, etc. GPU excels when there's a high amount of computation with high data locality, so that your implementation can achieve a nice ratio of compute-to-memory-access. SIMD makes GPU implementations slower than CPU where there's a lot of unpredictable braching to many code paths, for example.
The real reason is that a GPU not only has several cores, but it has many cores, typically hundreds of them! Each GPU core however is much slower than a low-end CPU.
But the programming mode is not at all like multi-cores CPUs. So most programs cannot be ported to or take benefit from GPUs.
While some answers have already been given here and this is an old thread, I just thought I'd add this for posterity and what not:
The main reason that CPU's and GPU's differ in performance so much for certain problems is design decisions made on how to allocate the chip's resources. CPU's devote much of their chip space to large caches, instruction decoders, peripheral and system management, etc. Their cores are much more complicated and run at much higher clock rates (which produces more heat per core that must be dissipated.) By contrast, GPU's devote their chip space to packing as many floating-point ALU's on the chip as they can possibly get away with. The original purpose of GPU's was to multiply matricies as fast as possible (because that is the primary type of computation involved in graphics rendering.) Since matrix multiplication is an embarrasingly parallel problem (e.g. each output value is computed completely independently of every other output value) and the code path for each of those computations is identical, chip space can be saved by having several ALU's follow the instructions decoded by a single instruction decoder, since they're all performing the same operations at the same time. By contrast, each of a CPU's cores must have its own separate instruction decoder since the cores are not following identical code paths, which makes each of a CPU's cores much larger on the die than a GPU's cores. Since the primary computations performed in matrix multiplication are floating-point multiplication and floating-point addition, GPU's are implemented such that each of these are single-cycle operations and, in fact, even contain a fused multiply-and-add instruction that multiplies two numbers and adds the result to a third number in a single cycle. This is much faster than a typical CPU, where floating-point multiplication is often a many-cycle operation. Again, the trade-off here is that the chip space is devoted to the floating-point math hardware and other instructions (such as control flow) are often much slower per core than on a CPU or sometimes even just don't exist on a GPU at all.
Also, since GPU cores run at much lower clock rates than typical CPU cores and don't contain as much complicated circuitry, they don't produce as much heat per core (or use as much power per core.) This allows more of them to be packed into the same space without overheating the chip and also allows a GPU with 1,000+ cores to have similar power and cooling requirements to a CPU with only 4 or 8 cores.