CUDA: Differences between HtoD and DtoH bandwidth - cuda

Yet another bandwidth related question. I expected the plots of Device-to-host bandwidth and that of Host-to-Device to be similar, but I see that there is a significant difference between the two. Considering both following the same route, so the effective bandwidth should be the same, isn't it? The testbed consists of total 12 Intel Westmere CPUs on two sockets, 4 Tesla C2050 GPUs with 4 PCIe Gen2 Express slots. Using the bandwidthtest program from NVidia code samples.
What are the overheads of doing a cudamemCpy from the host vs the device?

First, I would say those two curves are similar. I can honestly say that I've never seen symmetric PCI-e bandwidth on any system I have used -- and that includes both CUDA and graphics (OpenGL/D3D) tests, so I don't think it's something (especially this small difference) that should concern you.
As with your other PCI-e bandwidth question, the answer is similar -- the driver may use different strategies for different types and sizes of transfers, attempting to get the highest throughput possible.
Actual throughput depends on many factors, including the type of GPU, and especially on the host chipset in use.

Related

CUDA vs OpenCL performance comparison

I am using CUDA 6.0 and the OpenCL implementation that comes bundled with the CUDA SDK. I have two identical kernels for each platform (they differ in the platform specific keywords). They only read and write global memory, each thread different location. The launch configuration for CUDA is 200 blocks of 250 threads (1D), which corresponds directly to the configuration for OpenCL - 50,000 global work size and 250 local work size.
The OpenCL code runs faster. Is this possible or am I timing it wrong? My understanding is that the NVIDIA's OpenCL implementation is based on the one for CUDA. I get around 15% better performance with OpenCL.
It would be great if you could suggest why I might be seeing this and perhaps some differences between CUDA and OpenCL as implemented by NVIDIA?
Kernels executing on a modern GPU are almost never compute bound, and are almost always memory bandwidth bound. (Because there are so many compute cores running compared to the available path to memory.)
This means that the performance of a given kernel usually depends largely on the memory access patterns exhibited by the given algorithm.
In practice this makes it very difficult to predict (or even understand) what performance to expect ahead of time.
The differences you observed are likely due to subtle differences in the memory access patterns between the two kernels that result from different optimizations made by the OpenCL vs CUDA toolchain.
To learn how to optimize your GPU kernels it pays to learn the details of the memory caching hardware available to you, and how to use it to best advantage. (e.g., making strategic use of "local" memory caches vs always going directly to "global" memory in OpenCL.)

Do all GPUs use the same architecture?

I have some experience with nVIDIA CUDA and am now thinking about learning openCL too. I would like to be able to run my programs on any GPU. My question is: does every GPU use the same architecture as nVIDIA (multi-processors, SIMT stracture, global memory, local memory, registers, cashes, ...)?
Thank you very much!
Starting with your stated goal:
"I would like to be able to run my programs on any GPU."
Then yes, you should learn OpenCL.
In answer to your overall question, other GPU vendors do use different architectures than Nvidia GPUs. In fact, GPU designs from a single vendor can vary by quite a bit, depending on the model.
This is one reason that a given OpenCL code may perform quite differently (depending on your performance metric) from one GPU to the next. In fact, to achieve optimized performance on any GPU, an algorithm should be "profiled" by varying, for example, local memory size, to find the best algorithm settings for a given hardware design.
But even with these hardware differences, the goal of OpenCL is to provide a level of core functionality that is supported by all devices (CPUs, GPUs, FPGAs, etc) and include "extensions" which allow vendors to expose unique hardware features. Although OpenCL cannot hide significant differences in hardware, it does guarantee portability. This makes it much easier for a developer to start with an OpenCL program tuned for one device and then develop a program optimized for another architecture.
To complicate matters with identifying hardware differences, the terminology used by CUDA is different than that used by OpenCL, for example, the following are roughly equivalent in meaning:
CUDA: OpenCL:
Thread Work-item
Thread block Work-group
Global memory Global memory
Constant memory Constant memory
Shared memory Local memory
Local memory Private memory
More comparisons and discussion can be found here.
You will find that the kinds of abstraction provided by OpenCL and CUDA are very similar. You can also usually count on your hardware having similar features: global mem, local mem, streaming multiprocessors, etc...
Switching from CUDA to OpenCL, you may be confused by the fact that many of the same concepts have different names (for example: CUDA "warp" == OpenCL "wavefront").

Is it fair to compare SSE/AVX units to GPU cores?

I have a presentation to make to people who have (almost) no clue of how a GPU works. I think saying that a GPU has a thousand cores where a CPU only has four to eight of them is a non-sense. But I want to give my audience an element of comparison.
After a few months working with NVidia's Kepler and AMD's GCN architectures, I'm tempted to compare a GPU "core" to a CPU's SIMD ALU (I don't know if they have a name for that at Intel). Is it fair ? After all, when looking at an assembly level, those programming models have much in common (at least with GCN, take a look at p2-6 of the ISA manual).
This article states that an Haswell processor can do 32 single-precision operations per cycle, but I suppose there is pipelining or other things happening to achieve that rate. In NVidia parlance, how many Cuda-cores does this processor have ? I would say 8 per CPU-core for 32 bits operations, but this is just a guess based on the SIMD width.
Of course there is many other things to take into account when comparing CPU and GPU hardware, but this is not what I'm trying to do. I just have to explain how the thing is working.
PS: All pointers to CPU hardware documentations or CPU/GPU presentations are greatly appreciated !
EDIT:
Thanks for your answers, sadly I had to chose only one of them. I marked Igor's answer because it sticks the most to my initial question and gave me enough informations to justify why this comparison shouldn't be taken too far, but CaptainObvious provided very good articles.
I'd be very caution on making this kind of comparison. After all even in the GPU world the term "core" depending on the context has really different capability: the new AMD GCN is quite different from the old VLIW4 one which itself is quite different from the CUDA core one.
Besides that, you will bring more puzzlement than understanding to your audience if you make just one small comparison with CPU and that's it. If I were you I'd still go for a more detailed (can still be quick) comparison. For instance someone used to CPU and with little knowledge of GPU, might wonder how come a GPU can have so many registers though it's so expensive (in the CPU world). An explanation to that question is given at the end of this post as well as some more comparison GPU vs CPU.
This other article gives a nice comparison between these two kind of processing units by explaining how GPUs work but also how they evolved and showing the differences with CPUs. It addresses topics like data flow, memory hierarchy but also for what kind of applications a GPU is useful. After all the power a GPU can developed is accessible (efficiently) only for some types of problems.
And personally, If I had to make a presentation about GPU and had the possibility to make only one reference to CPU it would be this: presenting the problems a GPU can solve efficiently vs those a CPU can handle better.
As a bonus even though it's not related directly to your presentation here is an article that put GPGPU in perspective, showing that some speedup claimed by some people are overrated (this is linked to my last point btw :))
Very loosely speaking, it is not entirely unreasonable to say that a Haswell core has about 16 CUDA cores, but you definitely don't want to take that comparison too far. You may want to be cautious about making that statement directly in a presentation, but I've found it to be useful to think of a CUDA core as being somewhat related to a scalar FP unit.
It may help if I explain why Haswell can perform 32 single-precision operations per cycle.
8 single-precision operations execute in each AVX/AVX2 instruction. When writing code that will run on a Haswell CPU, you can use AVX and AVX2 instructions which operate on 256-bit vectors. These 256-bit vectors can represent 8 single-precision FP numbers, 8 integers (32-bit) or 4 double-precision FP numbers.
2 AVX/AVX2 instructions can execute in each core per cycle, although there are some restrictions on which instructions can be paired up.
A fused multiply add (FMA) instruction technically performs 2 single-precision operations. FMA instructions perform "fused" operations such as A = A * B + C, so there are arguably two operations per scalar operand: a multiplication and an addition.
This article explains the above points in more detail: http://www.realworldtech.com/haswell-cpu/4/
In the total accounting, a Haswell core can perform 8 * 2 * 2 single-precision operations per cycle. Since CUDA cores support FMA operations as well, you cannot count that factor of 2 when comparing CUDA cores to Haswell cores.
A Kepler CUDA core has one single-precision floating-point unit, so it can perform one floating-point operation per cycle: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, http://www.realworldtech.com/kepler-brief/
If I was putting together slides on this, I would have one section explaining how many FP operations Haswell can do per cycle: the three points above, plus you have multiple cores and possibly multiple processors. And, I'd have another section explaining how many FP operations a Kepler GPU can do per cycle: 192 per SMX, and you have multiple SMX units on the GPU.
PS.: I may be stating the obvious, but just to avoid confusion: the Haswell architecture also includes an integrated GPU, which has an altogether different architecture from the Haswell CPU.
I completely agree with CaptainObvious, especially that presenting the problems a GPU can solve efficiently vs those a CPU can handle better would be a good idea.
One way I like to compare CPUs and GPUs is by the number of operation/sec that they can perorm. But of course don't compare one cpu core to a multi-core gpu.
A SandyBridge core can perform 2 AVX op/cycles, that is crunch 8 double precision numbers/cycle. Hence, a computer with 16 Sandy-Bridge cores clocked at 2.6 GHz has a peak power of 333 Gflops.
A K20 computing module GK110 has a peak of 1170 Gflops, that is 3.5 times more. This is a fair comparaison in my opinion, and it should be emphasized that the peak performance is much easier to reach on CPU (some applications reach 80%-90% of peak) than on GPU (best cases I know are less than 50% of peak).
So to summerize, I would not go into architecture details, but rather state some shear numbers with the perspective that the peak is often far from reach on GPUs.
It's more fair to compare GPU to vectorized CPU units however if your audience has zero idea of how GPUs work, it seems fair to assume that they have a similar knowledge of vectorized SSE instructions.
For audiences such as these it's important to point out the high level differences, like how blocks of "cores" on the gpu share a scheduler and register file.
I would refer to the GTC Kepler architecture overview for a better idea of what the Kepler architecture looks like.
This is also a reasonably graspable comparison between the two if you want to stick to the "gpu core" idea.

Why do GPU based algorithms perform faster

I just implemented an algorithm on the GPU that computes the difference btw consecutive indices of an array. I compared it with a CPU based implementation and noticed that for large sized array, the GPU based implementation performs faster.
I am curious WHY does the GPU based implementation perform faster. Please note that i know the surface reasoning that a GPU has several cores and can thus do the operation is parallel i.e., instead of visiting each index sequentially, we can assign a thread to compute the difference for each index.
But can someone tell me a deeper reason as to why GPU's perform faster. What is so different about their architecture that they can beat a CPU based implementation
They don't perform faster, generally.
The point is: Some algorithms fit better into a CPU, some fit better into a GPU.
The execution model of GPUs differs (see SIMD), the memory model differs, the instruction set differs... The whole architecture is different.
There are no obvious way to compare a CPU versus a GPU. You can only discuss whether (and why) the CPU implementation A of an algorithm is faster or slower than a GPU implementation B of this algorithm.
This ended up kind of vague, so a tip of an iceberg of concrete reasons would be: The strong side of CPU is random memory access, branch prediction, etc. GPU excels when there's a high amount of computation with high data locality, so that your implementation can achieve a nice ratio of compute-to-memory-access. SIMD makes GPU implementations slower than CPU where there's a lot of unpredictable braching to many code paths, for example.
The real reason is that a GPU not only has several cores, but it has many cores, typically hundreds of them! Each GPU core however is much slower than a low-end CPU.
But the programming mode is not at all like multi-cores CPUs. So most programs cannot be ported to or take benefit from GPUs.
While some answers have already been given here and this is an old thread, I just thought I'd add this for posterity and what not:
The main reason that CPU's and GPU's differ in performance so much for certain problems is design decisions made on how to allocate the chip's resources. CPU's devote much of their chip space to large caches, instruction decoders, peripheral and system management, etc. Their cores are much more complicated and run at much higher clock rates (which produces more heat per core that must be dissipated.) By contrast, GPU's devote their chip space to packing as many floating-point ALU's on the chip as they can possibly get away with. The original purpose of GPU's was to multiply matricies as fast as possible (because that is the primary type of computation involved in graphics rendering.) Since matrix multiplication is an embarrasingly parallel problem (e.g. each output value is computed completely independently of every other output value) and the code path for each of those computations is identical, chip space can be saved by having several ALU's follow the instructions decoded by a single instruction decoder, since they're all performing the same operations at the same time. By contrast, each of a CPU's cores must have its own separate instruction decoder since the cores are not following identical code paths, which makes each of a CPU's cores much larger on the die than a GPU's cores. Since the primary computations performed in matrix multiplication are floating-point multiplication and floating-point addition, GPU's are implemented such that each of these are single-cycle operations and, in fact, even contain a fused multiply-and-add instruction that multiplies two numbers and adds the result to a third number in a single cycle. This is much faster than a typical CPU, where floating-point multiplication is often a many-cycle operation. Again, the trade-off here is that the chip space is devoted to the floating-point math hardware and other instructions (such as control flow) are often much slower per core than on a CPU or sometimes even just don't exist on a GPU at all.
Also, since GPU cores run at much lower clock rates than typical CPU cores and don't contain as much complicated circuitry, they don't produce as much heat per core (or use as much power per core.) This allows more of them to be packed into the same space without overheating the chip and also allows a GPU with 1,000+ cores to have similar power and cooling requirements to a CPU with only 4 or 8 cores.

What are the measured latency values of an memory hierarchy in a example CUDA capable device?

Even though this question is similar to mine, there still isn't any published latency values for the different types. I'd appreciate an actual measurement and an explanation of the methods and reasoning for their approach. Any CUDA capable discrete NVidia card would be ideal.
Things to measure:
Register
Shared Memory
Constant Cache Hit
Device Memory
Global Memory
This paper is pretty much the gold standard benchmarking example for a CUDA GPU. It exposes most of the information you are interested in by very thorough micro benchmarking, using the Tesla C1060/GTX 285 "GT200" class GPU.