OpenCL vs CUDA performance on Nvidia's device - cuda

I coded a program to create a color lookup table. I did it in CUDA and OpenCL, from my point of view both programs are pretty much the same, i.e. use the same amount of constant memory, global memory, same loops and branching code, etc.
I measure the running time and CUDA performed slightly better than OpenCL. My question is if using CUDA+NvidiaGPU is faster than OpenCL+NvidiaGPU because CUDA is the native way of programming such GPU?
Could you share some links to info related on this topic?

OpenCL and CUDA are equally fast if they are tweaked correctly for the target architecture. However, tweaking may negatively impact portability.
Links:
http://arxiv.org/ftp/arxiv/papers/1005/1005.2581.pdf
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6047190&tag=1

Related

CUDA vs OpenCL performance comparison

I am using CUDA 6.0 and the OpenCL implementation that comes bundled with the CUDA SDK. I have two identical kernels for each platform (they differ in the platform specific keywords). They only read and write global memory, each thread different location. The launch configuration for CUDA is 200 blocks of 250 threads (1D), which corresponds directly to the configuration for OpenCL - 50,000 global work size and 250 local work size.
The OpenCL code runs faster. Is this possible or am I timing it wrong? My understanding is that the NVIDIA's OpenCL implementation is based on the one for CUDA. I get around 15% better performance with OpenCL.
It would be great if you could suggest why I might be seeing this and perhaps some differences between CUDA and OpenCL as implemented by NVIDIA?
Kernels executing on a modern GPU are almost never compute bound, and are almost always memory bandwidth bound. (Because there are so many compute cores running compared to the available path to memory.)
This means that the performance of a given kernel usually depends largely on the memory access patterns exhibited by the given algorithm.
In practice this makes it very difficult to predict (or even understand) what performance to expect ahead of time.
The differences you observed are likely due to subtle differences in the memory access patterns between the two kernels that result from different optimizations made by the OpenCL vs CUDA toolchain.
To learn how to optimize your GPU kernels it pays to learn the details of the memory caching hardware available to you, and how to use it to best advantage. (e.g., making strategic use of "local" memory caches vs always going directly to "global" memory in OpenCL.)

Is "cudaMallocManaged" slower than "cudaMalloc"?

I downloaded CUDA 6.0 RC and tested the new unified memory by using "cudaMallocManaged" in my application.However, I found this kernel is slowed down.
Using cudaMalloc followed by cudaMemcpy is faster (~0.56), compared to cudaMallocManaged (~0.63).Is this expected?
One of the website claims that cudaMallocManged is for "faster prototyping of cuda kernel", so I was wondering which is a better option for application in terms of performance?
Thanks.
cudaMallocManaged() is not about speeding up your application (with a few exceptions or corner cases, some are suggested below).
Today's implementation of Unified Memory and cudaMallocManaged will not be faster than intelligently written code written by a proficient CUDA programmer, to do the same thing. The machine (cuda runtime) is not smarter than you are as a programmer. cudaMallocManaged does not magically make the PCIE bus or general machine architectural limitations disappear.
Fast prototyping refers to the time it takes you to write the code, not the speed of the code.
cudaMallocManaged may be of interest to a proficient cuda programmer in the following situations:
You're interested in quickly getting a prototype together -i.e. you don't care about the last ounce of performance.
You are dealing with a complicated data structure which you use infrequently (e.g. a doubly linked list) which would otherwise be a chore to port to CUDA (since deep copies using ordinary CUDA code tend to be a chore). It's necessary for your application to work, but not part of the performance path.
You would ordinarily use zero-copy. There may be situations where using cudaMallocManaged could be faster than a naive or inefficient zero-copy approach.
You are working on a Jetson device.
cudaMallocManaged may be of interest to a non-proficient CUDA programmer in that it allows you to get your feet wet with CUDA along a possibly simpler learning curve. (However, note that naive usage of cudaMallocManaged may result in a CUDA kernels running slower than expected, see here and here.)
Although Maxwell is mentioned in the comments, CUDA UM will offer major new features with the Pascal generation of GPUs, in some settings, for some GPUs. In particular, Unified Memory in these settings will no longer be limited to the available GPU device memory, and the memory handling granularity will drop to the page level even when the kernel is running. You can read more about it here.

Do all GPUs use the same architecture?

I have some experience with nVIDIA CUDA and am now thinking about learning openCL too. I would like to be able to run my programs on any GPU. My question is: does every GPU use the same architecture as nVIDIA (multi-processors, SIMT stracture, global memory, local memory, registers, cashes, ...)?
Thank you very much!
Starting with your stated goal:
"I would like to be able to run my programs on any GPU."
Then yes, you should learn OpenCL.
In answer to your overall question, other GPU vendors do use different architectures than Nvidia GPUs. In fact, GPU designs from a single vendor can vary by quite a bit, depending on the model.
This is one reason that a given OpenCL code may perform quite differently (depending on your performance metric) from one GPU to the next. In fact, to achieve optimized performance on any GPU, an algorithm should be "profiled" by varying, for example, local memory size, to find the best algorithm settings for a given hardware design.
But even with these hardware differences, the goal of OpenCL is to provide a level of core functionality that is supported by all devices (CPUs, GPUs, FPGAs, etc) and include "extensions" which allow vendors to expose unique hardware features. Although OpenCL cannot hide significant differences in hardware, it does guarantee portability. This makes it much easier for a developer to start with an OpenCL program tuned for one device and then develop a program optimized for another architecture.
To complicate matters with identifying hardware differences, the terminology used by CUDA is different than that used by OpenCL, for example, the following are roughly equivalent in meaning:
CUDA: OpenCL:
Thread Work-item
Thread block Work-group
Global memory Global memory
Constant memory Constant memory
Shared memory Local memory
Local memory Private memory
More comparisons and discussion can be found here.
You will find that the kinds of abstraction provided by OpenCL and CUDA are very similar. You can also usually count on your hardware having similar features: global mem, local mem, streaming multiprocessors, etc...
Switching from CUDA to OpenCL, you may be confused by the fact that many of the same concepts have different names (for example: CUDA "warp" == OpenCL "wavefront").

ArrayFire versus raw CUDA programming?

I am quite new to GPU programming, but since I have a computationally intensive task I have turned to the GPU for possible performance gains.
I tried rewriting my program with ArrayFire Free version. It is indeed faster than my CPU routine with multi-threading enabled, but not to the degree I expected (that is, < 100% speedup), and the returned results are not quite right (< 1% error compared to CPU routine, assuming the CPU routine's results are correct).
My task is mainly element-wise float-32 maths operations on large matrices (300MB-500MB size), with little if-thens/switch-cases etc. I guess the performance bottleneck is likely the bandwidth between CPU and GPU memory since there is a lot of data-reading, etc. The GPU I tested is a GeForce 580GTX with 3GB of video memory.
Is there still some significant room for optimization if I write raw CUDA code (with CUBLAS etc. and average optimization) instead of using ArrayFire for my task? I read some NVIDIA optimization guides; it seems that there is some memory-access tricks there for faster data-access and reducing bank-conflicts. Does ArrayFire use these general tricks automatically or not?
Thanks for the post. Glad to hear initial results were giving some speedup. I work on ArrayFire and can chime in here on your questions.
First and foremost, code is really required here for anyone to help with specificity. Can you share the code you wrote?
Second, you should think about CUDA and ArrayFire in the following way: CUDA is a way to program the GPU that provides you with the ability to write any GPU code you want. But there is a huge difference between naive CUDA code (often slower than the CPU) and expert, time-staking, hand-optimized CUDA code. ArrayFire (and some other GPU libraries like CUBLAS) have many man-years of optimizations poured into them, and are typically going to give better results than most normal people will have time to achieve on their own. However, there is also variability in how well someone uses ArrayFire (or other libraries). There are variables that can and should be tweaked in the usage of ArrayFire library calls to get the best performance. If you post your code, we can help share some of those here.
Third, ArrayFire uses CUBLAS in the functions that rely on BLAS, so you're not likely to see much difference using CUBLAS directly.
Fourth, yes, ArrayFire uses all the optimizations that are available in the NVIDIA CUDA Programming Guide for (e.g. faster data-transfer and reducing memory bank conflicts like you mention). That's where the bulk of ArrayFire development is focused, on optimizing those sorts of things.
Finally, the data discrepancies you noticed are likely due to that nature of CPU vs GPU computing. Since they are different devices, you will often see slightly different results. It's not that the CPU gives better results than the GPU, but rather that they are both working with finite amounts of precision in slightly different ways. If you're using single-precision instead of double, you might consider that. Posting code will let us help on that too.
Happy to expand my answer once code is posted.

Cuda optimization techniques

I have written a CUDA code to solve an NP-Complete problem, but the performance was not as I suspected.
I know about "some" optimization techniques (using shared memroy, textures, zerocopy...)
What are the most important optimization techniques CUDA programmers should know about?
You should read NVIDIA's CUDA Programming Best Practices guide: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_BestPracticesGuide.pdf
This has multiple different performance tips with associated "priorities". Here are some of the top priority tips:
Use the effective bandwidth of your device to work out what the upper bound on performance ought to be for your kernel
Minimize memory transfers between host and device - even if that means doing calculations on the device which are not efficient there
Coalesce all memory accesses
Prefer shared memory access to global memory access
Avoid code execution branching within a single warp as this serializes the threads
The new NVIDIA Visual Profiler (v4.1) supports automated performance analysis to identify performance improvement opportunities in your application. It also links directly to the most useful sections of the Best Practices Guide for the issues it detects. And the Visual Profiler is available for free as part of the CUDA Toolkit on NVIDIA's developer web site: http://www.nvidia.com/getcuda.