Cuda optimization techniques - cuda

I have written a CUDA code to solve an NP-Complete problem, but the performance was not as I suspected.
I know about "some" optimization techniques (using shared memroy, textures, zerocopy...)
What are the most important optimization techniques CUDA programmers should know about?

You should read NVIDIA's CUDA Programming Best Practices guide: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_BestPracticesGuide.pdf
This has multiple different performance tips with associated "priorities". Here are some of the top priority tips:
Use the effective bandwidth of your device to work out what the upper bound on performance ought to be for your kernel
Minimize memory transfers between host and device - even if that means doing calculations on the device which are not efficient there
Coalesce all memory accesses
Prefer shared memory access to global memory access
Avoid code execution branching within a single warp as this serializes the threads

The new NVIDIA Visual Profiler (v4.1) supports automated performance analysis to identify performance improvement opportunities in your application. It also links directly to the most useful sections of the Best Practices Guide for the issues it detects. And the Visual Profiler is available for free as part of the CUDA Toolkit on NVIDIA's developer web site: http://www.nvidia.com/getcuda.

Related

CUDA vs OpenCL performance comparison

I am using CUDA 6.0 and the OpenCL implementation that comes bundled with the CUDA SDK. I have two identical kernels for each platform (they differ in the platform specific keywords). They only read and write global memory, each thread different location. The launch configuration for CUDA is 200 blocks of 250 threads (1D), which corresponds directly to the configuration for OpenCL - 50,000 global work size and 250 local work size.
The OpenCL code runs faster. Is this possible or am I timing it wrong? My understanding is that the NVIDIA's OpenCL implementation is based on the one for CUDA. I get around 15% better performance with OpenCL.
It would be great if you could suggest why I might be seeing this and perhaps some differences between CUDA and OpenCL as implemented by NVIDIA?
Kernels executing on a modern GPU are almost never compute bound, and are almost always memory bandwidth bound. (Because there are so many compute cores running compared to the available path to memory.)
This means that the performance of a given kernel usually depends largely on the memory access patterns exhibited by the given algorithm.
In practice this makes it very difficult to predict (or even understand) what performance to expect ahead of time.
The differences you observed are likely due to subtle differences in the memory access patterns between the two kernels that result from different optimizations made by the OpenCL vs CUDA toolchain.
To learn how to optimize your GPU kernels it pays to learn the details of the memory caching hardware available to you, and how to use it to best advantage. (e.g., making strategic use of "local" memory caches vs always going directly to "global" memory in OpenCL.)

Is "cudaMallocManaged" slower than "cudaMalloc"?

I downloaded CUDA 6.0 RC and tested the new unified memory by using "cudaMallocManaged" in my application.However, I found this kernel is slowed down.
Using cudaMalloc followed by cudaMemcpy is faster (~0.56), compared to cudaMallocManaged (~0.63).Is this expected?
One of the website claims that cudaMallocManged is for "faster prototyping of cuda kernel", so I was wondering which is a better option for application in terms of performance?
Thanks.
cudaMallocManaged() is not about speeding up your application (with a few exceptions or corner cases, some are suggested below).
Today's implementation of Unified Memory and cudaMallocManaged will not be faster than intelligently written code written by a proficient CUDA programmer, to do the same thing. The machine (cuda runtime) is not smarter than you are as a programmer. cudaMallocManaged does not magically make the PCIE bus or general machine architectural limitations disappear.
Fast prototyping refers to the time it takes you to write the code, not the speed of the code.
cudaMallocManaged may be of interest to a proficient cuda programmer in the following situations:
You're interested in quickly getting a prototype together -i.e. you don't care about the last ounce of performance.
You are dealing with a complicated data structure which you use infrequently (e.g. a doubly linked list) which would otherwise be a chore to port to CUDA (since deep copies using ordinary CUDA code tend to be a chore). It's necessary for your application to work, but not part of the performance path.
You would ordinarily use zero-copy. There may be situations where using cudaMallocManaged could be faster than a naive or inefficient zero-copy approach.
You are working on a Jetson device.
cudaMallocManaged may be of interest to a non-proficient CUDA programmer in that it allows you to get your feet wet with CUDA along a possibly simpler learning curve. (However, note that naive usage of cudaMallocManaged may result in a CUDA kernels running slower than expected, see here and here.)
Although Maxwell is mentioned in the comments, CUDA UM will offer major new features with the Pascal generation of GPUs, in some settings, for some GPUs. In particular, Unified Memory in these settings will no longer be limited to the available GPU device memory, and the memory handling granularity will drop to the page level even when the kernel is running. You can read more about it here.

Do all GPUs use the same architecture?

I have some experience with nVIDIA CUDA and am now thinking about learning openCL too. I would like to be able to run my programs on any GPU. My question is: does every GPU use the same architecture as nVIDIA (multi-processors, SIMT stracture, global memory, local memory, registers, cashes, ...)?
Thank you very much!
Starting with your stated goal:
"I would like to be able to run my programs on any GPU."
Then yes, you should learn OpenCL.
In answer to your overall question, other GPU vendors do use different architectures than Nvidia GPUs. In fact, GPU designs from a single vendor can vary by quite a bit, depending on the model.
This is one reason that a given OpenCL code may perform quite differently (depending on your performance metric) from one GPU to the next. In fact, to achieve optimized performance on any GPU, an algorithm should be "profiled" by varying, for example, local memory size, to find the best algorithm settings for a given hardware design.
But even with these hardware differences, the goal of OpenCL is to provide a level of core functionality that is supported by all devices (CPUs, GPUs, FPGAs, etc) and include "extensions" which allow vendors to expose unique hardware features. Although OpenCL cannot hide significant differences in hardware, it does guarantee portability. This makes it much easier for a developer to start with an OpenCL program tuned for one device and then develop a program optimized for another architecture.
To complicate matters with identifying hardware differences, the terminology used by CUDA is different than that used by OpenCL, for example, the following are roughly equivalent in meaning:
CUDA: OpenCL:
Thread Work-item
Thread block Work-group
Global memory Global memory
Constant memory Constant memory
Shared memory Local memory
Local memory Private memory
More comparisons and discussion can be found here.
You will find that the kinds of abstraction provided by OpenCL and CUDA are very similar. You can also usually count on your hardware having similar features: global mem, local mem, streaming multiprocessors, etc...
Switching from CUDA to OpenCL, you may be confused by the fact that many of the same concepts have different names (for example: CUDA "warp" == OpenCL "wavefront").

OpenCL vs CUDA performance on Nvidia's device

I coded a program to create a color lookup table. I did it in CUDA and OpenCL, from my point of view both programs are pretty much the same, i.e. use the same amount of constant memory, global memory, same loops and branching code, etc.
I measure the running time and CUDA performed slightly better than OpenCL. My question is if using CUDA+NvidiaGPU is faster than OpenCL+NvidiaGPU because CUDA is the native way of programming such GPU?
Could you share some links to info related on this topic?
OpenCL and CUDA are equally fast if they are tweaked correctly for the target architecture. However, tweaking may negatively impact portability.
Links:
http://arxiv.org/ftp/arxiv/papers/1005/1005.2581.pdf
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6047190&tag=1

best way of using cuda

There are ways of using cuda:
auto-paralleing tools such as PGI workstation;
wrapper such as Thrust(in STL style)
NVidia GPUSDK(runtime/driver API)
Which one is better for performance or learning curve or other factors?
Any suggestion?
Performance rankings will likely be 3, 2, 1.
Learning curve is (1+2), 3.
If you become a CUDA expert, then it will be next to impossible to beat the performance of your hand-rolled code using all the tricks in the book using the GPU SDK due to the control that it gives you.
That said, a wrapper like Thrust is written by NVIDIA engineers and shown on several problems to have 90-95+% efficiency compared with hand-rolled CUDA. The reductions, scans, and many cool iterators they have are useful for a wide class of problems too.
Auto-parallelizing tools tend to not do quite as good a job with the different memory types as karlphillip mentioned.
My preferred workflow is using Thrust to write as much as I can and then using the GPU SDK for the rest. This is largely a factor of not trading away too much performance to reduce development time and increase maintainability.
Go with the traditional CUDA SDK, for both performance and smaller learning curve.
CUDA exposes several types of memory (global, shared, texture) which have a dramatic impact on the performance of your application, there are great articles about it on the web.
This page is very interesting and mentions the great series of articles about CUDA on Dr. Dobb's.
I believe that the NVIDIA GPU SDK is the best, with a few caveats. For example, try to avoid using the cutil.h functions, as these were written solely for use with the SDK, and I've personally, as well as many others, have run into some problems and bugs in them, that are hard to fix (There also is no documentation for this "library" and I've heard that NVIDIA does not support it at all)
Instead, as you mentioned, use the one of the two provided APIs. In particular I recommend the Runtime API, as it is a higher level API, and so you don't have to worry quite as much about all of the low level implementation details as you do in the Device API.
Both APIs are fully documented in the CUDA Programming Guide and CUDA Reference Guide, both of which are updated and provided with each CUDA release.
It depends on what you want to do on the GPU. If your algorithm would highly benefit from the things thrust can offer, like reduction, prefix, sum, then thrust is definitely worth a try and I bet you can't write the code faster yourself in pure CUDA C.
However if you're porting already parallel algorithms from the CPU to the GPU, it might be easier to write them in plain CUDA C. I had already successful projects with a good speedup going this route, and the CPU/GPU code that does the actual calculations is almost identical.
You can combine the two paradigms to some extend, but as far as I know you're launching new kernels for each thrust call, if you want to have all in one big fat kernel (taking too frequent kernel starts out of the equation), you have to use plain CUDA C with the SDK.
I find the pure CUDA C actually easier to learn, as it gives you quite a good understanding on what is going on on the GPU. Thrust adds a lot of magic between your lines of code.
I never used auto-paralleing tools such as PGI workstation, but I wouldn't advise to add even more "magic" into the equation.