If I want to re-write my application so that it leverages the power of nVidia's CUDA SDK, are there any differences at all in runtime performance between the different SDK offerings: C++, Java, Python?
Is there any difference at all between these 3 SDK's, besides the obvious language being used?
There will be a measurable performance impact on the CPU bound portions of your processing. For instance, if your CUDA data requires pre-processing before reaching the GPU, writing the numerical routine in Python would be suboptimal.
If your CUDA routines dominate the computation time (the CPU remains relatively idle), any of the bindings are a good choice.
It may be best to first prototype in a language such as Python, and if you identify a performance bottleneck move that code to C++.
Related
I downloaded CUDA 6.0 RC and tested the new unified memory by using "cudaMallocManaged" in my application.However, I found this kernel is slowed down.
Using cudaMalloc followed by cudaMemcpy is faster (~0.56), compared to cudaMallocManaged (~0.63).Is this expected?
One of the website claims that cudaMallocManged is for "faster prototyping of cuda kernel", so I was wondering which is a better option for application in terms of performance?
Thanks.
cudaMallocManaged() is not about speeding up your application (with a few exceptions or corner cases, some are suggested below).
Today's implementation of Unified Memory and cudaMallocManaged will not be faster than intelligently written code written by a proficient CUDA programmer, to do the same thing. The machine (cuda runtime) is not smarter than you are as a programmer. cudaMallocManaged does not magically make the PCIE bus or general machine architectural limitations disappear.
Fast prototyping refers to the time it takes you to write the code, not the speed of the code.
cudaMallocManaged may be of interest to a proficient cuda programmer in the following situations:
You're interested in quickly getting a prototype together -i.e. you don't care about the last ounce of performance.
You are dealing with a complicated data structure which you use infrequently (e.g. a doubly linked list) which would otherwise be a chore to port to CUDA (since deep copies using ordinary CUDA code tend to be a chore). It's necessary for your application to work, but not part of the performance path.
You would ordinarily use zero-copy. There may be situations where using cudaMallocManaged could be faster than a naive or inefficient zero-copy approach.
You are working on a Jetson device.
cudaMallocManaged may be of interest to a non-proficient CUDA programmer in that it allows you to get your feet wet with CUDA along a possibly simpler learning curve. (However, note that naive usage of cudaMallocManaged may result in a CUDA kernels running slower than expected, see here and here.)
Although Maxwell is mentioned in the comments, CUDA UM will offer major new features with the Pascal generation of GPUs, in some settings, for some GPUs. In particular, Unified Memory in these settings will no longer be limited to the available GPU device memory, and the memory handling granularity will drop to the page level even when the kernel is running. You can read more about it here.
I have code doing a lot of operations with objects which can be represented as arrays.
When does it make to sense to use GPGPU environments (like CUDA) in an application? Can I predict performance gains before writing real code?
The convenience depends on a number of factors. Elementwise independent operations on large arrays/matrices are a good candidate.
For your particular problem (machine learning/fuzzy logic), I would recommend reading some related documents, as
Large Scale Machine Learning using NVIDIA CUDA
and
Fuzzy Logic-Based Image Processing Using Graphics Processor Units
to have a feeling on the speedup achieved by other people.
As already mentioned, you should specify your problem. However, if large parts of your code involve operations on your objects that are independent in a sense that object n does not have to wait for the results of the operations objects 0 to n-1, GPUs may enhance performance.
You could go to CUDA Zone to get yourself a general idea about what CUDA can do and do better than CPU.
https://developer.nvidia.com/category/zone/cuda-zone
CUDA has already provided lots of performance libraries, tools and ecosystems to reduce the development difficulty. It could also help you understand what kind of operations CUDA are good at.
https://developer.nvidia.com/cuda-tools-ecosystem
Further more, CUDA provided benchmark report on some of the most common and representative operations. You could find if your code can benefit from that.
https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/CUDA_5.0_Math_Libraries_Performance.pdf
I coded a program to create a color lookup table. I did it in CUDA and OpenCL, from my point of view both programs are pretty much the same, i.e. use the same amount of constant memory, global memory, same loops and branching code, etc.
I measure the running time and CUDA performed slightly better than OpenCL. My question is if using CUDA+NvidiaGPU is faster than OpenCL+NvidiaGPU because CUDA is the native way of programming such GPU?
Could you share some links to info related on this topic?
OpenCL and CUDA are equally fast if they are tweaked correctly for the target architecture. However, tweaking may negatively impact portability.
Links:
http://arxiv.org/ftp/arxiv/papers/1005/1005.2581.pdf
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6047190&tag=1
I had done my physics simulation project using C++ , OpenGL in Visual Studio 10. Later I had used OpenMP for CPU Parallelization. Now I want to accelerate my C++ code to CUDA so that I can achieve higher performance. Is it possible to convert my code into CUDA or any GPU devices?
Cuda and C++ are different programming languages (even if they look syntactically similar) with different programming paradigm.
You'll have to recode, and perhaps even redesign, your project to take advantage of Cuda (or of OpenCL).
Actually, you'll need to define what are the numerical kernels that might take advantage of your GPGPU and then recode these kernels (in Cuda, or in OpenCL); you'll also have to write some glue code to make all this work together.
You can determine which parts of your project can be parallelized and then reimplement these parts in Cuda. You can take a look at Fast N-Body Simulation with CUDA.
There are ways of using cuda:
auto-paralleing tools such as PGI workstation;
wrapper such as Thrust(in STL style)
NVidia GPUSDK(runtime/driver API)
Which one is better for performance or learning curve or other factors?
Any suggestion?
Performance rankings will likely be 3, 2, 1.
Learning curve is (1+2), 3.
If you become a CUDA expert, then it will be next to impossible to beat the performance of your hand-rolled code using all the tricks in the book using the GPU SDK due to the control that it gives you.
That said, a wrapper like Thrust is written by NVIDIA engineers and shown on several problems to have 90-95+% efficiency compared with hand-rolled CUDA. The reductions, scans, and many cool iterators they have are useful for a wide class of problems too.
Auto-parallelizing tools tend to not do quite as good a job with the different memory types as karlphillip mentioned.
My preferred workflow is using Thrust to write as much as I can and then using the GPU SDK for the rest. This is largely a factor of not trading away too much performance to reduce development time and increase maintainability.
Go with the traditional CUDA SDK, for both performance and smaller learning curve.
CUDA exposes several types of memory (global, shared, texture) which have a dramatic impact on the performance of your application, there are great articles about it on the web.
This page is very interesting and mentions the great series of articles about CUDA on Dr. Dobb's.
I believe that the NVIDIA GPU SDK is the best, with a few caveats. For example, try to avoid using the cutil.h functions, as these were written solely for use with the SDK, and I've personally, as well as many others, have run into some problems and bugs in them, that are hard to fix (There also is no documentation for this "library" and I've heard that NVIDIA does not support it at all)
Instead, as you mentioned, use the one of the two provided APIs. In particular I recommend the Runtime API, as it is a higher level API, and so you don't have to worry quite as much about all of the low level implementation details as you do in the Device API.
Both APIs are fully documented in the CUDA Programming Guide and CUDA Reference Guide, both of which are updated and provided with each CUDA release.
It depends on what you want to do on the GPU. If your algorithm would highly benefit from the things thrust can offer, like reduction, prefix, sum, then thrust is definitely worth a try and I bet you can't write the code faster yourself in pure CUDA C.
However if you're porting already parallel algorithms from the CPU to the GPU, it might be easier to write them in plain CUDA C. I had already successful projects with a good speedup going this route, and the CPU/GPU code that does the actual calculations is almost identical.
You can combine the two paradigms to some extend, but as far as I know you're launching new kernels for each thrust call, if you want to have all in one big fat kernel (taking too frequent kernel starts out of the equation), you have to use plain CUDA C with the SDK.
I find the pure CUDA C actually easier to learn, as it gives you quite a good understanding on what is going on on the GPU. Thrust adds a lot of magic between your lines of code.
I never used auto-paralleing tools such as PGI workstation, but I wouldn't advise to add even more "magic" into the equation.