Are there any warp voting functions in OpenCL? - cuda

In CUDA there are __ballot(), __any(), __all(), __popc() and a bunch of lanemask functions to perform warp voting operations across all lanes (usually with the size of 32) within a warp. I'm wondering is there any such functions implemented in OpenCL to perform the same operations within one wavefront. If there is no such function, I may need to implement them as inline functions myself to use in my project.

According to the OpenCL v. 1.1 specification, section 6.11 "Built-in Functions", I believe that the answer is no.
However on NVIDIA GPUs, you can probably use inline PTX to implement these things (or at least this blogger was able to use inline PTX).

Actually check out OpenCL subgroups. They define some cross lane functions like sub_group_all() and sub_group_any() as well as something other interesting things.
Subgroups are a relatively new critter and I am not sure who all supports it. The Intel GPU implementation (extension actually) has a few more interesting shuffling functions to permute lanes (within the register file) as well as to make explicit block writes and reads. I bet AMD also supports subgroups, but I am not sure about NVidia.

Related

What do the %envregN special registers hold?

I've read: CUDA PTX code %envreg<32> special registers . The poster there was satisfied with not trying to treat OpenCL-originating PTX as a regular CUDA PTX. But - their question about %envN registers was not properly answered.
Mark Harris wrote that
OpenCL supports larger grids than most NVIDIA GPUs support, so grid sizes need to be virtualized by dividing across multiple actual grid launches, and so offsets are necessary. Also in OpenCL, indices do not necessarily start from (0, 0, 0), the user can specify offsets which the driver must pass to the kernel. Therefore the registers initialized for OpenCL and CUDA C launches are different.
So, do the %envN registers make up the "virtual grid index"? And what does each of these registers hold?
The extent of the answer that can be authoritatively given is what is in the PTX documentation:
A set of 32 pre-defined read-only registers used to capture execution environment of PTX program outside of PTX virtual machine. These registers are initialized by the driver prior to kernel launch and can contain cta-wide or grid-wide values.
Anything beyond that would have to be:
discovered via reverse engineering or disclosed by someone with authoritative/unpublished knowledge
subject to change (being undocumented)
evidently under control of the driver, which means that for a different driver (e.g. CUDA vs. OpenCL) the contents and/or interpretation might be different.
If you think that NVIDIA documentation should be improved in any way, my suggestion would be to file a bug.

Does CUDA signbit() remove divergence?

I have seen that some people suggest that using signbit() can eliminate warp divergence and improve performance. If this is correct, then how is it implemented in the GPU? Is there some dedicated hardware for this function in, e.g., special function units (SFU)?
The implementation of signbit() is in the open in CUDA versions up to, and including, CUDA 6.5. It can be found in the header file math_functions.h. For newer versions of CUDA, you could inspect the machine code with cubobjdump --dump-sass to see how it is implemented.
Looking at the header file in CUDA 6.5, one sees that signbit() is a macro that maps to an inline function that extracts the sign bit from the raw bit representation for the floating-point operand. On GPUs this is easily doable since integer and floating-point operands share the same register file. In case of CUDA 6.5, the sign bit is extracted with a single right-shift instruction.
So the implementation of signbit() is branchless and efficient, however there is no dedicated hardware instruction for it, as this is unnecessary.
In general, CUDA programmer's do not need to worry about branches all that often, especialy when if-then-else constructs with small bodies are concerned. The compiler frequently renders these into branchless code using either predication of select-type instructions (the machine equivalent of C/C++ ternary operator). It may also combine uniform branches with predication.

CUDA development on different cards?

I'm just starting to learn how to do CUDA development(using version 4) and was wondering if it was possible to develop on a different card then I plan to use? As I learn, it would be nice to know this so I can keep an eye out if differences are going to impact me.
I have a mid-2010 macbook pro with a Nvidia GeForce 320M graphic cards(its a pretty basic laptop integrated card) but I plan to run my code on EC2's NVIDIA Tesla “Fermi” M2050 GPUs. I'm wondering if its possible to develop locally on my laptop and then run it on EC2 with minimal changes(I'm doing this for a personal project and don't want to spend $2.4 for development).
A specific question is, I heard that recursions are supported in newer cards(and maybe not in my laptops), what if I run a recursion on my laptop gpu? will it kick out an error or will it run but not utilize the hardware features? (I don't need the specific answer to this, but this is kind of the what I'm getting at).
If this is going to be a problem, is there emulators for features not avail in my current card? or will the SDK emulate it for me?
Sorry if this question is too basic.
Yes, it's a pretty common practice to use different GPUs for development and production. nVidia GPU generations are backward-compatible, so if your program runs on older card (that is if 320M (CC1.3)), it would certainly run on M2070 (CC2.0)).
If you want to get maximum performance, you should, however, profile your program on same architecture you are going to use it, but usually everything works quite well without any changes when moving from 1.x to 2.0. Any emulator provide much worse view of what's going on than running on no-matter-how-old GPU.
Regarding recursion: an attempt to compile a program with obvious recursion for 1.3 architecture produces compile-time error:
nvcc rec.cu -arch=sm_13
./rec.cu(5): Error: Recursive function call is not supported yet: factorial(int)
In more complex cases the program might compile (I don't know how smart the compiler is in detecting recursions), but certainly won't work: in 1.x architecture there was no call stack, and all function calls were actually inlined, so recursion is technically impossible.
However, I would strongly recommend you to avoid recursion at any cost: it goes against GPGPU programming paradigm, and would certainly lead to very poor performance. Most algorithms are easily rewritten without the use of recursion, and it is much more preferable way to utilize them, not only on GPU, but on CPU as well.
The Cuda Version at first is not that important. More important are the compute capabilities of your card.
If you programm your kernels using cc 1.0 and they are scalable for the future you won't have any problems.
Choose yourself your minimum cc level you need for your application.
Calculate necessary parameters using properties and use ptx jit compilation:
If your kernel can handle arbitrary input sized data and your kernel launch configuration scales across thousands of threads it will scale across future versions.
In my projects all my kernels used a fixed number of threads per block which was equal to the number of resident threads per streaming multiprocessor divided by the number of resident blocks per streaming multiprocessor to reach 100% occupancy.
Some kernels need a multiple of two number of threads per block so I handled this case also since not for all cc versions the above equation guaranteed a multiple of two block size.
Some kernels used shared memory and its size was also deducted by the cc level properties.
This data was received using (cudaGetDeviceProperties) in a utility class and using ptx jit compiling my kernels worked without any changes on all devices. I programmed on a cc 1.1 device and ran tests on latest cuda cards without any changes!
All kernels were programmed to work with 64-bit length input data and utilizing all dimensions of the 3D Grid. (I am pretty sure in a year I will continue working on this project so this was necessary)
All my kernels except one did not exceeded the cc 1.0 register limit while having 100% occ. So if the used card cc was below 1.2 I added a maxregcount command to my kernel to still enforce 100% occ.
This does not guarantees best possible performance!
For possible best performance each kernel should be analyzed regarding its parameters and resources.
This maybe is not practicable for all applications and requirements
The NVidia Kepler K20 GPU available in Q4 2012 with CUDA 5 will support recursive algorithms.

best way of using cuda

There are ways of using cuda:
auto-paralleing tools such as PGI workstation;
wrapper such as Thrust(in STL style)
NVidia GPUSDK(runtime/driver API)
Which one is better for performance or learning curve or other factors?
Any suggestion?
Performance rankings will likely be 3, 2, 1.
Learning curve is (1+2), 3.
If you become a CUDA expert, then it will be next to impossible to beat the performance of your hand-rolled code using all the tricks in the book using the GPU SDK due to the control that it gives you.
That said, a wrapper like Thrust is written by NVIDIA engineers and shown on several problems to have 90-95+% efficiency compared with hand-rolled CUDA. The reductions, scans, and many cool iterators they have are useful for a wide class of problems too.
Auto-parallelizing tools tend to not do quite as good a job with the different memory types as karlphillip mentioned.
My preferred workflow is using Thrust to write as much as I can and then using the GPU SDK for the rest. This is largely a factor of not trading away too much performance to reduce development time and increase maintainability.
Go with the traditional CUDA SDK, for both performance and smaller learning curve.
CUDA exposes several types of memory (global, shared, texture) which have a dramatic impact on the performance of your application, there are great articles about it on the web.
This page is very interesting and mentions the great series of articles about CUDA on Dr. Dobb's.
I believe that the NVIDIA GPU SDK is the best, with a few caveats. For example, try to avoid using the cutil.h functions, as these were written solely for use with the SDK, and I've personally, as well as many others, have run into some problems and bugs in them, that are hard to fix (There also is no documentation for this "library" and I've heard that NVIDIA does not support it at all)
Instead, as you mentioned, use the one of the two provided APIs. In particular I recommend the Runtime API, as it is a higher level API, and so you don't have to worry quite as much about all of the low level implementation details as you do in the Device API.
Both APIs are fully documented in the CUDA Programming Guide and CUDA Reference Guide, both of which are updated and provided with each CUDA release.
It depends on what you want to do on the GPU. If your algorithm would highly benefit from the things thrust can offer, like reduction, prefix, sum, then thrust is definitely worth a try and I bet you can't write the code faster yourself in pure CUDA C.
However if you're porting already parallel algorithms from the CPU to the GPU, it might be easier to write them in plain CUDA C. I had already successful projects with a good speedup going this route, and the CPU/GPU code that does the actual calculations is almost identical.
You can combine the two paradigms to some extend, but as far as I know you're launching new kernels for each thrust call, if you want to have all in one big fat kernel (taking too frequent kernel starts out of the equation), you have to use plain CUDA C with the SDK.
I find the pure CUDA C actually easier to learn, as it gives you quite a good understanding on what is going on on the GPU. Thrust adds a lot of magic between your lines of code.
I never used auto-paralleing tools such as PGI workstation, but I wouldn't advise to add even more "magic" into the equation.

CUDA vs Direct X 10 for parallel mathematics. any thoughs you have about it?

CUDA vs Direct X 10 for parallel mathematics. any thoughs you have about it ?
CUDA is probably a better option, if you know your target architecture is using nVidia chips. You have complete control over your data transfers, instruction paths and order of operations. You can also get by with a lot less __syncthreads calls when you're working on the lower level.
DirectX 10 will be easier to interface against, I should think, but if you really want to push your speed optimization, you have to bypass the extra layer. DirectX 10 will also not know when to use texture memory versus constant memory versus shared memory as well as you will depending on your particular algorithm.
If you have access to a Tesla C1060 or something like that, CUDA is by far the better choice hands down. You can really speed things up if you know the specifics of your GPGPU - I've seen 188x speedups in one particular algorithm on a Tesla versus my desktop.
I find CUDA awkward. It's not C, but a subset of it. It doesn't support double precision floating point natively and is emulated. For single precision it's okay though. It depends on the type of task you throw at it. You have to spend more time computing in parallel than you spend passing the data around for it to be worth using. But that issue is not unique to CUDA.
I'd wait for Apple's OpenCL which seems like it will be the industry standard for parallel computing.
Well, CUDA is portable... That's a big win if you ask me...
CUDA has nothing to do about supporting double precision floating point operations.
This is dependent on the hardware available. The 9, 100, 200 and Tesla series support double precision floating point operations tesla.
It should be easy to decide between them.
If your app can tolerate being Windows specific, you can still consider DirectX Compute. Otherwise, use CUDA or OpenCL.
If your app cannot tolerate a vendor lock on NVIDIA, you cannot use CUDA, you must use OpenCL or DirectX Compute.
If your app is doing DirectX interop, consider that CUDA/OpenCL will incur context switch overhead doing graphics API interop, and DirectX Compute will not.
Unless one or more of those criteria affect your application, use the great granddaddy of massively parallel toolchains: CUDA.