Does CUDA signbit() remove divergence? - cuda

I have seen that some people suggest that using signbit() can eliminate warp divergence and improve performance. If this is correct, then how is it implemented in the GPU? Is there some dedicated hardware for this function in, e.g., special function units (SFU)?

The implementation of signbit() is in the open in CUDA versions up to, and including, CUDA 6.5. It can be found in the header file math_functions.h. For newer versions of CUDA, you could inspect the machine code with cubobjdump --dump-sass to see how it is implemented.
Looking at the header file in CUDA 6.5, one sees that signbit() is a macro that maps to an inline function that extracts the sign bit from the raw bit representation for the floating-point operand. On GPUs this is easily doable since integer and floating-point operands share the same register file. In case of CUDA 6.5, the sign bit is extracted with a single right-shift instruction.
So the implementation of signbit() is branchless and efficient, however there is no dedicated hardware instruction for it, as this is unnecessary.
In general, CUDA programmer's do not need to worry about branches all that often, especialy when if-then-else constructs with small bodies are concerned. The compiler frequently renders these into branchless code using either predication of select-type instructions (the machine equivalent of C/C++ ternary operator). It may also combine uniform branches with predication.

Related

What do the %envregN special registers hold?

I've read: CUDA PTX code %envreg<32> special registers . The poster there was satisfied with not trying to treat OpenCL-originating PTX as a regular CUDA PTX. But - their question about %envN registers was not properly answered.
Mark Harris wrote that
OpenCL supports larger grids than most NVIDIA GPUs support, so grid sizes need to be virtualized by dividing across multiple actual grid launches, and so offsets are necessary. Also in OpenCL, indices do not necessarily start from (0, 0, 0), the user can specify offsets which the driver must pass to the kernel. Therefore the registers initialized for OpenCL and CUDA C launches are different.
So, do the %envN registers make up the "virtual grid index"? And what does each of these registers hold?
The extent of the answer that can be authoritatively given is what is in the PTX documentation:
A set of 32 pre-defined read-only registers used to capture execution environment of PTX program outside of PTX virtual machine. These registers are initialized by the driver prior to kernel launch and can contain cta-wide or grid-wide values.
Anything beyond that would have to be:
discovered via reverse engineering or disclosed by someone with authoritative/unpublished knowledge
subject to change (being undocumented)
evidently under control of the driver, which means that for a different driver (e.g. CUDA vs. OpenCL) the contents and/or interpretation might be different.
If you think that NVIDIA documentation should be improved in any way, my suggestion would be to file a bug.

How is WebGL or CUDA code actually translated into GPU instructions?

When you write shaders and such in WebGL or CUDA, how is that code actually translated into GPU instructions?
I want to learn how you can write super low-level code that optimizes graphic rendering to the extreme, in order to see exactly how GPU instructions are executed, at the hardware/software boundary.
I understand that, for CUDA for example, you buy their graphics card (GPU), which is somehow implemented to optimize graphics operations. But then how do you program on top of that (in a general sense), without C?
The reason for this question is because on a previous question, I got the sense that you can't program the GPU directly by using assembly, so I am a bit confused.
If you look at docs like CUDA by example, that's all just C code (though they do have things like cudaMalloc and cudaFree, which I don't know what that's doing behind the scenes). But under the hood, that C must be being compiled to assembly or at least machine code or something, right? And if so, how is that accessing the GPU?
Basically I am not seeing how, at a level below C or GLSL, how the GPU itself is being instructed to perform operations. Can you please explain? Is there some snippet of assembly that demonstrates how it works, or anything like that? Or is there another set of some sort of "GPU registers" in addition to the 16 "CPU registers" on x86 for example?
The GPU driver compiles it to something the GPU understands, which is something else entirely than x86 machine code. For example, here's a snippet of AMD R600 assembly code:
00 ALU: ADDR(32) CNT(4) KCACHE0(CB0:0-15)
0 x: MUL R0.x, KC0[0].x, KC0[1].x
y: MUL R0.y, KC0[0].y, KC0[1].y
1 z: MUL R0.z, KC0[0].z, KC0[1].z
w: MUL R0.w, KC0[0].w, KC0[1].w
01 EXP_DONE: PIX0, R0
END_OF_PROGRAM
The machine code version of that would be executed by the GPU. The driver orchestrates the transfer of the code to the GPU and instructs it to run it. That is all very device specific, and in the case of nvidia, undocumented (at least, not officially documented).
The R0 in that snippet is a register, but on GPUs registers usually work a bit differently. They exist "per thread", and are in a way a shared resource (in the sense that using many registers in a thread means that fewer threads will be active at the same time). In order to have many threads active at once (which is how GPUs tolerate memory latency, whereas CPUs use out of order execution and big caches), GPUs usually have tens of thousands of registers.
Those languages are translated to machine code via a compiler. That compiler just is part of the drivers/runtimes of the various APIs, and is totally implementation specific. There are no families of common instruction sets we are used to in CPU land - like x86, arm or whatever. Different GPUs all have their own incompatible insruction set. Furthermore, there are no APIs with which to upload and run arbitrary binaries on those GPUs. And there is little publically available documentation for that, depending on the vendor.
The reason for this question is because on a previous question, I got the sense that you can't program the GPU directly by using assembly, so I am a bit confused.
Well, you can. In theory, at least. If you do not care about the fact that your code will only work on a small family of ASICs, and if you have all the necessary documentation for that, and if you are willing to implement some interface to the GPU allowing to run those binaries, you can do it. If you want to go that route, you could look at the Mesa3D project, as it provides open source drivers for a number of GPUs, including an llvm-based compiler infrastructure to generate the binaries for the particular architecture.
In practice, there is no useful way of bare metal GPU programming on a large scale.

differences between virtual and real architecture of cuda

Trying to understand the differences between virtual and real architecture of cuda, and how the different configurations will affect the performance of the program, e.g.
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21
...
The following explanation was given in NVCC manual,
GPU compilation is performed via an intermediate representation, PTX
([...]), which can
be considered as assembly for a virtual GPU architecture. Contrary to an actual graphics
processor, such a virtual GPU is defined entirely by the set of capabilities, or features,
that it provides to the application. In particular, a virtual GPU architecture provides a
(largely) generic instruction set, and binary instruction encoding is a non-issue because
PTX programs are always represented in text format.
Hence, a nvcc compilation command always uses two architectures: a compute
architecture to specify the virtual intermediate architecture, plus a real GPU architecture
to specify the intended processor to execute on. For such an nvcc command to be valid,
the real architecture must be an implementation (someway or another) of the virtual
architecture. This is further explained below.
The chosen virtual architecture is more of a statement on the GPU capabilities that
the application requires: using a smallest virtual architecture still allows a widest range
of actual architectures for the second nvcc stage. Conversely, specifying a virtual
architecture that provides features unused by the application unnecessarily restricts the
set of possible GPUs that can be specified in the second nvcc stage.
But still don't quite get how the performance will be affected by different configurations (or, maybe only affect the selection of the physical GPU devices?). In particular, this statement is most confusing to me:
In particular, a virtual GPU architecture provides a
(largely) generic instruction set, and binary instruction encoding is a non-issue because
PTX programs are always represented in text format.
The NVIDIA CUDA Compiler Driver NVCC User Guide Section on GPU Compilation provides a very thorough description of virtual and physical architecture and how the concepts are used in the build process.
The virtual architecture specifies the feature set that is targeted by the code. The table listed below shows some of the evolution of the virtual architecture. When compiling you should specify the lowest virtual architecture that has a sufficient feature set to enable the program to be executed on the widest range of physical architectures.
Virtual Architecture Feature List (from the User Guide)
compute_10 Basic features
compute_11 + atomic memory operations on global memory
compute_12 + atomic memory operations on shared memory
+ vote instructions
compute_13 + double precision floating point support
compute_20 + Fermi support
compute_30 + Kepler support
The physical architecture specifies the implementation of the GPU. This provides the compiler with the instruction set, instruction latency, instruction throughput, resource sizes, etc. so that the compiler can optimally translate the virtual architecture to binary code.
It is possible to specify multiple virtual and physical architecture pairs to the compiler and have the compiler back the final PTX and binary into a single binary. At runtime the CUDA driver will choose the best representation for the physical device that is installed. If binary code is not provided in the fatbinary file the driver can use the JIT runtime for the best PTX implementation.
"Virtual architecture" code will get compiled by a just-in-time compiler before being loaded on the device. AFAIK, it is the same compiler as the one NVCC invokes when building "physical architecture" code offline - so I don't know if there will be any differences in the resulting application performance.
Basically, every generation of the CUDA hardware is binary incompatible with previous generation - imagine next generation of Intel processors sporting ARM instruction set. This way, virtual architectures provide an intermediate representation of the CUDA application that can be compiled for compatible hardware. Every hardware generation introduces new features (e.g. atomics, CUDA Dynamic Parallelism) that require new instructions - that's why you need new virtual architectures.
Basically, if you want to use CDP you should compile for SM 3.5. You can compile it to device binary that will have assembly code for specific CUDA device generation or you can compile it to PTX code that can be compiled into device assembly for any device generation that provides these features.
The virtual architecture specifies what capabilities a GPU has and the real architecture specifies how it does it.
I can't think of any specific examples off hand. A (probably not correct) example may be a virtual GPU specifying the number of cores a card has, so code is generated targeting that number of cores, whereas the real card may have a few more for redundancy (or a few less due to manufacturing errors) and some methods of mapping to the cores that are actually in use, which can be placed on top of the more generic code generated in the first step.
You can think of the PTX code sort of like assembly code, which targets a certain architecture, which can then be compiled to machine code for a specific processor. Targeting the assembly code for the right kind of processor will, in general, generate better machine code.
well usually what nvidia writes as document causes people (including myself) to become more confused! (just me maybe!)
you are concerned with the performance, basically what this says is that don't be (probably) but you should.basically the GPU architecture is like nature. they run something on it and something happens. then they try to explain it. and then they feed it to you.
at the end should probably run some tests and see what configuration gives the best result.
the virtual architecture is what is designed to let you think freely. you should obey that, use as much as threads as you want, you can assign virtually everything as number of threads and blocks, doesn't matter, it will be translated to PTX and the device will run it.
the only problem is, if you assign more than 1024 threads per a single block you will get 0 s as the result, because the device(the real architecture) doesn't support it.
or for example your device support the CUDA 1.2, you can define double pointing variables in your code, but again you will get 0 s as the result because simply the device can't run it.
performance wise you have to know that every 32 thread (e.g. warps) have to access a single position in memory or else your access will be serialized and so on.
So I hope you've got the point by now, It is a relatively new science and GPU is a really sophisticated piece of hardware architecture, everybody is trying to make the best of it but it's a game of testing and a little knowledge of actual architecture behind CUDA. I suggest that search for GPU architecture and see how the virtual threads and thread blocks are actually implemented.

Are there any warp voting functions in OpenCL?

In CUDA there are __ballot(), __any(), __all(), __popc() and a bunch of lanemask functions to perform warp voting operations across all lanes (usually with the size of 32) within a warp. I'm wondering is there any such functions implemented in OpenCL to perform the same operations within one wavefront. If there is no such function, I may need to implement them as inline functions myself to use in my project.
According to the OpenCL v. 1.1 specification, section 6.11 "Built-in Functions", I believe that the answer is no.
However on NVIDIA GPUs, you can probably use inline PTX to implement these things (or at least this blogger was able to use inline PTX).
Actually check out OpenCL subgroups. They define some cross lane functions like sub_group_all() and sub_group_any() as well as something other interesting things.
Subgroups are a relatively new critter and I am not sure who all supports it. The Intel GPU implementation (extension actually) has a few more interesting shuffling functions to permute lanes (within the register file) as well as to make explicit block writes and reads. I bet AMD also supports subgroups, but I am not sure about NVidia.

CUDA: calling library function in kernel

I know that there is the restriction to call only __device__ functions in the kernel. This prevents me from calling standard functions like strcmp() and so on in the kernel.
At this point I am not able to understand/find the reasons for this. Could not the compiler just follow each includes in strings.h and so on while inlining the calls to strcmp() in the kernel? I guess the reason I am looking for is easy and I am missing something here.
Is it the only way to reimplement all the functions and datatypes I need in kernel computation? Is there a codebase with such reimplementations?
Yes, the only way to use stdlib's functions from kernel is to reimplement them. But I strongly advice you to reconsider this idea, since it's highly unlikely you would need to run code that uses strcmp() on GPU. Please, add additional details about your problem, so a better solution could be proposed (I highly doubt that serial string comparison on GPU is what you really need).
It's barely possible to simply recompile all stdlib for GPU, since it depends a lot on some system calls (like memory allocation), which could not be used on GPU (well, in recent versions of CUDA toolkit you can allocate device memory from kernel, but it's not "cuda-way", is supported only by newest hardware and is very bad for performance).
Besides, CPU versions of most functions is far from being "good" for GPUs. So, in vast majority of cases compiling your ordinary CPU functions for GPU would lead to no good, so the compiler doesn't even try it.
Standard functions like strcmp() have not been compiled for the CUDA architecture. I have not seen any standard C libraries for CUDA.