differences between virtual and real architecture of cuda - cuda

Trying to understand the differences between virtual and real architecture of cuda, and how the different configurations will affect the performance of the program, e.g.
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21
...
The following explanation was given in NVCC manual,
GPU compilation is performed via an intermediate representation, PTX
([...]), which can
be considered as assembly for a virtual GPU architecture. Contrary to an actual graphics
processor, such a virtual GPU is defined entirely by the set of capabilities, or features,
that it provides to the application. In particular, a virtual GPU architecture provides a
(largely) generic instruction set, and binary instruction encoding is a non-issue because
PTX programs are always represented in text format.
Hence, a nvcc compilation command always uses two architectures: a compute
architecture to specify the virtual intermediate architecture, plus a real GPU architecture
to specify the intended processor to execute on. For such an nvcc command to be valid,
the real architecture must be an implementation (someway or another) of the virtual
architecture. This is further explained below.
The chosen virtual architecture is more of a statement on the GPU capabilities that
the application requires: using a smallest virtual architecture still allows a widest range
of actual architectures for the second nvcc stage. Conversely, specifying a virtual
architecture that provides features unused by the application unnecessarily restricts the
set of possible GPUs that can be specified in the second nvcc stage.
But still don't quite get how the performance will be affected by different configurations (or, maybe only affect the selection of the physical GPU devices?). In particular, this statement is most confusing to me:
In particular, a virtual GPU architecture provides a
(largely) generic instruction set, and binary instruction encoding is a non-issue because
PTX programs are always represented in text format.

The NVIDIA CUDA Compiler Driver NVCC User Guide Section on GPU Compilation provides a very thorough description of virtual and physical architecture and how the concepts are used in the build process.
The virtual architecture specifies the feature set that is targeted by the code. The table listed below shows some of the evolution of the virtual architecture. When compiling you should specify the lowest virtual architecture that has a sufficient feature set to enable the program to be executed on the widest range of physical architectures.
Virtual Architecture Feature List (from the User Guide)
compute_10 Basic features
compute_11 + atomic memory operations on global memory
compute_12 + atomic memory operations on shared memory
+ vote instructions
compute_13 + double precision floating point support
compute_20 + Fermi support
compute_30 + Kepler support
The physical architecture specifies the implementation of the GPU. This provides the compiler with the instruction set, instruction latency, instruction throughput, resource sizes, etc. so that the compiler can optimally translate the virtual architecture to binary code.
It is possible to specify multiple virtual and physical architecture pairs to the compiler and have the compiler back the final PTX and binary into a single binary. At runtime the CUDA driver will choose the best representation for the physical device that is installed. If binary code is not provided in the fatbinary file the driver can use the JIT runtime for the best PTX implementation.

"Virtual architecture" code will get compiled by a just-in-time compiler before being loaded on the device. AFAIK, it is the same compiler as the one NVCC invokes when building "physical architecture" code offline - so I don't know if there will be any differences in the resulting application performance.
Basically, every generation of the CUDA hardware is binary incompatible with previous generation - imagine next generation of Intel processors sporting ARM instruction set. This way, virtual architectures provide an intermediate representation of the CUDA application that can be compiled for compatible hardware. Every hardware generation introduces new features (e.g. atomics, CUDA Dynamic Parallelism) that require new instructions - that's why you need new virtual architectures.
Basically, if you want to use CDP you should compile for SM 3.5. You can compile it to device binary that will have assembly code for specific CUDA device generation or you can compile it to PTX code that can be compiled into device assembly for any device generation that provides these features.

The virtual architecture specifies what capabilities a GPU has and the real architecture specifies how it does it.
I can't think of any specific examples off hand. A (probably not correct) example may be a virtual GPU specifying the number of cores a card has, so code is generated targeting that number of cores, whereas the real card may have a few more for redundancy (or a few less due to manufacturing errors) and some methods of mapping to the cores that are actually in use, which can be placed on top of the more generic code generated in the first step.
You can think of the PTX code sort of like assembly code, which targets a certain architecture, which can then be compiled to machine code for a specific processor. Targeting the assembly code for the right kind of processor will, in general, generate better machine code.

well usually what nvidia writes as document causes people (including myself) to become more confused! (just me maybe!)
you are concerned with the performance, basically what this says is that don't be (probably) but you should.basically the GPU architecture is like nature. they run something on it and something happens. then they try to explain it. and then they feed it to you.
at the end should probably run some tests and see what configuration gives the best result.
the virtual architecture is what is designed to let you think freely. you should obey that, use as much as threads as you want, you can assign virtually everything as number of threads and blocks, doesn't matter, it will be translated to PTX and the device will run it.
the only problem is, if you assign more than 1024 threads per a single block you will get 0 s as the result, because the device(the real architecture) doesn't support it.
or for example your device support the CUDA 1.2, you can define double pointing variables in your code, but again you will get 0 s as the result because simply the device can't run it.
performance wise you have to know that every 32 thread (e.g. warps) have to access a single position in memory or else your access will be serialized and so on.
So I hope you've got the point by now, It is a relatively new science and GPU is a really sophisticated piece of hardware architecture, everybody is trying to make the best of it but it's a game of testing and a little knowledge of actual architecture behind CUDA. I suggest that search for GPU architecture and see how the virtual threads and thread blocks are actually implemented.

Related

What do the %envregN special registers hold?

I've read: CUDA PTX code %envreg<32> special registers . The poster there was satisfied with not trying to treat OpenCL-originating PTX as a regular CUDA PTX. But - their question about %envN registers was not properly answered.
Mark Harris wrote that
OpenCL supports larger grids than most NVIDIA GPUs support, so grid sizes need to be virtualized by dividing across multiple actual grid launches, and so offsets are necessary. Also in OpenCL, indices do not necessarily start from (0, 0, 0), the user can specify offsets which the driver must pass to the kernel. Therefore the registers initialized for OpenCL and CUDA C launches are different.
So, do the %envN registers make up the "virtual grid index"? And what does each of these registers hold?
The extent of the answer that can be authoritatively given is what is in the PTX documentation:
A set of 32 pre-defined read-only registers used to capture execution environment of PTX program outside of PTX virtual machine. These registers are initialized by the driver prior to kernel launch and can contain cta-wide or grid-wide values.
Anything beyond that would have to be:
discovered via reverse engineering or disclosed by someone with authoritative/unpublished knowledge
subject to change (being undocumented)
evidently under control of the driver, which means that for a different driver (e.g. CUDA vs. OpenCL) the contents and/or interpretation might be different.
If you think that NVIDIA documentation should be improved in any way, my suggestion would be to file a bug.

How to detect NVIDIA CUDA Architecture [duplicate]

I've recently gotten my head around how NVCC compiles CUDA device code for different compute architectures.
From my understanding, when using NVCC's -gencode option, "arch" is the minimum compute architecture required by the programmer's application, and also the minimum device compute architecture that NVCC's JIT compiler will compile PTX code for.
I also understand that the "code" parameter of -gencode is the compute architecture which NVCC completely compiles the application for, such that no JIT compilation is necessary.
After inspection of various CUDA project Makefiles, I've noticed the following occur regularly:
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21
and after some reading, I found that multiple device architectures could be compiled for in a single binary file - in this case sm_20, sm_21.
My questions are why are so many arch / code pairs necessary? Are all values of "arch" used in the above?
what is the difference between that and say:
-arch compute_20
-code sm_20
-code sm_21
Is the earliest virtual architecture in the "arch" fields selected automatically, or is there some other obscure behaviour?
Is there any other compilation and runtime behaviour I should be aware of?
I've read the manual, http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation and I'm still not clear regarding what happens at compilation or runtime.
Roughly speaking, the code compilation flow goes like this:
CUDA C/C++ device code source --> PTX --> SASS
The virtual architecture (e.g. compute_20, whatever is specified by -arch compute...) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is actually executable object code for a GPU (machine language). An executable can contain multiple versions of SASS and/or PTX, and there is a runtime loader mechanism that will pick appropriate versions based on the GPU actually being used.
As you point out, one of the handy features of GPU operation is JIT-compile. JIT-compile will be done by the GPU driver (does not require the CUDA toolkit to be installed) anytime a suitable PTX code is available but a suitable SASS code is not. The definition of a "suitable PTX" code is one which is numerically equal to or lower than the GPU architecture being targeted for running the code. To pick an example, specifying arch=compute_30,code=compute_30 would tell nvcc to embed cc3.0 PTX code in the executable. This PTX code could be used to generate SASS code for any future architecture that the GPU driver supports. Currently this would include architectures like Pascal, Volta, Turing, etc. assuming the GPU driver supports those architectures.
One advantage of including multiple virtual architectures (i.e. multiple versions of PTX), then, is that you have executable compatibility with a wider variety of target GPU devices (although some devices may trigger a JIT-compile to create the necessary SASS).
One advantage of including multiple "real GPU targets" (i.e. multiple SASS versions) is that you can avoid the JIT-compile step, when one of those target devices is present.
If you specify a bad set of options, it's possible to create an executable that won't run (correctly) on a particular GPU.
One possible disadvantage of specifying a lot of these options is code size bloat. Another possible disadvantage is compile time, which will generally be longer as you specify more options.
It's also possible to create excutables that contain no PTX, which may be of interest to those trying to obscure their IP.
Creating PTX suitable for JIT should be done by specifying a virtual architecture for the code switch.
The purpose of multiple -arch flags is to use the __CUDA_ARCH__ macro for conditional compilation (ie, using #ifdef) of differently-optimized code paths.
See here: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-identification-macro

How is WebGL or CUDA code actually translated into GPU instructions?

When you write shaders and such in WebGL or CUDA, how is that code actually translated into GPU instructions?
I want to learn how you can write super low-level code that optimizes graphic rendering to the extreme, in order to see exactly how GPU instructions are executed, at the hardware/software boundary.
I understand that, for CUDA for example, you buy their graphics card (GPU), which is somehow implemented to optimize graphics operations. But then how do you program on top of that (in a general sense), without C?
The reason for this question is because on a previous question, I got the sense that you can't program the GPU directly by using assembly, so I am a bit confused.
If you look at docs like CUDA by example, that's all just C code (though they do have things like cudaMalloc and cudaFree, which I don't know what that's doing behind the scenes). But under the hood, that C must be being compiled to assembly or at least machine code or something, right? And if so, how is that accessing the GPU?
Basically I am not seeing how, at a level below C or GLSL, how the GPU itself is being instructed to perform operations. Can you please explain? Is there some snippet of assembly that demonstrates how it works, or anything like that? Or is there another set of some sort of "GPU registers" in addition to the 16 "CPU registers" on x86 for example?
The GPU driver compiles it to something the GPU understands, which is something else entirely than x86 machine code. For example, here's a snippet of AMD R600 assembly code:
00 ALU: ADDR(32) CNT(4) KCACHE0(CB0:0-15)
0 x: MUL R0.x, KC0[0].x, KC0[1].x
y: MUL R0.y, KC0[0].y, KC0[1].y
1 z: MUL R0.z, KC0[0].z, KC0[1].z
w: MUL R0.w, KC0[0].w, KC0[1].w
01 EXP_DONE: PIX0, R0
END_OF_PROGRAM
The machine code version of that would be executed by the GPU. The driver orchestrates the transfer of the code to the GPU and instructs it to run it. That is all very device specific, and in the case of nvidia, undocumented (at least, not officially documented).
The R0 in that snippet is a register, but on GPUs registers usually work a bit differently. They exist "per thread", and are in a way a shared resource (in the sense that using many registers in a thread means that fewer threads will be active at the same time). In order to have many threads active at once (which is how GPUs tolerate memory latency, whereas CPUs use out of order execution and big caches), GPUs usually have tens of thousands of registers.
Those languages are translated to machine code via a compiler. That compiler just is part of the drivers/runtimes of the various APIs, and is totally implementation specific. There are no families of common instruction sets we are used to in CPU land - like x86, arm or whatever. Different GPUs all have their own incompatible insruction set. Furthermore, there are no APIs with which to upload and run arbitrary binaries on those GPUs. And there is little publically available documentation for that, depending on the vendor.
The reason for this question is because on a previous question, I got the sense that you can't program the GPU directly by using assembly, so I am a bit confused.
Well, you can. In theory, at least. If you do not care about the fact that your code will only work on a small family of ASICs, and if you have all the necessary documentation for that, and if you are willing to implement some interface to the GPU allowing to run those binaries, you can do it. If you want to go that route, you could look at the Mesa3D project, as it provides open source drivers for a number of GPUs, including an llvm-based compiler infrastructure to generate the binaries for the particular architecture.
In practice, there is no useful way of bare metal GPU programming on a large scale.

Is there a reason GPU/CPU pointers aren't more strongly typed?

Is there a reason the language designers didn't make pointers more strongly typed, so that the compiler could differentiate between a GPU-pointer and a CPU-pointer and eliminate the ridiculously common bug of mixing the two?
Is there ever a need to have a pointer refer to both a GPU-memory location and a CPU-memory location at once (is that even possible)?
Or is this just an incredibly glaring oversight in the design of the language?
[Edit] Example: C++/CLI has two different types of pointers, which cannot be mixed. They introduced separate notation so that this requirement could be enforced by the compiler:
int* a; //Normal pointer
int^ b; //Managed pointer
//pretend a is assigned here
b = a; //Compiler error!
Is there a reason (other than laziness/oversight) that CUDA does not do the same thing?
Nvidia's nvcc CUDA C "compiler" is not a full compiler, but a rather simple driver program that calls some other tools (cudafe and the C preprocessor) to separate host and device code, and feeds them to their respective compilers.
Only the device code compiler (cicc, or nvopencc in previous CUDA releases) is provided by Nvidia. The host portion of the code is just passed on to the hosts native C compiler, which frees Nvidia from the burden of providing a competitive compiler itself.
Generating error messages on improper pointer use would require parsing the host C code. While that would certainly be possible (teaching e.g. sparse or clang about the CUDA peculiarities), to my knowledge nobody has invested the effort into this so far.
Nvidia has written up a document on the NVIDIA CUDA Compiler Driver NVCC that explains the compilation process and the tools involved in more detail.
All pointers you define are stored in RAM. no matter if it is a GPU pointer or a CPU pointer. then you have to copy it yourself to GPU. there is no GPU nor CPU pointer. it is just a variable that holds an address to a location in a memory. Where you use it is important, if you are using it in a GPU then the GPU will search for that address in its accessible memory, it can be a location in RAM if you had pinned it to your graphic memory.
The most important thing is that you don't have direct access to a location in RAM because the address space in a CPU is virtual. your data might be stored on a hard drive, but this isn't the case on GPU. your memory address is a direct pass to the location. that makes it impossible to unify both address spaces.

Is just-in-time (jit) compilation of a CUDA kernel possible?

Does CUDA support JIT compilation of a CUDA kernel?
I know that OpenCL offers this feature.
I have some variables which are not changed during runtime (i.e. only depend on the input file), therefore I would like to define these values with a macro at kernel compile time (i.e at runtime).
If I define these values manually at compile time my register usage drops from 53 to 46, what greatly improves performance.
It became available with nvrtc library of cuda 7.0. By this library you can compile your cuda codes during runtime.
http://devblogs.nvidia.com/parallelforall/cuda-7-release-candidate-feature-overview/
Bu what kind of advantages you can gain? In my view, i couldn't find so much dramatic advantages of dynamic compilation.
If it is feasible for you to use Python, you can use the excellent pycuda module to compile your kernels at runtime. Combined with a templating engine such as Mako, you will have a very powerful meta-programming environment that will allow you to dynamically tune your kernels for whatever architecture and specific device properties happen to be available to you (obviously some things will be difficult to make fully dynamic and automatic).
You could also consider just maintaining a few distinct versions of your kernel with different parameters, between which your program could choose at runtime based on whatever input you are feeding to it.