"register" keyword in CUDA - cuda

I have a large program that uses all the registers I allocated per thread (64) and spills to local memory. I would like to be able to tell the compiler which variables should remain in registers at all cost, and which ones I don't really care about. Does the "register" C/C++ keyword work in nvcc? Is there a different mechanism perhaps?
Thanks!

You can use register in CUDA C/C++ if you want to. In any context, it is only a hint to the compiler. It may be ignored. There is no stated guarantee that it does anything at all.
I think these statements are pretty much true for most language implementations of register.
I also think it's quite likely that the compiler can do a better job than you can of deciding what should be in registers, and appropriate priority.
The typical CUDA C/C++ mechanisms for controlling register usage work at a higher level, they are:
the -maxrregcount compile switch
the launch bounds directive.

Related

cudaMemcpy D2D flag - semantics w.r.t. multiple devices, and is it necessary?

I've not had the need before to memcpy data between 2 GPUs. Now, I'm guessing I'm going to do it with cudaMemcpy() and the cudaMemcpyDeviceToDevice flag, but:
is the cudaMemcpyDeviceToDevice flag used both for copying data within a single device's memory space and between the memory spaces of all devices?
If it is,
How are pointers to memory on different devices distinguished? Is it using the specifics of the Unified Virtual Address Space mechanism?
And if that's the case, then
Why even have the H2D, D2H, D2D flags at all for cudaMemcpy? Doesn't it need to check which device it needs to address anyway?
Can't we implement a flag-free version of cudaMemcpy using cuGetPointerAttribute() from the CUDA low-level driver?
For devices with UVA in effect, you can use the mechanism you describe. This doc section may be of interest (both the one describing device-to-device transfers as well as the subsequent section on UVA implications). Otherwise there is a cudaMemcpyPeer() API available, which has somewhat different semantics.
How are pointers to memory on different devices distinguished? Is it using the specifics of the Unified Virtual Address Space mechanism?
Yes, see the previously referenced doc sections.
Why even have the H2D, D2H, D2D flags at all for cudaMemcpy? Doesn't it need to check which device it needs to address anyway?
cudaMemcpyDefault is the transfer flag that was added when UVA first appeared, to enable the use of generically-flagged transfers, where the direction is inferred by the runtime upon inspection of the supplied pointers.
Can't we implement a flag-free version of cudaMemcpy using cuGetPointerAttribute() from the CUDA low-level driver?
I'm assuming the generically-flagged method described above would meet whatever needs you have (or perhaps I'm not understanding this question).
Such discussions could engender the question "Why would I ever use anything but cudaMemcpyDefault"?
One possible reason I can think of to use an explicit flag would be that the runtime API will do explicit error checking if you supply an explicit flag. If you're sure that a given invocation of cudaMemcpy would always be in a H2D transfer direction, for example, then explicitly using cudaMemcpyHostToDevice will cause the runtime API to throw an error if the supplied pointers do not match the indicated direction. Whether you attach any value to such a concept is probably a matter of opinion.
As a matter of lesser importance (IMO) code that uses the explicit flags does not depend on UVA being available, but such execution scenarios are "disappearing" with newer environments

How is WebGL or CUDA code actually translated into GPU instructions?

When you write shaders and such in WebGL or CUDA, how is that code actually translated into GPU instructions?
I want to learn how you can write super low-level code that optimizes graphic rendering to the extreme, in order to see exactly how GPU instructions are executed, at the hardware/software boundary.
I understand that, for CUDA for example, you buy their graphics card (GPU), which is somehow implemented to optimize graphics operations. But then how do you program on top of that (in a general sense), without C?
The reason for this question is because on a previous question, I got the sense that you can't program the GPU directly by using assembly, so I am a bit confused.
If you look at docs like CUDA by example, that's all just C code (though they do have things like cudaMalloc and cudaFree, which I don't know what that's doing behind the scenes). But under the hood, that C must be being compiled to assembly or at least machine code or something, right? And if so, how is that accessing the GPU?
Basically I am not seeing how, at a level below C or GLSL, how the GPU itself is being instructed to perform operations. Can you please explain? Is there some snippet of assembly that demonstrates how it works, or anything like that? Or is there another set of some sort of "GPU registers" in addition to the 16 "CPU registers" on x86 for example?
The GPU driver compiles it to something the GPU understands, which is something else entirely than x86 machine code. For example, here's a snippet of AMD R600 assembly code:
00 ALU: ADDR(32) CNT(4) KCACHE0(CB0:0-15)
0 x: MUL R0.x, KC0[0].x, KC0[1].x
y: MUL R0.y, KC0[0].y, KC0[1].y
1 z: MUL R0.z, KC0[0].z, KC0[1].z
w: MUL R0.w, KC0[0].w, KC0[1].w
01 EXP_DONE: PIX0, R0
END_OF_PROGRAM
The machine code version of that would be executed by the GPU. The driver orchestrates the transfer of the code to the GPU and instructs it to run it. That is all very device specific, and in the case of nvidia, undocumented (at least, not officially documented).
The R0 in that snippet is a register, but on GPUs registers usually work a bit differently. They exist "per thread", and are in a way a shared resource (in the sense that using many registers in a thread means that fewer threads will be active at the same time). In order to have many threads active at once (which is how GPUs tolerate memory latency, whereas CPUs use out of order execution and big caches), GPUs usually have tens of thousands of registers.
Those languages are translated to machine code via a compiler. That compiler just is part of the drivers/runtimes of the various APIs, and is totally implementation specific. There are no families of common instruction sets we are used to in CPU land - like x86, arm or whatever. Different GPUs all have their own incompatible insruction set. Furthermore, there are no APIs with which to upload and run arbitrary binaries on those GPUs. And there is little publically available documentation for that, depending on the vendor.
The reason for this question is because on a previous question, I got the sense that you can't program the GPU directly by using assembly, so I am a bit confused.
Well, you can. In theory, at least. If you do not care about the fact that your code will only work on a small family of ASICs, and if you have all the necessary documentation for that, and if you are willing to implement some interface to the GPU allowing to run those binaries, you can do it. If you want to go that route, you could look at the Mesa3D project, as it provides open source drivers for a number of GPUs, including an llvm-based compiler infrastructure to generate the binaries for the particular architecture.
In practice, there is no useful way of bare metal GPU programming on a large scale.

cuda inline and noinline device functions

According to the documentation, in devices of compute capability 1.x the compiler will inline __device__ functions by default, but for devices of compute capability 2.x and higher it will only do so if deemed appropriate by the compiler. When is it appropriate not to? There are also qualifiers such as __noinline__ and __forceinline__. In which cases is it better not to inline a __device__ function?
The compiler heuristic for inlining presumably evaluates the potential performance benefit from inlining due to the elimination of function call overhead against other characteristics including compile time. Aggressive inlining can lead to very large code that cause very long compile times. From observing the code generated for many different kernels, the CUDA compiler seems to inline in the vast majority of cases. Note that in some cases, inlining is currently not possible, for example when the called function is in a different, separately compiled, compilation unit.
In my experience, the instances in which it makes sense to override the compiler's inlining heuristic are rare. I have used __noinline__ to limit code size and thus reduce excessive compile times. Use of __noinline__ has no predictable effect on register pressure that I am aware of. Inlining may allow more aggressive code movement such as load scheduling and this may increase register pressure, while not inlining may increase register pressure due to ABI restrictions on the use of registers. I have never found a case where use of __noinline__ improved performance, but of course such cases could exist, possibly due to instruction cache effects.
I've experienced it that if you force __device__ function call to be compiled inline, it can decreases runtime to half. Just in a recent one, I made a function call (which passed just 5 variables to function) inline and kernel execution time decreased from 9.5ms to 4.5ms (almost half). And if you consider that you want to execute the same kernel hundred millions of times with total runtime of a week or more (like my case and many others that work on CFD or MD projects), increase in compile time is nothing important comparing to huge saving in runtime.
All in all, I think it worth to try inline function call impact on runtime especially for codes with very long runtimes.

Is there a reason GPU/CPU pointers aren't more strongly typed?

Is there a reason the language designers didn't make pointers more strongly typed, so that the compiler could differentiate between a GPU-pointer and a CPU-pointer and eliminate the ridiculously common bug of mixing the two?
Is there ever a need to have a pointer refer to both a GPU-memory location and a CPU-memory location at once (is that even possible)?
Or is this just an incredibly glaring oversight in the design of the language?
[Edit] Example: C++/CLI has two different types of pointers, which cannot be mixed. They introduced separate notation so that this requirement could be enforced by the compiler:
int* a; //Normal pointer
int^ b; //Managed pointer
//pretend a is assigned here
b = a; //Compiler error!
Is there a reason (other than laziness/oversight) that CUDA does not do the same thing?
Nvidia's nvcc CUDA C "compiler" is not a full compiler, but a rather simple driver program that calls some other tools (cudafe and the C preprocessor) to separate host and device code, and feeds them to their respective compilers.
Only the device code compiler (cicc, or nvopencc in previous CUDA releases) is provided by Nvidia. The host portion of the code is just passed on to the hosts native C compiler, which frees Nvidia from the burden of providing a competitive compiler itself.
Generating error messages on improper pointer use would require parsing the host C code. While that would certainly be possible (teaching e.g. sparse or clang about the CUDA peculiarities), to my knowledge nobody has invested the effort into this so far.
Nvidia has written up a document on the NVIDIA CUDA Compiler Driver NVCC that explains the compilation process and the tools involved in more detail.
All pointers you define are stored in RAM. no matter if it is a GPU pointer or a CPU pointer. then you have to copy it yourself to GPU. there is no GPU nor CPU pointer. it is just a variable that holds an address to a location in a memory. Where you use it is important, if you are using it in a GPU then the GPU will search for that address in its accessible memory, it can be a location in RAM if you had pinned it to your graphic memory.
The most important thing is that you don't have direct access to a location in RAM because the address space in a CPU is virtual. your data might be stored on a hard drive, but this isn't the case on GPU. your memory address is a direct pass to the location. that makes it impossible to unify both address spaces.

CUDA: calling library function in kernel

I know that there is the restriction to call only __device__ functions in the kernel. This prevents me from calling standard functions like strcmp() and so on in the kernel.
At this point I am not able to understand/find the reasons for this. Could not the compiler just follow each includes in strings.h and so on while inlining the calls to strcmp() in the kernel? I guess the reason I am looking for is easy and I am missing something here.
Is it the only way to reimplement all the functions and datatypes I need in kernel computation? Is there a codebase with such reimplementations?
Yes, the only way to use stdlib's functions from kernel is to reimplement them. But I strongly advice you to reconsider this idea, since it's highly unlikely you would need to run code that uses strcmp() on GPU. Please, add additional details about your problem, so a better solution could be proposed (I highly doubt that serial string comparison on GPU is what you really need).
It's barely possible to simply recompile all stdlib for GPU, since it depends a lot on some system calls (like memory allocation), which could not be used on GPU (well, in recent versions of CUDA toolkit you can allocate device memory from kernel, but it's not "cuda-way", is supported only by newest hardware and is very bad for performance).
Besides, CPU versions of most functions is far from being "good" for GPUs. So, in vast majority of cases compiling your ordinary CPU functions for GPU would lead to no good, so the compiler doesn't even try it.
Standard functions like strcmp() have not been compiled for the CUDA architecture. I have not seen any standard C libraries for CUDA.