CUDA: Why must Texture references be declared globally at file scope? - cuda

I am trying to understand how Texture memory works. I read in CUDA-By-Example book, that
Texture references must be declared globally at file scope,
I could not find reason why it is so. So any pointers to understand this in a little detail?
As pointed out by njuffa in comment below some hardware limitations caused this restriction to be imposed at HLL level, but what are those limitations and how this restriction comes ?

Disclaimer: Here I speculate on possible reason, from my experience with graphics programming. It is not supposed to be considered as ground truth or even correct.
This is how it's been done in shaders for ages.
The graphics pipeline is a state machine and every state transition is global. Current snapshot reflects current state of the driver and hardware, and affects entire program.
Traditionally, resources are referenced in shader languages at the file scope at the top. Be it texture buffer, sampler or any other resource. From application side you bind appropriate buffer to the appropriate texture slot and bind a sampler object. Then, usually, you can use the bound resources in any of the shader stages.
I guess, when CUDA was developed, nobody bothered to change high-level representation, but it is possible of course. For example, one could declare texture resources anywhere in any scope, but in the end, from hardware point of view, resource binding would stay global anyway, and would require to be unbound or rebound to another resource when leaving the scope. Those instructions of course can be inserted by the compiler. Implementation would be similar to that for destructors in C++.
A reason not to change API, might be to explicitly reflect the fact that state transitions are global and are usually very costly, so hiding this from API user wouldn't be wise (for performance reasons).

Related

How to implement handles for a CUDA driver API library?

Note: The question has been updated to address the questions that have been raised in the comments, and to emphasize that the core of the question is about the interdependencies between the Runtime- and Driver API
The CUDA runtime libraries (like CUBLAS or CUFFT) are generally using the concept of a "handle" that summarizes the state and context of such a library. The usage pattern is quite simple:
// Create a handle
cublasHandle_t handle;
cublasCreate(&handle);
// Call some functions, always passing in the handle as the first argument
cublasSscal(handle, ...);
// When done, destroy the handle
cublasDestroy(handle);
However, there are many subtle details about how these handles interoperate with Driver- and Runtime contexts and multiple threads and devices. The documentation lists several, scattered details about context handling:
The general description of contexts in the CUDA Programming Guide at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#context
The handling of multiple contexts, as described in the CUDA Best Practices Guide at http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#multiple-contexts
The context management differences between runtime and driver API, explained at http://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html
The general description of CUBLAS contexts/handles at http://docs.nvidia.com/cuda/cublas/index.html#cublas-context and their thread safety at http://docs.nvidia.com/cuda/cublas/index.html#thread-safety2
However, some of information seems to be not entirely up to date (for example, I think one should use cuCtxSetCurrent instead of cuCtxPushCurrent and cuCtxPopCurrent?), some of it seems to be from a time before the "Primary Context" handling was exposed via the driver API, and some parts are oversimplified in that they only show the most simple usage patterns, make only vague or incomplete statements about multithreading, or cannot be applied to the concept of "handles" that is used in the runtime libraries.
My goal is to implement a runtime library that offers its own "handle" type, and that allows usage patterns that are equivalent to the other runtime libraries in terms of context handling and thread safety.
For the case that the library can internally be implemented solely using the Runtime API, things may be clear: The context management is solely in the responsibility of the user. If he creates an own driver context, the rules that are stated in the documentation about the Runtime- and Driver context management will apply. Otherwise, the Runtime API functions will take care of the handling of primary contexts.
However, there may be the case that a library will internally have to use the Driver API. For example, in order to load PTX files as CUmodule objects, and obtain the CUfunction objects from them. And when the library should - for the user - behave like a Runtime library, but internally has to use the Driver API, some questions arise about how the context handling has to be implemented "under the hood".
What I have figured out so far is sketched here.
(It is "pseudocode" in that it omits the error checks and other details, and ... all this is supposed to be implemented in Java, but that should not be relevant here)
1. The "Handle" is basically a class/struct containing the following information:
class Handle
{
CUcontext context;
boolean usingPrimaryContext;
CUdevice device;
}
2. When it is created, two cases have to be covered: It can be created when a driver context is current for the calling thread. In this case, it should use this context. Otherwise, it should use the primary context of the current (runtime) device:
Handle createHandle()
{
cuInit(0);
// Obtain the current context
CUcontext context;
cuCtxGetCurrent(&context);
CUdevice device;
// If there is no context, use the primary context
boolean usingPrimaryContext = false;
if (context == nullptr)
{
usingPrimaryContext = true;
// Obtain the device that is currently selected via the runtime API
int deviceIndex;
cudaGetDevice(&deviceIndex);
// Obtain the device and its primary context
cuDeviceGet(&device, deviceIndex);
cuDevicePrimaryCtxRetain(&context, device));
cuCtxSetCurrent(context);
}
else
{
cuCtxGetDevice(device);
}
// Create the actual handle. This might internally allocate
// memory or do other things that are specific for the context
// for which the handle is created
Handle handle = new Handle(device, context, usingPrimaryContext);
return handle;
}
3. When invoking a kernel of the library, the context of the associated handle is made current for the calling thread:
void someLibraryFunction(Handle handle)
{
cuCtxSetCurrent(handle.context);
callMyKernel(...);
}
Here, one could argue that the caller is responsible for making sure that the required context is current. But if the handle was created for a primary context, then this context will be made current automatically.
4. When the handle is destroyed, this means that cuDevicePrimaryCtxRelease has to be called, but only when the context is a primary context:
void destroyHandle(Handle handle)
{
if (handle.usingPrimaryContext)
{
cuDevicePrimaryCtxRelease(handle.device);
}
}
From my experiments so far, this seems to expose the same behavior as a CUBLAS handle, for example. But my possibilities for thoroughly testing this are limited, because I only have a single device, and thus cannot test the crucial cases, e.g. of having two contexts, one for each of two devices.
So my questions are:
Are there any established patterns for implementing such a "Handle"?
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?
(I also had a look at the context handling in tensorflow, but I'm not sure whether one can derive recommendations about how to implement handles for a runtime library from that...)
(An "Update" has been removed here, because it was added in response to the comments, and should no longer be relevant)
I'm sorry I hadn't noticed this question sooner - as we might have collaborated on this somewhat. Also, it's not quite clear to me whether this question belongs here, on codereview.SX or on programmers.SX, but let's ignore all that.
I have now done what you were aiming to do, and possibly more generally. So, I can offer both an example of what to do with "handles", and moreover, suggest the prospect of not having to implement this at all.
The library is an expanding of cuda-api-wrappers to also cover the Driver API and NVRTC; it is not yet release-grade, but it is in the testing phase, on this branch.
Now, to answer your concrete question:
Pattern for writing a class surrounding a raw "handle"
Are there any established patterns for implementing such a "Handle"?
Yes. If you read:
What is the difference between: Handle, Pointer and Reference
you'll notice a handle is defined as an "opaque reference to an object". It has some similarity to a pointer. A relevant pattern, therefore, is a variation on the PIMPL idiom: In regular PIMPL, you write an implementation class, and the outwards-facing class only holds a pointer to the implementation class and forwards method calls to it. When you have an opaque handle to an opaque object in some third-party library or driver - you use the handle to forward method calls to that implementation.
That means, that your outwards-facing class is not a handle, it represents the object to which you have a handle.
Generality and flexibility
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
I'm not sure what exactly CUBLAS does under the hood (and I have almost never used CUBLAS to be honest), but if it were well-designed and implemented, it would
create its own context, and try to not to impinge on the rest of your code, i.e. it would alwas do:
Push our CUBLAS context onto the top of the stack
Do actual work
Pop the top of the context stack.
Your class doesn't do this.
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Yes:
Use RAII whenever it is possible and relevant. If your creation code allocates a resource (e.g. via the CUDA driver) - the destructor for the object you return should safely release those resources.
Allow for both reference-type and value-type use of Handles, i.e. it may be the handle I created, but it may also be a handle I got from somewhere else and isn't my responsibility. This is trivial if you leave it up to the user to release resources, but a bit tricky if you take that responsibility
You assume that if there's any current context, that's the one your handle needs to use. Says who? At the very least, let the user pass a context in if they want to.
Avoid writing the low-level parts of this on your own unless you really must. You are quite likely to miss some things (the push-and-pop is not the only thing you might be missing), and you're repeating a lot of work that is actually generic and not specific to your application or library. I may be biased here, but you can now use nice, RAII-ish, wrappers for CUDA contexts, streams, modules, devices etc. without even known about raw handles for anything.
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?
To the best of my knowledge, NVIDIA hasn't released it.

How can RISC-V SYSTEM instructions be implemented as trap?

I am currently studying the specifications for RISC-V with specification version 2.2 and Privileged Architecture version 1.10. In Chapter 2 of RISC-V specification, it is mentioned that "[...] though a simple implementation might cover the eight SCALL/SBREAK/CSRR* instructions with a single SYSTEM hardware instruction that always traps [...]"
However, when I look at the privileged specification, the instruction MRET is also a SYSTEM instruction, which is required to return from a trap. Right now I am confused how much of the Machine-level ISA are required: is it possible to omit all M-level CSRs and use a software handler for any SYSTEM instructions, as stated in Specification? If so, how does one pass in information such as return address and trap cause? Are they done through regular registers x1-x31?
Alternatively, is it enough to implement only the following M-level CSRs, if I am aiming for a simple embedded core with only M-level privilege?
mvendorid
marchid
mimpid
mhartid
misa
mscratch
mepc
mcause
Finally, how many of these CSRs can be omitted?
ECALL/EBREAK instructions are traps anyway. CSR instructions need to be carefully parsed to make sure they specify existent registers being accessed in allowed modes, which sounds like a job for your favorite sparse matrix, whether PLA or if/then.
You could emulate all SYSTEM instructions, but, as you see, you need to be able to access information inside the hardware that is not part of the normal ISA. This implies that you need to add "instruction extensions."
I would also recommend making the SYSTEM instructions atomic, meaning that exceptions should be masked or avoided within each emulated instruction.
Since I am not a very trusting person, I would create a new mode that would enable the instruction extensions that would allow you to read the exception address directly from the hardware, for example, and fetch instructions from a protected area of memory. Interrupts would be disabled automatically. The mode would be exited by branching to epc+4 or the illegal instruction handler. I would not want to have anything outside the RISC-V spec available even in M-mode, just to be safe.
In my experience, it is better to say "I do everything," than it is to explain to each customer, or worse, have a competitor explain to your customers, what it is that you do not do. But perhaps someone who knows the CSRs better could help; it is not something I do.

Function call and context save to stack

I am very interested in real time operating systems for micro-controllers, so I am doing a deep research on the topic. At the high level I understand all the general mechanisms of an OS.
In order to better learn it I decided to write a very simple kernel that does nothing but the context switch. This raised a lot of additional - practical questions to me. I was able to cope with many of them but I am still in doubt with the main thing - Saving context (all the CPU registers, and stack pointer) of a current task and restore context of a new task.
In general, OS use some function (lets say OSContextSwitch()) that preserves all the actions for the context switch. The body of the OSContextSwitch() is mainly written in assembly (inline assembly in C body function). But when the OSContextSwitch() is called by the scheduler, as far as I know, on a function call some of the CPU registers are preserved on the stack by the compiler (actually by the code generated by the compiler).
Finally, the question is: How to know which of the CPU registers are already preserved by the compiler to the stack so I can preserve the rest ? If I preserved all the registers regardless of the compiler behaviour, obviously there will be some stack leakage.
Such function should be written either as pure assembly (so NOT an assembly block inside a C function) or as a "naked" C function with nothing more than assembly block. Doing anything in between is a straight road to messing things up.
As for the registers which you should save - generally you need to know the ABI for your platform, which says that some registers are saved by caller and some should be saved by callee - you generally need to save/restore only the ones which are normally saved by callee. If you save all of them nothing wrong will happen - your code will only be slightly slower and use a little more RAM, but this will be a good place to start.
Here's a typical context switch implementation for ARM Cortex-M microcontrollers - https://github.com/DISTORTEC/distortos/blob/master/source/architecture/ARM/ARMv6-M-ARMv7-M/ARMv6-M-ARMv7-M-PendSV_Handler.cpp#L76

cudaMemcpy D2D flag - semantics w.r.t. multiple devices, and is it necessary?

I've not had the need before to memcpy data between 2 GPUs. Now, I'm guessing I'm going to do it with cudaMemcpy() and the cudaMemcpyDeviceToDevice flag, but:
is the cudaMemcpyDeviceToDevice flag used both for copying data within a single device's memory space and between the memory spaces of all devices?
If it is,
How are pointers to memory on different devices distinguished? Is it using the specifics of the Unified Virtual Address Space mechanism?
And if that's the case, then
Why even have the H2D, D2H, D2D flags at all for cudaMemcpy? Doesn't it need to check which device it needs to address anyway?
Can't we implement a flag-free version of cudaMemcpy using cuGetPointerAttribute() from the CUDA low-level driver?
For devices with UVA in effect, you can use the mechanism you describe. This doc section may be of interest (both the one describing device-to-device transfers as well as the subsequent section on UVA implications). Otherwise there is a cudaMemcpyPeer() API available, which has somewhat different semantics.
How are pointers to memory on different devices distinguished? Is it using the specifics of the Unified Virtual Address Space mechanism?
Yes, see the previously referenced doc sections.
Why even have the H2D, D2H, D2D flags at all for cudaMemcpy? Doesn't it need to check which device it needs to address anyway?
cudaMemcpyDefault is the transfer flag that was added when UVA first appeared, to enable the use of generically-flagged transfers, where the direction is inferred by the runtime upon inspection of the supplied pointers.
Can't we implement a flag-free version of cudaMemcpy using cuGetPointerAttribute() from the CUDA low-level driver?
I'm assuming the generically-flagged method described above would meet whatever needs you have (or perhaps I'm not understanding this question).
Such discussions could engender the question "Why would I ever use anything but cudaMemcpyDefault"?
One possible reason I can think of to use an explicit flag would be that the runtime API will do explicit error checking if you supply an explicit flag. If you're sure that a given invocation of cudaMemcpy would always be in a H2D transfer direction, for example, then explicitly using cudaMemcpyHostToDevice will cause the runtime API to throw an error if the supplied pointers do not match the indicated direction. Whether you attach any value to such a concept is probably a matter of opinion.
As a matter of lesser importance (IMO) code that uses the explicit flags does not depend on UVA being available, but such execution scenarios are "disappearing" with newer environments

Garbage Collection

I am not able to understand few things on the Garbage collection.
Firstly, how is data allocated space ? i.e. on stack or heap( As per my knowledge, all static or global variables are assigned space on stack and local variables are assigned space on heap).
Second, GC runs on data on stacks or heaps ? i.e a GC algorithm like Mark/Sweep would refer to data on stack as root set right? And then map all the reachable variables on heap by checking which variables on heap refer to the root set.
What if a program does not have a global variable? How does the algorithm work then?
Regards,
darkie
It might help to clarify what platform's GC you are asking about - JVM, CLR, Lisp, etc. That said:
First to take a step back, certain local variables of are generally allocated on the stack. The specifics can vary by language, however. To take C# as an example, only local Value Types and method parameters are stored on the stack. So, in C#, foo would be allocated on the stack:
public function bar() {
int foo = 2;
...
}
Alternatively, dynamically-allocated variables use memory from the heap. This should intuitively make sense, as otherwise the stack would have to grow dynamically each time a new is called. Also, it would mean that such variables could only be used as locals within the local function that allocated them, which is of course not true because we can have (for example) class member variables. So to take another example from C#, in the following case result is allocated on the heap:
public class MyInt
{
public int MyValue;
}
...
MyInt result = new MyInt();
result.MyValue = foo + 40;
...
Now with that background in mind, memory on the heap is garbage-collected. Memory on the stack has no need for GC as the memory will be reclaimed when the current function returns. At a high level, a GC algorithm works by keeping track of all objects that are dynamically allocated on the heap. Once allocated via new, the object will be tracked by GC, and collected when it is no longer in scope and there are no more references to it.
Check out the book Garbage Collection: algorithms for automatic dynamic memory management.
Firstly, how is data allocated space ?
i.e. on stack or heap( As per my
knowledge, all static or global
variables are assigned space on stack
and local variables are assigned space
on heap).
No, stack variables are method calls and local variables. A stack frame is created when the method is called and popped off when it returns.
Memory in Java and C# is allocated on the heap by calling "new".
Second, GC runs on data on stacks or
heaps ? i.e a GC algorithm like
Mark/Sweep would refer to data on
stack as root set right? And then map
all the reachable variables on heap by
checking which variables on heap refer
to the root set.
GC is used on the heap.
Mark and sweep would not be considered a cutting edge GC algorithm. Both Java and .NET GC use generational models now.
What if a program does not have a
global variable? How does the
algorithm work then?
What does "global variable" mean in languages like Java and C# where everything belongs to a class?
The root of the object graph is arbitrary. I'll admit that I don't know how it's chosen.
Read this article. It is a very good survey on uniprocessor garbage collection techniques. It will give you the basic understanding and terminology on GC. Then, follow up with the Jones and Lins book "Garbage Collection: Algorithms for Automatic Dynamic Memory Management". Contrary to the survey article I point to above, the book is not available for free on the Web; you have to buy it; but it is worth it.
Richard and Carl have a very nice show on the Windows Memory Model, including the .NET model and GC, in their .NET Rocks! archives:
Jeffrey Richter on the Windows Memory Model
You might find the short summary of Garbage Collection on the Memory Management Reference useful.
Ultimately, garbage collection has to start at the registers of the processor(s), since any objects that can't be reached by the processor can be recycled. Depending the the language and run-time system, it makes sense to assume statically that the stacks and registers of threads are also reachable, as well as “global variables”.
Stacks probably get you local variables. So in simple GCs you start by scanning thread contexts, their stacks, and the global variables. But that's certainly not true in every case. Some languages don't use stacks or have global variables as such. What's more, GCs can use a barrier so that they don't have to look at every stack or global every time. Some specialised hardware, such as the Symbolics Lisp Machine had barriers on registers!