I have a code working on a single GPU. In that code, I used
__device__ uint32_t aaa;
This line at the begining of code declared a global variable on the only involved device.
Now I want to use multiple devices (two or more), but I don't know how to allocate global variables in this case.
I think I should use cudaSetDevice() but I wonder where I should call this function.
When you create a variable like this:
__device__ int myval;
It is created at global scope. An allocation for it is made in the GPU memory of each device that is present when your application is launched.
In host code (when using such functions as cudaMemcpyFromSymbol()), you will be accessing whichever one corresponds to your most recent cudaSetDevice() call. In device code, you will be accessing whichever one corresponds to the device that your device code is executing on
The __device__ declaration is at global scope (and statically allocated) in your program. Variables at global scope are set up without the help of any runtime activity. Therefore there is no opportunity to specify which devices the variable should be instantiated on, so CUDA instantiates those variables on all devices present. Dynamically allocated device variables however are allocated using the runtime calls cudaMalloc and cudaMemcpy and so we can precede these calls with a cudaSetDevice call in a multi-GPU system, and so the CUDA runtime manages these variables on a per-device basis, which is consistent with the behavior of most CUDA runtime API calls, which operate on the most recently selected device via cudaSetDevice.
Related
I'm beginner at CUDA programming and have a question.
When I pass parameters by value, like this:
__global__ void add(int a, int b, int *c) {
// some operations
}
Since variable a and b are passed to kernel function add as copied value in function call stack, I guessed some memory space would be needed to copy in.
If I'm right, is that additional memory space where those parameters are copied
in GPU or in Host's main memory?
The reason why I wonder this problem is that I should pass a big struct to kernel function.
I also thought pass a pointer of the struct, but these way seems to be required to call cudamalloc for the struct and each member variables.
The very short answer is that all arguments to CUDA kernels are passed by value, and those arguments are copied by the host via an API into a dedicated memory argument buffer on the GPU. At present, this buffer is stored in constant memory and there is a limit of 4kb of arguments per kernel launch -- see here.
In more details, the PTX standard (technically since compute capability 2.0 hardware and the CUDA ABI appeared) defines a dedicated logical state space call .param which hold kernel and device parameter arguments. See here. Quoting from that documentation:
Each kernel function definition includes an optional list of
parameters. These parameters are addressable, read-only variables
declared in the .param state space. Values passed from the host to the
kernel are accessed through these parameter variables using ld.param
instructions. The kernel parameter variables are shared across all
CTAs within a grid.
It further notes that:
Note: The location of parameter space is implementation specific. For example, in some implementations kernel parameters reside in
global memory. No access protection is provided between parameter and
global space in this case. Similarly, function parameters are mapped
to parameter passing registers and/or stack locations based on the
function calling conventions of the Application Binary Interface
(ABI).
So the precise location of the parameter state space is implementation specific. In the first iteration of CUDA hardware, it actually mapped to shared memory for kernel arguments and registers for device function arguments. However, since compute 2.0 hardware and the PTX 2.2 standard, it maps to constant memory for kernels under most circumstances. The documentation says the following on the matter:
The constant (.const) state space is a read-only memory initialized
by the host. Constant memory is accessed with a ld.const
instruction. Constant memory is restricted in size, currently limited
to 64 KB which can be used to hold statically-sized constant
variables. There is an additional 640 KB of constant memory,
organized as ten independent 64 KB regions. The driver may allocate
and initialize constant buffers in these regions and pass pointers to
the buffers as kernel function parameters. Since the ten regions are
not contiguous, the driver must ensure that constant buffers are
allocated so that each buffer fits entirely within a 64 KB region and
does not span a region boundary.
Statically-sized constant variables have an optional variable initializer; constant variables with no explicit initializer are
initialized to zero by default. Constant buffers allocated by the
driver are initialized by the host, and pointers to such buffers are
passed to the kernel as parameters.
[Emphasis mine]
So while kernel arguments are stored in constant memory, this is not the same constant memory which maps to the .const state space accessible by defining a variable as __constant__ in CUDA C or the equivalent in Fortran or Python. Rather, it is an internal pool of device memory managed by the driver and not directly accessible to the programmer.
I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.
For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.
My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?
Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).
All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).
Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.
Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.
First of all you need to think of coalescing the memory access. You didn't mention about the GPU you are using. In the latest GPUs, the coal laced memory read will give same performance as that of constant memory. So always make your memory read and write in coal laced manner as possible as you can.
Another you can use texture memory (If the data size fits into it). This texture memory has some caching mechanism. This is previously used in case when global memory read was non-coalesced. But latest GPUs give almost same performance for texture and global memory.
I don't think the globally declared memory give more performance over dynamically allocated global memory, since the coalescing issue still exists. Also global memory declared in global (variable) scope is not possible in case of CUDA global memory. The variables that can declare globally (in the program) are constant memory variables and texture, which we doesn't required to pass to kernels as arguments.
for memory optimizations please see the memory optimization section in cuda c best practices guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations
I have some questions regarding cuda registers memory
1) Is there any way to free registers in cuda kernel? I have variables, 1D and 2D arrays in registers. (max array size 48)
2) If I use device functions, then what happens to the registers I used in device function after its execution? Will they be available for calling kernel execution or other device functions?
3) How nvcc optimizes register usage? Please share the points important w.r.t optimization of memory intensive kernel
PS: I have a complex algorithm to port to cuda which is taking a lot of registers for computation, I am trying to figure out whether to store intermediate data in register and write one kernel or store it in global memory and break algorithm in multiple kernels.
Only local variables are eligible of residing in registers (see also Declaring Variables in a CUDA kernel). You don't have direct control on which variables (scalar or static array) will reside in registers. The compiler will make it's own choices, striving for performance with respected to register sparing.
Register usage can be limited using the maxrregcount options of nvcc compiler.
You can also put most small 1D, 2D arrays in shared memory or, if accessing to constant data, put this content into constant memory (which is cached very close to register as L1 cache content).
Another way of reducing register usage when dealing with compute bound kernels in CUDA is to process data in stages, using multiple global kernel functions calls and storing intermediate results into global memory. Each kernel will use far less registers so that more active threads per SM will be able to hide load/store data movements. This technique, in combination with a proper usage of streams and asynchronous data transfers is very successful most of the time.
Regarding the use of device function, I'm not sure, but I guess registers's content of the calling function will be moved/stored into local memory (L1 cache or so), in the same way as register spilling occurs when using too many local variables (see CUDA Programming Guide -> Device Memory Accesses -> Local Memory). This operation will free up some registers for the called device function. After the device function is completed, their local variables exist no more, and registers can be now used again by the caller function and filled with the previously saved content.
Keep in mind that small device functions defined in the same source code of the global kernel could be inlined by the compiler for performance reasons: when this happen, the resulting kernel will in general require more registers.
I have a basic question about calling a device function from a global CUDA kernel. Can we specify the number of blocks and threads when I want to call a device function???
I post an question earlier about min reduction (here) and I want to call this function inside another global kernel. However the reduction code needs certain blocks and threads.
There are two types of functions that can be called on the device:
__device__ functions are like ordinary c or c++ functions: they operate in the context of a single (CUDA) thread. It's possible to call these from any number of threads in a block, but from the standpoint of the function itself, it does not automatically create a set of threads like a kernel launch does.
__global__ functions or "kernels" can only be called using a kernel launch method (e.g. my_kernel<<<...>>>(...); in the CUDA runtime API). When calling a __global__ function via a kernel launch, you specify the number of blocks and threads to launch as part of the kernel configuration (<<<...>>>). If your GPU is of compute capability 3.5 or higher, then you can also call a __global__ function from device code (using essentially the same kernel launch syntax, which allows you to specify blocks and threads for the "child" kernel). This employs CUDA Dynamic Parallelism which has a whole section of the programming guide dedicated to it.
There are many CUDA sample codes that demonstrate:
calling a __device__ function, such as simpleTemplates
calling a __global__ function from the device, such as cdpSimplePrint
To avoid really long and incohesive functions I am calling
a number of device functions from a kernel. I allocate a shared
buffer at the beginning of the kernel call (which is per-thread-block)
and pass pointers to it to all the device functions that are
performing some processing steps in the kernel.
I was wondering about the following:
If I allocate a shared memory buffer in a global function
how can other device functions that I pass a pointer distinguish
between the possible address types (global device or shared mem) that
the pointer could refer to.
Note it is invalid to decorate the formal parameters with a shared modifier
according to the 'CUDA programming guide'. The only way imhoit could be
implemented is
a) by putting markers on the allocated memory
b) passing invisible parameters with the call.
c) having a virtual unified address space that has separate segments for
global and shared memory and a threshold check on the pointer can be used?
So my question is: Do I need to worry about it or how should one proceed alternatively
without inlining all functions into the main kernel?
===========================================================================================
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called
'external calls from global functions', requiring them to be inlined. This means in effect
I have to declare all device functions inline and the separation of header / source
files is broken. This is of course quite ugly, but is there an alternative?
If I allocate a shared memory buffer in a global function how can other device functions that I pass a pointer distinguish between the possible address types (global device or shared mem) that the pointer could refer to.
Note that "shared" memory, in the context of CUDA, specifically means the on-chip memory that is shared between all threads in a block. So, if you mean an array declared with the __shared__ qualifier, it normally doesn't make sense to use it for passing information between device functions (as all the threads see the very same memory). I think the compiler might put regular arrays in shared memory? Or maybe it was in the register file. Anyway, there's a good chance that it ends up in global memory, which would be an inefficient way of passing information between the device functions (especially on < 2.0 devices).
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called 'external calls from global functions', requiring them to be inlined. This means in effect I have to declare all device functions inline and the separation of header / source files is broken. This is of course quite ugly, but is there an alternative?
CUDA does not include a linker for device code so you must keep the kernel(s) and all related device functions in the same .cu file.
This depends on the compute capability of your CUDA device. For devices of compute capability <2.0, the compiler has to decide at compile time whether a pointer points to shared or global memory and issue separate instructions. This is not required for devices with compute capability >= 2.0.
By default, all function calls within a kernel are inlined and the compiler can then, in most cases, use flow analysis to see if something is shared or global. If you're compiling for a device of compute capability <2.0, you may have encountered the warning warning : Cannot tell what pointer points to, assuming global memory space. This is what you get when the compiler can't follow your pointers around correctly.