Global device variables in CUDA: bad practice? - cuda

I am designing a library that has a large contingent of CUDA kernels to perform parallel computations. All the kernels will be acting on a common object, say a computational grid, which is defined using C++ style objects. The computational domain doesn't necessarily need to be accessed from the host side, so creating it on the device side and keeping it there makes sense for now. I'm wondering if the following is considered "good practice":
Suppose my computational grid class is called Domain. First I Define a global device-side variable to store the computational domain:
__device__ Domain* D
Then I Initialize the computational domain using a CUDA kernel
__global__ void initDomain(paramType P){
D = new Domain(P);
}
Then, I perform computations using this domain with other kernels:
__global__ void doComputation(double *x,double *y){
D->doThing(x,y);
//...
}
If my domain remains fixed (i.e. kernels don't modify the domain once it's created), is this OK? Is there a better way? I initially tried creating the Domain object on the host side and copying it over to the device, but this turned out to be a hassle because Domain is a relatively complex type that makes it a pain to copy over using e.g. cudaMemCpy or even Thrust::device_new (at least, I couldn't get it to work nicely).

Yes it's ok.
Maybe you can improve performance using
__constant__
using this keyword, your object will be available in all your kernels in a very fast memory.
In order to copy your object, you must use : cudaMemcpyToSymbol, please note there is come restriction : your object will be read-only in your device code, and it must don't have default constructor.
You can find informations here
If your object is complex and hard to copy, maybe you can look for : Unified memory, then just pass your variable by value to your kernel.

Related

Can my kernel code tell how much shared memory it has available?

Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid?
On the host side, you know how much shared memory a launched kernel had (or will have), since you set that value yourself; but what about the device side? It's easy to compile in the upper limit to that size, but that information is not available (unless passed explicitly) to the device. Is there an on-GPU mechanism for obtaining it? The CUDA C Programming Guide doesn't seem to discuss this issue (in or outside of the section on shared memory).
TL;DR: Yes. Use the function below.
It is possible: That information is available to the kernel code in special registers: %dynamic_smem_size and %total_smem_size.
Typically, when we write kernel code, we don't need to be aware of specific registers (special or otherwise) - we write C/C++ code. Even when we do use these registers, the CUDA compiler hides this from us through functions or structures which hold their values. For example, when we use the value threadIdx.x, we are actually accessing the special register %tid.x, which is set differently for every thread in the block. You can see these registers "in action" when you look at compiled PTX code. ArrayFire have written a nice blog post with some worked examples: Demystifying PTX code.
But if the CUDA compiler "hides" register use from us, how can we go behind that curtain and actually insist on using them, accessing them with those %-prefixed names? Well, here's how:
__forceinline__ __device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
and a similar function for %total_smem_size. This function makes the compiler add an explicit PTX instruction, just like asm can be used for host code to emit CPU assembly instructions directly. This function should always be inlined, so when you assign
x = dynamic_smem_size();
you actually just assign the value of the special register to x.

CUDA: calling a __host__ function() from a __global__ function() is not allowed [duplicate]

As the following error implies, calling a host function ('rand') is not allowed in kernel, and I wonder whether there is a solution for it if I do need to do that.
error: calling a host function("rand") from a __device__/__global__ function("xS_v1_cuda") is not allowed
Unfortunately you can not call functions in device that are not specified with __device__ modifier. If you need in random numbers in device code look at cuda random generator curand http://developer.nvidia.com/curand
If you have your own host function that you want to call from a kernel use both the __host__ and __device__ modifiers on it:
__host__ __device__ int add( int a, int b )
{
return a + b;
}
When this file is compiled by the NVCC compiler driver, two versions of the functions are compiled: one callable by host code and another callable by device code. And this is why this function can now be called both by host and device code.
The short answer is that here is no solution to that issue.
Everything that normally runs on a CPU must be tailored for a CUDA environment without any guarantees that it is even possible to do. Host functions are just another name in CUDA for ordinary C functions. That is, functions running on a CPU-memory Von Neumann architecture like all C/C++ has been up to this point in PCs. GPUs give you tremendous amounts of computing power but the cost is that it is not nearly as flexible or compatible. Most importantly, the functions run without the ability to access main memory and the memory they can access is limited.
If what you are trying to get is a random number generator you are in luck considering that Nvidia went to the trouble of specifically implementing a highly efficient Mersenne Twister that can support up to 256 threads per SMP. It is callable inside a device function, described in an earlier post of mine here. If anyone finds a better link describing this functionality please remove mine and replace the appropriate text here along with the link.
One thing I am continually surprised by is how many programmers seem unaware of how standardized high quality pseudo-random number generators are. "Rolling your own" is really not a good idea considering how much of an art pseudo-random numbers are. Verifying a generator as providing acceptably unpredictable numbers takes a lot of work and academic talent...
While not applicable to 'rand()' but a few host functions like "printf" are available when compiling with compute compatibility >= 2.0
e.g:
nvcc.exe -gencode=arch=compute_10,code=\sm_10,compute_10\...
error : calling a host function("printf") from a __device__/__global__ function("myKernel") is not allowed
Compiles and works with sm_20,compute_20
I have to disagree with some of the other answers in the following sense:
OP does not describe a problem: it is not unfortunate that you cannot call __host__ functions from device code - it is entirely impossible for it to be any other way, and that's not a bad thing.
To explain: Think of the host (CPU) code like a CD which you put into a CD player; and on the device code like a, say, SD card which you put into a a miniature music player. OP's question is "how can I shove a disc into my miniature music player"? You can't, and it makes no sense to want to. It might be the same music essentially (code with the same functionality; although usually, host code and device code don't perform quite the same computational task) - but the media are not interchangeable.

How to call a host function in a CUDA kernel?

As the following error implies, calling a host function ('rand') is not allowed in kernel, and I wonder whether there is a solution for it if I do need to do that.
error: calling a host function("rand") from a __device__/__global__ function("xS_v1_cuda") is not allowed
Unfortunately you can not call functions in device that are not specified with __device__ modifier. If you need in random numbers in device code look at cuda random generator curand http://developer.nvidia.com/curand
If you have your own host function that you want to call from a kernel use both the __host__ and __device__ modifiers on it:
__host__ __device__ int add( int a, int b )
{
return a + b;
}
When this file is compiled by the NVCC compiler driver, two versions of the functions are compiled: one callable by host code and another callable by device code. And this is why this function can now be called both by host and device code.
The short answer is that here is no solution to that issue.
Everything that normally runs on a CPU must be tailored for a CUDA environment without any guarantees that it is even possible to do. Host functions are just another name in CUDA for ordinary C functions. That is, functions running on a CPU-memory Von Neumann architecture like all C/C++ has been up to this point in PCs. GPUs give you tremendous amounts of computing power but the cost is that it is not nearly as flexible or compatible. Most importantly, the functions run without the ability to access main memory and the memory they can access is limited.
If what you are trying to get is a random number generator you are in luck considering that Nvidia went to the trouble of specifically implementing a highly efficient Mersenne Twister that can support up to 256 threads per SMP. It is callable inside a device function, described in an earlier post of mine here. If anyone finds a better link describing this functionality please remove mine and replace the appropriate text here along with the link.
One thing I am continually surprised by is how many programmers seem unaware of how standardized high quality pseudo-random number generators are. "Rolling your own" is really not a good idea considering how much of an art pseudo-random numbers are. Verifying a generator as providing acceptably unpredictable numbers takes a lot of work and academic talent...
While not applicable to 'rand()' but a few host functions like "printf" are available when compiling with compute compatibility >= 2.0
e.g:
nvcc.exe -gencode=arch=compute_10,code=\sm_10,compute_10\...
error : calling a host function("printf") from a __device__/__global__ function("myKernel") is not allowed
Compiles and works with sm_20,compute_20
I have to disagree with some of the other answers in the following sense:
OP does not describe a problem: it is not unfortunate that you cannot call __host__ functions from device code - it is entirely impossible for it to be any other way, and that's not a bad thing.
To explain: Think of the host (CPU) code like a CD which you put into a CD player; and on the device code like a, say, SD card which you put into a a miniature music player. OP's question is "how can I shove a disc into my miniature music player"? You can't, and it makes no sense to want to. It might be the same music essentially (code with the same functionality; although usually, host code and device code don't perform quite the same computational task) - but the media are not interchangeable.

Fastest way to access device vector elements directly on host

I refer you to following page http://code.google.com/p/thrust/wiki/QuickStartGuide#Vectors. Please see second paragraph where it says that
Also note that individual elements of a device_vector can be accessed
using the standard bracket notation. However, because each of these
accesses requires a call to cudaMemcpy, they should be used sparingly.
We'll look at some more efficient techniques later.
I searched all over the document but I could not find the more efficient technique. Does anyone know the fastest way to do this? i.e how to access device vector/device pointer on host fastest?
The "more efficient techniques" the guide alludes to are the Thrust algorithms. It's more efficient to access (or copy across the PCI-E bus) millions of elements at once than it is to access a single element because the fixed cost of CPU/GPU communication is amortized.
There's no faster way to copy data from the GPU to the CPU than by calling cudaMemcpy, because it is the most primitive way for a CUDA programmer to implement the task.
If you have a device_vector which you need to do more processing on, try to keep the data on the device and process it with Thrust algorithms or your own kernels. If you need to read only a few values from the device_vector, just access the values directly with bracket notation. If you need to access more than a few values, copy the device_vector over to a host_vector and read the the values from there.
thrust::device_vector<int> D;
...
thrust::host_vector<int> H = D;

Advantage/Disadvantage of function pointers

So the problem I am having has not actually happened yet. I am planning out some code for a game I am currently working on and I know that I am going to be needing to conserve memory usage as much as possible from step one. My question is, if I have for example, 500k objects that will need to constantly be constructed and deconstructed. Would is save me any memory to have the functions those classes are going to use as function pointers.e.g. without function pointers class MyClass{public:void Update();void Draw();...}e.g. with function pointersclass MyClass{public:void * Update;void * Draw;...}Would this save me any memory or would any new creation of MyClass just access the same functions that were defined for the rest of them? If it does save me any memory would it be enough to be worthwhile?
Assuming those are not virtual functions, you'd use more memory with function pointers.
The first example
There is no allocation (beyond the base amount required to make new return unique pointers, or your additional implementation that you ... ellipsized).
Non-virtual member functions are basically static functions taking a this pointer.
The advantage is that your objects are really simple, and you'll have only one place to look to find the corresponding code.
The disadvantage is that you lose some flexibility, and have to cram all your code into single update/draw functions. This will be hard to manage for anything but a tiny game.
The second example
You allocate 2x pointers per object instance. Pointers are usually 4x to 8x bytes each (depending on your target platform and your compiler).
The advantage you gain a lot of flexibility of being able to change the function you're pointing to at runtime, and can have a multitude of functions that implement it - thus supporting (but not guaranteeing) better code organization.
The disadvantage is that it will be harder to tell which function each instance will point to when you're debugging your application, or when you're simply reading through the code.
Other options
Using function pointers this specific way (instance data members) usually makes more sense in plain C, and less sense in C++.
If you want those functions to be bound at runtime, you may want to make them virtual instead. The cost is compiler/implementation dependent, but I believe it is (usually) going to be one v-table pointer per object instance.
Plus you can take full advantage of virtual function syntax to make your functions based on type, rather than whatever you bound them to. It will be easier to debug than the function pointer option, since you can simply look at the derived type to figure out what function a particular instance points to.
You also won't have to initialize those function pointers - the C++ type system would do the equivalent initialization automatically (by building the v-table, and initializing each instance's v-table pointer).
See: http://www.parashift.com/c++-faq-lite/virtual-functions.html