I want to redefine malloc() and free() in my code, but when I run, two errors appear:
allowing all exceptions is incompatible with previous function "malloc";
allowing all exceptions is incompatible with previous function "free";
Then I search for this error, it seems CUDA doesn't allow us to redefine libary function, is this true? If we can't redefine those functions, how can I resolve the error?
The very short answer is that you cannot.
malloc is fundamentally a C++ standard library function which the CUDA toolchain internally overloads with a device hook in device code. Attempting to define your own device version of malloc or free can and will break the toolchain's internals. Exactly how depends on platform and compiler.
In your previous question on this, you had code like this:
__device__ void* malloc(size_t)
{ return theHeap.alloc(t); }
__device__ void free(void* p)
{ the Heap.dealloc(p); }
Because of existing standard library requirements, malloc and free must be defined as __device__ __host__ at global namespace scope. It is illegal in CUDA to have separate __device__ and __host__definitions of the same function. You could probably get around this restriction by using a private namespace for the custom allocator, or using different function names. But don't try and redefine anything from the standard library in device or host code. It will break things.
Related
I have a specific function I am trying to find the source definition for, specifically what the nvcc compiler is using. This question is phrased to apply to any function (or symbol I suppose), which is used in a __device__ function. Given:
__device__ void Foo(){
int x = round( 0.0f );
}
What is the standard/canonical/recommended way to find the definition for "round( float )" used by the nvcc compiler to generate device code?
Normally I use Visual Studio's F1 "Go to Definition", or search for "round" in project files, etc. I also search the CUDA Toolkit documentation and CUDA MATH API. In this case, I find the VS cmath definition. But how do I determine which definition the nvcc compiler uses?
What is the standard/canonical/recommended way to find the definition for "round( float )" used by the nvcc compiler to generate device code?
Disassembly. Most inbuilt functions exist as stubs in headers that are expanded into inline assembly sequences as part of a device compiler code generating pass. There is no input code to view.
Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid?
On the host side, you know how much shared memory a launched kernel had (or will have), since you set that value yourself; but what about the device side? It's easy to compile in the upper limit to that size, but that information is not available (unless passed explicitly) to the device. Is there an on-GPU mechanism for obtaining it? The CUDA C Programming Guide doesn't seem to discuss this issue (in or outside of the section on shared memory).
TL;DR: Yes. Use the function below.
It is possible: That information is available to the kernel code in special registers: %dynamic_smem_size and %total_smem_size.
Typically, when we write kernel code, we don't need to be aware of specific registers (special or otherwise) - we write C/C++ code. Even when we do use these registers, the CUDA compiler hides this from us through functions or structures which hold their values. For example, when we use the value threadIdx.x, we are actually accessing the special register %tid.x, which is set differently for every thread in the block. You can see these registers "in action" when you look at compiled PTX code. ArrayFire have written a nice blog post with some worked examples: Demystifying PTX code.
But if the CUDA compiler "hides" register use from us, how can we go behind that curtain and actually insist on using them, accessing them with those %-prefixed names? Well, here's how:
__forceinline__ __device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
and a similar function for %total_smem_size. This function makes the compiler add an explicit PTX instruction, just like asm can be used for host code to emit CPU assembly instructions directly. This function should always be inlined, so when you assign
x = dynamic_smem_size();
you actually just assign the value of the special register to x.
As the following error implies, calling a host function ('rand') is not allowed in kernel, and I wonder whether there is a solution for it if I do need to do that.
error: calling a host function("rand") from a __device__/__global__ function("xS_v1_cuda") is not allowed
Unfortunately you can not call functions in device that are not specified with __device__ modifier. If you need in random numbers in device code look at cuda random generator curand http://developer.nvidia.com/curand
If you have your own host function that you want to call from a kernel use both the __host__ and __device__ modifiers on it:
__host__ __device__ int add( int a, int b )
{
return a + b;
}
When this file is compiled by the NVCC compiler driver, two versions of the functions are compiled: one callable by host code and another callable by device code. And this is why this function can now be called both by host and device code.
The short answer is that here is no solution to that issue.
Everything that normally runs on a CPU must be tailored for a CUDA environment without any guarantees that it is even possible to do. Host functions are just another name in CUDA for ordinary C functions. That is, functions running on a CPU-memory Von Neumann architecture like all C/C++ has been up to this point in PCs. GPUs give you tremendous amounts of computing power but the cost is that it is not nearly as flexible or compatible. Most importantly, the functions run without the ability to access main memory and the memory they can access is limited.
If what you are trying to get is a random number generator you are in luck considering that Nvidia went to the trouble of specifically implementing a highly efficient Mersenne Twister that can support up to 256 threads per SMP. It is callable inside a device function, described in an earlier post of mine here. If anyone finds a better link describing this functionality please remove mine and replace the appropriate text here along with the link.
One thing I am continually surprised by is how many programmers seem unaware of how standardized high quality pseudo-random number generators are. "Rolling your own" is really not a good idea considering how much of an art pseudo-random numbers are. Verifying a generator as providing acceptably unpredictable numbers takes a lot of work and academic talent...
While not applicable to 'rand()' but a few host functions like "printf" are available when compiling with compute compatibility >= 2.0
e.g:
nvcc.exe -gencode=arch=compute_10,code=\sm_10,compute_10\...
error : calling a host function("printf") from a __device__/__global__ function("myKernel") is not allowed
Compiles and works with sm_20,compute_20
I have to disagree with some of the other answers in the following sense:
OP does not describe a problem: it is not unfortunate that you cannot call __host__ functions from device code - it is entirely impossible for it to be any other way, and that's not a bad thing.
To explain: Think of the host (CPU) code like a CD which you put into a CD player; and on the device code like a, say, SD card which you put into a a miniature music player. OP's question is "how can I shove a disc into my miniature music player"? You can't, and it makes no sense to want to. It might be the same music essentially (code with the same functionality; although usually, host code and device code don't perform quite the same computational task) - but the media are not interchangeable.
I wrote a CUDA kernel to run via MATLAB,
with several cuDoubleComplex pointers. I activated the kernel with complex double vectors (defined as gpuArray), and gםt the error message: "unsupported type in argument specification cuDoubleComplex".
how do I set MATLAB to know this type?
The short answer, you can't.
The list of supported types for kernels is shown here, and that is all your kernel code can contain to compile correctly with the GPU computing toolbox. You will need either modify you code to use double2 in place of cuDoubleComplex, or supply Matlab with compiled PTX code and a function declaration which maps cuDoubleComplex to double2. For example
__global__ void mykernel(cuDoubleComplex *a) { .. }
would be compiled to PTX using nvcc and then loaded up in Matlab as
k = parallel.gpu.CUDAKernel('mykernel.ptx','double2*');
Either method should work.
I know that non-POD types can't be passed as parameters to CUDA kernel launches in general.
But where I can find an explanation for this, I mean a reliable source like a book, a CUDA manual, etc.
The entire premise of this question is incorrect. CUDA kernel arguments are not limited to POD types.
You are free to pass any complete type as an argument, either by reference or by value. There is a limit of 255 bytes or 4kb for the total size of the argument list depending on which architecture you compile for, but that is the only restriction on kernel arguments. When passing an instance of a class to a CUDA kernel, there are a number of simple restrictions which you must follow, including:
Any pointers in the class instance which the device code will dereference must be valid device pointers
Any member functions in the class which the the device code will call must be valid __device__ functions
Is it illegal to pass a class containing virtual functions or derived from a virtual base type as as a kernel argument
Classes which access namespace anonymous unions are not supported in device code
All of the features and limitations of C++ support in CUDA kernel code is described in the CUDA Programming Guide, a copy of which ships in every version of the CUDA toolkit. All you need to do is read it.