The behavior of __CUDA_ARCH__ macro - cuda

In the host code, it seems that the __CUDA_ARCH__ macro wont generate different code path, instead, it will generate code for exact the code path for the current device.
However, if __CUDA_ARCH__ were within device code, it will generate different code path for different devices specified in compiliation options (/arch).
Can anyone confirm this is correct?

__CUDA_ARCH__ when used in device code will carry a number defined to it that reflects the code architecture currently being compiled.
It is not intended to be used in host code. From the nvcc manual:
This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
Usage of __CUDA_ARCH__ in host code is therefore undefined (at least by CUDA). As pointed out by #tera in the comments, since the macro is undefined in host code, it could be used to differentiate host/device paths for example, in a __host__ __device__ function definition.
#ifndef __CUDA_ARCH__
//host code here
#else
//device code here
#endif

Related

How to find the CUDA __device__ definition of a function?

I have a specific function I am trying to find the source definition for, specifically what the nvcc compiler is using. This question is phrased to apply to any function (or symbol I suppose), which is used in a __device__ function. Given:
__device__ void Foo(){
int x = round( 0.0f );
}
What is the standard/canonical/recommended way to find the definition for "round( float )" used by the nvcc compiler to generate device code?
Normally I use Visual Studio's F1 "Go to Definition", or search for "round" in project files, etc. I also search the CUDA Toolkit documentation and CUDA MATH API. In this case, I find the VS cmath definition. But how do I determine which definition the nvcc compiler uses?
What is the standard/canonical/recommended way to find the definition for "round( float )" used by the nvcc compiler to generate device code?
Disassembly. Most inbuilt functions exist as stubs in headers that are expanded into inline assembly sequences as part of a device compiler code generating pass. There is no input code to view.

Can my kernel code tell how much shared memory it has available?

Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid?
On the host side, you know how much shared memory a launched kernel had (or will have), since you set that value yourself; but what about the device side? It's easy to compile in the upper limit to that size, but that information is not available (unless passed explicitly) to the device. Is there an on-GPU mechanism for obtaining it? The CUDA C Programming Guide doesn't seem to discuss this issue (in or outside of the section on shared memory).
TL;DR: Yes. Use the function below.
It is possible: That information is available to the kernel code in special registers: %dynamic_smem_size and %total_smem_size.
Typically, when we write kernel code, we don't need to be aware of specific registers (special or otherwise) - we write C/C++ code. Even when we do use these registers, the CUDA compiler hides this from us through functions or structures which hold their values. For example, when we use the value threadIdx.x, we are actually accessing the special register %tid.x, which is set differently for every thread in the block. You can see these registers "in action" when you look at compiled PTX code. ArrayFire have written a nice blog post with some worked examples: Demystifying PTX code.
But if the CUDA compiler "hides" register use from us, how can we go behind that curtain and actually insist on using them, accessing them with those %-prefixed names? Well, here's how:
__forceinline__ __device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
and a similar function for %total_smem_size. This function makes the compiler add an explicit PTX instruction, just like asm can be used for host code to emit CPU assembly instructions directly. This function should always be inlined, so when you assign
x = dynamic_smem_size();
you actually just assign the value of the special register to x.

CUDA: calling a __host__ function() from a __global__ function() is not allowed [duplicate]

As the following error implies, calling a host function ('rand') is not allowed in kernel, and I wonder whether there is a solution for it if I do need to do that.
error: calling a host function("rand") from a __device__/__global__ function("xS_v1_cuda") is not allowed
Unfortunately you can not call functions in device that are not specified with __device__ modifier. If you need in random numbers in device code look at cuda random generator curand http://developer.nvidia.com/curand
If you have your own host function that you want to call from a kernel use both the __host__ and __device__ modifiers on it:
__host__ __device__ int add( int a, int b )
{
return a + b;
}
When this file is compiled by the NVCC compiler driver, two versions of the functions are compiled: one callable by host code and another callable by device code. And this is why this function can now be called both by host and device code.
The short answer is that here is no solution to that issue.
Everything that normally runs on a CPU must be tailored for a CUDA environment without any guarantees that it is even possible to do. Host functions are just another name in CUDA for ordinary C functions. That is, functions running on a CPU-memory Von Neumann architecture like all C/C++ has been up to this point in PCs. GPUs give you tremendous amounts of computing power but the cost is that it is not nearly as flexible or compatible. Most importantly, the functions run without the ability to access main memory and the memory they can access is limited.
If what you are trying to get is a random number generator you are in luck considering that Nvidia went to the trouble of specifically implementing a highly efficient Mersenne Twister that can support up to 256 threads per SMP. It is callable inside a device function, described in an earlier post of mine here. If anyone finds a better link describing this functionality please remove mine and replace the appropriate text here along with the link.
One thing I am continually surprised by is how many programmers seem unaware of how standardized high quality pseudo-random number generators are. "Rolling your own" is really not a good idea considering how much of an art pseudo-random numbers are. Verifying a generator as providing acceptably unpredictable numbers takes a lot of work and academic talent...
While not applicable to 'rand()' but a few host functions like "printf" are available when compiling with compute compatibility >= 2.0
e.g:
nvcc.exe -gencode=arch=compute_10,code=\sm_10,compute_10\...
error : calling a host function("printf") from a __device__/__global__ function("myKernel") is not allowed
Compiles and works with sm_20,compute_20
I have to disagree with some of the other answers in the following sense:
OP does not describe a problem: it is not unfortunate that you cannot call __host__ functions from device code - it is entirely impossible for it to be any other way, and that's not a bad thing.
To explain: Think of the host (CPU) code like a CD which you put into a CD player; and on the device code like a, say, SD card which you put into a a miniature music player. OP's question is "how can I shove a disc into my miniature music player"? You can't, and it makes no sense to want to. It might be the same music essentially (code with the same functionality; although usually, host code and device code don't perform quite the same computational task) - but the media are not interchangeable.

complex CUDA kernel in MATLAB

I wrote a CUDA kernel to run via MATLAB,
with several cuDoubleComplex pointers. I activated the kernel with complex double vectors (defined as gpuArray), and gםt the error message: "unsupported type in argument specification cuDoubleComplex".
how do I set MATLAB to know this type?
The short answer, you can't.
The list of supported types for kernels is shown here, and that is all your kernel code can contain to compile correctly with the GPU computing toolbox. You will need either modify you code to use double2 in place of cuDoubleComplex, or supply Matlab with compiled PTX code and a function declaration which maps cuDoubleComplex to double2. For example
__global__ void mykernel(cuDoubleComplex *a) { .. }
would be compiled to PTX using nvcc and then loaded up in Matlab as
k = parallel.gpu.CUDAKernel('mykernel.ptx','double2*');
Either method should work.

write a cuda program to compile both sm_1x and sm_2x

My problem is very similar to this link, but I am not able to fix it.
I have a CUDA program using cuda layered texture. This feature is only available with Fermi architecture (with compute capability more than or equal to 2.0). If the GPU is not Fermi, I use 3d texture as substitution for layered texture. I use __CUDA_ARCH__ in my code when declaring the texture reference (texture reference needs to be global) as this:
#if __CUDA_ARCH__ >= 200
texture<float, cudaTextureType2DLayered> depthmapsTex;
#else
texture<float, cudaTextureType3D> depthmapsTex;
#endif
The problem I have is that it seems __CUDA_ARCH__ is not defined.
The things I have tried:
1) __CUDA_ARCH__ is able to work correctly within cuda kernel. I know from the NVCC document that __CUDA_ARCH__ is not able to work correctly within host code. I have to define the texture reference as global variable. Does it belong to host code? The extension of the file being compiled is .cu.
2) I have a program that works correctly using layered texture. Then I add __CUDA_ARCH__ macro in two ways:
#ifdef __CUDA_ARCH__
texture<float, cudaTextureType2DLayered> depthmapsTex;
#endif
and
#ifndef __CUDA_ARCH__
texture<float, cudaTextureType2DLayered> depthmapsTex;
#endif
I found neither of them work. Both have the same error. error : identifier "depthmapsTex" is undefined. It looks as if the MACRO __CUDA_ARCH__ is defined and not defined at the same time. I suspect this relates to the fact that the compilation has two stages, and only one of the stage can see __CUDA_ARCH__, but I am not sure what has happened exactly.
I use cmake + visual studio 10 to set up the project and compile the code. I suspect if there is anything wrong here.
I am not sure if I have provided enough information. Any help is appreciated. Thank you!
Edit:
I tried to find any example that uses __CUDA_ARCH__ in Nvidia CUDA SDK 5.0. The following code is extracted from line 20 to line 24 in file GPUHistogram.h in the project grabcutNPP.
#if __CUDA_ARCH__<300
#define PARALLEL_HISTS 64
#else
#define PARALLEL_HISTS 8
#endif
And from line 216 to line 219, it uses the MACRO PARALLEL_HISTS:
int gpuHistogramTempSize(int n_bins)
{
return n_bins * PARALLEL_HISTS * sizeof(int);
}
But I found there is a problem here. PARALLEL_HISTS is not correctly defined. If I change the first clause to #if defined(__CUDA_ARCH__)&& __CUDA_ARCH__<300, I found the CUDA_ARCH is not defined. Does the CUDA SDK example use CUDA_ARCH in the wrong way?
I am not sure I understand the exact problem which may well have an elegant solution. Here is an inelegant brute-force approach I have used in the past. Create two kernels with identical signatures, but different names (e.g. foo_sm10(), foo_sm20(), in two separate .cu files. Compile one file for sm_10, and the other file for sm_20. Move common code that is independent of compute capability into a header file, and include it from both of the previously mentioned .cu files. In the host code, create a function pointer to invoke the architecture-dependent kernels. Initialize the function pointer to the approriate architecture-dependent kernel based on the compute capability detected at runtime.
If you want to figure out the compute capability of your GPU, you could try something like:
int devID;
cudaDeviceProp props;
CUDA_SAFE_CALL( cudaGetDevice(&devID) );
CUDA_SAFE_CALL( cudaGetDeviceProperties(&props, devID) );
float cc;
cc = props.major+props.minor*0.1;
printf("\n:: CC: %.1f",cc);
But I have no idea how to solve your problem.