How to find the CUDA __device__ definition of a function? - cuda

I have a specific function I am trying to find the source definition for, specifically what the nvcc compiler is using. This question is phrased to apply to any function (or symbol I suppose), which is used in a __device__ function. Given:
__device__ void Foo(){
int x = round( 0.0f );
}
What is the standard/canonical/recommended way to find the definition for "round( float )" used by the nvcc compiler to generate device code?
Normally I use Visual Studio's F1 "Go to Definition", or search for "round" in project files, etc. I also search the CUDA Toolkit documentation and CUDA MATH API. In this case, I find the VS cmath definition. But how do I determine which definition the nvcc compiler uses?

What is the standard/canonical/recommended way to find the definition for "round( float )" used by the nvcc compiler to generate device code?
Disassembly. Most inbuilt functions exist as stubs in headers that are expanded into inline assembly sequences as part of a device compiler code generating pass. There is no input code to view.

Related

PyCuda C++ kernel "error: this declaration may not have extern "C" linkage"

I tried using std::tuple in my kernel code, but received many error: this declaration may not have extern "C" linkage errors that pointed to utility and tuple
It complains on the include. The following repros for me.
from pycuda.compiler import SourceModule
mod = SourceModule("""#include <tuple>""")
Do I need to do something special in my kernel code or in my Python code to specify I want to use the C++ compiler?
Cuda version: 11.8
PyCuda version: 2022.2.1
Do I need to do something special in my kernel code or in my Python code to specify I want to use the C++ compiler?
To be clear, you are using the C++ compiler. But PyCUDA automagically wraps the code you pass into a SourceModule instance in extern “C” unless you explicitly tell it not to:
Unless no_extern_c is True, the given source code is wrapped in extern “C” { … } to prevent C++ name mangling.
The underlying reason from a C++ perspective is that templated instances of types and functions can’t resolve with C linkage, thus the error.
However, even after you fix that problem, prepared to be disappointed. CUDA supports a lot of C++ language features, but it doesn’t support the standard library and you can’t use std::tuple within kernel code. NVIDIA does provide their own (very limited) reimplementation of the C++ standard library, and it does have a basic tuple type. That might work for you.

How to redefine malloc/free in CUDA?

I want to redefine malloc() and free() in my code, but when I run, two errors appear:
allowing all exceptions is incompatible with previous function "malloc";
allowing all exceptions is incompatible with previous function "free";
Then I search for this error, it seems CUDA doesn't allow us to redefine libary function, is this true? If we can't redefine those functions, how can I resolve the error?
The very short answer is that you cannot.
malloc is fundamentally a C++ standard library function which the CUDA toolchain internally overloads with a device hook in device code. Attempting to define your own device version of malloc or free can and will break the toolchain's internals. Exactly how depends on platform and compiler.
In your previous question on this, you had code like this:
__device__ void* malloc(size_t)
{ return theHeap.alloc(t); }
__device__ void free(void* p)
{ the Heap.dealloc(p); }
Because of existing standard library requirements, malloc and free must be defined as __device__ __host__ at global namespace scope. It is illegal in CUDA to have separate __device__ and __host__definitions of the same function. You could probably get around this restriction by using a private namespace for the custom allocator, or using different function names. But don't try and redefine anything from the standard library in device or host code. It will break things.

matrix as a parameter in CUDA

I have two problems in CUDA programming.
I want to pass matrix as a function parameter in a CUDA program. I tried following. GCC compiler compiles the following code but NVIDIA CUDA C compiler does not compiles this code and prompts error. (I have installed CUDA 7.5)
void printMatrix( size_t rows, size_t cols, int a[][cols] )
and
void printMatrix(int row, int col, int matrix[row][col])
Both are not working. It gives "a parameter is not allowed" error.
Inside the main method I want to declare a matrix
int a[n][n];
where n runs from 1 to 5 (in a for loop). It gives "expression must have a constant value" error.
Where am I making the error.
I have tried to compile the code from this question with gcc and nvcc compiler, gcc compiles and nvcc does not.
Because you are doing this on Windows, nvcc (note nvccis not a compiler) uses the Visual Studio compiler to compile host host. Visual Studio does not support C99 language features, so you cannot use them in any host code you will compile on Windows in conjunction with CUDA. You will have to rewrite your code without using C99 language features in your host code.
If you were doing this on linux, you would be using gcc to compile the host code via nvcc, and C99 language features would be available if you provide the correct command line options and pass the file with a .c extension, as you have ably demonstrated in your question. C99 features and CUDA cannot be mixed in a .cu file because CUDA requires a C++ compiler to compile host code contain CUDA language extensions.

complex CUDA kernel in MATLAB

I wrote a CUDA kernel to run via MATLAB,
with several cuDoubleComplex pointers. I activated the kernel with complex double vectors (defined as gpuArray), and gםt the error message: "unsupported type in argument specification cuDoubleComplex".
how do I set MATLAB to know this type?
The short answer, you can't.
The list of supported types for kernels is shown here, and that is all your kernel code can contain to compile correctly with the GPU computing toolbox. You will need either modify you code to use double2 in place of cuDoubleComplex, or supply Matlab with compiled PTX code and a function declaration which maps cuDoubleComplex to double2. For example
__global__ void mykernel(cuDoubleComplex *a) { .. }
would be compiled to PTX using nvcc and then loaded up in Matlab as
k = parallel.gpu.CUDAKernel('mykernel.ptx','double2*');
Either method should work.

Visual Studio + Nsight : __syncthreads() undefined [duplicate]

At the moment CUDA already recognizes a key CUDA C/C++ function such as cudaMalloc, cudaFree, cudaEventCreate, etc.
It also recognizes certain types like dim3 and cudaEvent_t.
However, it doesn't recognize other functions and types such as the texture template, the __syncthreads functions, or the atomicCAS function.
Everything compiles just fine, but I'm tired of seeing red underlinings all over the place and I want to the see the example parameters displayed when you type in any recognizable function.
How do I get VS to catch these functions?
You could create a dummy #include file of the following form:
#pragma once
#ifdef __INTELLISENSE__
void __syncthreads();
...
#endif
This should hide the fake prototypes from the CUDA and Visual C++ compilers, but still make them visible to IntelliSense.
Source for __INTELLISENSE__ macro: http://blogs.msdn.com/b/vcblog/archive/2011/03/29/10146895.aspx
You need to add CUDA-specific keywords like __syncthreads to the usertype.dat file for visual studio. An example usertype.dat file is included with the NVIDIA CUDA SDK. You also need to make sure that visual studio recognizes .cu files as c/c++ files as described in this post:
Note however that where that post uses $(CUDA_INC_PATH), with recent versions of CUDA you should use $(CUDA_PATH)/include.
Also, I would recommend Visual Assist X -- not free, but worth the money -- to improve intellisense. It works well with CUDA if you follow these instructions:
http://www.wholetomato.com/forum/topic.asp?TOPIC_ID=5481
http://forums.nvidia.com/index.php?showtopic=53690