CUDA constant memory symbols - cuda

I am using CUDA 5.0 and I have modules which are compiled separately.
I would like to access the same value in the constant memory from all modules.
The problem is the following, when I define the symbol in each
module the linker claims that the symbol has been redefined.
Is there a workaround or a solution for this problem?
Thank you for helping.

In CUDA separate compilation mode, there is a true linker, and every symbol which is linked into the final device binary payload much be uniquely defined. This means __constant__ memory symbols must be only be defined in one place in all the code which is linked together.
The solution is probably to declare the symbol as extern at every translation unit scope except one, which contains the definition of the symbol. Note that this is the only case where it is valid to use extern with __constant__ symbols, otherwise they are implicitly static. There is a general discussion of the separate compilation model which describes this scenario buried in the documentation (both the programming guide and nvcc manual IIRC).

Related

PyCuda C++ kernel "error: this declaration may not have extern "C" linkage"

I tried using std::tuple in my kernel code, but received many error: this declaration may not have extern "C" linkage errors that pointed to utility and tuple
It complains on the include. The following repros for me.
from pycuda.compiler import SourceModule
mod = SourceModule("""#include <tuple>""")
Do I need to do something special in my kernel code or in my Python code to specify I want to use the C++ compiler?
Cuda version: 11.8
PyCuda version: 2022.2.1
Do I need to do something special in my kernel code or in my Python code to specify I want to use the C++ compiler?
To be clear, you are using the C++ compiler. But PyCUDA automagically wraps the code you pass into a SourceModule instance in extern “C” unless you explicitly tell it not to:
Unless no_extern_c is True, the given source code is wrapped in extern “C” { … } to prevent C++ name mangling.
The underlying reason from a C++ perspective is that templated instances of types and functions can’t resolve with C linkage, thus the error.
However, even after you fix that problem, prepared to be disappointed. CUDA supports a lot of C++ language features, but it doesn’t support the standard library and you can’t use std::tuple within kernel code. NVIDIA does provide their own (very limited) reimplementation of the C++ standard library, and it does have a basic tuple type. That might work for you.

Is there any documentation for NVCC's `#pragma nv_exec_check_disable` and/or `#pragma hd_warning_disable`?

Some projects use
#pragma nv_exec_check_disable
and/or
#pragma hd_warning_disable
to silence NVCC warnings about
warning: calling a __host__ function from a __host__ __device__ function is not allowed`
However they seem completely undocumented, e.g. in the CUDA 9.1 reference.
Is there any relevant documentation anywhere ?
As was indicated in comments and a now (incorrectly) moderator deleted answer, all pragmas supported by cicc (the front end parser for device code) remain undocumented.
However, if you are truly interested in what might or might not be supported, you can look through the strings stored in ciccand see that there are an apparent cornucopia of feature control pragmas in the executable. All undocumented, unfortunately.

Can my kernel code tell how much shared memory it has available?

Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid?
On the host side, you know how much shared memory a launched kernel had (or will have), since you set that value yourself; but what about the device side? It's easy to compile in the upper limit to that size, but that information is not available (unless passed explicitly) to the device. Is there an on-GPU mechanism for obtaining it? The CUDA C Programming Guide doesn't seem to discuss this issue (in or outside of the section on shared memory).
TL;DR: Yes. Use the function below.
It is possible: That information is available to the kernel code in special registers: %dynamic_smem_size and %total_smem_size.
Typically, when we write kernel code, we don't need to be aware of specific registers (special or otherwise) - we write C/C++ code. Even when we do use these registers, the CUDA compiler hides this from us through functions or structures which hold their values. For example, when we use the value threadIdx.x, we are actually accessing the special register %tid.x, which is set differently for every thread in the block. You can see these registers "in action" when you look at compiled PTX code. ArrayFire have written a nice blog post with some worked examples: Demystifying PTX code.
But if the CUDA compiler "hides" register use from us, how can we go behind that curtain and actually insist on using them, accessing them with those %-prefixed names? Well, here's how:
__forceinline__ __device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
and a similar function for %total_smem_size. This function makes the compiler add an explicit PTX instruction, just like asm can be used for host code to emit CPU assembly instructions directly. This function should always be inlined, so when you assign
x = dynamic_smem_size();
you actually just assign the value of the special register to x.

Passing non-POD types as __global__ function parameters in CUDA

I know that non-POD types can't be passed as parameters to CUDA kernel launches in general.
But where I can find an explanation for this, I mean a reliable source like a book, a CUDA manual, etc.
The entire premise of this question is incorrect. CUDA kernel arguments are not limited to POD types.
You are free to pass any complete type as an argument, either by reference or by value. There is a limit of 255 bytes or 4kb for the total size of the argument list depending on which architecture you compile for, but that is the only restriction on kernel arguments. When passing an instance of a class to a CUDA kernel, there are a number of simple restrictions which you must follow, including:
Any pointers in the class instance which the device code will dereference must be valid device pointers
Any member functions in the class which the the device code will call must be valid __device__ functions
Is it illegal to pass a class containing virtual functions or derived from a virtual base type as as a kernel argument
Classes which access namespace anonymous unions are not supported in device code
All of the features and limitations of C++ support in CUDA kernel code is described in the CUDA Programming Guide, a copy of which ships in every version of the CUDA toolkit. All you need to do is read it.

Using STL containers in GNU Assembler

Is it possible to "link" the STL to an assembly program, e.g. similar to linking the glibc to use functions like strlen, etc.? Specifically, I want to write an assembly function which takes as an argument a std::vector and will be part of a lib. If this is possible, is there any documentation on this?
Any use of C++ templates will require the compiler to generate instantiations of those templates. So you don't really "link" something like the STL into a program; the compiler generates object code based upon your use of templates in the library.
However, if you can write some C++ code that forces the templates to be instantiated for whatever types and other arguments you need to use, then write some C-linkage functions to wrap the uses of those template instantiations, then you should be able to call those from your assembly code.
I strongly believe you're doing it wrong. Using assembler is not going to speed up your handling of the data. If you must use existing assembly code, simply pass raw buffers
std::vector is by definition (in the standard) compatible with raw buffers (arrays); the standard mandates contiguous allocation. Only reallocation can invalidate the memory region that contains the element data. In short, if the C++ code can know the (max) capacity required and reserve()/resize() appropriately, you can pass &vector[0] as the buffer address and be perfectly happy.
If the assembly code needs to decide how (much) to reallocate, let it use malloc. Once done, you should be able to use that array as STL container:
std::accumulate(buf, buf+n, 0, &dosomething);
Alternatively, you can use the fact that std::tr1::array<T, n> or boost::array<T, n> are POD, and use placement new right on the buffer allocated in the library (see here: placement new + array +alignment or How to make tr1::array allocate aligned memory?)
Side note
I have the suspicion that you are using assembly for the wrong reasons. Optimizing compilers will leverage the full potential of modern processors (including SIMD such as SSE1-4);
E.g. for gcc have a look at
__attibute__ (e.g. for pointer restrictions
such as alignment and aliasing guarantees: this will enable the more powerful vectorization options for the compiler);
-ftree_vectorize and -ftree_vectorizer_verbose=2, -march=native
Note also that since the compiler can't be sure what registers an external (or even inline) assembly procedure clobbers, it must assume all registers are clobbered leading to potential performance degradation. See http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html for ways to use inline assembly with proper hints to gcc.
probably completely off-topic: -fopenmp and gnu::parallel
Bonus: the following references on (premature) optimization in assembly and c++ might come in handy:
Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms
Optimizing subroutines in assembly language: An optimization guide for x86 platforms
The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers
And some other relevant resources