How to disable a specific nvcc compiler warnings - cuda

I want to disable a specific compiler warning with nvcc, specifically
warning: NULL reference is not allowed
The code I am working on uses NULL references are part of SFINAE, so they can't be avoided.
An ideal solution would be a #pragma in just the source file where we want to disable the warnings, but a compiler flag would also be fine, if one exists to turn off only the warning in question.

It is actually possible to disable specific warnings on the device with NVCC. It took me ages to figure out how to do it.
You need to use the -Xcudafe flag combined with a token listed on this page. For example, to disable the "controlling expression is constant" warning, pass the following to NVCC:
-Xcudafe "--diag_suppress=boolean_controlling_expr_is_constant"
For other warnings, see the above link.

Just to add to the previous answer about -xcudafe (not enough reputation yet to leave a comment)
MAJOR EDIT:
cudaFE is apparently Nvidia's custom version of C++ Front End from the Edison Design Group. You can find the docs for it here: http://www.edg.com/docs/edg_cpp.pdf. I am currently referring to page numbers from the July 2019 v5.1 manual.
#einpoklum notes that just doing push/pop as I had originally stated in the initial post doesn't work and specifically that #pragma push generates a warning that it is ignored. I could not replicate the warning, but indeed in the test program below neither CUDA 10.1 nor CUDA 9.2 did the push/pop actually do anything (note that lines 20 and 22 generate no warning).
However, on page 75 of this manual they go through how to do localized diagnostic severity control without push/pop:
The following example suppresses the “pointless friend declaration”
warning on the declaration of class A:
#pragma diag_suppress 522
class A { friend class A; };
#pragma diag_default 522
class B { friend class B; };
The #pragma diag_default returns the warning to default state. Another example would be:
#pragma diag_suppress = code_is_unreachable
...
#pragma diag_default = code_is_unreachable
The equal symbol is optional. Testing shows that this works and truly localized the severity control. Further, testing reveals that adding diagnostic suppressions in this way adds to previous diagnostic suppressions - it doesn’t replace. Also of note, in CUDA 10.1 unreachable code did not generate a warning while it did in CUDA 9.2. Finally, Page 77 of the manual mentions a new push/pop syntax:
#pragma push_macro(“identifier”)
#pragma pop_macro(“identifier”)
But I couldn't get it to work in the program below.
All of the above is tested in the program below, compiled with nvcc -std=c++14 test_warning_suppression.cu -o test_warning_suppression:
#include <cuda_runtime.h>
__host__ __device__ int return1(){
int j = -1; //warning given for both CUDA 9.2 and 10.1
return 1;
if(false){ return 0; } //warning given here for CUDA 9.2
}
#pragma push
#pragma diag_suppress = code_is_unreachable
#pragma diag_suppress = declared_but_not_referenced
__host__ __device__ int return2(){
int j = -1;
return 2;
if(false){ return 0; }
}
#pragma pop
__host__ __device__ int return3(){
int j = -1; //no warning given here
return 3;
if(false){ return 0; } //no warning here even in CUDA 9.2
}
//push/pop did not localize warning suppression, so reset to default
#pragma diag_default = declared_but_not_referenced
#pragma diag_default = code_is_unreachable
//warning suppression localized to lines above by diag_default!
__host__ __device__ int return4(){
int j = -1; //warning given for both CUDA 9.2 and 10.1
return 4;
if(false){ return 0; } //warning given here for CUDA 9.2
}
/* below does not work as of CUDA 10.1
#pragma push_macro(“identifier”)
#pragma diag_suppress = code_is_unreachable
__device__ int return5(){
return 5;
if(false){ return 0; }
}
#pragma pop_macro(“identifier”)
__device__ int return6(){
return 6;
if(false){ return 0; }
} */
int main(){ return 0; }

To augment user2333829's answer: if you know the warning name you can disable it like this:
-Xcudafe "--diag_suppress=boolean_controlling_expr_is_constant"
If you don't know the name, get warning numbers by compiling with:
-Xcudafe --display_error_number
And then with:
-Xcudafe --diag_suppress=<warning_number>
(Note: both options at the same time apparently don't work.)

You can use w flag to suppress the warnings
nvcc -w

I struggled to find a matching -Xcudafe for my warning. So here is another way.
You can pass a compiler flag to CL.exe that will disable a specific warning. For example, to disable the warnings about unchecked iterators you can pass /wd4996.
warning C4996: 'std::_Copy_impl': Function call with parameters that may be
unsafe - this call relies on the caller to check that the passed values are
correct. To disable this warning, use -D_SCL_SECURE_NO_WARNINGS. See
documentation on how to use Visual C++ 'Checked Iterators'
The tricky thing here is that by default the arguments from the host compiler settings are not passed to nvcc, so you need to add it via the CUDA C/C++ dialog.

I was using nvcc with the ubuntu g++ compilers, in my case openmpi mpic++. For the "-Wunused-result" of the g++ compiler the corresponding message suppress is "-Wno-unused-result". So passing it in the nvcc like -Xcompiler "-Wno-unused-result" worked for me.

ref gcc/clang compiling options:
https://docs.adacore.com/live/wave/gcc-12.x/html/gcc/gcc.html#Warning-Options
https://releases.llvm.org/12.0.0/tools/clang/docs/DiagnosticsReference.html#diagnostic-flags

Related

Why is CUDA function cudaLaunchKernel passed a function pointer to host-code function?

I compile axpy.cu with the following command.
nvcc --cuda axpy.cu -o axpy.cu.cpp.ii
Within axpy.cu.cpp.ii, I observe that function cudaLaunchKernel nested in __device_stub__Z4axpyfPfS_ accepts an function pointer to axpy which is defined in axpy.cu.cpp.ii. So my confuse is that shouldn't cudaLaunchKernel have been passed an function pointer to kernel function axpy? Why is there function definition with the same name as kernel function? Any help would be appreciated! Thanks in advance
Both of functions are shown below.
void __device_stub__Z4axpyfPfS_(float __par0, float *__par1, float *__par2){
void * __args_arr[3];
int __args_idx = 0;
__args_arr[__args_idx] = (void *)(char *)&__par0;
++__args_idx;
__args_arr[__args_idx] = (void *)(char *)&__par1;
++__args_idx;
__args_arr[__args_idx] = (void *)(char *)&__par2;
++__args_idx;
{
volatile static char *__f __attribute__((unused));
__f = ((char *)((void ( *)(float, float *, float *))axpy));
dim3 __gridDim, __blockDim;
size_t __sharedMem;
cudaStream_t __stream;
if (__cudaPopCallConfiguration(&__gridDim, &__blockDim, &__sharedMem, &__stream) != cudaSuccess)
return;
if (__args_idx == 0) {
(void)cudaLaunchKernel(((char *)((void ( *)(float, float *, float *))axpy)), __gridDim, __blockDim, &__args_arr[__args_idx], __sharedMem, __stream);
} else {
(void)cudaLaunchKernel(((char *)((void ( *)(float, float *, float *))axpy)), __gridDim, __blockDim, &__args_arr[0], __sharedMem, __stream);
}
};
}
void axpy( float __cuda_0,float *__cuda_1,float *__cuda_2)
# 3 "axpy.cu"
{
__device_stub__Z4axpyfPfS_( __cuda_0,__cuda_1,__cuda_2);
}
So my confuse [sic] is that shouldn't cudaLaunchKernel have been passed an function pointer to kernel function axpy?
No, because that isn't the design which NVIDIA chose and your assumption about how this function works are probably not correct. As I understand it, the first argument to cudaLaunchKernel is treated as a key, not a function pointer which is called. Elsewhere in the nvcc emitted code you will find something like this boilerplate:
static void __nv_cudaEntityRegisterCallback( void **__T0)
{
__nv_dummy_param_ref(__T0);
__nv_save_fatbinhandle_for_managed_rt(__T0);
__cudaRegisterEntry(__T0, ((void ( *)(float, float *, float *))axpy), _Z4axpyfPfS_, (-1));
}
You can see that the __cudaRegisterEntry call takes both a static pointer to axpy and a form the mangled symbol for the compiled GPU kernel. __cudaRegisterEntry is an internal, completely undocumented API from the CUDA runtime API. Many years ago, I satisfied myself by reverse engineering an earlier version of the CUDA runtime API, that there is an internal lookup mechanism which allows the correct instance of a host kernel stub to be matched to the correct instance of the compiled GPU code at runtime. The compiler emits a large amount of boilerplate and statically defined objects holding all the necessary definitions to make the runtime API work seamlessly without all of the additional API overhead that you need to use in the CUDA driver API or comparable compute APIs like OpenCL.
Why is there function definition with the same name as kernel function?
Because that is how NVIDIA decided to implement the runtime API. What you are asking about are undocumented, internal implementation details of the runtime API. You as a programmer are not supposed to have to see or use them, and they are not guaranteed to be the same from version to version.
If, as indicated in comments, you want to implement some additional compiler infrastructure in the CUDA compilation process, use the CUDA driver API, not the runtime API.

Does Cuda C++ not have tuples in device code?

__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
auto lamb = [](int x) {return x + 1; }; // Works.
auto t = std::make_tuple(1, 2, 3); // Does not work.
c[i] = a[i] + b[i];
}
NVCC has lambdas at least, but std::make_tuple fails to compile. Are tuples not allowed in the current version of Cuda?
I've just tried this out and tuple metaprogramming with std:: (std::tuple, std::get, etc ...) will work in device code with C++14 and expt-relaxed-constexpr enabled (CUDA8+) during compilation (e.g. nvcc -std=c++14 xxxx.cu -o yyyyy --expt-relaxed-constexpr) - CUDA 9 required for C++14, but basic std::tuple should work in CUDA 8 if you are limited to that. Thrust/tuple works but has some drawbacks: limited to 10 items and lacking in some of the std::tuple helper functions (e.g. std::tuple_cat). Because tuples and their related functions are compile-time, expt-relaxed-constexpr should enable your std::tuple to "just work".
#include <tuple>
__global__ void kernel()
{
auto t = std::make_tuple(1, 2, 3);
printf("%d\n",std::get<0>(t));
}
int main()
{
kernel<<<1,1>>>();
cudaDeviceSynchronize();
}
#include <thrust/tuple.h>
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
auto lamb = [](int x) {return x + 1; }; // Works.
auto t = thrust::make_tuple(1, 2, 3);
c[i] = a[i] + b[i];
}
I needed to get the ones from the Thrust library instead to make them work it seems. The above does compile.
Support for the standard c++ library on device side is problematic for CUDA as the standard library does not have the necessary __host__ or __device__ annotations.
That said, both clang and nvcc do have partial support for some functionality. Usually it's limited to constexpr functions that are considered to be __host__ __device__ if you pass --expt-relaxed-constexpr to nvcc (or by default in clang). Clang also has a bit more support for standard math functions. Neither supports anything that relies on C++ runtime (except for memory allocation, printf and assert) as that does not exist on device side.
So, in short -- most of the standard C++ library is unusable on device side in CUDA, though things do slowly improve as more and more functions in the standard library become constexpr.
Indeed, CUDA itself does not offer a device-side-capable version of std::tuple. However, I have a full tuple implementation as part of my cuda-kat library (still very much under initial development at the time of writing). thrust's tuple class is limited in the following senses:
Limited to 10 tuple elements.
Recursively expands templated types for every tuple element.
No/partial support for rvalues (e.g. in get())
The tuple implementation in cuda-kat is an adaptation of the EASTL tuple, which in turn is an adaptation of the LLVM project's libc++ tuple. Unlike the EASTL's, however, it is C++11-compatible, so you don't have to have the absolute latest CUDA version. It is possible to extract only the tuple class from the library with oh, I think 4 files or so, if you need just that.

Particular Allocating device memory for _global_ function in cuda

want to do this programm on cuda.
1.in "main.cpp"
struct Center{
double * Data;
int dimension;
};
typedef struct Center Center;
//I allow a pointer on N Center elements by the CUDAMALLOC like follow
....
#include "kernel.cu"
....
center *V_dev;
int M =100, n=4;
cudaStatus = cudaMalloc((void**)&V_dev,M*sizeof(Center));
Init<<<1,M>>>(V_dev, M, N); //I always know the dimension of N before calling
My "kernel.cu" file is something like this
#include "cuda_runtime.h"
#include"device_launch_parameters.h"
... //other include headers to allow my .cu file to know the Center type definition
__global__ void Init(Center *V, int N, int dimension){
V[threadIdx.x].dimension = dimension;
V[threadIdx.x].Data = (double*)malloc(dimension*sizeof(double));
for(int i=0; i<dimension; i++)
V[threadIdx.x].Data[i] = 0; //For the value, it can be any kind of operation returning a float that i want to be able put here
}
I'm on visual studio 2008 and CUDA 5.0. When I Build my project, I've got these errors:
error: calling a _host_ function("malloc") from a _global_ function("Init") is not allowed.
I want to know please how can I perform this? (I know that 'malloc' and other cpu memory allocation are not allowed for device memory.
malloc is allowed in device code but you have to be compiling for a cc2.0 or greater target GPU.
Adjust your VS project settings to remove any GPU device settings like compute_10,sm_10 and replace it with compute_20,sm_20 or higher to match your GPU. (And, to run that code, your GPU needs to be cc2.0 or higher.)
You need the compiler parameter -arch=sm_20 and a GPU which supports it.

nvidia cuda thrust abort() called on find_if

I'm trying to execute some sample code from the Thrust Quick Start Guide. It's pasted below. What is killing me is that when I'm running it, I'm getting an exception thrown "R6010 -abort() has been called) whenever I hit the find_if.
I've tried this using both the 4.1 and 4.2 runtimes. I'm building this in Visual Studio 2010 Ultimate using the latest NSight release candidate (downloaded May 4th, 2012). My graphics card is an NVidia NVS 3100m.
I can run the vector addition sample generated in a new VS project (that doesn't use Thrust) and it works okay. Adding Thrust however gives me this weirdness.
Any suggestions are appreciated.
mj
thrust::device_vector<int> input(4);
input[0] = 0;
input[1] = 5;
input[2] = 3;
input[3] = 7;
thrust::device_vector<int>::iterator iter;
iter = thrust::find_if(input.begin(), input.end(), greater_than_four());
iter = thrust::find_if(input.begin(), input.end(), greater_than_ten());
EDIT1
Another tidbit of information. I'm digging in deeper into this and seeing that an error is caught during cudaThreadSynchronize(). The message is "launch_closure_by_value".
I figured it out. The __host__ and __device__ tags were missing.
struct greater_than_four
{
__host__ __device__
bool operator()(int x)
{
return x > 4;
}
};

Using assert within kernel invocation

Is there convenient way for using asserts within the kernels invocation on device mode?
CUDA now has a native assert function. Use assert(...). If its argument is zero, it will stop kernel execution and return an error. (or trigger a breakpoint if in CUDA debugging.)
Make sure to include "assert.h". Also, this requires compute capability 2.x or higher, and is not supported on MacOS. For more details see CUDA C Programming Guide, Section B.16.
The programming guide also includes this example:
#include <assert.h>
__global__ void testAssert(void)
{
int is_one = 1;
int should_be_one = 0;
// This will have no effect
assert(is_one);
// This will halt kernel execution
assert(should_be_one);
}
int main(int argc, char* argv[])
{
testAssert<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
#define MYASSERT(condition) \
if (!(condition)) { return; }
MYASSERT(condition);
if you need something fancier you can use cuPrintf() which is available from the CUDA site for registered developers.