When compiling CUDA programs which use Google Test, nvcc will emit false-positive warnings:
function <name> was declared but never referenced
An MCVE:
// test.cu
#include <gtest/gtest.h>
namespace {
__global__ void a_kernel() {
printf("Works");
}
TEST(ExampleTest, ExampleTestCase) {
a_kernel<<<1, 1>>>();
}
}
Compiling it gives:
$ nvcc test.cu -lgtest -lgtest_main
test.cu(9): warning: function "<unnamed>::ExampleTest_ExampleTestCase_Test::ExampleTest_ExampleTestCase_Test()" was declared but never referenced
This is confirmed with the master branch of google test and CUDA 9.1 (I believe it started happening with CUDA 9.0, and the bug is not present in CUDA 8.0). The problem doesn't happen if the test is in the global namespace.
Is there a way to disable these warnings? I know I can use -w to disable all warnings, but I would like to keep other types of warnings.
You could try the brute force way:
#pragma push
#pragma diag_suppress 177 // suppress "function was declared but never referenced warning"
.. your function ..
#pragma pop
Related
Please look at this code:
void bar() {}
__host__ __device__ void foo()
{
bar();
}
__global__ void kernel()
{
foo();
}
int main()
{
kernel<<<1, 1>>>();
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
return 0;
}
I spent hours trying to solve the an illegal memory access was encountered runtime error. As it turned out, the reason is the bar() function - it's not declared as __device__. But! But the code compiles! It produces a warning, but compiles! The warning says:
warning: calling a __host__ function("bar") from a __host__ __device__
function("Test::foo") is not allowed
Since the compilation for my project produces a lot of output, I simply didn't see that warning. But if I remove the __device__ attribute from the foo() function, I get the expected error:
error: identifier "foo" is undefined in device code
The question is why the compiler prints only a warning and how to turn it into an error?
The question is why the compiler prints only a warning and how to turn it into an error?
The compiler prints only a warning because it doesn't know (at the point of compilation of the calling function) if the function will actually be called at runtime, in the objectionable configuration (i.e. on or from device code).
and how to turn it into an error?
From the nvcc manual you can add either:
-Werror all-warnings
to flag all warnings as errors
or
-Werror cross-execution-space-call
to only flag this type of warning as an error.
Also see here. To those who will ask why I didn't flag as a dupe, that other question doesn't include a question (or in the answer itself) about why the compiler behaves this way.
I spent hours trying to solve the... error. ... But the code compiles! It produces a warning, but compiles!
You need to revisit your debugging methodology right there :-(
Any warning which you have not positively proven to yourself is immaterial - is where you need to look for your errors. And it is far easier and more rewarding to resolve warnings than to prove them invalid. (And by resolve, I mean address the underlying condition, not suppressing the warning, or const_cast'ing etc.)
So, don't turn warnings into errors with the compiler, turn them into essentially-errors in your mind. Clean, warning-free code = happy life.
I have a class that calls a kernel in its constructor, as follows:
"ScalarField.h"
#include <iostream>
void ERROR_CHECK(cudaError_t err,const char * msg) {
if(err!=cudaSuccess) {
std::cout << msg << " : " << cudaGetErrorString(err) << std::endl;
std::exit(-1);
}
}
class ScalarField {
public:
float* array;
int dimension;
ScalarField(int dim): dimension(dim) {
std::cout << "Scalar Field" << std::endl;
ERROR_CHECK(cudaMalloc(&array, dim*sizeof(float)),"cudaMalloc");
}
};
"classA.h"
#include "ScalarField.h"
static __global__ void KernelSetScalarField(ScalarField v) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < v.dimension) v.array[index] = 0.0f;
}
class A {
public:
ScalarField v;
A(): v(ScalarField(3)) {
std::cout << "Class A" << std::endl;
KernelSetScalarField<<<1, 32>>>(v);
ERROR_CHECK(cudaGetLastError(),"Kernel");
}
};
"main.cu"
#include "classA.h"
A a_object;
int main() {
std::cout << "Main" << std::endl;
return 0;
}
If i instantiate this class on main (A a_object;) i get no errors. However, if I instantiate it outside main, just after defining it (class A {...} a_object;) I get an "invalid device function" error when the kernel launches. Why does that happen?
EDIT
Updated code to provide a more complete example.
EDIT 2
Following the advice in the comment by Raxvan, I wanted to say i have the dimensions variable used in ScalarField constructor also defined (in another class) outside main, but before everything else. Could that be the explanation? The debugger was showing the right value for dimensions though.
The short version:
The underlying reason for the problem when class A is instantiated outside of main is that a particular hook routine which is required to initialise the CUDA runtime library with your kernels is not being run before the constructor of class A is being called. This happens because there are no guarantees about the order in which static objects are instantiated and initialised in the C++ execution model. Your global scope class is being instantiated before the global scope objects which do the CUDA setup are initialised. Your kernel code is never being loaded into the context before it is call, and a runtime error results.
As best as I can tell, this is a genuine limitation of the CUDA runtime API and not something easily fixed in user code. In your trivial example, you could replace the kernel call with a call to cudaMemset or one of the non-symbol based runtime API memset functions and it will work. This problem is completely limited to user kernels or device symbols loaded at runtime via the runtime API. For this reason, an empty default constructor would also solve your problem. From a design point of view, I would be very dubious of any pattern which calls kernels in the constructor. Adding a specific method for class GPU setup/teardown which doesn't rely on the default constructor or destructor would be a much cleaner and less error prone design, IMHO.
In detail:
There is an internally generated routine (__cudaRegisterFatBinary) which must be run to load and register kernels, textures and statically defined device symbols contained in the fatbin payload of any runtime API program with the CUDA driver API before the kernel can be called without error. This is a part of the "lazy" context initialisation feature of the runtime API. You can confirm this for yourself as follows:
Here is a gdb trace of the revised example you posted. Note I insert a breakpoint into __cudaRegisterFatBinary, and that isn't reached before your static A constructor is called and the kernel launch fails:
talonmies#box:~$ gdb a.out
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/talonmies/a.out...done.
(gdb) break '__cudaRegisterFatBinary'
Breakpoint 1 at 0x403180
(gdb) run
Starting program: /home/talonmies/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Scalar Field
[New Thread 0x7ffff5a63700 (LWP 10774)]
Class A
Kernel : invalid device function
[Thread 0x7ffff5a63700 (LWP 10774) exited]
[Inferior 1 (process 10771) exited with code 0377]
Here is the same procedure, this time with A instantiation inside main (which is guaranteed to happen after the objects which perform lazy setup have been initialised):
talonmies#box:~$ cat main.cu
#include "classA.h"
int main() {
A a_object;
std::cout << "Main" << std::endl;
return 0;
}
talonmies#box:~$ nvcc --keep -arch=sm_30 -g main.cu
talonmies#box:~$ gdb a.out
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/talonmies/a.out...done.
(gdb) break '__cudaRegisterFatBinary'
Breakpoint 1 at 0x403180
(gdb) run
Starting program: /home/talonmies/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, 0x0000000000403180 in __cudaRegisterFatBinary ()
(gdb) cont
Continuing.
Scalar Field
[New Thread 0x7ffff5a63700 (LWP 11084)]
Class A
Main
[Thread 0x7ffff5a63700 (LWP 11084) exited]
[Inferior 1 (process 11081) exited normally]
If this is really a crippling problem for you, I would suggest contacting NVIDIA developer support and raising a bug report.
I have a function in my program called float valueAt(float3 v). It's supposed to return the value of a function at the given point. The function is user-specified. I have an interpreter for this function at the moment, but others recommended I compile the function online so it's in machine code and is faster.
How do I do this? I believe I know how to load the function when I have PTX generated, but I have no idea how to generate the PTX.
CUDA provides no way of runtime compilation of non-PTX code.
What you want can be done, but not using the standard CUDA APIs. PyCUDA provides an elegant just-in-time compilation method for CUDA C code which includes behind the scenes forking of the toolchain to compile to device code and loading using the runtime API. The (possible) downside is that you need to use Python for the top level of your application, and if you are shipping code to third parties, you might need to ship a working Python distribution too.
The only other alternative I can think of is OpenCL, which does support runtime compilation (that is all it supported until recently). The C99 language base is a lot more restrictive than what CUDA offers, and I find the APIs to be very verbose, but the runtime compilation model works well.
I've thought about this problem for a while, and while I don't think this is a "great" solution, it does seem to work so I thought I would share it.
The basic idea is to use linux to spawn processes to compile and then run the compiled code. I think this is pretty much a no-brainer, but since I put together the pieces, I'll post instructions here in case it's useful for somebody else.
The problem statement in the question is to be able to take a file that contains a user-defined function, let's assume it is a function of a single variable f(x), i.e. y = f(x), and that x and y can be represented by float quantities.
The user would edit a file called fx.txt that contains the desired function. This file must conform to C syntax rules.
fx.txt:
y=1/x
This file then gets included in the __device__ function that will be holding it:
user_testfunc.cuh:
__device__ float fx(float x){
float y;
#include "fx.txt"
;
return y;
}
which gets included in the kernel that is called via a wrapper.
cudalib.cu:
#include <math.h>
#include "cudalib.h"
#include "user_testfunc.cuh"
__global__ void my_kernel(float x, float *y){
*y = fx(x);
}
float cudalib_compute_fx(float x){
float *d, *h_d;
h_d = (float *)malloc(sizeof(float));
cudaMalloc(&d, sizeof(float));
my_kernel<<<1,1>>>(x, d);
cudaMemcpy(h_d, d, sizeof(float), cudaMemcpyDeviceToHost);
return *h_d;
}
cudalib.h:
float cudalib_compute_fx(float x);
The above files get built into a shared library:
nvcc -arch=sm_20 -Xcompiler -fPIC -shared cudalib.cu -o libmycudalib.so
We need a main application to use this shared library.
t452.cu:
#include <stdio.h>
#include <stdlib.h>
#include "cudalib.h"
int main(int argc, char* argv[]){
if (argc == 1){
// recompile lib, and spawn new process
int retval = system("nvcc -arch=sm_20 -Xcompiler -fPIC -shared cudalib.cu -o libmycudalib.so");
char scmd[128];
sprintf(scmd, "%s skip", argv[0]);
retval = system(scmd);}
else { // compute f(x) at x = 2.0
printf("Result is: %f\n", cudalib_compute_fx(2.0));
}
return 0;
}
Which is compiled like this:
nvcc -arch=sm_20 -o t452 t452.cu -L. -lmycudalib
At this point, the main application (t452) can be executed and it will produce the result of f(2.0) which is 0.5 in this case:
$ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./t452
Result is: 0.500000
The user can then modify the fx.txt file:
$ vi fx.txt
$ cat fx.txt
y = 5/x
And just re-run the app, and the new functional behavior is used:
$ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./t452
Result is: 2.500000
This method takes advantage of the fact that upon recompilation/replacement of a shared library, a new linux process will pick up the new shared library. Also note that I've omitted several kinds of error checking for clarity. At a minimum I would check CUDA errors, and I would also probably delete the shared object (.so) library before recompiling it, and then test for its existence after compilation, to do a basic test that the compilation proceeded successfully.
This method entirely uses the runtime API to achieve this goal, so as a result the user would have to have the CUDA toolkit installed on their machine and appropriately set up so that nvcc is available in the PATH. Using the driver API with PTX code would make this process much cleaner (and not require the toolkit on the user's machine), but AFAIK there is no way to generate PTX from CUDA C without using nvcc or a user-created toolchain built on the nvidia llvm compiler tools. In the future, there may be a more "integrated" approach available in the "standard" CUDA C toolchain, or perhaps even by the driver.
A similar approach can be arranged using separate compilation and linking of device code, such that the only source code that needs to be exposed to the user is in user_testfunc.cu (and fx.txt).
EDIT: There is now a CUDA runtime compilation facility, which should be used in place of the above.
im building and project which uses both Thrust (cuda api) and openMP technologies.the main purpose of my program is to present an interface to calculate something , simultaneously speaking.
in order to do that i've decided to use the STRATEGY design pattern , which basically means that we need to define a base class with a virtual function , and then other classes to derive from that base class and implement the needed function.
my problem starts here :
1 . can my project has more than 1 .CU file?
2 . can CU files have decleration of classes?
class foo
{
int m_name;
void doSomething();
}
3. this one continues 2. , i've head that DEVICE kernels can not be declared inside classes and has to be done like this :
//header file
__DEVICE__ void kernel(int x, inty)
{.....
}
class a : foo
{
void doSomething();
}
//cu file
void a::doSomething()
{
kernel<<<1,1>>>......();
}
is it the right way?
4.last question is , we i use THRUST , must i use CU files as well?
Thanks , igal
Yes, you can use multiple .cu files in your project.
Yes, but there are restrictions. According to *CUDA_C_Programming_Guide* v.4.0, section 3.1.5:
The front end of the compiler processes CUDA source files according to C++ syntax rules. Full C++ is supported for the host code. However, only a subset of C++ is fully supported for the device code as described in Appendix D. As a consequence of the use of C++ syntax rules, void pointers (e.g., returned by malloc()) cannot be assigned to non-void pointers without a typecast.
You're ALMOST correct. You have to use __global__ keyword when declaring your kernel.
__global__ void kernel(int x, inty)
{.....
}
Well, yes. Actually your thrust-boosted device code should be compiled with nvcc. See thrust documentation for details.
In general, you will compile your programs like that:
$ nvcc -c device.cu
$ g++ -c host.cpp -I/usr/local/cuda/include/
$ nvcc device.o host.o
Alternatively, you can use g++ to perform final linking step.
$ g++ tester device.o host.o -L/usr/local/cuda/lib64 -lcudart
On Windows change the paths after -I and -L. Also, as far as I know, you have to use cl compiler (MS Visual Studio).
Note 1:
Watch out for x86/x64 compatibility: if you use 64-bit CUDA Toolkit, use also a 64-bit compiler. (check -m32 and -m64 options of nvcc also)
Note 2:
device.cu contains kernels and a function that invokes kernel(s). This function has to be annotated with extern "C".
It can contain classes (limitations apply).
host.cpp contains pure C++ code with a extern "C" declaration of the function that is in device.cu (NOT kernel).
I am developing a CUDA 4.0 application running on a Fermi card. According to the specs, Fermi has Compute Capability 2.0 and therefore should support non-inlined function calls.
I compile every class I have with nvcc 4.0 in a distinct obj file. Then, I link them all with g++-4.4.
Consider the following code :
[File A.cuh]
#include <cuda_runtime.h>
struct A
{
__device__ __host__ void functionA();
};
[File B.cuh]
#include <cuda_runtime.h>
struct B
{
__device__ __host__ void functionB();
};
[File A.cu]
#include "A.cuh"
#include "B.cuh"
void A::functionA()
{
B b;
b.functionB();
}
Attempting to compile A.cu with nvcc -o A.o -c A.cu -arch=sm_20 outputs Error: External calls are not supported (found non-inlined call to _ZN1B9functionBEv).
I must be doing something wrong, but what ?
As explained on this thread on the NVidia forums, it appears that even though Fermi supports non-inlined functions, nvcc still needs to have all the functions available during compilation, i.e. in the same source file: there is no linker (yep, that's a pity...).
functionB is not declared and therefore considered external call. As the error said external calls are not supported. Implement functionB and it will work.
True, CUDA 5.0 does it. I can't get it to expose external device variables but device methods work just fine. Not by default.
The nvcc option is "-rdc=true". In Visual Studio and Nsight it is an option in the project properties under Configuration Properties -> CUDA C/C++ -> Common -> Generate Relocatable Device Code.