I'm attempting to build a project with nvcc. I am getting the most vexing nvlink error: messages I've ever seen.
Here is the link statement:
nvcc -rdc=true -arch=sm_21 -O3 -Xcompiler -fPIC -I"/usr/local/ACE_wrappers" -I"/usr/local/ACE_wrappers/TAO" -I"/usr/local/DDS" -I"/usr/include/Qt" -I"/usr/include/QtCore" -I"/usr/include/QtGui" -I"../../include" -I"../../include/DDS" -I"../../include/CoordinateTransforms" -I"../../include/DDS/IDLBrokerTemplates" -I"../../def/IDL" -I"../../def/CMD" -I"../../def/XSD" -I"../../src/NetAcquire" -I"/usr/local/ACE_wrappers/TAO/orbsvcs" -I"/usr/local/include/lct.7.5.4" -L"." -L"/usr/local/ACE_wrappers/lib" -L"/usr/local/DDS/lib" -L"/usr/lib64" -L"/usr/local/lib64" -L"../../def/IDL/lib" -L"../../def/XSD" -L"/usr/local/lib" .obj/../../src/Component.o .obj/../../src/COM.o .obj/../../src/DDS/EntityManager.o .obj/../../src/IDLBrokerTemplates/CommandManager.o .obj/../../src/IDLBrokerTemplates/OptionManager.o .obj/../../include/ApplicationProcessStateReporter_moc.o .obj/../../src/Application.o .obj/../../src/CoordinateTransforms/Site.o .obj/../../src/CoordinateTransforms/Geodesy.o .obj/../../src/CoordinateTransforms/Earth.o .obj/../../src/CoordinateTransforms/StateVector.o .obj/../../src/CoordinateTransforms/KeplerianImpact.o .obj/../../src/CoordinateTransforms/GeodeticPosition.o .obj/../../src/IDLBrokerTemplates/MeasurandSubscription.o .obj/../../src/NetAcquire/NetAcquire.o .obj/DataLossFlightTimeImpl.o .obj/DataLossFlightTime.o .obj/DftTable.o .obj/OptionListener.o .obj/PrimaryListener.o .obj/MissionTimeListener.o .obj/DeadMan.o .obj/main.o .obj/../../src/XML/spline.o .obj/../../src/XML/FpTable.o -l"naps-x86_64" -l"naio-x86_64" -l"nalct-x86_64" -l"curl" -l"TAO_Messaging" -l"TAO_Valuetype" -l"TAO_PI_Server" -l"TAO_PI" -l"TAO_CodecFactory" -l"TAO_CosNaming" -l"armadillo" -l"boost_filesystem" -l"boost_system" -l"xerces-c" -l"jarssXSD" -l"OpenDDS_Tcp" -l"JARSSRTv10" -l"QtNetwork" -l"fontconfig" -l"QtGui" -l"QtCore" -l"OpenDDS_Rtps_Udp" -l"OpenDDS_Rtps" -l"OpenDDS_Multicast" -l"OpenDDS_Udp" -l"OpenDDS_InfoRepoDiscovery" -l"OpenDDS_Dcps" -l"TAO_PortableServer" -l"TAO_AnyTypeCode" -l"TAO" -l"ACE" -o "DFT"
And I'm getting
nvlink error : Undefined reference to '_ZN5JARSS15KeplerianImpactC1ERKdS2_S2_S2_S2_S2_'
nvlink error : Undefined reference to '_ZNK5JARSS15KeplerianImpact9getStatusEv'
nvlink error : Undefined reference to '_ZNK5JARSS15KeplerianImpact13getImpactTimeEv'
nvlink error : Undefined reference to '_ZNK5JARSS15KeplerianImpact11getPlhStateEv'
nvlink error : Undefined reference to '_ZN5JARSS15KeplerianImpactD1Ev'
nvlink error : Undefined reference to '_ZN5JARSS7Geodesy12EFG2GeodeticERKdS2_S2_PdS3_S3_'
I'm certain that these functions/files are included in the compile. You can see from the compile that KeplerianImpact.cpp and Geodesy.cpp are in there.
Is there any way to make the link output easier to read so I can debug this?
Use c++filt to demangle the names. For instance:
$ c++filt _ZN5JARSS15KeplerianImpactC1ERKdS2_S2_S2_S2_S2_
JARSS::KeplerianImpact::KeplerianImpact(double const&, double const&, double const&, double const&, double const&, double const&)
Faced this problem earlier, I guess you haven't linked the device object using device linker.
Generate relocatable code for the device by compiling as shown below (-dc is the device equivalent of -c, see the manual for more information)
nvcc –arch=sm_21 –dc a.cu b.cu
Link the device parts of the code by calling nvlink or dlink before the final host link
nvlink -arch=sm_21 a.o b.o -o link.o (or)
nvcc –arch=sm_21 –dlink a.o b.o –o link.o
Finally form a executable using host compiler,
g++ a.o b.o link.o –L<path> -lcudart
I figured this out.
I needed to define my functions in the correct files. For example, in Foo.h:
class Foo {
public:
__host__ __device__
Foo();
}
and the function definition in Foo.cu not Foo.cpp as I originally thought.
Foo::Foo() {}
For the constant variables, I needed to implement a slightly different strategy.
Here is an example of the C++ class that I started with:
class Foo {
public:
static double const epsilon;
static void functionThatUsesEpsilon();
/**/
}
Had to be converted to use the global namespace as the epsilon def'n
namespace foo {
extern __constant__ double epsilon;
}
class Foo {
public:
// same stuff as before with the addition of this function
__host__ __device__
static inline double getEpsilon() {
#ifdef __CUDACC__
return foo::epsilon;
#else
return epsilon;
#endif
}
static void functionThatUsesEpsilon() {
if (bar < getEpsilon()) { // etc }
}
};
The ifdef above will return the correct version of the variable for either the host or the device code. Everywhere I had referenced Foo::epsilon I needed to replace with Foo::getEpsilon() so the correct epsilon was returned.
Hope this helps someone in the future. Thanks to #RobertCrovella for getting me thinking.
Related
I want to make thrust::scatter asynchronous by calling it in a device kernel(I could also do it by calling it in another host thread). thrust::cuda::par.on(stream) is host function that cannot be called from a device kernel. The following code was tried with CUDA 10.1 on Turing architecture.
__global__ void async_scatter_kernel(float* first,
float* last,
int* map,
float* output)
{
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
thrust::scatter(thrust::cuda::par.on(stream), first, last, map, output);
cudaDeviceSynchronize();
cudaStreamDestroy(stream);
}
I know thrust uses dynamic parallelism to launch its kernels when called from the device, however I couldn't find a way to specify the stream.
The following code compiles cleanly for me on CUDA 10.1.243:
$ cat t1518.cu
#include <thrust/scatter.h>
#include <thrust/execution_policy.h>
__global__ void async_scatter_kernel(float* first,
float* last,
int* map,
float* output)
{
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
thrust::scatter(thrust::cuda::par.on(stream), first, last, map, output);
cudaDeviceSynchronize();
cudaStreamDestroy(stream);
}
int main(){
float *first = NULL;
float *last = NULL;
float *output = NULL;
int *map = NULL;
async_scatter_kernel<<<1,1>>>(first, last, map, output);
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_35 -rdc=true t1518.cu -o t1518
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
$
The -arch=sm_35 (or similar) and -rdc=true are necessary (but not in all cases sufficient) compile switches for any code that uses CUDA Dynamic Parallelism. If you omit, for example, the -rdc=true switch, you get an error similar to what you describe:
$ nvcc -arch=sm_35 t1518.cu -o t1518
t1518.cu(11): error: calling a __host__ function("thrust::cuda_cub::par_t::on const") from a __global__ function("async_scatter_kernel") is not allowed
t1518.cu(11): error: identifier "thrust::cuda_cub::par_t::on const" is undefined in device code
2 errors detected in the compilation of "/tmp/tmpxft_00003a80_00000000-8_t1518.cpp1.ii".
$
So, for the example you have shown here, your compilation error can be eliminated either by updating to the latest CUDA version or by specifying the proper command line, or both.
I downloaded the example from https://www.olcf.ornl.gov/tutorials/mixing-openacc-with-gpu-libraries/
The codes are given in the above mwntioned links
1) using pgcc
pgc++ -c cuFFT.cu
pgcc -acc -Mcudalib=cufft fft.c cufft.o
works perfectly fine
2) using pgc++
pgc++ -c cuFFT.cu
pgc++ -acc -Mcudalib=cufft fft.cpp (or .c samefiles) cufft.o
I get the following error
undefined reference to launchCUFFT(float*, int, void*)
pgacclnk: child process exit status 1: /usr/bin/ld
You're running into a C/C++ linking mismatch.
To get this to work in PGI 17.9 tools I had to:
rename cuFFT.cu to cuFFT.cpp
edit the malloc line in fft.c from:
float *data = malloc(2*n*sizeof(float));
to:
float *data = (float *)malloc(2*n*sizeof(float));
modify the declaration in fft.c from:
extern void launchCUFFT(float *d_data, int n, void *stream);
to:
extern "C" void launchCUFFT(float *d_data, int n, void *stream);
Issue the following compile commands:
$ pgc++ -c cuFFT.cpp
$ pgc++ -acc -Mcudalib=cufft fft.c cuFFT.o
When I try to call a CUDA kernel (a __global__ function) using a function pointer, everything appears to work just fine. However, if I forget to provide launch configuration when calling the kernel, NVCC will not result in an error or warning, but the program will compile and then crash if I attempt to run it.
__global__ void bar(float x) { printf("foo: %f\n", x); }
typedef void(*FuncPtr)(float);
void invoker(FuncPtr func)
{
func<<<1, 1>>>(1.0);
}
invoker(bar);
cudaDeviceSynchronize();
Compile and run the above. Everything will work just fine. Then, remove the kernel's launch configuration (i.e., <<<1, 1>>>). The code will compile just fine but it will crash when you try to run it.
Any idea what is going on? Is this a bug, or I am not supposed to pass around pointers of __global__ functions?
CUDA version: 8.0
OS version: Debian (Testing repo)
GPU: NVIDIA GeForce 750M
If we take a slightly more complex version of your repro, and look at the code emitted by the CUDA toolchain front-end, it becomes possible to see what is happening:
#include <cstdio>
__global__ void bar_func(float x) { printf("foo: %f\n", x); }
typedef void(*FuncPtr)(float);
void invoker(FuncPtr passed_func)
{
#ifdef NVCC_FAILS_HERE
bar_func(1.0);
#endif
bar_func<<<1,1>>>(1.0);
passed_func(1.0);
passed_func<<<1,1>>>(2.0);
}
So let's compile it a couple of ways:
$ nvcc -arch=sm_52 -c -DNVCC_FAILS_HERE invoker.cu
invoker.cu(10): error: a __global__ function call must be configured
i.e. the front-end can detect that bar_func is a global function and requires launch parameters. Another attempt:
$ nvcc -arch=sm_52 -c -keep invoker.cu
As you note, this produces no compile error. Let's look at what happened:
void bar_func(float x) ;
# 5 "invoker.cu"
typedef void (*FuncPtr)(float);
# 7 "invoker.cu"
void invoker(FuncPtr passed_func)
# 8 "invoker.cu"
{
# 12 "invoker.cu"
(cudaConfigureCall(1, 1)) ? (void)0 : (bar_func)((1.0));
# 13 "invoker.cu"
passed_func((2.0));
# 14 "invoker.cu"
(cudaConfigureCall(1, 1)) ? (void)0 : passed_func((3.0));
# 15 "invoker.cu"
}
The standard kernel invocation syntax <<<>>> gets expanded into an inline call to cudaConfigureCall, and then a host wrapper function is called. The host wrapper has the API internals required to launch the kernel:
void bar_func( float __cuda_0)
# 3 "invoker.cu"
{__device_stub__Z8bar_funcf( __cuda_0); }
void __device_stub__Z8bar_funcf(float __par0)
{
if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
{ volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(float))bar_func));
(void)cudaLaunch(((char *)((void ( *)(float))bar_func)));
};
}
So the stub only handles arguments and launches the kernel via cudaLaunch. It doesn't handle launch configuration
The underlying reason for the crash (actually an undetected runtime API error) is that the kernel launch happens without a prior configuration. Obviously this happens because the CUDA front end (and C++ for that matter) can't do pointer introspection at compile time and detect that your function pointer is a stub function for calling a kernel.
I think the only way to describe this is a "limitation" of the runtime API and compiler. I wouldn't say what you are doing is wrong, but I would probably be using the driver API and explicitly managing the kernel launch myself in such a situation.
I did it the way I do with gcc
nvcc a.cu ut.cu
but the compiler shows
ptxas fatal : Unresolved extern function '_Z1fi'
The problem only occurs when the function is __device__ function.
[File ut.h]
__device__ int f(int);
[File ut.c]
#include "ut.h"
__device__ int f(int a){
return a*a;
}
[File a.cu]
#include "ut.h"
__global__ void mk(){
f(5);
}
int main(){
mk<<<1,1>>>();
}
When a __device__ or __global__ function calls a __device__ function (or __global__ function, in the case of dynamic parallelism) in another translation unit (i.e. file), then it is necessary to use device linking. To enable device linking with your simple compile command, just add the -rdc=true switch:
nvcc -rdc=true a.cu ut.cu
That should fix the issue.
Note that in your compile command you list "ut.cu" but in your question you show "ut.c", I assume that should be the file "ut.cu". If not, you will also need to change the file name from "ut.c" to "ut.cu".
You can read more about device linking in the nvcc manual.
I want to use Zgemv in parallel.
__global__ void S_Cphir(cuDoubleComplex *S,cuDoubleComplex *A,cuDoubleComplex *B, int n,int l)
{
....
cublasZgemv(handle,CUBLAS_OP_N,n,n,&alpha,S+i*n*n,n,A+n*i,1,&beta,B+i*n,1);}
void S_Cphir_(cuDoubleComplex *S,cuDoubleComplex *A,cuDoubleComplex *B, int n,int l){
dim3 grid = dim3(1,1,1);
dim3 block = dim3(32,1,1);
S_Cphir<<<grid,block>>>(S,A,B,n,l);}
my compile command is
nvcc -c -arch=compute_30 -code=sm_35 time_propagation_cublas.cu --relocatable-device-code true
nvcc -o ./main.v2 time_propagation_cublas.o -lcublas
The first line is work. But the second line is wrong!!
In function`__sti____cudaRegisterAll_58_tmpxft_000032b7_00000000_6_time_propagation_cublas_cpp1_ii_0d699356()';tmpxft_000032b7_00000000-3_time_propagation_cublas.cudafe1.cpp:(.text+0x17a4):
undefined reference to `__cudaRegisterLinkedBinary_58_tmpxft_000032b7_00000000_6_time_propagation_cublas_cpp1_ii_0d699356'
collect2: ld returned 1 exit status
I search the "cudaRegisterLinkedBinary" but I have nothing!!
I know nvcc support to call cublas in kernel.
Use the CUBLAS Device Library sample code as your reference. On a standard CUDA 5.5 install, you'll find it at:
/usr/local/cuda/samples/7_CUDALibraries/simpleDevLibCUBLAS
Referring to the Makefile in that directory, your compile commands should be like this:
nvcc -arch=sm_35 -rdc=true -o main.v2 time_propagation_cublas.cu -lcublas -lcublas_device -lcudadevrt