I want my compiled CUDA code to work on any Nvidia GPU, so I compile each .cu file with the options:
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_32,code=sm_32
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_52,code=sm_52
-gencode arch=compute_53,code=sm_53
-gencode arch=compute_60,code=sm_60
-gencode arch=compute_61,code=sm_61
-gencode arch=compute_61,code=compute_61
(This is using CUDA 8.0 so I don't have the newer architectures listed yet.)
The issue is that nvcc compiles each of these targets synchronously, which can take quite a long time. Is there a way to split this up across multiple CPU cores? I'm using a Make build system.
I can manually make the .ptx or .cubin file for each architecture in a different async nvcc invocation easily using a different Make target for each architecture. However how do I combine these into a final .o file to be linked together with my host code?
This:
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#cuda-compilation-trajectory
Seems to imply I should take multiple .cubin files and combine them into a .fatbin file. However when I try to do that I get the error:
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
Is this possible? What am I missing?
Thanks!
Edit 1:
Following talonmies reply. I've tried to do:
F:/SDKs/CUDASDK/9.2/bin/WIN64/bin/nvcc -ccbin=C:/MVS14/VC/bin --machine=64 --ptxas-options=-v -D_DEBUG -D_CONSOLE -Xcompiler /EHsc,/MDd,-Od,-Z7,/W2,/RTCs,/RTCu,/we4390,/wd4251,/we4150,/we4715,/we4047,/we4028,/we4311,/we4552,/we4553,/we4804,/we4806,/we4172,/we4553,/we4700,/we4805,/we4743,/we4717,/we4551,/we4533,/we6281,/we4129,/we4309,/we4146,/we4133,/we4083,/we4477,/we4473,/FS,/J,/EHsc -I"F:/SDKs/CUDASDK/9.2/include" -DWIN32 --device-c -cubin -gencode arch=compute_30,code=sm_30 -o ms_30.cubin ms.cu
F:/SDKs/CUDASDK/9.2/bin/WIN64/bin/nvcc -ccbin=C:/MVS14/VC/bin --machine=64 --ptxas-options=-v -D_DEBUG -D_CONSOLE -Xcompiler /EHsc,/MDd,-Od,-Z7,/W2,/RTCs,/RTCu,/we4390,/wd4251,/we4150,/we4715,/we4047,/we4028,/we4311,/we4552,/we4553,/we4804,/we4806,/we4172,/we4553,/we4700,/we4805,/we4743,/we4717,/we4551,/we4533,/we6281,/we4129,/we4309,/we4146,/we4133,/we4083,/we4477,/we4473,/FS,/J,/EHsc -I"F:/SDKs/CUDASDK/9.2/include" -DWIN32 --device-c -cubin -gencode arch=compute_35,code=sm_35 -o ms_35.cubin ms.cu
And then link with:
F:/SDKs/CUDASDK/9.2/bin/WIN64/bin/nvcc -o out.o -dlink ms_35.cubin ms_30.cubin -I"F:/SDKs/CUDASDK/9.2/include"
However I get the error:
fatbinary fatal : fatbinary elf mismatch: elf arch '35' does not match '30'
All the examples using device link always just have one arch used. Is it possible to combine architectures this way?
nvcc is merely a front-end issuing commands to a number of other tools.
If you add the --dryrun flag to your nvcc invocation, it will print the exact commands you need to run to replace your use of nvcc.
From there it should be easy to convert this list of commands into a script or makefile.
Update: nvcc from CUDA 11.3 finally supports this out of the box via the -t flag.
The tool chain doesn't support this and you shouldn't expect to be able to do this by hand as nvcc does either.
However, you can certainly script some sort process to
Execute parallel compilation of the code to multiple cubin files, one for each target architecture
Perform a device link pass to combine the cubins to a single elf payload
Link the final executable with the resulting object file emitted by the device link phase
You will probably need to enable separate device code compilation and you might also need to refactor your code slightly as a result. Caveat Emptor and all that.
Since I did not have access to a nVIDIA card, I was using GPUOcelot to compile and run my programs. Since I had separated out my cuda kernel and the C++ programs in two separate files (since I was using C++11 features) I was doing the following to run my program.
nvcc -c my_kernel.cu -arch=sm_20
g++ -std=c++0x -c my_main.cpp
g++ my_kernel.o my_main.o -o latest_output.o 'OcelotConfig -l'
I have recently been given access to a Windows box which has a nVIDIA card. I downloaded the CUDA toolkit for windows and mingw g++. Now I run
nvcc -c my_kernel.cu -arch=sm_20
g++ -std=c++0x -c my_main.cpp
The nvcc call now instead of producing my_kernel.o produces my_kernel.obj. And when I try to link them and run using g++ as I did before
g++ my_kernel.obj my_main.o -o m
I get the following error:
my_kernel.obj: file not recognized: File format not recognized
collect2.exe: error: ld returned 1 status
Could you please resolve the problem? Thanks.
nvcc is a compiler wrapper that invokes the device compiler and the host compiler under the hood (it can also invoke the host linker, but you're using -c so not doing linking). On Windows, the supported host compiler is cl.exe from Visual Studio.
Linking two object files created with two different C++ compilers is typically not possible, even if you are just using CPU only. This is because the ABI is different. The error message you are seeing is simply telling you that the object file format from cl.exe (via nvcc) is incompatible with g++.
You need to compile my_main.cpp with cl.exe, if that's producing errors then that's a different question!
Can I use Thrust with the OpenMP device system if my machine doesn't have a CUDA GPU? If so, do I still require the CUDA toolkit?
I just found this in the CUDA documentation:
When using either the OpenMP or TBB systems, nvcc isn't required. In general, nvcc is only required when targeting Thrust at CUDA. For example, we could compile the previous code directly with g++ with this command line:
$ g++ -O2 -o monte_carlo monte_carlo.cpp -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -lgomp -I<path-to-thrust-headers>
https://github.com/thrust/thrust/wiki/Device-Backends
I want to compile and run the bandwidthTest.cu in the CUDA SDK. I face the two following errors when I compile it with:
nvcc -arch=sm_20 bandwidthTest.cu -o bTest
cutil_inline.h: no such file or directory
shrUtils.h: no such file or directory
How can I solve this problem?
Add the current directory to your include search path.
nvcc -I. -arch=sm_20 bandwidthTest.cu -o bTest
Probably the two header files you tried to #include are not available in that directory. If you use the Visual Studio IDE, you can see the red outlining.
Find the path to cutil_inline.h and the path to shrUtils.h and put them in the compilation line in the following way:
nvcc -Ipath to cutil_inline.h -Ipath to shrUtils.h -arch=sm_20 bandwidthTest.cu -o bTest
Also, consider using a makefile for the compilation in case you aren't.
We are using Dell SUSE Enterprise. No choice.
SUSE doesn't have libpcap-devel or anything similar in the zypper repositories.
I've downloaded and installed libpcap from the GIT repository. libpcap requires both flex and bison to be compiled. flex version 2.5.35 is in the repo, as is bison.
However, I can't get any problem that uses libpcap-devel to compile. The autoconf script fails on attempts to link in libpcap.so:
configure:3633: $? = 1
configure:3636: checking whether we are using the GNU C++ compiler
configure:3665: g++ -c conftest.cpp >&5
configure:3672: $? = 0
configure:3689: result: yes
configure:3698: checking whether g++ accepts -g
configure:3728: g++ -c -g conftest.cpp >&5
configure:3735: $? = 0
configure:3836: result: yes
configure:3861: checking dependency style of g++
configure:3952: result: gcc3
configure:3981: checking for a BSD-compatible install
configure:4049: result: /usr/bin/install -c
configure:4067: checking for pcap_lookupdev in -lpcap
configure:4102: gcc -o conftest -g -O2 conftest.c -lpcap >&5
/usr/local/lib/libpcap.so: undefined reference to `pcap_lex'
collect2: ld returned 1 exit status
configure:4109: $? = 1
configure: failed program was:
Running nm on the archive, I find:
$ nm /usr/local/lib/libpcap.so | grep pcap_lex
U pcap_lex
of course, pcap_lex is really a #define from yylex.
I'm not in over my head here. I'm trying to figure out why none of this stuff compiles properly on Suse. Does anybody have a clue?
Somehow, whatever you did to compile libpcap caused it not to build correctly.
Without:
the config.log file from the libpcap directory;
the Makefile from the libpcap directory;
the output of the build in the libpcap directory;
it will be impossible to determine what that was, and thus it will be impossible to fix the process so that libpcap builds correctly. I have never seen a problem with libpcaps built on Linux, so I cannot determine what's happening here.
(If you supply this information, you will be helping not only yourself but all the people who have reported similar problems but have refused to respond to similar questions and thus have not supplied any of the information in question.)