Parallel Compilation of multiple CUDA architectures on same . cu file - cuda
I want my compiled CUDA code to work on any Nvidia GPU, so I compile each .cu file with the options:
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_32,code=sm_32
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_52,code=sm_52
-gencode arch=compute_53,code=sm_53
-gencode arch=compute_60,code=sm_60
-gencode arch=compute_61,code=sm_61
-gencode arch=compute_61,code=compute_61
(This is using CUDA 8.0 so I don't have the newer architectures listed yet.)
The issue is that nvcc compiles each of these targets synchronously, which can take quite a long time. Is there a way to split this up across multiple CPU cores? I'm using a Make build system.
I can manually make the .ptx or .cubin file for each architecture in a different async nvcc invocation easily using a different Make target for each architecture. However how do I combine these into a final .o file to be linked together with my host code?
This:
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#cuda-compilation-trajectory
Seems to imply I should take multiple .cubin files and combine them into a .fatbin file. However when I try to do that I get the error:
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
Is this possible? What am I missing?
Thanks!
Edit 1:
Following talonmies reply. I've tried to do:
F:/SDKs/CUDASDK/9.2/bin/WIN64/bin/nvcc -ccbin=C:/MVS14/VC/bin --machine=64 --ptxas-options=-v -D_DEBUG -D_CONSOLE -Xcompiler /EHsc,/MDd,-Od,-Z7,/W2,/RTCs,/RTCu,/we4390,/wd4251,/we4150,/we4715,/we4047,/we4028,/we4311,/we4552,/we4553,/we4804,/we4806,/we4172,/we4553,/we4700,/we4805,/we4743,/we4717,/we4551,/we4533,/we6281,/we4129,/we4309,/we4146,/we4133,/we4083,/we4477,/we4473,/FS,/J,/EHsc -I"F:/SDKs/CUDASDK/9.2/include" -DWIN32 --device-c -cubin -gencode arch=compute_30,code=sm_30 -o ms_30.cubin ms.cu
F:/SDKs/CUDASDK/9.2/bin/WIN64/bin/nvcc -ccbin=C:/MVS14/VC/bin --machine=64 --ptxas-options=-v -D_DEBUG -D_CONSOLE -Xcompiler /EHsc,/MDd,-Od,-Z7,/W2,/RTCs,/RTCu,/we4390,/wd4251,/we4150,/we4715,/we4047,/we4028,/we4311,/we4552,/we4553,/we4804,/we4806,/we4172,/we4553,/we4700,/we4805,/we4743,/we4717,/we4551,/we4533,/we6281,/we4129,/we4309,/we4146,/we4133,/we4083,/we4477,/we4473,/FS,/J,/EHsc -I"F:/SDKs/CUDASDK/9.2/include" -DWIN32 --device-c -cubin -gencode arch=compute_35,code=sm_35 -o ms_35.cubin ms.cu
And then link with:
F:/SDKs/CUDASDK/9.2/bin/WIN64/bin/nvcc -o out.o -dlink ms_35.cubin ms_30.cubin -I"F:/SDKs/CUDASDK/9.2/include"
However I get the error:
fatbinary fatal : fatbinary elf mismatch: elf arch '35' does not match '30'
All the examples using device link always just have one arch used. Is it possible to combine architectures this way?
nvcc is merely a front-end issuing commands to a number of other tools.
If you add the --dryrun flag to your nvcc invocation, it will print the exact commands you need to run to replace your use of nvcc.
From there it should be easy to convert this list of commands into a script or makefile.
Update: nvcc from CUDA 11.3 finally supports this out of the box via the -t flag.
The tool chain doesn't support this and you shouldn't expect to be able to do this by hand as nvcc does either.
However, you can certainly script some sort process to
Execute parallel compilation of the code to multiple cubin files, one for each target architecture
Perform a device link pass to combine the cubins to a single elf payload
Link the final executable with the resulting object file emitted by the device link phase
You will probably need to enable separate device code compilation and you might also need to refactor your code slightly as a result. Caveat Emptor and all that.
Related
How can I compile a CUDA application that targets both Kepler and Maxwell Architectures?
I do development on desktops, which have a Titan X card (Maxwell architecture). However, the production code runs on servers which have K40 cards (Kepler architecture). How can I build my code so that it runs optimally on both systems? So far, I have used compute_20,sm_20 but I think that this setting is not optimal.
The first thing you would want to do is build a fat binary that contains machine code (SASS) for sm_35 (the architecture of the K40) and sm_52 (the architecture of the Titan X), plus intermediate code (PTX) for compute_52, for JIT compilation on future GPUs. You do so via the -gencode switch of nvcc: nvcc -gencode arch=compute_35,code=sm_35 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_52,code=compute_52 This ensures that the executable code generated is best suited to, and makes full use of, each of the specified architectures. When the CUDA driver or runtime loads a kernel when running with a specific GPU, it will automatically select the version with the matching machine code. What building a fat binary does not do is adjust various parameters of your code, such as the launch configurations of kernels, to be optimal for the different architectures. So if you need to achieve the best possible performance on either platform you would want to profile the application and consider machine-specific source code adjustments based on the result of the profiling experiments.
However, #njuffa's answer is semantically correct, I'd like to point out some shorthand of nvcc's -gencode option. Precisely we can shorten: -gencode arch=compute_52,code=sm_52 -gencode arch=compute_52,code=compute_52 into this: -gencode arch=compute_52,code=\"sm_52,compute_52\" Which is described in Nvidia doc.
CUDA *.cpp files
Is there a flag I can pass nvcc to treat a .cpp file like it would a .cu file? I would rather not have to do a cp x.cpp x.cu; nvcc x.cu; rm x.cu. I ask because I have cpp files in my library that I would like to compile with/without CUDA based on particular flags passed to the Makefile.
Yes, referring to the nvcc documentation the flag is -x: nvcc -x cu test.cpp will compile test.cpp as if it were a test.cu file (i.e. pass it through the CUDA toolchain)
Compilation of cuda samples 'opensuse 13.1'
I have an installed cuda toolkit 6.5 on my opensuse 13.1, and have a problem with compiling cuda sample. The output after make command is: ~# make make[1]: Entering directory `/home/user/NVIDIA_CUDA-6.5_Samples/0_Simple/simpleStreams /usr/local/cuda-6.5/bin/nvcc -ccbin g++ -I../../common/inc -m64 -gencode arch=compute_11,code=sm_11 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_50,code=compute_50 -o simpleStreams.o -c simpleStreams.cu nvcc warning : The 'compute_11', 'compute_12', 'compute_13', 'sm_11', 'sm_12', and 'sm_13' architectures are deprecated, and may be removed in a future release. g++: No such file or directory make[1]: *** [simpleStreams.o] Error 1 make[1]: Leaving directory `/home/user/NVIDIA_CUDA-6.5_Samples/0_Simple/simpleStreams make: *** [0_Simple/simpleStreams/Makefile.ph_build] Error 2 Versions of my nvcc and gcc are: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2014 NVIDIA Corporation Built on Thu_Jul_17_21:41:27_CDT_2014 Cuda compilation tools, release 6.5, V6.5.12 gcc version 4.8.1 20130909 [gcc-4_8-branch revision 202388] (SUSE Linux) Can some one help me to solve this problem?
The nvcc doesn't like compute_1X flags where X is 1,2,3. Simply remove this: -gencode arch=compute_11,code=sm_11 code from the Makefile and you should compile correctly. Although this is just a warning is recommendable to fix all warnings to avoid trouble. Depending on nvcc configuration it may fail also when a warning occurs. The problem probably arises due to the fact it does not find g++ compiler. It may happen that you haven't installed gcc compiler for c++ which is the most probable cause. Or it may happen that you have installed it manually and is not on the PATH which is the less probable cause. To install gcc compiler for c++ follow this link. If it doesn't work the problem is not computer related.
nvcc compiled object on windows not a valid file format
Since I did not have access to a nVIDIA card, I was using GPUOcelot to compile and run my programs. Since I had separated out my cuda kernel and the C++ programs in two separate files (since I was using C++11 features) I was doing the following to run my program. nvcc -c my_kernel.cu -arch=sm_20 g++ -std=c++0x -c my_main.cpp g++ my_kernel.o my_main.o -o latest_output.o 'OcelotConfig -l' I have recently been given access to a Windows box which has a nVIDIA card. I downloaded the CUDA toolkit for windows and mingw g++. Now I run nvcc -c my_kernel.cu -arch=sm_20 g++ -std=c++0x -c my_main.cpp The nvcc call now instead of producing my_kernel.o produces my_kernel.obj. And when I try to link them and run using g++ as I did before g++ my_kernel.obj my_main.o -o m I get the following error: my_kernel.obj: file not recognized: File format not recognized collect2.exe: error: ld returned 1 status Could you please resolve the problem? Thanks.
nvcc is a compiler wrapper that invokes the device compiler and the host compiler under the hood (it can also invoke the host linker, but you're using -c so not doing linking). On Windows, the supported host compiler is cl.exe from Visual Studio. Linking two object files created with two different C++ compilers is typically not possible, even if you are just using CPU only. This is because the ABI is different. The error message you are seeing is simply telling you that the object file format from cl.exe (via nvcc) is incompatible with g++. You need to compile my_main.cpp with cl.exe, if that's producing errors then that's a different question!
Thrust OpenMP without CUDA?
Can I use Thrust with the OpenMP device system if my machine doesn't have a CUDA GPU? If so, do I still require the CUDA toolkit?
I just found this in the CUDA documentation: When using either the OpenMP or TBB systems, nvcc isn't required. In general, nvcc is only required when targeting Thrust at CUDA. For example, we could compile the previous code directly with g++ with this command line: $ g++ -O2 -o monte_carlo monte_carlo.cpp -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -lgomp -I<path-to-thrust-headers> https://github.com/thrust/thrust/wiki/Device-Backends