CUDA: nvcc fatal error when trying to make CUDASW++ - cuda

I keep getting:
nvcc fatal : Value 'sm_20' is not defined for option 'gpu-name'
My GPU is a GTX 590 and is indeed version 2.0 so that's not the problem. I switched to a lower version (sm_20) and get tons of errors with .h files.
Any ideas on what to try? I'm using cuda 5.0.

You could try compute_20 instead of sm_20.
Looking at the nvcc documentation in CUDA 5.0, the --gpu-name command line option is not mentioned. I guess it is an old option and you should probably instead use the -arch and/or -code options.

Related

Undefined reference to `cublasCreate_v2’ in ‘/tmp/tmpxft_0000120b_0000000-10_my_program”

I have tried to compile a code using CUDA 9.0 toolkit on NVIDIA Tesla P100 graphic card (Ubuntu version 16.04) and CUBLAS library is used in the code. For compilation, I have used the following command to compile “my_program.cu”
nvcc -std=c++11 -L/usr/local/cuda-9.0/lib64 my_program.cu -o mu_program.o -lcublas
But, I have got the following error:
nvlink error: Undefined reference to 'cublasCreate_v2’in '/tmp/tmpxft_0000120b_0000000-10_my_program’
As I have already linked the library path in the compilation command, why do I still get the error. Please help me to solve this error.
It seems fairly evident that you are trying to use the CUBLAS library in device code. This is different than ordinary host usage and requires special compilation/linking steps. You need to:
compile for the correct device architecture (must be cc3.5 or higher)
use relocatable device code linking
link in the cublas device library (in addition to the cublas host library)
link in the CUDA device runtime library
Use a CUDA toolkit prior to CUDA 10.0
The following additions to your compile command line should get you there:
nvcc -std=c++11 my_program.cu -o my_program.o -lcublas -arch=sm_60 -rdc=true -lcublas_device -lcudadevrt
The above assumes you are actually using a proper install of CUDA 9.0. The CUBLAS device library was deprecated and is now removed from newer CUDA toolkits (see here).

Using CUDA 8.0 with GCC 6.x - bad function overloading complaint

I'm trying to build some CUDA code using GCC 6.2.1, the default compiler of my distribution (Note: not a GCC version officially supported by CUDA, so you can call this experimental). This is code which builds fine with GCC 4.9.3 and both CUDA versions 7.5 and 8.0.
Well, if I build the following (close-to) minimal example:
#include <tuple>
int main() { return 0; }
with the command-line
nvcc -std=c++11 -Wno-deprecated-gpu-targets -o main main.cu
I get the following errors:
/usr/local/cuda/bin/../targets/x86_64-linux/include/math_functions.h(8897): error: cannot overload functions distinguished by return type alone
/usr/local/cuda/bin/../targets/x86_64-linux/include/math_functions.h(8901): error: cannot overload functions distinguished by return type alone
2 errors detected in the compilation of "/tmp/tmpxft_000071fe_00000000-9_b.cpp1.ii".
Why is that? How can I correct/circumvent this?
TL;DR: Forget about it. Only use CUDA 8.x with GCC 5.x , and CUDA 9 or later with GCC 6.x
It seems other people have seen this issue with GCC 6.1.x and the suggestion is to add the following flags to nvcc: -Xcompiler -D__CORRECT_ISO_CPP11_MATH_H_PROTO (yes, two successive flags; see nvcc --help for details). (But I can't report complete success since other issues pop up instead.)
But remember that GCC 5.4.x is the latest supported version, and that probably has a good reason, so it's somewhat of a wild goose chase to force GCC 6.x onto it - especially when CUDA 9 is available now.

How to force cubin file generation for a higher compute version

In the samples provided with CUDA 6.0, I'm running the following compile command with error output:
foo#foo:/usr/local/cuda-6.0/samples/0_Simple/cdpSimpleQuicksort$ nvcc --cubin -I../../common/inc cdpSimpleQuicksort.cu
nvcc warning : The 'compute_10' and 'sm_10' architectures are deprecated, and may be removed in a future release.
cdpSimpleQuicksort.cu(105): error: calling a __global__ function("cdp_simple_quicksort") from a __global__ function("cdp_simple_quicksort") is only allowed on the compute_35 architecture or above
cdpSimpleQuicksort.cu(114): error: calling a __global__ function("cdp_simple_quicksort") from a __global__ function("cdp_simple_quicksort") is only allowed on the compute_35 architecture or above
2 errors detected in the compilation of "/tmp/tmpxft_0000241a_00000000-6_cdpSimpleQuicksort.cpp1.ii".
I then altered the command to this, with a new failure:
foo#foo:/usr/local/cuda-6.0/samples/0_Simple/cdpSimpleQuicksort$ nvcc --cubin -I../../common/inc -gencode arch=compute_35,code=sm_35 cdpSimpleQuicksort.cu
cdpSimpleQuicksort.cu(105): error: kernel launch from __device__ or __global__ functions requires separate compilation mode
cdpSimpleQuicksort.cu(114): error: kernel launch from __device__ or __global__ functions requires separate compilation mode
2 errors detected in the compilation of "/tmp/tmpxft_000024f3_00000000-6_cdpSimpleQuicksort.cpp1.ii".
Does this have anything to do with the fact that the machine I'm on is only Compute 2.1 capable and the build tools are blocking me? What's the resolution... I'm not finding anything in the documentation that is clearly handling this error.
I looked at this question, and that... a link to documentation is simply not helping. I need to know how I have to modify the compile command.
Look at the makefile that comes with that cdpSimpleQuicksort project. It shows some additional switches that are needed to compile it, due to CUDA dynamic parallelism (which is essentially the second set of errors you are seeing.) Go back and study that makefile, and see if you can figure out how to combine some of the compile commands there with --cubin.
The readers digest version is that this should compile without error:
nvcc --cubin -rdc=true -I../../common/inc -arch=sm_35 cdpSimpleQuicksort.cu
Having said all that, you should be able to compile for whatever kind of target you want, but you won't be able to run a cdp code on a cc2.1 architecture.
cdp documentation
and here

CUDA: Why does compute_20 code fail on compute_35 device?

For a computer with Titan GPU (compute_35,sm_35), I compiled some code using this line in CMakeLists.txt:
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_35,code=sm_35)
The code compiles and also runs fine.
I wanted to check what compilation problems this code would cause for a friend who uses a GTS 450 (compute_20,sm_21). So, I changed the above line to:
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_20,code=sm_21)
The code compiles without any errors on my computer with Titan. But when I run it (again on my Titan computer), its fails after a thrust::copy call with the following error:
$ ./foobar
terminate called after throwing an instance of 'thrust::system::system_error'
what(): invalid device function
"foobar" terminated by signal SIGABRT (Abort)
Google says the above error is caused due to GPU architecture mismatch.
The strangest part is that with the above line (arch=compute_20,code=sm_21), the code compiles and runs without error on my friend's computer with GTS 450! Except for the GPU, her Ubuntu 12.04, gcc and CUDA SDK 5.5 versions are the same as mine.
Is this the real cause of this error? Why cannot Titan run compute_20 code? Isn't a CUDA GPU supposed to be backwards compatible with PTX or SASS code? Even if it isn't, why cannot the driver JIT compile the compute_20 PTX to the SASS of sm_35?
If you specify:
-gencode arch=compute_20,code=compute_20
your code should run (via JIT) on either GPU.
According to the nvcc manual, JIT is directly enabled when you specify a virtual architecture for the code switch. You can make multiple specifications in a single command:
-arch=compute_20 -code=compute20,sm_21,sm_35
(note this is in lieu of specifying -gencode ...)
which would allow JIT from sm_20 PTX, and non-JIT execution directly on cc2.1 or cc3.5 devices.

Cudafy - how to pass machine32 to nvcc

Am trying to get the cudafy.net examples to compile on my 64bit machine. Am stuck on the error:
fatal error... '(null)' could not be found
which seems to be fixed here:
by passing --machine 32 argument to nvcc call
Question: How do I do that in the VS2010 solution?
CudafyTranslator.Cudafy() method has several overloads.
For example this overload takes an ePlatform Enum value that defines for what platform (Auto, x86, x64 or both) you want your kernels to be compiled.