Cudafy - how to pass machine32 to nvcc - cuda

Am trying to get the cudafy.net examples to compile on my 64bit machine. Am stuck on the error:
fatal error... '(null)' could not be found
which seems to be fixed here:
by passing --machine 32 argument to nvcc call
Question: How do I do that in the VS2010 solution?

CudafyTranslator.Cudafy() method has several overloads.
For example this overload takes an ePlatform Enum value that defines for what platform (Auto, x86, x64 or both) you want your kernels to be compiled.

Related

Undefined reference to `cublasCreate_v2’ in ‘/tmp/tmpxft_0000120b_0000000-10_my_program”

I have tried to compile a code using CUDA 9.0 toolkit on NVIDIA Tesla P100 graphic card (Ubuntu version 16.04) and CUBLAS library is used in the code. For compilation, I have used the following command to compile “my_program.cu”
nvcc -std=c++11 -L/usr/local/cuda-9.0/lib64 my_program.cu -o mu_program.o -lcublas
But, I have got the following error:
nvlink error: Undefined reference to 'cublasCreate_v2’in '/tmp/tmpxft_0000120b_0000000-10_my_program’
As I have already linked the library path in the compilation command, why do I still get the error. Please help me to solve this error.
It seems fairly evident that you are trying to use the CUBLAS library in device code. This is different than ordinary host usage and requires special compilation/linking steps. You need to:
compile for the correct device architecture (must be cc3.5 or higher)
use relocatable device code linking
link in the cublas device library (in addition to the cublas host library)
link in the CUDA device runtime library
Use a CUDA toolkit prior to CUDA 10.0
The following additions to your compile command line should get you there:
nvcc -std=c++11 my_program.cu -o my_program.o -lcublas -arch=sm_60 -rdc=true -lcublas_device -lcudadevrt
The above assumes you are actually using a proper install of CUDA 9.0. The CUBLAS device library was deprecated and is now removed from newer CUDA toolkits (see here).

CUDA: safeCall() Runtime API error invalid device symbol [duplicate]

I use the cmake gui tool to configure my cuda project in vs2013.
CMakeLists.txt is as below:
project(CUDA_PART)
# required cmake version
cmake_minimum_required(VERSION 3.0)
include_directories(${CUDA_PART_SOURCE_DIR}/common)
# packages
find_package(CUDA REQUIRED)
# nvcc flags
set(CUDA_NVCC_FLAGS -gencode arch=compute_20,code=sm_20;-G;-g)
set(CUDA_VERBOSE_BUILD ON)
#FILE(GLOB SOURCES "*.cu" "*.cpp" "*.c" "*.h")
CUDA_ADD_EXECUTABLE(CUDA_PART hist_gpu_shmem_atomics.cu)
The .cu file is from Cuda by example source code hist_gpu_shmem_atomics.cu
There are two problems:
After the line histo_kernel <<<blocks * 2, 256 >>>(dev_buffer, SIZE, dev_histo);an "invalid device function" error occurs.
When I use the CUDA debugging tool to debug, its cannot trigger breakpoints in the device code.
But when I create a project with the same code by the cuda project temple in visual studio 2013.It works correctly!
So, is there something wrong in the CMakeLists.txt ?
OS: Win7 64bit;GPU: GTX960;CUDA: CUDA 7.5;VS: 2013 (and also 2010)
When I use set the "Code Generation" in vs2013 as follow :
The CUDA_NVCC_FLAGES turns out to be -gencode=arch=compute_20,code=\"sm_20,compute_20\"
It equals to:
-gencode=arch=compute_20,code=sm_20 \
-gencode=arch=compute_20,code=compute_20
So, I guess it will generate 2 versions machine code: the first one(SASS) with virtual and real architectures and the second one(PTX) with only virtual architecture. Since my GTX960 is a cc5.2 device, it chooses the second one (PTX) and convert it to a suitable SASS.
This is a problem:
set(CUDA_NVCC_FLAGS -gencode arch=compute_20,code=sm_20;-G;-g)
Those flags will cause nvcc to generate SASS code (only) for a cc 2.0 device (only). Such cc2.0 SASS code will not run on your cc5.2 device (GTX960). "Invalid device function" is exactly the error you would get when trying to launch a kernel in such a scenario. Since the kernel will never launch, trying to hit breakpoints in device code won't work.
I'm not a CMake expert, so there might be other, more sensible approaches, but one possible way to try to fix this might be:
set(CUDA_NVCC_FLAGS -gencode arch=compute_52,code=sm_52;-G;-g)
which should generate code for your cc5.2 device. There are undoubtedly other possible settings here, you may want to read this or the nvcc manual for more background on compile options to target specific devices.
Also note that -G generates device debug code, which is fine if that is what you want. However it will generally run slower than code compiled without that switch. If you want to debug, however, that switch is necessary.

How to force cubin file generation for a higher compute version

In the samples provided with CUDA 6.0, I'm running the following compile command with error output:
foo#foo:/usr/local/cuda-6.0/samples/0_Simple/cdpSimpleQuicksort$ nvcc --cubin -I../../common/inc cdpSimpleQuicksort.cu
nvcc warning : The 'compute_10' and 'sm_10' architectures are deprecated, and may be removed in a future release.
cdpSimpleQuicksort.cu(105): error: calling a __global__ function("cdp_simple_quicksort") from a __global__ function("cdp_simple_quicksort") is only allowed on the compute_35 architecture or above
cdpSimpleQuicksort.cu(114): error: calling a __global__ function("cdp_simple_quicksort") from a __global__ function("cdp_simple_quicksort") is only allowed on the compute_35 architecture or above
2 errors detected in the compilation of "/tmp/tmpxft_0000241a_00000000-6_cdpSimpleQuicksort.cpp1.ii".
I then altered the command to this, with a new failure:
foo#foo:/usr/local/cuda-6.0/samples/0_Simple/cdpSimpleQuicksort$ nvcc --cubin -I../../common/inc -gencode arch=compute_35,code=sm_35 cdpSimpleQuicksort.cu
cdpSimpleQuicksort.cu(105): error: kernel launch from __device__ or __global__ functions requires separate compilation mode
cdpSimpleQuicksort.cu(114): error: kernel launch from __device__ or __global__ functions requires separate compilation mode
2 errors detected in the compilation of "/tmp/tmpxft_000024f3_00000000-6_cdpSimpleQuicksort.cpp1.ii".
Does this have anything to do with the fact that the machine I'm on is only Compute 2.1 capable and the build tools are blocking me? What's the resolution... I'm not finding anything in the documentation that is clearly handling this error.
I looked at this question, and that... a link to documentation is simply not helping. I need to know how I have to modify the compile command.
Look at the makefile that comes with that cdpSimpleQuicksort project. It shows some additional switches that are needed to compile it, due to CUDA dynamic parallelism (which is essentially the second set of errors you are seeing.) Go back and study that makefile, and see if you can figure out how to combine some of the compile commands there with --cubin.
The readers digest version is that this should compile without error:
nvcc --cubin -rdc=true -I../../common/inc -arch=sm_35 cdpSimpleQuicksort.cu
Having said all that, you should be able to compile for whatever kind of target you want, but you won't be able to run a cdp code on a cc2.1 architecture.
cdp documentation
and here

CUDA: Why does compute_20 code fail on compute_35 device?

For a computer with Titan GPU (compute_35,sm_35), I compiled some code using this line in CMakeLists.txt:
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_35,code=sm_35)
The code compiles and also runs fine.
I wanted to check what compilation problems this code would cause for a friend who uses a GTS 450 (compute_20,sm_21). So, I changed the above line to:
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_20,code=sm_21)
The code compiles without any errors on my computer with Titan. But when I run it (again on my Titan computer), its fails after a thrust::copy call with the following error:
$ ./foobar
terminate called after throwing an instance of 'thrust::system::system_error'
what(): invalid device function
"foobar" terminated by signal SIGABRT (Abort)
Google says the above error is caused due to GPU architecture mismatch.
The strangest part is that with the above line (arch=compute_20,code=sm_21), the code compiles and runs without error on my friend's computer with GTS 450! Except for the GPU, her Ubuntu 12.04, gcc and CUDA SDK 5.5 versions are the same as mine.
Is this the real cause of this error? Why cannot Titan run compute_20 code? Isn't a CUDA GPU supposed to be backwards compatible with PTX or SASS code? Even if it isn't, why cannot the driver JIT compile the compute_20 PTX to the SASS of sm_35?
If you specify:
-gencode arch=compute_20,code=compute_20
your code should run (via JIT) on either GPU.
According to the nvcc manual, JIT is directly enabled when you specify a virtual architecture for the code switch. You can make multiple specifications in a single command:
-arch=compute_20 -code=compute20,sm_21,sm_35
(note this is in lieu of specifying -gencode ...)
which would allow JIT from sm_20 PTX, and non-JIT execution directly on cc2.1 or cc3.5 devices.

CUDA: nvcc fatal error when trying to make CUDASW++

I keep getting:
nvcc fatal : Value 'sm_20' is not defined for option 'gpu-name'
My GPU is a GTX 590 and is indeed version 2.0 so that's not the problem. I switched to a lower version (sm_20) and get tons of errors with .h files.
Any ideas on what to try? I'm using cuda 5.0.
You could try compute_20 instead of sm_20.
Looking at the nvcc documentation in CUDA 5.0, the --gpu-name command line option is not mentioned. I guess it is an old option and you should probably instead use the -arch and/or -code options.