Segmentation fault compiling CUDA "hello, world" with relocatable-device-code - cuda

I'm trying my hand at using the relocatable-device-code flag. I have a large project that would be easier to maintain with small blocks of code.
I was able to get the project to compile. When trying to run it, I get a hard crash. When using the debugger:
(gdb) where
#0 0x0000000000000001 in ?? ()
#1 0x00007fffffffe39c in ?? ()
#2 0x0000000000000000 in ?? ()
I've never seen a stack trace like that! I then reduced the amount of code until I came down to a singularity: main.cu file contains only
#include <iostream>
int main(void) {
std::cout << "hello, world" << std::endl;
return 0;
}
Which still fails. I'm using the following flags to compile my main.cu file.
nvcc -shared -rdc=true -arch=sm_20 -Xcompiler -fPIC -g -G
Do these make sense? Why the segmentation fault for such a simple progam?

Remove the -shared switch. That option is not applicable when you are trying to generate an executable.
From the documentation:
Generate a shared library during linking. Note: when other linker options are required for controlling dll generation, use option -Xlinker.

Related

NVCC won't look for libraries in /usr/lib/x86_64-linux-gnu - why?

Consider the following CUDA program, in a file named foo.cu:
#include <cooperative_groups.h>
#include <stdio.h>
__global__ void my_kernel() {
auto g = cooperative_groups::this_grid();
g.sync();
}
int main(int, char **) {
cudaLaunchCooperativeKernel( (const void*) my_kernel, 2, 2, nullptr, 0, nullptr);
cudaDeviceSynchronize();
}
This program needs to be compiled with -rdc=true (see this question); and needs to be explicitly linked against libcudadevrt. Ok, no problem... or is it?
$ nvcc -rdc=true -o foo -gencode arch=compute_61,code=sm_61 foo.cu -lcudadevrt
nvlink error : Undefined reference to 'cudaCGGetIntrinsicHandle' in '/tmp/tmpxft_000036ec_00000000-10_foo.o'
nvlink error : Undefined reference to 'cudaCGSynchronizeGrid' in '/tmp/tmpxft_000036ec_00000000-10_foo.o'
Only if I explicitly add the library's folder with -L/usr/lib/x86_64-linux-gnu, is it willing to build my program.
This is strange, because all of the CUDA libraries on my system are in that folder. Why isn't NVCC/nvlink looking in there?
Notes:
I'm using Devuan GNU/Linux 3.0.
CUDA 10.1 is installed as a distribution package.
An x86_64 machine with a GeForce 1050 Ti card.
NVCC, or perhaps nvlink, looks for paths in an environment variable named LIBRARIES. But - before doing so, the shell script /etc/nvcc.profile is executed (at least, it is on Devuan).
On Devuan 3.0, that file has a line saying:
LIBRARIES =+ $(_SPACE_) -L/usr/lib/x86_64-linux-gnu/stubs
so that's where your NVCC looks to by default.
You can therefore do one of two things:
Set the environment variable outside NVCC, e.g. in your ~/.profile or ~/.bashrc file:
export LIBRARIES=-L/usr/lib/x86_64-linux-gnu/
Change that nvcc.profile line to say:
LIBRARIES =+ $(_SPACE_) -L/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu/stubs
and NVCC will successfully build your binary.

CUDA code fails on Pascal cards (GTX 1080)

I tried running an executable which uses separable compilation on a GTX 1080 today (Compute Capability 6.1 which is not directly supported by CUDA 7.5), and wasn't able to run it, as the first CUDA call fails. I have traced it down to cublas, as this simple program (which doesn't even use cublas)
#include <cuda_runtime_api.h>
#include <cstdio>
__global__ void foo()
{
}
int main(int, char**)
{
void * data = nullptr;
auto err = cudaMalloc(&data, 256);
printf("%s\n", cudaGetErrorString(err));
return 0;
}
fails (outputs "unknown error") if built using
nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 -lcublas_device main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib cublas_device.lib
And works (outputs "no error") if built using
nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib
Even if built using the CUDA 8 release candidate, and compute_61 instead, it still fails as long as cublas_device.lib is linked.
Analysis of the simpleDevLibCublas example shows that the example is built for a set of real architectures (sm_xx), and not for virtual architectures (compute_xx), therefore the example in CUDA 7.5 does not run on newer cards. Furthermore, the same example in CUDA 8RC only includes one additional architecture, sm_60. Which is only used by the P100. However, that example does run on 6.1 cards such as the GTX 1080 as well. Support for the sm_61 architecture is not included in Cublas even in CUDA 8RC.
Therefore, the program will work if built using --gpu-architecture=sm_60 even if linking cublas_device, but will not work with --gpu-architecture=compute_60, --gpu-architecture=sm_61 or --gpu-architecture=compute_61. Or any --gpu-architecture=compute_xx for that matter.

Kernel seem not to execute

I'm a beginner when it comes to CUDA programming, but this situation doesn't look complex, yet it doesn't work.
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
__global__ void add(int *t)
{
t[2] = t[0] + t[1];
}
int main(int argc, char **argv)
{
int sum_cpu[3], *sum_gpu;
sum_cpu[0] = 1;
sum_cpu[1] = 2;
sum_cpu[2] = 0;
cudaMalloc((void**)&sum_gpu, 3 * sizeof(int));
cudaMemcpy(sum_gpu, sum_cpu, 3 * sizeof(int), cudaMemcpyHostToDevice);
add<<<1, 1>>>(sum_gpu);
cudaMemcpy(sum_cpu, sum_gpu, 3 * sizeof(int), cudaMemcpyDeviceToHost);
std::cout << sum_cpu[2];
cudaFree(sum_gpu);
return 0;
}
I'm compiling it like this
nvcc main.cu
It compiles, but the returned value is 0. I tried printing from within the kernel and it won't print so I assume i doesn't execute. Can you explain why?
I checked your code and everything is fine. It seems to me, that you are compiling it wrong (assuming you installed the CUDA SDK properly). Maybe you are missing some flags... That's a bit complicated in the beginning I think. Just check which compute capability your GPU has.
As a best practice I am using a Makefile for each of my CUDA projects. It is very easy to use when you first correctly set up your paths. A simplified version looks like this:
NAME=base
# Compilers
NVCC = nvcc
CC = gcc
LINK = nvcc
CUDA_INCLUDE=/opt/cuda
CUDA_LIBS= -lcuda -lcudart
SDK_INCLUDE=/opt/cuda/include
# Flags
COMMONFLAGS =-O2 -m64
NVCCFLAGS =-gencode arch=compute_20,code=sm_20 -m64 -O2
CXXFLAGS =
CFLAGS =
INCLUDES = -I$(CUDA_INCLUDE)
LIBS = $(CUDA_LIBS)
ALL_CCFLAGS :=
ALL_CCFLAGS += $(NVCCFLAGS)
ALL_CCFLAGS += $(addprefix -Xcompiler ,$(COMMONFLAGS))
OBJS = cuda_base.o
# Build rules
.DEFAULT: all
all: $(OBJS)
$(LINK) -o $(NAME) $(LIBS) $(OBJS)
%.o: %.cu
$(NVCC) -c $(ALL_CCFLAGS) $(INCLUDES) $<
%.o: %.c
$(NVCC) -ccbin $(CC) -c $(ALL_CCFLAGS) $(INCLUDES) $<
%.o: %.cpp
$(NVCC) -ccbin $(CXX) -c $(ALL_CCFLAGS) $(INCLUDES) $<
clean:
rm $(OBJS) $(NAME)
Explanation
I am using Arch Linux x64
the code is stored in a file called cuda_base.cu
the path to my CUDA SDK is /opt/cuda (maybe you have a different path)
most important: Which compute capability has your card? Mine is a GTX 580 with maximum compute capability 2.0. So I have to set as an NVCC flag arch=compute_20,code=sm_20, which stands for compute capability 2.0
The Makefile needs to be stored besides cuda_base.cu. I just copy & pasted your code into this file, then typed in the shell
$ make
nvcc -c -gencode arch=compute_20,code=sm_20 -m64 -O2 -Xcompiler -O2 -Xcompiler -m64 -I/opt/cuda cuda_base.cu
nvcc -o base -lcuda -lcudart cuda_base.o
$ ./base
3
and got your result.
Me and a friend of mine created a base template for writing CUDA code. You can find it here if you like.
Hope this helps ;-)
I've had the exact same problems. I've tried the vector sum example from 'CUDA by example', Sanders & Kandrot. I typed in the code, added the vectors together, out came zeros.
CUDA doesn't print error messages to the console, and only returns error codes from the functions like CUDAMalloc and CUDAMemcpy. In my desire to get a working example, I didn't check the error codes. A basic mistake. So, when I ran the version which loads up when I start a new CUDA project in Visual Studio, and which does do error checking, bingo! an error. The error message was 'invalid device function'.
Checking out the compute capability of my card, using the program in the book or equivalent, indicated that it was...
... wait for it...
1.1
So, I changed the compile options. In Visual Studio 13, Project -> Properties -> Configuration Properties -> CUDA C/C++ -> Device -> Code Generation.
I changed the item from compute_20,sm_20 to compute_11,sm_11. This indicates that the compute capability is 1.1 rather than the assumed 2.0.
Now, the rebuilt code works as expected.
I hope that is useful.

FFTW 3.3 compile error using NVCC on Linux

every one,
I am trying to use NVCC to compile the following code that uses FFTW3.3 library:
#include <stdio.h>
#include <fftw3.h>
void main() {
fftwf_complex a;
a[0] = 1;
a[1] = -1;
printf("a = %f %f, Testing FFTW with NVCC\n", a[0], a[1]);
}
When I compile using gcc it works fine:
cc main.cpp -o main.out -lfftw3 -lm
main.out
a = 1.000000 -1.000000, Testing FFTW with CUDA
However, when I am trying to compile the same code as .cu file, using nvcc instead of gcc,
I get a long list of compile errors:
nvcc main.cu -o main.out -lfftw3 -lm
/usr/include/fftw3.h(370): error: identifier "__float128" is undefined
/usr/include/fftw3.h(370): error: identifier "__float128" is undefined
...
Removing the two libraries -lfftw3 -lm would result in an undefined symbol of fftwf_complex.
Can anyone figure out what's going on?
This is a known problem in FFTW 3.3 whereby the FFTW headers misidentify that they are being compiled with a gcc version >=4.6 which has 128bit floating point support. It has been reported to effect compilation with icc, and it looks like nvcc steered compilation has the same problem.
The recommended workaround is to upgrade to FFTW 3.3.2.

eCos with stm32f4discovery Cortex-M4 in Ubuntu 12.04

I wrote a simple program for eCos in stm32f4discovery Cortex-M4, which following steps below.
$ecosconfig new stm32f4discovery
$configtool
#include <stdio.h>
int main(){
printf("hello ecos!\r\n");
return 0;
}
$arm-none-eabi-gcc -o hello.elf hello.c -Lecos_install/lib -I ecos_install/include -mcpu=cortex-m4 -mthumb -g -O2 -ffunction-sections -fdata-sections -Ttarget.ld -nostdlib
$arm-none-eabi-objcopy -O binary -R .sram hello.elf hello.bin
Actually, it is success. But, I don't know how to see the "hello ecos!".
I guess I need to setup baud rate and tty. So, I use minicom to do this. Unfortunately, I failed.
I use this stlink util to debug STM32F4 apps. After you compile and invoke this util, you can connect to stm32 target with gdb:
(gdb) tar ext :4242
(gdb) load hello.elf
Then you should be able to debug your app.