In order to use unified memory feature in CUDA 6, the following requirement must be meet,
a GPU with SM architecture 3.0 or higher (Kepler class or newer)
a 64-bit host application and operating system, except on Android
Linux or Windows
My setup is,
System: ubuntu 13.10 (64-bit)
GPU: GTX770
CUDA: 6.0
Driver Version: 331.49
The sample code are taken from the programming guide page 210.
__device__ __managed__ int ret[1000];
__global__ void AplusB(int a, int b) {
ret[threadIdx.x] = a + b + threadIdx.x;
}
int main() {
AplusB<<< 1, 1000 >>>(10, 100);
cudaDeviceSynchronize();
for(int i=0; i<1000; i++)
printf("%d: A+B = %d\n", i, ret[i]);
return 0;
}
The nvcc compile option I used is,
nvcc -m64 -Xptxas=-Werror -arch=compute_30 -code=sm_30 -o UM UnifiedMem.cu
This code compiles perfectly fine. During execution, it produces "segmentation fault" at printf(). It feels like that unified memory feature didn't come into effect. The address of variable ret is still of GPU but printf is called on CPU. CPU is trying to access a piece of data that is not allocated on CPU so it produces a segmentation fault. Can anybody help me? What is wrong here?
Thought I am not certain sure (and I can't check it for myself right now) I think that because Ubuntu 13.10 has gcc in version of 4.8.1, which I believe is not supported yet even in newest CUDA Toolkit 6.0. Try to compile your code with host compiler gcc 4.7.3 (that is, the same one that is included in officially supported Ubuntu 13.04 for default). For that you might install gcc-4.7 package and point /usr/bin/gcc-4.7 as host compiler for nvcc. For C++ support I believe you need g++-4.7 as well.
If you need some simple step-by-step guide, then you might proceed with http://n00bsys0p.co.uk/blog/2014/01/23/nvidia-cuda-55ubuntu-1310-saucy-salamander. It's for CUDA Toolkit 5.5, but I think it should be relevant for recent version as well.
Related
I'm currently trying to compile Darknet on the latest CUDA toolkit which is version 11.1. I have a GPU capable of running CUDA version 5 which is a GeForce 940M. However, while rebuilding darknet using the latest CUDA toolkit, it said
nvcc fatal : Unsupported GPU architecture 'compute_30'
compute_30 is for version 3, how can it fail while my GPU can run version 5
Is it possible that my code detected my intel graphics card instead of my Nvidia GPU? if that's the case, is it possible to change its detection?
Support for compute_30 has been removed for versions after CUDA 10.2. So if you are using nvcc make sure to use this flag to target the correct architecture in the build system for darknet
-gencode=arch=compute_50,code=sm_50
You may also need to use this one to avoid a warning of architectures are deprecated.
-Wno-deprecated-gpu-targets
I added the following:
makefiletemp = open('Makefile','r+')
list_of_lines = makefiletemp.readlines()
list_of_lines[15] = list_of_lines[14]
list_of_lines[16] = "ARCH= -gencode arch=compute_35,code=sm_35 \\\n"
makefiletemp = open('Makefile','w')
makefiletemp.writelines(list_of_lines)
makefiletemp.close()
right before the
#Compile Darknet
!make
command. That seemed to work!
I am trying to compile and run the following code on an Nvidia P100. I'm running CentOS 6.9, Driver version 396.37 and CUDA-9.2. It appears that these driver/cuda versions are compatible.
#include <stdio.h>
#include <cuda_runtime_api.h>
int main(int argc, char *argv[])
{
// Declare variables
int * dimA = NULL; //{2,3};
cudaMallocManaged(&dimA, 2 * sizeof(float));
dimA[0] = 2;
dimA[1] = 3;
cudaDeviceSynchronize();
printf("The End\n");
return 0;
}
It fails with a segmentation fault. When I compile with nvcc -g -G src/get_p100_to_work.cu and run the core file (cuda-gdb ./a.out core.277512), I get
Reading symbols from ./a.out...done.
[New LWP 277512]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `./a.out'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000040317d in main (argc=1, argv=0x7fff585da548) at src/get_p100_to_work.cu:71
71 dimA[0] = 2;
(cuda-gdb) bt full
#0 0x000000000040317d in main (argc=1, argv=0x7fff585da548) at src/get_p100_to_work.cu:71
dimA = 0x0
(cuda-gdb)
When I run this code on an NVidia K40, the code runs without error.
QUESTION :
How do I get my code to run on the P100? It seems from this tutorial, this code should run.
Previously, I had cloned an image of a GPU node with a 2 K40's in it. I then put that image on a node with 2 - P100's in it. I suspect that when installing the driver on the K40 node, there is a configuration specific to the graphics cards on the machine (which is makes sense). This configuration was not compatible with the P100. Since the driver on the P100 machine was basically corrupted, this would explain why my code failed so cataclysmically.
Solution : I ended up having to reinstall the driver and now it works.
I have a relatively simple CUDA kernel and I immediately call the kernel in the main method of my program in the following way:
__global__ void block() {
for (int i = 0; i < 20; i++) {
printf("a");
}
}
int main(int argc, char** argv) {
block << <1, 1 >> > ();
cudaError_t cudaerr = cudaDeviceSynchronize();
printf("Kernel executed!\n");
if (cudaerr != cudaSuccess)
printf("kernel launch failed with error \"%s\".\n",
cudaGetErrorString(cudaerr));
}
This program is compiled and launched using Visual Studio 2015, and the project being executed has been generated with CMAKE using the following CMakeLists.txt file:
project (Comparison)
cmake_minimum_required (VERSION 2.6)
find_package(CUDA REQUIRED)
set(
CUDA_NVCC_FLAGS
${CUDA_NVCC_FLAGS};
-arch=compute_30 -code=sm_30 -g -G
)
cuda_add_executable(Comparison kernel.cu)
I would expect the output of this program to print 20 A's to the console and then end with printing kernel executed. However, the A's are never printed to the console and the line Kernel executed shows up immediately. Even if I replace the for loop by a while(true) loop.
Even when running the code with the Nsight debugger attached and a breakpoint in the for loop of the kernel nothing happens. Leading me to believe that the kernel is never actually launched. Does anyone know how to make this kernel behave as expected?
The reason the kernel was not running correctly when compiled with the given CMakeLists.txt file was due to these flags:
-arch=compute_30 -code=sm_30
combined with the GPU that was being used (GTX 970, a cc 5.2 GPU).
Those flags specify the generation of cc 3.0 SASS code only, and such code is not compatible with a cc 5.2 device. The fix would be to modify the flags to something like:
-arch=sm_30
or
-arch=sm_52
or
-arch=compute_52 -code=sm_52
I would recommend the first or second approach, as it will include PTX support for future devices.
The kernel error was not evident because the error checking after the kernel was incomplete. Refer to the canonical/question answer.
In my library I need to support devices of compute capability 2.0 and higher. For CC 3.5+ devices I’ve implemented optimized kernels which utilize Dynamic Parallelism. It seems that nvcc compiler does not support DP when anything less than “compute_35,sm_35” is specified (I'm getting compiler/linker errors). My question is what is the best way to support multiple kernel versions in such case? Having multiple DLLs and choosing between them at runtime will work but I was wondering if there is a better way.
UPDATE: I’m successfully using #if __CUDA_ARCH__ >= 350 for other things (like __ldg() etc) but it does not work in DP case as I have to link with cudadevrt.lib which produces the following error:
1>nvlink : fatal error : could not find compatible device code in C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v5.5/lib/Win32/cudadevrt.lib
I believe this issue has been addressed now in CUDA 6.
In particular, the compile problem associated with having the -lcudadevrt library specified and throwing a link error for code that is not requiring dynamic parallelism, has been eliminated/removed.
Here's my simple test:
$ cat t264.cu
#include <stdio.h>
__global__ void kernel1(){
printf("Hello from DP Kernel\n");
}
__global__ void kernel2(){
#if __CUDA_ARCH__ >= 350
kernel1<<<1,1>>>();
#else
printf("Hello from non-DP Kernel\n");
#endif
}
int main(){
kernel2<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$
In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.
I have a very simple CUDA program. The program when compiled with -arch=sm_11 option, works correctly as expected. However, when compiled with -arch=sm_12, the results are unexpected.
Here is the kernel code :
__global__ void dev_test(int *test) {
*test = 100;
}
I invoke the kernel code as below :
int *dev_int, val;
val = 0;
cudaMalloc((void **)&dev_int, sizeof(int));
cudaMemset((void *)dev_int, 0, sizeof(int));
cudaMemcpy(dev_int, &val, sizeof(int), cudaMemcpyHostToDevice);
dev_test <<< 1, 1>>> (dev_int);
int *host_int = (int*)malloc(sizeof(int));
cudaMemcpy(host_int, dev_int, sizeof(int), cudaMemcpyDeviceToHost);
printf("copied back from device %d\n",*host_int);
When compiled with -arch=sm_11, the print statement correctly prints 100.
However when compiled with -arch=sm_12, it prints 0 i.e the changes inside the kernel function is not taking effect. I am guessing this is due to some incompatibility between my CUDA version and the nvidia drivers.
CUDA version - 3.0
NVRM version: NVIDIA UNIX x86_64 Kernel Module 195.36.24 Thu Apr 22 19:10:14 PDT 2010
GCC version: gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
Any help is highly appreciated.
My problem finally got resolved.. Not sure which one truly resolved it - i upgraded to Cuda 4.1 and upgraded my nVidia driver and the combination of the two solved the problem.