CUDA invalid device function - cuda

I get an thrust::system::system_error invalid device function while trying to access a device vector with thrust::device_vector< int > labels_d(width*height);
In my CMakeFile I've written
SET(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_20,code=compute_20)
And also tried different settings there.
So I guess it has something to do with my GPU (a Quadro FX 580) and CUDA maybe a pointer to my device is wrong or something...
Does anybody have a clue on what to change to make it work?

I've managed to find out that my GPU simply is too old for arch=compute_20, and so I have to use arch=compute_11.

Related

Why does printf() work within a kernel, but using std::cout doesn't?

I have been exploring the field of parallel programming and have written basic kernels in Cuda and SYCL. I have encountered a situation where I had to print inside the kernel and I noticed that std::cout inside the kernel does not work whereas printf works. For example, consider the following SYCL Codes -
This works -
void print(float*A, size_t N){
buffer<float, 1> Buffer{A, {N}};
queue Queue((intel_selector()));
Queue.submit([&Buffer, N](handler& Handler){
auto accessor = Buffer.get_access<access::mode::read>(Handler);
Handler.parallel_for<dummyClass>(range<1>{N}, [accessor](id<1>idx){
printf("%f", accessor[idx[0]]);
});
});
}
whereas if I replace the printf with std::cout<<accessor[idx[0]] it raises a compile time error saying - Accessing non-const global variable is not allowed within SYCL device code.
A similar thing happens with CUDA kernels.
This got me thinking that what may be the difference between printf and std::coout which causes such behavior.
Also suppose If I wanted to implement a custom print function to be called from the GPU, how should I do it?
TIA
This got me thinking that what may be the difference between printf and std::cout which causes such behavior.
Yes, there is a difference. The printf() which runs in your kernel is not the standard C library printf(). A different call is made, to an on-device function (the code of of which is closed, if it at all exists in CUDA C). That function uses a hardware mechanism on NVIDIA GPUs - a buffer for kernel threads to print into, which gets sent back over to the host side, and the CUDA driver then forwards it to the standard output file descriptor of the process which launched the kernel.
std::cout does not get this sort of a compiler-assisted replacement/hijacking - and its code is simply irrelevant on the GPU.
A while ago, I implemented an std::cout-like mechanism for use in GPU kernels; see this answer of mine here on SO for more information and links. But - I decided I don't really like it, and it compilation is rather expensive, so instead, I adapted a printf()-family implementation for the GPU, which is now part of the cuda-kat library (development branch).
That means I've had to answer your second question for myself:
If I wanted to implement a custom print function to be called from the GPU, how should I do it?
Unless you have access to undisclosed NVIDIA internals - the only way to do this is to use printf() calls instead of C standard library or system calls on the host side. You essentially need to modularize your entire stream over the low-level primitive I/O facilities. It is far from trivial.
In SYCL you cannot use std::cout for output on code not running on the host for similar reasons to those listed in the answer for CUDA code.
This means if you are running kernel code on the "device" (e.g. a GPU) then you need to use the stream class. There is more information about this in the SYCL developer guide section called Logging.
There is no __device__ version of std::cout, so only printf can be used in device code.

Can my kernel code tell how much shared memory it has available?

Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid?
On the host side, you know how much shared memory a launched kernel had (or will have), since you set that value yourself; but what about the device side? It's easy to compile in the upper limit to that size, but that information is not available (unless passed explicitly) to the device. Is there an on-GPU mechanism for obtaining it? The CUDA C Programming Guide doesn't seem to discuss this issue (in or outside of the section on shared memory).
TL;DR: Yes. Use the function below.
It is possible: That information is available to the kernel code in special registers: %dynamic_smem_size and %total_smem_size.
Typically, when we write kernel code, we don't need to be aware of specific registers (special or otherwise) - we write C/C++ code. Even when we do use these registers, the CUDA compiler hides this from us through functions or structures which hold their values. For example, when we use the value threadIdx.x, we are actually accessing the special register %tid.x, which is set differently for every thread in the block. You can see these registers "in action" when you look at compiled PTX code. ArrayFire have written a nice blog post with some worked examples: Demystifying PTX code.
But if the CUDA compiler "hides" register use from us, how can we go behind that curtain and actually insist on using them, accessing them with those %-prefixed names? Well, here's how:
__forceinline__ __device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
and a similar function for %total_smem_size. This function makes the compiler add an explicit PTX instruction, just like asm can be used for host code to emit CPU assembly instructions directly. This function should always be inlined, so when you assign
x = dynamic_smem_size();
you actually just assign the value of the special register to x.

"Unexpected address space" compilation error while using shared memory in PTX

I have written a trivial kernel in which I declare my shared memory array as
extern __shared__ float As[100];
In my kernel launch I specify the number_of_bytes of shared memory. I get the error "Unexpected address space" while compiling the kernel(to PTX). I am using fairly new version of LLVM from svn(3.3 in progress). Any ideas what I am doing wrong here ? the problem seems to be with extern keyword, but then how else am I gonna specify it?(Shared memory).
Should I use a different LLVM build?
Config CUDA 5.0 , Nvidia Tesla C1060
Well, it runs out that extern keyword is not really required in this case as per Gert-Jan from Nvidia forum. I am not sure what his id is on SO.
His reply --
"If you know how many elements your shared memory array has (e.g. 100 elements), you should not use the extern keyword, and you don't have to specify the number of bytes of shared memory in the kernel launch (the compiler can figure it out by himself). Only if you don't know how many elements you will need, you have to specify this in the kernel launch, and in your kernel you have to write "extern shared float *As"."
Hope this help other users.
I am not sure if CUDA-C/C++ supports this but perhaps try to set the address space attribute as a work-around:
__attribute__((address_space(3)))
extern __shared__ float As[100];
That should force llvm to put it in shared address space....
Good luck!

write a cuda program to compile both sm_1x and sm_2x

My problem is very similar to this link, but I am not able to fix it.
I have a CUDA program using cuda layered texture. This feature is only available with Fermi architecture (with compute capability more than or equal to 2.0). If the GPU is not Fermi, I use 3d texture as substitution for layered texture. I use __CUDA_ARCH__ in my code when declaring the texture reference (texture reference needs to be global) as this:
#if __CUDA_ARCH__ >= 200
texture<float, cudaTextureType2DLayered> depthmapsTex;
#else
texture<float, cudaTextureType3D> depthmapsTex;
#endif
The problem I have is that it seems __CUDA_ARCH__ is not defined.
The things I have tried:
1) __CUDA_ARCH__ is able to work correctly within cuda kernel. I know from the NVCC document that __CUDA_ARCH__ is not able to work correctly within host code. I have to define the texture reference as global variable. Does it belong to host code? The extension of the file being compiled is .cu.
2) I have a program that works correctly using layered texture. Then I add __CUDA_ARCH__ macro in two ways:
#ifdef __CUDA_ARCH__
texture<float, cudaTextureType2DLayered> depthmapsTex;
#endif
and
#ifndef __CUDA_ARCH__
texture<float, cudaTextureType2DLayered> depthmapsTex;
#endif
I found neither of them work. Both have the same error. error : identifier "depthmapsTex" is undefined. It looks as if the MACRO __CUDA_ARCH__ is defined and not defined at the same time. I suspect this relates to the fact that the compilation has two stages, and only one of the stage can see __CUDA_ARCH__, but I am not sure what has happened exactly.
I use cmake + visual studio 10 to set up the project and compile the code. I suspect if there is anything wrong here.
I am not sure if I have provided enough information. Any help is appreciated. Thank you!
Edit:
I tried to find any example that uses __CUDA_ARCH__ in Nvidia CUDA SDK 5.0. The following code is extracted from line 20 to line 24 in file GPUHistogram.h in the project grabcutNPP.
#if __CUDA_ARCH__<300
#define PARALLEL_HISTS 64
#else
#define PARALLEL_HISTS 8
#endif
And from line 216 to line 219, it uses the MACRO PARALLEL_HISTS:
int gpuHistogramTempSize(int n_bins)
{
return n_bins * PARALLEL_HISTS * sizeof(int);
}
But I found there is a problem here. PARALLEL_HISTS is not correctly defined. If I change the first clause to #if defined(__CUDA_ARCH__)&& __CUDA_ARCH__<300, I found the CUDA_ARCH is not defined. Does the CUDA SDK example use CUDA_ARCH in the wrong way?
I am not sure I understand the exact problem which may well have an elegant solution. Here is an inelegant brute-force approach I have used in the past. Create two kernels with identical signatures, but different names (e.g. foo_sm10(), foo_sm20(), in two separate .cu files. Compile one file for sm_10, and the other file for sm_20. Move common code that is independent of compute capability into a header file, and include it from both of the previously mentioned .cu files. In the host code, create a function pointer to invoke the architecture-dependent kernels. Initialize the function pointer to the approriate architecture-dependent kernel based on the compute capability detected at runtime.
If you want to figure out the compute capability of your GPU, you could try something like:
int devID;
cudaDeviceProp props;
CUDA_SAFE_CALL( cudaGetDevice(&devID) );
CUDA_SAFE_CALL( cudaGetDeviceProperties(&props, devID) );
float cc;
cc = props.major+props.minor*0.1;
printf("\n:: CC: %.1f",cc);
But I have no idea how to solve your problem.

CUDA PTX code %envreg<32> special registers

I tried to run a PTX assembly code generated by a .cl kernel with the CUDA driver API. The steps i took were these ( standard opencl procedure ):
1) Load .cl kernel
2) JIT compile it
3) Get the compiled ptx code and save it.
So far so good.
I noticed some special registers inside ptx assembly, %envreg3, %envreg6 etc. The problem is that these registers are not set ( according to ptx_isa these registers are set by the driver before the kernel launch ) when i try to execute the code with the driver API. So the code is falling into an infinite loop, and fails to run corectly. But if i manually set the values ( nore exactly i replace %envreg6 with the blocksize inside ptx ), the code is executing and i get the correct results ( correct compared with the cpu results ).
Does anyone know how we can set values to these registers, or maybee if i am missing something? i.e. a flag on cuLaunchKernel, that sets values to these registers?
You are trying to compile an OpenCL kernel and run it using the CUDA driver API. The NVIDIA driver/compiler interface is different between OpenCL and CUDA, so what you want to do is not supported and fundamentally cannot work.
Presumably, the only workaround would be the one you found: to patch the PTX code. But I'm afraid this might not work in the general case.
Edit:
Specifically, OpenCL supports larger grids than most NVIDIA GPUs support, so grid sizes need to be virtualized by dividing across multiple actual grid launches, and so offsets are necessary. Also in OpenCL, indices do not necessarily start from (0, 0, 0), the user can specify offsets which the driver must pass to the kernel. Therefore the registers initialized for OpenCL and CUDA C launches are different.