write a cuda program to compile both sm_1x and sm_2x - cuda

My problem is very similar to this link, but I am not able to fix it.
I have a CUDA program using cuda layered texture. This feature is only available with Fermi architecture (with compute capability more than or equal to 2.0). If the GPU is not Fermi, I use 3d texture as substitution for layered texture. I use __CUDA_ARCH__ in my code when declaring the texture reference (texture reference needs to be global) as this:
#if __CUDA_ARCH__ >= 200
texture<float, cudaTextureType2DLayered> depthmapsTex;
#else
texture<float, cudaTextureType3D> depthmapsTex;
#endif
The problem I have is that it seems __CUDA_ARCH__ is not defined.
The things I have tried:
1) __CUDA_ARCH__ is able to work correctly within cuda kernel. I know from the NVCC document that __CUDA_ARCH__ is not able to work correctly within host code. I have to define the texture reference as global variable. Does it belong to host code? The extension of the file being compiled is .cu.
2) I have a program that works correctly using layered texture. Then I add __CUDA_ARCH__ macro in two ways:
#ifdef __CUDA_ARCH__
texture<float, cudaTextureType2DLayered> depthmapsTex;
#endif
and
#ifndef __CUDA_ARCH__
texture<float, cudaTextureType2DLayered> depthmapsTex;
#endif
I found neither of them work. Both have the same error. error : identifier "depthmapsTex" is undefined. It looks as if the MACRO __CUDA_ARCH__ is defined and not defined at the same time. I suspect this relates to the fact that the compilation has two stages, and only one of the stage can see __CUDA_ARCH__, but I am not sure what has happened exactly.
I use cmake + visual studio 10 to set up the project and compile the code. I suspect if there is anything wrong here.
I am not sure if I have provided enough information. Any help is appreciated. Thank you!
Edit:
I tried to find any example that uses __CUDA_ARCH__ in Nvidia CUDA SDK 5.0. The following code is extracted from line 20 to line 24 in file GPUHistogram.h in the project grabcutNPP.
#if __CUDA_ARCH__<300
#define PARALLEL_HISTS 64
#else
#define PARALLEL_HISTS 8
#endif
And from line 216 to line 219, it uses the MACRO PARALLEL_HISTS:
int gpuHistogramTempSize(int n_bins)
{
return n_bins * PARALLEL_HISTS * sizeof(int);
}
But I found there is a problem here. PARALLEL_HISTS is not correctly defined. If I change the first clause to #if defined(__CUDA_ARCH__)&& __CUDA_ARCH__<300, I found the CUDA_ARCH is not defined. Does the CUDA SDK example use CUDA_ARCH in the wrong way?

I am not sure I understand the exact problem which may well have an elegant solution. Here is an inelegant brute-force approach I have used in the past. Create two kernels with identical signatures, but different names (e.g. foo_sm10(), foo_sm20(), in two separate .cu files. Compile one file for sm_10, and the other file for sm_20. Move common code that is independent of compute capability into a header file, and include it from both of the previously mentioned .cu files. In the host code, create a function pointer to invoke the architecture-dependent kernels. Initialize the function pointer to the approriate architecture-dependent kernel based on the compute capability detected at runtime.

If you want to figure out the compute capability of your GPU, you could try something like:
int devID;
cudaDeviceProp props;
CUDA_SAFE_CALL( cudaGetDevice(&devID) );
CUDA_SAFE_CALL( cudaGetDeviceProperties(&props, devID) );
float cc;
cc = props.major+props.minor*0.1;
printf("\n:: CC: %.1f",cc);
But I have no idea how to solve your problem.

Related

CUDA invalid device function

I get an thrust::system::system_error invalid device function while trying to access a device vector with thrust::device_vector< int > labels_d(width*height);
In my CMakeFile I've written
SET(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};-gencode arch=compute_20,code=compute_20)
And also tried different settings there.
So I guess it has something to do with my GPU (a Quadro FX 580) and CUDA maybe a pointer to my device is wrong or something...
Does anybody have a clue on what to change to make it work?
I've managed to find out that my GPU simply is too old for arch=compute_20, and so I have to use arch=compute_11.

complex CUDA kernel in MATLAB

I wrote a CUDA kernel to run via MATLAB,
with several cuDoubleComplex pointers. I activated the kernel with complex double vectors (defined as gpuArray), and gםt the error message: "unsupported type in argument specification cuDoubleComplex".
how do I set MATLAB to know this type?
The short answer, you can't.
The list of supported types for kernels is shown here, and that is all your kernel code can contain to compile correctly with the GPU computing toolbox. You will need either modify you code to use double2 in place of cuDoubleComplex, or supply Matlab with compiled PTX code and a function declaration which maps cuDoubleComplex to double2. For example
__global__ void mykernel(cuDoubleComplex *a) { .. }
would be compiled to PTX using nvcc and then loaded up in Matlab as
k = parallel.gpu.CUDAKernel('mykernel.ptx','double2*');
Either method should work.

The behavior of __CUDA_ARCH__ macro

In the host code, it seems that the __CUDA_ARCH__ macro wont generate different code path, instead, it will generate code for exact the code path for the current device.
However, if __CUDA_ARCH__ were within device code, it will generate different code path for different devices specified in compiliation options (/arch).
Can anyone confirm this is correct?
__CUDA_ARCH__ when used in device code will carry a number defined to it that reflects the code architecture currently being compiled.
It is not intended to be used in host code. From the nvcc manual:
This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
Usage of __CUDA_ARCH__ in host code is therefore undefined (at least by CUDA). As pointed out by #tera in the comments, since the macro is undefined in host code, it could be used to differentiate host/device paths for example, in a __host__ __device__ function definition.
#ifndef __CUDA_ARCH__
//host code here
#else
//device code here
#endif

"Unexpected address space" compilation error while using shared memory in PTX

I have written a trivial kernel in which I declare my shared memory array as
extern __shared__ float As[100];
In my kernel launch I specify the number_of_bytes of shared memory. I get the error "Unexpected address space" while compiling the kernel(to PTX). I am using fairly new version of LLVM from svn(3.3 in progress). Any ideas what I am doing wrong here ? the problem seems to be with extern keyword, but then how else am I gonna specify it?(Shared memory).
Should I use a different LLVM build?
Config CUDA 5.0 , Nvidia Tesla C1060
Well, it runs out that extern keyword is not really required in this case as per Gert-Jan from Nvidia forum. I am not sure what his id is on SO.
His reply --
"If you know how many elements your shared memory array has (e.g. 100 elements), you should not use the extern keyword, and you don't have to specify the number of bytes of shared memory in the kernel launch (the compiler can figure it out by himself). Only if you don't know how many elements you will need, you have to specify this in the kernel launch, and in your kernel you have to write "extern shared float *As"."
Hope this help other users.
I am not sure if CUDA-C/C++ supports this but perhaps try to set the address space attribute as a work-around:
__attribute__((address_space(3)))
extern __shared__ float As[100];
That should force llvm to put it in shared address space....
Good luck!

Visual Studio + Nsight : __syncthreads() undefined [duplicate]

At the moment CUDA already recognizes a key CUDA C/C++ function such as cudaMalloc, cudaFree, cudaEventCreate, etc.
It also recognizes certain types like dim3 and cudaEvent_t.
However, it doesn't recognize other functions and types such as the texture template, the __syncthreads functions, or the atomicCAS function.
Everything compiles just fine, but I'm tired of seeing red underlinings all over the place and I want to the see the example parameters displayed when you type in any recognizable function.
How do I get VS to catch these functions?
You could create a dummy #include file of the following form:
#pragma once
#ifdef __INTELLISENSE__
void __syncthreads();
...
#endif
This should hide the fake prototypes from the CUDA and Visual C++ compilers, but still make them visible to IntelliSense.
Source for __INTELLISENSE__ macro: http://blogs.msdn.com/b/vcblog/archive/2011/03/29/10146895.aspx
You need to add CUDA-specific keywords like __syncthreads to the usertype.dat file for visual studio. An example usertype.dat file is included with the NVIDIA CUDA SDK. You also need to make sure that visual studio recognizes .cu files as c/c++ files as described in this post:
Note however that where that post uses $(CUDA_INC_PATH), with recent versions of CUDA you should use $(CUDA_PATH)/include.
Also, I would recommend Visual Assist X -- not free, but worth the money -- to improve intellisense. It works well with CUDA if you follow these instructions:
http://www.wholetomato.com/forum/topic.asp?TOPIC_ID=5481
http://forums.nvidia.com/index.php?showtopic=53690