complex CUDA kernel in MATLAB - cuda

I wrote a CUDA kernel to run via MATLAB,
with several cuDoubleComplex pointers. I activated the kernel with complex double vectors (defined as gpuArray), and gםt the error message: "unsupported type in argument specification cuDoubleComplex".
how do I set MATLAB to know this type?

The short answer, you can't.
The list of supported types for kernels is shown here, and that is all your kernel code can contain to compile correctly with the GPU computing toolbox. You will need either modify you code to use double2 in place of cuDoubleComplex, or supply Matlab with compiled PTX code and a function declaration which maps cuDoubleComplex to double2. For example
__global__ void mykernel(cuDoubleComplex *a) { .. }
would be compiled to PTX using nvcc and then loaded up in Matlab as
k = parallel.gpu.CUDAKernel('mykernel.ptx','double2*');
Either method should work.

Related

How to find the CUDA __device__ definition of a function?

I have a specific function I am trying to find the source definition for, specifically what the nvcc compiler is using. This question is phrased to apply to any function (or symbol I suppose), which is used in a __device__ function. Given:
__device__ void Foo(){
int x = round( 0.0f );
}
What is the standard/canonical/recommended way to find the definition for "round( float )" used by the nvcc compiler to generate device code?
Normally I use Visual Studio's F1 "Go to Definition", or search for "round" in project files, etc. I also search the CUDA Toolkit documentation and CUDA MATH API. In this case, I find the VS cmath definition. But how do I determine which definition the nvcc compiler uses?
What is the standard/canonical/recommended way to find the definition for "round( float )" used by the nvcc compiler to generate device code?
Disassembly. Most inbuilt functions exist as stubs in headers that are expanded into inline assembly sequences as part of a device compiler code generating pass. There is no input code to view.

Unable to call CUDA half precision functions from the host

I am trying to do some FP16 work that will have both CPU and GPU backend. I researched my options and decided to use CUDA's half precision converter and data types. The ones I intent to use are specified as both __device__ and __host__ which according to my understanding (and the official documentation) should mean that the functions are callable from both HOST and DEVICE code. I wrote a simple test program:
#include <iostream>
#include <cuda_fp16.h>
int main() {
const float a = 32.12314f;
__half2 test = __float2half2_rn(a);
__half test2 = __float2half(a);
return 0;
}
However when I try to compile it I get:
nvcc cuda_half2.cu
cuda_half2.cu(6): error: calling a __device__ function("__float2half2_rn") from a __host__ function("main") is not allowed
cuda_half2.cu(7): error: calling a __device__ function("__float2half") from a __host__ function("main") is not allowed
2 errors detected in the compilation of "/tmp/tmpxft_000013b8_00000000-4_cuda_half2.cpp4.ii".
The only thing that comes to mind is that my CUDA is 9.1 and I'm reading the documentation for 9.2 but i can't find an older version of it, nor can I find anything in the changelog. Ideas?
Ideas?
Switch to CUDA 9.2
Your code compiles without error on CUDA 9.2, but throws the errors you indicate on CUDA 9.1. If you have CUDA 9.1 installed, then the documentation for it is already installed on your machine. On a typical linux install, it will be located in /usr/local/cuda-9.1/doc. If you look at /usr/local/cuda-9.1/doc/pdf/CUDA_Math_API.pdf you will see that the corresponding functions are only marked __device__, so this change was indeed made between CUDA 9.1 and CUDA 9.2

How to redefine malloc/free in CUDA?

I want to redefine malloc() and free() in my code, but when I run, two errors appear:
allowing all exceptions is incompatible with previous function "malloc";
allowing all exceptions is incompatible with previous function "free";
Then I search for this error, it seems CUDA doesn't allow us to redefine libary function, is this true? If we can't redefine those functions, how can I resolve the error?
The very short answer is that you cannot.
malloc is fundamentally a C++ standard library function which the CUDA toolchain internally overloads with a device hook in device code. Attempting to define your own device version of malloc or free can and will break the toolchain's internals. Exactly how depends on platform and compiler.
In your previous question on this, you had code like this:
__device__ void* malloc(size_t)
{ return theHeap.alloc(t); }
__device__ void free(void* p)
{ the Heap.dealloc(p); }
Because of existing standard library requirements, malloc and free must be defined as __device__ __host__ at global namespace scope. It is illegal in CUDA to have separate __device__ and __host__definitions of the same function. You could probably get around this restriction by using a private namespace for the custom allocator, or using different function names. But don't try and redefine anything from the standard library in device or host code. It will break things.

OpenCL application with 3 different kernels

I just started with OpenCL and I want to port an app I have in CUDA. The problem I'm facing now is the kernel stuff.
In CUDA I have all my kernel functions in the same file, on the contrary, OpenCL asks to read the file with the kernel source code and then do some other stuff.
My question is: Can I have one single file with all my kernel functions and then build the program in OpenCL OR I have to have one file for each of my kernel functions?
It would be nice if you give a little example.
The only difference between OpenCL and CUDA (in this specific regard) is that CUDA allows to mix device with host code in the same source file, while OpenCL requires you to load the program source as an external string and compile it at runtime.
But nevertheless it is absolutely no problem to put many kernel functions into a single OpenCL program, or even into a single OpenCL program source string. The kernels (say the C API kernel objects) themselves are then extracted from the program object using their respective function names.
pseudocode simplifying OpenCL's ugly C interface:
single OpenCL file:
__kernel void a(...) {}
__kernel void b(...) {}
C file:
source = read_cl_file();
program = clCreateProgramWithSource(source);
clBuildProgram(program);
kernel_a = clCreateKernel(program, "a");
kernel_b = clCreateKernel(program, "b");

write a cuda program to compile both sm_1x and sm_2x

My problem is very similar to this link, but I am not able to fix it.
I have a CUDA program using cuda layered texture. This feature is only available with Fermi architecture (with compute capability more than or equal to 2.0). If the GPU is not Fermi, I use 3d texture as substitution for layered texture. I use __CUDA_ARCH__ in my code when declaring the texture reference (texture reference needs to be global) as this:
#if __CUDA_ARCH__ >= 200
texture<float, cudaTextureType2DLayered> depthmapsTex;
#else
texture<float, cudaTextureType3D> depthmapsTex;
#endif
The problem I have is that it seems __CUDA_ARCH__ is not defined.
The things I have tried:
1) __CUDA_ARCH__ is able to work correctly within cuda kernel. I know from the NVCC document that __CUDA_ARCH__ is not able to work correctly within host code. I have to define the texture reference as global variable. Does it belong to host code? The extension of the file being compiled is .cu.
2) I have a program that works correctly using layered texture. Then I add __CUDA_ARCH__ macro in two ways:
#ifdef __CUDA_ARCH__
texture<float, cudaTextureType2DLayered> depthmapsTex;
#endif
and
#ifndef __CUDA_ARCH__
texture<float, cudaTextureType2DLayered> depthmapsTex;
#endif
I found neither of them work. Both have the same error. error : identifier "depthmapsTex" is undefined. It looks as if the MACRO __CUDA_ARCH__ is defined and not defined at the same time. I suspect this relates to the fact that the compilation has two stages, and only one of the stage can see __CUDA_ARCH__, but I am not sure what has happened exactly.
I use cmake + visual studio 10 to set up the project and compile the code. I suspect if there is anything wrong here.
I am not sure if I have provided enough information. Any help is appreciated. Thank you!
Edit:
I tried to find any example that uses __CUDA_ARCH__ in Nvidia CUDA SDK 5.0. The following code is extracted from line 20 to line 24 in file GPUHistogram.h in the project grabcutNPP.
#if __CUDA_ARCH__<300
#define PARALLEL_HISTS 64
#else
#define PARALLEL_HISTS 8
#endif
And from line 216 to line 219, it uses the MACRO PARALLEL_HISTS:
int gpuHistogramTempSize(int n_bins)
{
return n_bins * PARALLEL_HISTS * sizeof(int);
}
But I found there is a problem here. PARALLEL_HISTS is not correctly defined. If I change the first clause to #if defined(__CUDA_ARCH__)&& __CUDA_ARCH__<300, I found the CUDA_ARCH is not defined. Does the CUDA SDK example use CUDA_ARCH in the wrong way?
I am not sure I understand the exact problem which may well have an elegant solution. Here is an inelegant brute-force approach I have used in the past. Create two kernels with identical signatures, but different names (e.g. foo_sm10(), foo_sm20(), in two separate .cu files. Compile one file for sm_10, and the other file for sm_20. Move common code that is independent of compute capability into a header file, and include it from both of the previously mentioned .cu files. In the host code, create a function pointer to invoke the architecture-dependent kernels. Initialize the function pointer to the approriate architecture-dependent kernel based on the compute capability detected at runtime.
If you want to figure out the compute capability of your GPU, you could try something like:
int devID;
cudaDeviceProp props;
CUDA_SAFE_CALL( cudaGetDevice(&devID) );
CUDA_SAFE_CALL( cudaGetDeviceProperties(&props, devID) );
float cc;
cc = props.major+props.minor*0.1;
printf("\n:: CC: %.1f",cc);
But I have no idea how to solve your problem.