OpenCL application with 3 different kernels - cuda

I just started with OpenCL and I want to port an app I have in CUDA. The problem I'm facing now is the kernel stuff.
In CUDA I have all my kernel functions in the same file, on the contrary, OpenCL asks to read the file with the kernel source code and then do some other stuff.
My question is: Can I have one single file with all my kernel functions and then build the program in OpenCL OR I have to have one file for each of my kernel functions?
It would be nice if you give a little example.

The only difference between OpenCL and CUDA (in this specific regard) is that CUDA allows to mix device with host code in the same source file, while OpenCL requires you to load the program source as an external string and compile it at runtime.
But nevertheless it is absolutely no problem to put many kernel functions into a single OpenCL program, or even into a single OpenCL program source string. The kernels (say the C API kernel objects) themselves are then extracted from the program object using their respective function names.
pseudocode simplifying OpenCL's ugly C interface:
single OpenCL file:
__kernel void a(...) {}
__kernel void b(...) {}
C file:
source = read_cl_file();
program = clCreateProgramWithSource(source);
clBuildProgram(program);
kernel_a = clCreateKernel(program, "a");
kernel_b = clCreateKernel(program, "b");

Related

Nsight Compute profiling of a __device__ function in a kernel

I am trying to use Nsight Compute to profile kernels in my CUDA code. But how do I profile functions inside a kernel? Say for example, I have 2 functions (device functions) in a kernel (global). Nsight compute only profiles the kernel but there is no mention of the functions called inside the kernels.
nsight compute doesn't provide the ability to profile a __device__ function directly or individually, nor will it present results organized that way.
As indicated in the comments, for a __device__ function defined in the same compilation unit as the kernel you are profiling, that function may not even exist as a separate or identifiable entity; it will usually get inlined by the compiler, and then subject to further optimization with surrounding code.
You can however associate some profiler information to specific lines of code in the nsight compute UI "Source" page. This blog points that out with an example.

How to view CUDA library function calls in profiler?

I am using the cuFFT library. How do I modify my code to see the function calls from this library (or any other CUDA library) in the NVIDIA Visual Profiler NVVP? I am using Windows and Visual Studio 2013.
Below is my code. I convert my image and filter to the Fourier domain, then perform point-wise complex matrix multiplication in a custom CUDA kernel I wrote, and then simply perform the inverse DFT on the filtered images spectrum. The results are accurate, but I am not able to figure out how to view the cuFFT functions in the profiler.
// Execute FFT Plans
cufftExecR2C(fftPlanFwd, (cufftReal *)d_in, (cufftComplex *)d_img_Spectrum);
cufftExecR2C(fftPlanFwd, (cufftReal *)d_filter, (cufftComplex *)d_filter_Spectrum);
// Perform complex pointwise muliplication on filter spectrum and image spectrum
pointWise_complex_matrix_mult_kernel << <grid, block >> >(d_img_Spectrum, d_filter_Spectrum, d_filtered_Spectrum, ROWS, COLS);
// Execute FFT^-1 Plan
cufftExecC2R(fftPlanInv, (cufftComplex *)d_filtered_Spectrum, (cufftReal *)d_out);
At the entry point to the library, the library call is like any other call into a C or C++ library: it is executing on the host. Within that library call, there may be calls to CUDA kernels or other CUDA API functions, for a CUDA GPU-enabled library such as CUFFT.
The profilers (at least up through CUDA 7.0 - see note about CUDA 7.5 nvprof below) don't natively support the profiling of host code. They are primarily focused on kernel calls and CUDA API calls. A call into a library like CUFFT by itself is not considered a CUDA API call.
You haven't shown a complete profiler output, but you should see the CUFFT library make CUDA kernel calls; these will show up in the profiler output. The first two CUFFT calls prior to your pointWise_complex_matrix_mult_kernel should have one or more kernel calls each that show up to the left of that kernel, and the last CUFFT call should have one or more kernel calls that show up to the right of that kernel.
One possible way to get specific sections of host code to show up in the profiler is to use the NVTX (NVIDIA Tools Extension) library to annotate your source code, which will cause those annotations to show up in the profiler output. You might want to put an NVTX range event around the library call you wish to see identified in the profiler output.
Another approach would be to try out the new CPU profiling features in nvprof in CUDA 7.5. You can refer to section 3.4 of the Profiler guide that ships with CUDA 7.5RC.
Finally, ordinary host profilers should be able to profile your CUDA application, including CUFFT library calls, but they won't have any visibility into what is happening on the GPU.
EDIT: Based on discussion in the comments below, your code appears to be similar to the simpleCUFFT sample code. When I compile and profile that code on Win7 x64, VS 2013 Community, and CUDA 7, I get the following output (zoomed in to depict the interesting part of the timeline):
You can see that there are CUFFT kernels being called both before and after the complex pointwise multiply and scale kernel that appears in that code. My suggestion would be to start by doing something similar with the simpleCUFFT sample code rather than your own code, and see if you can duplicate the output above. If so, the problem lies in your code (perhaps your CUFFT calls are failing, perhaps you need to add proper error checking, etc.)

complex CUDA kernel in MATLAB

I wrote a CUDA kernel to run via MATLAB,
with several cuDoubleComplex pointers. I activated the kernel with complex double vectors (defined as gpuArray), and gםt the error message: "unsupported type in argument specification cuDoubleComplex".
how do I set MATLAB to know this type?
The short answer, you can't.
The list of supported types for kernels is shown here, and that is all your kernel code can contain to compile correctly with the GPU computing toolbox. You will need either modify you code to use double2 in place of cuDoubleComplex, or supply Matlab with compiled PTX code and a function declaration which maps cuDoubleComplex to double2. For example
__global__ void mykernel(cuDoubleComplex *a) { .. }
would be compiled to PTX using nvcc and then loaded up in Matlab as
k = parallel.gpu.CUDAKernel('mykernel.ptx','double2*');
Either method should work.

Passing non-POD types as __global__ function parameters in CUDA

I know that non-POD types can't be passed as parameters to CUDA kernel launches in general.
But where I can find an explanation for this, I mean a reliable source like a book, a CUDA manual, etc.
The entire premise of this question is incorrect. CUDA kernel arguments are not limited to POD types.
You are free to pass any complete type as an argument, either by reference or by value. There is a limit of 255 bytes or 4kb for the total size of the argument list depending on which architecture you compile for, but that is the only restriction on kernel arguments. When passing an instance of a class to a CUDA kernel, there are a number of simple restrictions which you must follow, including:
Any pointers in the class instance which the device code will dereference must be valid device pointers
Any member functions in the class which the the device code will call must be valid __device__ functions
Is it illegal to pass a class containing virtual functions or derived from a virtual base type as as a kernel argument
Classes which access namespace anonymous unions are not supported in device code
All of the features and limitations of C++ support in CUDA kernel code is described in the CUDA Programming Guide, a copy of which ships in every version of the CUDA toolkit. All you need to do is read it.

CUDA kernel doesn't launch

My problem is very much like this one. I run the simplest CUDA program but the kernel doesn't launch. However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects consisting of several files (which I took from someone else) with no problems. In these projects, compilation and linking is done through makefiles with a lot of flags. I think the problem is in the correct flags to use while compiling. I simply use a command like this:
nvcc -arch=sm_20 -lcudart test.cu with a such a program (to run on a linux machine):
__global__ void myKernel()
{
cuPrintf("Hello, world from the device!\n");
}
int main()
{
cudaPrintfInit();
myKernel<<<1,10>>>();
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
}
The program compiles correctly. When I add cudaMemcpy() operations, it returns no error. Any suggestion on why the kernel doesn't launch ?
The reason it is not printing when using printf is that kernel launches are asynchronous and your program is exiting before the printf buffer gets flushed. Section B.16 of the CUDA (5.0) C Programming Guide explains this.
The output buffer for printf() is set to a fixed size before kernel launch (see
Associated Host-Side API). It is circular and if more output is produced during kernel
execution than can fit in the buffer, older output is overwritten. It is flushed only
when one of these actions is performed:
Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the
CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as
well),
Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(),
cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(),
or cuEventSynchronize(),
Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
Context destruction via cudaDeviceReset() or cuCtxDestroy().
For this reason, this program prints nothing:
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
}
But this program prints "Hello, world from the device!\n" ten times.
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
cudaDeviceSynchronize();
}
Are you sure that your CUDA device supports the SM_20 architecture?
Remove the arch= option from your nvcc command line and rebuild everything. This compiles for the 1.0 CUDA architecture, which will be supported on all CUDA devices. If it still doesn't run, do a build clean and make sure there are no object files left anywhere. Then rebuild and run.
Also, arch= refers to the virtual architecture, which should be something like compute_10. sm_20 is the real architecture and I believe should be used with the code= switch, not arch=.
In Visual Studio:
Right click on your project > Properies > Cuda C/C++ > Device
and add then following to Code Generation field
compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;compute_60,sm_60;compute_61,sm_61;compute_70,sm_70;compute_75,sm_75;
generating code for all these architecture makes your code a bit slower. So eliminate one by one to find which compute and sm gen code is required for your GPU.
But if you are shipping this to others better include all of these.