Is it possible to call cuSPARSE routines from kernel functions? - cuda

Is it possible to call cuSPARSE routines from the GPU, that is, from inside a kernel using dynamic parallelism?
libcublas_device.a enables to call cuBLAS routine from the GPU. I supposed libcusparse_device.a would exist and allow to call cuSPARSE routines from the GPU. However, it seems that file does not exist. Is this possible? If yes, how? If not, does NVIDIA plan to deliver such features in next GPU generations ?
FGH
Note : I run Unix (CentOs) + I use a "Tesla K20m" GPU (CUDA 5.5, Compute capability 3.5)

Quoting the cuSPARSE Library documentation for CUDA 6.5 (Release Candidate version):
The cuSPARSE library contains a set of basic linear algebra subroutines used for
handling sparse matrices. It is implemented on top of the NVIDIA® CUDA™ runtime
(which is part of the CUDA Toolkit) and is designed to be called from C and C++.
Accordingly, as of August 2014, you can't call cuSPARSE routines from kernel functions. The answer to your question is then: NO.

Related

Reading Shared/Local Memory Store/Load bank conflicts hardware counters for OpenCL executable under Nvidia

It is possible to use nvprof to access/read bank conflicts counters for CUDA exec:
nvprof --events shared_st_bank_conflict,shared_ld_bank_conflict my_cuda_exe
However it does not work for the code that uses OpenCL rather then CUDA code.
Is there any way to extract these counters outside nvprof from OpenCL environment, maybe directly from ptx?
Alternatively is there any way to convert PTX assembly generated from nvidia OpenCL compiler using clGetProgramInfo with CL_PROGRAM_BINARIES to CUDA kernel and run it using cuModuleLoadDataEx and thus be able to use nvprof?
Is there any simulation CPU backend that allows to set such parameters as bank size etc?
Additional option:
Use converter of opencl to cuda code inlcuding features missing from CUDA like vloadn/vstoren, float16, and other various accessors. #define work only for simple kernels. Is there any tool that provides it?
Is there any way to extract these counters outside nvprof from OpenCL
environment, maybe directly from ptx?
No. Nor is there in CUDA, nor in compute shaders in OpenGL, DirectX or Vulkan.
Alternatively is there any way to convert PTX assembly generated from
nvidia OpenCL compiler using clGetProgramInfo with
CL_PROGRAM_BINARIES to CUDA kernel and run it using
cuModuleLoadDataEx and thus be able to use nvprof?
No. OpenCL PTX and CUDA PTX are not the same and can't be used interchangeably
Is there any simulation CPU backend that allows to set such parameters
as bank size etc?
Not that I am aware of.

Difference between #cuda.jit and #jit(target='gpu')

I have a question on working with Python CUDA libraries from Continuum's Accelerate and numba packages. Is using the decorator #jit with target = gpu the same as #cuda.jit?
No, they are not the same, although the eventual compilation path into PTX into assembler is. The #jit decorator is the general compiler path, which can be optionally steered onto a CUDA device. The #cuda.jit decorator is effectively the low level Python CUDA kernel dialect which Continuum Analytics have developed. So you get support for CUDA built-in variables like threadIdx and memory space specifiers like __shared__ in #cuda.jit.
If you want to write a CUDA kernel in Python and compile and run it, use #cuda.jit. Otherwise, if you want to accelerate an existing piece of Python use #jit with a CUDA target.

How to view CUDA library function calls in profiler?

I am using the cuFFT library. How do I modify my code to see the function calls from this library (or any other CUDA library) in the NVIDIA Visual Profiler NVVP? I am using Windows and Visual Studio 2013.
Below is my code. I convert my image and filter to the Fourier domain, then perform point-wise complex matrix multiplication in a custom CUDA kernel I wrote, and then simply perform the inverse DFT on the filtered images spectrum. The results are accurate, but I am not able to figure out how to view the cuFFT functions in the profiler.
// Execute FFT Plans
cufftExecR2C(fftPlanFwd, (cufftReal *)d_in, (cufftComplex *)d_img_Spectrum);
cufftExecR2C(fftPlanFwd, (cufftReal *)d_filter, (cufftComplex *)d_filter_Spectrum);
// Perform complex pointwise muliplication on filter spectrum and image spectrum
pointWise_complex_matrix_mult_kernel << <grid, block >> >(d_img_Spectrum, d_filter_Spectrum, d_filtered_Spectrum, ROWS, COLS);
// Execute FFT^-1 Plan
cufftExecC2R(fftPlanInv, (cufftComplex *)d_filtered_Spectrum, (cufftReal *)d_out);
At the entry point to the library, the library call is like any other call into a C or C++ library: it is executing on the host. Within that library call, there may be calls to CUDA kernels or other CUDA API functions, for a CUDA GPU-enabled library such as CUFFT.
The profilers (at least up through CUDA 7.0 - see note about CUDA 7.5 nvprof below) don't natively support the profiling of host code. They are primarily focused on kernel calls and CUDA API calls. A call into a library like CUFFT by itself is not considered a CUDA API call.
You haven't shown a complete profiler output, but you should see the CUFFT library make CUDA kernel calls; these will show up in the profiler output. The first two CUFFT calls prior to your pointWise_complex_matrix_mult_kernel should have one or more kernel calls each that show up to the left of that kernel, and the last CUFFT call should have one or more kernel calls that show up to the right of that kernel.
One possible way to get specific sections of host code to show up in the profiler is to use the NVTX (NVIDIA Tools Extension) library to annotate your source code, which will cause those annotations to show up in the profiler output. You might want to put an NVTX range event around the library call you wish to see identified in the profiler output.
Another approach would be to try out the new CPU profiling features in nvprof in CUDA 7.5. You can refer to section 3.4 of the Profiler guide that ships with CUDA 7.5RC.
Finally, ordinary host profilers should be able to profile your CUDA application, including CUFFT library calls, but they won't have any visibility into what is happening on the GPU.
EDIT: Based on discussion in the comments below, your code appears to be similar to the simpleCUFFT sample code. When I compile and profile that code on Win7 x64, VS 2013 Community, and CUDA 7, I get the following output (zoomed in to depict the interesting part of the timeline):
You can see that there are CUFFT kernels being called both before and after the complex pointwise multiply and scale kernel that appears in that code. My suggestion would be to start by doing something similar with the simpleCUFFT sample code rather than your own code, and see if you can duplicate the output above. If so, the problem lies in your code (perhaps your CUFFT calls are failing, perhaps you need to add proper error checking, etc.)

Linear algebra libraries and dynamic parallelism in CUDA

With the advent of dynamic parallelism in 3.5 and above CUDA architectures, is it possible to call linear algebra libraries from within __device__ functions?
Can the CUSOLVER library in CUDA 7 be called from a kernel (__global__) function?
CUBLAS library functions can be called from device code.
Thrust algorithms can be called from device code.
Various CURAND functions can be called from device code.
Other libraries that are part of the CUDA toolkit at this time (i.e. CUDA 7) -- CUFFT, CUSPARSE, CUSOLVER -- can only be used from host code.

Does CUDA 5 support STL or THRUST inside the device code?

Its mentioned that CUDA 5 allows library calls from kernel
Does that mean CUDA 5 can use thrust or STL inside device code then ?
CUDA 5 has a device code linker for the first time. It means you can have separate object files of device functions and link against them rather than having to declare them at compilation unit scope. It also adds the ability for kernels to call other kernels (but only on compute 3.5 Kepler devices).
None of this means that C++ standard library templates or Thrust can be used inside kernel code.