Linear algebra libraries and dynamic parallelism in CUDA - cuda

With the advent of dynamic parallelism in 3.5 and above CUDA architectures, is it possible to call linear algebra libraries from within __device__ functions?
Can the CUSOLVER library in CUDA 7 be called from a kernel (__global__) function?

CUBLAS library functions can be called from device code.
Thrust algorithms can be called from device code.
Various CURAND functions can be called from device code.
Other libraries that are part of the CUDA toolkit at this time (i.e. CUDA 7) -- CUFFT, CUSPARSE, CUSOLVER -- can only be used from host code.

Related

Reading Shared/Local Memory Store/Load bank conflicts hardware counters for OpenCL executable under Nvidia

It is possible to use nvprof to access/read bank conflicts counters for CUDA exec:
nvprof --events shared_st_bank_conflict,shared_ld_bank_conflict my_cuda_exe
However it does not work for the code that uses OpenCL rather then CUDA code.
Is there any way to extract these counters outside nvprof from OpenCL environment, maybe directly from ptx?
Alternatively is there any way to convert PTX assembly generated from nvidia OpenCL compiler using clGetProgramInfo with CL_PROGRAM_BINARIES to CUDA kernel and run it using cuModuleLoadDataEx and thus be able to use nvprof?
Is there any simulation CPU backend that allows to set such parameters as bank size etc?
Additional option:
Use converter of opencl to cuda code inlcuding features missing from CUDA like vloadn/vstoren, float16, and other various accessors. #define work only for simple kernels. Is there any tool that provides it?
Is there any way to extract these counters outside nvprof from OpenCL
environment, maybe directly from ptx?
No. Nor is there in CUDA, nor in compute shaders in OpenGL, DirectX or Vulkan.
Alternatively is there any way to convert PTX assembly generated from
nvidia OpenCL compiler using clGetProgramInfo with
CL_PROGRAM_BINARIES to CUDA kernel and run it using
cuModuleLoadDataEx and thus be able to use nvprof?
No. OpenCL PTX and CUDA PTX are not the same and can't be used interchangeably
Is there any simulation CPU backend that allows to set such parameters
as bank size etc?
Not that I am aware of.

Is it possible to call cuSPARSE routines from kernel functions?

Is it possible to call cuSPARSE routines from the GPU, that is, from inside a kernel using dynamic parallelism?
libcublas_device.a enables to call cuBLAS routine from the GPU. I supposed libcusparse_device.a would exist and allow to call cuSPARSE routines from the GPU. However, it seems that file does not exist. Is this possible? If yes, how? If not, does NVIDIA plan to deliver such features in next GPU generations ?
FGH
Note : I run Unix (CentOs) + I use a "Tesla K20m" GPU (CUDA 5.5, Compute capability 3.5)
Quoting the cuSPARSE Library documentation for CUDA 6.5 (Release Candidate version):
The cuSPARSE library contains a set of basic linear algebra subroutines used for
handling sparse matrices. It is implemented on top of the NVIDIA® CUDA™ runtime
(which is part of the CUDA Toolkit) and is designed to be called from C and C++.
Accordingly, as of August 2014, you can't call cuSPARSE routines from kernel functions. The answer to your question is then: NO.

CUDA How to launch a new kernel call in one kernel function?

I am new to CUDA programming. Now, I have a problem to handle: I am trying to use CUDA parallel programming to handle a set of datasets. And for each datasets, there are some matrix calculation needed to be done.
My design is like this:
Launch N threads to handle each dataset as they are independent to each other and the method to handle them are the same.
In each thread in 1, I want to use a new function and this function also works like a kernel as they are matrix calc... e.g. call M threads to parallel handle matrix calculation..
Does anyone know whether it is possible or not?
You can launch a kernel from a thread in another kernel if you use CUDA dynamic parallelism and your GPU supports it. GPUs that support CUDA dynamic parallelism currently are of compute capability 3.5.
You can discover the compute capability of your device from the CUDA deviceQuery sample.
You can learn more about how to use CUDA dynamic parallelism from the CUDA programming guide section.

Does CUDA 5 support STL or THRUST inside the device code?

Its mentioned that CUDA 5 allows library calls from kernel
Does that mean CUDA 5 can use thrust or STL inside device code then ?
CUDA 5 has a device code linker for the first time. It means you can have separate object files of device functions and link against them rather than having to declare them at compilation unit scope. It also adds the ability for kernels to call other kernels (but only on compute 3.5 Kepler devices).
None of this means that C++ standard library templates or Thrust can be used inside kernel code.

Mathematica and CUDA

Is it possible that built in functions in Mathematica (like Minimize[expr,{x1,x2,...}]) will start to work via CUDA after installation of CUDA module for Mathematica?
I don't believe so, no. Mathematica's CUDALink module currently provides only a handful of GPU accelerated functions - some basic image processing operations, BLAS style linear algebra calls, Fourier Transforms and simple parallel reductions (argmin, argmax, and summation). There is also tools for integrating user written CUDA code, and for generating CUDA code symbolically. Outside of that, the rest of Mathematica's core functionality remains CPU only.
You can see full details of current CUDA and OpenCL support here.