Difference between #cuda.jit and #jit(target='gpu') - cuda

I have a question on working with Python CUDA libraries from Continuum's Accelerate and numba packages. Is using the decorator #jit with target = gpu the same as #cuda.jit?

No, they are not the same, although the eventual compilation path into PTX into assembler is. The #jit decorator is the general compiler path, which can be optionally steered onto a CUDA device. The #cuda.jit decorator is effectively the low level Python CUDA kernel dialect which Continuum Analytics have developed. So you get support for CUDA built-in variables like threadIdx and memory space specifiers like __shared__ in #cuda.jit.
If you want to write a CUDA kernel in Python and compile and run it, use #cuda.jit. Otherwise, if you want to accelerate an existing piece of Python use #jit with a CUDA target.

Related

Reading Shared/Local Memory Store/Load bank conflicts hardware counters for OpenCL executable under Nvidia

It is possible to use nvprof to access/read bank conflicts counters for CUDA exec:
nvprof --events shared_st_bank_conflict,shared_ld_bank_conflict my_cuda_exe
However it does not work for the code that uses OpenCL rather then CUDA code.
Is there any way to extract these counters outside nvprof from OpenCL environment, maybe directly from ptx?
Alternatively is there any way to convert PTX assembly generated from nvidia OpenCL compiler using clGetProgramInfo with CL_PROGRAM_BINARIES to CUDA kernel and run it using cuModuleLoadDataEx and thus be able to use nvprof?
Is there any simulation CPU backend that allows to set such parameters as bank size etc?
Additional option:
Use converter of opencl to cuda code inlcuding features missing from CUDA like vloadn/vstoren, float16, and other various accessors. #define work only for simple kernels. Is there any tool that provides it?
Is there any way to extract these counters outside nvprof from OpenCL
environment, maybe directly from ptx?
No. Nor is there in CUDA, nor in compute shaders in OpenGL, DirectX or Vulkan.
Alternatively is there any way to convert PTX assembly generated from
nvidia OpenCL compiler using clGetProgramInfo with
CL_PROGRAM_BINARIES to CUDA kernel and run it using
cuModuleLoadDataEx and thus be able to use nvprof?
No. OpenCL PTX and CUDA PTX are not the same and can't be used interchangeably
Is there any simulation CPU backend that allows to set such parameters
as bank size etc?
Not that I am aware of.

Register array with numba cuda

In a numba cuda kernel, I know that we can define local and shared arrays. Also the all the variable assignments in a kernel go to the registers for a particular thread. Is it possible to declare a register array using numba cuda? Something similar to the following which would be used in CUDA C kernel?
register float accumulators[32];
It is not possible.
The register keyword is only a hint to the compiler, and it has essentially no effect in CUDA C/C++. The device code compiler will make decisions about what to put in registers based on its heuristics to generate fast code, not this instruction from the programmer.

how to get max blocks in thrust in cuda 5.5

The Thrust function below can get the maximum blocks of for a CUDA launch CUDA 5.0, which is used by Sparse Matrix Vector multiplication(SpMV) in CUSP, and it is a technique for setting up execution for persistent threads. The first line is the header file.
#include <thrust/detail/backend/cuda/arch.h>
thrust::detail::backend::cuda::arch::max_active_blocks(kernel<float,int,VECTORS_PER_BLOCK,TH READS_PER_VECTOR>,THREADS_PER_BLOCK,(size_t)0)
But the function is not supported by CUDA 5.5. Was this technique not supported by CUDA 5.5, or should I use some other function instead?
There was never any supported way to perform this computation in any version of Thrust. Headers inside thrust/detail and identifiers inside a detail namespace are part of Thrust's implementation -- they are not public features. Using them will break your code.
That said, there's some standalone code implementing the occupancy calculator in this repository:
https://github.com/jaredhoberock/cuda_launch_config

Is just-in-time (jit) compilation of a CUDA kernel possible?

Does CUDA support JIT compilation of a CUDA kernel?
I know that OpenCL offers this feature.
I have some variables which are not changed during runtime (i.e. only depend on the input file), therefore I would like to define these values with a macro at kernel compile time (i.e at runtime).
If I define these values manually at compile time my register usage drops from 53 to 46, what greatly improves performance.
It became available with nvrtc library of cuda 7.0. By this library you can compile your cuda codes during runtime.
http://devblogs.nvidia.com/parallelforall/cuda-7-release-candidate-feature-overview/
Bu what kind of advantages you can gain? In my view, i couldn't find so much dramatic advantages of dynamic compilation.
If it is feasible for you to use Python, you can use the excellent pycuda module to compile your kernels at runtime. Combined with a templating engine such as Mako, you will have a very powerful meta-programming environment that will allow you to dynamically tune your kernels for whatever architecture and specific device properties happen to be available to you (obviously some things will be difficult to make fully dynamic and automatic).
You could also consider just maintaining a few distinct versions of your kernel with different parameters, between which your program could choose at runtime based on whatever input you are feeding to it.

Does CUDA use an interpreter or a compiler?

This is a bit of silly question, but I'm wondering if CUDA uses an interpreter or a compiler?
I'm wondering because I'm not quite sure how CUDA manages to get source code to run on two cards with different compute capabilities.
From Wikipedia:
Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64 C compiler.
So, your answer is: it uses a compiler.
And to touch on the reason it can run on multiple cards (source):
CUDA C/C++ provides an abstraction, it's a means for you to express how you want your program to execute. The compiler generates PTX code which is also not hardware specific. At runtime the PTX is compiled for a specific target GPU - this is the responsibility of the driver which is updated every time a new GPU is released.
These official documents CUDA C Programming Guide and The CUDA Compiler Driver (NVCC) explain all the details about the compilation process.
From the second document:
nvcc mimics the behavior of the GNU compiler gcc: it accepts a range
of conventional compiler options, such as for defining macros and
include/library paths, and for steering the compilation process.
Not just limited to cuda , shaders in directx or opengl are also complied to some kind of byte code and converted to native code by the underlying driver.