How to access sparse tensor core functionality in CUDA? - cuda

Tensor cores can be programmatically accessed through the WMMA interface in CUDA (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma and https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/) . Recently, in the Ampere generation of cards, Nvidia announced the ability to perform sparse tensor operations with sparse matrices, as seen here: https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/
The format presented appears to take in pairs of elements and their order within four element segments (2 bit indices). However looking at the wmma documentation I can't find any mention of this, or how to access those special tensor core operations. This is not illuminated by the announcement page of this functionality either AFAICT.
How do I access sparse tensor core functionality in cuda?

The blog post in your question links to the following paper:
Accelerating Sparse Deep Neural Networks https://arxiv.org/pdf/2104.08378.pdf
In Section 3.2 it says
It is the application’s responsibility to ensure that the first operand is a matrix
stored in the compressed 2:4 format. cuSPARSELt and other libraries provide APIs for
compression and sparse math operations, while, starting in version 8.0, the TensorRT
SDK performs these functions for 2:4 sparse weights automatically. NVIDIA libraries
require that input dimensions of a sparse matrix multiplication be multiples of 16 and
32 for 16-bit (FP16/BF16) and 8b-integer formats, respectively.
Sparse tensor operations can manually be performed using ptx mma.sp which is explained in the ptx documentation Section 9.7.13.5 : https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-for-sparse-mma

Related

Using atomic arithmetic operations in CUDA Unified Memory multi-GPU or multi-processor

I am trying to implement a CUDA program that uses Unified Memory. I have two unified arrays and sometimes they need to be updated atomically.
The question below has an answer for a single GPU environment but I am not sure how to extend the answer given in the question to adapt in multi-GPU platforms.
Question: cuda atomicAdd example fails to yield correct output
I have 4 Tesla K20 if you need this information and all of them updates a part of those arrays that must be done atomically.
I would appreciate any help/recommendations.
To summarize comments into an answer:
You can perform this sort of address space wide atomic operation using atomicAdd_system
However, you can only do this on compute capability 6.x or newer devices (7.2 or newer if using Tegra)
specifically this means you have to compile for the correct compute capability such as -arch=sm_60 or similar
You state in the question you are using Telsa K20 cards -- these are compute capability 3.5 and do not support any of the system wide atomic functions.
As always, this information is neatly summarized in the relevant section of the Programming Guide.

Do CUDA cores have vector instructions?

According to most NVidia documentation CUDA cores are scalar processors and should only execute scalar operations, that will get vectorized to 32-component SIMT warps.
But OpenCL has vector types like for example uchar8.It has the same size as ulong (64 bit), which can be processed by a single scalar core. If I do operations on a uchar8 vector (for example component-wise addition), will this also map to an instruction on a single core?
If there are 1024 work items in a block (work group), and each work items processes a uchar8, will this effectively process 8120 uchar in parallel?
Edit:
My question was if on CUDA architectures specifically (independently of OpenCL), there are some vector instructions available in "scalar" cores. Because if the core is already capable of handling a 32-bit type, it would be reasonable if it can also handle addition of a 32-bit uchar4 for example, especially since vector operations are often used in computer graphics.
CUDA has "built-in" (i.e. predefined) vector types up to a size of 4 for 4-byte quantities (e.g. int4) and up to a size of 2 for 8-byte quantities (e.g. double2). A CUDA thread has a maximum read/write transaction size of 16 bytes, so these particular size choices tend to line up with that maximum.
These are exposed as typical structures, so you can reference for example .x to access just the first element of a vector type.
Unlike OpenCL, CUDA does not provide built-in operations ("overloads") for basic arithmetic e.g. +, -, etc. for element-wise operations on these vector types. There's no particular reason you couldn't provide such overloads yourself. Likewise, if you wanted a uchar8 you could easily provide a structure definition for such, as well as any desired operator overloads. These could probably be implemented just as you would expect for ordinary C++ code.
Probably an underlying question is, then, what is the difference in implementation between CUDA and OpenCL in this regard? If I operate on a uchar8, e.g.
uchar8 v1 = {...};
uchar8 v2 = {...};
uchar8 r = v1 + v2;
what will the difference be in terms of machine performance (or low-level code generation) between OpenCL and CUDA?
Probably not much, for a CUDA-capable GPU. A CUDA core (i.e. the underlying ALU) does not have direct native support for such an operation on a uchar8, and furthermore, if you write your own C++ compliant overload, you're probably going to use C++ semantics for this which will inherently be serial:
r.x = v1.x + v2.x;
r.y = v1.y + v2.y;
...
So this will decompose into a sequence of operations performed on the CUDA core (or in the appropriate integer unit within the CUDA SM). Since the NVIDIA GPU hardware doesn't provide any direct support for an 8-way uchar add within a single core/clock/instruction, there's really no way OpenCL (as implemented on a NVIDIA GPU) could be much different. At a low level, the underlying machine code is going to be a sequence of operations, not a single instruction.
As an aside, CUDA (or PTX, or CUDA intrinsics) does provide for a limited amount of vector operations within a single core/thread/instruction. Some examples of this are:
a limited set of "native" "video" SIMD instructions. These instructions are per-thread, so if used, they allow for "native" support of up to 4x32 = 128 (8-bit) operands per warp, although the operands must be properly packed into 32-bit registers. You can access these from C++ directly via a set of built-in intrinsics. (A CUDA warp is a set of 32 threads, and is the fundamental unit of lockstep parallel execution and scheduling on a CUDA capable GPU.)
a vector (SIMD) multiply-accumulate operation, which is not directly translatable to a single particular elementwise operation overload, the so-called int8 dp2a and dp4a instructions. int8 here is somewhat misleading. It does not refer to an int8 vector type but rather a packed arrangement of 4 8-bit integer quantities in a single 32-bit word/register. Again, these are accessible via intrinsics.
16-bit floating point is natively supported via half2 vector type in cc 5.3 and higher GPUs, for certain operations.
The new Volta tensorCore is something vaguely like a SIMD-per-thread operation, but it operates (warp-wide) on a set of 16x16 input matrices producing a 16x16 matrix result.
Even with a smart OpenCL compiler that could map certain vector operations into the various operations "natively" supported by the hardware, it would not be complete coverage. There is no operational support for an 8-wide vector (e.g. uchar8) on a single core/thread, in a single instruction, to pick one example. So some serialization would be necessary. In practice, I don't think the OpenCL compiler from NVIDIA is that smart, so my expectation is that you would find such per-thread vector operations fully serialized, if you studied the machine code.
In CUDA, you could provide your own overload for certain operations and vector types, that could be represented approximately in a single instruction. For example a uchar4 add could be performed "natively" with the __vadd4() intrinsic (perhaps included in your implementation of an operator overload.) Likewise, if you are writing your own operator overload, I don't think it would be difficult to perform a uchar8 elementwise vector add using two __vadd4() instructions.
If I do operations on a uchar8 vector (for example component-wise addition), will this also map to an instruction on a single core?
AFAIK it'll always be on a single core (instructions from a single kernel / workitem don't cross cores, except special instructions like barriers), but it may be more than one instruction. This depends on whether your hardware support operations on uchar8 natively. If it does not, then uchar8 will be broken up to as many pieces as required, and each piece will be processed with a separate instruction.
OpenCL is very "generic" in the sense that it supports many different vector type/size combos, but real-world hardware usually only implements some vector type/size combinations. You can query OpenCL devices for "preferred vector size" which should tell you what's the most efficient for that hardware.

CUBLAS library: find max of actual values rather than absolute values

The CUBLAS library of NVIDIA CUDA allows finding the element/index with maximum absolute value (cublasIsamax). Is it possible to find the element/index with the maximum actual value somehow, using the CUBLAS reduction functions?
[I am using CUBLAS version 3.2.]
Edit
Constraint: I cannot alter the state of the production server in any way. This means that I cannot use thrust/cudpp, and that I'm stuck with using an older version of CUBLAS.
I am not sure what "reduction functions" you are referring to.
CUBLAS is basically just a "like-for-like" implementation of BLAS for CUDA devices. It only provides the standard Level 1, 2 and 3 BLAS functions, plus exactly three extensions - geam (scaled matrix addition/transposition), dgmm (diagonalised matrix-matrix dot product) and getrfBatched (batched LU factorisation for many small matrices). None of those functions will find the signed maximum value of a supplied vector or matrix.
NVIDIA ship cudpp and thrust, either of which are probably better for this sort operation. Also, CUBLAS 3.2 is two and a half years old.
As a final comment, I would strong recommend using either the CUBLAS 4.x or CUBLAS 5.x releases. The API and performance of the code has improved considerably, especially for newer hardware.

Fermi architecture possible solution to my comparative study?

I am working on a comparative study in which I have to make a comparison of the serial and parallel versions of an algorithm (NSGA-II algorithm to be precise download link here). NSGA-II is a heuristic optimization method and hence depends on the initial random population generated. If the initial populations generated using the CPU and the GPU are different, then I can not make an impartial speedup study.
I possess a NVIDIA-TESLA-C1060 card which has a compute capability of 1.3. As per this anwer and this NVIDIA document, we can't expect an sm_13 device to always yield an IEEE-754 compliant float (single precision) value. Which in other word means that on my current device, I cant conduct an impartial speedup study of the CUDA program corresponding to its serial counterpart.
My question is: Would switching to Fermi architecture solve the problem?
Floating-point operations will yield different results on different architectures, regardless of whether they support IEEE754 or not, since floating-point is not associative. Even switching compiler on x86 will typically give different results. This whitepaper gives some excellent explanations.
Having said that, your real issue is that you have a data dependent algorithm where the operations are dependent on the random numbers you generate. So if you generate the same numbers on the CPU and the GPU then both runs will be following the same paths. Consider using cuRAND, which can generate the same numbers on both the CPU and GPU.

how to use CUDA for multiplying sparse matrices(over gf(2) field) for block lanczos algorithm?

I have an Academic project to do which relates to block lanczos algorith (Montengro's version). I have a problem designing the algorithm for the implementation of block lanczos, can anyone suggest me what path should I take for the sparse matrices that arise in this algo to multiply. They can be large ranging around 1M X 1M. I have gt 330m cuda enabled gpu with me.
Have you looked at CUSPARSE (included with the CUDA Toolkit) and/or CUSP (open source)?