Boolean operations on CUDA - cuda

My application needs to perform bit-vector operations like OR and XOR on bit-vectors.
e.g suppose array A = 000100101 (a.k.a bit vector)
B = 100101010
A . B = 100101111
Does CUDA support boolean variables? e.g. bool as in C. If yes, how is it stored and operated on? Does it also support bit-vector operations?. I couldn't find the answer in the CUDA Programming Guide.

CUDA supports the standard C++ bool, but in C++ it is only a type guaranteed to support two states, so bit operations shouldn't be used on it. In CUDA, as in C++, you get the standard complement of bitwise operators for integral types (and, or, xor, complement and left and right shift). Ideally you should aim to use a 32 bit type (or a packed 32 bit CUDA vector type) for memory throughput reasons.

Related

Extend the CUDA.#atomic enum to a custom struct

i was wondering weather it is possible to extend the CUDA.#atomic operation to a custom type.
Here is an example of what i am trying to do:
using CUDA
struct Dual
x
y
end
cu0 = CuArray([Dual(1, 2), Dual(2,3)])
cu1 = CuArray([Dual(1, 2), Dual(2,3)])
indexes = CuArray([1, 1])
function my_kernel(dst, src, idx)
index = threadIdx().x + (blockIdx().x - 1) * blockDim().x
#inbounds if index <= length(idx)
CUDA.#atomic dst[idx[index]] = dst[idx[index]] + src[index]
end
return nothing
end
#cuda threads = 100 my_kernel(cu0, cu1, indexes)
The Problem of this code is that the CUDA.#atomic call only supports basic types like
Int, Float or Real.
I need it to work with my own struct.
Would be nice if someone has an idea how this could be possible.
The underlying PTX instruction set for CUDA provides a subset of atomic store, exchange, add/subtract,increment/decrement, min/max, and compare-and-set operations for global and shared memory locations (not all architectures support all operations with all POD types, and there is evidence that not all operations are implemented in hardware on all architectures).
What all these instructions have in common is that they execute only one operation atomically. I am completely unfamiliar with Julia, but if
CUDA.#atomic dst[idx[index]] = dst[idx[index]] + src[index]
means "atomically add src[].x and src[].y to dst[].x and dst[].y" then that isn't possible because that implies two additions on separate memory locations in one atomic operation. If the members of your structure could be packed into a compatible type (a 32 bit or 64 bit unsigned integer, for example), you could perform atomic store, exchange or compare-and-set in CUDA. But not arithmetic.
If you consult this section of the programming guide, you can see an example of a brute force double precision add implementation using compare-and-set in a tight loop. If your structure can be packed into something which can be manipulated with compare-and-set, then it might be possible to roll your own atomic add for a custom type (limited to a maximum of 64 bits).
How you might approach that in Julia is definitely an exercise left to the reader.

Elementwise vector Multiplication in cublas [duplicate]

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ?
I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS).
CUBLAS is based on the reference BLAS, and the reference BLAS has never contained a Hadamard product (complex or real). Hence CUBLAS doesn't have one either. Intel have added v?Mul to MKL for doing this, but it is non-standard and not in most BLAS implementations. It is the kind of operation that an old school fortran programmer would just write a loop for, so I presume it really didn't warrant a dedicated routine in BLAS.
There is no "standard" CUDA library I am aware of which implements a Hadamard product. There would be the possibility of using CUBLAS GEMM or SYMM to do this and extracting the diagonal of the resulting matrix, but that would be horribly inefficient, both from a computation and storage stand point.
The Thrust template library can do this trivially using thrust::transform, for example:
thrust::multiplies<thrust::complex<float> > op;
thrust::transform(thrust::device, x, x + n, y, z, op);
would iterate over each pair of inputs from the device pointers x and y and calculate z[i] = x[i] * y[i] (there is probably a couple of casts you need to make to compile that, but you get the idea). But that effectively requires compilation of CUDA code within your project, and apparently you don't want that.

CUDA: Is there any api for element wise vector product in cublas? [duplicate]

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ?
I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS).
CUBLAS is based on the reference BLAS, and the reference BLAS has never contained a Hadamard product (complex or real). Hence CUBLAS doesn't have one either. Intel have added v?Mul to MKL for doing this, but it is non-standard and not in most BLAS implementations. It is the kind of operation that an old school fortran programmer would just write a loop for, so I presume it really didn't warrant a dedicated routine in BLAS.
There is no "standard" CUDA library I am aware of which implements a Hadamard product. There would be the possibility of using CUBLAS GEMM or SYMM to do this and extracting the diagonal of the resulting matrix, but that would be horribly inefficient, both from a computation and storage stand point.
The Thrust template library can do this trivially using thrust::transform, for example:
thrust::multiplies<thrust::complex<float> > op;
thrust::transform(thrust::device, x, x + n, y, z, op);
would iterate over each pair of inputs from the device pointers x and y and calculate z[i] = x[i] * y[i] (there is probably a couple of casts you need to make to compile that, but you get the idea). But that effectively requires compilation of CUDA code within your project, and apparently you don't want that.

How to perform Hadamard product with CUBLAS on complex numbers?

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ?
I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS).
CUBLAS is based on the reference BLAS, and the reference BLAS has never contained a Hadamard product (complex or real). Hence CUBLAS doesn't have one either. Intel have added v?Mul to MKL for doing this, but it is non-standard and not in most BLAS implementations. It is the kind of operation that an old school fortran programmer would just write a loop for, so I presume it really didn't warrant a dedicated routine in BLAS.
There is no "standard" CUDA library I am aware of which implements a Hadamard product. There would be the possibility of using CUBLAS GEMM or SYMM to do this and extracting the diagonal of the resulting matrix, but that would be horribly inefficient, both from a computation and storage stand point.
The Thrust template library can do this trivially using thrust::transform, for example:
thrust::multiplies<thrust::complex<float> > op;
thrust::transform(thrust::device, x, x + n, y, z, op);
would iterate over each pair of inputs from the device pointers x and y and calculate z[i] = x[i] * y[i] (there is probably a couple of casts you need to make to compile that, but you get the idea). But that effectively requires compilation of CUDA code within your project, and apparently you don't want that.

CUDA cublas<t>gbmv understanding

I recently wanted to use a simple CUDA matrix-vector multiplication. I found a proper function in cublas library: cublas<<>>gbmv. Here is the official documentation
But it is actually very poor, so I didn't manage to understand what the kl and ku parameters mean. Moreover, I have no idea what stride is (it must also be provided).
There is a brief explanation of these parameters (Page 37), but it looks like I need to know something else.
A search on the internet doesn't provide tons of useful information on this question, mostly references to different version of documentation.
So I have several questions to GPU/CUDA/cublas gurus:
How do I find more understandable docs or guides about using cublas?
If you know how to use this very function, couldn't you explain me how do I use it?
Maybe cublas library is somewhat extraordinary and everyone uses something more popular, better documented and so on?
Thanks a lot.
So BLAS (Basic Linear Algebra Subprograms) generally is an API to, as the name says, basic linear algebra routines. It includes vector-vector operations (level 1 blas routines), matrix-vector operations (level 2) and matrix-matrix operations (level 3). There is a "reference" BLAS available that implements everything correctly, but most of the time you'd use an optimized implementation for your architecture. cuBLAS is an implementation for CUDA.
The BLAS API was so successful as an API that describes the basic operations that it's become very widely adopted. However, (a) the names are incredibly cryptic because of architectural limitations of the day (this was 1979, and the API was defined using names of 8 characters or less to ensure it could widely compile), and (b) it is successful because it's quite general, and so even the simplest function calls require a lot of extraneous arguments.
Because it's so widespread, it's often assumed that if you're doing numerical linear algebra, you already know the general gist of the API, so implementation manuals often leave out important details, and I think that's what you're running into.
The Level 2 and 3 routines generally have function names of the form TMMOO.. where T is the numerical type of the matrix/vector (S/D for single/double precision real, C/Z for single/double precision complex), MM is the matrix type (GE for general - eg, just a dense matrix you can't say anything else about; GB for a general banded matrix, SY for symmetric matrices, etc), and OO is the operation.
This all seems slightly ridiculous now, but it worked and works relatively well -- you quickly learn to scan these for familiar operations so that SGEMV is a single-precision general-matrix times vector multiplication (which is probably what you want, not SGBMV), DGEMM is double-precision matrix-matrix multiply, etc. But it does take some practice.
So if you look at the cublas sgemv instructions, or in the documentation of the original, you can step through the argument list. First, the basic operation is
This function performs the matrix-vector multiplication
y = a op(A)x + b y
where A is a m x n matrix stored in column-major format, x and y
are vectors, and and are scalars.
where op(A) can be A, AT, or AH. So if you just want y = Ax, as is the common case, then a = 1, b = 0. and transa == CUBLAS_OP_N.
incx is the stride between different elements in x; there's lots of situations where this would come in handy, but if x is just a simple 1d array containing the vector, then the stride would be 1.
And that's about all you need for SGEMV.