Negative Speed Gain Using Numba Vectorize target='cuda' - cuda

I am trying to test out the effectiveness of using the Python Numba module's #vectorize decorator for speeding up a code snippet relevant to my actual code. I'm utilizing a code snippet provided in CUDAcast #10 available here and shown below:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
#vectorize(["float32(float32, float32)"], target='cpu')
def VectorAdd(a,b):
return a + b
def main():
N = 32000000
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
print("VectorAdd took %f seconds" % vectoradd_time)
if __name__ == '__main__':
main()
In the demo in the CUDAcast, the demonstrator gets a 100x speedup by sending the large array equation to the gpu via the #vectorize decorator. However, when I set the #vectorize target to the gpu:
#vectorize(["float32(float32, float32)"], target='cuda')
... the result is 3-4 times slower. With target='cpu' my runtime is 0.048 seconds; with target='cuda' my runtime is 0.22 seconds. I'm using a DELL Precision laptop with Intel Core i7-4710MQ processor and NVIDIA Quadro K2100M GPU. The output of running nvprof (NVIDIA profiler tool) indicate that the majority of the time is spent in memory handling (expected), but even the function evaluation takes longer on the GPU than the whole process did on the CPU. Obviously this isn't the result I was hoping for, but is it due to some error on my part or is this reasonable based on my hardware and code?

This question is also interesting for me.
I've tried your code and got similar results.
To somehow investigate this issue I've wrote the CUDA kernel using cuda.jit and add it in your code:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize, cuda
N = 16*50000 #32000000
blockdim = 16, 1
griddim = int(N/blockdim[0]), 1
#cuda.jit("void(float32[:], float32[:])")
def VectorAdd_GPU(a, b):
i = cuda.grid(1)
if i < N:
a[i] += b[i]
#vectorize("float32(float32, float32)", target='cpu')
def VectorAdd(a,b):
return a + b
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("VectorAdd took %f seconds" % vectoradd_time)
start = timer()
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
VectorAdd_GPU[griddim,blockdim](d_A, d_B)
C = d_A.copy_to_host()
vectoradd_time = timer() - start
print("VectorAdd_GPU took %f seconds" % vectoradd_time)
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
In this 'benchmark' I also take into account the time for copying of arrays from host to device and from device to host. In this case the GPU function is slowly than CPU one.
For the case above:
CPU - 0.0033;
GPU - 0.0096;
Vectorize (target='cuda') - 0.15 (for my PC).
If the copying time is not accounted:
GPU - 0.000245
So, what I have learned, (1) The copying from host to device and from device to host is time-consuming. It is obvious and well-known. (2) I do not know the reason but #vectorize can significantly slowing down the calculations on GPU. (3) It is better to use self-written kernels (and of course minimize the memory copying).
By the way I have also tested the #cuda.jit by solving heat-conduction equation by explicit finite-difference scheme and found that for this case python program execution time is comparable with C program and provide about 100 times speedup. It is because, fortunately in this case you can do many iterations without data exchange between host and device.
UPD. Used Software & Hardware: Win7 64bit, CPU: Intel Core2 Quad 3GHz, GPU: NVIDIA GeForce GTX 580.

Related

MKL FFTW is Slower than FFTPACK for small data size

I wrote a matrix computation C++ library 20 years ago and I’m willing to boost its performance using intel MKL library.
For complex value vector/matrix, my library uses two split arrays: one for the real part and one for imaginary part.
Here are the timing results:
N=65536, fftw time = 0.005(s), fftpack time = 0.001(s)
N=100000, fftw time = 0.005(s), fftpack time = 0.003(s)
N=131072, fftw time = 0.006(s), fftpack time = 0.004(s)
N=250000, fftw time = 0.013(s), fftpack time = 0.007(s)
N=262144, fftw time = 0.012(s), fftpack time = 0.008(s)
N=524288, fftw time = 0.022(s), fftpack time = 0.018(s)
N=750000, fftw time = 0.037(s), fftpack time = 0.025(s)
N=1048576, fftw time = 0.063(s), fftpack time = 0.059(s)
N=1500000, fftw time = 0.114(s), fftpack time = 0.079(s)
N=2097152, fftw time = 0.126(s), fftpack time = 0.146(s)
N=4194304, fftw time = 0.241(s), fftpack time = 0.35(s)
N=8388608, fftw time = 0.433(s), fftpack time = 0.788(s)
For vectors with length < 1500000 double value fftpack is faster than fftw.
Here is the code I use:
Matrix X=randn(M,1); //input vector
//start timer
Matrix Y = MyFFTW(X);
// measure time
//function to compute the FFT
Matrix MyFFTW(Matrix X)
{
int M= X.rows();
int N= X.cols();
Matrix Y(T_COMPLEX,M,N); // output complex to store FFT results
// Input data could also ba matrix
double* in_data = (double*)fftw_malloc(sizeof(double) * M );
fftw_complex* out_data = (fftw_complex*)fftw_malloc(sizeof(fftw_complex) * (M / 2 + 1));
fftw_plan fftplan = fftw_plan_dft_r2c_1d(M, in_data, out_data, FFTW_ESTIMATE);
//one fftplan is used for all the matrix columns
for (int i = 1; i <= N; i++)
{
//copy column number i to in_dataused by the fftplan, arrays indexing is 1-based like matlab
memcpy(in_data, X.pr(1,i), M* sizeof(double));
fftw_execute(fftplan);
//split out_data to real and imag parts
double* pr = Y.pr(1,i), * pi = Y.pi(1,i);
int k = (M - 1) / 2, j;
for (j = 0; j <= k; j++)
{
*pr++ = out_data[j][0];
*pi++ = out_data[j][1];
}
if (M % 2 == 0)
{
*pr++ = out_data[M/2][0];
*pi++ = out_data[M/2][1];
}
for (j = k; j >= 1; j--)
{
*pr++ = out_data[j][0];
*pi++ = out_data[j][1];
}
}
fftw_destroy_plan(fftplan);
fftw_free(in_data);
fftw_free(out_data);
return Y;
}
Results are obtained on Intel core i7 # 3.2 GHz using Visual Studio 2019 as compiler and the last intel MKL library.
Compiler flags are:
/fp:fast /DWIN32 /O2 /Ot /Oi /Oy /arch:AVX2 /openmp /MD
Linker libs are:
mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib
Is there a better way to make fftw faster for vector of small size ?
Update:
I tested against Matlab that uses MKL fftw for fft computation :
N=65536, matlab fft time = 0.071233(s)
N=100000, matlab fft time = 0.011437(s)
N=131072, matlab fft time = 0.0074411(s)
N=250000, matlab fft time = 0.015349(s)
N=262144, matlab fft time = 0.0082545(s)
N=524288, matlab fft time = 0.011395(s)
N=750000, matlab fft time = 0.022364(s)
N=1048576, matlab fft time = 0.019683(s)
N=1500000, matlab fft time = 0.033493(s)
N=2097152, matlab fft time = 0.035345(s)
N=4194304, matlab fft time = 0.069539(s)
N=8388608, matlab fft time = 0.1387(s)
Except the first call to fft with N=65536, Matlab(64bits) is faster than both my function (win32) using fftpack (for N > 500000) and using MKL fftw.
Thanks
regarding fftw, AFAIK, there are no specific performance tips from MKL which would help to accelerate the performance for small cases. Actually the overhead of using fftw from mkl is pretty negligible.
regarding your bench: I see you measure the allocation/deallocation parts, creating the fftw plans, memcopy operations as well. But, the only one routine (fftw_execute) from this benchmark are optimize by mkl.
That's could be a problem with this pipeline.
You could add the MKL_VERBOSE mode to check the execution time from fftw_execute...

"dimension too large" error when broadcasting to sparse matrix in octave

32-bit Octave has a limit on the maximum number of elements in an array. I have recompiled from source (following the script at https://github.com/calaba/octave-3.8.2-enable-64-ubuntu-14.04 ), and now have 64-bit indexing.
Nevertheless, when I attempt to perform elementwise multiplication using a broadcast function, I get error: out of memory or dimension too large for Octave's index type
Is this a bug, or an undocumented feature? If it's a bug, does anyone have a reasonably efficient workaround?
Minimal code to reproduce the problem:
function indexerror();
% both of these are formed without error
% a = zeros (2^32, 1, 'int8');
% b = zeros (1024*1024*1024*3, 1, 'int8');
% sizemax % returns 9223372036854775806
nnz = 1000 % number of non-zero elements
rowmax = 250000
colmax = 100000
irow = zeros(1,nnz);
icol = zeros(1,nnz);
for ind =1:nnz
irow(ind) = round(rowmax/nnz*ind);
icol(ind) = round(colmax/nnz*ind);
end
sparseMat = sparse(irow,icol,1,rowmax,colmax);
% column vector to be broadcast
broad = 1:rowmax;
broad = broad(:);
% this gives "dimension too large" error
toobig = bsxfun(#times,sparseMat,broad);
% so does this
toobig2 = sparse(repmat(broad,1,size(sparseMat,2)));
mult = sparse( sparseMat .* toobig2 ); % never made it this far
end
EDIT:
Well, I have an inefficient workaround. It's slower than using bsxfun by a factor of 3 or so (depending on the details), but it's better than having to sort through the error in the libraries. Hope someone finds this useful some day.
% loop over rows, instead of using bsxfun
mult_loop = sparse([],[],[],rowmax,colmax);
for ind =1:length(broad);
mult_loop(ind,:) = broad(ind) * sparseMat(ind,:);
end
The unfortunate answer is that yes, this is a bug. Apparently #bsxfun and repmat are returning full matrices rather than sparse. Bug has been filed here:
http://savannah.gnu.org/bugs/index.php?47175

Segmentation Fault in Pycuda using NVIDIA's cuSolver Library

i'm tryin to make a pycuda wrapper inspired by scikits-cuda library, for some operations provided in the new cuSolver library of Nvidia, first I need to perfom an LU factorization through cusolverDnSgetrf() op. but before that I need the 'Workspace' argument, the tool that cuSolver provides to get that is named cusolverDnSgetrf_bufferSize(); but when I use it, just crash and return a segmentation-fault. What I'm doing wrong?
Note: I have already working this op with scikits-cuda but the cuSolver library use a lot this kind of argument and I want to compare the usage between scikits-cuda and my implementation with the new library.
import numpy as np
import pycuda.gpuarray
import ctypes
import ctypes.util
libcusolver = ctypes.cdll.LoadLibrary('libcusolver.so')
class _types:
handle = ctypes.c_void_p
libcusolver.cusolverDnCreate.restype = int
libcusolver.cusolverDnCreate.argtypes = [_types.handle]
def cusolverCreate():
handle = _types.handle()
libcusolver.cusolverDnCreate(ctypes.byref(handle))
return handle.value
libcusolver.cusolverDnDestroy.restype = int
libcusolver.cusolverDnDestroy.argtypes = [_types.handle]
def cusolverDestroy(handle):
libcusolver.cusolverDnDestroy(handle)
libcusolver.cusolverDnSgetrf_bufferSize.restype = int
libcusolver.cusolverDnSgetrf_bufferSize.argtypes =[_types.handle,
ctypes.c_int,
ctypes.c_int,
ctypes.c_void_p,
ctypes.c_int,
ctypes.c_void_p]
def cusolverLUFactorization(handle, matrix):
m,n=matrix.shape
mtx_gpu = gpuarray.to_gpu(matrix.astype('float32'))
work=gpuarray.zeros(1, np.float32)
status=libcusolver.cusolverDnSgetrf_bufferSize(
handle, m, n,
int(mtx_gpu.gpudata),
n, int(work.gpudata))
print status
x = np.asarray(np.random.rand(3, 3), np.float32)
handle_solver=cusolverCreate()
cusolverLUFactorization(handle_solver,x)
cusolverDestroy(handle_solver)
The last parameter of cusolverDnSgetrf_bufferSize should be a regular pointer, not a GPU memory pointer. Try modifying the cusolverLUFactorization() function as follows:
def cusolverLUFactorization(handle, matrix):
m,n=matrix.shape
mtx_gpu = gpuarray.to_gpu(matrix.astype('float32'))
work = ctypes.c_int()
status = libcusolver.cusolverDnSgetrf_bufferSize(
handle, m, n,
int(mtx_gpu.gpudata),
n, ctypes.pointer(work))
print status
print work.value

how to optimize matrix multiplication using OpenACC?

I am learning OpenACC (with PGI's compiler) and trying to optimize matrix multiplication example. The fastest implementation I came up so far is the following:
void matrix_mul(float *restrict r, float *a, float *b, int N, int accelerate){
#pragma acc data copyin (a[0: N * N ], b[0: N * N]) copyout (r [0: N * N ]) if(accelerate)
{
# pragma acc region if(accelerate)
{
# pragma acc loop independent vector(32)
for (int j = 0; j < N; j ++)
{
# pragma acc loop independent vector(32)
for (int i = 0; i < N ; i ++ )
{
float sum = 0;
for (int k = 0; k < N ; k ++ ) {
sum += a [ i + k*N ] * b [ k + j * N ];
}
r[i + j * N ] = sum ;
}
}
}
}
This results in thread blocks of size 32x32 threads and gives me the best performance so far.
Here are the benchmarks:
Matrix multiplication (1500x1500):
GPU: Geforce GT650 M, 64-bit Linux
Data sz : 1500
Unaccelerated:
matrix_mul() time : 5873.255333 msec
Accelerated:
matrix_mul() time : 420.414700 msec
Data size : 1750 x 1750
matrix_mul() time : 876.271200 msec
Data size : 2000 x 2000
matrix_mul() time : 1147.783400 msec
Data size : 2250 x 2250
matrix_mul() time : 1863.458100 msec
Data size : 2500 x 2500
matrix_mul() time : 2516.493200 msec
Unfortunately I realized that the generated CUDA code is quite primitive (e.g. it does not even use shared memory) and hence cannot compete with hand-optimized CUDA program. As a reference implementation I took Arrayfire lib with the following results:
Arrayfire 1500 x 1500 matrix mul
CUDA toolkit 4.2, driver 295.59
GPU0 GeForce GT 650M, 2048 MB, Compute 3.0 (single,double)
Memory Usage: 1932 MB free (2048 MB total)
af: 0.03166 seconds
Arrayfire 1750 x 1750 matrix mul
af: 0.05042 seconds
Arrayfire 2000 x 2000 matrix mul
af: 0.07493 seconds
Arrayfire 2250 x 2250 matrix mul
af: 0.10786 seconds
Arrayfire 2500 x 2500 matrix mul
af: 0.14795 seconds
I wonder if there any suggestions how to get better performance from OpenACC ?
Perhaps my choice of directives is not right ?
You're getting right at a 14x speedup, which is pretty good for PGI's compiler in my experience.
First off, are you compiling with -Minfo? That will give you a lot of feedback from the compiler regarding optimization choices.
You are using a 32x32 thread block, but in my experience 16x16 thread blocks tend to get better performance. If you omit the vector(32) clauses, what scheduling does the compiler choose?
Declaring a and b with restrict might let the compiler generate better code.
Just by looking at your code, I'm not sure that shared memory would help performance. Shared memory only helps improve performance if your code can store and reuse values there instead of going to global memory. In this case you're not reusing any part of a or b after reading it.
It's also worth noting that I've had bad experiences with PGI's compiler when it comes to shared memory usage. It will sometimes do funny stuff and cache the wrong values (seems to mostly happen if you iterate a loop backward), generating wrong results. I actually have to compile my current application using the undocumented -ta=nvidia,nocache option to get it to work correctly, by bypassing shared memory usage altogether.

Optimizing CUDA FDTD Fortran

I am trying to optimize this FDTD code with CUDA Fortran. I have three 3-D cube matrix with input, output and costant.
attributes (global) subroutine kernel_h(k,num_cells_x,num_cells_y,num_cells_z,Hx,Hy,Hz,Ex,Ey,Ez,Cbdx,Cbdy,Cbdz)
implicit none
integer :: idx,idy
integer,value :: k,num_cells_x,num_cells_y,num_cells_z
real(kind=8), intent(in), dimension(1:num_cells_x,1:num_cells_y,1:num_cells_z) :: Ex, Ey, Ez
real(kind=8), intent(inout), dimension(1:num_cells_x,1:num_cells_y,1:num_cells_z) :: Hx, Hy, Hz
real(kind=8), intent(in), constant, dimension(1:num_cells_x,1:num_cells_y,1:num_cells_z) :: Cbdx,Cbdy,Cbdz
idx = threadIdx%x + ((blockIdx%x-1) * blockDim%x)
idy = threadIdx%y + ((blockIdx%y-1) * blockDim%y)
do while (idx < num_cells_x)
Hz(idx,idy,k) = Hz(idx,idy,k) + ((Ex(idx,idy+1,k)-Ex(idx,idy,k))*Cbdy(idx,idy,k) + (Ey(idx,idy,k)-Ey(idx+1,idy,k))*Cbdx(idx,idy,k))
Hx(idx,idy,k) = Hx(idx,idy,k) + ((Ey(idx,idy,k+1)-Ey(idx,idy,k))*Cbdz(idx,idy,k) + (Ez(idx,idy,k)-Ez(idx,idy+1,k))*Cbdy(idx,idy,k))
Hy(idx,idy,k) = Hy(idx,idy,k) + ((Ez(idx+1,idy,k)-Ez(idx,idy,k))*Cbdx(idx,idy,k) + (Ex(idx,idy,k)-Ex(idx,idy,k+1))*Cbdz(idx,idy,k))
idx = idx + (blockDim%x * gridDim%x)
idy = idy + (blockDim%y * gridDim%y)
end do
end subroutine kernel_h
and my kernel launch is:
bdim=dim3(16,16,1)
gdim=dim3((num_cells_x+(bdim%x-1))/bdim%x,(num_cells_y+(bdim%y-1))/bdim%y,1)
do k=1,num_cells_z
call kernel_h<<<gdim,bdim>>>(k,num_cells_x,num_cells_y,num_cells_z,Hx_d,Hy_d,Hz_d,Ex_d,Ey_d,Ez_d,Cbdx_d,Cbdy_d,Cbdz_d)
end do
My questions are: why i can't load more than 100x100x100 matrix? If i try i get a kernel error launch failure. And can i improve my code performace? I think it could be written in a better way.
I would guess that you are accessing out of bounds.
Consider a 10x10x10 volume (x,y,z). In that case you will launch a single block of 16x16 threads. These threads will access a 17x17 slice (since stencil radius is 1) which is clearly going to end up out of bounds. You would need to disable those threads that will access out of bounds and also disable those threads that will reach beyond the boundary to apply their stencil.
Consider looking at the FDTD3D sample in the CUDA SDK. Granted it's in C but it illustrates how to handle this problem and it also shows how to use shared memory to have a much more efficient implementation.