cuBLAS dsyrk slower than dgemm - cuda

I am trying to compute C = A*A' on the GPU using cuBLAS and am finding that the rank-k update cublasDsyrk is running about 5x slower than the general matrix-matrix multiplication routine cublasDgemm.
This is surprising to me; I thought syrk would be faster since it is a more specialized piece of code. Is that an unreasonable expectation? Am I doing this wrong?
Timing the code
Ultimately I'm writing CUDA code to be compiled into MEX files for MATLAB, so apologies for not providing a complete working example (there would be a lot of extraneous code for wrangling with the MATLAB objects).
I know this is probably not the best way, but I'm using clock() to time how long the code takes to run:
// Start of main function
clock_t tic = clock();
clock_t toc;
/* ---- snip ---- */
cudaDeviceSynchronize();
toc = clock();
printf("%8d (%7.3f ms) Allocated memory on GPU for output matrix\n",
toc-tic,1000*(double)(toc-tic)/CLOCKS_PER_SEC);
// Compute the upper triangle of C = alpha*A*A' + beta*C
stat = cublasDsyrk(handle, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N,
M, N, &alpha, A, M, &beta, C, M);
toc = clock();
printf("%8d (%7.3f ms) cublasDsyrk launched\n",
toc-tic,1000*(double)(toc-tic)/CLOCKS_PER_SEC);
cudaDeviceSynchronize();
toc = clock();
printf("%8d (%7.3f ms) cublasDsyrk completed\n",
toc-tic,1000*(double)(toc-tic)/CLOCKS_PER_SEC);
/* ----- snip ----- */
Runtimes
The output, running on a [12 x 500,000] random matrix (column-major storage):
911 ( 0.911 ms) Loaded inputs, initialized cuBLAS context
1111 ( 1.111 ms) Allocated memory on GPU for output matrix
1352 ( 1.352 ms) cublasDsyrk launched
85269 ( 85.269 ms) cublasDsyrk completed
85374 ( 85.374 ms) Launched fillLowerTriangle kernel
85399 ( 85.399 ms) kernel completed
85721 ( 85.721 ms) Finished and cleaned up
After replacing the syrk call with
stat = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, M, M, N,
&alpha, A, M, A, M, &beta, C, M);
the whole thing runs way faster:
664 ( 0.664 ms) Loaded inputs, initialized cuBLAS context
796 ( 0.796 ms) Allocated memory on GPU for output matrix
941 ( 0.941 ms) cublasDgemm launched
16787 ( 16.787 ms) cublasDgemm completed
16837 ( 16.837 ms) Launched fillLowerTriangle kernel
16859 ( 16.859 ms) kernel completed
17263 ( 17.263 ms) Finished and cleaned up
I tried it with a few matrices of other sizes; interestingly it seems that the speed difference is most pronounced when the matrix has few rows. At 100 rows, gemm is only 2x faster, and at 1000 rows it's slightly slower (which is what I would have expected all along).
Other details
I'm using CUDA Toolkit 7.5 and the GPU device is an NVIDIA Grid K520 (Kepler, compute capability 3.0). I'm running on an Amazon EC2 g2.x2large instance.

[n x 500,000] for n=12,100,1000 are all very wide matrix. In these corner cases, gemm() and syrk() may not be able to reach their peak performance, where syrk() is nearly twice faster as gemm() (as the result matrix is symentric so you can save half of the computation).
Another consideration is that CUDA gemm()/syrk() usually divides matrix into fixed size sub-matrices as the basic computing unit to achieve high performance. The size of the sub-matrix can be up to 32x64 for dgemm() as shown in the following link.
http://www.netlib.org/lapack/lawnspdf/lawn267.pdf
The performance usually drops a lot if your size (12 or 100) is neither much larger than the sub-matrix nor a multiple of it.

Related

Multiple global functions in the same CUDA source file

Can I write two separate global functions, that compute different things, in the same CUDA source file? Something like this:
__global__ void Ker1(mpz_t *d,mpz_t *c,mpz_t e,mpz_t n )
{
int i=blockIdx.x*blockDim.x + threadIdx.x;
mpz_powm (d[i], c[i], e, n);
}
__global__ void Ker2(mpz_t *d,mpz_t *c,mpz_t d, mpz_t n)
{
int i=blockIdx.x*blockDim.x + threadIdx.x;
mpz_powm(c[i], d[i],d, n);
}
int main()
{
/* ... */
cudaMemcpy(decode_device,decode_buffer,memSize,cudaMemcpyHostToDevice);
Ker1<<<dimGrid , dimBlock >>>( d_device,c_device,e,n );
Ker2<<<dimGrid , dimBlock>>>(c_device,d_device,d,n);
cudaMemcpy(decode_buffer,decode_device,memSize,cudaMemcpyDeviceToHost);
}
If not, how would you do something like this?
It is quite unclear what you're asking, but after 3 readings I assume : "Can I write several Kernels in the same source file ?".
Your can write as much kernel launchs as you want in your main function.
An example here on page 9 :
...
cudaMemcpy( dev1, host1, size, H2D ) ;
kernel2 <<< grid, block, 0 >>> ( ..., dev2, ... ) ;
kernel3 <<< grid, block, 0 >>> ( ..., dev3, ... ) ;
cudaMemcpy( host4, dev4, size, D2H ) ;
...
From : Streams and concurrency webinar
The calls will be asynchronous by default, so as soon as the kernel is launched in the GPU, the CPU will treat the instructions that follow.
To force synchronization you have to use cudaDeviceSynchronize(), or any memory transfer via cudaMemcpy that forces synchronization by itself.
Source : the CUDA FAQ.
Q: Can the CPU and GPU run in parallel?
Kernel invocation in CUDA is asynchronous, so the driver will return control to the application as soon as it has launched the kernel.
The "cudaThreadSynchronize()" API call should be used when measuring
performance to ensure that all device operations have completed before
stopping the timer.
CUDA functions that perform memory copies and that control graphics
interoperability are synchronous, and implicitly wait for all kernels
to complete.
By the way, if you don't need to synchronize between kernels, they can be executed concurrently if your GPU has the required compute capability (CC) :
Q: Is it possible to execute multiple kernels at the same time?
Yes. GPUs of compute capability 2.x or higher support concurrent kernel execution and launches.
(still readen from the CUDA FAQ).

Does Shader Unit calculates exponent

http://us.hardware.info/reviews/5419/nvidia-geforce-gtx-titan-z-sli-review-incl-tones-tizair-system
says that "GTX Titan-Z" has 5760 Shader units. Also here is written that "GTX Titan-Z" has 2x GK110 GPU.
CUDA exp() expf() and __expf() mentiones that it is possible to calculate exponent in cuda.
Let's say I have array of 500 000 000 ( five hundred millions ) of doubles. I want to calculate exponents of each of value in array. Who knows what to expect: 5760 shader units will be able to calculate exp, or this task can be done only with two GK110 GPU? Difference in perfomance is drastical, so I need to be sure, that if I rewrite my app with CUDA, then it will not work slower.
In other words, can I make 5760 threads to calculate 500 000 000 exponents?
GTX Titan Z is a dual GPU device. Each of the two GK110 GPUs on the card is attached via a 384-bit memory interface to its own 6 GB of high-speed memory. The theoretical bandwidth of each memory is 336 GB/sec. The particular GK110 variant used in the GTX Titan Z is comprised of fifteen clusters of execution units called SMX. Each SMX in turn is comprised of 192 single-precision floating-point units, 64 double-precision floating point units, and various other units.
Each double-precision unit in GK110 can execute one FMA (fused multiply-add), or one FMUL, or one FADD per clock cycle. At a base clock of 705 MHz, the maximum total number of DP operations that can be executed by each of the GK110 GPUs on Titan Z per second is therefore 705e6 * 15 * 64 = 676.8e9. Assuming all operations are FMAs, that equates to 1.3536 double-precision TFLOPS. Since the card uses two GPUs, the total DP performance of a GTX Titan Z is thus 2.7072 TFLOPS.
Like CPUs, GPUs provide general-purpose computation via various integer and floating-point units. GPUs also provide special function units (called MUFU = multifunction unit on GK110) that can compute rough single-precision approximations to some frequently used functions such as reciprocal, reciprocal square root, sine, cosine, exponential base 2, and logarithm based 2. As far as exponentiation is concerned, the standard single-precision math function exp2f() is the only function that maps more or less directly to a MUFU instruction (MUFU.EX2). Depending on compilation mode, there is a thin wrapper around this hardware instruction since the hardware does not support denormal operands in the special function units.
All other exponentiaton in CUDA is performed via software subroutines. The standard single-precision function expf() is a fairly heavy-weight wrapper around the hardware's exp2 capability. The double-precision exp() function is a pure software routine based on minimax polynomial approximation. The complete source code for it is visible in the CUDA header file math_functions_dbl_ptx3.h (in CUDA 6.5, DP exp() code starts at line 1706 in that file). As you can see, the computation involves primarily double-precision floating-point operations, as well as integer and some single-precision floating-point operations. You can also look at the machine code by disassembling a binary executable that calls exp() with cuobjdump --dump-sass.
In terms of performance, in CUDA 6.5 the double precision exp() function has a throughput on the order of 25e9 function calls per second on a Tesla K20 (1.170 DP TFLOPS). Since each call to DP exp() consumes an 8-byte source operand and produces an 8-byte result, this equates to roughly 400 GB/sec of memory bandwidth. Since each GK110 on a Titan Z provides about 15% more performance than the GK110 on a Tesla K20, the throughput and bandwidth requirements increase accordingly. Since the required bandwidth exceeds the theoretical memory bandwidth of the GPU, code that simply applies DP exp() to an array will be completely bound by memory bandwidth.
The number of functional units in the GPU and the number of threads executing has no relationship with the number of array elements that can be processed, but can have an impact on the performance of such processing. The mapping of array elements to threads can be freely chosen by the programmer. The number of array elements that can be processed in one go is a function of the size of the GPU's memory. Note that not all of the raw memory on the device is available for user code as the CUDA software stack needs some memory for its own use, typically around 100 MB or so. An exemplary mapping for applying DP exp() to an array is shown in this code snippet:
__global__ void exp_kernel (const double * __restrict__ src,
double * __restrict__ dst, int len)
{
int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
dst[i] = exp (src[i]);
}
}
#define ARRAY_LENGTH (500000000)
#define THREADS_PER_BLOCK (256)
int main (void) {
// ...
int len = ARRAY_LENGTH;
dim3 dimBlock(THREADS_PER_BLOCK);
int threadBlocks = (len + (dimBlock.x - 1)) / dimBlock.x;
if (threadBlocks > 65520) threadBlocks = 65520;
dim3 dimGrid(threadBlocks);
double *d_a = 0, *d_b = 0;
cudaMalloc((void**)&d_a, sizeof(d_a[0]), len);
cudaMalloc((void**)&d_b, sizeof(d_b[0]), len);
// ...
exp_kernel<<<dimGrid,dimBlock>>>(d_a, d_b, len);
// ...
}

running FFTW on GPU vs using CUFFT

I have a basic C++ FFTW implementation that looks like this:
for (int i = 0; i < N; i++){
// declare pointers and plan
fftw_complex *in, *out;
fftw_plan p;
// allocate
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
// initialize "in"
...
// create plan
p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
// execute plan
fftw_execute(p);
// clean up
fftw_destroy_plan(p);
fftw_free(in); fftw_free(out);
}
I'm doing N fft's in a for loop. I know I can execute many plans at once with FFTW, but in my implementation in and out are different every loop. The point is I'm doing the entire FFTW pipeline INSIDE a for loop.
I want to transition to using CUDA to speed this up. I understand that CUDA has its own FFT library CUFFT. The syntax is very similar: From their online documentation:
#define NX 64
#define NY 64
#define NZ 128
cufftHandle plan;
cufftComplex *data1, *data2;
cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);
/* Create a 3D FFT plan. */
cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C);
/* Transform the first signal in place. */
cufftExecC2C(plan, data1, data1, CUFFT_FORWARD);
/* Transform the second signal using the same plan. */
cufftExecC2C(plan, data2, data2, CUFFT_FORWARD);
/* Destroy the cuFFT plan. */
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);
However, each of these "kernels" (as Nvida calls them) (cufftPlan3d, cufftExecC2C, etc.) are calls to-and-from the GPU. If I understand the CUDA structure correctly, each of these method calls are INDIVIDUALLY parallelized operations:
#define NX 64
#define NY 64
#define NZ 128
cufftHandle plan;
cufftComplex *data1, *data2;
cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);
/* Create a 3D FFT plan. */
cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU
/* Transform the first signal in place. */
cufftExecC2C(plan, data1, data1, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU
/* Transform the second signal using the same plan. */
cufftExecC2C(plan, data2, data2, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU
/* Destroy the cuFFT plan. */
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);
I understand how this can speed up my code by running each FFT step on a GPU. But, what if I want to parallelize my entire for loop? What if I want each of my original N for loops to run the entire FFTW pipeline on the GPU? Can I create a custom "kernel" and call FFTW methods from the device (GPU)?
You cannot call FFTW methods from device code. The FFTW libraries are compiled x86 code and will not run on the GPU.
If the "heavy lifting" in your code is in the FFT operations, and the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine. Once the machine is fully utilized, there is generally no additional benefit to trying to run more things in parallel.
cufft routines can be called by multiple host threads, so it is possible to make multiple calls into cufft for multiple independent transforms. It's unlikely you would see much speedup from this if the individual transforms are large enough to utilize the machine.
cufft also supports batched plans which is another way to execute multiple transforms "at once".

Timing CUDA kernels

hi every one im currently working on timing some of my CUDA code. I was able to time them using events. My kernel ran for 19 ms. Somehow I find this doubtful because when I ran a sequential implementation of this, it was at around 5000 ms. I know the code should run faster, but should it be this fast?
I'm using wrapper functions to call cuda kernels in my cpp program. Am I supposed to be calling them there or in the .cu file? Thanks!
The obvious way to check if your program is working would be to compare the output to that of your CPU based implementation. If you get the same output, it is working by definition, right? :)
If your program is experimental in such a way that it doesn't really produce any verifiable output then there is a good chance that the compiler has optimized out some (or all) of your code. The compiler will remove code that does not contribute to output data. This can cause, for instance, that the entire contents of a kernel is removed if the final statement that stores the calculated value is commented out.
As to your speedup. 5000ms / 19ms = 263x, which is an unlikely increase, even for algorithms that map perfectly to the GPU architecture.
Well, if you wrote your CUDA code right, yes, it could be that much faster. Think about it. You moved the code from sequential execution on a single processor to parallel execution on hundreds of processors, depending on your GPU model. My $179 mid range card has 480 cores. Some available now have 1500 cores. It is very possible to get 100x perf jumps with CUDA, particularly if your kernel is much more compute-bound than memory bound.
That said, make sure you are measuring what you think you are measuring. If you are invoking your CUDA kernel without using any explicit streams, then the call is synchronous to the host thread and your timings should be accurate. If you are invoking your kernel using a stream, then you need to call cudaDeviceSynchronise() or have your host code wait on an event signaled by the kernel. Kernel calls invoked on a stream execute asynchronously to the host thread, so time measurements in the host thread will not correctly reflect the kernel time unless you make the host thread wait until the kernel call is complete. You can also use CUDA events to measure elapsed time on the GPU within a given stream. See section 5.1.2 of the CUDA Best Practices Guide in the NVidia GPU Computing SDK 4.2.
In my own code, I use the clock() function to get precise timings. For convenience, I have the macros
enum {
tid_this = 0,
tid_that,
tid_count
};
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
#define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
#define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
#define TIMER_TIC
#define TIMER_TOC(tid)
#endif
These can then be used to instrument the device code as follows:
__global__ mykernel ( ... ) {
/* Start the timer. */
TIMER_TIC
/* Do stuff. */
...
/* Stop the timer and store the results to the "timer_this" counter. */
TIMER_TOC( tid_this );
}
You can then read the cuda_timers in the host code.
A few notes:
The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.
The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
The timers can slow down your code a bit, which is why I wrapped them in the #ifdef USETIMERS so you can switch them off easily.
Although clock() returns integer values of type clock_t, I store the accumulated values as float, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).
The selection ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ) is necessary in case the clock counter wraps around.

Time to be used in calculating bandwidth

I am trying to find the effective bandwidth used by my code against the CUDA GEforce 8800 gtx maximum of 86GB/s .I am not sure what time to use though .Currently I am using the difference between calling the kernel with my instructions against calling the kernel with no instructions.Is this the correct approach?(formula i use is ->effective bw= (bytes read+written)/time)
Also I get a really bad kernel call overhead (close to 1 sec) .Is there a way to get rid of it?
You can time your kernel fairly precisely with cuda events.
//declare the events
cudaEvent_t start;
cudaEvent_t stop;
float kernel_time;
//create events before you use them
cudaEventCreate(&start);
cudaEventCreate(&stop);
//put events and kernel launches in the stream/queue
cudaEventRecord(start,0);
myKernel <<< config >>>( );
cudaEventRecord(stop,0);
//wait until the stop event is recorded
cudaEventSynchronize(stop);
//and get the elapsed time
cudaEventElapsedTime(&kernel_time,start,stop);
//cleanup
cudaEventDestroy(start);
cudaEVentDestroy(stop);
Effective Bandwidth in GBps= ( (Br + Bw)/10^9 ) / Time
Br = number of bytes read by kernel from DRAM
Bw = number of bytes written by kernel in DRAM
Time = time taken by kernel.
For example you test the effective bandwidth of copying a 2048x2048 matrix of floats (4 bytes each) from one locations to another in GPU's DRAM. The formula would be:
Bandwidth in GB/s = ( (2048x2048 x 4 x 2)/10^9 ) / time-taken-by-kernel
here:
2048x2048 (matrix elements)
4 (each element has 4 bytes)
2 (one for read and one for write)
/10^9 to covert B into GB.