Improving CUDA SDK matrixMul by the prefetching intrinsic - cuda

Here is the part of CUDA SDK (2.3) matrixMultiply kernel:
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
int XI=wA * ty + tx;
int XII=wB * ty + tx;
////////////////////
// PREFETCH BLOCK //
////////////////////
AS(ty, tx) = A[a + XI];
BS(ty, tx) = B[b + XII];
__syncthreads();
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
__syncthreads();
}
This version of matrix multiply brings a tile into shared memory and performs the calculation at the shared memory bandwidth. I want to improve the performance by prefetching the data of next iteration into L1 cache. I use the prefetch intrinsic as suggested here and inserted the following commands into the PREFETCH BLOCK above:
long long int k,kk;
k=((long long int)A+aStep); if(k<=aEnd) prefetch_l1(k+XI);
kk=((long long int)B+bStep); if(kk<=aEnd) prefetch_l1(kk+XII);
After test, two versions (with or without prefetching) perform very similar (average of 3 runs):
without prefetching: 6434.866211 (ms)
with prefetching: 6480.041016 (ms)
Question:
I expect to see some speedup out of the prefetching but I'm confused with the results. Any body has any justification why these two implementations perform very close? Maybe I am performing a wrong prefetching.
Thank you in advance.
Further informations:
GPU: Tesla C2050
CUDA version: 4.0
inline __device__ void prefetch_l1 (unsigned int addr)
{
asm(" prefetch.global.L1 [ %1 ];": "=r"(addr) : "r"(addr));
}

Prefetching (on any architecture) is only of benefit if:
you have memory bandwidth to spare and
you can initiate the prefetch at the correct time, i.e. sufficiently far ahead of time before the data is actually needed and at a time when spare memory bandwidth is available.
If you can't meet these criteria then prefetching is not going to help, and may even do more harm than good.

Related

why unroll can not accelerate transpose matrix?

I am following a tutorial to learn cuda now and I learn that unroll a kernel function will accelerate the program. And it indeed works when I write a function which used to summarize a array.
But when I write a function used to transpose matrix following tutorial, it dosen't work.
The origin function like below:
__global__ void transform_matrix_read_col(
int* mat_a , int* mat_b , size_t row_num , size_t col_num
){
int ix = threadIdx.x + blockDim.x * blockIdx.x;
int iy = threadIdx.y + blockDim.y * blockIdx.y;
int row_idx = iy*col_num + ix;
int col_idx = ix*row_num + iy;
if(ix < col_num && iy < row_num){
mat_b[row_idx] = mat_a[col_idx];
}
}
and unrool function:
__global__ void transform_matrix_read_col_unrool(
int* mat_a , int* mat_b , size_t row_num , size_t col_num
){
int ix = threadIdx.x +(blockDim.x * blockIdx.x * 4);
int iy = threadIdx.y + blockDim.y * blockIdx.y;
int row_idx = iy*col_num + ix;
int col_idx = ix*row_num + iy;
if(ix < col_num && iy < row_num){
mat_b[row_idx] = mat_a[col_idx];
mat_b[row_idx + blockDim.x*1] = mat_a[col_idx + row_num*blockDim.x*1];
mat_b[row_idx + blockDim.x*2] = mat_a[col_idx + row_num*blockDim.x*2];
mat_b[row_idx + blockDim.x*3] = mat_a[col_idx + row_num*blockDim.x*3];
}
}
and the main function:
size_t width = 128 , height = 128,
array_size = width*height,array_bytes = array_size * sizeof(int);
int* matrix_data = nullptr,*output_data = nullptr;
cudaMallocHost(&matrix_data, array_bytes);
cudaMallocHost(&output_data, array_bytes);
util::init_array_int(matrix_data,array_size);//this func will random generate some integer
int* matrix_data_dev = nullptr,* output_matrix_dev = nullptr;
cudaMalloc(&matrix_data_dev, array_bytes);
cudaMemcpy(matrix_data_dev, matrix_data, array_bytes, cudaMemcpyHostToDevice);
cudaMalloc(&output_matrix_dev, array_bytes);
dim3 block(32,16);
dim3 grid((width-1)/block.x+1,(height-1)/block.y+1);
dim3 gridUnrool4((width-1)/(block.x*4)+1,(height-1)/block.y +1);
transform_matrix_read_col<<<grid,block>>>(matrix_data_dev, output_matrix_dev, height, width);
cudaDeviceSynchronize();
transform_matrix_read_col_unrool<<<gridUnrool4,block>>>(matrix_data_dev, output_matrix_dev, height, width);
cudaDeviceSynchronize();
and the staticstis of nsys(run on linux with a rtx 3090):
CUDA Kernel Statistics:
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- -------- ------- ------- ---------------------------------------------------------------------------
6.3 3,456 1 3,456.0 3,456 3,456 transform_matrix_read_col_unrool(int*, int*, unsigned long, unsigned long)
5.2 2,880 1 2,880.0 2,880 2,880 transform_matrix_read_col(int*, int*, unsigned long, unsigned long)
We can see that unrool version slower a lot.
But on the tutorial , it say that unroll will acclerate transpose actually.
So What cause this problem? And how to accelerate transpose matrix ?
Unrolling only help if the computation is compute bound so that a higher (useful) instruction throughput can decrease the execution time. Memory-bound code tends not to be much faster once unrolled because memory-bound instruction are slowed down by the contention of the memory controller.
A transposition may not seem memory-bound at first glance because of a low apparent memory throughput, but one need to care about cache lines. Indeed, when a single value is requested from memory from the user code, the hardware actually request a pretty big cache line for (subsequent) contiguous accesses to be fast.
Another consideration to take into account is that the code can also be latency bound. Indeed, the inefficient strided accesses can be slow due to the memory latency. The memory controller may not be able to fully saturate the RAM in this case (although this is quite unlikely on GPUs, especially regarding the large cache lines). If so, adding more instruction do not help because they are typically executed in an in-order way as opposed to modern mainstream CPUs. Using larger blocks and more blocks helps to provide more parallelism to the GPUs which can then perform more concurrent memory accesses and possibly better use the memory.
The key with the transposition is to make accesses as contiguous as possible and reuse cache lines. The most critical thing is to operate on small 2D blocks and not on full row/lines (ie. not a 1D kernel) to increase the cache locality. Moreover, one efficient well-known solution is to use the shared memory: each threads of a CUDA block fetch a part of a 2D array block and can then perform the transposition in shared memory possibly more efficiently. It is not so easy due to possible shared memory conflicts that can impact performance. Fortunately, there are few research papers and articles talking about that since the last decades.
The simplest efficient solution to this problem is basically to use cuBLAS which is heavily optimized. This post may also be useful.
Note that a 128x128 transposition is very small for a GPU. GPUs are designed to compute bigger datasets (or far more expensive computations on such small input). If the input array is initially stored on the CPU, then I strongly advise you to do that directly on the CPU as moving data on the GPU will likely be already slower than computing the transposition efficiently on the CPU directly. Indeed, data cannot be moved faster than the main RAM permit and a 128x128 transposition can be implemented in a way it saturate the main RAM (in fact, it can be likely done directly in the CPU caches that are significantly faster than the main RAM).

What is the difference between __ldg() intrinsic and a normal execution?

I am trying to explore '__ldg intrinsic'. I have gone through NVIDIA's documentation for this but didn't get any satisfactory answer over its use and implementations. Moreover with reference to THIS I tried implementing __ldg in a simple 1024*1024 matrix multiplication example.
#include<stdio.h>
#include<stdlib.h>
__global__ void matrix_mul(float * ad,float * bd,float * cd,int N)
{
float pvalue=0;
//find Row and Column corresponding to a data element for each thread
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
//calculate dot product of Row of First Matrix and Column of Second Matrix
for(int i=0;i< N;++i)
{
// I tried with executing this first:
float m=__ldg(&ad[Row * N+i]);
float n=__ldg(&bd[i * N + Col]);
//Then I executed this as a normal execution:
// float m = ad[Row * N+i];
// float n = bd[i * N + Col];
pvalue += m * n;
}
//store dot product at corresponding position in resultant Matrix
cd[Row * N + Col] = pvalue;
}
int main()
{
int N = 1024,i,j; //N == size of square matrix
float *a,*b;
float *ad,*bd,*cd,*c;
//open a file for outputting the result
FILE *f;
f=fopen("Parallel Multiply_ldg.txt","w");
size_t size=sizeof(float)* N * N;
//allocate host side memory
a=(float*)malloc(size);
b=(float*)malloc(size);
c=(float*)malloc(size);
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
a[i*N+j]=2.0; //(float)(i*N+j); //initializing each value with its own index
b[i*N+j]=1.0; //(float)(i*N+j); //random functions can be used alternatively
}
}
//allocate device memory
cudaMalloc(&ad,size);
//printf("\nAfter cudaMalloc for ad\n%s\n",cudaGetErrorString(cudaGetLastError()));
cudaMalloc(&bd,size);
//printf("\nAfter cudaMalloc bd\n%s\n",cudaGetErrorString(cudaGetLastError()));
cudaMalloc(&cd,size);
//printf("\nAfter cudaMalloc cd\n%s\n",cudaGetErrorString(cudaGetLastError()));
//copy value from host to device
cudaMemcpy(ad,a,size,cudaMemcpyHostToDevice);
cudaMemcpy(bd,b,size,cudaMemcpyHostToDevice);
printf("\nAfter HostToDevice Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));
//calculate execution configuration
dim3 blocksize(16,16); //each block contains 16 * 16 (=256) threads
dim3 gridsize(N/16,N/16); //creating just sufficient no of blocks
//GPU timer code
float time;
cudaEvent_t start,stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
matrix_mul <<< gridsize, blocksize >>> (ad,bd,cd, N);
cudaDeviceSynchronize();
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop); //time taken in kernel call calculated
cudaEventDestroy(start);
cudaEventDestroy(stop);
//copy back results
cudaMemcpy(c,cd,sizeof(float)* N*N,cudaMemcpyDeviceToHost);
printf("\nAfter DeviceToHost Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));
//output results in output_file
fprintf(f,"Array A was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",a[i*N+j]);
fprintf(f,"\n");
}
fprintf(f,"\nArray B was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",b[i*N+j]);
fprintf(f,"\n");
}
fprintf(f,"\nMultiplication of A and B gives C----\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",c[i*N+j]); //if correctly computed, then all values must be N
fprintf(f,"\n");
}
printf("\nYou can see output in Parallel Mutiply.txt file in project directory");
printf("\n\nTime taken is %f (ms)\n",time);
fprintf(f,"\n\nTime taken is %f (ms)\n",time);
fclose(f);
cudaThreadExit();
//cudaFree(ad); cudaFree(bd); cudaFree (cd);
free(a);free(b);free(c);
//_getch();
return 1;
}
I commented that __ldg part in my kernel and executed by normal execution, and vice versa.
In both cases it gives me correct multiplication result. I am confused with the time difference I am getting between these executions, because its huge almost more than 100X!
In case of __ldg it gives me: Time taken is 0.014432 (ms)
And in case of normal execution without __ldg it gives me : Time taken is 36.858398 (ms)
Is this the exact way of using __ldg intrisic? What is the significance of __ldg intrinsic and what is the proper way of using it? Apparently what I did above in my code is wrong and naive. I am looking for explanation and example. Thanks in advance.
From the CUDA C Programming Guide
Global memory accesses for devices of compute capability 3.x are cached in L2 and for devices of compute capability 3.5, may also be cached in the read-only data cache described in the previous section; they are not cached in L1.
...
Data that is read-only for the entire lifetime of the kernel can also be cached in the read-only data cache described in the previous section by reading it using the __ldg() function (see Read-Only Data Cache Load Function). When the compiler detects that the read-only condition is satisfied for some data, it will use __ldg() to read it. The compiler might not always be able to detect that the read-only condition is satisfied for some data. Marking pointers used for loading such data with both the const and __restrict__ qualifiers increases the likelihood that the compiler will detect the read-only condition.
The read only cache accesses have a much lower latency than the global memory accesses. Because matrix multiplication accesses the same values from memory many times, caching in the read only cache gives a huge speedup (in memory bound applications).
In NVIDIA GPU there is a texture - images with special and not hard logic to work with images.
This texture memory is another type of memory available in GPU. In particularly constant, global and register file memory has not any relation to this texture memory.
Kepler GPUs and later add the ability to use this memory from "GPU texture pipeline".
But let's specify the difference between constant cache and read-only cache.
Constant Cache
Data loaded through the constant cache must be relatively small and must be accessed in such way that all threads of a warp should access the same location at any given time.
Read-only Cache or Texture Memory Cache
Cache can be much larger and can be accessed in a non-uniform pattern.
Read Only cache has granularity 32 bytes.
You can use this as "read-only cache" for your CUDA kernel.
1. Data stored in global memory can be cached in that place GPU Texture Memory
2. With doing that you give promise to the compiler that data is read-only for the
duration of a kernel execution in GPU.
There are two ways to achieve this.
A. Using an intrinsic function __ldg
Example: output[i] += __ldg(&input[j]);
B. Qualifying pointers to global memory
const float* __restrict__ input
output[idx] += input[idx];
Comparision:
The intrinsic __ldg is a better choice for deep compiler reasons.

Implementing Max Reduce in Cuda

I've been learning Cuda and I am still getting to grips with parallelism. The problem I am having at the moment is implementing a max reduce on an array of values. This is my kernel
__global__ void max_reduce(const float* const d_array,
float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (gid == 0)
*d_max = shared[tid];
}
I have implemented a min reduce using the same method (replacing the max function with the min) which works fine.
To test the kernel, I found the min and max values using a serial for loop. The min and max values always come out the same in the kernel but only the min reduce matches up.
Is there something obvious I'm missing/doing wrong?
Your main conclusion in your deleted answer was correct: the kernel you have posted doesn't comprehend the fact that at the end of that kernel execution, you have done a good deal of the overall reduction, but the results are not quite complete. The results of each block must be combined (somehow). As pointed out in the comments, there are a few other issues with your code as well. Let's take a look at a modified version of it:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX; // 1
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]); // 2
__syncthreads();
}
// what to do now?
// option 1: save block result and launch another kernel
if (tid == 0)
d_max[blockIdx.x] = shared[tid]; // 3
// option 2: use atomics
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
As Pavan indicated, you need to initialize your shared memory array. The last block launched may not be a "full" block, if gridDim.x*blockDim.x is greater than elements.
Note that in this line, even though we are checking that the thread operating (gid) is less than elements, when we add s to gid for indexing into the shared memory we can still index outside of the legitimate values copied into shared memory, in the last block. Therefore we need the shared memory initialization indicated in note 1.
As you already discovered, your last line was not correct. Each block produces it's own result, and we must combine them somehow. One method you might consider if the number of blocks launched is small (more on this later) is to use atomics. Normally we steer people away from using atomics since they are "costly" in terms of execution time. However, the other option we are faced with is saving the block result in global memory, finishing the kernel, and then possibly launching another kernel to combine the individual block results. If I have launched a large number of blocks initially (say more than 1024) then if I follow this methodology I might end up launching two additional kernels. Thus the consideration of atomics. As indicated, there is no native atomicMax function for floats, but as indicated in the documentation, you can use atomicCAS to generate any arbitrary atomic function, and I have provided an example of that in atomicMaxf which provides an atomic max for float.
But is running 1024 or more atomic functions (one per block) the best way? Probably not.
When launching kernels of threadblocks, we really only need to launch enough threadblocks to keep the machine busy. As a rule of thumb we want at least 4-8 warps operating per SM, and somewhat more is probably a good idea. But there's no particular benefit from a machine utilization standpoint to launch thousands of threadblocks initially. If we pick a number like 8 threadblocks per SM, and we have at most, say, 14-16 SMs in our GPU, this gives us a relatively small number of 8*14 = 112 threadblocks. Let's choose 128 (8*16) for a nice round number. There's nothing magical about this, it's just enough to keep the GPU busy. If we make each of these 128 threadblocks do additional work to solve the whole problem, we can then leverage our use of atomics without (perhaps) paying too much of a penalty for doing so, and avoid multiple kernel launches. So how would this look?:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX;
while (gid < elements) {
shared[tid] = max(shared[tid], d_array[gid]);
gid += gridDim.x*blockDim.x;
}
__syncthreads();
gid = (blockDim.x * blockIdx.x) + tid; // 1
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
With this modified kernel, when creating the kernel launch, we are not deciding how many threadblocks to launch based on the overall data size (elements). Instead we are launching a fixed number of blocks (say, 128, you can modify this number to find out what runs fastest), and letting each threadblock (and thus the entire grid) loop through memory, computing partial max operations on each element in shared memory. Then, in the line marked with comment 1, we must re-set the gid variable to it's initial value. This is actually unnecessary and the block reduction loop code can be further simplified if we guarantee that the size of the grid (gridDim.x*blockDim.x) is less than elements, which is not difficult to do at kernel launch.
Note that when using this atomic method, it's necessary to initialize the result (*d_max in this case) to an appropriate value, like -FLOAT_MAX.
Again, we normally steer people way from atomic usage, but in this case, it's worth considering if we carefully manage it, and it allows us to save the overhead of an additional kernel launch.
For a ninja-level analysis of how to do fast parallel reductions, take a look at Mark Harris' excellent whitepaper which is available with the relevant CUDA sample.
Here's one that appears naive but isn't. This won't generalize to other functions like sum(), but it works great for min() and max().
__device__ const float float_min = -3.402e+38;
__global__ void maxKernel(float* d_data)
{
// compute max over all threads, store max in d_data[0]
int i = threadIdx.x;
__shared__ float max_value;
if (i == 0) max_value = float_min;
float v = d_data[i];
__syncthreads();
while (max_value < v) max_value = v;
__syncthreads();
if (i == 0) d_data[0] = max_value;
}
Yup, that's right, only syncing once after initialization and once before writing the result. Damn the race conditions! Full speed ahead!
Before you tell me it won't work, please give it a try first. I have tested thoroughly and it works every time on a variety of arbitrary kernel sizes. It turns out that the race condition doesn't matter in this case because the while loop resolves it.
It works significantly faster than a conventional reduction. Another surprise is that the average number of passes for a kernel size of 32 is 4. Yup, that's (log(n)-1), which seems counterintuitive. It's because the race condition gives an opportunity for good luck. This bonus comes in addition to removing the overhead of the conventional reduction.
With larger n, there is no way to avoid at least one iteration per warp, but that iteration only involves one compare operation which is usually immediately false across the warp when max_value is on the high end of the distribution. You could modify it to use multiple SM's, but that would greatly increase the total workload and add a communication cost, so not likely to help.
For terseness I've omitted the size and output arguments. Size is simply the number of threads (which could be 137 or whatever you like). Output is returned in d_data[0].
I've uploaded the working file here: https://github.com/kenseehart/YAMR

CUDA: writing to shared memory increses kernel time execution a lot

I am trying to reduce 65536 elements array (calculate sum of elements in it) with help of CUDA. Kernel looks like following (please, ignore *dev_distanceFloats and index arguments for now)
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
}
ant it launched as one block with 256 threads:
kernel_calcSum <<<1, 256 >>>(dev_spFloats1, dev_distanceFloats, index);
So far, so good, each of 256 threads takes 256 elements from global memory and calculates it's sum in local variable mySum. Kernel execution time is about 45 milliseconds.
Next step is to introduce shared memory among those 256 threads in block (to calculate sum of mySum), so kernel becomes as following:
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
__shared__ float sdata[256];
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
sdata[tid] = mySum;
}
I just added writing to shared memory, but execution time increases from 45 milliseconds to 258 milliseconds (I am cheking this with help of NVidia Visual Profiler 5.0.0).
I realize that there are 8 bank conflicts for each thread when writing to sdata variable (I am on GTX670 which have capability 3.0 with 32 banks). As an experiment - I tried to reduce of threads to 32 when launching kernel - but time still 258 milliseconds.
Question 1: why writing to shared memory takes so long in my case ?
Question 2: is there any tool, which show in details kinda "execution plan" (timings for memory access, conflicts, etc) ?
Thanks for your suggestions.
Update:
playing with kernel - I set sdata to some constant for each thread:
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
__shared__ float sdata[256];
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
sdata[tid] = 111;
}
and timings are back to 48 millisec.
So, changing
sdata[tid] = mySum;
to
sdata[tid] = 111;
made this.
Is this compiler optimization (may be it just removed this line?) or by some reason copying from local memory (register?) to shared takes long?
Both of your kernels do not do anything, because they do not write out results to memory that would still be accessible after the kernel finishes.
In the first case, the compiler is clever enough to notice this and optimize away the whole calculation.
In the second case where shared memory is involved, the compiler does not notice this as the flow of information through shared memory would be harder to track. It thus leaves the calculation in.
Pass in a pointer to global memory (as you already do) and write out results via this pointer.
Shared memory is not the right thing for this. What you need are warp atomic operations, to sum up within a warp, then communicate the intermediary results between warps. There's example code demonstrating this shipping with CUDA.
Summing up elements is one of those tasks where massive parallization won't help much and the GPU in fact can be outperformed by a CPU.

Tips for optimizing X_transpose*X CUDA kernel

I am writing my first CUDA application and am writing all the kernels my self for practice.
In one portion I am simply calculating X_transpose * X.
I have been using cudaMallocPitch and cudaMemcpy2D, I first allocate enough space on the device for X and X_transpose*X. I copy X to the device, my kernel takes two inputs, the X matrix, then the space to write the X_transpose * X result.
Using the profiler the kernel originally took 104 seconds to execute on a matrix of size 5000x6000. I pad the matrix with zeros on the host so that it is a multiple of the block size to avoid checking the bounds of the matrix in the kernel. I use a block size of 32 by 32.
I made some changes to try to maximize coalesced reads/writes to global memory, this seemed to help significantly. Using the visual profiler to profile the release build of my code, the kernel now takes 4.27 seconds to execute.
I haven't done an accurate timing of my matlab execution(just the operation X'*X;), but it appears to be about 3 seconds. I was hoping I could get much better speedups than matlab using CUDA.
The nvidia visual profiler is unable to find any issues with my kernel, I was hoping the community here might have some suggestions as to how I can make it go faster.
The kernel code:
__global__ void XTXKernel(Matrix X, Matrix XTX) {
//find location in output matrix
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
Matrix XTXsub = GetSubMatrix(XTX, blockRow, blockCol);
float Cvalue = 0;
for(int m = 0; m < (X.paddedHeight / BLOCK_SIZE); ++m) {
//Get sub-matrix
Matrix Xsub = GetSubMatrix(X, m, blockCol);
Matrix XTsub = GetSubMatrix(X, m, blockRow);
__shared__ float Xs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float XTs[BLOCK_SIZE][BLOCK_SIZE];
//Xs[row][col] = GetElement(Xsub, row, col);
//XTs[row][col] = GetElement(XTsub, col, row);
Xs[row][col] = *(float*)((char*)Xsub.data + row*Xsub.pitch) + col;
XTs[col][row] = *(float*)((char*)XTsub.data + row*XTsub.pitch) + col;
__syncthreads();
for(int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += Xs[e][row] * XTs[col][e];
__syncthreads();
}
//write the result to the XTX matrix
//SetElement(XTXsub, row, col, Cvalue);
((float *)((char*)XTXsub.data + row*XTX.pitch) + col)[0] = Cvalue;
}
The definition of my Matrix structure:
struct Matrix {
matrixLocation location;
unsigned int width; //width of matrix(# cols)
unsigned int height; //height of matrix(# rows)
unsigned int paddedWidth; //zero padded width
unsigned int paddedHeight; //zero padded height
float* data; //pointer to linear array of data elements
size_t pitch; //pitch in bytes, the paddedHeight*sizeof(float) for host, device determines own pitch
size_t size; //total number of elements in the matrix
size_t paddedSize; //total number of elements counting zero padding
};
Thanks in advance for your suggestions.
EDIT: I forgot to mention, I am running the on a Kepler card, GTX 670 4GB.
Smaller block size like 16x16 or 8x8 may be faster. This slides also demos larger non-square size of block/shared mem may be faster for particular matrix size.
For shared mem allocation, add a dumy element on the leading dimension by using [BLOCK_SIZE][BLOCK_SIZE+1] to avoid the bank conflict.
Try to unroll the inner for loop by using #pragma unroll
On the other hand, You probably won't be much faster than matlab GPU code for large enough A'*A. Since the performance bottleneck of matlab is the invoking overhead rather than the kernel performance.
The cuBLAS routine culas_gemm() may have highest performance for matrix multiplication. You could compare yours with it.
MAGMA routine magma_gemm() has higher performance than cuBLAS in some cases. It's a open source project. You may also get some ideas from their code.