Matrix Multiplication of matrix and its transpose in Cuda - cuda

I am relatively new to CUDA programming so there are some unsolved issues for which I hope I can get some hints in the right direction.
So the case is that I want to multiply a 2D array with its transpose and to be precise I want to execute the operation ATA.
I have already used the cublas Dgemm function and now I am trying to do the same operation with a tiled algorithm, very similar to the one from CUDA guide.
The case is that while the initial algorithm runs properly, I want to calculate only the upper triangular matrix of the product hoping that I could achieve a better time for the operation, and I am not sure on how to extract tiles/blocks which will have the respective elements.
So if you could enlighten me on this, or give any hint I would be grateful, cause I have stuck on that for a while.
This is the code of the kernel
__shared__ double Ads1[TILE_WIDTH][TILE_WIDTH];
__shared__ double Ads2[TILE_WIDTH][TILE_WIDTH];
//block row and column
//we save in registers for faster access
int by = blockIdx.y;
int bx = blockIdx.x;
int ty = threadIdx.y;
int tx = threadIdx.x;
int row = by * TILE_WIDTH + ty;
int col = bx * TILE_WIDTH + tx;
double Rvalue = 0;
if(row >= width || col >= width) return;
//Each thread block computes one sub-matrix Rsub of result R
for (int i=0; i<(int) ceil(((double) height/TILE_WIDTH)); ++i)
{
Ads1[tx][ty] = Ad[(i * TILE_WIDTH + ty)*width + col];
Ads2[tx][ty] = Ad[(i * TILE_WIDTH + tx)*width + row];
__syncthreads();
for (int j = 0; j < TILE_WIDTH; ++j)
{
if ((i*TILE_WIDTH + j) > height ) break; //in order not to exceed the matrix's height
Rvalue+=Ads1[j][tx]*Ads2[ty][j];
}
__syncthreads();
}
Rd [row * width + col] = Rvalue;

You may want to use the batch dgemm API function described here recursely dividing your output matrix with block diagonal and corner. You also want to balance smallest block size versus overhead in compute to avoid small invokes. Finally, note that matrix multiply turns memory bound at some stage, which can be on modern GPU somewhat large.

Related

Kmeans clustering acceleration in GPU(CUDA)

I am a fairly new cuda user. I'm practicing on my first cuda application where I try to accelerate kmeans algorithm by using GPU(GTX 670).
Briefly, each thread works on a single point which is compared to all cluster centers and a point is assigned to a center with minimum distance(kernel code can be seen below with comments).
According to Nsight Visual Studio, I have an occupancy of 99.61%(1024 blocks, 1024 threads per block), 99.34% Streaming Multiprocessor activity, 79.98% warp issue efficiency, no shared memory bank conflicts, 18.4GFLOPs Single MUL and 55.2 GFLOPs Single ADD(takes about 14,5 ms to complete kmeans kernel with given parameters).
According to Wikipedia, GTX670's peak performance is 2460 GFLOPs. I am nowhere close to it. In addition to these, some papers claim they can achieve more than half of the peak performance. I cannot see how further I can optimize this kernel code. Is there any optimization that I can apply to the kernel? Any suggestion or help is appreciated and I can give any additional information on demand.
Complete Code
Thanks in advance.
#define SIZE 1024*1024 //number of points
#define CENTERS 32 //number of cluster centroids
#define DIM 8 //dimension of each point and center
#define cudaTHREADSIZE 1024 //threads per block
#define cudaBLOCKSIZE SIZE/cudaTHREADSIZE //number of blocks for kernel
__global__ void kMeans(float *dp, float *dc,int *tag, int *membershipChangedPerBlock)
{
//TOTAL NUMBER OF THREADS SHOULD BE EQUAL TO THE NUMBER OF POINTS, BECAUSE EACH THREAD WORKS ON A SINGLE POINT
__shared__ unsigned char membershipChanged[cudaTHREADSIZE];
__shared__ float dc_shared[CENTERS*DIM];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int threadID = threadIdx.x;
membershipChanged[threadIdx.x] = 0;
//move centers to shared memory, because each and every thread will call it(roughly + %10 performance here)
while(threadID < CENTERS*DIM){
dc_shared[threadID] = dc[threadID];
threadID += blockDim.x;
}
__syncthreads();
while(tid < SIZE){
int index,prevIndex;
float dist, min_dist;
index = 0;//all initial point indices(centroid number) are assigned to 0.
prevIndex = 0;
dist = 0;
min_dist = 0;
//euclid distance for center 0
for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
min_dist += (dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS]);
}
//euclid distance for other centers with distance comparison
for(int centerIdx = 1; centerIdx < CENTERS; centerIdx++){
dist = 0;
for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
dist += (dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS]);
}
//compare distances, if found a shorter one, change index to that centroid number
if(dist < min_dist){
min_dist = dist;
index = centerIdx;
}
}
if (tag[tid] != index) {//if a point's cluster membership changes, flag it as changed in order to compute total membership changes later on
membershipChanged[threadIdx.x] = 1;
}
tag[tid] = index;
__syncthreads();//sync before applying sum reduction to membership changes
//sum reduction
for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
if (threadIdx.x < s) {
membershipChanged[threadIdx.x] +=
membershipChanged[threadIdx.x + s];
}
__syncthreads();
}
if (threadIdx.x == 0) {
membershipChangedPerBlock[blockIdx.x] = membershipChanged[0];
}
tid += blockDim.x * gridDim.x;
}
}
My advice is to compare your work with a more exprienced GPU developer's work. I found out Kmeans algorithm is written by Byran Catanzaro after watching this video. You can find the source code:
https://github.com/bryancatanzaro/kmeans
I am also a beginner but IMHO it is better to use libraries like "Trust". GPU programming is really complicated issue it is hard to achieve max performance "Trust" will help you with that.
Check out rapids.ai cuml which replicates scikit api
Example from docs:
from cuml import KMeans
from cuml.cluster import KMeans
import cudf
import numpy as np
import pandas as pd
def np2cudf(df):
# convert numpy array to cuDF dataframe
df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])})
pdf = cudf.DataFrame()
for c,column in enumerate(df):
pdf[str(c)] = df[column]
return pdf
a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],
dtype=np.float32)
b = np2cudf(a)
print("input:")
print(b)
print("Calling fit")
kmeans_float = KMeans(n_clusters=2)
kmeans_float.fit(b)
print("labels:")
print(kmeans_float.labels_)
print("cluster_centers:")
print(kmeans_float.cluster_centers_)

Tricky array arithmetics inside a __global__ kernel (CUDA samples)

I have a question about code from CUDA sample "CUDA Separable Convolution" . In order to make row-convolution, this code first loads data in shared memory. Using pointer arithmetics, each thread moves the input pointers into their own position, and after that writes some piece of global memory into shared memory. Here is the piece of code that confuses me:
__global__ void convolutionRowsKernel(
float *d_Dst,
float *d_Src,
int imageW,
int imageH,
int pitch
)
{
__shared__ float s_Data[ROWS_BLOCKDIM_Y][(ROWS_RESULT_STEPS + 2 * ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X];
//Offset to the left halo edge
const int baseX = (blockIdx.x * ROWS_RESULT_STEPS - ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X + threadIdx.x;
const int baseY = blockIdx.y * ROWS_BLOCKDIM_Y + threadIdx.y;
d_Src += baseY * pitch + baseX;
d_Dst += baseY * pitch + baseX;
//Load main data
#pragma unroll
for (int i = ROWS_HALO_STEPS; i < ROWS_HALO_STEPS + ROWS_RESULT_STEPS; i++)
{
s_Data[threadIdx.y][threadIdx.x + i * ROWS_BLOCKDIM_X] = d_Src[i * ROWS_BLOCKDIM_X];
}
...
As far as I understand this code, each thread will calculate their own values of baseX and baseY, and after that all active threads will start to increase pointers d_Src and d_Dst simultaneously.
So, according to my knowledge, this would be correct, if arrays d_Src and d_Dst were in local memory (e.g. each thread would have there own copy of this arrays). But this arrays are in global device memory! So what will happen, all active threads will increase the pointers, and the result will be incorrect. Can one explain me, why this works?
Thanks
It works because every thread has its own copy of the pointer.
void foo(float* bar){
bar++;
}
float* test = 0;
foo(test);
cout<<test<<endl; //will print 0

Cuda Kernel with reduction - logic errors for dot product of 2 matrices

I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right answer for only matrices with size 2. For any other size matrices, I am getting it wrong.
This is only the test so I am keeping matrix size very small. Only about 100 so only 1 block would fit it all.
Any help would be greatly appreciated. Thanks!
Here is the regular code
float* ha = new float[n]; // matrix a
float* hb = new float[n]; // matrix b
float* hc = new float[1]; // sum of a.b
float dx = hc[0];
float hx = 0;
// dot product
for (int i = 0; i < n; i++)
hx += ha[i] * hb[i];
Here is my cuda kernel
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
dc[tid] = 0;
for (int stride = 1; stride < n; stride *= 2) {
if (tid % (2 * stride) == 0)
dc[tid] += (da[tid] * db[tid]) + (da[tid+stride] * db[tid+stride]);
__syncthreads();
}
}
My complete code : http://pastebin.com/zS85URX5
Hopefully you can figure out why it works for the n=2 case, so let's skip that, and take a look at why it fails for some other case, let's choose n=4. When n = 4, you have 4 threads, numbered 0 to 3.
In the first iteration of your for-loop, stride = 1, so the threads that pass the if test are threads 0 and 2.
thread 0: dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2: dc[2] += da[2]*db[2] + da[3]*db[3];
So far so good. In the second iteration of your for loop, stride is 2, so the thread that passes the if test is thread 0 (only).
thread 0: dc[0] += da[0]*db[0] + da[2]*db[2];
But this doesn't make sense and is not what we want at all. What we want is something like:
dc[0] += dc[2];
So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block):
// CUDA kernel code
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
// do multiplication in parallel for full width of threads
dc[tid] = da[tid] * db[tid];
// wait for all threads to complete multiply step
__syncthreads();
int stride = blockDim.x;
while (stride > 1){
// handle odd step
if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
// successively divide problem by 2
stride >>= 1;
// add each upper half element to each lower half element
if (tid < stride) dc[tid] += dc[tid + stride];
// wait for all threads to complete add step
__syncthreads();
}
}
Note that I'm not really using the n parameter. Since you are launching the kernel with n threads, the blockDim.x built-in variable is equal to n in this case.

Tips for optimizing X_transpose*X CUDA kernel

I am writing my first CUDA application and am writing all the kernels my self for practice.
In one portion I am simply calculating X_transpose * X.
I have been using cudaMallocPitch and cudaMemcpy2D, I first allocate enough space on the device for X and X_transpose*X. I copy X to the device, my kernel takes two inputs, the X matrix, then the space to write the X_transpose * X result.
Using the profiler the kernel originally took 104 seconds to execute on a matrix of size 5000x6000. I pad the matrix with zeros on the host so that it is a multiple of the block size to avoid checking the bounds of the matrix in the kernel. I use a block size of 32 by 32.
I made some changes to try to maximize coalesced reads/writes to global memory, this seemed to help significantly. Using the visual profiler to profile the release build of my code, the kernel now takes 4.27 seconds to execute.
I haven't done an accurate timing of my matlab execution(just the operation X'*X;), but it appears to be about 3 seconds. I was hoping I could get much better speedups than matlab using CUDA.
The nvidia visual profiler is unable to find any issues with my kernel, I was hoping the community here might have some suggestions as to how I can make it go faster.
The kernel code:
__global__ void XTXKernel(Matrix X, Matrix XTX) {
//find location in output matrix
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
Matrix XTXsub = GetSubMatrix(XTX, blockRow, blockCol);
float Cvalue = 0;
for(int m = 0; m < (X.paddedHeight / BLOCK_SIZE); ++m) {
//Get sub-matrix
Matrix Xsub = GetSubMatrix(X, m, blockCol);
Matrix XTsub = GetSubMatrix(X, m, blockRow);
__shared__ float Xs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float XTs[BLOCK_SIZE][BLOCK_SIZE];
//Xs[row][col] = GetElement(Xsub, row, col);
//XTs[row][col] = GetElement(XTsub, col, row);
Xs[row][col] = *(float*)((char*)Xsub.data + row*Xsub.pitch) + col;
XTs[col][row] = *(float*)((char*)XTsub.data + row*XTsub.pitch) + col;
__syncthreads();
for(int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += Xs[e][row] * XTs[col][e];
__syncthreads();
}
//write the result to the XTX matrix
//SetElement(XTXsub, row, col, Cvalue);
((float *)((char*)XTXsub.data + row*XTX.pitch) + col)[0] = Cvalue;
}
The definition of my Matrix structure:
struct Matrix {
matrixLocation location;
unsigned int width; //width of matrix(# cols)
unsigned int height; //height of matrix(# rows)
unsigned int paddedWidth; //zero padded width
unsigned int paddedHeight; //zero padded height
float* data; //pointer to linear array of data elements
size_t pitch; //pitch in bytes, the paddedHeight*sizeof(float) for host, device determines own pitch
size_t size; //total number of elements in the matrix
size_t paddedSize; //total number of elements counting zero padding
};
Thanks in advance for your suggestions.
EDIT: I forgot to mention, I am running the on a Kepler card, GTX 670 4GB.
Smaller block size like 16x16 or 8x8 may be faster. This slides also demos larger non-square size of block/shared mem may be faster for particular matrix size.
For shared mem allocation, add a dumy element on the leading dimension by using [BLOCK_SIZE][BLOCK_SIZE+1] to avoid the bank conflict.
Try to unroll the inner for loop by using #pragma unroll
On the other hand, You probably won't be much faster than matlab GPU code for large enough A'*A. Since the performance bottleneck of matlab is the invoking overhead rather than the kernel performance.
The cuBLAS routine culas_gemm() may have highest performance for matrix multiplication. You could compare yours with it.
MAGMA routine magma_gemm() has higher performance than cuBLAS in some cases. It's a open source project. You may also get some ideas from their code.

thread management in nbody code of cuda-sdk

When I read the nbody code in Cuda-SDK, I went through some lines in the code and I found that it is a little bit different than their paper in GPUGems3 "Fast N-Body Simulation with CUDA".
My questions are: First, why the blockIdx.x is still involved in loading memory from global to share memory as written in the following code?
for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++)
{
sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
multithreadBodies ?
positions[WRAP(blockIdx.x + q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] : //this line
positions[WRAP(blockIdx.x + tile, gridDim.x) * p + threadIdx.x]; //this line
__syncthreads();
// This is the "tile_calculation" function from the GPUG3 article.
acc = gravitation(bodyPos, acc);
__syncthreads();
}
isn't it supposed to be like this according to paper? I wonder why
sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
multithreadBodies ?
positions[WRAP(q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] :
positions[WRAP(tile, gridDim.x) * p + threadIdx.x];
Second, in the multiple threads per body why the threadIdx.x is still involved? Isn't it supposed to be a fix value or not involving at all because the sum only due to threadIdx.y
if (multithreadBodies)
{
SX_SUM(threadIdx.x, threadIdx.y).x = acc.x; //this line
SX_SUM(threadIdx.x, threadIdx.y).y = acc.y; //this line
SX_SUM(threadIdx.x, threadIdx.y).z = acc.z; //this line
__syncthreads();
// Save the result in global memory for the integration step
if (threadIdx.y == 0)
{
for (int i = 1; i < blockDim.y; i++)
{
acc.x += SX_SUM(threadIdx.x,i).x; //this line
acc.y += SX_SUM(threadIdx.x,i).y; //this line
acc.z += SX_SUM(threadIdx.x,i).z; //this line
}
}
}
Can anyone explain this to me? Is it some kind of optimization for faster code?
I am an author of this code and the paper. Numbered answers correspond to your numbered questions.
The blockIdx.x offset to the WRAP macro is not mentioned in the paper because this is a micro-optimization. I'm not even sure it is worthwhile any more. The purpose was to ensure that different SMs were accessing different DRAM memory banks rather than all pounding on the same bank at the same time, to ensure we maximize the memory throughput during these loads. Without the blockIdx.x offset, all simultaneously running thread blocks will access the same address at the same time. Since the overall algorithm is compute rather than bandwidth bound, this is definitely a minor optimization. Sadly, it makes the code more confusing.
The sum is across threadIdx.y, as you say, but each thread needs to do a separate sum (each thread computes gravitation for a separate body). Therefore we need to use threadIdx.x to index the right column of the (conceptually 2D) shared memory array.
To Answer SystmD's question in his (not really correct) answer, gridDim.y is only 1 in the (default/common) 1D block case.
1)
the array SharedPos is loaded in the shared memory of each block (i.e. each tile) before synchronization of the threads of each block (with __syncthreads()). blockIdx.x is the index of the tile, according to the algorithm.
each thread (index threadIdx.x threadIdx.y) loads a part of the shared array SharedPos. blockIdx.x refers to the index of the tile (without multithreading).
2)
acc is the float3 of the body index blockIdx.x * blockDim.x + threadIdx.x (see the beginning of the integrateBodies function)
I found some trouble with multithreadBodies=true during this sum with q>4 (128 bodies,p =16,q=8 gridx=8) . (with GTX 680). Some sums were not done on the whole blockDim.y ...
I changed the code to avoid that, It works but I don't know really why...
if (multithreadBodies)
{
SX_SUM(threadIdx.x, threadIdx.y).x = acc.x;
SX_SUM(threadIdx.x, threadIdx.y).y = acc.y;
SX_SUM(threadIdx.x, threadIdx.y).z = acc.z;
__syncthreads();
for (int i = 0; i < blockDim.y; i++)
{
acc.x += SX_SUM(threadIdx.x,i).x;
acc.y += SX_SUM(threadIdx.x,i).y;
acc.z += SX_SUM(threadIdx.x,i).z;
}
}
Another question:
In the first loop:
for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++)
{
}
I don't know why blockIdx.y is used since grid.y=1.
3) For a faster code, I use asynchronous H2D and D2D memory copies (my code only uses the gravitation kernel).