Kmeans clustering acceleration in GPU(CUDA) - cuda

I am a fairly new cuda user. I'm practicing on my first cuda application where I try to accelerate kmeans algorithm by using GPU(GTX 670).
Briefly, each thread works on a single point which is compared to all cluster centers and a point is assigned to a center with minimum distance(kernel code can be seen below with comments).
According to Nsight Visual Studio, I have an occupancy of 99.61%(1024 blocks, 1024 threads per block), 99.34% Streaming Multiprocessor activity, 79.98% warp issue efficiency, no shared memory bank conflicts, 18.4GFLOPs Single MUL and 55.2 GFLOPs Single ADD(takes about 14,5 ms to complete kmeans kernel with given parameters).
According to Wikipedia, GTX670's peak performance is 2460 GFLOPs. I am nowhere close to it. In addition to these, some papers claim they can achieve more than half of the peak performance. I cannot see how further I can optimize this kernel code. Is there any optimization that I can apply to the kernel? Any suggestion or help is appreciated and I can give any additional information on demand.
Complete Code
Thanks in advance.
#define SIZE 1024*1024 //number of points
#define CENTERS 32 //number of cluster centroids
#define DIM 8 //dimension of each point and center
#define cudaTHREADSIZE 1024 //threads per block
#define cudaBLOCKSIZE SIZE/cudaTHREADSIZE //number of blocks for kernel
__global__ void kMeans(float *dp, float *dc,int *tag, int *membershipChangedPerBlock)
{
//TOTAL NUMBER OF THREADS SHOULD BE EQUAL TO THE NUMBER OF POINTS, BECAUSE EACH THREAD WORKS ON A SINGLE POINT
__shared__ unsigned char membershipChanged[cudaTHREADSIZE];
__shared__ float dc_shared[CENTERS*DIM];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int threadID = threadIdx.x;
membershipChanged[threadIdx.x] = 0;
//move centers to shared memory, because each and every thread will call it(roughly + %10 performance here)
while(threadID < CENTERS*DIM){
dc_shared[threadID] = dc[threadID];
threadID += blockDim.x;
}
__syncthreads();
while(tid < SIZE){
int index,prevIndex;
float dist, min_dist;
index = 0;//all initial point indices(centroid number) are assigned to 0.
prevIndex = 0;
dist = 0;
min_dist = 0;
//euclid distance for center 0
for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
min_dist += (dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS]);
}
//euclid distance for other centers with distance comparison
for(int centerIdx = 1; centerIdx < CENTERS; centerIdx++){
dist = 0;
for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
dist += (dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS]);
}
//compare distances, if found a shorter one, change index to that centroid number
if(dist < min_dist){
min_dist = dist;
index = centerIdx;
}
}
if (tag[tid] != index) {//if a point's cluster membership changes, flag it as changed in order to compute total membership changes later on
membershipChanged[threadIdx.x] = 1;
}
tag[tid] = index;
__syncthreads();//sync before applying sum reduction to membership changes
//sum reduction
for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
if (threadIdx.x < s) {
membershipChanged[threadIdx.x] +=
membershipChanged[threadIdx.x + s];
}
__syncthreads();
}
if (threadIdx.x == 0) {
membershipChangedPerBlock[blockIdx.x] = membershipChanged[0];
}
tid += blockDim.x * gridDim.x;
}
}

My advice is to compare your work with a more exprienced GPU developer's work. I found out Kmeans algorithm is written by Byran Catanzaro after watching this video. You can find the source code:
https://github.com/bryancatanzaro/kmeans
I am also a beginner but IMHO it is better to use libraries like "Trust". GPU programming is really complicated issue it is hard to achieve max performance "Trust" will help you with that.

Check out rapids.ai cuml which replicates scikit api
Example from docs:
from cuml import KMeans
from cuml.cluster import KMeans
import cudf
import numpy as np
import pandas as pd
def np2cudf(df):
# convert numpy array to cuDF dataframe
df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])})
pdf = cudf.DataFrame()
for c,column in enumerate(df):
pdf[str(c)] = df[column]
return pdf
a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],
dtype=np.float32)
b = np2cudf(a)
print("input:")
print(b)
print("Calling fit")
kmeans_float = KMeans(n_clusters=2)
kmeans_float.fit(b)
print("labels:")
print(kmeans_float.labels_)
print("cluster_centers:")
print(kmeans_float.cluster_centers_)

Related

CUDA shared memory bank conflict unexpected timing

I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 scenarios:
When there is no bank conflict (offset=1)
When there is a bank conflict (offset=32, all threads are accessing bank 0)
Here is a sample of the code (only the kernel):
__global__ void kernel(int offset) {
__shared__ uint32_t shared_memory[MEMORY_SIZE];
// init shared memory
if (threadIdx.x == 0) {
for (int i = 0; i < MEMORY_SIZE; i++)
shared_memory[i] = i;
}
__syncthreads();
uint32_t index = threadIdx.x * offset;
// 2048 / 32 = 64
for (int i = 0; i < 64; i++)
{
shared_memory[index] += index * 10;
index += 32;
index %= MEMORY_SIZE;
__syncthreads();
}
}
I expected the version with offset=32 to run slower than the one with offset=1 as access should be serialized but found out that they have similar output time. How is that possible ?
You have only 1 working warp, so biggest problem with your performance is that each (or most) GPU command awaits for finishing previous one. This hides most shared memory conflicts slowdown. You also have a lot of work per each shared memory access. How many small commands there are in cosf? Try simple integer arithmetics instead.

PyCUDA illegal memory access of curandState*

I'm studying the spread of an invasive species and am trying to generate random numbers within a PyCUDA kernel using the XORWOW random number generator. The matrices I need to be able to use as input in the study are quite large (up to 8,000 x 8,000).
The error seems to occur inside get_random_number when indexing the curandState* of the XORWOW generator. The code executes without errors on smaller matrices and produces correct results. I'm running my code on 2 NVidia Tesla K20X GPUs.
Kernel code and setup:
kernel_code = '''
#include <curand_kernel.h>
#include <math.h>
extern "C" {
__device__ float get_random_number(curandState* global_state, int thread_id) {
curandState local_state = global_state[thread_id];
float num = curand_uniform(&local_state);
global_state[thread_id] = local_state;
return num;
}
__global__ void survival_of_the_fittest(float* grid_a, float* grid_b, curandState* global_state, int grid_size, float* survival_probabilities) {
int x = threadIdx.x + blockIdx.x * blockDim.x; // column index of cell
int y = threadIdx.y + blockIdx.y * blockDim.y; // row index of cell
// make sure this cell is within bounds of grid
if (x < grid_size && y < grid_size) {
int thread_id = y * grid_size + x; // thread index
grid_b[thread_id] = grid_a[thread_id]; // copy current cell
float num;
// ignore cell if it is not already populated
if (grid_a[thread_id] > 0.0) {
num = get_random_number(global_state, thread_id);
// agents in this cell die
if (num < survival_probabilities[thread_id]) {
grid_b[thread_id] = 0.0; // cell dies
//printf("Cell (%d,%d) died (probability of death was %f)\\n", x, y, survival_probabilities[thread_id]);
}
}
}
}
mod = SourceModule(kernel_code, no_extern_c = True)
survival = mod.get_function('survival_of_the_fittest')
Data setup:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
grid_a = gpuarray.to_gpu(np.ones((matrix_size,matrix_size)).astype(np.float32))
grid_b = gpuarray.to_gpu(np.zeros((matrix_size,matrix_size)).astype(np.float32))
generator = curandom.XORWOWRandomNumberGenerator()
grid_size = np.int32(matrix_size)
survival_probabilities = gpuarray.to_gpu(np.random.uniform(0,1,(matrix_size,matrix_size)))
Kernel call:
survival(grid_a, grid_b, generator.state, grid_size, survival_probabilities,
grid = (grid_dims, grid_dims), block = (block_dims, block_dims, 1))
I expect to be able to generate random numbers within the range (0,1] for matrices up to (8,000 x 8,000), but executing my code on large matrices leads to an illegal memory access error.
pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
Am I indexing the curandState* incorrectly in get_random_number? And if not, what else might be causing this error?
The problem here is a disconnect between this code which determines the size of the state which the PyCUDA curandom interface allocates for its internal state and this code in your post:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
You seem to be assuming that PyCUDA will magically allocate enough state for whatever block and grid dimension you select in you code. That is obviously unlikely, particularly at large grid sizes. You either need to
Modify your code to use the same block and grid sizes as the curandom module uses internally for whichever generator you choose to use, or
Allocate and manage your own state scratch space so that you have enough state allocated to service the block and grid sizes you select
I leave it as an exercise to the reader as to which one of these two approaches will work better in your application.

Matrix Multiplication of matrix and its transpose in Cuda

I am relatively new to CUDA programming so there are some unsolved issues for which I hope I can get some hints in the right direction.
So the case is that I want to multiply a 2D array with its transpose and to be precise I want to execute the operation ATA.
I have already used the cublas Dgemm function and now I am trying to do the same operation with a tiled algorithm, very similar to the one from CUDA guide.
The case is that while the initial algorithm runs properly, I want to calculate only the upper triangular matrix of the product hoping that I could achieve a better time for the operation, and I am not sure on how to extract tiles/blocks which will have the respective elements.
So if you could enlighten me on this, or give any hint I would be grateful, cause I have stuck on that for a while.
This is the code of the kernel
__shared__ double Ads1[TILE_WIDTH][TILE_WIDTH];
__shared__ double Ads2[TILE_WIDTH][TILE_WIDTH];
//block row and column
//we save in registers for faster access
int by = blockIdx.y;
int bx = blockIdx.x;
int ty = threadIdx.y;
int tx = threadIdx.x;
int row = by * TILE_WIDTH + ty;
int col = bx * TILE_WIDTH + tx;
double Rvalue = 0;
if(row >= width || col >= width) return;
//Each thread block computes one sub-matrix Rsub of result R
for (int i=0; i<(int) ceil(((double) height/TILE_WIDTH)); ++i)
{
Ads1[tx][ty] = Ad[(i * TILE_WIDTH + ty)*width + col];
Ads2[tx][ty] = Ad[(i * TILE_WIDTH + tx)*width + row];
__syncthreads();
for (int j = 0; j < TILE_WIDTH; ++j)
{
if ((i*TILE_WIDTH + j) > height ) break; //in order not to exceed the matrix's height
Rvalue+=Ads1[j][tx]*Ads2[ty][j];
}
__syncthreads();
}
Rd [row * width + col] = Rvalue;
You may want to use the batch dgemm API function described here recursely dividing your output matrix with block diagonal and corner. You also want to balance smallest block size versus overhead in compute to avoid small invokes. Finally, note that matrix multiply turns memory bound at some stage, which can be on modern GPU somewhat large.

Tricky array arithmetics inside a __global__ kernel (CUDA samples)

I have a question about code from CUDA sample "CUDA Separable Convolution" . In order to make row-convolution, this code first loads data in shared memory. Using pointer arithmetics, each thread moves the input pointers into their own position, and after that writes some piece of global memory into shared memory. Here is the piece of code that confuses me:
__global__ void convolutionRowsKernel(
float *d_Dst,
float *d_Src,
int imageW,
int imageH,
int pitch
)
{
__shared__ float s_Data[ROWS_BLOCKDIM_Y][(ROWS_RESULT_STEPS + 2 * ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X];
//Offset to the left halo edge
const int baseX = (blockIdx.x * ROWS_RESULT_STEPS - ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X + threadIdx.x;
const int baseY = blockIdx.y * ROWS_BLOCKDIM_Y + threadIdx.y;
d_Src += baseY * pitch + baseX;
d_Dst += baseY * pitch + baseX;
//Load main data
#pragma unroll
for (int i = ROWS_HALO_STEPS; i < ROWS_HALO_STEPS + ROWS_RESULT_STEPS; i++)
{
s_Data[threadIdx.y][threadIdx.x + i * ROWS_BLOCKDIM_X] = d_Src[i * ROWS_BLOCKDIM_X];
}
...
As far as I understand this code, each thread will calculate their own values of baseX and baseY, and after that all active threads will start to increase pointers d_Src and d_Dst simultaneously.
So, according to my knowledge, this would be correct, if arrays d_Src and d_Dst were in local memory (e.g. each thread would have there own copy of this arrays). But this arrays are in global device memory! So what will happen, all active threads will increase the pointers, and the result will be incorrect. Can one explain me, why this works?
Thanks
It works because every thread has its own copy of the pointer.
void foo(float* bar){
bar++;
}
float* test = 0;
foo(test);
cout<<test<<endl; //will print 0

Cuda Kernel with reduction - logic errors for dot product of 2 matrices

I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right answer for only matrices with size 2. For any other size matrices, I am getting it wrong.
This is only the test so I am keeping matrix size very small. Only about 100 so only 1 block would fit it all.
Any help would be greatly appreciated. Thanks!
Here is the regular code
float* ha = new float[n]; // matrix a
float* hb = new float[n]; // matrix b
float* hc = new float[1]; // sum of a.b
float dx = hc[0];
float hx = 0;
// dot product
for (int i = 0; i < n; i++)
hx += ha[i] * hb[i];
Here is my cuda kernel
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
dc[tid] = 0;
for (int stride = 1; stride < n; stride *= 2) {
if (tid % (2 * stride) == 0)
dc[tid] += (da[tid] * db[tid]) + (da[tid+stride] * db[tid+stride]);
__syncthreads();
}
}
My complete code : http://pastebin.com/zS85URX5
Hopefully you can figure out why it works for the n=2 case, so let's skip that, and take a look at why it fails for some other case, let's choose n=4. When n = 4, you have 4 threads, numbered 0 to 3.
In the first iteration of your for-loop, stride = 1, so the threads that pass the if test are threads 0 and 2.
thread 0: dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2: dc[2] += da[2]*db[2] + da[3]*db[3];
So far so good. In the second iteration of your for loop, stride is 2, so the thread that passes the if test is thread 0 (only).
thread 0: dc[0] += da[0]*db[0] + da[2]*db[2];
But this doesn't make sense and is not what we want at all. What we want is something like:
dc[0] += dc[2];
So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block):
// CUDA kernel code
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
// do multiplication in parallel for full width of threads
dc[tid] = da[tid] * db[tid];
// wait for all threads to complete multiply step
__syncthreads();
int stride = blockDim.x;
while (stride > 1){
// handle odd step
if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
// successively divide problem by 2
stride >>= 1;
// add each upper half element to each lower half element
if (tid < stride) dc[tid] += dc[tid + stride];
// wait for all threads to complete add step
__syncthreads();
}
}
Note that I'm not really using the n parameter. Since you are launching the kernel with n threads, the blockDim.x built-in variable is equal to n in this case.