Tips for optimizing X_transpose*X CUDA kernel - cuda

I am writing my first CUDA application and am writing all the kernels my self for practice.
In one portion I am simply calculating X_transpose * X.
I have been using cudaMallocPitch and cudaMemcpy2D, I first allocate enough space on the device for X and X_transpose*X. I copy X to the device, my kernel takes two inputs, the X matrix, then the space to write the X_transpose * X result.
Using the profiler the kernel originally took 104 seconds to execute on a matrix of size 5000x6000. I pad the matrix with zeros on the host so that it is a multiple of the block size to avoid checking the bounds of the matrix in the kernel. I use a block size of 32 by 32.
I made some changes to try to maximize coalesced reads/writes to global memory, this seemed to help significantly. Using the visual profiler to profile the release build of my code, the kernel now takes 4.27 seconds to execute.
I haven't done an accurate timing of my matlab execution(just the operation X'*X;), but it appears to be about 3 seconds. I was hoping I could get much better speedups than matlab using CUDA.
The nvidia visual profiler is unable to find any issues with my kernel, I was hoping the community here might have some suggestions as to how I can make it go faster.
The kernel code:
__global__ void XTXKernel(Matrix X, Matrix XTX) {
//find location in output matrix
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
Matrix XTXsub = GetSubMatrix(XTX, blockRow, blockCol);
float Cvalue = 0;
for(int m = 0; m < (X.paddedHeight / BLOCK_SIZE); ++m) {
//Get sub-matrix
Matrix Xsub = GetSubMatrix(X, m, blockCol);
Matrix XTsub = GetSubMatrix(X, m, blockRow);
__shared__ float Xs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float XTs[BLOCK_SIZE][BLOCK_SIZE];
//Xs[row][col] = GetElement(Xsub, row, col);
//XTs[row][col] = GetElement(XTsub, col, row);
Xs[row][col] = *(float*)((char*)Xsub.data + row*Xsub.pitch) + col;
XTs[col][row] = *(float*)((char*)XTsub.data + row*XTsub.pitch) + col;
__syncthreads();
for(int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += Xs[e][row] * XTs[col][e];
__syncthreads();
}
//write the result to the XTX matrix
//SetElement(XTXsub, row, col, Cvalue);
((float *)((char*)XTXsub.data + row*XTX.pitch) + col)[0] = Cvalue;
}
The definition of my Matrix structure:
struct Matrix {
matrixLocation location;
unsigned int width; //width of matrix(# cols)
unsigned int height; //height of matrix(# rows)
unsigned int paddedWidth; //zero padded width
unsigned int paddedHeight; //zero padded height
float* data; //pointer to linear array of data elements
size_t pitch; //pitch in bytes, the paddedHeight*sizeof(float) for host, device determines own pitch
size_t size; //total number of elements in the matrix
size_t paddedSize; //total number of elements counting zero padding
};
Thanks in advance for your suggestions.
EDIT: I forgot to mention, I am running the on a Kepler card, GTX 670 4GB.

Smaller block size like 16x16 or 8x8 may be faster. This slides also demos larger non-square size of block/shared mem may be faster for particular matrix size.
For shared mem allocation, add a dumy element on the leading dimension by using [BLOCK_SIZE][BLOCK_SIZE+1] to avoid the bank conflict.
Try to unroll the inner for loop by using #pragma unroll
On the other hand, You probably won't be much faster than matlab GPU code for large enough A'*A. Since the performance bottleneck of matlab is the invoking overhead rather than the kernel performance.
The cuBLAS routine culas_gemm() may have highest performance for matrix multiplication. You could compare yours with it.
MAGMA routine magma_gemm() has higher performance than cuBLAS in some cases. It's a open source project. You may also get some ideas from their code.

Related

why unroll can not accelerate transpose matrix?

I am following a tutorial to learn cuda now and I learn that unroll a kernel function will accelerate the program. And it indeed works when I write a function which used to summarize a array.
But when I write a function used to transpose matrix following tutorial, it dosen't work.
The origin function like below:
__global__ void transform_matrix_read_col(
int* mat_a , int* mat_b , size_t row_num , size_t col_num
){
int ix = threadIdx.x + blockDim.x * blockIdx.x;
int iy = threadIdx.y + blockDim.y * blockIdx.y;
int row_idx = iy*col_num + ix;
int col_idx = ix*row_num + iy;
if(ix < col_num && iy < row_num){
mat_b[row_idx] = mat_a[col_idx];
}
}
and unrool function:
__global__ void transform_matrix_read_col_unrool(
int* mat_a , int* mat_b , size_t row_num , size_t col_num
){
int ix = threadIdx.x +(blockDim.x * blockIdx.x * 4);
int iy = threadIdx.y + blockDim.y * blockIdx.y;
int row_idx = iy*col_num + ix;
int col_idx = ix*row_num + iy;
if(ix < col_num && iy < row_num){
mat_b[row_idx] = mat_a[col_idx];
mat_b[row_idx + blockDim.x*1] = mat_a[col_idx + row_num*blockDim.x*1];
mat_b[row_idx + blockDim.x*2] = mat_a[col_idx + row_num*blockDim.x*2];
mat_b[row_idx + blockDim.x*3] = mat_a[col_idx + row_num*blockDim.x*3];
}
}
and the main function:
size_t width = 128 , height = 128,
array_size = width*height,array_bytes = array_size * sizeof(int);
int* matrix_data = nullptr,*output_data = nullptr;
cudaMallocHost(&matrix_data, array_bytes);
cudaMallocHost(&output_data, array_bytes);
util::init_array_int(matrix_data,array_size);//this func will random generate some integer
int* matrix_data_dev = nullptr,* output_matrix_dev = nullptr;
cudaMalloc(&matrix_data_dev, array_bytes);
cudaMemcpy(matrix_data_dev, matrix_data, array_bytes, cudaMemcpyHostToDevice);
cudaMalloc(&output_matrix_dev, array_bytes);
dim3 block(32,16);
dim3 grid((width-1)/block.x+1,(height-1)/block.y+1);
dim3 gridUnrool4((width-1)/(block.x*4)+1,(height-1)/block.y +1);
transform_matrix_read_col<<<grid,block>>>(matrix_data_dev, output_matrix_dev, height, width);
cudaDeviceSynchronize();
transform_matrix_read_col_unrool<<<gridUnrool4,block>>>(matrix_data_dev, output_matrix_dev, height, width);
cudaDeviceSynchronize();
and the staticstis of nsys(run on linux with a rtx 3090):
CUDA Kernel Statistics:
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- -------- ------- ------- ---------------------------------------------------------------------------
6.3 3,456 1 3,456.0 3,456 3,456 transform_matrix_read_col_unrool(int*, int*, unsigned long, unsigned long)
5.2 2,880 1 2,880.0 2,880 2,880 transform_matrix_read_col(int*, int*, unsigned long, unsigned long)
We can see that unrool version slower a lot.
But on the tutorial , it say that unroll will acclerate transpose actually.
So What cause this problem? And how to accelerate transpose matrix ?
Unrolling only help if the computation is compute bound so that a higher (useful) instruction throughput can decrease the execution time. Memory-bound code tends not to be much faster once unrolled because memory-bound instruction are slowed down by the contention of the memory controller.
A transposition may not seem memory-bound at first glance because of a low apparent memory throughput, but one need to care about cache lines. Indeed, when a single value is requested from memory from the user code, the hardware actually request a pretty big cache line for (subsequent) contiguous accesses to be fast.
Another consideration to take into account is that the code can also be latency bound. Indeed, the inefficient strided accesses can be slow due to the memory latency. The memory controller may not be able to fully saturate the RAM in this case (although this is quite unlikely on GPUs, especially regarding the large cache lines). If so, adding more instruction do not help because they are typically executed in an in-order way as opposed to modern mainstream CPUs. Using larger blocks and more blocks helps to provide more parallelism to the GPUs which can then perform more concurrent memory accesses and possibly better use the memory.
The key with the transposition is to make accesses as contiguous as possible and reuse cache lines. The most critical thing is to operate on small 2D blocks and not on full row/lines (ie. not a 1D kernel) to increase the cache locality. Moreover, one efficient well-known solution is to use the shared memory: each threads of a CUDA block fetch a part of a 2D array block and can then perform the transposition in shared memory possibly more efficiently. It is not so easy due to possible shared memory conflicts that can impact performance. Fortunately, there are few research papers and articles talking about that since the last decades.
The simplest efficient solution to this problem is basically to use cuBLAS which is heavily optimized. This post may also be useful.
Note that a 128x128 transposition is very small for a GPU. GPUs are designed to compute bigger datasets (or far more expensive computations on such small input). If the input array is initially stored on the CPU, then I strongly advise you to do that directly on the CPU as moving data on the GPU will likely be already slower than computing the transposition efficiently on the CPU directly. Indeed, data cannot be moved faster than the main RAM permit and a 128x128 transposition can be implemented in a way it saturate the main RAM (in fact, it can be likely done directly in the CPU caches that are significantly faster than the main RAM).

dot_product with CUDA_CUB

__global__ void sum(const float * __restrict__ indata, float * __restrict__ outdata) {
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
// --- Specialize BlockReduce for type float.
typedef cub::BlockReduce<float, BLOCKSIZE> BlockReduceT;
// --- Allocate temporary storage in shared memory
__shared__ typename BlockReduceT::TempStorage temp_storage;
float result;
if(tid < N) result = BlockReduceT(temp_storage).Sum(indata[tid]);
// --- Update block reduction value
if(threadIdx.x == 0) outdata[blockIdx.x] = result;
return;
}
I have tested the reduction sum(as shown in above code snippet) with cuda cub successfully, I want to perform the inner product of two vectors based on this code. But I have some confusions about it:
We need two input vectors for the inner_product, need I to conduct a component-wise multiplication of this two input vectors before the reduction sum on the resulting new vector.
In the code examples of the cuda cub, the dimension of input vectors is equal to the blocknumber*threadnumber. what if we have a very large vector.
Yes, with cub, and assuming your vectors were stored separately (i.e. not interleaved), you would need to do an element-wise multiplication first. On the other hand, thrust transform_reduce could handle it in a single function call.
blocknumber*threadnumber should give you all the range you need. on a cc3.0 or higher GPU, blocknumber (i.e. gridDim.x) can range up to 2^31-1 and threadnumber (i.e. blockDim.x) can range up to 1024. This gives you the possibility to handle 2^40 elements. If each element is 4 bytes, this would constitute (i.e. require) 2^42 bytes. That is about 4TB (or double that if you are considering 2 input vectors), which is much larger than any GPU memory currently. So you will run out of GPU memory space before you run out of grid dimension.
Note that what you are showing is cub::BlockReduce. However if you are doing a vector dot product of two large vectors, you might want to use cub::DeviceReduce instead.

CUDA: writing to shared memory increses kernel time execution a lot

I am trying to reduce 65536 elements array (calculate sum of elements in it) with help of CUDA. Kernel looks like following (please, ignore *dev_distanceFloats and index arguments for now)
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
}
ant it launched as one block with 256 threads:
kernel_calcSum <<<1, 256 >>>(dev_spFloats1, dev_distanceFloats, index);
So far, so good, each of 256 threads takes 256 elements from global memory and calculates it's sum in local variable mySum. Kernel execution time is about 45 milliseconds.
Next step is to introduce shared memory among those 256 threads in block (to calculate sum of mySum), so kernel becomes as following:
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
__shared__ float sdata[256];
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
sdata[tid] = mySum;
}
I just added writing to shared memory, but execution time increases from 45 milliseconds to 258 milliseconds (I am cheking this with help of NVidia Visual Profiler 5.0.0).
I realize that there are 8 bank conflicts for each thread when writing to sdata variable (I am on GTX670 which have capability 3.0 with 32 banks). As an experiment - I tried to reduce of threads to 32 when launching kernel - but time still 258 milliseconds.
Question 1: why writing to shared memory takes so long in my case ?
Question 2: is there any tool, which show in details kinda "execution plan" (timings for memory access, conflicts, etc) ?
Thanks for your suggestions.
Update:
playing with kernel - I set sdata to some constant for each thread:
__global__ void kernel_calcSum(float *d, float *dev_distanceFloats, int index) {
__shared__ float sdata[256];
int tid = threadIdx.x;
float mySum = 0;
for (int e = 0; e < 256; e++) {
mySum += d[tid + e];
}
sdata[tid] = 111;
}
and timings are back to 48 millisec.
So, changing
sdata[tid] = mySum;
to
sdata[tid] = 111;
made this.
Is this compiler optimization (may be it just removed this line?) or by some reason copying from local memory (register?) to shared takes long?
Both of your kernels do not do anything, because they do not write out results to memory that would still be accessible after the kernel finishes.
In the first case, the compiler is clever enough to notice this and optimize away the whole calculation.
In the second case where shared memory is involved, the compiler does not notice this as the flow of information through shared memory would be harder to track. It thus leaves the calculation in.
Pass in a pointer to global memory (as you already do) and write out results via this pointer.
Shared memory is not the right thing for this. What you need are warp atomic operations, to sum up within a warp, then communicate the intermediary results between warps. There's example code demonstrating this shipping with CUDA.
Summing up elements is one of those tasks where massive parallization won't help much and the GPU in fact can be outperformed by a CPU.

CUDA: Thread and Array Allocation

I have read many times about CUDA Thread/Blocks and Array, but still don't understand point: how and when CUDA starts to run multithread for kernel function. when host calling kernel function, or inside kernel function.
For example I have this example, It just simple transpose an array. (so, it just copy value from this array to another array).
__global__
void transpose(float* in, float* out, uint width) {
uint tx = blockIdx.x * blockDim.x + threadIdx.x;
uint ty = blockIdx.y * blockDim.y + threadIdx.y;
out[tx * width + ty] = in[ty * width + tx];
}
int main(int args, char** vargs) {
/*const int HEIGHT = 1024;
const int WIDTH = 1024;
const int SIZE = WIDTH * HEIGHT * sizeof(float);
dim3 bDim(16, 16);
dim3 gDim(WIDTH / bDim.x, HEIGHT / bDim.y);
float* M = (float*)malloc(SIZE);
for (int i = 0; i < HEIGHT * WIDTH; i++) { M[i] = i; }
float* Md = NULL;
cudaMalloc((void**)&Md, SIZE);
cudaMemcpy(Md,M, SIZE, cudaMemcpyHostToDevice);
float* Bd = NULL;
cudaMalloc((void**)&Bd, SIZE); */
transpose<<<gDim, bDim>>>(Md, Bd, WIDTH); // CALLING FUNCTION TRANSPOSE
cudaMemcpy(M,Bd, SIZE, cudaMemcpyDeviceToHost);
return 0;
}
(I have commented all lines that not important, just have the line calling function transpose)
I have understand all lines in function main except the line calling function tranpose. Does it true when I say: when we call function transpose<<<gDim, bDim>>>(Md, Bd, WIDTH), CUDA will automatically assign each elements of array into one thread (and block), and when we calling "one time" tranpose, CUDA will running gDim * bDim times tranpose on gDim * bDim threads.
This point makes me feel frustrated so much, because it doesn't like multithread in java, when I use :( Please tell me.
Thanks :)
Your understanding is in essence correct.
transpose is not a function, but a CUDA kernel. When you call a regular function, it only runs once. But when you launch a kernel a single time, CUDA will automatically run the code in the kernel many times. CUDA does this by starting many threads. Each thread runs the code in your kernel one time. The numbers inside the tripple brackets (<<< >>>) is called the kernel execution configuration. It determines how many threads will be launched by CUDA and specifies some relationships between the threads.
The number of threads that will be started is calculated by multiplying up all the values in the grid and block dimensions inside the triple brackets. For instance, the number of threads will be 1,048,576 (16 * 16 * 64 * 64) in your example.
Each thread can read some variables to find out which thread it is. Those are the blockIdx and threadIdx structures at the top of the kernel. The values reflect the ones in the kernel execution configuration. So, if you run your kernel with a grid configuration of 16 x 16 (the first dim3 in the triple brackets, you will get threads that, when they each read the x and y values in the blockIdx structure, will get all possible combinations of x and y between 0 and 15.
So, as you see, CUDA does not know anything about array elements or any other data structures that are specific to your kernel. It just deals with threads, thread indexes and block indexes. You then use those indexes to to determine what a given thread should do (in particular, which values in your application specific data it should work on).

CUDA efficient division?

I would like to know if there is, by any chance an efficient way of dividing elements of an array. I am running with matrix values 10000x10000 and it a considerable amount of time in comparison with other kernels. Division are expensive operations, and I can't see how to improve it.
__global__ void division(int N, float* A, int* B){
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col <= row) ){
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
kernel launched with
int N = 10000;
int threads = 32
int blocks = (N+threads-1)/threads
dim3 t(threads,threads);
dim3 b(blocks, blocks);
division<<< b, t >>>(N, A, B);
cudaThreadSynchronize();
Option B:
__global__ void division(int N, float* A, int* B){
int k = blockIdx.x * blockDim.x + threadIdx.x;
int kmax = N*(N+1)/2
int i,j;
if(k< kmax){
row = (int)(sqrt(0.25+2.0*k)-0.5);
col = k - (row*(row+1))>>1;
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
launched with
int threads =192;
int totalThreadsNeeded = (N*(N+1)/2;
int blocks = ( threads + (totalThreadsNeeded)-1 )/threads;
division<<<blocks, threads >>>(N, A, B);
Why is option B giving a wrong result even if the threadIds are the correct one? what is missing here?
Your basic problem is that you are launching an improbably huge grid (over 100 million threads for your 10000x10000 array example), and then because of the triangular nature of the access pattern in the kernel, fully half of those threads never do anything productive. So a enormous amount of GPU cycles are being wasted for no particularly good reason. Further, the access pattern you are using isn't allowing coalesced memory access, which is going to further reduce the performance of the threads which are actually doing useful work.
If I understand your problem correctly, the kernel is only performing element-wise division on a lower-triangle of a square array. If this is the case, it could be equally done using something like this:
__global__
void division(int N, float* A, int* B)
{
for(int row=blockIdx.x; row<N; row+=gridDim.x) {
for(int col=threadIdx.x; col<=row; col+=blockDim.x) {
int val = max(1,B[row*N+col]);
A[row*N+col] /= (float)val;
}
}
}
[disclaimer: written in browser, never compiled, never tested, use at own risk]
Here, a one dimension grid is used, with each block computing a row at a time. Threads in a block move along the row, so memory access is coalesced. In comments you mention your GPU is a Tesla C2050. That device only requires 112 blocks of 192 threads each to completely "fill" each of the 14 SM with a full complement of 8 blocks each and the maximum number of concurrent threads per SM. So the launch parameters could be something like:
int N = 10000;
int threads = 192;
int blocks = min(8*14, N);
division<<<blocks, threads>>>(N, A, B);
I would expect this to run considerably faster than your current approach. If numerical accuracy isn't that important, you can probably achieve further speed-up by replacing the division with an approximate reciprocal intrinsic and a floating point multiply.
Because threads are executed in groups of 32, called warps, you are paying for the division for all 32 threads in a warp if both if conditions are true for just one of the threads. If the condition is false for many threads, see if you can filter out the values for which the division is not needed in a separate kernel.
The int to float conversion may itself be slow. If so, you might be able to generate floats directly in your earlier step, and pass B in as an array of floats.
You may be able to generate inverted numbers in the earlier step, where you generate the B array. If so, you can use multiplication instead of division in this kernel. (a / b == a * 1 / b).
Depending on your algorithm, maybe you can get away with a lower precision division. There's an intrinsic, __fdividef(x, y), that you can try. There is also a compiler flag, -prec-div=false.
The very first thing to look at should be coalesced memory access. There is no reason for the non-coalesced pattern here, just exchange rows and columns for to avoid wasting a lot of memory bandwidth:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
...
A[row*N+col] ...
Even if this is run on compute capability 2.0 or higher, the caches are not large enough to remedy this suboptimal pattern.