cuda sub matrix - cuda

problem:
I have 4 matrix (64x64) of single precision numbers. need to do calculation like:
R = A * sin(B) + C * cos(D)
idea:
to speed up calculation use shared memory. since each block of threads has (in case of my GPU) 16KB shared memory and size of float is 4 there can by stored 4000 floating point numbers in shared memory. so for each matrix use 1000 elements which is 31 elements per dimension.
so each matrix shoud be devided in 16 submatrix (16x16).
dim3 dimBlock(16, 16, 1)
dim3 dimGrid(4, 4, 1)
kernel:
int Tx = threadIdx.x;
int Ty = threadIdx.y;
int Bx = blockIdx.x;
int By = blockIdx.y;
int idx = Bx * blockDim.x + Tx;
int idy = By * blockDim.y + Ty;
__shared__ float s_A[16*16];
__shared__ float s_B[16*16];
__shared__ float s_C[16*16];
__shared__ float s_D[16*16];
// I am not sure how to write this part
s_A[(Tx * blockDim.x + Ty + By) + Bx] = A[idx * 64 + idy];
s_B[(Tx * blockDim.x + Ty + By) + Bx] = B[idx * 64 + idy];
s_C[(Tx * blockDim.x + Ty + By) + Bx] = C[idx * 64 + idy];
s_D[(Tx * blockDim.x + Ty + By) + Bx] = D[idx * 64 + idy];
R[idx * 64 + idy] = s_A[(Tx * blockDim.x + Ty + By) + Bx] * sin(s_B[(Tx * blockDim.x + Ty + By) + Bx]) + s_C[(Tx * blockDim.x + Ty + By) + Bx] * cos(s_D[(Tx * blockDim.x + Ty + By) + Bx]);
How to devide original matrix to submatrixs so each block has own 4 submatrix and calculate on them.

Unless I have misinterpreted your question, you don't need to and shouldn't use shared memory for this operation. Shared memory is useful for sharing and resuing data between threads within the same block, and for facilitating coalesced memory access. Your operation seems to required neither of those things to work correctly. Using shared memory in the way you propose would probably be slower than just reading from global memory directly. Also, because you are only worried about element wise operations, the indexing scheme of your kernel can be greatly simplified -- the fact that A, B, C and D are "matrices" is irrelevant to the calculations as I understand your question.
As a result, an near optimal version of your kernel could be written as simply as
__global__ void kernel(const float *A, const float *B, const float *C,
const float *D, const int n, float *R)
{
int tidx = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
while(tidx < n) {
R[tidx] = A[idx] * sinf(B[idx]) + C[idx]*cosf(D[idx]);
tidx += stride
}
}
In this code, you would launch as many blocks as would reach peak throughput of your GPU, and each thread will process more than one input/output value if the size of the array exceeds the size of the optimal 1D grid you have launched. Of course this is pretty academic if you are only processing 4096 elements in total -- that is probably about 2 orders of magnitude too small to get any benefit from using a GPU.

You have a problem here that your operation/transfer ratio is of order 1. You might have a hard time actually getting any decent speed from your GPU because of a bandwidth bottleneck between the thread and global memory and not having a way to reduce that.
A shared memory solution is usually best when there is some data called repeatedly from global memory. Instead of loading this data repeatedly from the low bandwidth, high latency global memory, you load it once from there, and do subsequent loads from the higher bandwidth, lower latency shared memory. Note, that's higher and lower, not high and low. There is still a performance penalty from using shared memory.
You your case, since elements aren't called several times from global memory, storing them in shared memory will only add the bandwidth limitations and latency that comes with shared memory usage. So, in effect, this solution will just add on the latency of access from shared memory to your data loading.
Now, if you have several calculations to perform, and some of these matrices are used in them too, then combining them into one kernel might give you a speed boost, since you might be able to load these once for the whole thing instead of once per operation. If that's not the case, and you can't increase your operation/transfer ratio, then you'll have a hard time getting some decent speeds, and might be better off doing these calculations on the CPU.
You might even get some decent results from multithreading on the CPU.

Related

why unroll can not accelerate transpose matrix?

I am following a tutorial to learn cuda now and I learn that unroll a kernel function will accelerate the program. And it indeed works when I write a function which used to summarize a array.
But when I write a function used to transpose matrix following tutorial, it dosen't work.
The origin function like below:
__global__ void transform_matrix_read_col(
int* mat_a , int* mat_b , size_t row_num , size_t col_num
){
int ix = threadIdx.x + blockDim.x * blockIdx.x;
int iy = threadIdx.y + blockDim.y * blockIdx.y;
int row_idx = iy*col_num + ix;
int col_idx = ix*row_num + iy;
if(ix < col_num && iy < row_num){
mat_b[row_idx] = mat_a[col_idx];
}
}
and unrool function:
__global__ void transform_matrix_read_col_unrool(
int* mat_a , int* mat_b , size_t row_num , size_t col_num
){
int ix = threadIdx.x +(blockDim.x * blockIdx.x * 4);
int iy = threadIdx.y + blockDim.y * blockIdx.y;
int row_idx = iy*col_num + ix;
int col_idx = ix*row_num + iy;
if(ix < col_num && iy < row_num){
mat_b[row_idx] = mat_a[col_idx];
mat_b[row_idx + blockDim.x*1] = mat_a[col_idx + row_num*blockDim.x*1];
mat_b[row_idx + blockDim.x*2] = mat_a[col_idx + row_num*blockDim.x*2];
mat_b[row_idx + blockDim.x*3] = mat_a[col_idx + row_num*blockDim.x*3];
}
}
and the main function:
size_t width = 128 , height = 128,
array_size = width*height,array_bytes = array_size * sizeof(int);
int* matrix_data = nullptr,*output_data = nullptr;
cudaMallocHost(&matrix_data, array_bytes);
cudaMallocHost(&output_data, array_bytes);
util::init_array_int(matrix_data,array_size);//this func will random generate some integer
int* matrix_data_dev = nullptr,* output_matrix_dev = nullptr;
cudaMalloc(&matrix_data_dev, array_bytes);
cudaMemcpy(matrix_data_dev, matrix_data, array_bytes, cudaMemcpyHostToDevice);
cudaMalloc(&output_matrix_dev, array_bytes);
dim3 block(32,16);
dim3 grid((width-1)/block.x+1,(height-1)/block.y+1);
dim3 gridUnrool4((width-1)/(block.x*4)+1,(height-1)/block.y +1);
transform_matrix_read_col<<<grid,block>>>(matrix_data_dev, output_matrix_dev, height, width);
cudaDeviceSynchronize();
transform_matrix_read_col_unrool<<<gridUnrool4,block>>>(matrix_data_dev, output_matrix_dev, height, width);
cudaDeviceSynchronize();
and the staticstis of nsys(run on linux with a rtx 3090):
CUDA Kernel Statistics:
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- -------- ------- ------- ---------------------------------------------------------------------------
6.3 3,456 1 3,456.0 3,456 3,456 transform_matrix_read_col_unrool(int*, int*, unsigned long, unsigned long)
5.2 2,880 1 2,880.0 2,880 2,880 transform_matrix_read_col(int*, int*, unsigned long, unsigned long)
We can see that unrool version slower a lot.
But on the tutorial , it say that unroll will acclerate transpose actually.
So What cause this problem? And how to accelerate transpose matrix ?
Unrolling only help if the computation is compute bound so that a higher (useful) instruction throughput can decrease the execution time. Memory-bound code tends not to be much faster once unrolled because memory-bound instruction are slowed down by the contention of the memory controller.
A transposition may not seem memory-bound at first glance because of a low apparent memory throughput, but one need to care about cache lines. Indeed, when a single value is requested from memory from the user code, the hardware actually request a pretty big cache line for (subsequent) contiguous accesses to be fast.
Another consideration to take into account is that the code can also be latency bound. Indeed, the inefficient strided accesses can be slow due to the memory latency. The memory controller may not be able to fully saturate the RAM in this case (although this is quite unlikely on GPUs, especially regarding the large cache lines). If so, adding more instruction do not help because they are typically executed in an in-order way as opposed to modern mainstream CPUs. Using larger blocks and more blocks helps to provide more parallelism to the GPUs which can then perform more concurrent memory accesses and possibly better use the memory.
The key with the transposition is to make accesses as contiguous as possible and reuse cache lines. The most critical thing is to operate on small 2D blocks and not on full row/lines (ie. not a 1D kernel) to increase the cache locality. Moreover, one efficient well-known solution is to use the shared memory: each threads of a CUDA block fetch a part of a 2D array block and can then perform the transposition in shared memory possibly more efficiently. It is not so easy due to possible shared memory conflicts that can impact performance. Fortunately, there are few research papers and articles talking about that since the last decades.
The simplest efficient solution to this problem is basically to use cuBLAS which is heavily optimized. This post may also be useful.
Note that a 128x128 transposition is very small for a GPU. GPUs are designed to compute bigger datasets (or far more expensive computations on such small input). If the input array is initially stored on the CPU, then I strongly advise you to do that directly on the CPU as moving data on the GPU will likely be already slower than computing the transposition efficiently on the CPU directly. Indeed, data cannot be moved faster than the main RAM permit and a 128x128 transposition can be implemented in a way it saturate the main RAM (in fact, it can be likely done directly in the CPU caches that are significantly faster than the main RAM).

numba: how to understand the stride [duplicate]

I was wondering, why do one need to use a grid-stride stride in the following loop:
for (int i = index; i < ITERATIONS; i =+ stride)
{
C[i] = A[i] + B[i];
}
Where we set stride and index to:
index = blockIdx.x * blockDim.x + threadIdx.x;
stride = blockDim.x * gridDim.x;
When calling kernel we have this:
int blockSize = 5;
int ITERATIONS = 20;
int numBlocks = (ITERATIONS + blockSize - 1) / blockSize;
bench<<<numBlocks, blockSize>>>(A, B, C);
So when we launch the kernel we will have blockDim.x = 5 and gridDim = 4 and there for stride will be equal 20.
My point is that, whenever one uses such approach, stride will always be equal or bigger than number of elements in calculation, so every time when it will come to increment loop will be over.
And here is the question, why one need to use loop or stride at all, why just not to run with index, like this?:
index = blockIdx.x * blockDim.x + threadIdx.x;
C[index] = A[index] + B[index];
And another question, how can I now, in this particular case, how many thread is running on my GPU simultaneously before give a “jump” to another portion of a very big array (ex. 2000000)?
My point is that, whenever one uses such approach, stride will always
be equal or bigger than number of elements in calculation, so every
time when it will come to increment loop will be over.
There lies the problem with your understanding. To use that kernel effectively, you only need to run as many blocks as will achieve maximal device wide occupancy for your device, not as many blocks as are required to process all your data. Those fewer blocks then become "resident" and process more than one input/output pair per thread. The grid stride also preserves whatever memory coalescing and cache coherency properties the kernel might have.
By doing this, you eliminate overhead from scheduling and retiring blocks. There can be considerable efficiency gains in simple kernels by doing so. There is no other reason for this design pattern.

Why do we need stride in CUDA kernel?

I was wondering, why do one need to use a grid-stride stride in the following loop:
for (int i = index; i < ITERATIONS; i =+ stride)
{
C[i] = A[i] + B[i];
}
Where we set stride and index to:
index = blockIdx.x * blockDim.x + threadIdx.x;
stride = blockDim.x * gridDim.x;
When calling kernel we have this:
int blockSize = 5;
int ITERATIONS = 20;
int numBlocks = (ITERATIONS + blockSize - 1) / blockSize;
bench<<<numBlocks, blockSize>>>(A, B, C);
So when we launch the kernel we will have blockDim.x = 5 and gridDim = 4 and there for stride will be equal 20.
My point is that, whenever one uses such approach, stride will always be equal or bigger than number of elements in calculation, so every time when it will come to increment loop will be over.
And here is the question, why one need to use loop or stride at all, why just not to run with index, like this?:
index = blockIdx.x * blockDim.x + threadIdx.x;
C[index] = A[index] + B[index];
And another question, how can I now, in this particular case, how many thread is running on my GPU simultaneously before give a “jump” to another portion of a very big array (ex. 2000000)?
My point is that, whenever one uses such approach, stride will always
be equal or bigger than number of elements in calculation, so every
time when it will come to increment loop will be over.
There lies the problem with your understanding. To use that kernel effectively, you only need to run as many blocks as will achieve maximal device wide occupancy for your device, not as many blocks as are required to process all your data. Those fewer blocks then become "resident" and process more than one input/output pair per thread. The grid stride also preserves whatever memory coalescing and cache coherency properties the kernel might have.
By doing this, you eliminate overhead from scheduling and retiring blocks. There can be considerable efficiency gains in simple kernels by doing so. There is no other reason for this design pattern.

Cuda block/grid dimensions: when to use dim3?

I need some clearing up regarding the use of dim3 to set the number of threads in my CUDA kernel.
I have an image in a 1D float array, which I'm copying to the device with:
checkCudaErrors(cudaMemcpy( img_d, img.data, img.row * img.col * sizeof(float), cudaMemcpyHostToDevice));
Now I need to set the grid and block sizes to launch my kernel:
dim3 blockDims(512);
dim3 gridDims((unsigned int) ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<< gridDims, blockDims>>>(...)
I'm wondering: in this case, since the data is 1D, does it matter if I use a dim3 structure? Any benefits over using
unsigned int num_blocks = ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<<num_blocks, 512>>>(...)
instead?
Also, is my understanding correct that when using dim3, I'll reference the thread ID with 2 indices inside my kernel:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
And when I'm not using dim3, I'll just use one index?
Thank you very much,
The way you arrange the data in memory is independently on how you would configure the threads of your kernel.
The memory is always a 1D continuous space of bytes. However, the access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.
dim3 is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.
The same happens for the blocks and the grid.
Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#dim3
So, in both cases: dim3 blockDims(512); and myKernel<<<num_blocks, 512>>>(...) you will always have access to threadIdx.y and threadIdx.z.
As the thread ids start at zero, you can calculate a memory position as a row major order using also the ydimension:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int gid = img.col * y + x;
because blockIdx.y and threadIdx.y will be zero.
To sumup, it does it matter if you use a dim3 structure. I would be clear where the configuration of the threads has been defined, and the 1D, 2D and 3D access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

CUDA efficient division?

I would like to know if there is, by any chance an efficient way of dividing elements of an array. I am running with matrix values 10000x10000 and it a considerable amount of time in comparison with other kernels. Division are expensive operations, and I can't see how to improve it.
__global__ void division(int N, float* A, int* B){
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col <= row) ){
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
kernel launched with
int N = 10000;
int threads = 32
int blocks = (N+threads-1)/threads
dim3 t(threads,threads);
dim3 b(blocks, blocks);
division<<< b, t >>>(N, A, B);
cudaThreadSynchronize();
Option B:
__global__ void division(int N, float* A, int* B){
int k = blockIdx.x * blockDim.x + threadIdx.x;
int kmax = N*(N+1)/2
int i,j;
if(k< kmax){
row = (int)(sqrt(0.25+2.0*k)-0.5);
col = k - (row*(row+1))>>1;
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
launched with
int threads =192;
int totalThreadsNeeded = (N*(N+1)/2;
int blocks = ( threads + (totalThreadsNeeded)-1 )/threads;
division<<<blocks, threads >>>(N, A, B);
Why is option B giving a wrong result even if the threadIds are the correct one? what is missing here?
Your basic problem is that you are launching an improbably huge grid (over 100 million threads for your 10000x10000 array example), and then because of the triangular nature of the access pattern in the kernel, fully half of those threads never do anything productive. So a enormous amount of GPU cycles are being wasted for no particularly good reason. Further, the access pattern you are using isn't allowing coalesced memory access, which is going to further reduce the performance of the threads which are actually doing useful work.
If I understand your problem correctly, the kernel is only performing element-wise division on a lower-triangle of a square array. If this is the case, it could be equally done using something like this:
__global__
void division(int N, float* A, int* B)
{
for(int row=blockIdx.x; row<N; row+=gridDim.x) {
for(int col=threadIdx.x; col<=row; col+=blockDim.x) {
int val = max(1,B[row*N+col]);
A[row*N+col] /= (float)val;
}
}
}
[disclaimer: written in browser, never compiled, never tested, use at own risk]
Here, a one dimension grid is used, with each block computing a row at a time. Threads in a block move along the row, so memory access is coalesced. In comments you mention your GPU is a Tesla C2050. That device only requires 112 blocks of 192 threads each to completely "fill" each of the 14 SM with a full complement of 8 blocks each and the maximum number of concurrent threads per SM. So the launch parameters could be something like:
int N = 10000;
int threads = 192;
int blocks = min(8*14, N);
division<<<blocks, threads>>>(N, A, B);
I would expect this to run considerably faster than your current approach. If numerical accuracy isn't that important, you can probably achieve further speed-up by replacing the division with an approximate reciprocal intrinsic and a floating point multiply.
Because threads are executed in groups of 32, called warps, you are paying for the division for all 32 threads in a warp if both if conditions are true for just one of the threads. If the condition is false for many threads, see if you can filter out the values for which the division is not needed in a separate kernel.
The int to float conversion may itself be slow. If so, you might be able to generate floats directly in your earlier step, and pass B in as an array of floats.
You may be able to generate inverted numbers in the earlier step, where you generate the B array. If so, you can use multiplication instead of division in this kernel. (a / b == a * 1 / b).
Depending on your algorithm, maybe you can get away with a lower precision division. There's an intrinsic, __fdividef(x, y), that you can try. There is also a compiler flag, -prec-div=false.
The very first thing to look at should be coalesced memory access. There is no reason for the non-coalesced pattern here, just exchange rows and columns for to avoid wasting a lot of memory bandwidth:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
...
A[row*N+col] ...
Even if this is run on compute capability 2.0 or higher, the caches are not large enough to remedy this suboptimal pattern.