Cuda Efficient Matrix Addition - cuda

I am new to cuda and learning GPU programming. I want to add two nxm matrices (float* A and float* B) and store the results in float* C in the kernel. The goal is to get the fastest implementation. I have the following questions:
I was wondering how to arange the blocks and grid to get the best performance ( for both small and large n and m)
It is good to assign one thread to each element of matrices. However, for large n and m it is not possible. What is the best option then?
How can matrix padding improve the performance?

1: A simple method would be to store the matrix as a vector/array of floats where the rows are concatenated. Then you could just use a large number of threads per block and the smallest necessary number of blocks. Here is an example how the kernel could look like.
2: You basically can have a infinite number of threads, as long as the size of the matrix doesn't exceed the free memory on your GPU. They won't be executed simultaneously, but the driver will schedule them for you and you don't have to care about it.
A thread per element generally works good, if you want to try another way, have a look at Grid Stride Loops, which is a scalable method to organize your elements in less threads.
3: I don't see how padding would improve the performance as you get more elements to copy and calculate, but I'm no expert for that.

Related

CUDA thread and block organization directions

In CUDA programming, threads and blocks have multiple directions (x, y and z).
Until now, I ignored this and only took into account the x direction (threadIdx.x, blockIdx.x, blockDim.x, etc.).
Apparently, both threads within a block and blocks on the grid are arranged as a cube. However, if this is the case, why is it enough to specify the x direction? Would I not address multiple threads like that? Only using the x direction, am I able to address all threads available to my GPU?
Only using the x direction, am I able to address all threads available to my GPU?
If we are talking about a desire to spin up ~2 trillion threads or less, then there is no particular requirement to use a multidimensional block, or grid. All CUDA GPUs of compute capability 3.0 and higher can launch up to about 2 billion blocks (2^31-1) with 1024 threads each, using a 1-D grid organization.
With methodologies like grid-stride loop it seems rare to me that more than ~2 trillion threads would be needed.
I claim without formal proof that any problem that can be realized in a 1D grid can be realized in a 2D or 3D grid, or vice versa. This is just a mathematical mapping from one realization to another. Furthermore, it should be possible to arrange for important by-products like coalesced access in either realization.
There may be some readability benefits, code complexity benefits, and possibly small performance considerations when realizing in a 1D or multi-dimensional way. The usual case for this that I can think of is when the data to be processed is "inherently" multi-dimensional. In this case, letting the CUDA engine generate 2 or 3 distinct indices for you:
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;
might be simpler than using a 1D grid index, and computing 2D data indices from those:
int tid = threadIdx.x+blockDim.x*blockIdx.x;
int idx = tid%DATA_WIDTH;
int idy = tid/DATA_WIDTH;
(the integer division operation above is unavoidable in the general case. The modulo operation can be simplified by using the result from the integer division.)
It's arguably an extra line of code and an extra division operation required to get to the same point, when only a 1D grid is created. However I would suggest that even this is small potatoes, and you should use whichever approach seems most reasonable and comfortable to you as a programmer.
If for some reason you desire to spin up more than ~2 Trillion threads, then moving to a multidimensional grid, at least, is unavoidable.
Apparently, both threads within a block and blocks on the grid are arranged as a cube.
To understand how the threadblock thread index is computed in any case, I refer you to the programming guide. It should be evident that one case can be made equivalent to another - each thread gets a unique thread ID no matter how you specify the threadblock dimensions. In my opinion, a threadblock should only be thought of as a "cube" of threads (i.e. 3-dimensional) if you specify the configuration that way:
dim3 block(32,8,4); //for example
However, if this is the case, why is it enough to specify the x direction? Would I not address multiple threads like that?
If you only used a single threadblock dimension to create a thread index in the 32,8,4 case:
int tid = threadIdx.x;
then you certainly would be "addressing" multiple threads (in y, and z) using that approach. That would typically, in my experience, be "broken" code. Therefore a kernel designed to use a multidimensional block or grid may not work correctly if the block or grid is specified as 1 dimensional, and the reverse statement is also true. You can find examples of such problems (thread index calculation not being correct for the grid design) here on the cuda tag.

Why order of dimension makes big difference in performance?

To launch a CUDA kernel, we use dim3 to specify the dimensions, and I think the meaning of each dimension is opt to the user, for example, it could mean (width, height) or (rows, cols), which has the meaning reversed.
So I did an experiment with the CUDA sample in the SDK: 3_Imaging/convolutionSeparable, simply exchage .x and .y in the kernel function, and reverse the dimensions of blocks and threads used to launch the kernel, so the meaning changes from dim(width, height)/idx(x, y) to dim(rows, cols)/idx(row, col).
The result is the same, however, the performance decreases, the original one takes about 26ms, while the modified one takes about 40ms on my machine(SM 3.0).
My question is, what makes the difference? is (rows, cols) not feasible for CUDA?
P.S. I only modified convolutionRows, no convolutionColumns
EDIT: The change can be found here.
There are at least two potential consequences of your changes:
First, you are changing the memory access pattern to the main memory so the
access is as not coalesced as in the original case.
You should think about the GPU main memory in the same way as it was
a "CPU" memory, i.e., prefetching, blocking, sequential accesses...
techniques to applies in order to get performance.
If you want to know more about this topic, it is mandatory to read
this paper. What every programmer should know about memory.
You'll find an example a comparison between row and column major
access to the elements of a matrix there.
To get and idea on how important this is, think that most -if not
all- GPU high performance codes perform a matrix transposition
before any computation in order to achieve a more coalesced memory
access, and still this additional step worths in terms on
performance. (sparse matrix operations, for instance)
Second. This is more subtle, but in some scenarios it has a deep impact on the performance of a kernel; the launching configuration. It is not the same launching 20 blocks of 10 threads as launching 10 blocks of 20 threads. There is a big difference in the amount of resources a thread needs (shared memory, number of registers,...). The more resources a thread needs the less warps can be mapped on a single SM so the less occupancy... and the -most of the times- less performance.
This not applies to your question, since the number of blocks is equal to the number of threads.
When programming for GPUs you must be aware of the architecture in order to understand how that changes will modify the performance. Of course, I am not familiar with the code so there will be others factors among these two.

CUDA: How to find index of extrema in sub matrices?

I have a large rectangular matrix NxM in GPU memory, stored as 1-dimensional array in row-by-row representation. Let us say that this matrix is actually composed of submatrices of size nxm. For simplicity, assume that N is a multiple of n and same with M and m. Let us say, the data type of the array is float or double.
What is an efficient method to find the index of the extrema in each sub-matrix? For example, how to find the 1-dimensional index of the maximum element of each submatrix and write down those indices in some array.
I can hardly imagine to be so self-confident (or arrogant?) to say that one particular solution is the "most efficient way" to do something.
However, some thoughts (without the claim to cover the "most efficient" solution) :
I think that there are basically two "orthogonal" ways of approaching this
For all sub-matrices in parallel: Find the extremum sequentially
For all sub-matrices sequentially: Find the extremum in parallel
The question which one is more appropriate probably depends on the sizes of the matrices. You mentioned that "N is a multiple of n" (similarly for M and m). Let's the matrix of size M x N is composed of a*b sub-matrices of size m x n.
For the first approach, one could simply let each thread take care of one sub-matrix, with a trivial loop like
for (all elements of my sub-matrix) max = element > max ? element : max;
The prerequisite here is that a*b is "reasonably large". That is, when you can launch this kernel for, let's say, 10000 sub-matrices, then this could already bring a good speedup.
In contrast to that, in the second approach, each kernel (with all its threads) would take care of one sub-matrix. In this case, the kernel could be a standard "reduction" kernel. (The reduction is often presented an example for "computing the sum/product of the elements of an array", but it works for any binary associative operation, so instead of computing the sum or product, one can basically use the same kernel for computing the minimum or maximum). So the kernel would be launched for each sub-matrix, and this would only make sense when the sub-matrix is "reasonably large".
However, in both cases, one has to consider the general performance guidelines. Particularly, since in this case, the operation is obviously memory-bound (and not compute-bound), one has to make sure that the accesses to global memory (that is, to the matrix itself) are coalesced, and that the occupancy that is created by the kernel is as high as possible.
EDIT: Of course, one could consider to somehow combine these approaches, but I think that they are at least showing the most important directions of the space of available options.

How to compute which SM given thread is running at?

I am CUDA beginner.
So far I learned, that each SM have 8 blocks (of threads). Let's say I have simple job of multiplying elements in array by 2. However, I have less data than threads.
Not a problem, because I could cut off the "tail" of threads to make them idle. But if I understand correctly this would mean some SMs would get 100% of work, and some part (or even none).
So I would like to calculate which SM is running given thread and make computation in such way, that each SM has equal amount of work.
I hope it makes sense in the first place :-) If so, how to compute which SM is running given thread? Or -- index of current SM and total numbers of them? In other words, equivalent on threadDim/threadIdx in SM terms.
Update
For comment it was too long.
Robert, thank you for your answer. While I try to digest all, here is what I do -- I have a "big" array and I simply have to multiply the values *2 and store it to output array (as a warmup; btw. all computations I do, mathematically are correct). So first I run this in 1 block, 1 thread. Fine. Next, I tried to split work in such way that each multiplication is done just once by one thread. As the result my program runs around 6 times slower. I even sense why -- small penalty for fetching the info about GPU, then computing how many blocks and threads I should use, then within each thread instead of single multiplications now I have around 10 extra multiplications just to compute the offset in the array for a thread. On one hand I try to find out how to change that undesired behaviour, on the other I would like to spread the "tail" of threads among SMs evenly.
I rephrase -- maybe I am mistaken, but I would like to solve this. I have 1G small jobs (*2 that's all) -- should I create 1K blocks with 1K threads, or 1M blocks with 1 thread, 1 block with 1M threads, and so on. So far, I read GPU properties, divide, divide, and use blindly maximum values for each dimension of grid/block (or required value, if there is no data to compute).
The code
size is the size of the input and output array. In general:
output_array[i] = input_array[i]*2;
Computing how many blocks/threads I need.
size_t total_threads = props.maxThreadsPerMultiProcessor
* props.multiProcessorCount;
if (size<total_threads)
total_threads = size;
size_t total_blocks = 1+(total_threads-1)/props.maxThreadsPerBlock;
size_t threads_per_block = 1+(total_threads-1)/total_blocks;
Having props.maxGridSize and props.maxThreadsDim I compute in similar manner the dimensions for blocks and threads -- from total_blocks and threads_per_block.
And then the killer part, computing the offset for a thread ("inside" the thread):
size_t offset = threadIdx.z;
size_t dim = blockDim.x;
offset += threadIdx.y*dim;
dim *= blockDim.y;
offset += threadIdx.z*dim;
dim *= blockDim.z;
offset += blockIdx.x*dim;
dim *= gridDim.x;
offset += blockIdx.y*dim;
dim *= gridDim.y;
size_t chunk = 1+(size-1)/dim;
So now I have starting offset for current thread, and the amount of data in array (chunk) for multiplication. I didn't use grimDim.z above, because AFAIK is alway 1, right?
It's an unusual thing to try to do. Given that you are a CUDA beginner, such a question seems to me to be indicative of attempting to solve a problem improperly. What is the problem you are trying to solve? How does it help your problem if you are executing a particular thread on SM X vs. SM Y? If you want maximum performance out of the machine, structure your work in a way such that all thread processors and SMs can be active, and in fact that there is "more than enough work" for all. GPUs depend on oversubscribed resources to hide latency.
As a CUDA beginner, your goals should be:
create enough work, both in blocks and threads
access memory efficiently (this mostly has to do with coalescing - you can read up on that)
There is no benefit to making sure that "each SM has an equal amount of work". If you create enough blocks in your grid, each SM will have an approximately equal amount of work. This is the scheduler's job, you should let the scheduler do it. If you do not create enough blocks, your first objective should be to create or find more work to do, not to come up with a fancy work breakdown per block that will yield no benefit.
Each SM in the Fermi GPU (for example) has 32 thread processors. In order to keep these processors busy even in the presence of inevitable machine stalls due to memory accesses and the like, the machine is designed to hide latency by swapping in another warp of threads (32) when a stall occurs, so that processing can continue. In order to facilitate this, you should try to have a large number of available warps per SM. This is facilitated by having:
many threadblocks in your grid (at least 6 times the number of SMs in the GPU)
multiple warps per threadblock (probably at least 4 to 8 warps, so 128 to 256 threads per block)
Since a (Fermi) SM is always executing 32 threads at a time, if I have fewer threads than 32 times the number of SMs in my GPU at any instant, then my machine is under-utilized. If my entire problem is only composed of, say, 20 threads, then it's simply not well designed to take advantage of any GPU, and breaking those 20 threads up into multiple SMs/threadblocks is not likely to have any appreciable benefit.
EDIT: Since you don't want to post your code, I'll make a few more suggestions or comments.
You tried to modify some code, found that it runs slower, then jumped to (I think) the wrong conclusion.
You should probably familiarize yourself with a simple code example like vector add. It's not multiplying each element, but the structure is close. There's no way performing this vector add using a single thread would actually run faster. I think if you study this example, you'll find a straightforward way to extend it to do array element multiply-by-2.
Nobody computes threads per block the way you have outlined. First of all, threads per block should be a multiple of 32. Secondly, it's customary to pick threads per block as a starting point, and build your other launch parameters from it, not the other way around. For a large problem, just start with 256 or 512 threads per block, and dispense with the calculations for that.
Build your other launch parameters (grid size) based on your chosen threadblock size. Your problem is 1D in nature, so a 1D grid of 1D threadblocks is a good starting point. If this calculation exceeds the machine limit in terms of max blocks in x-dimension, then you can either have each thread loop to process multiple elements or else extend to a 2D grid (of 1D threadblocks).
Your offset calculation is needlessly complex. Refer to the vector add example about how to create a grid of threads with relatively simple offset calculation to process an array.

matrix multiplication in cuda

say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.
a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.
b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.
I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?
As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.
If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.
The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.
Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.
The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.
There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.
Have you looked at the CUDA documentation: Cuda Programming Model
Also, sample source code: Matrix Multiplication
Did you look at
$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul
i.e. the matrix multiplication example in the SDK?
If you don't need to implement this yourself, just use a library -- CUBLAS, MAGMA, etc., provide tuned matrix multiplication implementations.