Moving memory around on device in CUDA - cuda

What is the fastest way to move data that is on the device around in CUDA?
What I need to do is basically copy continuous sub-rows and sub-columns (of which I have the indexes on the device) from row-major matrices into new smaller matrices, but from what I've observed, memory access in CUDA is not particularly efficient, as it seems the cores are optimized to do computation rather that memory stuff.
Now the CPU seems to be pretty good at doing sequential stuff like moving rows of aligned memory from a place to another.
I see three options:
make a kernel that does the memory copying
outside a kernel, call cudaMemcpy(.., device to device) for each position (terribly slow for columns I would guess)
move the memory to the host, create the new smaller matrix and send it back on the device
Now I could test this on my specific gpu, but given its specs I don't think it would be representative. In general, what is recommended?
Edit:
I'm essentially multiplying two matrices A,B but I'm only interested in multiplying the X elements:
A =[[XX XX]
[ XX XX ]
[XX XX ]]
with the corresponding elements in the columns of B. The XX are always of the same length and I know their positions (and there's a fixed number of them per row).

If you have a matrix storage pattern that involves varying spacing between corresponding row elements (or corresponding column elements), none of the input transformation or striding capabilities of cublas will help, and none of the api striding-copy functions (such as cudaMemcpy2D) will help.
You'll need to write your own kernel to gather the data, before feeding it to cublasXgemm. This should be fairly trivial to do, if you have the locations of the incoming data elements listed in a vector or otherwise listed.

Related

Cuda Efficient Matrix Addition

I am new to cuda and learning GPU programming. I want to add two nxm matrices (float* A and float* B) and store the results in float* C in the kernel. The goal is to get the fastest implementation. I have the following questions:
I was wondering how to arange the blocks and grid to get the best performance ( for both small and large n and m)
It is good to assign one thread to each element of matrices. However, for large n and m it is not possible. What is the best option then?
How can matrix padding improve the performance?
1: A simple method would be to store the matrix as a vector/array of floats where the rows are concatenated. Then you could just use a large number of threads per block and the smallest necessary number of blocks. Here is an example how the kernel could look like.
2: You basically can have a infinite number of threads, as long as the size of the matrix doesn't exceed the free memory on your GPU. They won't be executed simultaneously, but the driver will schedule them for you and you don't have to care about it.
A thread per element generally works good, if you want to try another way, have a look at Grid Stride Loops, which is a scalable method to organize your elements in less threads.
3: I don't see how padding would improve the performance as you get more elements to copy and calculate, but I'm no expert for that.

How to avoid race condition in different blocks in CUDA

I am writing a function in CUDA that divides set of unsorted points in a 3D grid. Based on the bounds of the point set, I can find the coordinate of every point and write it in an array within the grid cell.
I launch kernal with threads equal to number of points by dividing them in different blocks for max thread count.
Now each thread finds its coordinate and write the point in the cell, but other threads within same or different block can also compute same coordinate at same time. The code fails here because of race condition.
I read about atomics, locks and critical section but these synchronizations are used within a thread block only, that is unlikely in my case.
Any suggestions please ?
My initial guess is I need to sort the points based on distance of grid cell size, and launch kernal with each block equal to size of grid cell
Atomics can work on the global memory and synchronize between blocks. The only issue here is performance. Depending on how much of the run time is taken up by just performing the writes to memory you may get slower code than just doing it in serial on the CPU. Atomics are slow. Maybe try to rethink the problem.

CUDA: How to find index of extrema in sub matrices?

I have a large rectangular matrix NxM in GPU memory, stored as 1-dimensional array in row-by-row representation. Let us say that this matrix is actually composed of submatrices of size nxm. For simplicity, assume that N is a multiple of n and same with M and m. Let us say, the data type of the array is float or double.
What is an efficient method to find the index of the extrema in each sub-matrix? For example, how to find the 1-dimensional index of the maximum element of each submatrix and write down those indices in some array.
I can hardly imagine to be so self-confident (or arrogant?) to say that one particular solution is the "most efficient way" to do something.
However, some thoughts (without the claim to cover the "most efficient" solution) :
I think that there are basically two "orthogonal" ways of approaching this
For all sub-matrices in parallel: Find the extremum sequentially
For all sub-matrices sequentially: Find the extremum in parallel
The question which one is more appropriate probably depends on the sizes of the matrices. You mentioned that "N is a multiple of n" (similarly for M and m). Let's the matrix of size M x N is composed of a*b sub-matrices of size m x n.
For the first approach, one could simply let each thread take care of one sub-matrix, with a trivial loop like
for (all elements of my sub-matrix) max = element > max ? element : max;
The prerequisite here is that a*b is "reasonably large". That is, when you can launch this kernel for, let's say, 10000 sub-matrices, then this could already bring a good speedup.
In contrast to that, in the second approach, each kernel (with all its threads) would take care of one sub-matrix. In this case, the kernel could be a standard "reduction" kernel. (The reduction is often presented an example for "computing the sum/product of the elements of an array", but it works for any binary associative operation, so instead of computing the sum or product, one can basically use the same kernel for computing the minimum or maximum). So the kernel would be launched for each sub-matrix, and this would only make sense when the sub-matrix is "reasonably large".
However, in both cases, one has to consider the general performance guidelines. Particularly, since in this case, the operation is obviously memory-bound (and not compute-bound), one has to make sure that the accesses to global memory (that is, to the matrix itself) are coalesced, and that the occupancy that is created by the kernel is as high as possible.
EDIT: Of course, one could consider to somehow combine these approaches, but I think that they are at least showing the most important directions of the space of available options.

Kernel design for overlapping data, launch of a seperate warp

i have a question regarding a CFD application i am trying to implement according to a paper i found online. this might be somewhat of a beginner question, but here it goes:
the situation is as follows:
the 2D domain gets decomposed into tiles. Each of these tiles is being processed by a block of the kernel in question. The calculations being executed are highly suited for parallel execution, as they take into account only a handfull of their neighbours (it's a shallow water application). The tiles do overlap. Each tile has 2 extra cells on each side of the domain it's supposed to calculate the result to.
on the left you see 1 block, on the right 4, with the overlap that comes with it. grey are "ghost cells" needed for the calculation. light green is the domain each block actually writed back to global memory. needless to say the whole domain is going to have more than 4 tiles.
The idea per thread goes as following:
(1) copy data from global memory to shared memory
__synchthreads();
(2) perform some calculations
__synchthreads();
(3) perform some more calculations
(4) write back to globabl memory
for the cells in the green area, the Kernel is straight forward, you copy data according to your threadId, and calculate along using your neighbours in shared memory. Because of the nature of the data dependency this does however not suffice:
(1) has to be run on all cells (grey and green). No dependency.
(2) has to be run on all green cells, and the inner rows/columns of the grey cells. Depends on neighbouring data N,S,E and W.
(3) has to be run on all green cells. Depends on data from step (2) on neighbours N,S,E and W.
so here goes my question:
how does one do this without a terribly cluttered code?
all i can think of is a horrible amount of "if" statements to decide whether a thread should perform some of these steps twice, depending on his threadId.
i have considered using overlapping blocks as well (as opposed to just overlapping data), but this leads to another problem: the __synchthreads()-calls would have to be in conditional parts of the code.
Taking the kernel apart and having the steps (2)/(3) run in different kernels is not really an option either, as they produce intermediate results which can't all be written back to memory because of their number/size.
the author himself writes this (Brodtkorb et al. 2010, Efficient Shallow Water Simulations on GPUs:
Implementation, Visualization, Verification, and Validation):
When launching our kernel, we start by reading from global memory into on-chip shared memory. In addition to the interior cells of our block, we need to use data from two neighbouring cells in each direction to fulfill the data
dependencies of the stencil. After having read data into shared memory, we proceed by computing the one dimensional
fluxes in the x and y directions, respectively. Using the steps illustrated in Figure 1, fluxes are computed by storing
all values that are used by more than one thread in shared memory. We also perform calculations collectively within
one block to avoid duplicate computations. However, because we compute the net contribution for each cell, we have
to perform more reconstructions and flux calculations than the number of threads, complicating our kernel. This is
solved in our code by designating a single warp that performs the additional computations; a strategy that yielded a
better performance than dividing the additional computations between several warps.
so, what does he mean by designating a single warp to do these compuations, and how does one do so?
so, what does he mean by designating a single warp to do these compuations, and how does one do so?
You could do something like this:
// work that is done by all threads in a block
__syncthreads(); // may or may not be needed
if (threadIdx.x < 32) {
// work that is done only by the designated single warp
}
Although that's trivially simple, insofar as the last question in your question is considered, and the highlighted paragraph, I think it's very likely what they are referring to. I think it fits with what I'm reading here. Furthermore I don't know of any other way to restrict work to a single warp, except by using conditionals. They may also have chosen a single warp to take advantage of warp-synchronous behavior, which gets around the __syncthreads(); in conditional code issue you mention earlier.
so here goes my question: how does one do this without a terribly cluttered code?
all i can think of is a horrible amount of "if" statements to decide whether a thread should perform some of these steps twice, depending on his threadId.
Actually, I don't think any sequence of ordinary "if" statements, regardless of how cluttered, could solve the problem you describe.
A typical way to solve the dependency between steps 2 and 3 that you have already mentioned is to separate the work into two ( or more) kernels. You indicate that this is "not really an option", but as near as I can tell, what you're looking for is a global sync. Such a concept is not well-defined in CUDA except for the kernel launch/exit points. CUDA does not guarantee execution order among blocks in a grid. If your block calculations in step 3 depend on neighboring block calculations in step 2, then in my opinion, you definitely need a global sync, and your code is going to get ugly if you don't implement it with a kernel launch. Alternative methods such as using global semaphores or global block counters are, in my opinion, fragile and difficult to apply to general cases of widespread data dependence (where every block is dependent on neighbor calculations from the previous step).
If the neighboring calculations depend on only the data from a thin set of neighboring cells ("halo") , and not the whole neighboring block, and those cells can be computed independently, then it might be possible to have your block be expanded to include neighboring cells (i.e. overlap), effectively computing the halo regions twice between neighboring blocks, but you've indicated you've already considered and discarded this idea. However, I personally would want to consider the code in detail before accepting the idea that this would be rejected based entirely on difficulty with __syncthreads(); In my experience, people who say they can't use __syncthreads(); due to conditional code execution haven't accurately considered all the options, at a detail code level, to make __syncthreads(); work, even in the midst of conditional code.

matrix multiplication in cuda

say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.
a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.
b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.
I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?
As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.
If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.
The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.
Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.
The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.
There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.
Have you looked at the CUDA documentation: Cuda Programming Model
Also, sample source code: Matrix Multiplication
Did you look at
$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul
i.e. the matrix multiplication example in the SDK?
If you don't need to implement this yourself, just use a library -- CUBLAS, MAGMA, etc., provide tuned matrix multiplication implementations.