I am writing a function in CUDA that divides set of unsorted points in a 3D grid. Based on the bounds of the point set, I can find the coordinate of every point and write it in an array within the grid cell.
I launch kernal with threads equal to number of points by dividing them in different blocks for max thread count.
Now each thread finds its coordinate and write the point in the cell, but other threads within same or different block can also compute same coordinate at same time. The code fails here because of race condition.
I read about atomics, locks and critical section but these synchronizations are used within a thread block only, that is unlikely in my case.
Any suggestions please ?
My initial guess is I need to sort the points based on distance of grid cell size, and launch kernal with each block equal to size of grid cell
Atomics can work on the global memory and synchronize between blocks. The only issue here is performance. Depending on how much of the run time is taken up by just performing the writes to memory you may get slower code than just doing it in serial on the CPU. Atomics are slow. Maybe try to rethink the problem.
Related
What is the fastest way to move data that is on the device around in CUDA?
What I need to do is basically copy continuous sub-rows and sub-columns (of which I have the indexes on the device) from row-major matrices into new smaller matrices, but from what I've observed, memory access in CUDA is not particularly efficient, as it seems the cores are optimized to do computation rather that memory stuff.
Now the CPU seems to be pretty good at doing sequential stuff like moving rows of aligned memory from a place to another.
I see three options:
make a kernel that does the memory copying
outside a kernel, call cudaMemcpy(.., device to device) for each position (terribly slow for columns I would guess)
move the memory to the host, create the new smaller matrix and send it back on the device
Now I could test this on my specific gpu, but given its specs I don't think it would be representative. In general, what is recommended?
Edit:
I'm essentially multiplying two matrices A,B but I'm only interested in multiplying the X elements:
A =[[XX XX]
[ XX XX ]
[XX XX ]]
with the corresponding elements in the columns of B. The XX are always of the same length and I know their positions (and there's a fixed number of them per row).
If you have a matrix storage pattern that involves varying spacing between corresponding row elements (or corresponding column elements), none of the input transformation or striding capabilities of cublas will help, and none of the api striding-copy functions (such as cudaMemcpy2D) will help.
You'll need to write your own kernel to gather the data, before feeding it to cublasXgemm. This should be fairly trivial to do, if you have the locations of the incoming data elements listed in a vector or otherwise listed.
To launch a CUDA kernel, we use dim3 to specify the dimensions, and I think the meaning of each dimension is opt to the user, for example, it could mean (width, height) or (rows, cols), which has the meaning reversed.
So I did an experiment with the CUDA sample in the SDK: 3_Imaging/convolutionSeparable, simply exchage .x and .y in the kernel function, and reverse the dimensions of blocks and threads used to launch the kernel, so the meaning changes from dim(width, height)/idx(x, y) to dim(rows, cols)/idx(row, col).
The result is the same, however, the performance decreases, the original one takes about 26ms, while the modified one takes about 40ms on my machine(SM 3.0).
My question is, what makes the difference? is (rows, cols) not feasible for CUDA?
P.S. I only modified convolutionRows, no convolutionColumns
EDIT: The change can be found here.
There are at least two potential consequences of your changes:
First, you are changing the memory access pattern to the main memory so the
access is as not coalesced as in the original case.
You should think about the GPU main memory in the same way as it was
a "CPU" memory, i.e., prefetching, blocking, sequential accesses...
techniques to applies in order to get performance.
If you want to know more about this topic, it is mandatory to read
this paper. What every programmer should know about memory.
You'll find an example a comparison between row and column major
access to the elements of a matrix there.
To get and idea on how important this is, think that most -if not
all- GPU high performance codes perform a matrix transposition
before any computation in order to achieve a more coalesced memory
access, and still this additional step worths in terms on
performance. (sparse matrix operations, for instance)
Second. This is more subtle, but in some scenarios it has a deep impact on the performance of a kernel; the launching configuration. It is not the same launching 20 blocks of 10 threads as launching 10 blocks of 20 threads. There is a big difference in the amount of resources a thread needs (shared memory, number of registers,...). The more resources a thread needs the less warps can be mapped on a single SM so the less occupancy... and the -most of the times- less performance.
This not applies to your question, since the number of blocks is equal to the number of threads.
When programming for GPUs you must be aware of the architecture in order to understand how that changes will modify the performance. Of course, I am not familiar with the code so there will be others factors among these two.
I want to design a CUDA kernel that has thread blocks where warps read their own 1-D arrays. Suppose that a thread block with two warps takes two arrays {1,2,3,4} and {2,4,6,8}. Then each of warps would perform some computations by reading their own arrays. The computation is done based on per-array element basis. This means that the thread block would have redundant computations for the elements 2 and 4 in the arrays.
Here is my question: How I can avoid such redundant computations?
Precisely, I want to make a warp skip the computation of an element once the element has been already touched by other warps, otherwise the computation goes normally because any warps never touched the element before.
Using a hash table on the shared memory dedicated into a thread block may be considered. But I worry about performance degradation due to hash table accesses whenever a warp access elements of an array.
Any idea or comments?
In parallel computation on many-core co-processors, it is desired to perform arithmetic operations on an independent set of data, i.e. to eliminate any sort of dependency on the set of vectors which you provide to threads/warps. In this way, the computations can run in parallel. If you want to keep track of elements that you have been previously computed (in this case, 2 and 4 which are common in the two input arrays), you have to serialize and create branches which in turn diminishes the computing performance.
In conclusion, you need to check if it is possible to eliminate redundancy at the input level by reducing the input vectors to those with different components. If not, skipping redundant computations of repeated components may not necessarily improve the performance, since the computations are performed in batch.
Let us try to understand what happens on the hardware level.
First of all, computation in CUDA happens via warps. A warp is a group of 32 threads which are synchronized to each other. An instruction is executed on all the threads of a corresponding warp at the same time instance. So technically it's not threads but the warps that finally executes on the hardware.
Now, let us suppose that somehow you are able to keep track of which elements does not need computation so you put a condition in the kernel like
...
if(computationNeeded){
compute();
}
else{
...
}
...
Now let us assume that there are 5 threads in a particular warp for which "computationNeeded" is false and the don't need computation. But according to our definition, all threads executes the same instruction. So even if you put these conditions all threads have to execute both if and else blocks.
Exception:
It will be faster if all the threads of a particular block execute either if or else condition. But its exceptional case for almost all the real-world algorithms.
Suggestion
Put a pre-processing step either on CPU or GPU, which eliminate the redundant elements from input data. Also, check if this step is worth.
Precisely, I want to make a warp skip the computation of an element once the element has been already touched by other warps, otherwise the computation goes normally because any warps never touched the element before.
Using a hash table on the shared memory dedicated into a thread block
may be considered
If you want inter-warp communication between ALL the warps of the grid then it is only possible via global memory not shared memory.
i have a question regarding a CFD application i am trying to implement according to a paper i found online. this might be somewhat of a beginner question, but here it goes:
the situation is as follows:
the 2D domain gets decomposed into tiles. Each of these tiles is being processed by a block of the kernel in question. The calculations being executed are highly suited for parallel execution, as they take into account only a handfull of their neighbours (it's a shallow water application). The tiles do overlap. Each tile has 2 extra cells on each side of the domain it's supposed to calculate the result to.
on the left you see 1 block, on the right 4, with the overlap that comes with it. grey are "ghost cells" needed for the calculation. light green is the domain each block actually writed back to global memory. needless to say the whole domain is going to have more than 4 tiles.
The idea per thread goes as following:
(1) copy data from global memory to shared memory
__synchthreads();
(2) perform some calculations
__synchthreads();
(3) perform some more calculations
(4) write back to globabl memory
for the cells in the green area, the Kernel is straight forward, you copy data according to your threadId, and calculate along using your neighbours in shared memory. Because of the nature of the data dependency this does however not suffice:
(1) has to be run on all cells (grey and green). No dependency.
(2) has to be run on all green cells, and the inner rows/columns of the grey cells. Depends on neighbouring data N,S,E and W.
(3) has to be run on all green cells. Depends on data from step (2) on neighbours N,S,E and W.
so here goes my question:
how does one do this without a terribly cluttered code?
all i can think of is a horrible amount of "if" statements to decide whether a thread should perform some of these steps twice, depending on his threadId.
i have considered using overlapping blocks as well (as opposed to just overlapping data), but this leads to another problem: the __synchthreads()-calls would have to be in conditional parts of the code.
Taking the kernel apart and having the steps (2)/(3) run in different kernels is not really an option either, as they produce intermediate results which can't all be written back to memory because of their number/size.
the author himself writes this (Brodtkorb et al. 2010, Efficient Shallow Water Simulations on GPUs:
Implementation, Visualization, Verification, and Validation):
When launching our kernel, we start by reading from global memory into on-chip shared memory. In addition to the interior cells of our block, we need to use data from two neighbouring cells in each direction to fulfill the data
dependencies of the stencil. After having read data into shared memory, we proceed by computing the one dimensional
fluxes in the x and y directions, respectively. Using the steps illustrated in Figure 1, fluxes are computed by storing
all values that are used by more than one thread in shared memory. We also perform calculations collectively within
one block to avoid duplicate computations. However, because we compute the net contribution for each cell, we have
to perform more reconstructions and flux calculations than the number of threads, complicating our kernel. This is
solved in our code by designating a single warp that performs the additional computations; a strategy that yielded a
better performance than dividing the additional computations between several warps.
so, what does he mean by designating a single warp to do these compuations, and how does one do so?
so, what does he mean by designating a single warp to do these compuations, and how does one do so?
You could do something like this:
// work that is done by all threads in a block
__syncthreads(); // may or may not be needed
if (threadIdx.x < 32) {
// work that is done only by the designated single warp
}
Although that's trivially simple, insofar as the last question in your question is considered, and the highlighted paragraph, I think it's very likely what they are referring to. I think it fits with what I'm reading here. Furthermore I don't know of any other way to restrict work to a single warp, except by using conditionals. They may also have chosen a single warp to take advantage of warp-synchronous behavior, which gets around the __syncthreads(); in conditional code issue you mention earlier.
so here goes my question: how does one do this without a terribly cluttered code?
all i can think of is a horrible amount of "if" statements to decide whether a thread should perform some of these steps twice, depending on his threadId.
Actually, I don't think any sequence of ordinary "if" statements, regardless of how cluttered, could solve the problem you describe.
A typical way to solve the dependency between steps 2 and 3 that you have already mentioned is to separate the work into two ( or more) kernels. You indicate that this is "not really an option", but as near as I can tell, what you're looking for is a global sync. Such a concept is not well-defined in CUDA except for the kernel launch/exit points. CUDA does not guarantee execution order among blocks in a grid. If your block calculations in step 3 depend on neighboring block calculations in step 2, then in my opinion, you definitely need a global sync, and your code is going to get ugly if you don't implement it with a kernel launch. Alternative methods such as using global semaphores or global block counters are, in my opinion, fragile and difficult to apply to general cases of widespread data dependence (where every block is dependent on neighbor calculations from the previous step).
If the neighboring calculations depend on only the data from a thin set of neighboring cells ("halo") , and not the whole neighboring block, and those cells can be computed independently, then it might be possible to have your block be expanded to include neighboring cells (i.e. overlap), effectively computing the halo regions twice between neighboring blocks, but you've indicated you've already considered and discarded this idea. However, I personally would want to consider the code in detail before accepting the idea that this would be rejected based entirely on difficulty with __syncthreads(); In my experience, people who say they can't use __syncthreads(); due to conditional code execution haven't accurately considered all the options, at a detail code level, to make __syncthreads(); work, even in the midst of conditional code.
I am CUDA beginner.
So far I learned, that each SM have 8 blocks (of threads). Let's say I have simple job of multiplying elements in array by 2. However, I have less data than threads.
Not a problem, because I could cut off the "tail" of threads to make them idle. But if I understand correctly this would mean some SMs would get 100% of work, and some part (or even none).
So I would like to calculate which SM is running given thread and make computation in such way, that each SM has equal amount of work.
I hope it makes sense in the first place :-) If so, how to compute which SM is running given thread? Or -- index of current SM and total numbers of them? In other words, equivalent on threadDim/threadIdx in SM terms.
Update
For comment it was too long.
Robert, thank you for your answer. While I try to digest all, here is what I do -- I have a "big" array and I simply have to multiply the values *2 and store it to output array (as a warmup; btw. all computations I do, mathematically are correct). So first I run this in 1 block, 1 thread. Fine. Next, I tried to split work in such way that each multiplication is done just once by one thread. As the result my program runs around 6 times slower. I even sense why -- small penalty for fetching the info about GPU, then computing how many blocks and threads I should use, then within each thread instead of single multiplications now I have around 10 extra multiplications just to compute the offset in the array for a thread. On one hand I try to find out how to change that undesired behaviour, on the other I would like to spread the "tail" of threads among SMs evenly.
I rephrase -- maybe I am mistaken, but I would like to solve this. I have 1G small jobs (*2 that's all) -- should I create 1K blocks with 1K threads, or 1M blocks with 1 thread, 1 block with 1M threads, and so on. So far, I read GPU properties, divide, divide, and use blindly maximum values for each dimension of grid/block (or required value, if there is no data to compute).
The code
size is the size of the input and output array. In general:
output_array[i] = input_array[i]*2;
Computing how many blocks/threads I need.
size_t total_threads = props.maxThreadsPerMultiProcessor
* props.multiProcessorCount;
if (size<total_threads)
total_threads = size;
size_t total_blocks = 1+(total_threads-1)/props.maxThreadsPerBlock;
size_t threads_per_block = 1+(total_threads-1)/total_blocks;
Having props.maxGridSize and props.maxThreadsDim I compute in similar manner the dimensions for blocks and threads -- from total_blocks and threads_per_block.
And then the killer part, computing the offset for a thread ("inside" the thread):
size_t offset = threadIdx.z;
size_t dim = blockDim.x;
offset += threadIdx.y*dim;
dim *= blockDim.y;
offset += threadIdx.z*dim;
dim *= blockDim.z;
offset += blockIdx.x*dim;
dim *= gridDim.x;
offset += blockIdx.y*dim;
dim *= gridDim.y;
size_t chunk = 1+(size-1)/dim;
So now I have starting offset for current thread, and the amount of data in array (chunk) for multiplication. I didn't use grimDim.z above, because AFAIK is alway 1, right?
It's an unusual thing to try to do. Given that you are a CUDA beginner, such a question seems to me to be indicative of attempting to solve a problem improperly. What is the problem you are trying to solve? How does it help your problem if you are executing a particular thread on SM X vs. SM Y? If you want maximum performance out of the machine, structure your work in a way such that all thread processors and SMs can be active, and in fact that there is "more than enough work" for all. GPUs depend on oversubscribed resources to hide latency.
As a CUDA beginner, your goals should be:
create enough work, both in blocks and threads
access memory efficiently (this mostly has to do with coalescing - you can read up on that)
There is no benefit to making sure that "each SM has an equal amount of work". If you create enough blocks in your grid, each SM will have an approximately equal amount of work. This is the scheduler's job, you should let the scheduler do it. If you do not create enough blocks, your first objective should be to create or find more work to do, not to come up with a fancy work breakdown per block that will yield no benefit.
Each SM in the Fermi GPU (for example) has 32 thread processors. In order to keep these processors busy even in the presence of inevitable machine stalls due to memory accesses and the like, the machine is designed to hide latency by swapping in another warp of threads (32) when a stall occurs, so that processing can continue. In order to facilitate this, you should try to have a large number of available warps per SM. This is facilitated by having:
many threadblocks in your grid (at least 6 times the number of SMs in the GPU)
multiple warps per threadblock (probably at least 4 to 8 warps, so 128 to 256 threads per block)
Since a (Fermi) SM is always executing 32 threads at a time, if I have fewer threads than 32 times the number of SMs in my GPU at any instant, then my machine is under-utilized. If my entire problem is only composed of, say, 20 threads, then it's simply not well designed to take advantage of any GPU, and breaking those 20 threads up into multiple SMs/threadblocks is not likely to have any appreciable benefit.
EDIT: Since you don't want to post your code, I'll make a few more suggestions or comments.
You tried to modify some code, found that it runs slower, then jumped to (I think) the wrong conclusion.
You should probably familiarize yourself with a simple code example like vector add. It's not multiplying each element, but the structure is close. There's no way performing this vector add using a single thread would actually run faster. I think if you study this example, you'll find a straightforward way to extend it to do array element multiply-by-2.
Nobody computes threads per block the way you have outlined. First of all, threads per block should be a multiple of 32. Secondly, it's customary to pick threads per block as a starting point, and build your other launch parameters from it, not the other way around. For a large problem, just start with 256 or 512 threads per block, and dispense with the calculations for that.
Build your other launch parameters (grid size) based on your chosen threadblock size. Your problem is 1D in nature, so a 1D grid of 1D threadblocks is a good starting point. If this calculation exceeds the machine limit in terms of max blocks in x-dimension, then you can either have each thread loop to process multiple elements or else extend to a 2D grid (of 1D threadblocks).
Your offset calculation is needlessly complex. Refer to the vector add example about how to create a grid of threads with relatively simple offset calculation to process an array.