Why order of dimension makes big difference in performance? - cuda

To launch a CUDA kernel, we use dim3 to specify the dimensions, and I think the meaning of each dimension is opt to the user, for example, it could mean (width, height) or (rows, cols), which has the meaning reversed.
So I did an experiment with the CUDA sample in the SDK: 3_Imaging/convolutionSeparable, simply exchage .x and .y in the kernel function, and reverse the dimensions of blocks and threads used to launch the kernel, so the meaning changes from dim(width, height)/idx(x, y) to dim(rows, cols)/idx(row, col).
The result is the same, however, the performance decreases, the original one takes about 26ms, while the modified one takes about 40ms on my machine(SM 3.0).
My question is, what makes the difference? is (rows, cols) not feasible for CUDA?
P.S. I only modified convolutionRows, no convolutionColumns
EDIT: The change can be found here.

There are at least two potential consequences of your changes:
First, you are changing the memory access pattern to the main memory so the
access is as not coalesced as in the original case.
You should think about the GPU main memory in the same way as it was
a "CPU" memory, i.e., prefetching, blocking, sequential accesses...
techniques to applies in order to get performance.
If you want to know more about this topic, it is mandatory to read
this paper. What every programmer should know about memory.
You'll find an example a comparison between row and column major
access to the elements of a matrix there.
To get and idea on how important this is, think that most -if not
all- GPU high performance codes perform a matrix transposition
before any computation in order to achieve a more coalesced memory
access, and still this additional step worths in terms on
performance. (sparse matrix operations, for instance)
Second. This is more subtle, but in some scenarios it has a deep impact on the performance of a kernel; the launching configuration. It is not the same launching 20 blocks of 10 threads as launching 10 blocks of 20 threads. There is a big difference in the amount of resources a thread needs (shared memory, number of registers,...). The more resources a thread needs the less warps can be mapped on a single SM so the less occupancy... and the -most of the times- less performance.
This not applies to your question, since the number of blocks is equal to the number of threads.
When programming for GPUs you must be aware of the architecture in order to understand how that changes will modify the performance. Of course, I am not familiar with the code so there will be others factors among these two.

Related

Is there a way to know what's the extra space that cudaMalloc is going to reserve?

When I use cudaMalloc (100) it reserves more than 100 B (According to some users here it's due to granularity issues and housekeeping information.
Is it possible to determine how big this space will be based on the Bytes I need to reserve?
Thank you so much.
EDIT: I'll explain why I need to know.
I want to apply the convolution algorithm over huge images on the GPU. To do so, since there isn't enough memory on the GPU to hold it, I need to split the image in batches of rows an call the kernel several times.
In fact, I need to send 2 images, the OnlyRead matrix and the Results matrix.
I want to calcule a priori the max number of rows I can send to the device according to the amount of free memory.
The first cudaMalloc executes successfully, but the problem appears when trying to execute the second CudaMalloc since the first reserve took more Bytes than expected.
What I'm doing now is considering the free memory amount a 10% less than what it is... but that's just a magical number that came from nowhere..
"Is there a way to know what's the extra space that cudaMalloc is going to reserve?"
Not without violating CUDA's platform guarantees, no. cudaMalloc() returns a pointer to the requested amount of memory. You can't make any assumptions about the amount of memory that happens to be valid after the end of the requested amount - the CUDA allocator already makes use of suballocators, and unlike CPU-based memory allocators, the data structures to track free lists etc. are not interleaved with the allocated memory. So for example, it would be unwise to assume that the CUDA runtime's guarantees about the alignment of the returned pointers mean anything other than that returned pointers will have a certain alignment.
If you study the CUDA runtime's behavior, that will shed light on the behavior of that particular CUDA runtime, but the behavior may change with future releases and break your code.

Do different arithmetic operations have different processing times?

Are the basic arithmetic operations same with respect to processor usage. For e.g if I do an addition vs division in a loop, will the calculation time for addition be less than that for division?
I am not sure if this question belongs here or computer science SE
Yes. Here is a quick example:
http://my.safaribooksonline.com/book/hardware/9788131732465/instruction-set-and-instruction-timing-of-8086/app_c
those are the microcode and the timing of the operation of a massively old architecture, the 8086. it is a fairly simple point to start.
of relevant note, they are measured in cycles, or clocks, and everything move at the speed of the cpu (they are synchronized on the main clock or frequency of the microprocessor)
if you scroll down on that table you'll see a division taking anywhere from 80 to 150 cycles.
also note operation speed is affected by which area of memory the operand reside.
note that on modern processor you can have parallel instruction executed concurrently (even if the cpu is single threaded) and some of them are executed out of order, then vector instruction murky the question even more.
i.e. a SSE multiplication can multiply multiple number in a single operation (taking multiple time)
Yes. Different machine instructions are not equally expensive.
You can either do measurements yourself or use one of the references in this question to help you understand the costs for various instructions.

CUDA shared memory - sum reduction from kernel

I am working on big datasets that are image cubes (450x450x1500). I have a kernel that works on individual data elements. Each data element produces 6 intermediate results (floats). My block consists of 1024 threads. The 6 intermediate results are stored in shared memory by each thread (6 float arrays). However, now I need to add each of the intermediate result to produce a sum (6 sum values). I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.
Are there any reduction routines that can be called from inside a kernel function on arrays in shared memory?
What will be the best way to solve this problem? I am a newbie to CUDA programming and would welcome any suggestions.
This seems unlikely:
I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.
I can't imagine how you have enough space to store your data in shared memory but not in global memory.
Anyway, CUB provides reduction routines that can be called from within a threadblock, and that can operate on data stored in shared memory.
Or you can write your own sum-reduction code. It's not terribly hard to do, there are many questions on SO about it, such as this one.
Or you could adapt the cuda sample code.
Update
After seeing all the comments, I understand that instead of doing 1 or a few times of reduction, you need to do the reductions for 450x450x6 times.
In this case there's simpler solution.
You don't need to implement relatively complex parallel reduction for each 1500-D vector。 Since you already have 450x450x6 vectors to reduce, you could reduce all these vectors in parallel using traditional serial reduction method.
You could use a block with 16x16 threads to process a particular region of the image, and a grid with 29x29 blocks to cover the whole 450x450 image.
In each thread, you could iterate over the 1500 frames. In each iterration, you coulde first compute the 6 intermediate results, then add them to the sums. When yo finish all the iterations, you could write the 6 sums to global mem.
That finishes the kernel design. And no shared mem is needed.
You wil find that the performance is very good. Since it is a memory bound operation,it won't be much longer than simply access all the image cube data once.
In case you don't have enough global mem for the whole cube, you could split it into 4 sub-cubes of [1500][225][225], and call the kernel routine on each sub-cube. The only thing you need to change is the grid size.
Have a look at this that explains parallel reduction in CUDA thoroughly.
If I understand it correctly, each thread should sum up "only" 6 floats.
I'm not sure if it is worth doing that by a parallel reduction in general, in the sense that you will experience performance gains.
If you are targeting a Kepler, you may try to use shuffle operations if you properly set the block size so that your intermediate results fit the Streaming Multiprocessor's registers in some way.
As also pointed out by Robert Crovella, your statement about the possibility of storing the intermediate results seems strange as the amount of global memory is certainly larger than the amount of shared memory.

Does CUDA automatically load-balance for you?

I'm hoping for some general advice and clarification on best practices for load balancing in CUDA C, in particular:
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
If so, will the spare processing capacity be assigned to another warp?
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
So in general, for a given call to a kernel what do I need load balance?
Threads in each warp?
Threads in each block?
Threads across all blocks?
Finally, to give an example, what load balancing techniques you would use for the following function:
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly select 5% of the points and log them (or some complicated function)
I write the resulting vector x1 (e.g. [1, log(2), 3, 4, 5, ..., N]) to memory
I repeat the above 2 operations on x1 to yield x2 (e.g. [1, log(log(2)), 3, 4, log(5), ..., N]), and then do a further 8 iterations to yield x3 ... x10
I return x10
Many thanks.
Threads are grouped into three levels that are scheduled differently. Warps utilize SIMD for higher compute density. Thread blocks utilize multithreading for latency tolerance. Grids provide independent, coarse-grained units of work for load balancing across SMs.
Threads in a warp
The hardware executes the 32 threads of a warp together. It can execute 32 instances of a single instruction with different data. If the threads take different control flow, so they are not all executing the same instruction, then some of those 32 execution resources will be idle while the instruction executes. This is called control divergence in CUDA references.
If a kernel exhibits a lot of control divergence, it may be worth redistributing work at this level. This balances work by keeping all execution resources busy within a warp. You can reassign work between threads as shown below.
// Identify which data should be processed
if (should_do_work(threadIdx.x)) {
int tmp_index = atomicAdd(&tmp_counter, 1);
tmp[tmp_index] = threadIdx.x;
}
__syncthreads();
// Assign that work to the first threads in the block
if (threadIdx.x < tmp_counter) {
int thread_index = tmp[threadIdx.x];
do_work(thread_index); // Thread threadIdx.x does work on behalf of thread tmp[threadIdx.x]
}
Warps in a block
On an SM, the hardware schedules warps onto execution units. Some instructions take a while to complete, so the scheduler interleaves the execution of multiple warps to keep the execution units busy. If some warps are not ready to execute, they are skipped with no performance penalty.
There is usually no need for load balancing at this level. Simply ensure that enough warps are available per thread block so that the scheduler can always find a warp that is ready to execute.
Blocks in a grid
The runtime system schedules blocks onto SMs. Several blocks can run concurrently on an SM.
There is usually no need for load balancing at this level. Simply ensure that enough thread blocks are available to fill all SMs several times over. It is useful to overprovision thread blocks to minimize the load imbalance at the end of a kernel, when some SMs are idle and no more thread blocks are ready to execute.
As others have already said, the threads within a warp use a scheme called Single Instruction, Multiple Data (SIMD.) SIMD means that there is a single instruction decoding unit in the hardware controling multiple arithmetic and logic units (ALU's.) A CUDA 'core' is basically just a floating-point ALU, not a full core in the same sense as a CPU core. While the exact CUDA core to instruction decoder ratio varies between different CUDA Compute Capability versions, all of them use this scheme. Since they all use the same instruction decoder, each thread within a warp of threads will execute the exact same instruction on every clock cycle. The cores assigned to the threads within that warp that do not follow the currently-executing code path will simply do nothing on that clock cycle. There is no way to avoid this, as it is an intentional physical hardware limitation. Thus, if you have 32 threads in a warp and each of those 32 threads follows a different code path, you will have no speedup from parallelism at all within that warp. It will execute each of those 32 code paths sequentially. This is why it is ideal for all threads within the warp to follow the same code path as much as possible, since parallelism within a warp is only possible when multiple threads are following the same code path.
The reason that the hardware is designed this way is that it saves chip space. Since each core doesn't have its own instruction decoder, the cores themselves take up less chip space (and use less power.) Having smaller cores that use less power per core means that more cores can be packed onto the chip. Having small cores like this is what allows GPU's to have hundreds or thousands of cores per chip while CPU's only have 4 or 8, even while maintaining similar chip sizes and power consumption (and heat dissipation) levels. The trade off with SIMD is that you can pack a lot more ALU's onto the chip and get a lot more parallelism, but you only get the speedup when those ALU's are all executing the same code path. The reason this trade off is made to such a high degree for GPU's is that much of the computation involved in 3D graphics processing is simply floating-point matrix multiplication. SIMD lends itself well to matrix multiplication because the process to compute each output value of the resultant matrix is identical, just on different data. Furthermore, each output value can be computed completely independently of every other output value, so the threads don't need to communicate with each other at all. Incidentally, similar patterns (and often even matrix multiplication itself) also happen to appear commonly in scientific and engineering applications. This is why General Purpose processing on GPU's (GPGPU) was born. CUDA (and GPGPU in general) was basically an afterthought on how existing hardware designs which were already being mass produced for the gaming industry could also be used to speed up other types of parallel floating-point processing applications.
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
Yes. As soon as you have divergence in a Warp, the scheduler needs to take all divergent branches and process them one by one. The compute capacity of the threads not in the currently executed branch will then be lost. You can check the CUDA Programming Guide, it explains quite well what exactly happens.
If so, will the spare processing capacity be assigned to another warp?
No, unfortunately that is completely lost.
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
Because a Warp has to be SIMD (single instruction, multiple data) to achieve optimal performance, the Warps inside a block can be completely divergent, however, they share some other resources. (Shared Memory, Registers, etc.)
So in general, for a given call to a kernel what do I need load balance?
I don't think load balance is the right word here. Just make sure, that you always have enough Threads being executed all the time and avoid divergence inside warps. Again, the CUDA Programming Guide is a good read for things like that.
Now for the example:
You could execute m threads with m=0..N*0.05, each picking a random number and putting the result of the "complicated function" in x1[m].
However, randomly reading from global memory over a large area isn't the most efficient thing you can do with a GPU, so you should also think about whether that really needs to be completely random.
Others have provided good answers for the theoretical questions.
For your example, you might consider restructuring the problem as follows:
have a vector x of N points: [1, 2, 3, ..., N]
compute some complicated function on every element of x, yielding y.
randomly sample subsets of y to produce y0 through y10.
Step 2 operates on every input element exactly once, without consideration for whether that value is needed. If step 3's sampling is done without replacement, this means that you'll be computing 2x the number of elements you'll actually need, but you'll be computing everything with no control divergence and all memory access will be coherent. These are often much more important drivers of speed on GPUs than the computation itself, but this depends on what the complicated function is really doing.
Step 3 will have a non-coherent memory access pattern, so you'll have to decide whether it's better to do it on the GPU or whether it's faster to transfer it back to the CPU and do the sampling there.
Depending on what the next computation is, you might restructure step 3 to instead randomly draw an integer in [0,N) for each element. If the value is in [N/2,N) then ignore it in the next computation. If it's in [0,N/2), then associate its value with an accumulator for that virtual y* array (or whatever is appropriate for your computation).
Your example is a really good way of showing of reduction.
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly pick 50% of the points and log them (or some complicated function) (1)
I write the resulting vector x1 to memory (2)
I repeat the above 2 operations on x1 to yield x2, and then do a further 8 iterations to yield x3 ... x10 (3)
I return x10 (4)
Say |x0| = 1024, and you pick 50% of the points.
The first stage could be the only stage where you have to read from the global memory, I will show you why.
512 threads read 512 values from memory(1), it stores them into shared memory (2), then for step (3) 256 threads will read random values from shared memory and store them also in shared memory. You do this until you end up with one thread, which will write it back to global memory (4).
You could extend this further by at the initial step having 256 threads reading two values, or 128 threads reading 4 values, etc...

matrix multiplication in cuda

say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.
a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.
b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.
I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?
As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.
If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.
The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.
Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.
The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.
There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.
Have you looked at the CUDA documentation: Cuda Programming Model
Also, sample source code: Matrix Multiplication
Did you look at
$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul
i.e. the matrix multiplication example in the SDK?
If you don't need to implement this yourself, just use a library -- CUBLAS, MAGMA, etc., provide tuned matrix multiplication implementations.