I am CUDA beginner.
So far I learned, that each SM have 8 blocks (of threads). Let's say I have simple job of multiplying elements in array by 2. However, I have less data than threads.
Not a problem, because I could cut off the "tail" of threads to make them idle. But if I understand correctly this would mean some SMs would get 100% of work, and some part (or even none).
So I would like to calculate which SM is running given thread and make computation in such way, that each SM has equal amount of work.
I hope it makes sense in the first place :-) If so, how to compute which SM is running given thread? Or -- index of current SM and total numbers of them? In other words, equivalent on threadDim/threadIdx in SM terms.
Update
For comment it was too long.
Robert, thank you for your answer. While I try to digest all, here is what I do -- I have a "big" array and I simply have to multiply the values *2 and store it to output array (as a warmup; btw. all computations I do, mathematically are correct). So first I run this in 1 block, 1 thread. Fine. Next, I tried to split work in such way that each multiplication is done just once by one thread. As the result my program runs around 6 times slower. I even sense why -- small penalty for fetching the info about GPU, then computing how many blocks and threads I should use, then within each thread instead of single multiplications now I have around 10 extra multiplications just to compute the offset in the array for a thread. On one hand I try to find out how to change that undesired behaviour, on the other I would like to spread the "tail" of threads among SMs evenly.
I rephrase -- maybe I am mistaken, but I would like to solve this. I have 1G small jobs (*2 that's all) -- should I create 1K blocks with 1K threads, or 1M blocks with 1 thread, 1 block with 1M threads, and so on. So far, I read GPU properties, divide, divide, and use blindly maximum values for each dimension of grid/block (or required value, if there is no data to compute).
The code
size is the size of the input and output array. In general:
output_array[i] = input_array[i]*2;
Computing how many blocks/threads I need.
size_t total_threads = props.maxThreadsPerMultiProcessor
* props.multiProcessorCount;
if (size<total_threads)
total_threads = size;
size_t total_blocks = 1+(total_threads-1)/props.maxThreadsPerBlock;
size_t threads_per_block = 1+(total_threads-1)/total_blocks;
Having props.maxGridSize and props.maxThreadsDim I compute in similar manner the dimensions for blocks and threads -- from total_blocks and threads_per_block.
And then the killer part, computing the offset for a thread ("inside" the thread):
size_t offset = threadIdx.z;
size_t dim = blockDim.x;
offset += threadIdx.y*dim;
dim *= blockDim.y;
offset += threadIdx.z*dim;
dim *= blockDim.z;
offset += blockIdx.x*dim;
dim *= gridDim.x;
offset += blockIdx.y*dim;
dim *= gridDim.y;
size_t chunk = 1+(size-1)/dim;
So now I have starting offset for current thread, and the amount of data in array (chunk) for multiplication. I didn't use grimDim.z above, because AFAIK is alway 1, right?
It's an unusual thing to try to do. Given that you are a CUDA beginner, such a question seems to me to be indicative of attempting to solve a problem improperly. What is the problem you are trying to solve? How does it help your problem if you are executing a particular thread on SM X vs. SM Y? If you want maximum performance out of the machine, structure your work in a way such that all thread processors and SMs can be active, and in fact that there is "more than enough work" for all. GPUs depend on oversubscribed resources to hide latency.
As a CUDA beginner, your goals should be:
create enough work, both in blocks and threads
access memory efficiently (this mostly has to do with coalescing - you can read up on that)
There is no benefit to making sure that "each SM has an equal amount of work". If you create enough blocks in your grid, each SM will have an approximately equal amount of work. This is the scheduler's job, you should let the scheduler do it. If you do not create enough blocks, your first objective should be to create or find more work to do, not to come up with a fancy work breakdown per block that will yield no benefit.
Each SM in the Fermi GPU (for example) has 32 thread processors. In order to keep these processors busy even in the presence of inevitable machine stalls due to memory accesses and the like, the machine is designed to hide latency by swapping in another warp of threads (32) when a stall occurs, so that processing can continue. In order to facilitate this, you should try to have a large number of available warps per SM. This is facilitated by having:
many threadblocks in your grid (at least 6 times the number of SMs in the GPU)
multiple warps per threadblock (probably at least 4 to 8 warps, so 128 to 256 threads per block)
Since a (Fermi) SM is always executing 32 threads at a time, if I have fewer threads than 32 times the number of SMs in my GPU at any instant, then my machine is under-utilized. If my entire problem is only composed of, say, 20 threads, then it's simply not well designed to take advantage of any GPU, and breaking those 20 threads up into multiple SMs/threadblocks is not likely to have any appreciable benefit.
EDIT: Since you don't want to post your code, I'll make a few more suggestions or comments.
You tried to modify some code, found that it runs slower, then jumped to (I think) the wrong conclusion.
You should probably familiarize yourself with a simple code example like vector add. It's not multiplying each element, but the structure is close. There's no way performing this vector add using a single thread would actually run faster. I think if you study this example, you'll find a straightforward way to extend it to do array element multiply-by-2.
Nobody computes threads per block the way you have outlined. First of all, threads per block should be a multiple of 32. Secondly, it's customary to pick threads per block as a starting point, and build your other launch parameters from it, not the other way around. For a large problem, just start with 256 or 512 threads per block, and dispense with the calculations for that.
Build your other launch parameters (grid size) based on your chosen threadblock size. Your problem is 1D in nature, so a 1D grid of 1D threadblocks is a good starting point. If this calculation exceeds the machine limit in terms of max blocks in x-dimension, then you can either have each thread loop to process multiple elements or else extend to a 2D grid (of 1D threadblocks).
Your offset calculation is needlessly complex. Refer to the vector add example about how to create a grid of threads with relatively simple offset calculation to process an array.
Related
In CUDA programming, threads and blocks have multiple directions (x, y and z).
Until now, I ignored this and only took into account the x direction (threadIdx.x, blockIdx.x, blockDim.x, etc.).
Apparently, both threads within a block and blocks on the grid are arranged as a cube. However, if this is the case, why is it enough to specify the x direction? Would I not address multiple threads like that? Only using the x direction, am I able to address all threads available to my GPU?
Only using the x direction, am I able to address all threads available to my GPU?
If we are talking about a desire to spin up ~2 trillion threads or less, then there is no particular requirement to use a multidimensional block, or grid. All CUDA GPUs of compute capability 3.0 and higher can launch up to about 2 billion blocks (2^31-1) with 1024 threads each, using a 1-D grid organization.
With methodologies like grid-stride loop it seems rare to me that more than ~2 trillion threads would be needed.
I claim without formal proof that any problem that can be realized in a 1D grid can be realized in a 2D or 3D grid, or vice versa. This is just a mathematical mapping from one realization to another. Furthermore, it should be possible to arrange for important by-products like coalesced access in either realization.
There may be some readability benefits, code complexity benefits, and possibly small performance considerations when realizing in a 1D or multi-dimensional way. The usual case for this that I can think of is when the data to be processed is "inherently" multi-dimensional. In this case, letting the CUDA engine generate 2 or 3 distinct indices for you:
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;
might be simpler than using a 1D grid index, and computing 2D data indices from those:
int tid = threadIdx.x+blockDim.x*blockIdx.x;
int idx = tid%DATA_WIDTH;
int idy = tid/DATA_WIDTH;
(the integer division operation above is unavoidable in the general case. The modulo operation can be simplified by using the result from the integer division.)
It's arguably an extra line of code and an extra division operation required to get to the same point, when only a 1D grid is created. However I would suggest that even this is small potatoes, and you should use whichever approach seems most reasonable and comfortable to you as a programmer.
If for some reason you desire to spin up more than ~2 Trillion threads, then moving to a multidimensional grid, at least, is unavoidable.
Apparently, both threads within a block and blocks on the grid are arranged as a cube.
To understand how the threadblock thread index is computed in any case, I refer you to the programming guide. It should be evident that one case can be made equivalent to another - each thread gets a unique thread ID no matter how you specify the threadblock dimensions. In my opinion, a threadblock should only be thought of as a "cube" of threads (i.e. 3-dimensional) if you specify the configuration that way:
dim3 block(32,8,4); //for example
However, if this is the case, why is it enough to specify the x direction? Would I not address multiple threads like that?
If you only used a single threadblock dimension to create a thread index in the 32,8,4 case:
int tid = threadIdx.x;
then you certainly would be "addressing" multiple threads (in y, and z) using that approach. That would typically, in my experience, be "broken" code. Therefore a kernel designed to use a multidimensional block or grid may not work correctly if the block or grid is specified as 1 dimensional, and the reverse statement is also true. You can find examples of such problems (thread index calculation not being correct for the grid design) here on the cuda tag.
To launch a CUDA kernel, we use dim3 to specify the dimensions, and I think the meaning of each dimension is opt to the user, for example, it could mean (width, height) or (rows, cols), which has the meaning reversed.
So I did an experiment with the CUDA sample in the SDK: 3_Imaging/convolutionSeparable, simply exchage .x and .y in the kernel function, and reverse the dimensions of blocks and threads used to launch the kernel, so the meaning changes from dim(width, height)/idx(x, y) to dim(rows, cols)/idx(row, col).
The result is the same, however, the performance decreases, the original one takes about 26ms, while the modified one takes about 40ms on my machine(SM 3.0).
My question is, what makes the difference? is (rows, cols) not feasible for CUDA?
P.S. I only modified convolutionRows, no convolutionColumns
EDIT: The change can be found here.
There are at least two potential consequences of your changes:
First, you are changing the memory access pattern to the main memory so the
access is as not coalesced as in the original case.
You should think about the GPU main memory in the same way as it was
a "CPU" memory, i.e., prefetching, blocking, sequential accesses...
techniques to applies in order to get performance.
If you want to know more about this topic, it is mandatory to read
this paper. What every programmer should know about memory.
You'll find an example a comparison between row and column major
access to the elements of a matrix there.
To get and idea on how important this is, think that most -if not
all- GPU high performance codes perform a matrix transposition
before any computation in order to achieve a more coalesced memory
access, and still this additional step worths in terms on
performance. (sparse matrix operations, for instance)
Second. This is more subtle, but in some scenarios it has a deep impact on the performance of a kernel; the launching configuration. It is not the same launching 20 blocks of 10 threads as launching 10 blocks of 20 threads. There is a big difference in the amount of resources a thread needs (shared memory, number of registers,...). The more resources a thread needs the less warps can be mapped on a single SM so the less occupancy... and the -most of the times- less performance.
This not applies to your question, since the number of blocks is equal to the number of threads.
When programming for GPUs you must be aware of the architecture in order to understand how that changes will modify the performance. Of course, I am not familiar with the code so there will be others factors among these two.
First question:
Suppose I need to launch a kernel with 229080 threads on a Tesla C1060 which has compute capability 1.3.
So according to the documentation this machine has 240 cores with 8 cores on each symmetric multiprocessor for a total of 30 SMs.
I can use up to 1024 per SM for a total of 30720 threads running "concurrently".
Now if I define blocks of 256 threads that means I can have 4 blocks for each SM because 1024/256=4. So those 30720 threads can be arranged in 120 blocks across all SMs.
Now for my example of 229080 threads I would need 229080/256=~895 (rounded up) blocks to process all the threads.
Now lets say I want to call a kernel and I must use those 229080 threads so I have two options. The first one is to I divide the problem so that I call the kernel ~8 times in a for loop with a Grid of 120 blocks and 30720 threads each time (229080/30720). That way I make sure the device will stay occupied completely. The other option is to call the kernel with a Grid of 895 blocks for the entire 229080 threads on which case many blocks will remain idle until a SM finishes with the 8 blocks it has.
So which is the preferred option? does it make any difference for those blocks to remain idle waiting? do they take resources?
Second question
Let's say that within the kernel I'm calling I need to access non coalesced global memory so an option is to use shared memory.
I can then use each thread to extract a value from an array on global memory say global_array which is of length 229080. Now as I understand correctly you have to avoid branching when copying to shared memory since all threads on a block need to reach the syncthreads() call to make sure they all can access the shared memory.
The problem here is that for the 229080 threads I need exactly 229080/256=894.84375 blocks because there is a residue of 216 threads. I can round up that number and get 895 blocks and the last block will just use 216 threads.
But since I need to extract the value to shared memory from global_array which is of length 229080 and I can't use a conditional statement to prevent the last 40 threads (256-216) from accessing illegal addresses on global_array then how can I circumvent this problem while working with shared memory loading?
So which is the preferred option? does it make any difference for those blocks to remain idle waiting? do they take resources?
A single kernel is preferred according to what you describe. Threadblocks queued up but not assigned to an SM don't take any resources you need to worry about, and the machine is definitely designed to handle situations just like that. The overhead of 8 kernel calls will definitely be slower, all other things being equal.
Now as I understand correctly you have to avoid branching when copying to shared memory since all threads on a block need to reach the syncthreads() call to make sure they all can access the shared memory.
This statement is not correct on the face of it. You can have branching while copying to shared memory. You just need to make sure that either:
The __syncthreads() is outside the branching construct, or,
The __syncthreads() is reached by all threads within the branching construct (which effectively means that the branch construct evaluates to the same path for all threads in the block, at least at the point where the __syncthreads() barrier is.)
Note that option 1 above is usually achievable, which makes code simpler to follow and easy to verify that all threads can reach the barrier.
But since I need to extract the value to shared memory from global_array which is of length 229080 and I can't use a conditional statement to prevent the last 40 threads (256-216) from accessing illegal addresses on global_array then how can I circumvent this problem while working with shared memory loading?
Do something like this:
int idx = threadIdx.x + (blockDim.x * blockIdx.x);
if (idx < data_size)
shared[threadIdx.x] = global[idx];
__syncthreads();
This is perfectly legal. All threads in the block, whether they are participating in the data copy to shared memory or not, will reach the barrier.
I'm hoping for some general advice and clarification on best practices for load balancing in CUDA C, in particular:
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
If so, will the spare processing capacity be assigned to another warp?
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
So in general, for a given call to a kernel what do I need load balance?
Threads in each warp?
Threads in each block?
Threads across all blocks?
Finally, to give an example, what load balancing techniques you would use for the following function:
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly select 5% of the points and log them (or some complicated function)
I write the resulting vector x1 (e.g. [1, log(2), 3, 4, 5, ..., N]) to memory
I repeat the above 2 operations on x1 to yield x2 (e.g. [1, log(log(2)), 3, 4, log(5), ..., N]), and then do a further 8 iterations to yield x3 ... x10
I return x10
Many thanks.
Threads are grouped into three levels that are scheduled differently. Warps utilize SIMD for higher compute density. Thread blocks utilize multithreading for latency tolerance. Grids provide independent, coarse-grained units of work for load balancing across SMs.
Threads in a warp
The hardware executes the 32 threads of a warp together. It can execute 32 instances of a single instruction with different data. If the threads take different control flow, so they are not all executing the same instruction, then some of those 32 execution resources will be idle while the instruction executes. This is called control divergence in CUDA references.
If a kernel exhibits a lot of control divergence, it may be worth redistributing work at this level. This balances work by keeping all execution resources busy within a warp. You can reassign work between threads as shown below.
// Identify which data should be processed
if (should_do_work(threadIdx.x)) {
int tmp_index = atomicAdd(&tmp_counter, 1);
tmp[tmp_index] = threadIdx.x;
}
__syncthreads();
// Assign that work to the first threads in the block
if (threadIdx.x < tmp_counter) {
int thread_index = tmp[threadIdx.x];
do_work(thread_index); // Thread threadIdx.x does work on behalf of thread tmp[threadIdx.x]
}
Warps in a block
On an SM, the hardware schedules warps onto execution units. Some instructions take a while to complete, so the scheduler interleaves the execution of multiple warps to keep the execution units busy. If some warps are not ready to execute, they are skipped with no performance penalty.
There is usually no need for load balancing at this level. Simply ensure that enough warps are available per thread block so that the scheduler can always find a warp that is ready to execute.
Blocks in a grid
The runtime system schedules blocks onto SMs. Several blocks can run concurrently on an SM.
There is usually no need for load balancing at this level. Simply ensure that enough thread blocks are available to fill all SMs several times over. It is useful to overprovision thread blocks to minimize the load imbalance at the end of a kernel, when some SMs are idle and no more thread blocks are ready to execute.
As others have already said, the threads within a warp use a scheme called Single Instruction, Multiple Data (SIMD.) SIMD means that there is a single instruction decoding unit in the hardware controling multiple arithmetic and logic units (ALU's.) A CUDA 'core' is basically just a floating-point ALU, not a full core in the same sense as a CPU core. While the exact CUDA core to instruction decoder ratio varies between different CUDA Compute Capability versions, all of them use this scheme. Since they all use the same instruction decoder, each thread within a warp of threads will execute the exact same instruction on every clock cycle. The cores assigned to the threads within that warp that do not follow the currently-executing code path will simply do nothing on that clock cycle. There is no way to avoid this, as it is an intentional physical hardware limitation. Thus, if you have 32 threads in a warp and each of those 32 threads follows a different code path, you will have no speedup from parallelism at all within that warp. It will execute each of those 32 code paths sequentially. This is why it is ideal for all threads within the warp to follow the same code path as much as possible, since parallelism within a warp is only possible when multiple threads are following the same code path.
The reason that the hardware is designed this way is that it saves chip space. Since each core doesn't have its own instruction decoder, the cores themselves take up less chip space (and use less power.) Having smaller cores that use less power per core means that more cores can be packed onto the chip. Having small cores like this is what allows GPU's to have hundreds or thousands of cores per chip while CPU's only have 4 or 8, even while maintaining similar chip sizes and power consumption (and heat dissipation) levels. The trade off with SIMD is that you can pack a lot more ALU's onto the chip and get a lot more parallelism, but you only get the speedup when those ALU's are all executing the same code path. The reason this trade off is made to such a high degree for GPU's is that much of the computation involved in 3D graphics processing is simply floating-point matrix multiplication. SIMD lends itself well to matrix multiplication because the process to compute each output value of the resultant matrix is identical, just on different data. Furthermore, each output value can be computed completely independently of every other output value, so the threads don't need to communicate with each other at all. Incidentally, similar patterns (and often even matrix multiplication itself) also happen to appear commonly in scientific and engineering applications. This is why General Purpose processing on GPU's (GPGPU) was born. CUDA (and GPGPU in general) was basically an afterthought on how existing hardware designs which were already being mass produced for the gaming industry could also be used to speed up other types of parallel floating-point processing applications.
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
Yes. As soon as you have divergence in a Warp, the scheduler needs to take all divergent branches and process them one by one. The compute capacity of the threads not in the currently executed branch will then be lost. You can check the CUDA Programming Guide, it explains quite well what exactly happens.
If so, will the spare processing capacity be assigned to another warp?
No, unfortunately that is completely lost.
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
Because a Warp has to be SIMD (single instruction, multiple data) to achieve optimal performance, the Warps inside a block can be completely divergent, however, they share some other resources. (Shared Memory, Registers, etc.)
So in general, for a given call to a kernel what do I need load balance?
I don't think load balance is the right word here. Just make sure, that you always have enough Threads being executed all the time and avoid divergence inside warps. Again, the CUDA Programming Guide is a good read for things like that.
Now for the example:
You could execute m threads with m=0..N*0.05, each picking a random number and putting the result of the "complicated function" in x1[m].
However, randomly reading from global memory over a large area isn't the most efficient thing you can do with a GPU, so you should also think about whether that really needs to be completely random.
Others have provided good answers for the theoretical questions.
For your example, you might consider restructuring the problem as follows:
have a vector x of N points: [1, 2, 3, ..., N]
compute some complicated function on every element of x, yielding y.
randomly sample subsets of y to produce y0 through y10.
Step 2 operates on every input element exactly once, without consideration for whether that value is needed. If step 3's sampling is done without replacement, this means that you'll be computing 2x the number of elements you'll actually need, but you'll be computing everything with no control divergence and all memory access will be coherent. These are often much more important drivers of speed on GPUs than the computation itself, but this depends on what the complicated function is really doing.
Step 3 will have a non-coherent memory access pattern, so you'll have to decide whether it's better to do it on the GPU or whether it's faster to transfer it back to the CPU and do the sampling there.
Depending on what the next computation is, you might restructure step 3 to instead randomly draw an integer in [0,N) for each element. If the value is in [N/2,N) then ignore it in the next computation. If it's in [0,N/2), then associate its value with an accumulator for that virtual y* array (or whatever is appropriate for your computation).
Your example is a really good way of showing of reduction.
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly pick 50% of the points and log them (or some complicated function) (1)
I write the resulting vector x1 to memory (2)
I repeat the above 2 operations on x1 to yield x2, and then do a further 8 iterations to yield x3 ... x10 (3)
I return x10 (4)
Say |x0| = 1024, and you pick 50% of the points.
The first stage could be the only stage where you have to read from the global memory, I will show you why.
512 threads read 512 values from memory(1), it stores them into shared memory (2), then for step (3) 256 threads will read random values from shared memory and store them also in shared memory. You do this until you end up with one thread, which will write it back to global memory (4).
You could extend this further by at the initial step having 256 threads reading two values, or 128 threads reading 4 values, etc...
In my cuda code if I increase the blocksizeX ,blocksizeY it actually is taking more time .[Therefore I run it at 1x1]Also a chunk of my execution time ( for eg 7 out of 9 s ) is taken by just the call to the kernel .Infact I am quite amazed that even if I comment out the entire kernel the time is almost same.Any suggestions where and how to optimize?
P.S. I have edited this post with my actual code .I am downsampling an image so every 4 neighoring pixels (so for eg 1,2 from row 1 and 1,2 from row 2) give an output pixel.I get a effective bw of .5GB/s compared to theoretical maximum of 86.4 GB/s.The time I use is the difference in calling the kernel with instructions and calling an empty kernel.
It looks pretty bad to me right now but I cant figure out what am I doing wrong.
__global__ void streamkernel(int *r_d,int *g_d,int *b_d,int height ,int width,int *f_r,int *f_g,int *f_b){
int id=blockIdx.x * blockDim.x*blockDim.y+ threadIdx.y*blockDim.x+threadIdx.x+blockIdx.y*gridDim.x*blockDim.x*blockDim.y;
int number=2*(id%(width/2))+(id/(width/2))*width*2;
if (id<height*width/4)
{
f_r[id]=(r_d[number]+r_d[number+1];+r_d[number+width];+r_d[number+width+1];)/4;
f_g[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1])/4;
f_b[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1];)/4;
}
}
Try looking up the matrix multiplication example in CUDA SDK examples for how to use shared memory.
The problem with your current kernel is that it's doing 4 global memory reads and 1 global memory write for each 3 additions and 1 division. Each global memory access costs roughly 400 cycles. This means you're spending the vast majority of time doing memory access (what GPUs are bad at) rather than compute (what GPUs are good at).
Shared memory in effect allows you to cache this so that amortized, you get roughly 1 read and 1 write at each pixel for 3 additions and 1 division. That is still not doing so great on the CGMA ratio (compute to global memory access ratio, the holy grail of GPU computing).
Overall, I think for a simple kernel like this, a CPU implementation is likely going to be faster given the overhead of transferring data across the PCI-E bus.
You're forgetting the fact that one multiprocessor can execute up to 8 blocks simultaneously and the maximum performance is reached exactly then. However there are many factors that limit the number of blocks that can exist in parallel (incomplete list):
Maximum amount of shared memory per multiprocessor limits the number of blocks if #blocks * shared memory per block would be > total shared memory.
Maximum number of threads per multiprocessor limits the number of blocks if #blocks * #threads / block would be > max total #threads.
...
You should try to find a kernel execution configuration that causes exactly 8 blocks to be run on one multiprocessor. This will almost always yield the highest performance even if the occupancy is =/= 1.0! From this point on you can try to iteratively make changes that reduce the number of executed blocks per MP, but therefore increase the occupancy of your kernel and see if the performance increases.
The nvidia occupancy calculator(excel sheet) will be of great help.