The clock64() device-side function in CUDA gives us some sort of clock ticks value. The documentation says:
when executed in device code, [clock64()] returns the value of a per-multiprocessor counter that is incremented every clock cycle.
A little program I wrote to examine clock64() behavior suggests that you get about the same initial value when you start a kernel at different points in (wall clock) time (without rebooting the machine or "manually" resetting the device). For my specific case that seems to be about 5,200,000 to 6,400,000 for the first kernel a process starts. Also, the values increase slightly from SM to SM - while it's not clear they should be at all related, or perhaps, if they are, perhaps they should actually be identical.
I also found that with the next kernel launch, the initial clock64() value increases - but then after some more kernel runs jumps down to a much lower value (e.g. 350,000 or so) and gradually climbs again. There doesn't seem to be a consistent pattern to this behavior (that I can detect with a few runs and manual inspection).
So, my questions are:
Does clock64() actually return clock ticks, or something else that's time-based?
In what ways is clocks64() SM-specific, and in what ways are the values on different SMs related?
What resets/re-initializes the clock64() value?
Can I initialize the clock64() value(s) myself?
Does clock64() actually return clock ticks, or something else that's time-based?
clock64() reads a per-SM 64-bit counter (it actually returns a signed result, so 63 bits available). The clock source for this counter is the GPU core clock. The core clock frequency is discoverable using the deviceQuery sample code, for example. As an order-of-magnitude estimate, most CUDA GPUs I am familiar with have a clock period that is on the order of 1 nanosecond. If we multiply 2^63 by 1 nanosecond, I compute a counter rollover period of approximately 300 years.
In what ways is clock64() SM-specific, and in what ways are the values on different SMs related?
There is no guarantee that the counter in a particular SM has any defined relationship to a counter in another SM, other than that they will have the same clock period.
What resets/re-initializes the clock64() value?
The counter will be reset at some unspecified point, somewhere between machine power-on and the first point at which you access the counter for that SM. The counter may additionally be reset at any point when a SM is inactive, i.e. has no resident threadblocks. The counter should not be reset during any interval when one or more threadblocks are active on the SM.
Can I initialize the clock64() value(s) myself?
You cannot. You have no direct control over the counter value.
Are the basic arithmetic operations same with respect to processor usage. For e.g if I do an addition vs division in a loop, will the calculation time for addition be less than that for division?
I am not sure if this question belongs here or computer science SE
Yes. Here is a quick example:
http://my.safaribooksonline.com/book/hardware/9788131732465/instruction-set-and-instruction-timing-of-8086/app_c
those are the microcode and the timing of the operation of a massively old architecture, the 8086. it is a fairly simple point to start.
of relevant note, they are measured in cycles, or clocks, and everything move at the speed of the cpu (they are synchronized on the main clock or frequency of the microprocessor)
if you scroll down on that table you'll see a division taking anywhere from 80 to 150 cycles.
also note operation speed is affected by which area of memory the operand reside.
note that on modern processor you can have parallel instruction executed concurrently (even if the cpu is single threaded) and some of them are executed out of order, then vector instruction murky the question even more.
i.e. a SSE multiplication can multiply multiple number in a single operation (taking multiple time)
Yes. Different machine instructions are not equally expensive.
You can either do measurements yourself or use one of the references in this question to help you understand the costs for various instructions.
I'm hoping for some general advice and clarification on best practices for load balancing in CUDA C, in particular:
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
If so, will the spare processing capacity be assigned to another warp?
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
So in general, for a given call to a kernel what do I need load balance?
Threads in each warp?
Threads in each block?
Threads across all blocks?
Finally, to give an example, what load balancing techniques you would use for the following function:
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly select 5% of the points and log them (or some complicated function)
I write the resulting vector x1 (e.g. [1, log(2), 3, 4, 5, ..., N]) to memory
I repeat the above 2 operations on x1 to yield x2 (e.g. [1, log(log(2)), 3, 4, log(5), ..., N]), and then do a further 8 iterations to yield x3 ... x10
I return x10
Many thanks.
Threads are grouped into three levels that are scheduled differently. Warps utilize SIMD for higher compute density. Thread blocks utilize multithreading for latency tolerance. Grids provide independent, coarse-grained units of work for load balancing across SMs.
Threads in a warp
The hardware executes the 32 threads of a warp together. It can execute 32 instances of a single instruction with different data. If the threads take different control flow, so they are not all executing the same instruction, then some of those 32 execution resources will be idle while the instruction executes. This is called control divergence in CUDA references.
If a kernel exhibits a lot of control divergence, it may be worth redistributing work at this level. This balances work by keeping all execution resources busy within a warp. You can reassign work between threads as shown below.
// Identify which data should be processed
if (should_do_work(threadIdx.x)) {
int tmp_index = atomicAdd(&tmp_counter, 1);
tmp[tmp_index] = threadIdx.x;
}
__syncthreads();
// Assign that work to the first threads in the block
if (threadIdx.x < tmp_counter) {
int thread_index = tmp[threadIdx.x];
do_work(thread_index); // Thread threadIdx.x does work on behalf of thread tmp[threadIdx.x]
}
Warps in a block
On an SM, the hardware schedules warps onto execution units. Some instructions take a while to complete, so the scheduler interleaves the execution of multiple warps to keep the execution units busy. If some warps are not ready to execute, they are skipped with no performance penalty.
There is usually no need for load balancing at this level. Simply ensure that enough warps are available per thread block so that the scheduler can always find a warp that is ready to execute.
Blocks in a grid
The runtime system schedules blocks onto SMs. Several blocks can run concurrently on an SM.
There is usually no need for load balancing at this level. Simply ensure that enough thread blocks are available to fill all SMs several times over. It is useful to overprovision thread blocks to minimize the load imbalance at the end of a kernel, when some SMs are idle and no more thread blocks are ready to execute.
As others have already said, the threads within a warp use a scheme called Single Instruction, Multiple Data (SIMD.) SIMD means that there is a single instruction decoding unit in the hardware controling multiple arithmetic and logic units (ALU's.) A CUDA 'core' is basically just a floating-point ALU, not a full core in the same sense as a CPU core. While the exact CUDA core to instruction decoder ratio varies between different CUDA Compute Capability versions, all of them use this scheme. Since they all use the same instruction decoder, each thread within a warp of threads will execute the exact same instruction on every clock cycle. The cores assigned to the threads within that warp that do not follow the currently-executing code path will simply do nothing on that clock cycle. There is no way to avoid this, as it is an intentional physical hardware limitation. Thus, if you have 32 threads in a warp and each of those 32 threads follows a different code path, you will have no speedup from parallelism at all within that warp. It will execute each of those 32 code paths sequentially. This is why it is ideal for all threads within the warp to follow the same code path as much as possible, since parallelism within a warp is only possible when multiple threads are following the same code path.
The reason that the hardware is designed this way is that it saves chip space. Since each core doesn't have its own instruction decoder, the cores themselves take up less chip space (and use less power.) Having smaller cores that use less power per core means that more cores can be packed onto the chip. Having small cores like this is what allows GPU's to have hundreds or thousands of cores per chip while CPU's only have 4 or 8, even while maintaining similar chip sizes and power consumption (and heat dissipation) levels. The trade off with SIMD is that you can pack a lot more ALU's onto the chip and get a lot more parallelism, but you only get the speedup when those ALU's are all executing the same code path. The reason this trade off is made to such a high degree for GPU's is that much of the computation involved in 3D graphics processing is simply floating-point matrix multiplication. SIMD lends itself well to matrix multiplication because the process to compute each output value of the resultant matrix is identical, just on different data. Furthermore, each output value can be computed completely independently of every other output value, so the threads don't need to communicate with each other at all. Incidentally, similar patterns (and often even matrix multiplication itself) also happen to appear commonly in scientific and engineering applications. This is why General Purpose processing on GPU's (GPGPU) was born. CUDA (and GPGPU in general) was basically an afterthought on how existing hardware designs which were already being mass produced for the gaming industry could also be used to speed up other types of parallel floating-point processing applications.
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
Yes. As soon as you have divergence in a Warp, the scheduler needs to take all divergent branches and process them one by one. The compute capacity of the threads not in the currently executed branch will then be lost. You can check the CUDA Programming Guide, it explains quite well what exactly happens.
If so, will the spare processing capacity be assigned to another warp?
No, unfortunately that is completely lost.
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
Because a Warp has to be SIMD (single instruction, multiple data) to achieve optimal performance, the Warps inside a block can be completely divergent, however, they share some other resources. (Shared Memory, Registers, etc.)
So in general, for a given call to a kernel what do I need load balance?
I don't think load balance is the right word here. Just make sure, that you always have enough Threads being executed all the time and avoid divergence inside warps. Again, the CUDA Programming Guide is a good read for things like that.
Now for the example:
You could execute m threads with m=0..N*0.05, each picking a random number and putting the result of the "complicated function" in x1[m].
However, randomly reading from global memory over a large area isn't the most efficient thing you can do with a GPU, so you should also think about whether that really needs to be completely random.
Others have provided good answers for the theoretical questions.
For your example, you might consider restructuring the problem as follows:
have a vector x of N points: [1, 2, 3, ..., N]
compute some complicated function on every element of x, yielding y.
randomly sample subsets of y to produce y0 through y10.
Step 2 operates on every input element exactly once, without consideration for whether that value is needed. If step 3's sampling is done without replacement, this means that you'll be computing 2x the number of elements you'll actually need, but you'll be computing everything with no control divergence and all memory access will be coherent. These are often much more important drivers of speed on GPUs than the computation itself, but this depends on what the complicated function is really doing.
Step 3 will have a non-coherent memory access pattern, so you'll have to decide whether it's better to do it on the GPU or whether it's faster to transfer it back to the CPU and do the sampling there.
Depending on what the next computation is, you might restructure step 3 to instead randomly draw an integer in [0,N) for each element. If the value is in [N/2,N) then ignore it in the next computation. If it's in [0,N/2), then associate its value with an accumulator for that virtual y* array (or whatever is appropriate for your computation).
Your example is a really good way of showing of reduction.
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly pick 50% of the points and log them (or some complicated function) (1)
I write the resulting vector x1 to memory (2)
I repeat the above 2 operations on x1 to yield x2, and then do a further 8 iterations to yield x3 ... x10 (3)
I return x10 (4)
Say |x0| = 1024, and you pick 50% of the points.
The first stage could be the only stage where you have to read from the global memory, I will show you why.
512 threads read 512 values from memory(1), it stores them into shared memory (2), then for step (3) 256 threads will read random values from shared memory and store them also in shared memory. You do this until you end up with one thread, which will write it back to global memory (4).
You could extend this further by at the initial step having 256 threads reading two values, or 128 threads reading 4 values, etc...
I have to find the execution time (in microseconds) of a small block of MIPS code, given that:
it will take a total of 30 cycles
total of 10 MIPS instructions
2.0 GHz CPU
That's all the information I am given to solve this question with (I already added up the total number of cycles, given the assumptions I am supposed to make about how many cycles different kinds of instructions are supposed to take). I have been playing around with the formulas from the book trying to find the execution time, but I can't get an answer that seems right. Whats the process for solving a problem like this? Thanks.
My best guess at interpreting your problem is that on average each instruction takes 3 cycles to complete. Because you were given the total number of cycles I'm not sure that the instruction count even matters.
You have a 2Ghz machine so that is 2 * 10^9 cycles per second. This equates to each cycle taking 5 * 10^(-10) seconds. (twice as fast as a 1Ghz machine which is 1*10^(-9)).
We have 30 cycles to complete to run the program so...
30 * (5 * 10^(-10)) = 1.5 * 10^(-8) or 15 nano seconds to execute all 10 instructions in 30 cycles.
The hadoop documentation states:
The right number of reduces seems to be 0.95 or 1.75 multiplied by
( * mapred.tasktracker.reduce.tasks.maximum).
With 0.95 all of the reduces can launch immediately and start
transferring map outputs as the maps finish. With 1.75 the faster
nodes will finish their first round of reduces and launch a second
wave of reduces doing a much better job of load balancing.
Are these values pretty constant? What are the results when you chose a value between these numbers, or outside of them?
The values should be what your situation needs them to be. :)
The below is my understanding of the benefit of the values:
The .95 is to allow maximum utilization of the available reducers. If Hadoop defaults to a single reducer, there will be no distribution of the reducing, causing it to take longer than it should. There is a near linear fit (in my limited cases) to the increase in reducers and the reduction in time. If it takes 16 minutes on 1 reducer, it takes 2 minutes on 8 reducers.
The 1.75 is a value that attempts to optimize the performance differences o the machines in a node. It will create more than a single pass of reducers so that the faster machines will take on additional reducers while slower machines do not.
This figure (1.75) is one that will need to be adjusted much more to your hardware than the .95 value. If you have 1 quick machine and 3 slower, maybe you'll only want 1.10. This number will need more experimentation to find the value that fits your hardware configuration. If the number of reducers is too high, the slow machines will be the bottleneck again.
To add to what Nija said above, and also a bit of personal experience:
0.95 makes a bit of sense because you are utilizing the maximum capacity of your cluster, but at the same time, you are accounting for some empty task slots for what happens in case some of your reducers fail. If you're using 1x the number of reduce task slots, your failed reduce has to wait until at least one reducer finishes. If you're using 0.85, or 0.75 of the reduce task slots, you're not utilizing as much of your cluster as you could.
We can say that these numbers are not valid anymore. Now acording to the book "Hadoop: definitive guide" and hadoop wiki we target that reducer should process by 5 minutes.
Fragment from the book:
Chosing the Number of Reducers The single reducer default is something
of a gotcha for new users to Hadoop. Almost all real-world jobs should
set this to a larger number; otherwise, the job will be very slow
since all the intermediate data flows through a single reduce task.
Choosing the number of reducers for a job is more of an art than a
science. Increasing the number of reducers makes the reduce phase
shorter, since you get more parallelism. However, if you take this too
far, you can have lots of small files, which is suboptimal. One rule
of thumb is to aim for reducers that each run for five minutes or so,
and which produce at least one HDFS block’s worth of output.