I know the execution time of a program = Instruction count x CPI x Clock cycles, but i don't know the CPI of each instruction in mars mips. Therefore, how to determine the CPI of each instruction or each type of instruction in mars mips.
Related
Quoting from the Independent Thread Scheduling section (page 27) of the Volta whitepaper:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the
same instruction for all active threads in a warp just as before, retaining the execution efficiency
of previous architectures
From my understanding, this implies that if there is no divergence within threads of a warp,
(i.e. all threads of a warp are active), the threads should execute in lockstep.
Now, consider listing 8 from this blog post, reproduced below:
unsigned tid = threadIdx.x;
int v = 0;
v += shmem[tid+16]; __syncwarp(); // 1
shmem[tid] = v; __syncwarp(); // 2
v += shmem[tid+8]; __syncwarp(); // 3
shmem[tid] = v; __syncwarp(); // 4
v += shmem[tid+4]; __syncwarp(); // 5
shmem[tid] = v; __syncwarp(); // 6
v += shmem[tid+2]; __syncwarp(); // 7
shmem[tid] = v; __syncwarp(); // 8
v += shmem[tid+1]; __syncwarp(); // 9
shmem[tid] = v;
Since we don't have any divergence here, I would expect the threads to already be executing in lockstep without
any of the __syncwarp() calls.
This seems to contradict the statement I quote above.
I would appreciate if someone can clarify this confusion?
From my understanding, this implies that if there is no divergence within threads of a warp, (i.e. all threads of a warp are active), the threads should execute in lockstep.
If all threads in a warp are active for a particular instruction, then by definition there is no divergence. This has been true since day 1 in CUDA. It's not logical in my view to connect your statement with the one you excerpted, because it is a different case:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures
This indicates that the active threads are in lockstep. Divergence is still possible. The inactive threads (if any) would be somehow divergent from the active threads. Note that both of these statements are describing the CUDA SIMT model and they have been correct and true since day 1 of CUDA. They are not specific to the Volta execution model.
For the remainder of your question, I guess instead of this:
I would appreciate if someone can clarify this confusion?
You are asking:
Why is the syncwarp needed?
Two reasons:
As stated near the top of that post:
Thread synchronization: synchronize threads in a warp and provide a memory fence. __syncwarp
A memory fence is needed in this case, to prevent the compiler from "optimizing" shared memory locations into registers.
The CUDA programming model provides no specified order of thread execution. It would be a good idea for you to acknowledge that statement as ground truth. If you write code that requires a specific order of thread execution (for correctness), and you don't provide for it explicitly in your source code as a programmer, your code is broken. Regardless of the way it behaves or what results it produces.
The volta whitepaper is describing the behavior of a specific hardware implementation of a CUDA-compliant device. The hardware may ensure things that are not guaranteed by the programming model.
Do all I-type MIPS instructions take the same number of cycles in a multi-cycle datapath? I know R-type have the same number of cycles.
Do all I-type MIPS instructions take the same number of cycles in a multi-cycle datapath?
No.
First, let's look at some of the instructions are included in the MIPS I-Type category: addi and lw. These are both I-Type instructions — with the identical 16-bit immediate and rs and rt fields. They are decoded using the same fields, which is why they are both considered I-Type instructions.
Ok, next, let's look at the multicycle processor. This is not a pipelined processor, though, generally speaking, it will have a cycle for each stage in an equivalent pipelined version.
While we would generally find a pipelined processor's performance superior at the same megahertz, one advantage of a multicycle processor implementation over a pipelined processor implementation is that "stages" that are not needed can be skipped (skipping a cycle is not viable in a pipelined processor because of the instruction execution overlap; whereas the multicycle processor does not overlap execution of instructions).
So, among IF, ID, EX, MEM, WB stages, addi does not require or make use of the MEM stage, and thus it would be silly not to skip that cycle, making addi a 4 cycle instruction.
However, lw, does require the MEM stage (so all the stages) hence it will be 1 cycle longer than addi.
I'm given stages of a clock cycle in a processor.
IF ID EX MEM WB
250ps 350ps 150ps 300ps 200ps
Now I'm being asked what is the total latency of a LW instruction in a pipelined instruction.
Here's what I know:
The clock cycle time in a pipelined version is 350ps because that's the longest instruction.
The clock cycle time in a non-pipelined version is 1250ps because that's the duration of all the instructions added together.
But how does the "latency of a LW instruction" relate to those times?
Ok I'm pretty sure I figured out the answer which is you take the longest duration of the stages which in this case is 350ps and you multiply it by the amount of stages, in this case 5.
So
350 * 5 = 1750ps
Yes, you are correct with your result. Here is the formula:
(Number of Instructions)(Longest Instruction Time with (Unit)) = Latency(Unit)
Im a bit confuzed how many scalar chanels ( i mean "gpu simd width" x "gpu simd cores")
GPU have, for example my own GPU "nvidia geforce gt 610")
it has 48 shader processors (i hoppe each od such processor has separate SIMD
as a processing word), some say also that mosc common (?) gpu simd width is 32
floats/ints - so are my calculations right and it has just 48x32 = 1536 scalar
channels? (i mean when all shader processors are at work 1536 floats can be processed in one step)
The GT610 is a cc 2.1 GPU with a single SM. That SM contains 48 CUDA cores (=shader processors). Each CUDA core is capable of producing one single precision scalar result per clock cycle. Each CUDA core does not have a separate SIMD path to process a SIMD word. It processes one scalar element per clock cycle.
It has 48 scalar channels. 48 floats can be processed in one step, i.e. in one clock cycle.
The SIMT vector width of GT610 is 32, just as it is on all CUDA GPUs -- this is the "warp size". This means when a CUDA instruction is issued, it will be executed across 32 threads per instruction issue.
I'm querying the device capabilities with CUDA and I got the following information:
Device 0: Quadro K5000
Total Memory: 4294639616 bytes
Clock Rate: 705500 kilohertz
Max. Threads per Block: 1024
SM Count: 8
Execution Timeout Enabled: 1
Max. HW Texture Count: 128
TCC driver enabled: 0
CUDA Device Ordinal: 0
What does it mean the device "clock rate" field? Is it the clock for one SP (or cuda core) or something else?
It's the GPU clock, just like your CPU has a clock. Clock rate is a standard electrical engineering term: http://en.wikipedia.org/wiki/Clock_rate
705500kHz = 705.5MHz
Your computer's CPU also has a clock rate. When you buy a computer, the CPU's clock rate usually is advertised and for a long time, performance was directly proportional to the clock rate.
Roughly speaking the clock rate determines how many processing steps given a certain time a processor performs. However processing steps doesn't equal instructions.