Why are these branch lines marked partial? - language-agnostic

I have these lines marked partial, why is it?
Full coverage report
What does those coverage numbers mean? How came you have 2 hits of the branch content, while the branch itself marks 3/4 ?

According to the gcov report the branch coverage is as follows:
30: 140: if (obj->root)
branch 0 taken 11% (fallthrough)
branch 1 taken 89%
branch 2 taken 0% (fallthrough)
branch 3 taken 100%
Raw upload found in the Codecov Commit's Build tab.
In compiled languages there could be multiple execution strategies. In this case there are 4. Codecov detects all branches, as you can see in the coverage report above, there is not much other data to go off.

Related

winmips64 branch target buffer does 2 cycles stall

So i have been searching why on winmips64 there were 2 cycles stalls each time branch target buffer mispredict but i got nothing .So it mispredicts the first time and the last that bne is running .In the first time says its 2 cycles of branch taken stall and on last 2 cycles of branch misprediction stall any ideas?R12 is mentioned in another part of the code
lw R4,0(R3)
lw R8,0(R2)
dmul R8,R8,R4
daddi R3,R3,-8
daddi R11,R2,-8
dadd R9,R9,R8
daddi R2,R2,-8
bne R11,R12,loop
I don't know the winmips64 architecture specifically, i.e. how it is or isn't different from other MIPS pipelined implementations.  So, if someone else knows specifics, please correct me if I'm wrong.
That both the branch taken and mispredict cost 2 cycles is consistent with standard 5 stage pipeline, where the branch decision (taken/not taken) is fully resolved in EX stage, and thus has to flush the instructions in prior stages, IF and ID, when it's wrong.
The first time the branch executes, it is unknown to the branch predictor, and the processor appears to move forward with an assumption of not taken; thus, message about the branch taken stall.
The last time the branch executes, it is know to the branch predictor, however, the branch is not taken, which doesn't match the prediction as the predictor is assuming it will continue to loop back as it has just been doing.  Thus, the message of misprediction.

How many instructions need to be killed on a miss-predict in a 6-stage scalar or superscalar MIPS?

I am working on a pipeline with 6 stages: F D I X0 X1 W. I am asked how many instructions need to be killed when a branch miss-predict happens.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch. In the pipeline diagram, it looks like it would require killing 4 instructions that are in the process of flowing through the pipeline. Is that correct?
I am also asked how many need to be killed if the pipeline is a three-wide superscalar. This one I am not sure on. I think that it would be 12 because you can fetch 3 instructions at a time. Is that correct?
kill all the instructions that came after the branch
Not if this is a real MIPS. MIPS has one branch-delay slot: The instruction after a branch always executes whether the branch is taken or not. (jal's return address is the end of the delay slot so it doesn't execute twice.)
This was enough to fully hide the 1 cycle of branch latency on classic MIPS I (R2000), which used a scalar classic RISC 5-stage pipeline. It managed that 1 cycle branch latency by forwarding from the first half of an EX clock cycle to an IF starting in the 2nd half of a clock cycle. This is why MIPS branch conditions are all "simple" (don't need carry propagation through the whole word), like beq between two registers but only one-operand bgez / bltz against an implicit 0 for signed 2's complement comparisons. That only has to check the sign bit.
If your pipeline was well-designed, you'd expect it to resolve branches after X0 because the MIPS ISA is already limited to make low-latency branch decision easy for the ALU. But apparently your pipeline is not optimized and branch decisions aren't ready until the end of X1, defeating the purpose of making it run MIPS code instead of RISC-V or whatever other RISC instruction set.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch.
I think 4 cycles looks right for a generic scalar pipeline without a branch delay slot.
At the end of that X1 cycle, there's an instruction in each of the previous 4 pipeline stages, waiting to move to the next stage on that clock edge. (Assuming no other pipeline bubbles). The delay-slot instruction is one of those and doesn't need to be killed.
(Unless there was an I-cache miss fetching the delay slot instruction, in which case the delay slot instruction might not even be in the pipeline yet. So it's not as simple as killing the 3 stages before X0, or even killing all but the oldest previous instruction in the pipeline. Delay slots are not free to implement, also complicating exception handling.)
So 0..3 instructions need to be killed in pipeline stages from F to I. (If it's possible for the delay-slot instruction to be in one of those stages, you have to detect that special case. If it isn't, e.g. I-cache miss latency long enough that it's either in X0 or still waiting to be fetched, then the pipeline can just kill those first 3 stages and do something based on X0 being a bubble or not.)
I think that it would be 12 because you can fetch 3 instructions at a time
No. Remember the branch itself is one of a group of 3 instructions that can go through the pipeline. In the predict-not-taken case, presumably the decode stage would have sent all 3 instructions in that fetch/decode group down the pipe.
The worst case is I think when the branch is the first (oldest in program order) instruction in a group. Then 1 (or 2 with no branch delay slot) instructions from that group in X1 have to be killed, as well as all instructions in previous stages. Then (assuming no bubbles) you're cancelling 13 (or 14) instructions, 3 in each previous stage.
The best case is when the branch is last (youngest in program order) in a group of 3. Then you're discarding 11 (or 12 with no delay slot).
So for a 3-wide version of this pipeline with no delay slot, depending on bubbles in previous pipeline stages, you're killing 0..14 instructions that are in the pipeline already.
Implementing a delay slot sucks; there's a reason newer ISAs don't expose that pipeline detail. Long-term pain for short-term gain.

Correlated branch prediction

I have this exercise related to correlated predictors that states the following:
A: BEQZ R1, D
…
D: BEQZ R1, F
…
F: NOT R1, R1
G: JUMP A
The prediction works like follows
fetch the current instruction
if it is a branch, determine the current state of the predictor and predict the branch:
a.row is determined by the branch address (in this case either A or D)
b.column is determined by the current global shift register
c.use the value in the cell to determine the prediction from the state machine (current state is saved in the cell)
Execute the branch, and determine the actual decision
(Taken: 1, Not Taken: 0):
a.update the cell based on the current state and the
actual decision
b.update the global shift register (shift left and add the actual decision bit to right)
goto step 1
This is the solution
Solved exercise
I understood the scheme and know that a 2 bit predictor means less errors but I cannot solve this question and I have trouble finding how the solution was found, any help would be appreciated.
This a variation of the briefly described Two-level adaptive predictor with global history table in the Agner Fog's microarchitecture paper (page 15).
In this variant, the history register is shared across all branches however the pattern history table is local to a branch1.
The outcome of the last n (n = 2, in your case) branches is remembered (0 = Not taken, 1 = Taken), ordered from left to right in chronological order, forming a value of n bit that is used, along with the branch address2, to index a table of 2-bit saturating counters.
Each counter is incremented if the branch is taken and decremented otherwise (this is the canonical implementation, any 4-state FSA will do).
The meaning of each counter value is:
00b (0) Strongly not taken
01b (1) Weakly not taken
10b (2) Weakly taken
11b (3) Strongly taken
Saturating means than 3+1 (A strongly taken branch is taken again) = 3 and that 0-1 (A strongly not taken branch is again not taken) = 0 while normally arithmetic on registers is modulo 2n.
In your exercise the assumptions are:
The pattern history table is given as a 2D-table with rows corresponding to the full address of the branch and columns to the value of the global history register.
All counters start in the state 01b (Weakly not taken).
The global history register is 00b at reset.
R1 is 0 at the beginning.
Let's see the first iteration only.
First iteration
The instruction is BEQZ R1, D (a branch, obviously), its address is A.
Since R1 is 0, the branch will be taken (towards D).
Indexing into the table with a global history of 00b and address A gives us the counter value 01b (Weakly not taken) thus the prediction is not taken.
Once the CPU has executed the branch and flushed the mispredicted stage, the table must be updated.
Since the branch was taken, the counter is incremented from 01b to 10b.
Finally, the global history goes from 00b to 01b since the branch is taken (a 1 is shifted in from the right).
Note that the yellow highlighted entries are those read when the corresponding instruction is executed, while the green ones are those updated by the previous prediction.
Thus to see that the counter value has been incremented you have to look at the next row.
Since the branch was taken, the CPU is at D (BEQZ R1, F), this is exactly the same as before, only the global history register has value 01b.
After this instruction is executed the CPU is at F, so R1 becomes 111..11b (the solution just indicates it as 1) and the two above branches are re-executed.
1This is a simplification, the table is almost always a cache. It's impractical to an entry for each possible memory address where a branch can be found.
2Part of the address is used as the index in the cache, once a set has been selected, the address is again compared with the tag in each way of the set.

How does Parity's Aura consensus protocol work?

Here it's a very high level description with only formulas. I want to understand actually how it works.
I don't actually understand what a step is and what's it's use? Does a node always keep updating the step? And when time to create to create and broadcast a block comes it will take the current step value and check if he should broadcast or not.
What do you mean by "Blocks from more than 1 step into the future are rejected."? Does this mean that if block time is 5 seconds then the next block timestamp should be exactly 5 seconds higher.
And also what happens when the next primary doesn't broadcast? How does the network deal with it? All the next blocks should get invalidated right because they won't follow a timestamp difference of 5 seconds.
AuRa is the name for Parity's Proof-of-Authority (PoA) consensus engine, the name originally comes from Authority Round (used to be AuRo). It's used in the Kovan network.
PoA networks are permissioned not public by design. Only strictly defined authority nodes are allowed to seal blocks. This is very useful for test networks or enterprise networks where the native tokens on the blockchain are not holding any value and therefore would be easy to attack in a Proof-of-Work (PoW) or Proof-of-Stake (PoS) environment.
A step is one part of the authority round. Each authority can seal one block in each round. Let's say we have five authorities: 0x0a .. 0x0e. These would be the steps, as defined in the chain specification or in the dynamic validator contract:
Step 1: 0x0a seals a block
Step 2: 0x0b seals a block
Step 3: 0x0c seals a block
Step 4: 0x0d seals a block
Step 5: 0x0e seals a block
After the round is finished, it starts over again.
What do you mean by "Blocks from more than 1 step into the future are rejected."?
Now if The node 0x0c would try to seal a block right after 0x0a, then this block would be more than 1 step into the future. The block sealing strickly relies on the block step order of all authorities.
And also what happens when the next primary doesn't broadcast?
That's no problem, there will be a gap between two blocks, i.e., doubled block time. So if 0x0c notices that 0x0b is not providing a block in the specified time window, it can override this step with its own block and the round goes on. There are certain tolerances on the block timestamps to make sure the network does not stall.
In this screenshot above, you can see that two authorities in the Kovan network are not sealing blocks. The result is an increased block time between these steps.
Disclosure: I work for Parity.

Pipeline Hazards questions

I'm currently studying for an exam tomorrow and need some help understanding the following:
The following program is given:
ADDF R12, R13, R14
ADD R1,R8,R9
MUL R4,R2,R3
MUL R5,R6,R7
ADD R10,R5,R7
ADD R11,R2,R3
Find the potential conficts that can arise if the architecture has:
a) No pipeline
b) A Pipeline
c) Multiple pipelines
So for (b) I would say the instruction on line 5 is a Data Hazard because it fetches the value of R5 which is from the previous line given the result of a multiplication, so that instruction is not yet finished.
But what happens if an architecture doesn't have a pipeline? My best guess is that no hazards exist, but I'm not sure.
Also, what happens if it has 2 or more pipelines?
Cheers.
You are correct to suggest that for a) there are no hazards as each instruction must complete before the next starts.
For b):
There is a "Read After Write" dependency between lines 4 and 5.
There are "Read After Read" dependencies between lines 4 and 5 and also between lines 2 and 6.
I suspect that the difference between parts b) and c) is that the question assumes you know ahead of time that the pipe-line has a well defined number of stages. For example we know that if the pipe-line has 3 stages then the RAR dependency between lines 2 and 6 is irrelevant.
In a system with multiple pipelines however the system could fetch say 4 instructions per cycle making dependencies that were formally too far apart now potential hazards.