Executing an unconditional branch instruction using the modified ID stage
 

  • The unconditional branch instruction do not use the flag values and will always update the PC with the value PC + offset

  • Executing the unconditional branch is easier to understand

    (And therefore I will present this first)

  • I will use the following program to show you the branch delay using the modified ID stage:

         bra Label
         add r2, r1, r2
         add r3, r1, r3
         add r4, r1, r4       
         ...
      

    (The execution does not use data forwarding and I will omit the data forwarding circuits in the diagrams to keep the material simple)

Executing an unconditional branch instruction using the modified ID stage - Example

Start of cycle 1: IF stage is fetching the branch instruction:

 

Executing an unconditional branch instruction using the modified ID stage - Example

End of cycle 1: BRA Lable is fetched in IR(ID)

Because we always make a copy of the PC into the ID stage, the value of PC available as source operand !

Executing an unconditional branch instruction using the modified ID stage - Example

Start of cycle 2: ID stage computes PC + offset, IF stage fetches add r2,r1,r2:

The MUX will always select PC + offset (= Label) as input when executing an unconditional branch instruction

Executing an unconditional branch instruction using the modified ID stage - Example

End of cycle 2: PC is updated to address Label , add r2,r1,r2 is fetched in IR(ID)

The CPU made the branch in 2 clock periods !! There was a one-instruction branch delay...

DEMO (using Aaron's pipelined CPU)
 

  • Execute this command on a lab machine:

       /home/cs355001/demo/pipeline/6-speedup-bra     
      

    Program being executed:

         0:   10  62    // mov r1,#62
              18  1     // mov r2,#1
              26  1     // mov r3,#1
              34  1     // mov r4,#1
              42  1     // mov r5,#1
              50  1     // mov r6,#1
              58  1     // mov r7,#1
              0   0     // nop
              0   0     // nop
              0   0     // nop
              0   0     // nop
              0   0     // nop
       12:    192 44    // bra +44
              16  10    // add r2,r1,r2  (R2=R1+R2) <-- Only one instr is executed 
              24  11    // add r3,r1,r3  (R3=R1+R3) 
              32  12    // add r3,r1,r4  (R4=R1+R4) 
              40  13    // add r4,r1,r5  (R4=R1+R4)
              48  14    // add r5,r1,r6  (R4=R1+R4)
              56  15    // add r6,r1,r7  (R4=R1+R4)
      
       56:     0   1    // <---- bra target   (56 = 111000)
               0   2
      

Demo: delayed branching in a real (SPARC) CPU
 

  • NOTE: the delayed branching instruction can be avoided by stalling the IF stage (thus preventing the CPU from fetching the next instruction....)

  • Most CPU's (e.g.: ARM) will avoid using a delayed branch instruction

  • However: the SPARC CPU does have a delayed branch

  • I can show you the delayed branch on my home machine that has a SPARC CPU:

        ~/.home2
      
        User: cs355001 (my own passwd)
      
        cd /home/cs355001/demo/delay-branch
      
        /home/cs255000/bin/as255s  sparc-DELAY-BRANCH
      
        /home/cs255000/bin/sparc   (load prog and run it)