Advanced Computer Architecture - Pipelining - Tran Ngoc Thinh

Implementation technique in which multiple
instructions are overlapped in execution
• Real-life pipelining examples?
– Laundry
– Factory production lines
– Traffic??

35 trang xuanthi 28/12/2022 980

Download

Bạn đang xem 20 trang mẫu của tài liệu "Advanced Computer Architecture - Pipelining - Tran Ngoc Thinh", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.

File đính kèm:

advanced_computer_architecture_pipelining_tran_ngoc_thinh.pdf

Nội dung text: Advanced Computer Architecture - Pipelining - Tran Ngoc Thinh

3/19/2013 dce 2011 What is pipelining? • Implementation technique in which multiple instructions are overlapped in execution • Real-life pipelining examples? – Laundry – Factory production lines – Traffic?? 3 dce 2011 Instruction Pipelining (1/2) • Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. • An instruction execution pipeline involves a number of steps, where each step completes a part of an instruction. Each step is called a pipeline stage or a pipeline segment. • The stages or steps are connected in a linear fashion: one stage to the next to form the pipeline instructions enter at one end and progress through the stages and exit at the other end. • The time to move an instruction one step down the pipeline is is equal to the machine cycle and is determined by the stage with the longest processing delay. 4 2
3/19/2013 dce 2011 Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a A s k B O r d C e r D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? 7 dce 2011 Pipelined Laundry Start work ASAP 6 PM 7 8 9 10 11 Midnight Time 30 40 40 40 40 20 T a A s k O B r d C e r D Pipelined laundry takes 3.5 hours for 4 loads Speedup = 6/3.5 = 1.7 8 4
3/19/2013 dce 2011 Pipelining Example: Laundry • Pipelined Laundry Observations: – Speedup due to pipelining depends on the number of stages in the pipeline – Pipeline rate limited by slowest pipeline stage • If dryer needs 45 min , time for all stages has to be 45 min to accommodate it • Unbalanced lengths of pipe stages reduces speedup – Time to “fill” pipeline and time to “drain” it reduces speedup – If one load depends on another, we will have to wait (Delay/Stall for Dependencies) 11 dce 2011 CPU Pipelining • 5 stages of a MIPS instruction – Fetch instruction from instruction memory – Read registers while decoding instruction – Execute operation or calculate address, depending on the instruction type – Access an operand from data memory – Write result into a register • We can reduce the cycles to fit the stages. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr 12 6
3/19/2013 dce 2011 CPU Pipelining: Example • Single-Cycle, non-pipelined execution •Total time for 3 instructions: 24 ns P r o g r a m e x e c u t io n 2 4 6 8 10 12 14 16 18 o r d e r Time (in in structio ns) Instruction Reg ALU Data Reg lw $ 1 , 1 0 0 ($0 ) fetch access Instruction Data 8 ns fetch Reg ALU access Reg lw $ 2 , 2 0 0 ($0 ) Instruction 8 ns fetch . . . lw $ 3 , 3 0 0 ($0 ) 8 n s 15 dce 2011 CPU Pipelining: Example • Single-cycle, pipelined execution – Improve performance by increasing instruction throughput – Total time for 3 instructions = 14 ns – Each instruction adds 2 ns to total execution time – Stage time limited by slowest resource (2 ns) – Assumptions: • Write to register occurs in 1st half of clock • Read from register occurs in 2nd half of clock P ro g ra m e x e c u t io n 2 4 6 8 1 0 1 2 1 4 T me o rd e r i ( in in st ru c tio n s) Instruction Data Reg ALU Reg lw $1, 100($0) fetch access Instruction Data 2 n s Reg ALU Reg lw $2, 200($0) fetch access Instruction Data lw 2 n s Reg ALU Reg $3, 300($0) fetch access 2 n s 2 n s 2 n s 2 n s 2 n s 16 8
3/19/2013 dce 2011 CPU Pipelining Example: (2/2) • If we have 3 consecutive instructions – Non-pipelined needs 8 x 3 = 24 ns – Pipelined needs 14 ns => Speedup = 24 / 14 = 1.7 • If we have1003 consecutive instructions – Add more time for 1000 instruction (i.e. 1003 instruction)on the previous example • Non-pipelined total time= 1000 x 8 + 24 = 8024 ns • Pipelined total time = 1000 x 2 + 14 = 2014 ns => Speedup ~ 3.98~ (8 ns / 2 ns] ~ near perfect speedup => Performance increases for larger number of instructions (throughput) 19 dce 2011 Pipelining MIPS Instruction Set • MIPS was designed with pipelining in mind => Pipelining is easy in MIPS: – All instruction are the same length – Limited instruction format – Memory operands appear only in lw & sw instructions – Operands must be aligned in memory 1. All MIPS instruction are the same length – Fetch instruction in 1st pipeline stage – Decode instructions in 2nd stage – If instruction length varies (e.g. 80x86), pipelining will be more challenging 20 10
3/19/2013 dce 2011 Pipelining MIPS Instruction Set 4. Operands must be aligned in memory – Transfer of more than one data operand can be done in a single stage with no conflicts – Need not worry about single data transfer instruction requiring 2 data memory accesses – Requested data can be transferred between the CPU & memory in a single pipeline stage 23 dce 2011 Instruction Pipelining Review – MIPS In-Order Single-Issue Integer Pipeline – Performance of Pipelines with Stalls – Pipeline Hazards • Structural hazards • Data hazards . Minimizing Data hazard Stalls by Forwarding . Data Hazard Classification . Data Hazards Present in Current MIPS Pipeline • Control hazards . Reducing Branch Stall Cycles . Static Compiler Branch Prediction . Delayed Branch Slot » Canceling Delayed Branch Slot 24 12
3/19/2013 dce 2011 Visualizing Pipelining Figure A.2, Page A-8 Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I Ifetch Reg DMem Reg n ALU Write s destination IF ID EX MEM WB t register r. Ifetch Reg DMem Reg in first half ALU of WB cycle O r Ifetch Reg DMem Reg ALU d e r Ifetch Reg DMem Reg ALU Read operand registers in second half of ID cycle Operation of ideal integer in-order 5-stage pipeline dce 2011 Pipelining Performance Example • Example: For an unpipelined CPU: – Clock cycle = 1ns, 4 cycles for ALU operations and branches and 5 cycles for memory operations with instruction frequencies of 40%, 20% and 40%, respectively. – If pipelining adds 0.2 ns to the machine clock cycle then the speedup in instruction execution from pipelining is: Non-pipelined Average instruction execution time = Clock cycle x Average CPI = 1 ns x ((40% + 20%) x 4 + 40%x 5) = 1 ns x 4.4 = 4.4 ns In the pipelined implementation five stages are used with an average instruction execution time of: 1 ns + 0.2 ns = 1.2 ns Speedup from pipelining = Instruction time unpipelined Instruction time pipelined = 4.4 ns / 1.2 ns = 3.7 times faster 28 14
3/19/2013 dce 2011 Stalls and performance • Stalls impede progress of a pipeline and result in deviation from 1 instruction executing/clock cycle • Pipelining can be viewed to: – Decrease CPI or clock cycle time for instruction – Let’s see what affect stalls have on CPI • CPI pipelined = Ideal CPI + Pipeline stall cycles per instruction = 1 + Pipeline stall cycles per instruction • Ignoring overhead and assuming stages are balanced: CPI unpipelined Speedup 1 pipeline stall cycles per instruction 31 dce 2011 Even more pipeline performance issues! • This results in: Clock cycle unpipelined Clock cycle pipelined Pipeline depth Clock cycle unpipelined Pipeline depth • Which leads to: Clock cycle pipelined 1 Clock cycle unpipelined Speedup from pipelining 1 Pipeline stall cycles per instruction Clock cycle pipelined 1 Pipeline depth 1 Pipeline stall cycles per instruction • If no stalls, speedup equal to # of pipeline stages in ideal case 32 16
3/19/2013 dce 2011 How is it resolved? Mem Reg DM Reg Load ALU Mem Reg DM Reg Instruction 1 ALU Mem Reg DM Reg Instruction 2 ALU Stall Bubble Bubble Bubble Bubble Bubble Mem Reg DM Reg Instruction 3 ALU Time Pipeline generally stalled by inserting a “bubble” or NOP 35 dce 2011 Or alternatively Clock Number Inst. # 1 2 3 4 5 6 7 8 9 10 LOAD IF ID EX MEM WB Inst. i+1 IF ID EX MEM WB Inst. i+2 IF ID EX MEM WB Inst. i+3 stall IF ID EX MEM WB Inst. i+4 IF ID EX MEM WB Inst. i+5 IF ID EX MEM Inst. i+6 IF ID EX LOAD instruction “steals” an instruction fetch cycle which will cause the pipeline to stall. Thus, no instruction completes on clock cycle 8 36 18
3/19/2013 dce 2011 Data Hazards • Data hazards occur when the pipeline changes the order of read/write accesses to instruction operands in such a way that the resulting access order differs from the original sequential instruction operand access order of the unpipelined machine resulting in incorrect execution. • Data hazards may require one or more instructions to be stalled to ensure correct execution. • Example: ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, R7 OR R8,R1,R9 XOR R10, R1, R11 – All the instructions after ADD use the result of the ADD instruction – SUB, AND instructions need to be stalled for correct execution. 39 dce 2011 Data Hazard on R1 Time (clock cycles) IF ID/RF EX MEM WB I Ifetch Reg DMem Reg add r1,r2,r3 ALU n s Ifetch Reg DMem Reg t sub r4,r1,r3 ALU r. Ifetch Reg DMem Reg O and r6,r1,r7 ALU r d Ifetch Reg DMem Reg e or r8,r1,r9 ALU r Ifetch Reg DMem Reg xor r10,r1,r11 ALU 20
3/19/2013 dce 2011 Forwarding to Avoid Data Hazard Time (clock cycles) I n Ifetch Reg DMem Reg add r1,r2,r3 ALU s t r. Ifetch Reg DMem Reg sub r4,r1,r3 ALU O r Ifetch Reg DMem Reg d and r6,r1,r7 ALU e r Ifetch Reg DMem Reg or r8,r1,r9 ALU Ifetch Reg DMem Reg xor r10,r1,r11 ALU dce 2011 Forwarding to Avoid LW-SW Data Hazard Time (clock cycles) Ifetch Reg DMem Reg I add r1,r2,r3 ALU n s Ifetch Reg DMem Reg t lw r4, 0(r1) ALU r. Ifetch Reg DMem Reg O sw r4,12(r1) ALU r d Ifetch Reg DMem Reg ALU e or r8,r6,r9 r Ifetch Reg DMem Reg xor r10,r9,r11 ALU 22
3/19/2013 dce 2011 Read after write (RAW) hazards • With RAW hazard, instruction j tries to read a source operand before instruction i writes it. • Thus, j would incorrectly receive an old or incorrect value • Graphically/Example: j i i: ADD R1, R2, R3 Instruction j is a Instruction i is a j: SUB R4, R1, R6 read instruction write instruction issued after i issued before j • Can use stalling or forwarding to resolve this hazard 47 dce 2011 Write after write (WAW) hazards • With WAW hazard, instruction j tries to write an operand before instruction i writes it. • The writes are performed in wrong order leaving the value written by earlier instruction • Graphically/Example: j i i: SUB R4, R1, R3 Instruction j is a Instruction i is a j: ADD R1, R2, R3 write instruction write instruction issued after i issued before j 48 24
3/19/2013 dce 2011 Data Hazard Even with Forwarding Time (clock cycles) Ifetch Reg DMem Reg I lw r1, 0(r2) ALU n s Ifetch Reg DMem Reg t sub r4,r1,r6 ALU r. Ifetch Reg DMem Reg O and r6,r1,r7 ALU r d e Ifetch Reg DMem Reg or r8,r1,r9 ALU r dce 2011 Data Hazard Even with Forwarding Time (clock cycles) Ifetch Reg DMem Reg I lw r1, 0(r2) ALU n s Ifetch Reg DMem Reg t sub r4,r1,r6 Bubble ALU r. Ifetch Bubble Reg DMem Reg O and r6,r1,r7 ALU r d Bubble Ifetch Reg DMem e or r8,r1,r9 ALU r 26
3/19/2013 dce 2011 Some example situations Situation Example Action No Dependence LW R1, 45(R2) No hazard possible because no ADD R5, R6, R7 dependence exists on R1 in the SUB R8, R6, R7 immediately following three instructions. OR R9, R6, R7 Dependence LW R1, 45(R2) Comparators detect the use of R1 in the ADD R5, R1, R7 ADD and stall the ADD (and SUB and OR) requiring stall SUB R8, R6, R7 before the ADD begins EX OR R9, R6, R7 Dependence LW R1, 45(R2) Comparators detect the use of R1 in SUB ADD R5, R6, R7 and forward the result of LOAD to the ALU overcome by SUB R8, R1, R7 in time for SUB to begin with EX forwarding OR R9, R6, R7 Dependence with LW R1, 45(R2) No action is required because the read of ADD R5, R6, R7 R1 by OR occurs in the second half of the accesses in order SUB R8, R6, R7 ID phase, while the write of the loaded data occurred in the first half. OR R9, R1, R7 55 dce 2011 Static Compiler Instruction Scheduling (Re-Ordering) for Data Hazard Stall Reduction • Many types of stalls resulting from data hazards are very frequent. For example: A = B + C produces a stall when loading the second data value (B). • Rather than allow the pipeline to stall, the compiler could sometimes schedule the pipeline to avoid stalls. • Compiler pipeline or instruction scheduling involves rearranging the code sequence (instruction reordering) to eliminate or reduce the number of stall cycles. Static = At compilation time by the compiler Dynamic = At run time by hardware in the CPU 56 28
3/19/2013 dce 2011 Control Hazards • When a conditional branch is executed it may change the PC and, without any special measures, leads to stalling the pipeline for a number of cycles until the branch condition is known (branch is resolved). – Otherwise the PC may not be correct when needed in IF • In current MIPS pipeline, the conditional branch is resolved in stage 4 (MEM stage) resulting in three stall cycles as shown below: Branch instruction IF ID EX MEM WB Branch successor stall stall stall IF ID EX MEM WB Branch successor + 1 IF ID EX MEM WB Branch successor + 2 3 stall cycles IF ID EX MEM Branch successor + 3 IF ID EX Branch successor + 4 IF ID Branch successor + 5 IF Assuming we stall or flush the pipeline on a branch instruction: Three clock cycles are wasted for every branch for current MIPS pipeline Branch Penalty = stage number where branch is resolved - 1 here Branch Penalty = 4 - 1 = 3 Cycles 59 dce Control Hazard on Branches 2011 Three Stage Stall Ifetch Reg DMem Reg 10: beq r1,r3,36 ALU Ifetch Reg DMem Reg 14: and r2,r3,r5 ALU Ifetch Reg DMem Reg 18: or r6,r1,r7 ALU Ifetch Reg DMem Reg 22: add r8,r1,r9 ALU Ifetch Reg DMem Reg 36: xor r10,r1,r11 ALU 30
3/19/2013 dce 2011 Pipelined MIPS Datapath Instruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc Access Back Next PC Next MUX Modified MIPS SEQ PC Adder Adder Pipeline: Conditional Zero? Branche 4 RS1 MEM/WB Address Memory EX/MEM Completed in Reg Reg File RS2 ID/EX ALU IF/ID Memory MUX ID Stage Data MUX Branch resolved in Sign Extend stage 2 (ID) Imm Branch Penalty = 2 WB Data WB - 1 = 1 RD RD RD • Interplay of instruction set design and cycle time. dce 2011 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken – Execute successor instructions in sequence – “Squash” instructions in pipeline if branch actually taken – Advantage of late pipeline state update – 47% MIPS branches not taken on average – PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken – 53% MIPS branches taken on average – But haven’t calculated branch target address in MIPS • MIPS still incurs 1 cycle branch penalty • Other machines: branch target known before outcome – What happens when hit not-taken branch? 32
3/19/2013 dce 2011 Delayed Branch • Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled • Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot – Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches – Growth in available transistors has made dynamic approaches relatively cheaper dce 2011 Evaluating Branch Alternatives Pipeline speedup = Pipeline depth 1 +Branch frequency Branch penalty Assume: 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken Scheduling Branch CPI speedup v.speedup v. scheme penalty unpipelinedstall Stall pipeline 3 1.60 3.1 1.0 Predict not taken1x0.04+3x0.10 1.34 3.7 1.19 Predict taken 1x0.14+2x0.061.26 4.0 1.29 Delayed branch 0.5 1.10 4.5 1.45 34