Advanced Computer Architecture - Superscalar and vliw processors - Tran Ngoc Thinh

What is a Superscalar Architecture?
• Features of Superscalar Architectures
• Data Dependencies
• Policies for Parallel Instruction Execution
• Register Renaming
• VLIW Processors

17 trang xuanthi 28/12/2022 1860

Download

Bạn đang xem tài liệu "Advanced Computer Architecture - Superscalar and vliw processors - Tran Ngoc Thinh", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.

File đính kèm:

advanced_computer_architecture_superscalar_and_vliw_processo.pdf

Nội dung text: Advanced Computer Architecture - Superscalar and vliw processors - Tran Ngoc Thinh

5/29/2013 dce 2011 Outline • What is a Superscalar Architecture? • Features of Superscalar Architectures • Data Dependencies • Policies for Parallel Instruction Execution • Register Renaming • VLIW Processors 3 dce 2011 What is a Superscalar Architecture? • A superscalar architecture is one in which several instructions can be initiated simultaneously and executed independently. • Pipelining allows several instructions to be executed at the same time, but they have to be in different pipeline stages at a given moment. • Superscalar architectures include all features of pipelining but, in addition, there can be several instructions executing simultaneously in the same pipeline stage. 4 2
5/29/2013 dce 2011 Superscalar Architectures • In example a floating point and two integer operations can be issued and executed simultaneously; each unit is pipelined and can execute several operations in different pipeline stages. 7 dce 2011 Limitations on Parallel Execution • The situations which prevent instructions to be executed in parallel by a superscalar architecture are very similar to those which prevent an efficient execution on any pipelined architecture. • The consequences of these situations on superscalar architectures are more severe than those on simple pipelines, because the potential of parallelism in superscalars is greater and, thus, a greater opportunity is lost. 8 4
5/29/2013 dce 2011 True Data Dependency • True data dependency exists when the output of one instruction is required as an input to a subsequent instruction: MUL R4,R3,R1 R4  R3 * R1 - - - - - - - - - - - - - - ADD R2,R4,R5 R2  R4 + R5 • True data dependencies are intrinsic features of the user’s program. They cannot be eliminated by compiler or hardware techniques. • True data dependencies have to be detected and treated: the addition above cannot be executed before the result of the multiplication is available. – The simplest solution is to stall the adder unti the multiplier has finished. – In order to avoid the adder to be stalled, the compiler or hardware can find other instructions which can be executed by the adder until the result of the multiplication is available. 11 dce 2011 Output Dependency • An output dependency exists if two instructions are writing into the same location; if the second instruction writes before the first one, an error occurs: MUL R4,R3,R1 R4  R3 * R1 - - - - - - - - - - - - - - ADD R4,R2,R5 R4  R2 + R5 12 6
5/29/2013 dce 2011 Policies for Parallel Instruction Execution • The ability of a superscalar processor to execute instructions in parallel is determined by: 1. the number and nature of parallel pipelines (this determines the number and nature of instructions that can be fetched and executed at the same time); 2. the mechanism that the processor uses to find independent instructions (instructions that can be executed in parallel). • The policies used for instruction execution are characterized by the following two factors: 1. the order in which instructions are issued for execution; 2. the order in which instructions are completed (they write results into registers and memory locations). 15 dce 2011 Policies for Parallel Instruction Execution • The simplest policy is to execute and complete instructions in their sequential order. This, however, gives little chances to find instructions which can be executed in parallel. • In order to improve parallelism the processor has to look ahead and try to find independent instructions to execute in parallel. Instructions will be executed in an order different from the strictly sequential one, with the restriction that the result must be correct. • Execution policies: 1. In-order issue with in-order completion. 2. In-order issue with out-of-order completion. 3. Out-of-order issue with out-of-order completion. 16 8
5/29/2013 dce 2011 In-Order Issue with In-Order Completion • The processor detects and handles (by stalling) true data dependencies and resource conflicts. • As instructions are issued and completed in their strict order, the resulting parallelism is very much dependent on the way the program is written/ compiled. – If I3 and I6 switch position, the pairs I6-I4 and I5-I3 can be executed in parallel (see following slide). • We are interested in techniques which are not compiler based but allow the hardware alone to detect instructions which can be executed in parallel and to issue them. 19 dce 2011 In-Order Issue with In-Order Completion • If the compiler generates this sequence: I1: ADDF R12,R13,R14 R12  R13 + R14 (float. pnt.) I2: ADD R1,R8,R9 R1  R8 + R9 I6: ADD R11,R2,R3 R11  R2 + R3 I4: MUL R5,R6,R7 R5  R6 * R7 I5: ADD R10,R5,R7 R10  R5 + R7 I3: MUL R4,R2,R3 R4  R2 * R3 • I6-I4 and I5-I3 could be executed in parallel • The sequence needs only 6 cycles instead of 8. 20 10
5/29/2013 dce 2011 Out-of-Order Issue with Out-of-Order Completion • We consider the instruction sequence in above. • I6 can be now issued before I5 and in parallel with I4; the sequence takes only 6 cycles (compared to 8 if we have in-order issue & in-order completion). 23 dce 2011 Out-of-Order Issue with Out-of-Order Completion • With out-of-order issue &out-of-order completion the processor has to bother about true data dependency and both about output-dependency and antidependency! • Output dependency can be violated (the addition completes before the multiplication): MUL R4,R3,R1 R4  R3 * R1 - - - - - - - - - - - - - - ADD R4,R2,R5 R4  R2 + R5 • Antidependency can be violated (the operand in R3 is used after it has been over-written): MUL R4,R3,R1 R4  R3 * R1 - - - - - - - - - - - - - - ADD R3,R2,R5 R3  R2 + R5 24 12
5/29/2013 dce 2011 Some Architectures PowerPC 604 • six independent execution units: – Branch execution unit, Load/Store unit – 3 Integer units, Floating-point unit • in-order issue Power PC 620 • provides in addition to the 604 out-of-order issue Pentium • three independent execution units: 2 Integer units, Floating point unit • in-order issue Pentium II • provides in addition to the Pentium out-of-order issue • five instructions can be issued in one cycle 27 dce 2011 What is Good and what is Bad with Superscalars ? Good • The hardware solves everything: – Hardware detects potential parallelism between instructions; – Hardware tries to issue as many instructions as possible in parallel. – Hardware solves register renaming. • Binary compatibility – If functional units are added in a new version of the architecture or some other improvements have been made to the architecture (without changing the instruction sets), old programs can benefit from the additional potential of parallelism. – Why? Because the new hardware will issue the old instruction sequence in a more efficient way. 28 14
5/29/2013 dce 2011 VLIW Processors • Detection of parallelism and packaging of operations into instructions is done, by the compiler, off-line. 31 dce 2011 Advantages and Problems with VLIW Processors Advantages • Simpler hardware: – the number of FUs can be increased without needing additional sophisticated hardware to detect parallelism, like in superscalars. – Power consumption can be reduced. • Good compilers can detect parallelism based on global analysis of the whole program (no window of execution problem). Successive Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Time in Base Cycles 32 16