Advanced Computer Architecture - Introduction - Tran Ngoc Thinh
Introduction – Brief history of computers – Basic concepts of computer architecture. Instruction Set Principle – Classifying Instruction Set Architectures – Addressing Modes,Type and Size of Operands – Operations in the Instruction Set, Instructions for Control Flow, Instruction Format – The Role of Compilers |
Bạn đang xem 20 trang mẫu của tài liệu "Advanced Computer Architecture - Introduction - Tran Ngoc Thinh", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.
File đính kèm:
- advanced_computer_architecture_introduction_tran_ngoc_thinh.pdf
Nội dung text: Advanced Computer Architecture - Introduction - Tran Ngoc Thinh
- dce 2010 Administrative Issues (cont.) • Grades – 10% homeworks – 20% presentations – 20% midterm exam – 50% final exam Advanced Computer Architecture 3 dce 2010 Administrative Issues (cont.) • Personnel – Instructor: Dr. Tran Ngoc Thinh • Email: tnthinh@cse.hcmut.edu.vn • Phone: 8647256 (5843) • Office: A3 building • Office hours: Thursdays, 09:00-11:00 – TA: Mr. Tran Huy Vu • Email:vutran@cse.hcmut.edu.vn • Phone: 8647256 (5843) • Office: A3 building • Office hours: Advanced Computer Architecture 4 2
- dce 2010 Course Coverage • Memory Hierarchy Design – Memory hierarchy – Cache memories – Virtual memories – Memory management. • SuperScalar Architectures – Instruction level parallelism and machine parallelism – Hardware techniques for performance enhancement – Limitations of the superscalar approach • Vector Processors Advanced Computer Architecture 7 dce 2010 Course Requirements • Computer Organization & Architecture – Comb./Seq. Logic, Processor, Memory, Assembly Language • Data Structures / Algorithms – Complexity analysis, efficient implementations • Operating Systems – Task scheduling, management of processors, memory, input/output devices Advanced Computer Architecture 8 4
- dce 2010 Levels of Abstraction Applications Operating System Compiler Firmware Instruction Set Architecture Instruction Set Processor I/O System Datapath & Control Digital Design Circuit Design Layout • S/W and H/W consists of hierarchical layers of abstraction, each hides details of lower layers from the above layer • The instruction set arch. abstracts the H/W and S/W interface and allows many implementation of varying cost and performance to run the same S/W Advanced Computer Architecture 11 dce 2010 The Task of Computer Designer • determine what attribute are important for a new machine • design a machine to maximize cost performance • What are these Task? – instruction set design – function organization – logic design – implementation • IC design, packaging, power, cooling . – Advanced Computer Architecture 12 6
- dce 2010 The Processor Chip Advanced Computer Architecture 15 dce 2010 Intel 4004 Die Photo • Introduced in 1970 – First microprocessor • 2,250 transistors • 12 mm2 • 108 KHz Advanced Computer Architecture 16 8
- dce 2010 Pentium Die Photo • 3,100,000 transistors • 296 mm2 • 60 MHz • Introduced in 1993 – 1st superscalar implementation of IA32 Advanced Computer Architecture 19 dce 2010 Pentium III • 9,5000,000 transistors • 125 mm2 • 450 MHz • Introduced in 1999 Advanced Computer Architecture 20 10
- dce 2010 Price Trends (Pentium III) Advanced Computer Architecture 23 dce 2010 Price Trends (DRAM memory) Advanced Computer Architecture 24 12
- dce 2010 Limiting Force: Power Density Advanced Computer Architecture 27 dce 2010Crossroads: Conventional Wisdom in Comp. Arch • Old Conventional Wisdom: Power is free, Transistors expensive • New Conventional Wisdom: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on) • Old CW: Sufficiently increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, ) • New CW: “ILP wall” law of diminishing returns on more HW for ILP • Old CW: Multiplies are slow, Memory access is fast • New CW: “Memory wall” Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply) • Old CW: Uniprocessor performance 2X / 1.5 yrs • New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall – Uniprocessor performance now 2X / 5(?) yrs Sea change in chip design: multiple “cores” (2X processors per chip / ~ 2 years) • More power efficient to use a large number of simpler processors rather than a small number of complex processors Advanced Computer Architecture 28 14
- dce 2010 The End of the Uniprocessor Era Single biggest change in the history of computing systems Advanced Computer Architecture 31 dce 2010 The End of the Uniprocessor Era • Multiprocessors imminent in 1970s, „80s, „90s, • “ today‟s processors are nearing an impasse as technologies approach the speed of light ” David Mitchell, The Transputer: The Time Is Now (1989) • Custom multiprocessors strove to lead uniprocessors Procrastination rewarded: 2X seq. perf. / 1.5 years • “We are dedicating all of our future product development to multicore designs. This is a sea change in computing” Paul Otellini, President, Intel (2004) • Difference is all microprocessor companies switch to multicore (AMD, Intel, IBM, Sun; all new Apples 2-4 CPUs) Procrastination penalized: 2X sequential perf. / 5 yrs Biggest programming challenge: 1 to 2 CPUs Advanced Computer Architecture 32 16
- dce 2010 Computer Design Cycle Evaluate Existing Systems for Bottlenecks Benchmarks 1 Performance Technology and cost The computer design is evaluated for bottlenecks using certain benchmarks to achieve the optimum performance Advanced Computer Architecture 35 dce 2010 Performance (Metric) • Time/Latency: The wall clock or CPU elapsed time. • Throughput: The number of results per second. Other measures such as MIPS, MFLOPS, clock frequency (MHz), cache size do not make any sense. Advanced Computer Architecture 36 18
- dce 2010 Technology Trends: Computer Generations • Vacuum tube 1946-1957 1st Gen. • Transistor - 1958-1964 2nd Gen. • Small scale integration 1965-1968 – Up to 100 devices/chip • Medium scale integration 1969-1971 3rd Gen. – 100-3,000 devices/chip • Large scale integration 1972-1977 – 3,000 - 100,000 devices/chip • Very large scale integration 1978 on 4th Gen. – 100,000 - 100,000,000 devices/chip • Ultra large scale integration – Over 100,000,000 devices/chip Advanced Computer Architecture 39 dce 2010 Computer Design Cycle 3: Cost 1: Performance Implementation Complexity The systems are implemented using the latest technology to obtain cost effective, high performance solution - the implementation complexities are given due consideration Implement Next Generation System 2: Technology Advanced Computer Architecture 40 20
- dce 2010 Price vs. Cost 100% 80% Averag e Discount 60% Gross Marg in 40% Direct Costs 20% Component Costs 0% Mini W/S PC • List Price: •Amount for which the finished good is sold; •it includes Average Discount of 15% to 35% of the as volume discounts and/or retailer markup Advanced Computer Architecture 43 dce 2010 Cost-effective IC Design: Price-Performance Design • Yield: Percentage of manufactured components surviving testing • Volume: increases manufacturing hence decreases the list price and improves the purchasing efficiency • Feature Size: the minimum size of a transistor or wire in either x or y direction Advanced Computer Architecture 44 22
- dce 2010 Cost of Integrated Circuits Die: is the square area of the wafer containing the integrated circuit See that while fitting dies on the wafer the small wafer area around the periphery goes waist Cost of a die: The cost of a die is determined from cost of a wafer; the number of dies fit on a wafer and the percentage of dies that work, i.e., the yield of the die. Advanced Computer Architecture 47 dce 2010 Cost of Integrated Circuits The cost of integrated circuit can be determined as ratio of the total cost; i.e., the sum of the costs of die, cost of testing die, cost of packaging and the cost of final testing a chip; to the final test yield. Cost of IC= die cost + die testing cost + packaging cost + final testing cost final test yield • The cost of die is the ratio of the cost of the wafer to the product of the dies per wafer and die yield Die cost = Cost of wafer dies per wafer x die yield Advanced Computer Architecture 48 24
- dce 2010 Volume vs. Cost • Rule of thumb on applying learning curve to manufacturing: “When volume doubles, costs reduce 10%” A DEC View of Computer Engineering by C. G. Bell, J. C. Mudge, and J. E. McNamara, Digital Press, Bedford, MA., 1978. 1990 1992 1994 1997 PC 23,880,898 33,547,589 44,006,000 65,480,000 WS 407,624 584,544 679,320 978,585 Ratio 59 57 65 67 • 2x = 65 => X = 6.0 • Since doubling value reduces cost by 10%, costs reduces to (0.9)6.0 = 0.53 of the original price. PC costs are 47% less than workstation costs for whole market. Advanced Computer Architecture 51 dce 2010 High Margins on High-End Machines • R&D considered return on investment (ROI) 10% – Every $1 R&D must generate $7 to $13 in sales • High end machines need more $ for R&D • Sell fewer high end machines – Fewer to amortize R&D – Much higher margins • Cost of 1 MB Memory (January 1994): PC $40 (Mac Quadra) WS $42 (SS-10) Mainframe $1920 (IBM 3090) Supercomputer $600 (M90 DRAM) $1375 (C90 15 ns SRAM) Advanced Computer Architecture 52 26
- dce 2010 Does Anybody Really Know What Time it is? UNIX Time Command: 90.7u 12.9s 2:39 65% • User CPU Time (Time spent in program): 90.7 sec • System CPU Time (Time spent in OS): 12.9 sec • Elapsed Time (Response Time = 2 min 39 sec =159 Sec.) • (90.7+12.9)/159 * 100 = 65%, % of lapsed time that is CPU time. 45% of the time spent in I/O or running other programs Advanced Computer Architecture 55 dce 2010 Time CPU time – time the CPU is computing – not including the time waiting for I/O or running other program User CPU time – CPU time spent in the program System CPU time – CPU time spent in the operating system performing task requested by the program decrease execution time CPU time = User CPU time + System CPU time Advanced Computer Architecture 56 28
- dce 2010 Example • Time of Concorde vs. Boeing 747? • Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours • Throughput of Concorde vs. Boeing 747 ? • Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” • Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster” • Boeing is 1.6 times (“60%”)faster in terms of throughput • Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job Advanced Computer Architecture 59 dce 2010 Computer Performance Measures: Program Execution Time (1/2) • For a specific program compiled to run on a specific machine (CPU) “A”, the following parameters are provided: – The total instruction count of the program. – The average number of cycles per instruction (average CPI). – Clock cycle of machine “A” Advanced Computer Architecture 60 30
- dce 2010 Example For a given program: Execution time on machine A: ExecutionA = 1 second Execution time on machine B: ExecutionB = 10 seconds Performance Execution Time Speedup A B Performance Execution Time B A 10 10 1 The performance of machine A is 10 times the performance of machine B when running this program, or: Machine A is said to be 10 times faster than machine B when running this program. The two CPUs may target different ISAs provided the program is written in a high level language (HLL) Advanced Computer Architecture 63 dce 2010 CPU Execution Time: The CPU Equation • A program is comprised of a number of instructions executed , I – Measured in: instructions/program • The average instruction executed takes a number of cycles per instruction (CPI) to be completed. – Measured in: cycles/instruction, CPI • CPU has a fixed clock cycle time C = 1/clock rate – Measured in: seconds/cycle • CPU execution time is the product of the above three parameters as follows: CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle T = I x CPI x C execution Time Number of Average CPI for program CPU Clock Cycle per program in seconds instructions executed Advanced Computer Architecture 64 32
- dce 2010 Performance Comparison: Example • From the previous example: A Program is running on a specific machine with the following parameters: – Total executed instruction count, I: 10,000,000 instructions – Average CPI for the program: 2.5 cycles/instruction. – CPU clock rate: 200 MHz. • Using the same program with these changes: – A new compiler used: New instruction count 9,500,000 New CPI: 3.0 – Faster CPU implementation: New clock rate = 300 MHZ • What is the speedup with the changes? Speedup = Old Execution Time = Iold x CPIold x Clock cycleold New Execution Time Inew x CPInew x Clock Cyclenew Speedup = (10,000,000 x 2.5 x 5x10-9) / (9,500,000 x 3 x 3.33x10-9 ) = .125 / .095 = 1.32 or 32 % faster after changes. Advanced Computer Architecture 67 dce 2010 Instruction Types & CPI • Given a program with n types or classes of instructions executed on a given CPU with the following characteristics: C = Count of instructions of type i i i = 1, 2, . n CPIi = Cycles per instruction for typei Then: CPI = CPU Clock Cycles / Instruction Count I Where: n CPU clock cycles CPI i Ci i 1 Instruction Count I = S Ci Advanced Computer Architecture 34
- dce 2010 Instruction Type Frequency & CPI: A RISC Example Program Profile or Executed Instructions Mix CPIi x Fi CPI Base Machine (Reg / Reg) Op Freq, Fi CPIi CPIi x Fi % Time ALU 50% 1 .5 23% = .5/2.2 Load 20% 5 1.0 45% = 1/2.2 Store 10% 3 .3 14% = .3/2.2 Branch 20% 2 .4 18% = .4/2.2 Sum = 2.2 Typical Mix n CPI CPI i F i i 1 Advanced Computer Architecture 71 dce 2010 Performance Terminology “X is n% faster than Y” means: ExTime(Y) Performance(X) n = = 1 + ExTime(X) Performance(Y) 100 n = 100(Performance(X) - Performance(Y)) Performance(Y) n = 100(ExTime(Y) - ExTime(X)) ExTime(X) Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X? n = 100(15 - 10) = 50% 10 Advanced Computer Architecture 72 36
- dce 2010 Example of Amdahl‟s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold 1 Speedupoverall = = 1.053 0.95 Advanced Computer Architecture 75 dce 2010 Performance Enhancement Calculations: Amdahl's Law • The performance enhancement possible due to a given design improvement is limited by the amount that the improved feature is used Amdahl‟s Law: Performance improvement or speedup due to enhancement E: Execution Time without E Performance with E Speedup(E) = = Execution Time with E Performance without E – Suppose that enhancement E accelerates a fraction F of the execution time by a factor S and the remainder of the time is unaffected then: Execution Time with E = ((1-F) + F/S) X Execution Time without E Hence speedup is given by: Execution Time without E 1 Speedup(E) = = ((1 - F) + F/S) X Execution Time without E (1 - F) + F/S Advanced Computer Architecture 76 38
- dce 2010 An Alternative Solution Using CPU Equation Op Freq Cycles CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% • If a CPU design enhancement improves the CPI of load instructions from 5 to 2, what is the resulting performance improvement from this enhancement: Old CPI = 2.2 New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6 Original Execution Time Instruction count x old CPI x clock cycle Speedup(E) = = New Execution Time Instruction count x new CPI x clock cycle old CPI 2.2 = = = 1.37 new CPI 1.6 Which is the same speedup obtained from Amdahl‟s Law in the first solution. Advanced Computer Architecture 79 dce 2010 Extending Amdahl's Law To Multiple Enhancements • Suppose that enhancement Ei accelerates a fraction Fi of the execution time by a factor Si and the remainder of the time is unaffected then: Original Execution Time Speedup (1 ) F i XOriginal Execution Time ( i F i i ) S i 1 Speedup (1 ) F i ( i F i i ) S i Note: All fractions Fi refer to original execution time before the enhancements are applied. Advanced. Computer Architecture 80 40
- dce 2010 Computer Performance Measures: MIPS Rating (1/3) • For a specific program running on a specific CPU the MIPS rating is a measure of how many millions of instructions are executed per second: MIPS Rating = Instruction count / (Execution Time x 106) = Instruction count / (CPU clocks x Cycle time x 106) = (Instruction count x Clock rate) / (Instruction count x CPI x 106) = Clock rate / (CPI x 106) • Major problem with MIPS rating: As shown above the MIPS rating does not account for the count of instructions executed (I). – A higher MIPS rating in many cases may not mean higher performance or better execution time. i.e. due to compiler design variations. Advanced Computer Architecture 85 dce 2010 Computer Performance Measures: MIPS Rating (2/3) • In addition the MIPS rating: – Does not account for the instruction set architecture (ISA) used. • Thus it cannot be used to compare computers/CPUs with different instruction sets. – Easy to abuse: Program used to get the MIPS rating is often omitted. • Often the Peak MIPS rating is provided for a given CPU which is obtained using a program comprised entirely of instructions with the lowest CPI for the given CPU design which does not represent real programs. Advanced Computer Architecture 86 42
- dce 2010 A MIPS Example (2) count cycles [(5x1) + (1x2) + (1x3)] x 106 CPI1 = = 10/7 = 1.43 (5 + 1 + 1) x 106 cycles 100 MHz MIPS1 = = 69.9 1.43 [(10x1) + (1x2) + (1x3)] x 106 CPI2 = = 15/12 = 1.25 (10 + 1 + 1) x 106 100 MHz So, compiler 2 has a higher MIPS2 = = 80.0 MIPS rating and should be 1.25 faster? Advanced Computer Architecture 89 dce 2010 A MIPS Example (3) • Now let‟s compare CPU time: Note Instruction Count x CPI ! important CPU Time = formula! Clock Rate 7 x 106 x 1.43 CPU Time1 = = 0.10 seconds 100 x 106 12 x 106 x 1.25 CPU Time2 = = 0.15 seconds 100 x 106 Therefore program 1 is faster despite a lower MIPS! Advanced Computer Architecture 90 44
- dce 2010 CPU Benchmark Suites • Performance Comparison: the execution time of the same workload running on two machines without running the actual programs • Benchmarks: the programs specifically chosen to measure the performance. • Five levels of programs: in the decreasing order of accuracy – Real Applications – Modified Applications – Kernels – Toy benchmarks – Synthetic benchmarks Advanced Computer Architecture 93 dce 2010 SPEC: System Performance Evaluation Cooperative • SPECCPU: popular desktop benchmark suite – CPU only, split between integer and floating point programs • First Round 1989: 10 programs yielding a single number – SPECmarks • Second Round 1992: SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) • Third Round 1995 – new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point) – “benchmarks useful for 3 years” – Single flag setting for all programs: SPECint_base95, SPECfp_base95 • SPECint2000 has 12 integer, SPECfp2000 has 14 integer pgms • SPECCPU2006 to be announced Spring 2006 • SPECSFS (NFS file server) and SPECWeb (WebServer) added as server benchmarks Advanced Computer Architecture 94 46