Advanced Computer Architecture - Memory Hierarchy Design (part 2) - Tran Ngoc Thinh
Memory stall cycles per memory access = Miss rate x Miss penalty Memory accesses per instruction = ( 1 + fraction of loads/stores) Miss Penalty = M = the number of stall cycles resulting from missing in cache = Main memory access time - 1 Thus for a unified L1 cache with no stalls on a cache hit: CPI = CPIexecution + (1 + fraction of loads/stores) x (1 - H1) x M AMAT = 1 + Miss rate x Miss penalty AMAT = 1 + (1 - H1) x M |
Bạn đang xem tài liệu "Advanced Computer Architecture - Memory Hierarchy Design (part 2) - Tran Ngoc Thinh", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.
File đính kèm:
- advanced_computer_architecture_memory_hierarchy_design_part.pdf
Nội dung text: Advanced Computer Architecture - Memory Hierarchy Design (part 2) - Tran Ngoc Thinh
- 4/25/2013 dce 2011 Unified vs. Separate Level 1 Cache • Unified Level 1 Cache (Princeton Memory Architecture). A single level 1 (L1 ) cache is used for both instructions and data. • Separate instruction/data Level 1 caches (Harvard Memory Architecture): The level 1 (L1) cache is split into two caches, one for instructions (instruction cache, L1 I-cache) and the other for data (data cache, L1 D- cache). Processor Processor Most Common Control Control Instruction Registers Unified Registers Level 1 Cache Level L1 Datapath I-cache One Datapath Cache Data L L1 1 Level 1 D-cache Cache Unified Level 1 Cache Separate (Split) Level 1 Caches (Princeton Memory Architecture) (Harvard Memory Architecture) dce 2011 Memory Hierarchy Performance (1/2) • The Average Memory Access Time (AMAT): The number of cycles required to complete an average memory access request by the CPU. • Memory stall cycles per memory access: The number of stall cycles added to CPU execution cycles for one memory access. Memory stall cycles per average memory access = (AMAT -1) • For ideal memory: AMAT = 1 cycle, this results in zero memory stall cycles. 2
- 4/25/2013 dce 2011 Cache Performance: Single Level L1 Princeton (Unified) Memory Architecture (2/2) Memory stall cycles per memory access = Miss rate x Miss penalty Memory accesses per instruction = ( 1 + fraction of loads/stores) Miss Penalty = M = the number of stall cycles resulting from missing in cache = Main memory access time - 1 Thus for a unified L1 cache with no stalls on a cache hit: CPI = CPIexecution + (1 + fraction of loads/stores) x (1 - H1) x M AMAT = 1 + Miss rate x Miss penalty AMAT = 1 + (1 - H1) x M dce 2011 Cache Performance Example (1/2) • Suppose a CPU executes at Clock Rate = 200 MHz (5 ns per cycle) with a single level of cache. • CPIexecution = 1.1 • Instruction mix: 50% arith/logic, 30% load/store, 20% control • Assume a cache miss rate of 1.5% and a miss penalty of M= 50 cycles. CPI = CPIexecution + mem stalls per instruction Mem Stalls per instruction = Mem accesses per instruction x Memory stall cycles per access = Mem accesses per instruction x Miss rate x Miss penalty Instruction fetch Load/store Mem accesses per instruction = 1 + 0.3 = 1.3 Mem Stalls per memory access = (1- H1) x M = 0.015 x 50 = 0.75 cycles AMAT = 1 +.75 = 1.75 cycles Mem Stalls per instruction = 1.3 x .015 x 50 = 0.975 CPI = 1.1 + .975 = 2.075 The ideal memory CPU with no misses is 2.075/1.1 = 1.88 times faster 4
- 4/25/2013 dce 2011 Cache Performance Example (1/2) • Suppose a CPU uses separate level one (L1) caches for instructions and data (Harvard memory architecture) with different miss rates for instruction and data access: – A cache hit incurs no stall cycles while a cache miss incurs 200 stall cycles for both memory reads and writes. – CPIexecution = 1.1 – Instruction mix: 50% arith/logic, 30% load/store, 20% control – Assume a cache miss rate of 0.5% for instruction fetch and a cache data miss rate of 6%. – Find the resulting CPI using this cache? How much faster is the CPU with ideal memory? dce 2011 Cache Performance Example (2/2) CPI = CPIexecution + mem stalls per instruction Mem Stall cycles per instruction = Instruction Fetch Miss rate x M + Data Memory Accesses Per Instruction x Data Miss Rate x M Mem Stall cycles per instruction = 0.5/100 x 200 + 6/100 x 0.3 x 200 = 1 + 3.6 = 4.6 Mem Stall cycles per access = 4.6 / 1.3 = 3.5 cycles AMAT = 1 + 3.5 = 4.5 cycles CPI = CPIexecution + mem stalls per instruction = 1.1 + 4.6 = 5.7 The CPU with ideal cache (no misses) is 5.7/1.1 = 5.18 times faster With no cache the CPI would have been = 1.1 + 1.3 X 200 = 261.1 !! 6
- 4/25/2013 dce 2011 VM Benefit • VM provides the following benefits – Allows multiple programs to share the same physical memory – Allows programmers to write code as though they have a very large amount of main memory – Automatically handles bringing in data from disk dce 2011 Virtual Memory Basics • Programs reference “virtual” addresses in a non-existent memory – These are then translated into real “physical” addresses – Virtual address space may be bigger than physical address space • Divide physical memory into blocks, called pages – Anywhere from 512 to 16MB (4k typical) • Virtual-to-physical translation by indexed table lookup – Add another cache for recent translations (the TLB) • Invisible to the programmer – Looks to your application like you have a lot of memory! – Anyone remember overlays? 8
- 4/25/2013 dce 2011 Example of virtual memory Virtual Physical Address Address Physical • Relieves problem of making a Main Memory 0 A 0 program that was too large to 4 B 4K C fit in physical memory – 8 C 8K well .fit! 12 D 12K • Allows program to run in any 16K A Virtual Memory 20K location in physical memory 24K B – (called relocation) 28K – Really useful as you might want to run same program on lots machines D Disk Logical program is in contiguous VA space; here, consists of 4 pages: A, B, C, D; The physical location of the 3 pages – 3 are in main memory and 1 is located on the disk dce 2011 Cache terms vs. VM terms So, some definitions/“analogies” – A “page” or “segment” of memory is analogous to a “block” in a cache – A “page fault” or “address fault” is analogous to a cache miss “real”/physical so, if we go to main memory and our data memory isn’t there, we need to get it from disk 10
- 4/25/2013 dce 2011 Cache VS. VM comparisons (2/2) • Replacement policy: – Replacement on cache misses primarily controlled by hardware – Replacement with VM (i.e. which page do I replace?) usually controlled by OS • Because of bigger miss penalty, want to make the right choice • Sizes: – Size of processor address determines size of VM – Cache size independent of processor address size dce 2011 Virtual Memory • Timing’s tough with virtual memory: –AMAT = Tmem + (1-h) * Tdisk – = 100nS + (1-h) * 25,000,000nS • h (hit rate) had to be incredibly (almost unattainably) close to perfect to work 12