Lec04-Cache

Goal: Illusion of large, fast, cheap memory.
Let programs address a memory space that
scales to the disk size, at a speed that is
usually as fast as register access
• Solution: Put smaller, faster “cache”
memories between CPU and DRAM. Create
a “memory hierarchy”

16 trang xuanthi 28/12/2022 1020

Download

Bạn đang xem tài liệu "Lec04-Cache", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.

File đính kèm:

lec04_cache.pdf

Nội dung text: Lec04-Cache

dce 2011 Since 1980, CPU has outpaced DRAM Four-issue 2GHz superscalar accessing 100ns DRAM could execute 800 instructions during time for one memory access! Performance (1/latency) CPU 1000 CPU 60% per yr 2X in 1.5 yrs 100 Gap grew 50% per year 10 DRAM 9% per yr DRAM 2X in 10 yrs Year 19 19 20 80 90 00 3 dce 2011 Processor-DRAM Performance Gap Impact • To illustrate the performance impact, assume a single-issue pipelined CPU with CPI = 1 using non-ideal memory. • Ignoring other factors, the minimum cost of a full memory access in terms of number of wasted CPU cycles: CPU CPU Memory Minimum CPU memory stall cycles Year speed cycle Access or instructions wasted MHZ ns ns 1986: 8 125 190 190/125 - 1 = 0.5 1989: 33 30 165 165/30 -1 = 4.5 1992: 60 16.6 120 120/16.6 -1 = 6.2 1996: 200 5 110 110/5 -1 = 21 1998: 300 3.33 100 100/3.33 -1 = 29 2000: 1000 1 90 90/1 - 1 = 89 2002: 2000 .5 80 80/.5 - 1 = 159 2004: 3000 .333 60 60.333 - 1 = 179 4
dce 2011 Common Predictable Patterns Two predictable properties of memory references: • Temporal Locality: If a location is referenced, it is likely to be referenced again in the near future (e.g., loops, reuse). • Spatial Locality: If a location is referenced it is likely that locations near it will be referenced in the near future (e.g., straightline code, array access). 7 dce 2011 Caches Caches exploit both types of predictability: – Exploit temporal locality by remembering the contents of recently accessed locations. – Exploit spatial locality by fetching blocks of data around recently accessed locations. 8
dce 2011 Simple view of cache • Example: For(i = 0; i < 10; i++) S = S + A[i]; • No cache: At least 12 accesses to main memory (10 A[i] and Read S, write S) • With Cache: if A[i] and S is in a single block (ex 32-bytes), 1 access to load block to cache, and 1 access to write block to main memory • Access to S: Temporal Locality • Access to A[i]: Spatial Locality (A[i]) 11 dce 2011 Replacement 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Memory 0 1 2 3 CPU need this Cache • Cache cannot hold all blocks • Replace a block by another that is currently needed by CPU 12
dce 2011 Direct-Mapped Cache Tag Index Block Address Offset t kb V Tag Data Block 2k lines t = HIT Data Word or Byte 15 dce 2011 Direct-mapped Cache • Address: N bits (2N words) • Cache has 2k lines (blocks) • Each line has 2b words • Block M is mapped to the line M % 2k • Need t = N-k-bTag bits to identify mem. block • Advantage: Simple • Disadvantage: High miss rate • What if CPU accesses block N0, N1 and N0 % 2k = N1 % 2k ? 16
dce 2011 64KB Direct Mapped Cache Example Tag field Address (showing bit positions) 4K= 4096 blocks 31 16 15 4 3 2 1 0 Index field Each block = four words = 16 bytes 16 122 Byte Hit Data Tag o ffset Word select Can cache up to Ind ex Block offset 232 bytes = 4 GB 16 bits 128 bits of memory V Tag Data 4K entries 16 32 32 32 32 Mux Hit or miss? 32 Larger cache blocks take better advantage of spatial locality Block Address = 28 bits Block offset and thus may result in a lower miss rate Tag = 16 bits Index = 12 bits = 4 bits Mapping Function: Cache Block frame number = (Block address) MOD (4096) i.e. index field or 12 low bit of block address 19 dce 2011 Fully Associative Cache V Tag Data Block t = Tag t = HIT Data Block Offset = Word b or Byte 20
dce 2011 W-way Set-associative Cache • Balancing: Direct mapped cache vs Fully associative cache • Cache has 2k sets • Each set has 2w lines • Block M is mapped to one of 2w lines in set M % 2k • Tag bit: t = N-k-b • Currently: widely used (Intel, AMD, ) 23 dce 2011 4K Four-Way Set Associative Cache: MIPS Implementation Example Block Address Tag 3130 12 11 10 9 8 3 2 1 0 Offset Field 22 8 Index Field 1024 block frames Each block = one word Index VTagData VTagData VTagData VTagData 4-way set associative 0 1 1024 / 4= 256 sets 2 253 Can cache up to 254 232 bytes = 4 GB 255 of memory 22 32 Set associative cache requires parallel tag matching and more complex hit logic which may increase hit time Block Address = 30 bits Block offset 4-to-1 multiplexor Tag = 22 bits Index = 8 bits = 2 bits Hit Data Mapping Function: Cache Set Number = index= (Block address) MOD (256) 24
dce 2011 Block Size and Spatial Locality Block is unit of transfer between the cache and memory Tag Word0 Word1 Word2 Word3 4 word block, b=2 Split CPU block address offsetb address 32-b bits b bits 2b = block size a.k.a line size (in bytes) Larger block size has distinct hardware advantages • less tag overhead • exploit fast burst transfers from DRAM • exploit fast burst transfers over wide busses What are the disadvantages of increasing block size? Fewer blocks => more conflicts. Can waste bandwidth. 27 dce 2011 Q3: Which block should be replaced on a miss? • Easy for Direct Mapped • Set Associative or Fully Associative: – Random – Least Recently Used (LRU) • LRU cache state must be updated on every access • true implementation only feasible for small sets (2- way, 4-way) • pseudo-LRU binary tree often used for 4-8 way – First In, First Out (FIFO) a.k.a. Round-Robin • used in highly associative caches • Replacement policy has a second order effect since replacement only happens on misses 28
dce 2011 Reading assignment 2 • Cache performance – Replacement policy (algorithms) – Optimization (Miss rate, penalty, ) • Reference – Hennessy - Patterson - Computer Architecture. A Quantitative – www2.lns.mit.edu/~avinatan/research/cache.pdf – More on internet 31