How does a CPU hide slow memory? · From Transistor to Browser

Chapter 8
How does a CPU hide slow memory?

The browser
The network
Software and languages
The CPU
Memory
Arithmetic
Logic gates
The switch

Figure 8.1: Where we are in the climb from transistor to browser: still on the processor. The pipeline made the core quick; this chapter asks how it copes with a memory far slower than itself.

The pipeline of Chapter 7 keeps the core busy, finishing an instruction almost every cycle. But every instruction is fetched from memory, and memory is slow. A core can do an addition in about a cycle; a read from main memory can cost a few hundred. Left alone, the fast machine we built would spend almost all of its life waiting for the next word to arrive. This chapter is about the trick that hides the wait, and it is arguably the largest performance idea in all of hardware.

8.1 How far away memory is

The gap is easier to feel if we stretch time. Suppose one clock cycle lasted one whole second. Then reaching the nearest, smallest, fastest memory takes a few seconds; reaching main memory takes minutes; reaching the disk takes days; and reaching another machine across the network takes over a year. Figure 8.2 lays the distances out on that human scale.

latency-ladder — Figure 8.2: The memory hierarchy, each level trading size for distance. On the far right, every distance is scaled to human time at one second a cycle, and named for the place it would carry you. The registers and caches are seconds away; everything below is minutes, days, or years.

No single memory can be both large and close, so machines use several at once: a small, fast one right beside the core, backed by larger, slower ones further out. The small fast pools are caches, built from the same cross-coupled latches as the registers of Chapter 4. The whole scheme works only because programs are predictable, in two ways. A word just used is likely to be used again soon (temporal locality), and the words next to it are likely to be used soon after (spatial locality). Figure 8.3 shows how the second of those leads a cache to move data in whole lines, not single words.

cache-line — Figure 8.3: A cache moves a whole line at a time. A miss on any one word loads its eight-word line into a slot, so the next seven words of that line are then hits. One slow trip buys seven fast ones.

8.2 Keeping the hot data near

The model here is the simplest cache that works, a direct-mapped one. Each line of memory maps to exactly one slot in the cache, chosen by its number. A load hits if that slot already holds the line it wants, and otherwise misses, fetches the line, and evicts whatever was there.

Example 8-1. A direct-mapped cache

class Cache:
    """A direct-mapped cache: every memory line maps to exactly one slot."""

    def __init__(self, slots: int = SLOTS, line: int = LINE):
        self.line = line
        self.blocks = [None] * slots   # the line number in each slot, or None

    def load(self, address: int) -> bool:
        """Look up an address. Return True on a hit; on a miss, load its line into
        the one slot it maps to, evicting whatever was there, and return False."""
        block = address // self.line1
        slot = block % len(self.blocks)
        hit = self.blocks[slot] == block2
        self.blocks[slot] = block3
        return hit

the address divided by the line size names the line it belongs to
a hit is when the one slot that line maps to already holds it
on a miss the line is loaded into that slot, evicting the old one: this is why two lines that map to the same slot keep pushing each other out

8.3 The hit rate is everything

A cache only helps when it hits, so the number that matters is the hit rate. The arithmetic is brutal. An average access costs the fast hit time always, plus the slow miss penalty on the fraction that miss. When the penalty is a hundred times the hit, even a small miss rate dominates the average.

Example 8-2. The average cost of an access

def average_access(hit_rate: float) -> float:
    """The cost of an average access: the hit time always, plus the miss penalty
    on the fraction that miss. A small miss rate on a large penalty still hurts."""
    return HIT_TIME + (1 - hit_rate) * MISS_PENALTY1

the hit time is paid on every access; the miss penalty only on the misses, but it is so large that a few percent of them decide the average

The cache does not make memory fast; it makes the slow trip rare. A cache makes one core wait less, and the next chapter takes the other road to speed: stop using one core, and use many.

Exercises

Double the line size in the model. What happens to the scan’s hit rate, and why?
Give the cache more slots. Which of the three patterns improves, and which does not?
At a miss penalty of a hundred cycles, what hit rate does a program need to average under two cycles per access?

Previous Next

Chapter 8How does a CPU hide slow memory?

8.1 How far away memory is

8.2 Keeping the hot data near

8.3 The hit rate is everything

Exercises

Chapter 8
How does a CPU hide slow memory?