Chapter 8
How does a CPU hide slow memory?

  1. The browser
  2. The network
  3. Software and languages
  4. The CPU
  5. Memory
  6. Arithmetic
  7. Logic gates
  8. The switch
Figure 8.1: Where we are in the climb from transistor to browser: still on the processor. The pipeline made the core quick; this chapter asks how it copes with a memory far slower than itself.

The pipeline of Chapter 7 keeps the core busy, finishing an instruction almost every cycle. But every instruction is fetched from memory, and memory is slow. A core can do an addition in about a cycle; a read from main memory can cost a few hundred. Left alone, the fast machine we built would spend almost all of its life waiting for the next word to arrive. This chapter is about the trick that hides the wait, and it is arguably the largest performance idea in all of hardware.

8.1 How far away memory is

The gap is easier to feel if we stretch time. Suppose one clock cycle lasted one whole second. Then reaching the nearest, smallest, fastest memory takes a few seconds; reaching main memory takes minutes; reaching the disk takes days; and reaching another machine across the network takes over a year. Figure 8.2 lays the distances out on that human scale.

latency-ladderlatency-ladder
Figure 8.2: The memory hierarchy, each level trading size for distance. On the far right, every distance is scaled to human time at one second a cycle, and named for the place it would carry you. The registers and caches are seconds away; everything below is minutes, days, or years.

The interactive version needs the in-browser runtime.

No single memory can be both large and close, so machines use several at once: a small, fast one right beside the core, backed by larger, slower ones further out. The small fast pools are caches, built from the same cross-coupled latches as the registers of Chapter 4. The whole scheme works only because programs are predictable, in two ways. A word just used is likely to be used again soon (temporal locality), and the words next to it are likely to be used soon after (spatial locality). Figure 8.3 shows how the second of those leads a cache to move data in whole lines, not single words.

cache-linecache-line
Figure 8.3: A cache moves a whole line at a time. A miss on any one word loads its eight-word line into a slot, so the next seven words of that line are then hits. One slow trip buys seven fast ones.

8.2 Keeping the hot data near

The model here is the simplest cache that works, a direct-mapped one. Each line of memory maps to exactly one slot in the cache, chosen by its number. A load hits if that slot already holds the line it wants, and otherwise misses, fetches the line, and evicts whatever was there.

Example 8-1. A direct-mapped cache

class Cache:
    """A direct-mapped cache: every memory line maps to exactly one slot."""

    def __init__(self, slots: int = SLOTS, line: int = LINE):
        self.line = line
        self.blocks = [None] * slots   # the line number in each slot, or None

    def load(self, address: int) -> bool:
        """Look up an address. Return True on a hit; on a miss, load its line into
        the one slot it maps to, evicting whatever was there, and return False."""
        block = address // self.line1
        slot = block % len(self.blocks)
        hit = self.blocks[slot] == block2
        self.blocks[slot] = block3
        return hit

8.3 The hit rate is everything

A cache only helps when it hits, so the number that matters is the hit rate. The arithmetic is brutal. An average access costs the fast hit time always, plus the slow miss penalty on the fraction that miss. When the penalty is a hundred times the hit, even a small miss rate dominates the average.

Example 8-2. The average cost of an access

def average_access(hit_rate: float) -> float:
    """The cost of an average access: the hit time always, plus the miss penalty
    on the fraction that miss. A small miss rate on a large penalty still hurts."""
    return HIT_TIME + (1 - hit_rate) * MISS_PENALTY1

The cache does not make memory fast; it makes the slow trip rare. A cache makes one core wait less, and the next chapter takes the other road to speed: stop using one core, and use many.

Exercises

PreviousNext
✎ Suggest an edit