Chapter 9
What if one core is not enough?

  1. The browser
  2. The network
  3. Software and languages
  4. The CPU
  5. Memory
  6. Arithmetic
  7. Logic gates
  8. The switch
Figure 9.1: Where we are in the climb from transistor to browser: still on the processor, the last way to make it faster. This chapter stops speeding up one core and starts using many.

A pipeline and a cache each make one core do more, but only up to a point: the clock can go only so fast, and the cache can hide only so much. So the last two decades of speed have come less from a quicker core and more from having more than one. A multicore chip is several complete processors of Chapter 5 on one die, each with its own pipeline and caches, sharing the larger memory below and running a protocol so they never disagree about who holds the freshest copy of a word, a protocol this book leaves outside its walls. This chapter models what more than one worker buys, and where the idea hits its own wall.

9.1 Many hands on one job

The easiest work to share is the same operation over many items: square a thousand numbers, brighten a million pixels. Hand each worker a slice and let them run at the same time. The answer is exactly the one a single worker would produce, item by item; what changes is the wall-clock, which falls to the size of the largest slice. Figure 9.2 shows the split.

data-paralleldata-parallel
Figure 9.2: A data-parallel map. Sixteen items handed out in four slices to four workers running at once. Each does four steps; because they run together, the job finishes in four, not sixteen.

Example 9-1. Splitting a map across workers

def parallel_map(operation, data: list, workers: int) -> tuple:
    """Apply `operation` to every item, the items split across workers that run at
    the same time. Return the result and the step count: the size of the largest
    chunk, since that worker is the last to finish."""
    chunks = _chunks(data, workers)1
    result = [operation(item) for chunk in chunks for item in chunk]
    steps = max((len(chunk) for chunk in chunks), default=0)2
    return result, steps

This is the shape of the fastest hardware there is. Inside a single core, one SIMD instruction applies the same operation to a whole row of numbers at once. A graphics processor takes it to the extreme: thousands of small, simple versions of the arithmetic unit of Chapter 3 marching in lockstep, hopeless at a branchy program but unbeatable at doing the same sum to a million values, which is why the same chips that draw games also train neural networks.

9.2 The ceiling

Doubling the workers looks like it should double the speed forever, but it never does, because no real job is all parallel. Some part of it must run in order, and that serial part is not shared out among the workers. Amdahl’s law makes this exact: if a fraction of the work is serial, the speedup can never exceed one over that fraction, no matter how many workers pile on.

Example 9-2. Amdahl’s law

def speedup(workers: int, serial_fraction: float) -> float:
    """Amdahl's law: how much faster a job runs on `workers` workers when
    `serial_fraction` of it must run in order. The parallel rest is shared among
    the workers; the serial part is not, so it is never divided away."""
    return 1 / (serial_fraction + (1 - serial_fraction) / workers)1

Figure 9.3 draws that ceiling for a job that is one-tenth serial: the speedup climbs with workers but presses up against ten and never passes it.

amdahlamdahl
Figure 9.3: Amdahl’s ceiling for a job one-tenth serial. Each bar is the speedup at a worker count; they rise towards the dashed ceiling of ten but, even at 256 workers, fall short of it.

The interactive version needs the in-browser runtime.

Whatever the shape of the machine, one core or a thousand, a wide vector unit or a graphics processor, it bottoms out in the same place: the gates of Chapter 2 and the switch of Chapter 1. One last question about the hardware remains, how a single chip can be turned into any circuit at all, and the next chapter answers it.

Exercises

PreviousNext
✎ Suggest an edit