What if one core is not enough? · From Transistor to Browser

Chapter 9
What if one core is not enough?

The browser
The network
Software and languages
The CPU
Memory
Arithmetic
Logic gates
The switch

Figure 9.1: Where we are in the climb from transistor to browser: still on the processor, the last way to make it faster. This chapter stops speeding up one core and starts using many.

A pipeline and a cache each make one core do more, but only up to a point: the clock can go only so fast, and the cache can hide only so much. So the last two decades of speed have come less from a quicker core and more from having more than one. A multicore chip is several complete processors of Chapter 5 on one die, each with its own pipeline and caches, sharing the larger memory below and running a protocol so they never disagree about who holds the freshest copy of a word, a protocol this book leaves outside its walls. This chapter models what more than one worker buys, and where the idea hits its own wall.

9.1 Many hands on one job

The easiest work to share is the same operation over many items: square a thousand numbers, brighten a million pixels. Hand each worker a slice and let them run at the same time. The answer is exactly the one a single worker would produce, item by item; what changes is the wall-clock, which falls to the size of the largest slice. Figure 9.2 shows the split.

Figure 9.2: A data-parallel map. Sixteen items handed out in four slices to four workers running at once. Each does four steps; because they run together, the job finishes in four, not sixteen.

Example 9-1. Splitting a map across workers

def parallel_map(operation, data: list, workers: int) -> tuple:
    """Apply `operation` to every item, the items split across workers that run at
    the same time. Return the result and the step count: the size of the largest
    chunk, since that worker is the last to finish."""
    chunks = _chunks(data, workers)1
    result = [operation(item) for chunk in chunks for item in chunk]
    steps = max((len(chunk) for chunk in chunks), default=0)2
    return result, steps

the items are cut into one near-equal chunk per worker
the wall-clock is the largest chunk, since that worker is the last to finish; the answer is the same as the plain left-to-right map

This is the shape of the fastest hardware there is. Inside a single core, one SIMD instruction applies the same operation to a whole row of numbers at once. A graphics processor takes it to the extreme: thousands of small, simple versions of the arithmetic unit of Chapter 3 marching in lockstep, hopeless at a branchy program but unbeatable at doing the same sum to a million values, which is why the same chips that draw games also train neural networks.

9.2 The ceiling

Doubling the workers looks like it should double the speed forever, but it never does, because no real job is all parallel. Some part of it must run in order, and that serial part is not shared out among the workers. Amdahl’s law makes this exact: if a fraction of the work is serial, the speedup can never exceed one over that fraction, no matter how many workers pile on.

Example 9-2. Amdahl’s law

def speedup(workers: int, serial_fraction: float) -> float:
    """Amdahl's law: how much faster a job runs on `workers` workers when
    `serial_fraction` of it must run in order. The parallel rest is shared among
    the workers; the serial part is not, so it is never divided away."""
    return 1 / (serial_fraction + (1 - serial_fraction) / workers)1

the parallel rest is divided among the workers, but the serial fraction is added back undivided, so it sets a floor on the time and a ceiling on the speed

Figure 9.3 draws that ceiling for a job that is one-tenth serial: the speedup climbs with workers but presses up against ten and never passes it.

amdahl — Figure 9.3: Amdahl’s ceiling for a job one-tenth serial. Each bar is the speedup at a worker count; they rise towards the dashed ceiling of ten but, even at 256 workers, fall short of it.

Whatever the shape of the machine, one core or a thousand, a wide vector unit or a graphics processor, it bottoms out in the same place: the gates of Chapter 2 and the switch of Chapter 1. One last question about the hardware remains, how a single chip can be turned into any circuit at all, and the next chapter answers it.

Exercises

Map an operation over a list whose length is not a multiple of the worker count. Why is the speedup then less than the worker count?
A job is one-quarter serial. What is the most any number of workers can speed it up?
How many workers does the one-tenth-serial job need to reach nine times faster? Why are the ones after that barely worth adding?

Previous Next

Chapter 9What if one core is not enough?

9.1 Many hands on one job

9.2 The ceiling

Exercises

Chapter 9
What if one core is not enough?