Chapter 9
What if one core is not enough?
- The browser
- The network
- Software and languages
- The CPU
- Memory
- Arithmetic
- Logic gates
- The switch
A pipeline and a cache each make one core do more, but only up to a point: the clock can go only so fast, and the cache can hide only so much. So the last two decades of speed have come less from a quicker core and more from having more than one. A multicore chip is several complete processors of Chapter 5 on one die, each with its own pipeline and caches, sharing the larger memory below and running a protocol so they never disagree about who holds the freshest copy of a word, a protocol this book leaves outside its walls. This chapter models what more than one worker buys, and where the idea hits its own wall.
9.1 Many hands on one job
The easiest work to share is the same operation over many items: square a thousand numbers, brighten a million pixels. Hand each worker a slice and let them run at the same time. The answer is exactly the one a single worker would produce, item by item; what changes is the wall-clock, which falls to the size of the largest slice. Figure 9.2 shows the split.
Example 9-1. Splitting a map across workers
def parallel_map(operation, data: list, workers: int) -> tuple: """Apply `operation` to every item, the items split across workers that run at the same time. Return the result and the step count: the size of the largest chunk, since that worker is the last to finish.""" chunks = _chunks(data, workers)1 result = [operation(item) for chunk in chunks for item in chunk] steps = max((len(chunk) for chunk in chunks), default=0)2 return result, steps
- the items are cut into one near-equal chunk per worker
- the wall-clock is the largest chunk, since that worker is the last to finish; the answer is the same as the plain left-to-right map
This is the shape of the fastest hardware there is. Inside a single core, one SIMD instruction applies the same operation to a whole row of numbers at once. A graphics processor takes it to the extreme: thousands of small, simple versions of the arithmetic unit of Chapter 3 marching in lockstep, hopeless at a branchy program but unbeatable at doing the same sum to a million values, which is why the same chips that draw games also train neural networks.
9.2 The ceiling
Doubling the workers looks like it should double the speed forever, but it never does, because no real job is all parallel. Some part of it must run in order, and that serial part is not shared out among the workers. Amdahl’s law makes this exact: if a fraction of the work is serial, the speedup can never exceed one over that fraction, no matter how many workers pile on.
Example 9-2. Amdahl’s law
def speedup(workers: int, serial_fraction: float) -> float: """Amdahl's law: how much faster a job runs on `workers` workers when `serial_fraction` of it must run in order. The parallel rest is shared among the workers; the serial part is not, so it is never divided away.""" return 1 / (serial_fraction + (1 - serial_fraction) / workers)1
- the parallel rest is divided among the workers, but the serial fraction is added back undivided, so it sets a floor on the time and a ceiling on the speed
Figure 9.3 draws that ceiling for a job that is one-tenth serial: the speedup climbs with workers but presses up against ten and never passes it.
Whatever the shape of the machine, one core or a thousand, a wide vector unit or a graphics processor, it bottoms out in the same place: the gates of Chapter 2 and the switch of Chapter 1. One last question about the hardware remains, how a single chip can be turned into any circuit at all, and the next chapter answers it.
Exercises
- Map an operation over a list whose length is not a multiple of the worker count. Why is the speedup then less than the worker count?
- A job is one-quarter serial. What is the most any number of workers can speed it up?
- How many workers does the one-tenth-serial job need to reach nine times faster? Why are the ones after that barely worth adding?