OSTEP Chapter 13: The Abstraction of Address Spaces

Chapter 13 of OSTEP provides a primer on how and why modern operating systems abstract physical hardware.

This is part of our series going through OSTEP book chapters. The OSTEP textbook is freely available at Remzi's website if you like to follow along.


Multiprogramming and Time Sharing

In the early days of computing, machines didn't provide much of a memory abstraction to users. The operating system was essentially a library of routines sitting at the bottom of physical memory, while a single running process occupied the rest of the available space. 

I am not that old, but I got to experience this in 2002 through programming sensor nodes with TinyOS. These "motes" were operating on extremely constrained hardware with just 64KB of memory, so there was no complex virtualization. The entire OS and the application were compiled together as a single process, and all memory was statically preallocated. Ah, the joy of debugging a physically distributed multi-node deployment using three LEDs per node, with everyone shouting "red fast-blinking, yellow on". Here’s a snapshot from two mote deployments we ran at UB. Most of my pre-2010 work was on sensor networks. I realize I never wrote about that period. Maybe there’s a reason.

The shift away from the simple early computer systems began with multiprogramming. The OS switched among ready processes to increase CPU utilization on expensive machines. Later, the demand for interactivity gave rise to time sharing. Initially, implementing time sharing involved saving a process's entire memory state to disk and loading another, but this had terrible overhead. Thus, operating systems began keeping multiple processes resident in memory, enabling fast context switches.



The Address Space and Virtualization

To solve the protection problems caused by sharing memory, the OS introduced the address space. This abstraction is a running program's view of memory, containing three components. 

  • Program Code: The static instructions, placed at the top of the address space.
  • Heap: Dynamically allocated, user-managed memory that grows downward.
  • Stack: Used for local variables and function calls, growing upward from the bottom.

The crux of virtual memory is how the OS translates virtual addresses into physical addresses. The OS aims for three primary goals. Transparency: The program should act as if it has its own private physical memory. Efficiency: The OS must minimize time and space overhead. Protection: The OS must isolate processes, ensuring a program cannot affect the memory of the OS or other applications.


GPUs and Beyond

So, do these classic principles still apply to modern hardware like GPUs? Yes, but the scale and complexity is larger. Early CPUs handled a few processes; GPUs handle thousands of threads. Yet, the foundational abstractions (virtual address spaces, hardware-assisted translation, and memory isolation) mostly stay the same (for now). CUDA Unified Memory pushes things further with a single address space across system RAM and GPU VRAM.

A major challenge in modern machine learning is the so-called memory wall. AI accelerators perform computations so fast that they waste most of their time waiting for data to arrive. Conventional system RAM cannot deliver the required bandwidth, and this forced the adoption of High Bandwidth Memory (HBM): stacks of memory chips placed adjacent to compute cores, connected with ultra-fine wiring capable of moving terabytes of data per second.

The memory wall is so severe that the main bottleneck in scaling LLMs today is the memory. Generating text relies on the attention mechanism, which keeps KeyValue representations of all previous tokens to avoid recomputing context. When we present an AI with a huge codebase, the GPU’s ultra-fast but limited HBM will quickly be exhausted. To avoid running out of memory, engineers come up with new designs. They use tiered memory, and keep hot data while spilling overflow to cheaper CXL-attached system RAM. Cache offloading is another key technique: it moves inactive users' KV caches to the host CPU or fast NVMe SSDs to free GPU space.  

Comments

Popular posts from this blog

Hints for Distributed Systems Design

The F word

The Agentic Self: Parallels Between AI and Self-Improvement

Learning about distributed systems: where to start?

Foundational distributed systems papers

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

Advice to the young

Agentic AI and The Mythical Agent-Month

Are We Becoming Architects or Butlers to LLMs?

Welcome to Town Al-Gasr