OSTEP Chapter 9: Proportional Share Scheduling

The Crux: Fairness Over Speed. Unlike the schedulers we explored in Chapter 8 (like Shortest Job First or Multi-Level Feedback Queues) that optimize for "turnaround time"  or "response time", proportional-share schedulers introduced in this Chapter aim to guarantee that each job receives a specific percentage of CPU time.

(This is part of our series going through OSTEP book chapters.)


Basic Concept: Tickets

Lottery Scheduling serves as the foundational example of proportional-share schedulers. It uses a randomized mechanism to achieve fairness probabilistically. The central concept of Lottery Scheduling is the ticket. Tickets represent the share of the resource a process should receive.

The scheduler holds a lottery every time slice. If Job A has 75 tickets and Job B has 25 (100 total), the scheduler picks a random number between 0 and 99. Statistically, Job A will win 75% of the time. The implementation is incredibly simple. It requires a random number generator, a list of processes, and a loop that sums ticket values until the randomly picked counter (300 in the below example) exceeds the winning number.


Advanced Ticket Mechanisms

1.  Ticket Currency: Users can allocate tickets among their own jobs in local currency (e.g., 500 "Alice-tickets"), which the system converts to global currency. This delegates the "fairness" decision to the user.

2.  Ticket Transfer: A client can temporarily hand its tickets to a server process to maximize performance while a specific request is being handled.

3.  Ticket Inflation: In trusted environments, a process can unilaterally boost its ticket count to reflect a higher need for CPU. In competitive settings this is unsafe, since a greedy process could grant itself excessive tickets and monopolize the machine. In practice, modern systems prevent this with control groups (cgroups), which act as an external regulator that assigns fixed resource weights so untrusted processes cannot simply print more tickets to override the scheduler.


Lottery Scheduling depends on randomness to decide which job runs next. This randomness helps avoid the tricky cases that can trip up traditional algorithms, like LRU on cyclic workloads, and keeps the scheduler simple with minimal state to track. However, fairness is only achieved over time. In the short term, a job might get unlucky and lose more often than its share of tickets. Studies show that fairness is low for short jobs and only approaches perfect fairness as the total runtime increases.


Lottery vs. Stride vs. CFS scheduling

Stride Scheduling emerged to address the probabilistic quirks of Lottery Scheduling. It assigns each process a stride, inversely proportional to its tickets, and maintains a pass value tracking how much CPU time the process has received. At each decision point, the scheduler selects the process with the lowest pass value.

This guarantees exact fairness each cycle, but it introduces challenges with global state. When a new process arrives, assigning it a fair initial pass value is tricky: set it too low, and it can dominate the CPU; too high, and it risks starvation. In contrast, Lottery Scheduling handles new arrivals seamlessly, since it requires no global state.



The Linux Completely Fair Scheduler (CFS) builds on these earlier proportional schedulers but removes randomness by using Virtual Runtime (vruntime) to track each process’s CPU usage. At every scheduling decision, CFS selects the job with the smallest vruntime, ensuring a fair distribution of CPU time. To prevent excessive context-switching overhead when there are many tasks (each receiving only a tiny slice of CPU time), CFS enforces a min_granularity. This ensures every process runs for at least a minimum time slice, and it balances fairness with efficient CPU utilization.



To prioritize specific processes, CFS uses the classic UNIX "nice" level, which allows users to assign values between -20 (highest priority) and +19 (lowest priority). CFS maps these values to geometric weights; a process with a higher priority (lower nice value) is assigned a larger weight. This weight directly alters the rate at which vruntime accumulates: high-priority processes add to their vruntime much more slowly than low-priority ones. When determining exactly how long a process should run within the target scheduling latency (sched_latency), instead of simply dividing the target latency equally among all tasks (e.g., 48ms/n), CFS calculates the time slice for a specific process k as a fraction of the total weight of all currently running processes.

Consequently, a high-priority job can run for a longer physical time while only "charging" a small amount of virtual time, allowing it to claim a larger proportional share of the CPU compared to "nicer" low-priority tasks.

Finally, because modern systems handle thousands of processes, CFS replaces the simple lists of Lottery Scheduling with Red-Black Trees, giving $O(\log n)$ efficiency for insertion and selection.


Challenges in Proportional Sharing

The I/O Problem. Proportional schedulers face a challenge when jobs sleep, such as waiting for I/O. In a straightforward model, a sleeping job lags behind, and when it resumes, it can monopolize the CPU to catch up, potentially starving other processes. CFS addresses this by resetting the waking job’s vruntime to the minimum value in the tree. This ensures no process starves, but it can penalize the interactive job, leading to slower response times.

The Ticket Assignment Problem. Assigning tickets is still an open challenge. In general-purpose computing, such as browsers or editors, it’s unclear how many tickets each application deserves, making fairness difficult to enforce. The situation is a bit more clear in virtualization and cloud computing, where ticket allocation aligns naturally with resource usage: if a client pays for 25% of a server, it can be assigned 25% of the tickets, providing a clear and effective proportional share.




I’ve been experimenting with Gemini Pro and NotebookLM. What a time to be alive! The fancy slides above all came from these tools, and for the final summary infographic, it produced a solid visual mind map of everything. Decades ago, as a secondary school student, I created similar visual mind maps as mnemonic devices for exams. The Turkish education system relied heavily on memorization, but I needed to understand concepts first to be able to memorize them. So I connected ideas, contextualized them through these visual mind maps, and it worked wonders. I even sold these mind maps to friends before exams. Looking back, those experiences were formative for my later blogging and explanation efforts. Fun times.

Comments

Popular posts from this blog

Hints for Distributed Systems Design

The F word

The Agentic Self: Parallels Between AI and Self-Improvement

Learning about distributed systems: where to start?

Foundational distributed systems papers

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

Advice to the young

Agentic AI and The Mythical Agent-Month

Are We Becoming Architects or Butlers to LLMs?

Welcome to Town Al-Gasr