OSTEP Chapter 8
The crux of this chapter is how to schedule tasks without perfect knowledge. If you remember from the previous chapter, the core tension in CPU scheduling is these two conflicting goals:
- Minimizing Turnaround Time: Usually achieved by running shorter jobs first (SJF).
- Minimizing Response Time: Usually achieved by Round Robin scheduling (RR). Essential for interactive users.
Unfortunately, the OS does not have a crystal ball. It doesn't know if a process is a short interactive job or a massive number-crunching batch job. The Multi-Level Feedback Queue (MLFQ) solves this by encoding/capturing information from history of the job, and assumes that if a job has been CPU-intensive in the past, it likely will be in the future. As we'll see below, it also gives a chance for jobs to redeem themselves through the boosting process.
I really enjoyed this chapter. MLFQ, invented by Corbato in 1962, is a brilliant scheduling algorithm. This elegant solution served as the base scheduler for many systems, including BSD UNIX derivatives, Solaris, and Windows NT and subsequent Windows operating systems.
(This is part of our series going through OSTEP book chapters.)
How MLFQ Works: The Basic Rules
The chapter constructs the MLFQ algorithm iteratively, starting with a basic structure involving distinct queues, each with a different priority level.
- Rule 1: If Priority(A) > Priority(B), A runs.
- Rule 2: If Priority(A) = Priority(B), they run in Round-Robin.
But how does a job get its priority?
- Rule 3: New jobs start at the highest priority.
- Rule 4 (Initial Version): If a job uses up its time allotment, it moves down a queue. If it gives up the CPU (e.g., for I/O) before the time is up, it stays at the same priority.
This setup cleverly approximates Shortest Job First. Because the scheduler assumes every new job is short (giving it high priority), true short jobs finish quickly. Long jobs eventually exhaust their time slices and sink to the bottom queues, where they run only when the system isn't busy with interactive tasks.
Patching the initial MLFQ rules
However, this basic version has fatal flaws.
- If too many interactive jobs flood the system, low-priority background jobs might starve.
- A clever user could rewrite a program to yield the CPU (say through I/O: writing to a dummy file) just before its time slice ends. This resets the job's allotment, allowing it to monopolize the CPU at the highest priority.
- A job that starts CPU-intensive but becomes interactive later (like a compiler finishing and waiting for input) would be stuck at the bottom priority.
To fix these issues, the chapter introduces two crucial modifications.
The Priority Boost: To prevent low-priority jobs from starving, the scheduler employs Rule 5: After a set time period (S), all jobs are moved back to the topmost queue. This "boost" ensures that CPU-bound jobs get at least some processing time and allows jobs that have become interactive to return to a high-priority state.
Better Accounting: To stop users from gaming the system, the scheduler rewrites Rule 4 regarding how it tracks time. Rule 4: Instead of resetting the allotment every time a job yields the CPU, the scheduler tracks the total time a job uses at a given priority level. Once the allotment is used up (regardless of how many times the job yielded the CPU) it is demoted.
Tuning MLFQ
The remaining piece of the puzzle is parameterization. An MLFQ requires choosing the number of queues, the time slice length for each, and the frequency of the priority boost. There are no easy answers to these questions, and finding a satisfactory balance often requires deep experience with specific workloads. For example, most implementations employ varying time-slice lengths, assigning short slices (e.g., 10 ms) to high-priority queues for responsiveness and longer slices (e.g., 100s of ms) to low-priority queues for efficiency. Furthermore, the priority boost interval is often referred to as a "voodoo constant" because it requires magic to set correctly; if the value is too high, jobs starve, but if it is too low, interactive performance suffers.
MLFQ is a milestone in operating systems design. It delivers strong performance for interactive jobs without prior knowledge of job length, while remaining fair to long-running tasks. As noted earlier, it became the base scheduler for many operating systems, with several variants refining the core idea. One notable variant is the decay-usage approach used in FreeBSD 4.3. Instead of using fixed priority tables (as in Solaris), it computes priority using a mathematical function of recent CPU usage. Running increases a job’s usage counter and lowers its priority, while the passage of time decays this counter. Decay plays the same role as periodic priority boosts. As usage fades, priority rises, ensuring long-running jobs eventually run and allowing jobs that shift from CPU-bound to interactive to regain high priority.
TLA+ model
I used Gemini to write a TLA+ model of the MLFQ algorithm here. To run this MLFQ TLA+ model at Spectacle for visualization, click this link and it will open the model on your browser, no installation or plugin required. What you will see is the initial state. Click on any enabled action to take it, you can go back and forward on the right pane to explore the execution. And you can share a URL back with anyone to point to an interesting state or trace, just like I did here.
Comments