Always Measure One Level Deeper

- June 19, 2024

This is a great paper (CACM 2018) by John Ousterhout. Ousterhout is well known for his work on log-structured file system, tcl/tk, Raft, and magic VLSI CAD. His book on Philosophy of Software Design is great, and he has a lot of wisdom about life in general that he shares in his Stanford CS classes.

The paper is written very well, so I lift up paragraphs verbatim from it to summarize its main points. There are many war stories in the text. Please do read it, because they are fascinating, and likely you can see how they can apply to your work, and save you from making a mistake.

At the end, I chime in with my reflections and link to other relevant work.

Key Insights

In academic research a thorough performance evaluation is considered essential for many publications to prove the value of a new idea. In industry, performance evaluation is necessary to maintain a high level of performance across the lifetime of a product.

A good performance evaluation provides a deep understanding of a system’s behavior, quantifying not only the overall behavior but also its internal mechanisms and policies. It explains why a system behaves the way it does, what limits that behavior, and what problems must be addressed in order to improve the system. Done well, performance evaluation exposes interesting system properties that were not obvious previously. It not only improves the quality of the system being measured but the developer’s intuition, resulting in better systems in the future.

Performance measurements often go wrong, reporting surface-level results that are more marketing than science. Performance measurement is less straightforward than it might seem; it is easy to believe results that are incorrect or misleading and overlook important system behaviors.

The key to good performance measurement is to make many more measurements besides the ones you think will be important; it is crucial to understand not just the system’s performance but also why it performs that way.

Performance measurement done well results in new discoveries about the system being measured and new intuition about system behavior for the person doing the measuring.

Most common mistakes

Mistake 1: Trusting the numbers. Engineers are easily fooled during performance measurements because measurement bugs are not obvious. Performance measurements should be considered guilty until proven innocent. I have been involved in dozens of performance-measurement projects and cannot recall a single one in which the first results were correct.

Mistake 2: Guessing instead of measuring. The second common mistake is to draw conclusions about a system’s performance based on educated guesses, without measurements to back them up. Educated guesses are often correct and play an important role in guiding performance measurement; see Rule 3 (Use your intuition to ask questions, not answer them). However, engineers’ intuition about performance is not reliable. When my students and I designed our first log-structured file system, we were fairly certain that reference patterns exhibiting locality would result in better performance than those without locality. Fortunately, we decided to measure, to be sure. To our surprise, the workloads with locality behaved worse than those without. It took considerable analysis to understand this behavior. The reasons were subtle, but they exposed important properties of the system and led us to a new policy for garbage collection that improved the system’s performance significantly. If we had trusted our initial guess, we would have missed an important opportunity for performance improvement.

Mistake 3: Superficial measurements. Most performance measurements I see are superficial, measuring only the outermost visible behavior of a system (such as the overall running time of an application or the average latency of requests made to a server). These measurements are essential, as they represent the bottom line by which a system is likely to be judged, but they are not sufficient. They leave many questions unanswered (such as “What are the limits that keep the system from performing better?” and “Which of the improvements had the greatest impact on performance?”). In order to get a deep understanding of system performance, the internal behavior of a system must be measured, in addition to its top-level performance.

Mistake 4: Confirmation bias. Confirmation bias causes people to select and interpret data in a way that supports their hypotheses. For example, confirmation bias affects your level of trust. When you see a result that supports your hypothesis, you are more likely to accept the result without question. In contrast, if a measurement suggests your new approach is not performing well, you are more likely to dig deeper to understand exactly what is happening and perhaps find a way to fix the problem. This means that an error in a positive result is less likely to be detected than is an error in a negative result. Confirmation bias also affects how you present information. You are more likely to include results that support your hypothesis and downplay or omit results that are negative. For example, I frequently see claims in papers of the form: “XXX is up to 3.5x faster than YYY.” Such claims cherry-pick the best result to report and are misleading because they do not indicate what performance can be expected in the common case. Statements like this belong in late-night TV commercials, not scientific papers. (Mic drop!)

Mistake 5: Haste. The last mistake in performance evaluation is not allowing enough time. Engineers usually underestimate how long it takes to measure performance accurately, so they often carry out evaluations in a rush. When this happens, they will make mistakes and take shortcuts, leading to all the other mistakes.

Keys to High-Quality Performance Analysis

Rule 1: Allow lots of time.

Rule 2: Never trust a number generated by a computer. The way to validate a measurement is to find different ways to measure the same thing: Take different measurements at the same level, Measure the system’s behavior at a lower level to break down the factors that determine performance, run simulations and compare their results to measurements of the real implementation.

Rule 3: Use your intuition to ask questions, not to answer them. Intuition is a wonderful thing. As you accumulate knowledge and experience in an area, you will start having gut-level feelings about a system’s behavior and how to handle certain problems. If used properly, such intuition can save significant time and effort. However, it is easy to become over-confident and assume your intuition is infallible. Curmudgeons make good performance evaluators because they trust nothing and enjoy finding problems.

Rule 4: Always measure one level deeper. If you want to understand the performance of a system at a particular level, you must measure not just that level but also the next level deeper. That is, measure the underlying factors that contribute to the performance at the higher level. If you are measuring overall latency for remote procedure calls, you could measure deeper by breaking down that latency, determining how much time is spent in the client machine, how much time is spent in the network, and how much time is spent on the server. You could also measure where time is spent on the client and server. Measuring deeper is the single most important ingredient for high-quality performance measurement. Focusing on this one rule will prevent most of the mistakes anyone could potentially make.

Measurement Infrastructure

Making good performance measurements takes time, so it is worth creating infrastructure to help you work more efficiently. The infrastructure will easily pay for itself by the time the measurement project is finished. Furthermore, performance measurements tend to be run repeatedly, making infrastructure even more valuable. In a cloud service provider, for example, measurements must be made continuously in order to maintain contractual service levels.

Also, create a dashboard. It can be as fancy as an interactive webpage or as simple as a text file, but a dashboard is essential for any nontrivial measurement effort.

Discussion

The paper does not explicitly talk about trying representative workloads (although in a couple places it mentions why measuring under different workloads is important). Improving your performance for the workloads your users/customers care about is important. So it is essential to focus on benchmarking with representative workloads rather than the workloads your system is good at supporting. The latter amounts to drawing a target after you fired your shots, to make yourself look competent.

(I can talk about database benchmarking here, but oh please, I don't want to get into that here! The subtitle of this article seems to address that topic: "Performance measurements often go wrong, reporting surface-level results that are more marketing than science".)

Another thing, I feel the article should mention explicitly (again it is there implicitly) is to highlight the importance of not just measuring components in isolation, but also evaluating their interactions and behavior as part of the overall holistic system set up. This relates to the concept of emergent complexity when components are integrated. Measuring components in complete isolation, without considering their integration and interactions within the deployed system, can lead to misleading conclusions about real-world performance. Properties and behaviors may emerge at the system level that are not obvious from studying components individually. It is important to measure composed behavior and interactions within the system and environment context. This allows identifying bottlenecks, contention points, and other systemic issues stemming from component interplay. A great example for this is the metastable faults. Read this OSDI 2022 paper summary to learn more about this.

The principles and practices outlined for high-quality performance evaluation are broadly applicable across various fields beyond just computer systems. I think physics is a great example.

"Science is a way of trying not to fool yourself. The first principle is that you must not fool yourself, and you are the easiest person to fool." -- Richard Feynman

Five years ago I had read the book "Big Science: Ernest Lawrence and the Invention that Launched the Military-Industrial Complex". All those decades of work on building particle accelerators was for the service of doing better and more precise measurements. Always measure one level deeper paid off well for physics. Since the 1930s, the scale of scientific endeavor has grown exponentially. Increasingly more, we started to need big teams, big equipment, and big funding for measurements and inventions.