Tuesday, May 28, 2019

Paper summary. Cloud programming simplified: A Berkeley view on Serverless Computing

This position paper by UC Berkeley RISE lab is about serverless computing, its shortcomings, and its potential. It is easy reading, and is still useful even if you have a pretty good understanding about serverless computing due to some insights and forecasts in the paper. As you will read below, the paper provides a very strong endorsement for serverless computing.

Instead of explaining the paper in my terms, I quote some of my highlights from the paper below, and at the end, in the MAD questions section, I discuss some of my thoughts on serverless computing.

Introduction

We believe the main reason for the success of low-level virtual machines was that in the early days of cloud computing users wanted to recreate the same computing environment in the cloud that they had on their local computers to simplify porting their workloads to the cloud.

To set up your own environment in cloud (using virtual machines), you need to address these 8 issues.

  1. Redundancy for availability, so that a single machine failure doesn't take down the service. 
  2. Geographic distribution of redundant copies to preserve the service in case of disaster.
  3. Load balancing and request routing to efficiently utilize resources.
  4. Autoscaling in response to changes in load to scale up or down the system.
  5. Monitoring to make sure the service is still running well.
  6. Logging to record messages needed for debugging or performance tuning. 
  7. System upgrades, including security patching.
  8. Migration to new instances as they become available.

Compared to what it takes to set up the servers with the proper environment to run the code, the code to accomplish application logic might be dozens of lines of JavaScript.

In our definition, for a service to be considered serverless, it must scale automatically with no need for explicit provisioning, and be billed based on usage. Cloud functions are the general purpose element in serverless computing today, and lead the way to a simplified and general purpose programming model for the cloud.

While we are unsure which solutions will win, we believe all issues will all be addressed eventually, thereby enabling serverless computing to become the face of cloud computing.

Emergence of Serverless Computing


Serverless programming provides an interface that greatly simplifies cloud programming, and represents an evolution that parallels the transition from assembly language to high-level programming languages. Automated memory management relieves programmers from managing memory resources, whereas serverless computing relieves programmers from managing server resources.

There are three critical distinctions between serverless and serverfull computing:

  1. Decoupled computation and storage. The storage and computation scale separately and are provisioned and priced independently. In general, the storage is provided by a separate cloud service and the computation is stateless.
  2. Executing code without managing resource allocation. Instead of requesting resources, the user provides a piece of code and the cloud automatically provisions resources to execute that code.
  3. Paying in proportion to resources used instead of for resources allocated. Billing is by some dimension associated with the execution, such as execution time, rather than by a dimension of the base cloud platform, such as size and number of VMs allocated.

We believe serverless computing represents significant innovation over platform as a service (PaaS) and other previous models. Among these factors, the autoscaling offered by AWS Lambda marked a striking departure from what came before. It tracked load with much greater fidelity than serverful autoscaling techniques, responding quickly to scale up when needed and scaling all the way down to zero resources, and zero cost, in the absence of demand. It charged in a much more fine-grained way, providing a minimum billing increment of 100 ms at a time when other autoscaling services charged by the hour.


Cloud functions, or functions as a service (FaaS), provide general compute and are complemented by an ecosystem of specialized Backend as a Service (BaaS) offerings such as object storage, databases, or messaging.


Unlike serverless computing, Kubernetes is a technology that simplifies management of serverful computing. Kubernetes can provide short-lived computing environments, like serverless computing, and has far fewer limitations, e.g., on hardware resources, execution time, and network communication. It can also deploy software originally developed for on-premise use completely on the public cloud with little modification. Serverless computing, on the other hand, introduces a paradigm shift that allows fully offloading operational responsibilities to the provider, and makes possible fine-grained multi-tenant multiplexing.

Recent surveys found that about 24% of serverless users were new to cloud computing and 30% of existing serverful cloud customers also used serverless computing.

\\ Murat's note: While 24% is an impressive number, what is the control here? Maybe traditional cloud computing is also getting new users at that rate?


\\ Murat's note: Chat bots are very popular use case of serverless, even more than IoT in total. They are sneaking under the radar, but are worth watching for their future ubiquitous applications. 

Limitations of today's serverless platforms

In this section, we present an overview of five research projects and discuss the obstacles that prevent existing serverless computing platforms from achieving state-of-the-art performance, i.e., matching the performance of serverful clouds for the same workloads.


Serverless SQLite: Databases. A strawman solution would be to run common transactional databases, such as PostgreSQL, Oracle, or MySQL inside cloud functions. However, that immediately runs into a number of challenges. First, serverless computing has no built-in persistent storage, so we need to leverage some remote persistent store, which introduces large latency.  Second, these databases assume connection-oriented protocols, e.g., databases are running as servers accepting connections from clients. This assumption conflicts with existing cloud functions that are running behind network address translators, and thus don't support incoming connections. Finally, while many high performance databases rely on shared memory, cloud functions run in isolation so cannot share memory. While shared-nothing distributed databases do not require shared memory, they expect nodes to remain online and be directly addressable.

Lack of fine-grained coordination. Applications are left with no choice but to either (1) manage a VM-based system that provides notifications, as in ElastiCache and SAND, or (2) implement their own notification mechanism, such as in ExCamera, that enables cloud functions to communicate with each other via a long-running VM-based rendezvous server. This limitation also suggests that new variants of serverless computing may be worth exploring, for example naming function instances and allowing direct addressability for access to their internal state (e.g., Actors as a Service).

Networking challenges. There may be several ways to address this challenge:

  1. Provide cloud functions with a larger number of cores, similar to VM instances, so multiple tasks can combine and share data among them before sending over the network or after receiving it.
  2. Allow the developer to explicitly place the cloud functions on the same VM instance. Offer distributed communication primitives that applications can use out-of-the-box so that cloud providers can allocate cloud functions to the same VM instance.
  3. Let applications provide a computation graph, enabling the cloud provider to co-locate the cloud functions to minimize communication overhead. 

Summary and predictions

By providing a simplified programming environment, serverless computing makes the cloud much easier to use, thereby attracting more people who can and will use it. [This is] a maturation akin to the move from assembly language to high-level languages more than four decades ago.

We predict that serverless use will skyrocket.

The first step is Serverless Ephemeral Storage, which must provide low latency and high IOPS at reasonable cost, but need not provide economical long term storage. A second class of applications would benefit from Serverless Durable Storage, which does demand long term storage. New non-volatile memory technologies may help with such storage systems. Other applications would benefit from a low latency signaling service and support for popular communication primitives.

Two challenges for the future of serverless computing are improved security and accommodating cost-performance advances that are likely to come from special purpose processors.

The future of serverful computing will be to facilitate BaaS. Applications that prove to be difficult to write on top of serverless computing, such as OLTP databases or communication primitives such as queues, will likely be offered as part of a richer set of services from all cloud providers.

MAD questions

1.  Is a very strong endorsement for serverless warranted?
The paper gives very strong endorsements for serverless:
We predict that serverless use will skyrocket.
While we are unsure which solutions will win, we believe all issues will all be addressed eventually, thereby enabling serverless computing to become the face of cloud computing.
Remember, when we read papers, we should fight vigorously with the claims, and play the devil's advocate. So let's challenge this claim. What could be the reasons this claim may not hold?

First of all, we need to quantify and limit the claim. What does skyrocket mean? What does it mean for serverless to become the face of cloud computing? And finally what does serverless mean? Is this claim true of today's cloud functions? If we don't have a stable definition of serverless, this claim is prone to the No True Scotsman fallacy. If serverless use does not skyrocket, it will be because we don't have "true" serverless yet.

Ok, assuming that the claim is quantified, what may be some reasons it could fail?

Serverless improves greatly on ease of use, and that alone may warrant a lot of use for serverless. But ease-of-use is not necessarily exclusive to serverless. BaaS managed services, like distributed databases, can get even easier to use. And some even support stored procedures, which helps meet some of the serverless needs.

When comparing with PaaS, the paper said that serverless differentiates itself due to its very quick autoscaling. But, this may not be such a strong differentiator for the customers. Most customers may not have very bursty  workloads that require quick and extreme scaling.

Another contender for the serverless lunch may be software as a service (SaaS), like instagram, icloud, etc. SaaS can be even simpler to use than serverless, and may be programmed with visual workflows using mouse clicks. SaaS may steal users from serverless would work if SaaS services play well with each other so customers can pipe output from one as input to others.

2. Could serverless ever work for stateful services?
It is easy to make FaaS serverless because it is stateless. But FaaS scalability is limited by the BaaS scalability it depends on. It is easy to scale storage, because it is also stateless. But, the story becomes murkier when it comes to scalability of stateful services. At the limits, this is likely to be impossible: You can't have extreme scalability and extreme state (requiring incessant coordination). But outside the extremes, with good engineering we can get quick scalability for stateful services.

3. "Berkeley view" papers
If you are into this stuff, here are two other Berkeley view papers.

A Berkeley view of systems challenges for AI

Above the Clouds: A Berkeley View of Cloud Computing

Also there was a recent CIDR paper by another group of UC Berkeley researchers on serverless computing titled: "Serverless Computing: One Step Forward, Two Steps Back", which I had covered before. This paper is worth reading for another perspective on serverless.

Friday, May 24, 2019

Paper summary. Scalable Consistency in Scatter

Here is the pdf for the paper. It is by Lisa Glendenning, Ivan Beschastnikh, Arvind Krishnamurthy, and Thomas Anderson, Department of Computer Science & Engineering University of Washington.

This paper is about peer-to-peer (P2P) systems. But the paper is from 2011, way after the P2P hype had died. This makes the paper more interesting, because it had the opportunity to consider things in hindsight. The P2P corpse was cold, and Dynamo had looted the distributed hash tables (DHT) idea from P2P and applied it in the context of datacenter computing. In return, this work liberates the Paxos coordination idea from the datacenter world and employs it in the P2P world. It replaces each node (or virtual node) in a P2P overlay ring with a Paxos group that consists of a number of nodes.


Ok, what problem do Paxos groups solve in the P2P systems? In the presence of high churn, DHTs in P2P systems suffer from inconsistent routing state and inconsistent name space partitioning issues (see Figure 1). By leveraging the Paxos group abstraction as a stable base to build these coordination operations (split, merge, migrate, repartition), Scatter achieves linearizable consistency even under adverse circumstances.

Group coordination 

Scatter supports the following multi-group operations:

  • split: partition the state of an existing group into two groups
  • merge: create a new group from the union of the state of two neighboring groups
  • migrate: move members from one group to a different group
  • repartition: change the key-space partitioning between two adjacent groups

Each multi-group operation in Scatter is structured as a distributed transaction. The paper calls this design pattern as nested consensus, and says: "We believe that this general idea of structuring protocols as communication between replicated participants, rather than between individual nodes, can be applied more generally to the construction of scalable, consistent distributed systems."


Nested consensus uses a two-tiered approach. At the top tier, groups execute a two-phase commit protocol (2PC), while within each group Paxos is used for agreeing on the actions that the group takes. Provided that a majority of nodes in each group remain alive and connected, the 2PC protocol will be non-blocking and terminate. (This is the same argument Spanner uses as it employs 2PC over Paxos groups.) For individual links in the overlay to remain highly available, Scatter maintains an additional invariant: a group can always reach its adjacent groups. To maintain this connectivity, Scatter enforces that every adjacent group of a group A has up-to-date knowledge of the membership of A.

Multi-group operations are coordinated by whichever group decides to initiate the transaction as a result of some local policy. The group initiating a transaction is called the coordinator group and the other groups involved are called the participant groups. This is the overall structure of nested consensus:

  1. The coordinator group replicates the decision to initiate the transaction.
  2. The coordinator group broadcasts a transaction prepare message to the nodes of the participant groups.
  3. Upon receiving the prepare message, a participant group decides whether or not to commit the proposed transaction and replicates its vote.
  4. A participant group broadcasts a commit or abort message to the nodes of the coordinator group.
  5. When the votes of all participant groups is known, the coordinator group replicates whether or not the transaction was committed.
  6. The coordinator group broadcasts the outcome of the transaction to all participant groups.
  7. Participant groups replicate the transaction outcome.
  8. When a group learns that a transaction has been committed then it executes the steps of the proposed transaction, the particulars of which depend on the multi-group operation.


Figure 5 shows an example of this template for group-split operation. After each group has learned and replicated the outcome (committed) of the split operation at time t3, the following updates are executed by the respective group: (1) G1 updates its successor pointer to G2a, (2) G3 updates its predecessor pointer to G2b, and (3) G2 executes a replicated state machine reconfiguration to instantiate the two new groups which partition between them G2's original key-range and set of member nodes.


The storage service (discussed next) continues to process client requests during the execution of group transactions except for a brief period of unavailability for any reconfiguration required by a committed transaction. Also, groups continue to serve lookup requests during transactions provided that the lookups are serialized with respect to the transaction commit.

Storage service

To improve throughput for put and get operations on keys, Scatter divides the key range assigned to the Paxos group into sub-ranges and assigns these sub-ranges to nodes within the Paxos group. Each key is only assigned to one primary and is serialized by that primary. The group leader replicates information regarding the assignment of keys to primaries using Paxos, as it does with the state for multi-group operations. Once an operation is routed to the correct group for a given key, then any node in the group will forward the operation to the appropriate primary. The primaries can run Paxos on the keys assigned to themselves concurrently with each other because this does not result in a conflict: it is OK to have different keys updated at the same time, since linearizability is a per key property.

Scatter provides linearizable storage within a given key and does not attempt to linearize multi-key application transactions.  A read is served by a primary within the Paxos group which is responsible for that key. The primary uses leader lease with the rest of the nodes. It is possible to provide weaker consistency reads, as is default in ZooKeeper, by reading from one node in the group.


Figure 7 plots the probability of group failure for different group sizes for two node churn rates with node lifetimes drawn from heavy-tailed Pareto distributions observed in typical peer-to-peer systems. The plot indicates that a modest group size of 8-12 prevents group failure with high probability. The prototype implementation in the paper demonstrates that even with these very short node lifetimes, it is possible to build a scalable and consistent system with practical performance. This was surprising to me.

Evaluation

They evaluate Scatter in a variety of configurations, for both micro-benchmarks and for a Twitter-style application. Compared to OpenDHT, Scatter provides equivalent performance with much better availability, consistency (i.e. linearizability), and adaptability even in very challenging environments. For example, if average node lifetimes are as short as 180 seconds, therefore triggering very frequent reconfigurations to maintain data durability, Scatter is able to maintain overall consistency and data availability, serving its reads in an average of 1.3 seconds in a typical wide area setting.






This is good performance, but to put things in context of datacenter computing, the evaluation is done with "small data". When you have many gigabytes (if not terabytes) of data assigned to each node, just to copy that data at line speed may take more time than the churn rate of the the nodes in a P2P environment.

The paper also compares Scatter against statically partitioned ZooKeeper groups. Here, the key-space partitioning was derived based on historical workload characteristics, but the inability to adapt to dynamic hotspots in the access pattern limits the scalability of the ZooKeeper-based groups deployment. Further, the variability in the throughput also increases with the number of ZooKeeper instances used in the experiment.


In contrast, Scatter's throughput scales linearly with the number of nodes, with only a small amount of variability due to uneven group sizes and temporary load skews. This is because Scatter uses ring and group operations to adapt to change in access patterns. Based on the load balancing policy in Scatter, the groups repartition their keyspaces proportionally to their respective loads whenever a group's load is a factor of 1.6 or above that of its neighboring group. As this check is performed locally between adjacent groups, it does not require global load monitoring, but it might require multiple iterations of the load-balancing operation to disperse hotspots.

Hat tip for @DharmaShukla for recommending the paper to me. The paper has inspired some design decisions in Cosmos DB.

MAD questions

1. What could be some alternative designs to solve this problem?
Instead of arranging the Paxos groups in a ring, why not have a vertical-Paxos group overseeing the Paxos groups? The vPaxos box would be assigning key ranges to Paxos groups, coordinating the group operations (split, merge, load-balance) and maintaining the configuration information of the Paxos groups. This would allow adapting to changes in workload and reconfiguring in reaction to node availability in a much faster manner than that of the P2P ring, where load-balancing is done by adjacent groups dispersing load to each other in multiple iterations.

Another problem with Scatter is that it lacks WAN locality optimization. A client may need to go across the globe to contact a Paxos group responsible for keys that it interacts with the most. WPaxos can learn and adopt to these patterns. So, while we are at it, why not replace the vanilla Paxos in the Paxos group with WPaxos to achieve client access locality adaptation in an orthogonal way. Then the final set up becomes VPaxos over-seeing groups of WPaxos deployments.

2. Would it ever be possible to replace datacenters with P2P technologies?
The paper in the introduction seems fairly optimistic: "Our interest is in building a storage layer for a very large scale P2P system we are designing for hosting planetary scale social networking applications. Purchasing, installing, powering up, and maintaining a very large scale set of nodes across many geographically distributed data centers is an expensive proposition; it is only feasible on an ongoing basis for those applications that can generate revenue. In much the same way that Linux offers a free alternative to commercial operating systems for researchers and developers interested in tinkering, we ask: what is the Linux analogue with respect to cloud computing?"

I am not very optimistic...

3. Why don't we invest in better visualizations/figures for writing papers?
This paper had beautiful figures for explaining concepts. Check Figure 4 below, it shows two groups considering different operations concurrently, visualized with thought bubbles. These figures go a long way. It is a shame we don't invest any effort in standardizing and teaching good illustration techniques to support exposition. It is even discouraged to use colors because they look faded/blended when printed in black and white. For God's sake, it is 2019, and we should level up our illustration game.

What are some other examples of papers with beautiful figures illustrating concepts? Please let me know. They are a treat to read.

Tuesday, May 14, 2019

Book Notes. Steal Like an Artist: 10 Things Nobody Told You About Being Creative

This book is by Austin Kleon, 2012. I had also wrote about his other book "Show Your Work! 10 Ways to Share Your Creativity and Get Discovered." 

Here are the 10 things nobody told you about being creative:
  1. Steal like an artist.
  2. Don’t wait until you know who you are to get started.
  3. Write the book you want to read.
  4. Use your hands.
  5. Side projects and hobbies are important.
  6. The secret: do good work and share it with people.
  7. Geography is no longer our master.
  8. Be nice. (The world is a small town.)
  9. Be boring. (It’s the only way to get work done.)
  10. Creativity is subtraction.
Kleon gave a short TEDX talk about the idea behind this book.

The title is an homage to a quote attributed to Picasso: “Good artists borrow, great artists steal.” Picasso also said: "Art is theft." It’s not just where you take things from, it's where you take them to. Here are some parts I highlighted under Section 1: "steal like an artist."

Every artist gets asked the question, "Where do you get your ideas?" The honest artist answers, "I steal them."
             
Every new idea is just a mashup or a remix of one or more previous ideas.

You have a mother and you have a father. You possess features from both of them, but the sum of you is bigger than their parts.
             
You are, in fact, a mashup of what you choose to let into your life. You are the sum of your influences. The German writer Goethe said, "We are shaped and fashioned by what we love."

Your job is to collect good ideas. The more good ideas you collect, the more you can choose from to be influenced by.
             
Carry a notebook and a pen with you wherever you go. Get used to pulling it out and jotting down your thoughts and observations. Copy your favorite passages out of books. Record overheard conversations. Doodle when you're on the phone.

You might be scared to start. That's natural. There's this very real thing that runs rampant in educated people. It’s called "impostor syndrome."
             
Ask anybody doing truly creative work, and they'll tell you the truth: They don't know where the good stuff comes from. They just show up to do their thing. Every day.
             
Don't just steal the style, steal the thinking behind the style. You don't want to look like your heroes, you want to see like your heroes.

As with Kleon's other books, the book has beautiful artwork.


Wednesday, May 8, 2019

Book Notes. Creativity, Inc.: Overcoming the Unseen Forces That Stand in the Way of True Inspiration

This book is by Ed Catmull, cofounder of Pixar, with Amy Wallace, 2014. The book is about the cultivation and management of creativity:
If Pixar is ever successful, will we do something stupid, too? Can paying careful attention to the missteps of others help us be more alert to our own? Or is there something about becoming a leader that makes you blind to the things that threaten the well-being of your enterprise? 
I would devote myself to learning how to build not just a successful company but a sustainable creative culture. As I turned my attention from solving technical problems to engaging with the philosophy of sound management, I was excited once again.
While reading the book, I was impressed by how many questions Ed kept asking. I thought I was asking a lot of questions, but Ed is really really into asking questions and using them to achieve focus.

Here are some parts I highlighted from the book.

From childhood to PhD

Growing up in the 1950s, I had yearned to be a Disney animator but had no idea how to go about it.

In graduate school, I’d quietly set a goal of making the first computer-animated feature film.
             
Walt Disney was one of my two boyhood idols. The other was Albert Einstein.

Disney’s animators were at the forefront of applied technology; instead of merely using existing methods, they were inventing ones of their own.

Every time some technological breakthrough occurred, Walt Disney incorporated it and then talked about it on his show in a way that highlighted the relationship between technology and art.

That night’s episode was called “Where Do the Stories Come From?” and Disney kicked it off by praising his animators’ knack for turning everyday occurrences into cartoons.
             
An artist was drawing Donald Duck, giving him a jaunty costume and a bouquet of flowers and a box of candy with which to woo Daisy. Then, as the artist’s pencil moved around the page, Donald came to life, putting up his dukes to square off with the pencil lead, then raising his chin to allow the artist to give him a bow tie.

Whether it’s a T-Rex or a slinky dog or a desk lamp, if viewers sense not just movement but intention--or, put another way, emotion--then the animator has done his or her job.

I remember the optimistic energy--an eagerness to move forward that was enabled and supported by a wealth of emerging technologies. It was boom time in America, with manufacturing and home construction at an all-time high.

The first organ transplants were performed in 1954; the first polio vaccine came a year later; in 1956, the term artificial intelligence entered the lexicon.

Then, when I was twelve, the Soviets launched the first artificial satellite--Sputnik 1--into earth’s orbit.

The United States government’s response to being bested was to create something called ARPA,

Looking back, I still admire that enlightened reaction to a serious threat: We’ll just have to get smarter.
             
ARPA would have a profound effect on America, leading directly to the computer revolution and the Internet, among countless other innovations.
             
I was a quiet, focused student in high school. An art teacher once told my parents I would often become so lost in my work that I wouldn’t hear the bell ring at the end of class;
             
Throughout my life, people have always smiled when I told them I switched from art to physics because it seems, to them, like such an incongruous leap. But my decision to pursue physics, and not art, would lead me, indirectly, to my true calling.
             
Four years later, in 1969, I graduated from the University of Utah with two degrees, one in physics and the other in the emerging field of computer science.
             
But soon after I matriculated, also at the U of U, I met a man who would encourage me to change course: one of the pioneers of interactive computer graphics, Ivan Sutherland.
             
Sutherland and Dave Evans, who was chair of the university’s computer science department, were magnets for bright students with diverse interests, and they led us with a light touch.
             
The result was a collaborative, supportive community so inspiring that I would later seek to replicate it at Pixar.

One of my classmates, Jim Clark, would go on to found Silicon Graphics and Netscape. Another, John Warnock, would co-found Adobe, known for Photoshop and the PDF file format, among other things. Still another, Alan Kay, would lead on a number of fronts, from object-oriented programming to “windowing” graphical user interfaces.
             
Not only did I often sleep on the floor of the computer rooms to maximize time on the computer, but so did many of my fellow graduate students.

Making pictures with a computer spoke to both sides of my brain.

In the spring of 1972, I spent ten weeks making my first short animated film—a digitized model of my left hand.

Professor Sutherland used to say that he loved his graduate students at Utah because we didn’t know what was impossible.

My dissertation, “A Subdivision Algorithm for Computer Display of Curved Surfaces,” offered a solution to that problem.

“Texture mapping,” as I called it, was like having stretchable wrapping paper that you could apply to a curved surface so that it fit snugly.

At the U of U, we were inventing a new language. One of us would contribute a verb, another a noun, then a third person would figure out ways to string the elements together to actually say something.
             
Today, there is a Z-buffer in every game and PC chip manufactured on earth.
       

After college      

In the next decade, I would learn much about what managers should and shouldn’t do, about vision and delusion, about confidence and arrogance, about what encourages creativity and what snuffs it out.

I’ve made a policy of trying to hire people who are smarter than I am.

Alvy and I decided to do the opposite--to share our work with the outside world.

It’s hard to imagine now, but in 1976, the idea of incorporating high technology into Hollywood filmmaking wasn’t just a low priority; it wasn’t even on the radar. But one man was about to change that, with a movie called Star Wars.

In the intervening years, George has said that he hired me because of my honesty, my “clarity of vision,” and my steadfast belief in what computers could do.

A research lab is not a university, and the structure didn’t scale well. At Lucasfilm, then, I decided to hire managers to run the graphics, video, and audio groups; they would then report to me.

For all the care you put into artistry, visual polish frequently doesn’t matter if you are getting the story right.

To this day, I am thankful that the deal went south. Because it paved the way for Steve Jobs.

Alan [Kay] had been at the U of U with me and at Xerox PARC with Alvy, and he told Steve that he should visit us if he wanted to see the cutting edge in computer graphics.

I remember his assertiveness. There was no small talk. Instead, there were questions. Lots of questions. What do you want? Steve asked. Where are you heading? What are your long-term goals? He used the phrase “insanely great products” to explain what he believed in. Clearly, he was the sort of person who didn’t let presentations happen to him, and it wasn’t long before he was talking about making a deal.

As he spoke, it became clear to us that his goal was not to build an animation studio; his goal was to build the next generation of home computers to compete with Apple. This wasn’t merely a deviation from our vision, it was the total abandonment of it, so we politely declined. We returned to the task of trying to find a buyer.

At one point in this period, I met with Steve and gently asked him how things got resolved when people disagree with him. He seemed unaware that what I was really asking him was how things would get resolved if we worked together and I disagreed with him, for he gave a more general answer. He said, “When I don’t see eye to eye with somebody, I just take the time to explain it better, so they understand the way it should be.”

In the end, Steve paid \$5 million to spin Pixar off of Lucasfilm—and then, after the sale, he agreed to pay another \$5 million to fund the company, with 70 percent of the stock going to Steve and 30 percent to the employees.
                             
His method for taking the measure of a room was saying something definitive and outrageous—“These charts are bullshit!” or “This deal is crap!”—and watching people react. If you were brave enough to come back at him, he often respected it--poking at you, then registering your response, was his way of deducing what you thought and whether you had the guts to champion it.

Every few weeks, I’d head down to Steve’s office in Redwood City to brief him on our progress. I didn’t relish the meetings, to be honest, because they were often frustrating.

At Pixar’s lowest point, as we floundered and failed to make a profit, Steve had sunk \$54 million of his own money into the company—a significant chunk of his net worth, and more money than any venture capital firm would have considered investing, given the sorry state of our balance sheet.
             
After trying everything we could to sell our Pixar Image Computer, we were finally facing the fact that hardware could not keep us going.

There is nothing quite like ignorance combined with a driving need to succeed to force rapid learning.


We began to focus our energies on the creative side. We started making animated commercials for Trident gum and Tropicana orange juice and almost immediately won awards for the creative content while continuing to hone our technical and storytelling skills.

In 1991, we laid off more than a third of our employees.

Three times between 1987 and 1991, a fed-up Steve Jobs tried to sell Pixar. And yet, despite his frustrations, he could never quite bring himself to part with us. When Microsoft offered \$90 million for us, he walked away. Steve wanted \$120 million, and felt their offer was not just insulting but proof that they weren’t worthy of us.
             
How would we resolve conflicts? And his answer, which I found comically egotistical at the time, was that he simply would continue to explain why he was right until I understood. The irony was that this soon became the technique I used with Steve. When we disagreed, I would state my case, but since Steve could think much faster than I could, he would often shoot down my arguments. So I’d wait a week, marshal my thoughts, and then come back and explain it again. He might dismiss my points again, but I would keep coming back until one of three things happened: (1) He would say “Oh, okay, I get it” and give me what I needed; (2) I’d see that he was right and stop lobbying; or (3) our debate would be inconclusive, in which case I’d just go ahead and do what I had proposed in the first place. Each outcome was equally likely, but when this third option occurred, Steve never questioned me. For all his insistence, he respected passion. If I believed in something that strongly, he seemed to feel, it couldn’t be all wrong.
             
Katzenberg wanted Pixar to make a feature film, and he wanted Disney to own and distribute it.
             
Steve took the reins, rejecting Jeffrey’s logic that since Disney was investing in Pixar’s first movie, it deserved to own our technology as well. “You’re giving us money to make the film,” Steve said, “not to buy our trade secrets.” What Disney brought to the table was its marketing and distribution muscle; what we brought were our technical innovations, and they were not for sale. Steve made this a deal breaker and stuck to his guns until, ultimately, Jeffrey agreed.

Given the millions of dollars at stake and the realization that we’d never get another chance if we blew it, we had to figure it out fast. Luckily, John already had an idea. Toy Story would be about a group of toys and a boy—Andy—who loves them. The twist was that it would be told from the toys’ point of view.

On November 19, 1993, we went to Disney to unveil the new, edgier Woody in a series of story reels—a mock-up of the film, like a comic book version with temporary voices, music, and drawings of the story. That day will forever be known at Pixar as “Black Friday” because Disney’s completely reasonable reaction was to shut down the production until an acceptable script was written.

With our first feature film suddenly on life support, John quickly summoned Andrew, Pete, and Joe. For the next several months, they spent almost every waking minute together, working to rediscover the heart of the movie, the thing that John had first envisioned: a toy cowboy who wanted to be loved. They also learned an important lesson--to trust their own storytelling instincts.

1991, two of the year’s biggest blockbusters—Beauty and the Beast and Terminator 2—had relied heavily on technology that had been developed at Pixar, and people in Hollywood were starting to pay attention. By 1993, when Jurassic Park was released, computer-generated special effects would no longer be considered some nerdy sideline experiment;

And a few months later, as if on cue, Eisner called, saying that he wanted to renegotiate the deal and keep us as a partner. He accepted Steve’s offer of a 50/50 split. I was amazed; Steve had called this exactly right. His clarity and execution were stunning.
             
For the first time since our founding, our jobs were safe.

Pixar as a company

The point is, we value self-expression.
             
What makes Pixar special is that we acknowledge we will always have problems, many of them hidden from our view; that we work hard to uncover these problems, even if doing so means making ourselves uncomfortable; and that, when we come across a problem, we marshal all of our energies to solve it.

In the coming pages, I will discuss many of the steps we follow at Pixar, but the most compelling mechanisms to me are those that deal with uncertainty, instability, lack of candor, and the things we cannot see. I believe the best managers acknowledge and make room for what they do not know—not just because humility is a virtue but because until one adopts that mindset, the most striking breakthroughs cannot occur. I believe that managers must loosen the controls, not tighten them. They must accept risk; they must trust the people they work with and strive to clear the path for them; and always, they must pay attention to and engage with anything that creates fear.
             
Only when we admit what we don’t know can we ever hope to learn it.
             
When it comes to creative inspiration, job titles and hierarchy are meaningless.

Every person there, no matter their job title, felt free to speak up. This was not only what we wanted, it was a fundamental Pixar belief: Unhindered communication was key, no matter what your position. At our long, skinny table, comfortable in our middle seats, we had utterly failed to recognize that we were behaving contrary to that basic tenet.
             
I discovered we’d completely missed a serious, ongoing rift between our creative and production departments. In short, production managers told me that working on Toy Story had been a nightmare. They felt disrespected and marginalized—like second-class citizens. And while they were gratified by Toy Story’s success, they were very reluctant to sign on to work on another film at Pixar. I was floored. How had we missed this?
             
For me, this discovery was bracing. Being on the lookout for problems, I realized, was not the same as seeing problems. This would be the idea—the challenge—around which I would build my new sense of purpose.
             
Because making a movie involves hundreds of people, a chain of command is essential. But in this case, we had made the mistake of confusing the communication structure with the organizational structure.

Going forward, anyone should be able to talk to anyone else, at any level, at any time, without fear of reprimand. Communication would no longer have to go through hierarchical channels.
             
The first principle was “Story Is King,” by which we meant that we would let nothing--not the technology, not the merchandising possibilities--get in the way of our story.

The other principle we depended on was “Trust the Process.”
             
While Woody would choose Andy in the end, he would make that choice with the awareness that doing so guaranteed future sadness.
             
For the next six months, our employees rarely saw their families. We worked deep into the night, seven days a week. Despite two hit movies, we were conscious of the need to prove ourselves, and everyone gave everything they had. With several months still to go, the staff was exhausted and starting to fray.

I had expected the road to be rough, but I had to admit that we were coming apart. By the time the film was complete, a full third of the staff would have some kind of repetitive stress injury.
             
Critics raved that Toy Story 2 was one of the only sequels ever to outshine the original.

Though I was immensely proud of what we had accomplished, I vowed that we would never make a film that way again. It was management’s job to take the long view, to intervene and protect our people from their willingness to pursue excellence at all costs. Not to do so would be irresponsible.

Good idea or Good team?                

If you give a good idea to a mediocre team, they will screw it up. If you give a mediocre idea to a brilliant team, they will either fix it or throw it away and come up with something better.

Getting the team right is the necessary precursor to getting the ideas right.
             
Getting the right people and the right chemistry is more important than getting the right idea.
             
Ideas come from people. Therefore, people are more important than ideas.
             
Why are we confused about this? Because too many of us think of ideas as being singular, as if they float in the ether, fully formed and independent of the people who wrestle with them.
             
Find, develop, and support good people, and they in turn will find, develop, and own good ideas.
                             
We should trust in people, I told them, not processes. The error we’d made was forgetting that “the process” has no agenda and doesn’t have taste.

Once you’re aware of the suitcase/handle problem, you’ll see it everywhere. People glom onto words and stories that are often just stand-ins for real action and meaning.

Around this time, John coined a new phrase: “Quality is the best business plan.”
             
That didn’t mean that we wouldn’t make mistakes. Mistakes are part of creativity. But when we did, we would strive to face them without defensiveness and with a willingness to change.

Braintrust                

What is the nature of honesty? If everyone agrees about its importance, why do we find it hard to be frank? How do we think about our own failures and fears? Is there a way to make our managers more comfortable with unexpected results—the inevitable surprises that arise, no matter how well you’ve planned? How can we address the imperative many managers feel to overcontrol the process? With what we have learned so far, can we finally get the process right? Where are we still deluded?
             
Candor is forthrightness or frankness--not so different from honesty, really. And yet, in common usage, the word communicates not just truth--telling but a lack of reserve.

A hallmark of a healthy creative culture is that its people feel free to share ideas, opinions, and criticisms. Lack of candor, if unchecked, ultimately leads to dysfunctional environments.
             
The Braintrust, which meets every few months or so to assess each movie we’re making, is our primary delivery system for straight talk.
             
Its premise is simple: Put smart, passionate people in a room together, charge them with identifying and solving problems, and encourage them to be candid with one another.
             
The Braintrust is one of the most important traditions at Pixar.
             
The passion expressed in a Braintrust meeting was never taken personally because everyone knew it was directed at solving problems.
             
And largely because of that trust and mutual respect, its problem-solving powers were immense.
             
Candor could not be more crucial to our creative process. Why? Because early on, all of our movies suck. That’s a blunt assessment, I know, but I make a point of repeating it often, and I choose that phrasing because saying it in a softer way fails to convey how bad the first versions of our films really are. I’m not trying to be modest or self-effacing by saying this. Pixar films are not good at first, and our job is to make them so—to go, as I say, “from suck to not-suck.” This idea—that all the movies we now think of as brilliant were, at one time, terrible—is a hard concept for many to grasp. But think about how easy it would be for a movie about talking toys to feel derivative, sappy, or overtly merchandise-driven. Think about how off-putting a movie about rats preparing food could be, or how risky it must’ve seemed to start WALL-E with 39 dialogue-free minutes. We dare to attempt these stories, but we don’t get them right on the first pass. And this is as it should be. Creativity has to start somewhere, and we are true believers in the power of bracing, candid feedback and the iterative process—reworking, reworking, and reworking again, until a flawed story finds its throughline or a hollow character finds its soul.
             
(It takes about twelve thousand storyboard drawings to make one 90-minute reel, and because of the iterative nature of the process I’m describing, story teams commonly create ten times that number by the time their work is done.)

People who take on complicated creative projects become lost at some point in the process. It is the nature of things—in order to create, you must internalize and almost become the project for a while, and that near-fusing with the project is an essential part of its emergence. But it is also confusing. Where once a movie’s writer/director had perspective, he or she loses it. Where once he or she could see a forest, now there are only trees.
             
You may be thinking, How is the Braintrust different from any other feedback mechanism?
             
The first is that the Braintrust is made up of people with a deep understanding of storytelling and, usually, people who have been through the process themselves.

The second difference is that the Braintrust has no authority. This is crucial: The director does not have to follow any of the specific suggestions given. After a Braintrust meeting, it is up to him or her to figure out how to address the feedback.
             
By removing from the Braintrust the power to mandate solutions, we affect the dynamics of the group in ways I believe are essential.
             
While problems in a film are fairly easy to identify, the sources of those problems are often extraordinarily difficult to assess.
             
The Braintrust’s notes, then, are intended to bring the true causes of problems to the surface—not to demand a specific remedy.
             
I like to think of the Braintrust as Pixar’s version of peer review, a forum that ensures we raise our game—not by being prescriptive but by offering candor and deep analysis.

The film itself—not the filmmaker—is under the microscope.
             
The feedback usually begins with John. While everyone has an equal voice in a Braintrust meeting, John sets the tone, calling out the sequences he liked best, identifying some themes and ideas he thinks need to be improved. That’s all it takes to launch the back-and-forth. Everybody jumps in with observations about the film’s strengths and weaknesses.
             
Andrew felt there was a similarly impactful opportunity here that was being missed--and, thus, was keeping the film from working--and he said so candidly. “Pete, this movie is about the inevitability of change,” he said. “And of growing up.” [Inside Out]

And it was Brad Bird who pointed that out to Andrew in a Braintrust meeting. “You’ve denied your audience the moment they’ve been waiting for,” he said, “the moment where EVE throws away all her programming and goes all out to save WALL-E. Give it to them. The audience wants it.” As soon as Brad said that, it was like: Bing! After the meeting, Andrew went off and wrote an entirely new ending in which EVE saves WALL-E, and at the next screening, there wasn’t a dry eye in the house.

“Sometimes the Braintrust will know something’s wrong, but they will identify the wrong symptom,” he told me.

Instead of saying, ‘The writing in this scene isn’t good enough,’ you say, ‘Don’t you want people to walk out of the theater and be quoting those lines?’ It’s more of a challenge. ‘Isn’t this what you want? I want that too!’

Fail early, Fail fast

Left to their own devices, most people don’t want to fail. But Andrew Stanton isn’t most people. As I’ve mentioned, he’s known around Pixar for repeating the phrases “fail early and fail fast” and “be wrong as fast as you can.” He thinks of failure like learning to ride a bike; it isn’t conceivable that you would learn to do this without making mistakes—without toppling over a few times. “Get a bike that’s as low to the ground as you can find, put on elbow and knee pads so you’re not afraid of falling, and go,” he says.

In a fear-based, failure-averse culture, people will consciously or unconsciously avoid risk.
             
Their work will be derivative, not innovative. But if you can foster a positive understanding of failure, the opposite will happen.
             
I have found that people who pour their energy into thinking about an approach and insisting that it is too early to act are wrong just as often as people who dive in and work quickly.

The overplanners just take longer to be wrong (and, when things inevitably go awry, are more crushed by the feeling that they have failed). There’s a corollary to this, as well: The more time you spend mapping out an approach, the more likely you are to get attached to it. The nonworking idea gets worn into your brain, like a rut in the mud. It can be difficult to get free of it and head in a different direction. Which, more often than not, is exactly what you must do.
             
To be a truly creative company, you must start things that might fail.

Fear can be created quickly; trust can’t. Leaders must demonstrate their trustworthiness, over time, through their actions—and the best way to do that is by responding well to failure. The Braintrust and various groups within Pixar have gone through difficult times together, solved problems together, and that is how they’ve built up trust in each other. Be patient. Be authentic. And be consistent. The trust will come.

Your employees are smart; that’s why you hired them. So treat them that way. They know when you deliver a message that has been heavily massaged. When managers explain what their plan is without giving the reasons for it, people wonder what the “real” agenda is. There may be no hidden agenda, but you’ve succeeded in implying that there is one. Discussing the thought processes behind solutions aims the focus on the solutions, not on second-guessing. When we are honest, people know it.
             
Management’s job is not to prevent risk but to build the ability to recover.

Protecting the new, the original

Originality is fragile. And, in its first moments, it’s often far from pretty. This is why I call early mock-ups of our films “ugly babies.” They are not beautiful, miniature versions of the adults they will grow up to be. They are truly ugly: awkward and unformed, vulnerable and incomplete. They need nurturing—in the form of time and patience—in order to grow.

(This reminds me of what I wrote here.)

The Ugly Baby idea is not easy to accept. Having seen and enjoyed Pixar movies, many people assume that they popped into the world already striking, resonant, and meaningful—fully grown, if you will. In fact, getting them to that point involved months, if not years, of work.
             
When Andrew finished his pitch, those of us in attendance were silent for a moment. Then, John Lasseter spoke for all of us when he said, “You had me at the word fish.”

To view lack of conflict as optimum is like saying a sunny day is optimum. A sunny day is when the sun wins out over the rain. There’s no conflict. You have a clear winner. But if every day is sunny and it doesn’t rain, things don’t grow. And if it’s sunny all the time—if, in fact, we don’t ever even have night—all kinds of things don’t happen and the planet dries up. The key is to view conflict as essential, because that’s how we know the best ideas will be tested and survive. You know, it can’t only be sunlight.”
             
For many years, I was on a committee that read and selected papers to be published at SIGGRAPH, the annual computer graphics conference I mentioned in chapter 2. These papers were supposed to present ideas that advanced the field. The committee was composed of many of the field’s most prominent players, all of whom I knew; it was a group that took the task of selecting papers very seriously. At each of the meetings, I was struck that there seemed to be two kinds of reviewers: some who would look for flaws in the papers, and then pounce to kill them; and others who started from a place of seeking and promoting good ideas. When the “idea protectors” saw flaws, they pointed them out gently, in the spirit of improving the paper—not eviscerating it. Interestingly, the “paper killers” were not aware that they were serving some other agenda (which was often, in my estimation, to show their colleagues how high their standards were). Both groups thought they were protecting the proceedings, but only one group understood that by looking for something new and surprising, they were offering the most valuable kind of protection. Negative feedback may be fun, but it is far less brave than endorsing something unproven and providing room for it to grow.
             
I suppose I could simply have mandated that our production managers add the cost of adding interns to their budgets. But that would have made this new idea the enemy—something to resent. Instead, I decided to make the interns a corporate expense—they would essentially be available, at no extra cost, to any department who wanted to take them on. The first year, Pixar hired eight interns who were placed in the animation and technical departments. They were so eager and hard-working and they learned so fast that every one of them, by the end, was doing real production work. Seven of them ultimately returned, after graduation, to work for us in a full-time capacity. Every year since then, the program has grown a little more, and every year more managers have found themselves won over by their young charges. It wasn’t just that the interns lightened the workload by taking on projects. Teaching them Pixar’s ways made our people examine how they did things, which led to improvements for all. A few years in, it became clear that we didn’t need to fund interns out of the corporate coffers anymore; as the program proved its worth, people became willing to absorb the costs into their budgets. In other words, the intern program needed protection to establish itself at first, but then grew out of that need. Last year, we had ten thousand applications for a hundred spots.

Whether it’s the kernel of a movie idea or a fledgling internship program, the new needs protection. Business-as-usual does not. Managers do not need to work hard to protect established ideas or ways of doing business. The system is tilted to favor the incumbent. The challenger needs support to find its footing. And protection of the new—of the future, not the past—must be a conscious effort.

“In many ways, the work of a critic is easy,” Ego [from Ratatouille] says. “We risk very little yet enjoy a position over those who offer up their work and their selves to our judgment. We thrive on negative criticism, which is fun to write and to read. But the bitter truth we critics must face is that in the grand scheme of things, the average piece of junk is probably more meaningful than our criticism designating it so. But there are times when a critic truly risks something, and that is in the discovery and defense of the new. The world is often unkind to new talent, new creations. The new needs friends.”

People want to hang on to things that work--stories that work, methods that work, strategies that work. You figure something out, it works, so you keep doing it—this is what an organization that is committed to learning does. And as we become successful, our approaches are reinforced, and we become even more resistant to change.

Up had to go through these changes--changes that unfolded over not months but years--to find its heart. Which meant that the people working on Up had to be able to roll with that evolution without panicking, shutting down, or growing discouraged. It helped that Pete understood what they were feeling.
             
“It wasn’t until I finished directing Monsters, Inc. that I realized failure is a healthy part of the process,” he told me. “Throughout the making of that film, I took it personally—I believed my mistakes were personal shortcomings, and if I were only a better director I wouldn’t make them.” To this day, he says, “I tend to flood and freeze up if I’m feeling overwhelmed. When this happens, it’s usually because I feel like the world is crashing down and all is lost. One trick I’ve learned is to force myself to make a list of what’s actually wrong. Usually, soon into making the list, I find I can group most of the issues into two or three larger all-encompassing problems. So it’s really not all that bad. Having a finite list of problems is much better than having an illogical feeling that everything is wrong.”
             
This could just be my Lutheran, Scandinavian upbringing, but I believe life should not be easy. We’re meant to push ourselves and try new things—which will definitely make us feel uncomfortable.

Status Quo

“Better the devil you know than the devil you don’t.” For many, these are words to live by. Politicians master whatever system it took to get elected, and afterward there is little incentive to change it.

Which brings us to one of my core management beliefs: If you don’t try to uncover what is unseen and understand its nature, you will be ill prepared to lead.

That couldn’t have happened if the producer of the movie--and the company’s leadership in general--hadn’t been open to a new viewpoint that challenged the status quo. That kind of openness is only possible in a culture that acknowledges its own blind spots. It’s only possible when managers understand that others see problems they don’t—and that they also see solutions.
             
You might say I’m an advocate for humility in leaders. But to be truly humble, those leaders must first understand how many of the factors that shape their lives and businesses are—and will always be—out of sight.

I think we’re out of the woods now, but it took a while. And all because a flawed mental model, constructed in response to a single event, had taken hold. Once a model of how we should work gets in our head, it is difficult to change.

Two-phase commit and beyond

In this post, we model and explore the two-phase commit protocol using TLA+. The two-phase commit protocol is practical and is used in man...