Posts

Showing posts with the label formal methods

Cedar: A New Language for Expressive, Fast, Safe, and Analyzable Authorization

Image
This paper (2024) presents Cedar, AWS's new authorization policy language. By providing a clear declarative way to manage access control policies, Cedar addresses the common limitations of embedding authorization logic directly into application code: problems with correctness, auditing, and maintainence. Cedar introduces a domain-specific language (DSL) to express policies that are separate from application code, and improves readability and manageability. In that sense, this is like aspect-oriented programming but for authorization policy. The language is designed with four main objectives: expressiveness, performance, safety, and analyzability. Cedar balances these goals by supporting role-based (RBAC), attribute-based (ABAC), and relationship-based (ReBAC) access control models. Policy Structure and Evaluation Cedar policies consist of three primary components: effect, scope, and conditions. The effect can either be "permit" or "forbid", defining whether acc...

Smart Casual Verification of the Confidential Consortium Framework

Image
This paper (NSDI'25) applies lightweight formal methods (hence the pun "smart casual" in contrast to formal attire) to the Confidential Consortium Framework (CCF). CCF is an open-source platform for trustworthy cloud applications, used in Microsoft's Azure Confidential Ledger service. The authors combine formal specification, model checking, and automated testing to validate CCF's distributed protocols, specifically its custom consensus protocol. Background CCF uses trusted execution environments (TEEs) and state machine replication. Its consensus protocol is based on Raft but has major modifications: signature transactions (Merkle tree root over the whole log  signed by the leader), optimistic acknowledgments, and partition leader step-down. CCF also avoids RPCs, using a uni-directional messaging layer. These changes add complexity, requiring formal verification. At the time of writing, CCF's implementation spans 63 kLoC in C++. The authors define five requi...

Model Checking Guided Testing for Distributed Systems

Image
This paper appeared in Eurosys 2023. It shows how to generate test-cases from a TLA+ model of a distributed protocol and apply it to the Java implementation to check for bugs in the implementation. They applied the technique to Raft, XRaft, and Zab protocols, and presented the bugs they find. Background on the problem Bridging the gap between model (i.e., abstract protocol) to the implementation (i.e., concrete code) is an important problem. TLA+ gives validates the correctness of your abstract/model protocol, but does not say anything about whether your implementation of the protocol is correct.  This is still very useful as Lamport likes to argue: if you didn't get the abstract design right, it doesn't matter if your implementation is true to the abstract, you will still have a flawed implementation. While having a TLA+ model checked abstract does not guarantee that your implementation of the abstract is correct, it helps you avoid the big/intricate protocol level problems a...

Automated Validation of State-Based Client-Centric Isolation with TLA+ (2021)

Image
This work provides a TLA+ formalization of t he state-based isolation model introduced in the PODC 2017 paper. The TLA+ models provided here will be very useful for you if you are a distributed database developer. Using these, you can show your database protocols satisfy the desired isolation levels, and as you extend the design, you can keep checking that the guarantees still hold. I have been suffering from this problem recently. I wrote a TLA+ model for a distributed database design. I thought I would be able to write serializability predicates using TLA+ formulas relatively easily. When I tried doing so, I realized what a mess I am in. I made a couple attempts and found that operation-centric specification of database isolation is hard to define and check. I decided approach the problem sideways and wrote some BaitInvariants to show what behaviors are out-ruled, only to find that what I thought would always be rejected was actually valid serialization behavior in some scenarios. ...

Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21)

Image
This paper comes from my colleagues at AWS S3 Automated Reasoning Group, detailing their experience applying lightweight formal methods to a new class of storage node developed for S3 storage backend. Lightweight formal methods emphasize automation and usability. In this case, the approach involves three prongs: developing executable reference models as specifications, checking implementation conformance to those models, and building infrastructure to ensure the models remain accurate in the future. ShardStore ShardStore is a new append-only key-value storage node developed for AWS S3 backend. It is over 40K lines of Rust code. Shardstore is a log-structured merge tree (LSM tree) but with shard data stored outside the tree to reduce write amplification. ShardStore employs soft updates for avoiding the cost of redirecting writes through a write-ahead log while still being crash consistent. A soft updates implementation ensures that only writes whose dependencies are persisted are sent...

Silent data corruptions at scale

Image
This paper from Facebook (Arxiv Feb 2021) is referred in the Google fail-silent Corruption Execution Errors (CEEs) paper as the most related work. Both papers discuss the same phenomenon, and say that we need to update our belief about quality-tested CPUs not having logic errors, and that if they had an error it would be a fail-stop or at least fail-noisy hardware errors triggering machine checks. This paper provides an account of how Facebook have observed CEEs over several years. After running a wide range of silent error test scenarios across 100K  machines, they found that 100s of CPUs are identified as having these errors, showing that CEEs are a systemic issue across generations. This paper, as the Google paper, does not name specific vendor or chipset types. Also the ~1/1000 ratio reported here matches the ~1/1000 mercurial core ratio that the Google paper reports. The paper claims that silent data corruptions can occur due to device characteristics and are repeatable at s...

Building Distributed Systems With Stateright

Image
Stateright is a model checker for distributed systems. It is provided as a Rust library, and it allows you to verify systems implemented in Rust. It is openly available on GitHub and the developer, Jon Nadal, is looking for contributors and new users.   On Tuesday Jon gave a presentation to us on Zoom. He made his presentation slides available here. We have also recorded Jon's presentation and the Q&A and demo sessions in entirety.    The highlights of Stateright are: great visualization support, time travel debugger: which helps you go back/forth and choose to explore another branch from a given point of the current execution (in the Figure below, the Next Steps heading provide possible next steps to choose from), an actor-based model, an embedded linearizability tester, and extensive docs and Rust book for introducing the concepts. The model trait has state, init_states, actions, next_state, and properties. Similarly there is an actor trait you can implement, and...

Compositional Programming and Testing of Dynamic Distributed Systems

Image
This paper is authored by Ankush Desai, Amar Phanishayee, Shaz Qadeer, and Sanjit A. Seshia, and it appeared at OOPSLA 2018. The paper describes the ModP framework extension over the P language . P is a language developed for safe event-driven programming. P models processes as state machines: the state machines communicate via message passing, and events cause state machines to transition between states.  The killer feature of P is unifying modeling with programming! P enables the programmers to write specifications and enables systematic testing both via random testing and exhaustive symbolic execution testing. P is used in event driven systems such as device drivers and robotics . The P GitHub page  states that P is used in Microsoft (P# and Coyote ) and also used extensively at AWS for model-based testing of complex distributed systems.   Compositional testing of distributed systems A problem with systematic testing is that monolithic testing of large systems fa...

Armada: Low-Effort Verification of High-Performance Concurrent Programs

Image
This paper appeared in PLDI 2020 and is authored by Jacob R. Lorch, Yixuan Chen, Manos Kapritsos, Bryan Parno, Shaz Qadeer, Upamanyu Sharma, James R. Wilcox, and Xueyuan Zhao. The paper presents the Armada framework for code-level verification of concurrent (multithreaded rather than distributed) programs. Armada is roughly a stepwise-refinement extended version of Dafny . So that is where I will start explaining. Dafny Dafny is an open source iterative compiled language for writing verified code. The code is verified with respect to the specifications the developer writes by adding high-level annotations alongside the code, such as ensures P and invariant P. At https://rise4fun.com/Dafny/tutorial you will find an interactive tutorial, that will get you up and running proving programs with Dafny. It is really fun, you should try this out now!  Dafny leverages Z3 for discharging proof obligations. Ok, so what is Z3? Z3 is a popular open source Satisfiability modulo theories (SMT)...

Modeling Streamlet (take two)

Image
This post is a sequel to my last two posts on Streamlet. The first one explained the protocol . The second one presented a first draft modeling of the protocol. In this post, I present two improvements to the first draft modeling.  In the first, I will make the epochs monotonic at the process level. In the second one, I give each node its own inbox, instead of using a shared whiteboard message set. Again, I will introduce the model dev-log style, using the notes I took while developing it. Adding locally monotonic epochs Why don't I just sequentially increase e in the process main loop. Instead of using while true, with e \in E, the process will go through the epochs in increasing order. This still doesn't constrain the processes to go in lock-step through the epochs. Each process can go through epochs in its own pace independent of the others. The only requirement is if a process participated at an epoch (either as a proposer or voter), it will not participate at a lower epoch...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems