Posts

Showing posts with the label automated reasoning

Neurosymbolic AI: The 3rd Wave

Image
The paper (arXiv 2020, also AI review 2023) opens up with discussing recent high-profile AI debates: the MontrĂ©al AI Debate and the AAAI 2020 fireside chat with Kahneman, Hinton, LeCun, and Bengio. A consensus seems to be emerging: for AI to be robust and trustworthy, it must combine learning with reasoning. Kahneman's "System 1 vs. System 2" dual framing of cognition maps well to deep learning and symbolic reasoning. And AI needs both. Neurosymbolic AI promises to  combine data-driven learning with structured reasoning, and provide modularity, interpretability, and measurable explanations. The paper moves from philosophical context to representation, then to system design and technical challenges in neurosymbolic AI.  Neurons and Symbols: Context and Current Debate This section lays out the historic divide within symbolic AI and neural AI. Symbolic approach supports logic, reasoning, and explanation. Neural approach excels at perception and learning from data. Symbolic...

Neurosymbolic AI: Why, What, and How

Image
The paper (2023) argues for integrating two historically divergent traditions in artificial intelligence (neural networks and symbolic reasoning) into a unified paradigm called Neurosymbolic AI. It argues that the path to capable, explainable, and trustworthy artificial intelligence lies in marrying perception-driven neural systems with structure-aware symbolic models.  The authors lean on Daniel Kahneman’s story of two systems in the mind ( Thinking Fast and Slow ). Neural networks are the fast ones: pattern-hungry, intuitive, good with unstructured mess. Symbolic methods are the slow ones: careful, logical, good with rules and plans. Neural networks, especially in their modern incarnation as large language models (LLMs), excel at pattern recognition, but fall short in tasks demanding multi-step reasoning, abstraction, constraint satisfaction, or explanation. Conversely, symbolic systems offer interpretability, formal correctness, and composability, but tend to be brittle (not in...

Multi-Grained Specifications for Distributed System Model Checking and Verification

Image
This EuroSys 2025 paper wrestles with the messy interface between formal specification and implementation reality in distributed systems. The case study is ZooKeeper. The trouble with verifying something big like ZooKeeper is that the spec and the code don’t match. Spec wants to be succinct and abstract; code has to be performant and dirty.  For instance, a spec might say, “this happens atomically.” But the actual system says, “yeah, buddy, right.” Take the case of FollowerProcessNEWLEADER: the spec bundles updating the epoch and accepting the leader’s history into a single atomic step. But in the real system, those steps are split across threads and separated by queuing, I/O, and asynchronous execution. Modeling them as atomic would miss real, observable intermediate states, and real bugs. To bridge this model-code gap, the authors use modularization and targeted abstraction. Don’t write one spec, write multi-grained specs. Different parts of the system are modeled at different ...

Cedar: A New Language for Expressive, Fast, Safe, and Analyzable Authorization

Image
This paper (2024) presents Cedar, AWS's new authorization policy language. By providing a clear declarative way to manage access control policies, Cedar addresses the common limitations of embedding authorization logic directly into application code: problems with correctness, auditing, and maintainence. Cedar introduces a domain-specific language (DSL) to express policies that are separate from application code, and improves readability and manageability. In that sense, this is like aspect-oriented programming but for authorization policy. The language is designed with four main objectives: expressiveness, performance, safety, and analyzability. Cedar balances these goals by supporting role-based (RBAC), attribute-based (ABAC), and relationship-based (ReBAC) access control models. Policy Structure and Evaluation Cedar policies consist of three primary components: effect, scope, and conditions. The effect can either be "permit" or "forbid", defining whether acc...

Smart Casual Verification of the Confidential Consortium Framework

Image
This paper (NSDI'25) applies lightweight formal methods (hence the pun "smart casual" in contrast to formal attire) to the Confidential Consortium Framework (CCF). CCF is an open-source platform for trustworthy cloud applications, used in Microsoft's Azure Confidential Ledger service. The authors combine formal specification, model checking, and automated testing to validate CCF's distributed protocols, specifically its custom consensus protocol. Background CCF uses trusted execution environments (TEEs) and state machine replication. Its consensus protocol is based on Raft but has major modifications: signature transactions (Merkle tree root over the whole log  signed by the leader), optimistic acknowledgments, and partition leader step-down. CCF also avoids RPCs, using a uni-directional messaging layer. These changes add complexity, requiring formal verification. At the time of writing, CCF's implementation spans 63 kLoC in C++. The authors define five requi...

Design and Analysis of a Logless Dynamic Reconfiguration Protocol

Image
This paper appeared in OPODIS'21 and describes dynamic reconfiguration in MongoDB. So, what is dynamic reconfiguration? The core Raft protocol implements state machine replication (SMR) using a static set of servers. ( Please read this to learn about how MongoDB adopted Raft for a pull-based SMR .) To ensure availability in the presence of faults, SMR systems must be able to dynamically (and safely) replace failed nodes with healthy ones. This is known as dynamic reconfiguration. MongoDB logless reconfiguration Since its inception, the MongoDB replication system has provided a custom, ad hoc, legacy protocol for dynamic reconfiguration of replicas. This legacy protocol managed configurations in a logless fashion, i.e., each server only stored its latest configuration. It decoupled reconfiguration processing from the main database operation log. The legacy protocol, however, was known to be unsafe in certain cases.  Revising that legacy protocol, this paper presents a redesigned saf...

Beyond the Code: TLA+ and the Art of Abstraction

Image
I have been teaching a TLA+ miniseries inside AWS. I just finished the 10th week, with a one hour seminar each week. I wanted to pen down my learnings from this experience. The art of abstraction Let's start with abstraction. Abstraction is a powerful tool for avoiding distraction. The etimology of the word abstract comes from Latin for cut and draw away. With abstraction, you slice out the protocol from a complex system, omit unnecessary details, and simplify a complex system into a useful model. For example, if you are interested in the consistency model of your distributed system, you can abstract away the mechanics of communication in the system when that is an unnecessary distraction. In his 2019 talk, Leslie Lamport said : Abstraction, abstraction, abstraction! That's how you win a Turing Award. Inside Amazon abstraction is also very valuable. We see skilled engineers look at large systems, slice out a protocol/subsystem, cut to the essence of it, and dive deep to make th...

Model Checking Guided Testing for Distributed Systems

Image
This paper appeared in Eurosys 2023. It shows how to generate test-cases from a TLA+ model of a distributed protocol and apply it to the Java implementation to check for bugs in the implementation. They applied the technique to Raft, XRaft, and Zab protocols, and presented the bugs they find. Background on the problem Bridging the gap between model (i.e., abstract protocol) to the implementation (i.e., concrete code) is an important problem. TLA+ gives validates the correctness of your abstract/model protocol, but does not say anything about whether your implementation of the protocol is correct.  This is still very useful as Lamport likes to argue: if you didn't get the abstract design right, it doesn't matter if your implementation is true to the abstract, you will still have a flawed implementation. While having a TLA+ model checked abstract does not guarantee that your implementation of the abstract is correct, it helps you avoid the big/intricate protocol level problems a...

Popular posts from this blog

Hints for Distributed Systems Design

My Time at MIT

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Foundational distributed systems papers

Advice to the young

Learning about distributed systems: where to start?

Distributed Transactions at Scale in Amazon DynamoDB

Making database systems usable

Looming Liability Machines (LLMs)

Analyzing Metastable Failures in Distributed Systems