SRDS Day 1

This week, I was at the 43rd International Symposium on Reliable Distributed Systems (SRDS 2024) at Charlotte, NC. The conference center was at the UNC Charlotte, which has a large and beautiful campus.

I was the Program Committee chair for SRDS'24 along with Silvia Bonomi. Marco Vieira and Bojan Cukic were the general co-chairs. A lot of work goes in to organizing a conference. Silvia and I recruited the reviewers (27 reviewers, plus external reviewers), oversaw the reviewing process (using hotcrp conference review management system), and came up with the conference program for the accepted papers.

We got 87 submissions. 2 papers have been desk-rejected. Each paper received 3 reviews, which brought it up to 261 reviews in total. 27 papers were finally accepted (acceptance rate of ~30%), 1 has been withdrawn after the notification.

Marco and Bojan took care of everything else including all local arrangements, web-site management, which are a lot more involved than you would guess.  There is always something to go wrong at any step of the process and you have to stay on top of everything. This behind-the-scenes logistics work, often unnoticed by attendees, is crucial to the success of any conference.

I am glad this is over. And I am also happy that we played our role and carried our fair share of the load in advancing reliable distributed systems research through this conference.

Ok, now the fun part. I took some notes from the conference presentations, and I will give brief overviews of some of the interesting work presented at the conference. 


Cristina Nita-Rotaru's Keynote 

Cristina is a Professor of Computer Science at Northeastern university. She is an expert on Network and Distributed Systems Security and instead of listing her numerous awards I will direct you to the bio/abstract link.

Her talk was titled: "Using Formal Methods to Understand the Security Risks of Distributed Protocols". If you read this blog, you know that I am a big proponent of using formal methods for establishing/validating correctness of distributed systems and protocols. I had been reluctant for the applications in security, as I didn't have much experience in that domain. Cristina's talk addresses that topic, and shows that formal methods are hugely beneficial for finding/understanding security risks of distributed/networked protocols. She made the talk accessible, and I was able to follow how the formal modeling approach transfers/applies to the security domain successfully as well.

Cristina said that adversarial testing using byzantine adversary (for example, adversarial testing for congestion control) is better than straightforward application of fuzzing, which is most commonly used in security/networking conferences. She then introduced their approach to adversarial testing. They use an abstract model to generate abstract strategies, map abstract strategies to concrete strategies, and execute concrete strategies.

Formal methods play a crucial role in this process. They help clarify system specifications, make implicit assumptions explicit, and identify flaws through counterexamples. Natural language specifications are often informal, inconsistent, and incomplete. Reference implementations may differ from textual descriptions. So code is not enough, we need the specification.

Cristina's talk covered four main areas where formal methods are applied: attack synthesis, patch verification, ambiguity resolution, and security definition.


Attack Synthesis. Cristina mentioned the history of attacks on protocols like TCP. Most of these attacks were discovered manually. Cristina's approach is to synthesize attacks. This involves formal modeling of the protocol, specifying the properties the protocol must meet, and then defining the attacker as a process that when composed with the target system creates a property violation. To synthesize attacks, they created the KORG framework, and applied the technique to protocols such as TCP, DCCP, and SCTP (used in applications like Zoom and WhatsApp).

Patch Verification. This part focused on the importance of verifying patches for known vulnerabilities. Patches, often referred to as "mitigations," may themselves introduce new vulnerabilities or fail to fully address the original issue. Cristina's team developed methods to verify that patches not only fix the intended problem but also don't introduce new bugs. This process involves rigorous testing and formal verification of the patched protocol implementation.

Ambiguity Resolution. Protocol specifications are often written in natural language, so they can contain ambiguities that lead to security vulnerabilities. Cristina discussed how through their work they successfully got an errata approved for the SCTP RFC, clarifying the ambiguous sections and potentially preventing future implementation errors.

Security Definition. The final section shifted the focus to the decentralized finance (DeFi) space, specifically discussing the gossipsub protocol used in Ethereum 2.0 and Filecoin. This protocol aims to handle communication in the presence of byzantine (malicious) attackers. Cristina explained the peer-scoring mechanism used in gossipsub to identify and mitigate the impact of malicious nodes. While this scoring system was previously validated through testing, the talk describes a more rigorous approach using formal methods. The team modeled gossipsub in ACL2, a theorem prover, to formally analyze its properties. This analysis revealed potential issues in Ethereum 2.0's implementation, showing that certain weight combinations in the scoring function would never converge to a stable state.


MSF-Model: Queuing-Based Analysis and Prediction of Metastable Failures in Replicated Storage Systems

The authors of this work are F. Habibi, T. Lorido-Botran, A. Showail, D. Sturman, F. Nawab.

Metastable failures occur when a triggering event pushes the system into a bad state, which then persists even after the initial trigger is removed. This persistence is due to sustaining effects within the system. A common scenario leading to metastability is request retries, which account for over 50% of metastability events. These failures are particularly challenging to recover from without shedding load.

This work targets the retry-request behavior using a queueing theory modeling approach. The model employs an M/M/1 queueing system, where requests arrive following a Poisson distribution (λ) and service times are exponentially distributed (μ). To analyze a simple such system, the authors use a two-dimensional stationary Markov chain. They conduct Monte Carlo simulations and employ a distance metric to evaluate the system's behavior. An interesting observation is that the system doesn't always converge to a stable state.


HAPPA: A Modular Platform for HPC Application Resilience Analysis with LLMs Embedded

The authors are H. Jiang, J. Zhu, B. Fang, K. Barker, C. Chen, R. Jin, Q. Guan.

Soft errors are a concern in High-Performance Computing (HPC) applications as they can lead to silent data corruption. A common example of a soft error is a bit-flip in memory, which can have severe consequences for correctness and accuracy. Traditionally, analyzing the impact of such errors involves fault-injection techniques, which are often time-consuming and resource-intensive. The HAPPA platform introduces a novel approach: static prediction of application resilience without the need for fault-injection.

The authors use the LULESH (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics) code as a case study for their analysis. Their method leverages transformer-based pre-trained models to understand the contextual influences on code resilience. Their approach allows studying how different code structures and patterns might respond to soft errors. This was an interesting work, the findings in this photo should give you some idea of how code context changes the effects of a bit flip. 


Tolerating Compound Threats in Critical Infrastructure Control Systems

The authors are S. Bommareddy, M. Khan, H. Nadeem, B. Gilby, I. Chiu, J. van de Lindt, O. Nofal, M. Panteli, L. Wells II, Y. Amir, A. Babay.

This work addresses an emerging and potentially devastating threat to critical infrastructure: compound threats, such as a cyberattack occurring in the aftermath of a natural hazard. This scenario combines two serious challenges: crash failures (potentially caused by natural disasters) and Byzantine failures (caused by malicious attacks).

The goal of this research is to analyze the system design implications for power grid systems under these compound threat scenarios. The researchers argue that current Supervisory Control and Data Acquisition (SCADA) systems are not adequately equipped to handle these complex, multi-faceted threats, by showing analysis results from Hawaii and Florida grid control centers, studying how some of the centers/backup are situated in hurricane paths. One key finding of the study is the importance of system reconfiguration in enhancing resilience against compound threats. The ability to dynamically adjust the system's structure or operation in response to both physical damage and cyber intrusions appears to be crucial for maintaining functionality under extreme conditions.


Comments

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book