Notes from USENIX NSDI 18 First Day
I have been attending USENIX NSDI, one of the premier conferences on networking, at Seattle, WA. Here are some notes from the first day, Monday, April 9.
The best paper is awarded to "NetChain: Scale-Free Sub-RTT Coordination" by Xin Jin, Johns Hopkins University; Xiaozhou Li, Barefoot Networks; Haoyu Zhang, Princeton University; Nate Foster, Cornell University; Jeongkeun Lee, Barefoot Networks; Robert Soulé, Università della Svizzera italiana; Changhoon Kim, Barefoot Networks; Ion Stoica, UC Berkeley.
The community award (best paper whose code and dataset made publicly available) went to "Stateless Datacenter Load-balancing with Beamer" by Vladimir Olteanu, Alexandru Agache, Andrei Voinescu, and Costin Raiciu, University Politehnica of Bucharest.
NSDI makes all papers available publicly. So if any of the papers here interest you, you can download and read the paper.
Congestion control today done via end-to-end protocols. The switches are dumb. What if they were smarter? That would provide benefits for the end host and fairness, etc.
But, smart switches are challenging to realize for high-speed networks. In high-speed networks, it is hard to
The work implements simulated fair queuing (fair queueing without per flow queues) in high-speed switches. The approach is based on approximate fair queueing: simulate a bit by bit round robin scheme with key approximations
The approximate flow counters are stored using a variation of count-min sketch. The results show the approximate fair queueing is achieved via 12-14 queues. Evaluation also shows that approximate fair queuing leads to 4-8x improvement in flow completion times.
This work received the best paper award. And it is also on the distributed coordination area I am interested in, so I have a relatively long coverage of this work.
Conventional wisdom says coordination is expensive. Netchain aims to provide lighting fast coordination enabled by programmable switches. Some applications of the coordination service can be configuration management, group membership, locking, barrier synchronization.
Right now these are done over a coordination service running Paxos, which is often implemented over a strongly-consistent, fault-tolerant key-value store. Can we do better in terms of latency and throughput?
The opportunity for using in-network coordination is that distributed coordination is communication-heavy rather than computation-heavy. The idea is to run coordination in the switches using consensus. The design goals are to achieve high-throughput and strong consistency.
How do they build a strongly consistent fault-tolerant in-network (which means in the switches) keyvalue store? They use the chain replication protocol. That is, there is a master configurator managing the chain configuration and the storage nodes on the chain that replicate the key-values.
The in-network (that is, on the switch) key-value storage builds on the SOSP 17 paper titled "NetCache: Balancing Key-Value Stores with Fast In-Network Caching" which leverages register arrays in the switches.
There is a problem of possible out-of-order delivery between the consecutive switches in the chain, which can lead to inconsistency. The presentation says they solve that with serialization with sequence number and dropping the out of order packet. The onus is on the client to retry when its request doesn't get a reply in time.
Of course the complicated part here is for handling a switch failure. Chain replication technique for recovery is adapted, so that the master configurator (which is Paxos maintained) can reconfigure the switch by removing the crashed node. But then to keep the fault-tolerance, later a new node needs to be added. The paper says the master first copies the state to the newly added one and then add that to the chain. Of course, there is a catching-up problem of the newly added switch if the previous node keep getting and inserting new items. There needs to be some blocking to coordinate it, probably via 2-phase commit. I quickly scanned the paper, and didn't see this discussed.
The paper has a TLA+ specification in the extended arxiv version. I checked the TLA+ model, and it assumes copying state to the new switch is done atomically, which seems to be an over-simplification of what needs to be implemented in reality.
The work is evaluated over 4 barefoot tofino switches and 4 commodity servers. I think the presenter said that upto 100K key-value pairs could be stored on 8Mb storage available on the switches. Compared to Zookeeper, the solution was able to provide 3 orders of magnitude improvement in throughput and 1 order of magnitude improvement in read latency and 2 order of magnitude in write latency.
The presented concluded with asking what kind of new applications could be enabled by faster coordination service available at the datacenters. I am still unclear what subRTT means in the title, but I will read the paper and write a longer review later.
Neha presented this paper, and pulled it off without using the word blockchain even once.
Verification provided by distributed ledgers should not mean everything is in the open. The companies require some privacy to keep some business strategies and positions secret. On the other too much secrets is also a bad thing, there needs to be some auditability to prevent bad actors.
zkLedger provides practical privacy and complete auditing.
A big challenge here is that a bad actor bank could omit transactions. zkledger jointly addresses the privacy and auditability requirements by keeping an entry for every bank in every transaction. To hide values, zkLedger uses Pedersen commitments. The key insight is that the auditor audits every transaction. zkLedger also uses an interactive map/reduce paradigm over the ledger with non-interactive zero-knowledge proofs (NIZKs) to compute measurements that go beyond sums.
The abstract of the paper provides a good summary, so here it is:
Banks create digital asset transactions that are visible only to the organizations party to the transaction, but are publicly verifiable. An auditor sends queries to banks, for example "What is the outstanding amount of a certain digital asset on your balance sheet?" and gets a response and cryptographic assurance that the response is correct. zkLedger has two important benefits over previous work. First, zkLedger provides fast, rich auditing with a new proof scheme using Schnorr-type non-interactive zero-knowledge proofs. Unlike zk-SNARKs, our techniques do not require trusted setup and only rely on widely-used cryptographic assumptions. Second, zkLedger provides completeness; it uses a columnar ledger construction so that banks cannot hide transactions from the auditor, and participants can use rolling caches to produce and verify answers quickly. We implement a distributed version of zkLedger that can produce provably correct answers to auditor queries on a ledger with a hundred thousand transactions in less than 10 milliseconds.
This work aims to provide accurate timestamping as a primitive (with 10 nanosec accuracy) in datacenters at scale. Tightly-synchronized clocks have applications in distributed systems particularly for distributed databases, such as spanner and cockroachdb.
The system they develop, Huygens, take NIC timestamps (which is supported by most current generation NICs), and doesn't require specialized switches like PTP. Other than the NIC timestamping support, Huygens is software based.
Huygens leverages three key ideas. First, coded probes are used for identifying and rejecting impure probe data which suffer queuing delays. To detect this, Huygens send 2 consecutive probe packets with known gap: 10microsecond (NIC timestamping) and checks the gap between them on the receiving end. Only the probes with the original 10 microsecond gap are accepted as pure, since they most likely have experienced zero queueing delays. Since the queues are changing fast, it is very unlikely both of the consecutive packets were subject to same nonzero queueing delay.
Second, Huygens processes the purified data with Support Vector Machines, a widely-used and powerful classifier, to accurately estimate one-way propagation times. (Huygens assume delay between two servers are symmetric.)
Finally, to detect and correct synchronization errors even further, Huygens exploits the network effect that a group of pair-wise synchronized clocks must be transitively synchronized.
One of the questions asked was about the high probing rate employed by this work. Another question was about if this could be extended to WAN? The presenter mentioned 10 microsecond accuracy in a WAN experiment, but I wonder if it is due to Google datacenters having private and high-speed links.
One very interesting question was if this could be used for measuring the temperature in datacenters? This is a great question, not a crazy one. High temperature means local clock starts to run slower, and there is a predictable linear relationship. So if you have tightly synchronized clocks, you can measure the drift from ideal and infer temperature increase/decrease.
I will read and summarize this paper later, since I am interested in mechanisms and applications of clock synchronization in distributed systems. After our work on hybrid logical clocks (HLC), we had built a distributed monitoring system, Retroscope, using HLC.
I had provided a summary of this work earlier in my blog. Please see that.
The papers presented were:
The papers presented were:
The afternoon and especially late afternoon isn't really great for listening to conference talks. To follow talks one needs to expend a lot of mental effort, because the deliveries are done very quickly. Moreover the context changes drastically from talk to talk, and that is also depleting attention.
I wonder, if at least the late afternoons are better spent with panels, or more lively and less concentration-demanding activities.
I have heard of this book on these topics "When: The Scientific Secrets of Perfect Timing" by Daniel Pink. I plan to check that book.
Pre-session announcements
NSDI has 40 papers accepted out of 255 papers. There was a mention of 1000 reviews done for the conference. That is a lot of reviews by very highly qualified people. It is a shame those reviews are not shared openly, those reviews could be really useful for the community to learn from, and providing them may also expose if there were any sub-par reviews. There is a movement for open review process, and I hope it catches on at a wider scale.The best paper is awarded to "NetChain: Scale-Free Sub-RTT Coordination" by Xin Jin, Johns Hopkins University; Xiaozhou Li, Barefoot Networks; Haoyu Zhang, Princeton University; Nate Foster, Cornell University; Jeongkeun Lee, Barefoot Networks; Robert Soulé, Università della Svizzera italiana; Changhoon Kim, Barefoot Networks; Ion Stoica, UC Berkeley.
The community award (best paper whose code and dataset made publicly available) went to "Stateless Datacenter Load-balancing with Beamer" by Vladimir Olteanu, Alexandru Agache, Andrei Voinescu, and Costin Raiciu, University Politehnica of Bucharest.
NSDI makes all papers available publicly. So if any of the papers here interest you, you can download and read the paper.
First session
The first session was on new hardware. The main theme here was to see how we can get the performance hardware solutions offer with the programmability of software solutions. I provide short summaries of the presentations of two papers. The session also included two other papers titled "PASTE: A Network Programming Interface for Non-Volatile Main Memory" and "Azure Accelerated Networking: SmartNICs in the Public Cloud Microsoft".Approximating fair queueing on reconfigurable switches
The paper is by Naveen Kr. Sharma and Ming Liu, University of Washington; Kishore Atreya, Cavium; Arvind Krishnamurthy, University of Washington.Congestion control today done via end-to-end protocols. The switches are dumb. What if they were smarter? That would provide benefits for the end host and fairness, etc.
But, smart switches are challenging to realize for high-speed networks. In high-speed networks, it is hard to
- maintain a sorted packet buffer
- store per-flow counters
- access and modify current round number.
The work implements simulated fair queuing (fair queueing without per flow queues) in high-speed switches. The approach is based on approximate fair queueing: simulate a bit by bit round robin scheme with key approximations
The approximate flow counters are stored using a variation of count-min sketch. The results show the approximate fair queueing is achieved via 12-14 queues. Evaluation also shows that approximate fair queuing leads to 4-8x improvement in flow completion times.
NetChain: Scale-Free Sub-RTT Coordination
The paper is by Xin Jin, Johns Hopkins University; Xiaozhou Li, Barefoot Networks; Haoyu Zhang, Princeton University; Nate Foster, Cornell University; Jeongkeun Lee, Barefoot Networks; Robert Soulé, Università della Svizzera italiana; Changhoon Kim, Barefoot Networks; Ion Stoica, UC Berkeley.This work received the best paper award. And it is also on the distributed coordination area I am interested in, so I have a relatively long coverage of this work.
Conventional wisdom says coordination is expensive. Netchain aims to provide lighting fast coordination enabled by programmable switches. Some applications of the coordination service can be configuration management, group membership, locking, barrier synchronization.
Right now these are done over a coordination service running Paxos, which is often implemented over a strongly-consistent, fault-tolerant key-value store. Can we do better in terms of latency and throughput?
The opportunity for using in-network coordination is that distributed coordination is communication-heavy rather than computation-heavy. The idea is to run coordination in the switches using consensus. The design goals are to achieve high-throughput and strong consistency.
How do they build a strongly consistent fault-tolerant in-network (which means in the switches) keyvalue store? They use the chain replication protocol. That is, there is a master configurator managing the chain configuration and the storage nodes on the chain that replicate the key-values.
The in-network (that is, on the switch) key-value storage builds on the SOSP 17 paper titled "NetCache: Balancing Key-Value Stores with Fast In-Network Caching" which leverages register arrays in the switches.
There is a problem of possible out-of-order delivery between the consecutive switches in the chain, which can lead to inconsistency. The presentation says they solve that with serialization with sequence number and dropping the out of order packet. The onus is on the client to retry when its request doesn't get a reply in time.
Of course the complicated part here is for handling a switch failure. Chain replication technique for recovery is adapted, so that the master configurator (which is Paxos maintained) can reconfigure the switch by removing the crashed node. But then to keep the fault-tolerance, later a new node needs to be added. The paper says the master first copies the state to the newly added one and then add that to the chain. Of course, there is a catching-up problem of the newly added switch if the previous node keep getting and inserting new items. There needs to be some blocking to coordinate it, probably via 2-phase commit. I quickly scanned the paper, and didn't see this discussed.
The paper has a TLA+ specification in the extended arxiv version. I checked the TLA+ model, and it assumes copying state to the new switch is done atomically, which seems to be an over-simplification of what needs to be implemented in reality.
The work is evaluated over 4 barefoot tofino switches and 4 commodity servers. I think the presenter said that upto 100K key-value pairs could be stored on 8Mb storage available on the switches. Compared to Zookeeper, the solution was able to provide 3 orders of magnitude improvement in throughput and 1 order of magnitude improvement in read latency and 2 order of magnitude in write latency.
The presented concluded with asking what kind of new applications could be enabled by faster coordination service available at the datacenters. I am still unclear what subRTT means in the title, but I will read the paper and write a longer review later.
Second session
The second session was on distributed systems, my favorite topic. I have brief summaries of the three papers presented in this session.zkLedger: Privacy-Preserving Auditing for Distributed Ledgers
This paper is by Neha Narula, MIT Media Lab; Willy Vasquez, University of Texas at Austin; Madars Virza, MIT Media Lab.Neha presented this paper, and pulled it off without using the word blockchain even once.
Verification provided by distributed ledgers should not mean everything is in the open. The companies require some privacy to keep some business strategies and positions secret. On the other too much secrets is also a bad thing, there needs to be some auditability to prevent bad actors.
zkLedger provides practical privacy and complete auditing.
A big challenge here is that a bad actor bank could omit transactions. zkledger jointly addresses the privacy and auditability requirements by keeping an entry for every bank in every transaction. To hide values, zkLedger uses Pedersen commitments. The key insight is that the auditor audits every transaction. zkLedger also uses an interactive map/reduce paradigm over the ledger with non-interactive zero-knowledge proofs (NIZKs) to compute measurements that go beyond sums.
The abstract of the paper provides a good summary, so here it is:
Banks create digital asset transactions that are visible only to the organizations party to the transaction, but are publicly verifiable. An auditor sends queries to banks, for example "What is the outstanding amount of a certain digital asset on your balance sheet?" and gets a response and cryptographic assurance that the response is correct. zkLedger has two important benefits over previous work. First, zkLedger provides fast, rich auditing with a new proof scheme using Schnorr-type non-interactive zero-knowledge proofs. Unlike zk-SNARKs, our techniques do not require trusted setup and only rely on widely-used cryptographic assumptions. Second, zkLedger provides completeness; it uses a columnar ledger construction so that banks cannot hide transactions from the auditor, and participants can use rolling caches to produce and verify answers quickly. We implement a distributed version of zkLedger that can produce provably correct answers to auditor queries on a ledger with a hundred thousand transactions in less than 10 milliseconds.
Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization
This paper is by Yilong Geng, Shiyu Liu, and Zi Yin, Stanford University; Ashish Naik, Google Inc.; Balaji Prabhakar and Mendel Rosenblum, Stanford University; Amin Vahdat, Google Inc.This work aims to provide accurate timestamping as a primitive (with 10 nanosec accuracy) in datacenters at scale. Tightly-synchronized clocks have applications in distributed systems particularly for distributed databases, such as spanner and cockroachdb.
The system they develop, Huygens, take NIC timestamps (which is supported by most current generation NICs), and doesn't require specialized switches like PTP. Other than the NIC timestamping support, Huygens is software based.
Huygens leverages three key ideas. First, coded probes are used for identifying and rejecting impure probe data which suffer queuing delays. To detect this, Huygens send 2 consecutive probe packets with known gap: 10microsecond (NIC timestamping) and checks the gap between them on the receiving end. Only the probes with the original 10 microsecond gap are accepted as pure, since they most likely have experienced zero queueing delays. Since the queues are changing fast, it is very unlikely both of the consecutive packets were subject to same nonzero queueing delay.
Second, Huygens processes the purified data with Support Vector Machines, a widely-used and powerful classifier, to accurately estimate one-way propagation times. (Huygens assume delay between two servers are symmetric.)
Finally, to detect and correct synchronization errors even further, Huygens exploits the network effect that a group of pair-wise synchronized clocks must be transitively synchronized.
One of the questions asked was about the high probing rate employed by this work. Another question was about if this could be extended to WAN? The presenter mentioned 10 microsecond accuracy in a WAN experiment, but I wonder if it is due to Google datacenters having private and high-speed links.
One very interesting question was if this could be used for measuring the temperature in datacenters? This is a great question, not a crazy one. High temperature means local clock starts to run slower, and there is a predictable linear relationship. So if you have tightly synchronized clocks, you can measure the drift from ideal and infer temperature increase/decrease.
I will read and summarize this paper later, since I am interested in mechanisms and applications of clock synchronization in distributed systems. After our work on hybrid logical clocks (HLC), we had built a distributed monitoring system, Retroscope, using HLC.
SnailTrail: Generalizing Critical Paths for Online Analysis of Distributed Dataflows
This paper is by Moritz Hoffmann, Andrea Lattuada, John Liagouris, Vasiliki Kalavri, Desislava Dimitrova, Sebastian Wicki, Zaheer Chothia, and Timothy Roscoe, ETH Zurich.I had provided a summary of this work earlier in my blog. Please see that.
Third session
The third section was on traffic management. I am not really a networking guy, but I listened to these to get some more understanding on what is going on in this domain.The papers presented were:
- "Balancing on the Edge: Transport Affinity without Network State",
- "Stateless Datacenter Load-balancing with Beamer",
- "Larry: Practical Network Reconfigurability in the Data Center", and
- "Semi-Oblivious Traffic Engineering: The Road Not Taken"
Fourth session
The fourth session was on network function virtualization and hardware.The papers presented were:
- Metron: NFV Service Chains at the True Speed of the Underlying Hardware
- G-NET: Effective GPU Sharing in NFV Systems
- SafeBricks: Shielding Network Functions in the Cloud
MAD questions
Really, you read this far into this long post, and still expect me to write some MAD questions? Ok, here is just one, so that my promise is not broken.The afternoon and especially late afternoon isn't really great for listening to conference talks. To follow talks one needs to expend a lot of mental effort, because the deliveries are done very quickly. Moreover the context changes drastically from talk to talk, and that is also depleting attention.
I wonder, if at least the late afternoons are better spent with panels, or more lively and less concentration-demanding activities.
I have heard of this book on these topics "When: The Scientific Secrets of Perfect Timing" by Daniel Pink. I plan to check that book.
Comments