PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Last week we covered the Portland paper by UC San Diego, which appeared at Sigcomm'09. This paper is well written and I really appreciate its clarity and simplicity.

The motivation for the work is the need to scale at the data center networks (DCNs). Data centers today may have around 100K machines, with 32 virtual machines at each, which yield a total of 3 million hosts. Assuming one switch is required for every 25 physical hosts and accounting for interior nodes, the topology would consist of 8,000 switches. Current network protocols impose significant management overhead at this scale let alone supporting seamless VM migration across machines. For example, broadcasting ARPs and updates to every swithch is out of the question for such a network.

The desiderata for the DCN is given as:
R1. Any VM may migrate to any physical machine. Migrating VMs should not have to change their IP addresses as doing so will break pre-existing TCP connections and application-level state.
R2. An administrator should not need to configure any switch before deployment.
R3. Any end host should be able to efficiently communicate with any other end host in the data center along any of the available physical communication paths.
R4. There should be no forwarding loops.
R5. Failures will be common at scale, so failure detection and recovery should be rapid and efficient.

To achieve these desiderata the paper proposes Portland, a scalable fault-tolerant layer-2 routing and forwarding protocol for DCNs. The principal insight behind Portland is the observation that DCNs have a known baseline topology imposed by racks and rows of racks, and need not support an arbitrary topology. Using this observation, Portland enables switches to discover their positions in topology, and assigns internal hierarchical pseudo mac (pmac) addresses to all end hosts, which enable efficient, provably loop-free forwarding, and easy VM migration.

Portland provides layer 2 network forwarding based on flat MAC addresses. A layer 2 fabric imposes less administrative overhead compared to layer 3 forwarding. A layer 3 protocol would require configuring each switch with its subnet information and synchronizing DHCP servers to distribute IP addresses based on the host's subnet. A layer 3 protocol would also render transparent VM migration difficult because VMs must switch their IP addresses if they migrate to a host on a different subnet. While VLANs allow a middle ground between layer 2 and layer 3, VLANs lacks flexibility (as bandwidth resources to be explicitly assigned to each VLAN at each participating switch) and scalability (as each switch must maintain state for all hosts in each VLAN that they participate in).

Fabric manager:
PortLand employs a logically centralized fabric manager that maintains soft-state about network configuration information. The fabric manager is a user process running on a dedicated machine responsible for assisting with ARP resolution, fault tolerance, and multicast. This fabric manager idea is very similar to the open flow networking idea we discussed in an earlier paper. Using soft-state for network configuration and maintaining this at the fabric manager relieves the manual administration requirements.

Pseudo MAC address:
The basis for efficient forwarding and routing as well as VM migration in Portland's design is the hierarchical Pseudo MAC (PMAC) addresses used to identify each end host. The PMAC encodes the location of an end host in the topology. For example, all end points in the same pod will have the same prefix in their assigned PMAC. PMAC address has the form: pod.position.port.vmid. The end hosts remain unmodified, believing that they maintain their actual MAC addresses.

Location discovery protocol:
Portland employs a location discovery protocol so that switches learn whether they are edge, aggregate, or core switches and configure themselves. Protocol works by first the edge switches discovering that they are on the edge (they don't get location discovery packets from majority of ports [which are terminals]). After that aggregation switches discover they are aggregation switches, because they talk to at least one edge switch. Finally, core switches discover they are core switches as they do not talk to any edge switches. This location discovery protocol assumes that the protocol (and protocol timers) is started at the same time in the data center. But it is easy to relax that requirement by writing a self-stabilizing location discovery protocol.

VM migration:
VM keeps the same IP after migration, but the PMAC address needs to be changed. To this end, the VM from the new location sends a message to the fabric manager, and the IP mapping to PMAC is changed at fabric manager. The fabric manager also propogates this change to the switch that VM has left. (The new switch already knows this updated mapping as the packet was sent by the VM via the new switch.)

Portland achieves loop free routing using a simple rule: once the packet starts to go downwards the hierarchy (core to aggregate to edge) it cannot go back upwards. Portland achieves fault-tolerant routing by utilizing the fabric manager as a centralized control for the network; this is very much like what open-flow does.

Comments

Thanks for the post. It was very interesting and meaningful.
Vee Eee Technologies
Unknown said…
Great post! Your data center visuals really helped to hammer home the key points here.

Jackie
Enterprise IT Asset Management Software

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Advice to the young

Foundational distributed systems papers

Distributed Transactions at Scale in Amazon DynamoDB

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Designing Data Intensive Applications (DDIA) Book