Paper Summary: A self-configurable geo-replicated cloud storage system
This paper is a followup work to the Pileus work, which I had covered here. Pileus aimed to help developers find a suitable consistency/latency combination for their application deployment. In Pileus, the configuration of primary and secondary nodes is assumed to be fixed (some storage nodes are designated as primary nodes, which hold the master data, while others are secondary nodes). The developer uses an SLA to state ranked preferences for latency and consistency of the reads that would make most sense for the application, and using this SLA, Pileus provides dynamic tuning of the performance of the application by deciding which read to forward to which replica/master.
This paper introduces a followup system, called Tuba, where the configuration is not fixed, and can be changed on the fly. Tuba extends Pileus to address the problem of finding an optimal configuration of primary and secondary replicas that maximizes the overall utility and minimizes the cost for the application.
Tuba extends Pileus with a configuration service (CS) delivering the following capabilities:
1. performing a reconfiguration periodically for different tablets, and
2. informing clients of the current configuration for different tablets.
Data is organized into tablets as access units, and each tablet is assigned to primary and secondary nodes. Of course different tablets may be assigned to different primary and secondary nodes. In order for the CS to configure a tablet's replicas to maximize the overall utility, the CS must be aware of the way the tablet is being accessed globally. Therefore, all clients in the system periodically send their observed latency and the hit and miss ratios of their SLAs to the CS.
Once a new configuration is decided, one or more of the following operations are performed as the system changes to the new configuration: (i) changing the primary replica, (ii) adding or removing secondary replicas, and (iii) changing the synchronization periods between primary and secondary replicas.
+ Adjust the Synchronization Period
+ Add/Remove Secondary Replica
+ Change Primary Replica
+ Add Primary Replica
For example, for a missed subSLA with strong consistency, two potential new configurations would be: (i) creating a new replica near the client and making it the solo primary replica, or (ii) adding a new primary replica near the client and making the system run in multi-primary mode. As another example, for a social network application that spans Brazil and India, the CS may decide to swap the primary and secondary replica roles to improve utility. During peak times in India, the secondary replica in South Asia becomes the primary replica. Likewise, during peak times in Brazil, the replica in Brazil becomes primary.
The constraints for configurations may include (i) Geo-replication factor, (ii) Location, (iii) Synchronization period, and (iv) Cost constraints. With the cost constraint, the application developers can indicate how much they are willing to pay (in terms of dollars) to switch to a new configuration. For instance, one possible configuration is to put secondary replicas in all available datacenters. While the gained utility for this configuration would be great, the cost of this configuration is likely unacceptably large.
For each configuration possibility that meets the constraints, the CS computes the expected gained utility and the cost of reconfiguration. The CS considers the following costs for a new configuration:
+ Storage: the cost of storing a tablet in a particular site.
+ Read/Write Operation: the cost of performing read/write operations.
+ Synchronization: the cost of synchronizing a secondary replica with a primary one.
Finally the CS choses the configuration that offers the highest utility-to-cost ratio, and executes the reconfiguration operations required to transition to the new one.
Fast Mode. When a client is in fast mode, read and single-primary write operations involve a single round-trip to one selected replica. No additional overhead is imposed on these operations. Multi-primary write operations use a three-phase protocol in fast or slow mode.
Slow Mode. Slow mode has no affect on read operations with relaxed consistency. On the other hand, since read operations with strong consistency must always go to a primary replica, the client needs to validate that the responding replica to a strong-consistency read is still a primary replica. If not, the client retries the read operation. Write operations are more involved when a client is in slow mode. Any client in slow mode that wishes to execute a write operation on a tablet needs to take a non-exclusive lock on the tablet's configuration before issuing the write operation. On the other hand, the CS needs to take an exclusive lock on the configuration if it decides to change the set of primary replicas. This lock procedure is required to ensure the linearizability of write operations.
SEA is South East Asia and WEU is West Europe.
This paper introduces a followup system, called Tuba, where the configuration is not fixed, and can be changed on the fly. Tuba extends Pileus to address the problem of finding an optimal configuration of primary and secondary replicas that maximizes the overall utility and minimizes the cost for the application.
Tuba extends Pileus with a configuration service (CS) delivering the following capabilities:
1. performing a reconfiguration periodically for different tablets, and
2. informing clients of the current configuration for different tablets.
Data is organized into tablets as access units, and each tablet is assigned to primary and secondary nodes. Of course different tablets may be assigned to different primary and secondary nodes. In order for the CS to configure a tablet's replicas to maximize the overall utility, the CS must be aware of the way the tablet is being accessed globally. Therefore, all clients in the system periodically send their observed latency and the hit and miss ratios of their SLAs to the CS.
Once a new configuration is decided, one or more of the following operations are performed as the system changes to the new configuration: (i) changing the primary replica, (ii) adding or removing secondary replicas, and (iii) changing the synchronization periods between primary and secondary replicas.
Configuration service (CS)
The CS is a centralized service. To improve utility, the CS selects a new configuration by first generating all configurations that satisfy a list of defined constraints. To permute configurations, it is free to use one of the following operations:+ Adjust the Synchronization Period
+ Add/Remove Secondary Replica
+ Change Primary Replica
+ Add Primary Replica
For example, for a missed subSLA with strong consistency, two potential new configurations would be: (i) creating a new replica near the client and making it the solo primary replica, or (ii) adding a new primary replica near the client and making the system run in multi-primary mode. As another example, for a social network application that spans Brazil and India, the CS may decide to swap the primary and secondary replica roles to improve utility. During peak times in India, the secondary replica in South Asia becomes the primary replica. Likewise, during peak times in Brazil, the replica in Brazil becomes primary.
The constraints for configurations may include (i) Geo-replication factor, (ii) Location, (iii) Synchronization period, and (iv) Cost constraints. With the cost constraint, the application developers can indicate how much they are willing to pay (in terms of dollars) to switch to a new configuration. For instance, one possible configuration is to put secondary replicas in all available datacenters. While the gained utility for this configuration would be great, the cost of this configuration is likely unacceptably large.
For each configuration possibility that meets the constraints, the CS computes the expected gained utility and the cost of reconfiguration. The CS considers the following costs for a new configuration:
+ Storage: the cost of storing a tablet in a particular site.
+ Read/Write Operation: the cost of performing read/write operations.
+ Synchronization: the cost of synchronizing a secondary replica with a primary one.
Finally the CS choses the configuration that offers the highest utility-to-cost ratio, and executes the reconfiguration operations required to transition to the new one.
Client execution modes
Clients need to avoid two potential safety violations: (i) performing a read operation with strong consistency on a non-primary replica, or (ii) executing a write operation on a non-primary replica. Based on the freshness of a client's view, the client is either in fast or slow mode. A client is in the fast mode for a given tablet if it knows the locations of primary and secondary replicas, and it is guaranteed that the configuration will not change in the near future. Whenever a client suspects that a configuration may have changed, it enters slow mode until it refreshes its local cache.Fast Mode. When a client is in fast mode, read and single-primary write operations involve a single round-trip to one selected replica. No additional overhead is imposed on these operations. Multi-primary write operations use a three-phase protocol in fast or slow mode.
Slow Mode. Slow mode has no affect on read operations with relaxed consistency. On the other hand, since read operations with strong consistency must always go to a primary replica, the client needs to validate that the responding replica to a strong-consistency read is still a primary replica. If not, the client retries the read operation. Write operations are more involved when a client is in slow mode. Any client in slow mode that wishes to execute a write operation on a tablet needs to take a non-exclusive lock on the tablet's configuration before issuing the write operation. On the other hand, the CS needs to take an exclusive lock on the configuration if it decides to change the set of primary replicas. This lock procedure is required to ensure the linearizability of write operations.
Evaluation
They implemented Tuba to extend Microsoft Azure Storage, with broad consistency choices (as in Bayou), consistency-based SLAs (as in Pileus), and a novel replication configuration service. Their evaluation compared with a system that is statically configured. An experiment with clients distributed in datacenters around the world shows that reconfiguration every two hours increases the fraction of reads guaranteeing strong consistency from 33% to 54%. This confirms that automatic reconfiguration can yield substantial benefits which are realizable in practice.SEA is South East Asia and WEU is West Europe.
Comments