OceanBase: A 707 Million tpmC Distributed Relational Database System
This paper appeared in VLDB2022. It presents the design of OceanBase, a scale-out multitenant relational database deployed on tens of thousands of servers at Alipay.com.
The design of OceanBase is not impressive. It looks like OceanBase uses a standard run-off-the-mill Paxos groups architecture. Maybe there are some interesting work going on at integration with storage. Unfortunately, the paper is poorly written. It talks and boasts about many things, some repeatedly. But it doesn't give enough detail about any parts. For example, even the type of concurrency control protocol used is not discussed in the paper. Turns out it uses a "MVCC" based approach to concurrency control. (DBDB's catalog on OceanBase gives a good overview of OceanBase.) The paper would be much more useful and enjoyable if it focused on a certain component of OceanBase and did a thorough walk through on that part. For those interested in diving deeper, OceanBase has a Github repo here. The opensource version is limited in features, and is not what is running on Alibaba/Alipay.
Don't get my comments above wrong. The decade of engineering work on OceanBase and its scale and production use for Alipay are very impressive. Here are other things I found interesting about Oceanbase:
- compaction in storage (section 3.3)
- row caches and block caches in storage (Fig 3)
- different replica types (full, data, log)
- tables are partitioned explicitly by the user
The paper puts the TPC-C benchmark performance in the title, and talks about this repeatedly in the body. But this is more of a gimmick in my opinion. They heavily optimized for the benchmark to get that number. That is a lot of warehouses. "A total of 2,360 Alibaba Cloud ECS servers have been employed in this benchmark test. There are 400 remote terminal emulator (RTE) servers to emulate the total 559,440,000 users, and 55,944,000 warehouses."
Last week, I was listening to Corecursive podcast, and there was a discussion about how AliBaba does things their own way. In relation to OceanBase, I found the MySQL part of the discussion especially interesting. (This is also not mentioned in the paper.)
"They build on top of the knowledge they learn from Facebook about forking the storage engine underneath MySQL to RocksDB and then this guy were running into problem where RocksDB is an LSM-tree database storage engine. The problem with that is that the compaction of these different layer of logs was getting quite slow and consume a lot of CBU. So guess what they did? They offload that computation to an FPGA. These guys were running custom hardware, custom chips so that their database can handle the lot better and that’s just massively impressive. That’s just not something that you can find in any tech company."
Below I provide an abridged form of the paper as a summary, without much commentary from me.
Design
The design goals of OceanBase are:
- Fast scale-out (scale-in) on commodity hardware
- Cross-region deployment and fault tolerance
- Compatibility with MySQL
Each OceanBase cluster consists of several zones. These zones can be restricted to one region or spread over multiple regions. In each zone, OceanBase can be deployed as shared-nothing. Transactions are replicated among the zones using Paxos. OceanBase supports the cross-region disaster tolerance for multiple regions and zero data loss (Recovery Point Objective=0, Recovery Time Objective<=30 seconds).
Database tables, especially the large ones, are partitioned explicitly by the user and these partitions are the basic units for the data distribution and load balance.
In each node, OceanBase is similar to a classical RDBMS. On receiving a SQL statement, the node compiles it to produce a SQL execution plan. If this is a local plan, it is executed by this OceanBase node. Otherwise, this plan is executed using the so-called two-phase commit protocol and the OceanBase node acts as the coordinator. A transaction is committed only after its redo logs persist on the majority of all corresponding Paxos groups.
Among all Paxos groups of an OceanBase cluster, one Paxos group is in charge of the management of the cluster, e.g., load balance, adding (removing) nodes, failure detection of nodes, and fault tolerance of failure nodes.
OceanBase implements a plan cache. It uses a fast/lightweight parser to do only lexical analysis, and then attempts to match an existing plan in the plan cache. It does not require a grammar/syntax checking as long as it can match a statement already in the plan cache. This approach is 10 times faster as compared to using a normal parser.OceanBase is multitenant. It includes two categories of tenants: system and ordinary tenants. The system tenant maintains the system table and performs system maintenance and management actions. An ordinary tenant can be regarded as a MySQL instance created by the system tenant. As such it has its own independent schema and variables, it can create its own users, database, and tables.
In OceanBase, uses Resource Units (including CPU, memory, IOPS, disk size) for allocating resources to tenants. OceanBase does not rely on Docker or virtual machine technology, and it implements resource isolation within the database in a more lightweight manner.
Storage engine
OceanBase has a Log-Structured Merge-tree (LSM-tree) storage system. The data has been grouped into two parts: the static baseline data (placed in SSTable) and dynamic incremental data (placed in MemTable). SSTable is read-only and will not be modified after it is generated. MemTable supports reading and writing, and is stored in the memory. CRUD operations are first written to MemTable. When MemTable reaches a certain size, it is dumped to disk and becomes SSTable. While querying, we need to query SSTable and MemTable separately, merge the query results, and return the merged query results to the SQL layer. Block Cache and Row Cache are implemented in the memory to reduce the random reading of the baseline data.Besides caching the data blocks inside OceanBase, rows are also cached. Row caching will significantly accelerate the query performance of a single row. For “empty checks” where rows do not exist, Bloom filters could be constructed and cached. Bulk of the OLTP business operations are small queries. Through small query optimization, OceanBase achieves a performance close to that of the in-memory database.
The basic read unit of OceanBase, called microblock, has a relatively small size, e.g., 4KB and 8KB, and it is configurable through a database administrator. On the contrary, the write unit of OceanBase, called macroblock, is 2MB. Macroblock is also the basic unit of allocation and garbage collection of the storage system. Many microblocks are packed into a macroblock and this makes the disk utilization more efficient at the cost of a larger write amplification.
In OceanBase, a complete replica of a partition or table consists of the baseline, mutation increment, and redo log and such a complete replica is called a full replica. Besides full replica, there are also data replica and log replica types.
Transaction processing engine
A table partition is the basic unit for the data distribution, load balance, and Paxos synchronization. Typically, there is a Paxos group for each partition.
OceanBase introduces the Paxos distributed consistency protocol to 2PC, enabling the distributed transactions with automatic fault tolerance. Each participant in the two-phase commit contains multiple copies, and the copies are readily available through the Paxos protocol. When a participant node fails, the Paxos protocol can quickly elect another replica to replace the original participant to continue providing services, and restore the state of the original participant. Thereby, it determines the execution of the distributed transaction and continues advancing the completion of the two-phase commit agreement.
As shown in Figure 6, the coordinator replies before completing the commit phase with the participant. This results in two, instead of three Paxos synchronizations, in a two-phase commit, and the delay of a two-phase commit is further reduced to one Paxos synchronization. If the coordinator crashes before commit, it is possible to construct the state through the local state of all the participants during a disaster recovery. In a fault scenario, OceanBase 2PC protocol will continue to advance the transaction to the final completion according to whether all participants of the transaction have completed the prepare phase.
For compatibility and performance considerations, OceanBase supports read committed (RC) and snapshot isolation (SI), and makes RC as the default isolation level. A normal read will only happen on the leader, and it refers to a query of general select. OceanBase also supports a weakly consistent query, which needs to be explicitly specified by a user through hint in SQL.
Discussion
The paper has a Lessons Learned section. I found these parts noteworthy.
- The application layer should not use a database system as a key-value storage system, and should rely on the advanced features of database;
- The stored procedure (written in SQL) still has great value for OLTP applications;
- For the applications using distributed databases, every transaction and every SQL should have a timeout set because of the higher failure rate of distributed system networks and nodes.
Comments
1. They wanted to build a KV system to replace MySQL etc. at the very beginning. But,
2. Application developers wanted SQL [in the company].
3. Oracle and MySQL were widely used in shard-ed deployment [in the company].
4. Classical applications running on Oracle have many stored procedures [out of the company].