Distributed Systems in Engineering
NuRaft: a Lightweight C++ Raft Core
Aug 14, 2019
By: Gene Zhang and Jung-Sang Ahn
0
We are excited to announce the public release of NuRaft, a lightweight C++ Raft core, under the Apache 2.0 open source license. NuRaft is based on the cornerstone C++ Raft implementation, but with various additions and changes, and is the result of over two years of development and testing for production use within eBay for storage server data replication. This post discusses what NuRaft is, and how it can be used.

We are excited to announce the public release of NuRaft, a lightweight C++ Raft core, under the Apache 2.0 open source license. NuRaft is based on the cornerstone C++ Raft implementation, but with various additions and changes, and is the result of over two years of development and testing for production use within eBay for storage server data replication. This post discusses what NuRaft is, and how it can be used.

A consensus protocol, such as Raft, can be used for strongly consistent replication by servers to achieve high availability and high read throughput with replicas. It can also be used to coordinate distributed computing agents with a globally consistent ordering. University researchers as well as developers at companies who need an efficient C++ replication protocol for their data replication or distributed log stores to support distributed transactions, as we do at eBay, can benefit from it. The NuRaft protocol requires at least three servers or virtual machines to tolerate one failure, although you can run three processes on a single machine for testing and learning purposes.

NuRaft: A critical building block in eBay's data infrastructure

eBay's NuData initiative aims to develop and operate new-generation, cloud-native database services for eBay's core businesses, leveraging open source and contributing to the open source community. NuRaft is the first graduate from our overall effort, which also includes a LSM-tree-based storage engine (Jungle), a replicated high-performance log store (which works with distributed knowledge graph store Akutan), a multi-purpose log player, a multi-master NuKV service (with optional session guarantees), a distributed transaction protocol (GRIT), a transactional graph store (NuGraph), a transactional doc store, global secondary indexes (GSIs), CDCStreamer for change data capture, NuSQL, and machine-learning-based anomaly detection and prediction (GRANO - demo at VLDB 2019).

Over two years ago, when we started to look for a robust, efficient data replication component, we analyzed a few open source choices. Given our development environment, we needed a C or C++ implementation of a consensus protocol, such as Multi-Paxos, or Raft. After some hands-on prototyping and evaluation using some of the options, we settled on the cornerstone C++ Raft implementation. It's lightweight, has the least dependencies (on ASIO only), and yet is functionally complete in terms of cluster management, recovery from peers, in addition to including basic replication features. It provides an interface for log store plugin, state machine, as well as configuration parameters.

The following diagram illustrates the interface and its main calls. As you can see, NuRaft allows flexible design for users through User-defined Log Store and User-defined State Machine with a fixed set of APIs.

Some significant enhancements in NuRaft over cornerstone Raft include:

  • SSL/TLS support.
  • Logical snapshot: original snapshot recovery from peers is based on the assumption that a snapshot is a file. We added a logical snapshot interface.
  • Leader election: pre-vote and priority for leadership. Pre-vote addresses a problem of annoying members that interrupt the current legitimate leader from an isolated part of the network.
  • Learner only member.
  • Custom quorum sizes for commit and leader election.
  • Asynchronous replication.

The following diagram depicts the replication flows for both strongly consistent and asynchronous replications. The main differences between the two are when to execute the requests and return results to clients and quorum sizes. For strongly consistent replication, the requests are executed at commit time, while for async replication, the requests are executed at pre-commit time before real replication occurs.

What are the purposes of the enhancements?
  1. Logical snapshot support
    We needed a snapshot interface that is not file only, but a snapshot in a KV store. Physically creating a file as a snapshot is quite expensive and disruptive to the normal operations in many application scenarios. Instead, logical snapshots would be much efficient. For example, when we started implementing the replicated storage servers, we used ForestDB as the underlying storage engine, and it provides a logical snapshot capability. That's the reason we introduced this support.
    Alternatively, if a snapshot can be established from a copied file that was changing but contains logical snapshots, shipping a changing file as a snapshot could work, but we didn't take that approach.
  2. Avoiding disruption of a leader by an unstable member
    The pre-vote mechanism is to avoid disrupting a leader by an unstable member. The original protocol will force a leader to step down when a follower sees a new term and agreeing to a leader vote request from a member disconnected from the leader. During pre-vote, the voter checks if it received heartbeat from the leader recently before its election timer expires, that means the leader is possibly alive. Then the node rejects the pre-vote request and the vote initiator will not move forward. With pre-vote, a legitimate leader will not step down just because a follower is temporarily out of touch with the leader while the majority of the members are working properly.
  3. Reading latest values from the leader requires non-overlapping leadership
    The leadership expiration is to avoid, in rare cases of a leader being isolated from the rest, more than one leader acting at the same time. It is important to avoid multiple simultaneous leaders when we use reads from the leader as the latest values. To tackle this problem, we introduce a parameter leadership_expiry_ to set the expiration time of the current leader who needs to receive responses from quorum nodes within this expiration time to maintain its leadership.
  4. Priority and learner-only membership
    Sometimes you would like to have leaders be in certain data centers, or have a learner-only member (with priority 0) for backup purposes. Priority is used to achieve that, although it's not guaranteed to strictly follow the priority. A member with priority 0 will be guaranteed not to participate in a leader election.
  5. Custom quorum sizes for commit and leader election
    This helps to achieve the same functionality of Flexible Paxos. Overlap between the leader election quorum and replication commit quorum as in flexible Paxos is not enforced. It's your responsibility to choose the right values for your scenarios.
  6. Asynchronous replication
    Since NuRaft is lightweight and efficient, we want to use it for replications with lower consistency requirements and lower latency. Asynchronous replication can be achieved using NuRaft by calling cluster_config::set_async_replicatoin(true). And we can use the same code to achieve tunable consistency levels. Note that asynchronous replication may cause data loss.
Get started

If you are ready to explore NuRaft, go for a quick tutorial in the repo.

Your contributions are welcome

Our goal is to make NuRaft the best standalone C++ Raft core open source implementation in the industry. Your contributions, in forms of bug reports, pull requests, feature requests, test cases or examples, or benchmarks, are all welcome.

Tags:Cloud, Data Infrastructure and Services, Distributed Systems

Previous Article:API Mindset at eBay

Attachments

  • Original document
  • Permalink

Disclaimer

eBay Inc. published this content on 14 August 2019 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 14 August 2019 19:06:02 UTC