Byzantine Fault Tolerant Consistent Hash Rings

One of the recurring design patterns in Apollo is the use of a cleverly constructed random graph structure from the original Fireflies paper. I discussed this structure in my post on consensus in Delos. In that post, I focused on the use of this graph structure to form byzantine fault tolerant subset within a larger membership context. In this post, we’re still using the ring structure of the fireflies graphs, but I want to focus more on the consistent hashing properties of these rings and how Apollo leverages this property to solve some rather tricky design problems.

Consistent Hash Rings

The basic concept of a consistent hash ring is, of course, a ring. That is, a circle. It has no beginning and no end. A consistent hashing orders the hashed members on this ring. When the membership changes, we don’t have to rehash what hasn’t changed.

In the figure above, you can see Nodes 1-4 mapped to positions on the ring. Also note that there are names that are interspersed between these nodes. In essence, this is just a number line and the items just map to a particular position (via its hash); this is a very simple structure and the key point is the wrap around that makes this a ring and give it some interesting properties.

As you can see from the diagram above, we can use this ring mapping to partition the ring into regions that are “owned” by different nodes. In this case, we see 4 nodes – 1 through 4 – and can follow a simple rule that maps the hashed names to these nodes. The rule is simple and ammounts to mapping nodes that are the successors of the hashed names as the “owner” of these names. It’s just simply following the ring around to the right from the hashed name to the first node one encounters.

Circling back to the Fireflies paper, you can see the same structure at work. In Fireflies, the rings map the members (nodes) in the group to the rings by hashing their IDs to provide the ring position of the member.

In this diagram from the Fireflies paper, you can see members A-G mapped to a ring by some hash function. In Fireflies, there are multiple rings (see my previous post on consensus as well as the Fireflies paper, of course) and each of these rings has an ID – simply the index number of the ring, for example. The upshot of this is that because we’re including the ring ID in the calculation of the hash, we end up “randomly” distributing the same membership set to different positions across different rings as you can see in the diagram below.

In this diagrm we have the same membership set hashed to 3 different rings, but notice the members are in different positions on different rings due to inclusion of the ring ID in the calculation of the member’s hash, and thus position.

As I discussed in my post on consensus in Apollo, we can use these rings to provide us with Byzantine Fault Tolerant subsets of the larger member set. This is possible due to the particular way these rings are determined. Like the examples above where the names are hashed to positions in the ring we can use the same principle to find byzantine fault tolerant subsets from any hashed position.

This diagram illustrates how to construct a BFT subset of the members by using the multiple rings and simply finding the successor node/member of a given hash position.

In the diagram above, we select nodes D, E and A based on some chosen position within the ring by selecting the next successor node. And given the way these rings are constructed, these members turn out to be a “BFT subset” of the members in the group.

The practical application of these BFT subsets is that we simply don’t need a byzantine fault tolerant set that includes all of the members. Due to the way these rings are constructed, based on the number of members – i.e. the cardinality of the membership set. That is, the number of these rings – i.e. 3 in the examples I’ve been using – only grows logarithmically with the number of members. This means that the cardinality of these BFT subsets is far, far smaller than the cardinality of the total members in the group.

This becomes a rather useful property that we can leverage in rather interesting ways throughout Apollo. The particular use that is of interest is the consensus mechanism in Apollo’s implementation of the Fireflies.

Consensus In Group Membership

One of the very nice features of Apollo’s implementation of Fireflies is that it provides a stable and consistent membership View that is shared by all of the membership. That probably sounds kind of trivial as of course the members of a group know who is in the group and all agree on this membership set! This is, of course, non trivial as getting all the members to agree on the same membership set involves a consensus agreement by these members on that memberhip set.

This consensus view model of membership in Apollo is loosely based on the Rapid Membership service. The principle idea is that this shared membership view is also consistent and stable. This last point really turns out to be essential as without a stable view of group membership, you really can’t build much useful on top of this membership view.

I recommend that you read the Rapid paper for some really interesting discussion and data around various membership models that are available out there in the wild. But the upshot is that we can leverage these BFT subsets as delegates the group collectively uses for the View Consensus in Delos Fireflies.

Accommodating Failure

When implementing any of these consensus schemes – whether the “fast Paxos” of Rapid, or my hair brained committee delegate scheme in Delos – one quickly runs into the issue of failure. After all, this is precisely what a view membership tracks – additions and removals. Rapid deals with this using a 3/4 agreement of the members. And if this is not possible, then backs off to a leader driven consensus to establish the view membership.

Needless to say, that’s pretty damn expensive at scale and so Delos uses the aforementioned Byzantine Fault Tolerant subsets. While this allows Delos Fireflies to scale exceptionally well, the problem becomes that the members of these BFT subsets may fail – of course. And so while one could use 3/4 agreement…. And this is where the Consistent Hashing comes in.

A Hash In Time

In Delos, I simply use the ring structure to choose a new member of the BFT subset. This operation can be done locally by every node. My rationale is that one condition of the consensus is determining failed nodes. And in the Fireflies protocol that the View implements, knowledge of the failed nodes should be “universal” across the group. That is, failures – in Fireflies – are global knowledge.

And so everyone left alive when consensus view change occurs can make the same calculation that everyone else is making (with high probability, of course!). And that’s just following the ring around to get the live member. And, of course, everyone has to agree on this committee subset, so everything works out pretty well. It’s a fast and efficient one shot consensus mechanism for a stable membership View.

And as I’ve said previously, this View consensus (and integrity) is what underlies the rest of the system that is built upon it. I think it’s pretty cool to use the properties of consistent hashing to determine what amounts to an asynchronous committee consensus protocol. And BFT, no less 😎

Tensegrity