garagehq.deuxfleurs.fr/content/blog/2023-11-thoughts-on-leaderless-consensus/index.md
Alex Auvolat d3c5a34d27
Some checks failed
continuous-integration/drone/push Build is failing
remove --- in article
2023-12-06 18:03:33 +01:00

11 KiB

+++ title='Thoughts on "Leaderless Consensus"' date=2023-11-30 +++

Consensus algorithms such as Raft and Paxos, which are used in many distributed databases, have notoriously unpredictable performance in low-quality networks that suffer from latency, jitter, packet loss and/or unavailable nodes, which is why Garage does not use them and uses only CRDTs. A new paper by Antoniadis et al., Leaderless Consensus, introduces a new category of algorithms that better tolerate the frequent unavailability of a subset of nodes. However, additional research and practical work is required before these results can be put into practice. Read for more details.


As I have said many times when presenting Garage, we have made a point of not using any consensus algorithm in Garage and using only CRDTs, for several reasons. The first, and most important reason, is that all of the consensus algorithms that we know of[^1] (in particular Raft, which is very popular in distributed databases) suffer from unpredictable performance when nodes or the network are unreliable. Even in relatively stable conditions, Raft-like algorithms can still be much slower than CRDTs (as we have shown in some benchmarks) because they elect a leader node and require all operations to pass through the leader, which can become a bottleneck. Other than performance issues, Raft is a complex algorithm and implementing it correctly is a challenging software engineering endeavor that we did not wish to undertake, preferring instead simplicity as a foundational principle to help us write correct software.

However, writing a distributed system such as Garage can be challenging when consensus is not available, as we can only use CRDTs (conflict-free replicated data types) in the code, and we cannot rely on state machine replication. This means that the specific semantics of CRDTs have to be taken into account everywhere in the code, which is often not a problem but sometimes adds some complexity. More importantly, this means that a whole class of features cannot be implemented in Garage, like those that would require some form of locking or exclusive access. In practice, this has been causing us issues on the CreateBucket endpoint, which by definition is meant to exclusively associate a bucket name to a newly created bucket. In current Garage versions, concurrent calls to CreateBucket with the same name may create several buckets and leave Garage in an inconsistent state.

This leads naturally to the following question: is it possible to implement a consensus algorithm that eschews the shortcomings of Raft-like algorithms in unreliable systems? And in particular, is it possible to implement a consensus algorithm that does not elect a leader, and is therefore not sensitive to temporary slowdowns or unavailabilities of individual nodes? A new paper by Antoniadis et al., Leaderless Consensus [PDF], suggests that the answer is yes. However, as with all new research, putting it into practice will take some time and a lot of work. I will discuss in this article practical questions posed by the Leaderless Consensus paper, and further steps that could be taken to advance on these issues.

Please note that the entire content of this article is purely speculative and does not include any positive results. Note also that we are not discussing Byzantine-tolerant systems, which seem to be the main focus of Leaderless Consensus, even though the authors also propose an algorithm for non-Byzantine systems (the one we are interested in).

Main takeaways of Leaderless Consensus

To be able to meaningfully say that an algorithm is leaderless, one has to first determine what leaderless precisely means. The paper starts by offering such a definition, using a network model they call synchronous-k ("synchronous minus k"), where n nodes are running in synchronous steps where at most k nodes might be offline, paused, or otherwise unavailable, at each step. The synchronous-k model has a variant called eventually synchronous-k which seems to better model the behaviour of WAN links on the Internet, although I am not sure of the precise difference between the two. Once the synchronous-k network model is defined, a leaderless consensus algorithm is simply defined as a consensus algorithm that still works (i.e. it terminates, giving a decision), in a synchronous-1 system. Concretely, this means that at any given time, a random node in the network may be disconnected (not always the same one), and the consensus algorithm will be impacted only minimally. In other words, we can say that a leaderless consensus algorithm degrades gracefully in the presence of transient node failures. This "graceful degradation" property, which Raft does not have, seems to be exactly what we are looking for in a potential consensus algorithm that could be added to Garage.

Having given this definition, the paper continues by offering concrete algorithms to implement leaderless consensus. Of particular interest to us, the paper presents in Section 5 a leaderless consensus algorithm, which they call OFT-Archipelago, which works in message passing systems without Byzantine nodes, where the only faults that can occur are message omissions (like messages being dropped by the network, or temporary node crashes). This is exactly the premise made by Garage, so this algorithm could be a good candidate for us. Interestingly, while leaderless consensus is formally defined as a consensus algorithm that works in a synchronous-1 system (i.e. tolerating only one failed node at each step), Archipelago works with up to f < n/2 unavailable nodes at each time steps.

According to the benchmarks in the leaderless consensus paper, while Archipelago has very good throughput (around 50kops/s), the latency of individual operations is generally between 1 or 2 seconds. This seems to be acceptable for application in Garage if used only for administrative operations on buckets and access keys which are relatively rare. From a theoretical point of view, OFT-Archipelago can terminate in 3 RTT in the optimal scenario, however it is not clear to me whether there is an upper bound on the termination time, or whether there is a probabilistic analysis of the termination delays that could be made. It is also not very clear to me the link between this algorithm and the FLP impossibility theorem: since Archipelago seems to do things that are forbidden by FLP, it means that the premise of a synchronous-k system is probably in fact much stronger that the network asynchrony assumed by FLP.

Among the other advantages of OFT-Archipelago is the fact that the algorithm seems to be very simple, much more than Raft, as it is described in the paper in only 42 lines of very understandable pseudocode. There is also a BFT variant of Archipelago, which is not of interest to us in the context of Garage as we are making the hypothesis that all nodes are trusted.

Where to go from now?

Before an algorithm such as OFT-Archipelago can be added to Garage, a few fundamental questions need to be answered, among which:

  • How should Archipelago interact with Garage's use of CRDT data types? Do we have to create a fully separate subsystem for things that are managed under consensus, or can we hopefully share some logic? More precisely, can we use a consensus algorithm simply as a total order broadcast primitive that becomes a mandatory passing point for all modification requests on a set of metadata tables, with those tables still being based on the CRDT table replication and synchronisation library which is currently in use in Garage? In this situation, nodes that come back from a crash can simply catch up on old changes using the Merkle tree algorithm synchronisation algorithm that we already have. Or must we use the consensus algorithm as the only way to broadcast operations and data for the tables that are managed by it? This would mean that we must add specific logic to handle the case of a node coming back from a crash, where it must either download all the log of operations since it was last up, or an entire snapshot of the metadata tables in question. I think this is mostly related to the reason we want to add consensus, and the exact consistency guarantees we are expecting it to provide to us.

  • Can Archipelago be made correct under cluster reconfiguration scenarios? This is linked to the work done for task 3 of the 2023 NLnet project (#495, #667), which focuses on making the Quorum-based algorithm for CRDT updates reliable even when the cluster layout is updated. I will be writing more about this topic in a future blog post, but in a nutshell, the NLnet task is mainly focused on maintaining read-after-write consistency in Garage at all times, which has led us to develop a relatively general framework for modeling algorithm based on quorums. Since Archipelago also guarantees its correctness using a non-empty-intersection-of-quorums property, it could benefit from the work that was originally made on quorums for the CRDT algorithms.

If we obtain satisfactory answers to these questions, the remaining work will be the technical implementation of Archipelago in Garage and its validation:

  • Determine more precisely how the pipelined version of Archipelago is made, as its complete description is not given in the leaderless consensus paper, only a few basic pointers (Section 8.1 of the JPDC version).

  • Implement Archipelago in Rust, ideally under the form of a generic reusable crate that could be used outside of the context of Garage.

  • Do a benchmark of Archipelago vs. existing Raft implementations (for instance the async-raft crate). We should benchmark the algorithms in the following scenarios: stable networking, high latency and jitter, evolutive situation with different phases. My hypothesis is that Archipelago could be slower (in terms of latency, not necessarily in throughput) than Raft in the stable networking scenario, but the other two scenarios would force Raft to reconfigure often (i.e. change leaders), which could be the source of huge performance penalties, which Archipelago would not suffer from.

  • Integrate Archipelago with Garage to solve the CreateBucket issue.

  • To validate our implementation, we would want to test it using automated testing frameworks such as Jepsen. I've been using Jepsen for the NLnet task 3 and I'm starting to understand quite well how it works, so this could be relatively easy.

  • If we want to go further, there is always the possibility of formalizing a proof of our implementation, however I don't know what are the good tools to do this, and in all cases it would be an extreme amount of work.

Please send your comments and feedback to garagehq@deuxfleurs.fr if you have any.


1: We are concerned only with consensus algorithms in the context of closed, trusted systems such as distributed databases, and not in large trustless networks such as blockchains.

Written by Alex Auvolat.