diff --git a/content/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf b/content/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf new file mode 100644 index 0000000..0e6292e Binary files /dev/null and b/content/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf differ diff --git a/content/blog/2023-11-thoughts-on-leaderless-consensus/index.md b/content/blog/2023-11-thoughts-on-leaderless-consensus/index.md new file mode 100644 index 0000000..1dce0e6 --- /dev/null +++ b/content/blog/2023-11-thoughts-on-leaderless-consensus/index.md @@ -0,0 +1,196 @@ ++++ +title='Thoughts on "Leaderless Consensus"' +date=2023-11-30 ++++ + +*Consensus algorithms such as Raft and Paxos, which are used in many distributed databases, +have notoriously unpredictable performance in low-quality networks that suffer from +latency, jitter, packet loss and/or unavailable nodes, which is why Garage does not use +them and uses only CRDTs. A new paper by Antoniadis et al., [*Leaderless Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151), +introduces a new category of algorithms that better tolerate the frequent +unavailability of a subset of nodes. However, additional research and practical work is required before +these results can be put into practice. Read for more details.* + + + +--- + +As I have said many times when presenting Garage, we have made a point of not +using any consensus algorithm in Garage and using only CRDTs, for several +reasons. The first, and most important reason, is that all of the consensus +algorithms that we know of[^1] (in particular Raft, which is very popular in +distributed databases) suffer from unpredictable performance when nodes or the +network are unreliable. Even in relatively stable conditions, Raft-like +algorithms can still be much slower than CRDTs (as we have shown in some +[benchmarks](https://garagehq.deuxfleurs.fr/documentation/design/benchmarks/#on-a-complex-simulated-network)) +because they elect a leader node and require all operations to pass through the +leader, which can become a bottleneck. Other than performance issues, Raft is +a complex algorithm and implementing it correctly is a challenging software +engineering endeavor that we did not wish to undertake, preferring instead +simplicity as a foundational principle to help us write correct software. + +However, writing a distributed system such as Garage can be challenging when +consensus is not available, as we can only use CRDTs (conflict-free replicated +data types) in the code, and we cannot rely on state machine replication. This +means that the specific semantics of CRDTs have to be taken into account +everywhere in the code, which is often not a problem but sometimes adds some +complexity. More importantly, this means that a whole class of features cannot +be implemented in Garage, like those that would require some form of locking or +exclusive access. In practice, this has been causing us issues on the +CreateBucket endpoint, which by definition is meant to exclusively associate a +bucket name to a newly created bucket. In current Garage versions, concurrent +calls to CreateBucket with the same name may create several buckets and leave +Garage in an inconsistent state. + +This leads naturally to the following question: is it possible to implement a +consensus algorithm that eschews the shortcomings of Raft-like algorithms in +unreliable systems? And in particular, is it possible to implement a consensus +algorithm that does not elect a leader, and is therefore not sensitive to +temporary slowdowns or unavailabilities of individual nodes? A new paper by +Antoniadis et al., [*Leaderless +Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151) +[[PDF](/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf)], +suggests that the answer is *yes*. However, as with all new research, putting +it into practice will take some time and a lot of work. I will discuss in this +article practical questions posed by the *Leaderless Consensus* paper, and +further steps that could be taken to advance on these issues. + +Please note that the entire content of this article is **purely speculative** +and does not include any *positive results*. Note also that we are not +discussing Byzantine-tolerant systems, which seem to be the main focus of +*Leaderless Consensus*, even though the authors also propose an algorithm for +non-Byzantine systems (the one we are interested in). + +--- + +## Main takeaways of *Leaderless Consensus* + +To be able to meaningfully say that an algorithm is *leaderless*, one has to first +determine what *leaderless* precisely means. The paper starts by offering such +a definition, using a network model they call *synchronous-k* ("synchronous minus *k*"), +where *n* nodes are running in synchronous steps where at most *k* nodes might be +offline, paused, or otherwise unavailable, at each step. +The *synchronous-k* model has a variant called *eventually synchronous-k* which seems +to better model the behaviour of WAN links on the Internet, although I am not sure +of the precise difference between the two. Once the *synchronous-k* network model +is defined, a leaderless consensus algorithm is simply defined as a consensus algorithm +that still works (i.e. it terminates, giving a decision), in a *synchronous-1* system. +Concretely, this means that at any given time, a random node in the network may be +disconnected (not always the same one), and the consensus algorithm will be impacted +only minimally. In other words, we can say that a leaderless consensus algorithm +degrades gracefully in the presence of transient node failures. +This "graceful degradation" property, which Raft does not have, +seems to be exactly what we are looking for in a potential consensus algorithm that +could be added to Garage. + +Having given this definition, the paper continues by offering concrete +algorithms to implement leaderless consensus. Of particular interest to us, the +paper presents in Section 5 a leaderless consensus algorithm, which they call +OFT-Archipelago, which works in message passing systems without Byzantine +nodes, where the only faults that can occur are message omissions (like +messages being dropped by the network, or temporary node crashes). This is +exactly the premise made by Garage, so this algorithm could be a good candidate +for us. Interestingly, while leaderless consensus is formally defined as a +consensus algorithm that works in a *synchronous-1* system (i.e. tolerating +only one failed node at each step), Archipelago works with up to *f < n/2* +unavailable nodes at each time steps. + +According to the benchmarks in the leaderless consensus paper, while +Archipelago has very good throughput (around 50kops/s), the latency of +individual operations is generally between 1 or 2 seconds. This seems to be +acceptable for application in Garage if used only for administrative operations +on buckets and access keys which are relatively rare. From a theoretical point +of view, OFT-Archipelago can terminate in 3 RTT in the optimal scenario, +however it is not clear to me whether there is an upper bound on the +termination time, or whether there is a probabilistic analysis of the +termination delays that could be made. It is also not very clear to me the +link between this algorithm and the FLP impossibility theorem: since +Archipelago seems to do things that are forbidden by FLP, it means that the +premise of a *synchronous-k* system is probably in fact much stronger that the +network asynchrony assumed by FLP. + +Among the other advantages of OFT-Archipelago is the fact that the algorithm +seems to be very simple, much more than Raft, as it is described in the paper +in only 42 lines of very understandable pseudocode. There is also a BFT +variant of Archipelago, which is not of interest to us in the context of Garage +as we are making the hypothesis that all nodes are trusted. + +--- + +## Where to go from now? + +Before an algorithm such as OFT-Archipelago can be added to Garage, a few fundamental +questions need to be answered, among which: + +- How should Archipelago interact with Garage's use of CRDT data types? Do we + have to create a fully separate subsystem for things that are managed under + consensus, or can we hopefully share some logic? More precisely, can we use + a consensus algorithm simply as a total order broadcast primitive that + becomes a mandatory passing point for all modification requests on a set of + metadata tables, with those tables still being based on the CRDT table + replication and synchronisation library which is currently in use in Garage? + In this situation, nodes that come back from a crash can simply catch up on + old changes using the Merkle tree algorithm synchronisation algorithm that we + already have. Or must we use the consensus algorithm as the only way to + broadcast operations and data for the tables that are managed by it? This + would mean that we must add specific logic to handle the case of a node + coming back from a crash, where it must either download all the log of + operations since it was last up, or an entire snapshot of the metadata tables + in question. I think this is mostly related to the reason we want to add + consensus, and the exact consistency guarantees we are expecting it to + provide to us. + +- Can Archipelago be made correct under cluster reconfiguration scenarios? This + is linked to the work done for task 3 of the 2023 NLnet project + ([#495](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/495), + [#667](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/667)), which focuses + on making the Quorum-based algorithm for CRDT updates reliable even when the + cluster layout is updated. I will be writing more about this topic in a + future blog post, but in a nutshell, the NLnet task is mainly focused on + maintaining read-after-write consistency in Garage at all times, which has + led us to develop a relatively general framework for modeling algorithm based + on quorums. Since Archipelago also guarantees its correctness using a + non-empty-intersection-of-quorums property, it could benefit from the work + that was originally made on quorums for the CRDT algorithms. + +If we obtain satisfactory answers to these questions, the remaining work will be +the technical implementation of Archipelago in Garage and its validation: + +- Determine more precisely how the pipelined version of Archipelago is made, + as its complete description is not given in the leaderless consensus paper, + only a few basic pointers (Section 8.1 of the JPDC version). + +- Implement Archipelago in Rust, ideally under the form of a generic reusable crate + that could be used outside of the context of Garage. + +- Do a benchmark of Archipelago vs. existing Raft implementations (for instance + the async-raft crate). We should benchmark the algorithms in the following + scenarios: stable networking, high latency and jitter, evolutive situation + with different phases. My hypothesis is that Archipelago could be slower (in + terms of latency, not necessarily in throughput) than Raft in the stable + networking scenario, but the other two scenarios would force Raft to + reconfigure often (i.e. change leaders), which could be the source of huge + performance penalties, which Archipelago would not suffer from. + +- Integrate Archipelago with Garage to solve the CreateBucket issue. + +- To validate our implementation, we would want to test it using automated + testing frameworks such as Jepsen. I've been using Jepsen for the NLnet task + 3 and I'm starting to understand quite well how it works, so this could be + relatively easy. + +- If we want to go further, there is always the possibility of formalizing a + proof of our implementation, however I don't know what are the good tools to + do this, and in all cases it would be an extreme amount of work. + + +Please send your comments and feedback to +[garagehq@deuxfleurs.fr](mailto:garagehq@deuxfleurs.fr) if you have any. + +--- + +1: We are concerned only with consensus algorithms in the +context of closed, trusted systems such as distributed databases, and not in +large trustless networks such as blockchains. + +Written by [Alex Auvolat](https://adnab.me).