garagehq.deuxfleurs.fr/content/blog/2023-11-thoughts-on-leaderless-consensus/index.md

+++
title='Thoughts on "Leaderless Consensus"'
date=2023-11-30
+++

*Consensus algorithms such as Raft and Paxos, which are used in many distributed databases,
have notoriously unpredictable performance in low-quality networks that suffer from
latency, jitter, packet loss and/or unavailable nodes, which is why Garage does not use
them and uses only CRDTs. A new paper by Antoniadis et al., [*Leaderless Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151),
introduces a new category of algorithms that better tolerate the frequent 
unavailability of a subset of nodes. However, additional research and practical work is required before
these results can be put into practice. Read for more details.*

<!-- more -->

---

As I have said many times when presenting Garage, we have made a point of not
using any consensus algorithm in Garage and using only CRDTs, for several
reasons.  The first, and most important reason, is that all of the consensus
algorithms that we know of[^1] (in particular Raft, which is very popular in
distributed databases) suffer from unpredictable performance when nodes or the
network are unreliable.  Even in relatively stable conditions, Raft-like
algorithms can still be much slower than CRDTs (as we have shown in some
[benchmarks](https://garagehq.deuxfleurs.fr/documentation/design/benchmarks/#on-a-complex-simulated-network))
because they elect a leader node and require all operations to pass through the
leader, which can become a bottleneck.  Other than performance issues, Raft is
a complex algorithm and implementing it correctly is a challenging software
engineering endeavor that we did not wish to undertake, preferring instead
simplicity as a foundational principle to help us write correct software.

However, writing a distributed system such as Garage can be challenging when
consensus is not available, as we can only use CRDTs (conflict-free replicated
data types) in the code, and we cannot rely on state machine replication.  This
means that the specific semantics of CRDTs have to be taken into account
everywhere in the code, which is often not a problem but sometimes adds some
complexity.  More importantly, this means that a whole class of features cannot
be implemented in Garage, like those that would require some form of locking or
exclusive access.  In practice, this has been causing us issues on the
CreateBucket endpoint, which by definition is meant to exclusively associate a
bucket name to a newly created bucket. In current Garage versions, concurrent
calls to CreateBucket with the same name may create several buckets and leave
Garage in an inconsistent state.

This leads naturally to the following question: is it possible to implement a
consensus algorithm that eschews the shortcomings of Raft-like algorithms in
unreliable systems?  And in particular, is it possible to implement a consensus
algorithm that does not elect a leader, and is therefore not sensitive to
temporary slowdowns or unavailabilities of individual nodes?  A new paper by
Antoniadis et al., [*Leaderless
Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151)
[[PDF](/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf)],
suggests that the answer is *yes*.  However, as with all new research, putting
it into practice will take some time and a lot of work.  I will discuss in this
article practical questions posed by the *Leaderless Consensus* paper, and
further steps that could be taken to advance on these issues.

Please note that the entire content of this article is **purely speculative**
and does not include any *positive results*. Note also that we are not
discussing Byzantine-tolerant systems, which seem to be the main focus of
*Leaderless Consensus*, even though the authors also propose an algorithm for
non-Byzantine systems (the one we are interested in).

## Main takeaways of *Leaderless Consensus*

To be able to meaningfully say that an algorithm is *leaderless*, one has to first
determine what *leaderless* precisely means. The paper starts by offering such
a definition, using a network model they call *synchronous-k* ("synchronous minus *k*"),
where *n* nodes are running in synchronous steps where at most *k* nodes might be
offline, paused, or otherwise unavailable, at each step.
The *synchronous-k* model has a variant called *eventually synchronous-k* which seems
to better model the behaviour of WAN links on the Internet, although I am not sure
of the precise difference between the two. Once the *synchronous-k* network model
is defined, a leaderless consensus algorithm is simply defined as a consensus algorithm
that still works (i.e. it terminates, giving a decision), in a *synchronous-1* system.
Concretely, this means that at any given time, a random node in the network may be
disconnected (not always the same one), and the consensus algorithm will be impacted
only minimally. In other words, we can say that a leaderless consensus algorithm
degrades gracefully in the presence of transient node failures.
This "graceful degradation" property, which Raft does not have,
seems to be exactly what we are looking for in a potential consensus algorithm that
could be added to Garage.

Having given this definition, the paper continues by offering concrete
algorithms to implement leaderless consensus. Of particular interest to us, the
paper presents in Section 5 a leaderless consensus algorithm, which they call
OFT-Archipelago, which works in message passing systems without Byzantine
nodes, where the only faults that can occur are message omissions (like
messages being dropped by the network, or temporary node crashes).  This is
exactly the premise made by Garage, so this algorithm could be a good candidate
for us. Interestingly, while leaderless consensus is formally defined as a
consensus algorithm that works in a *synchronous-1* system (i.e.  tolerating
only one failed node at each step), Archipelago works with up to *f < n/2*
unavailable nodes at each time steps.

According to the benchmarks in the leaderless consensus paper, while
Archipelago has very good throughput (around 50kops/s), the latency of
individual operations is generally between 1 or 2 seconds. This seems to be
acceptable for application in Garage if used only for administrative operations
on buckets and access keys which are relatively rare. From a theoretical point
of view, OFT-Archipelago can terminate in 3 RTT in the optimal scenario,
however it is not clear to me whether there is an upper bound on the
termination time, or whether there is a probabilistic analysis of the
termination delays that could be made.  It is also not very clear to me the
link between this algorithm and the FLP impossibility theorem: since
Archipelago seems to do things that are forbidden by FLP, it means that the
premise of a *synchronous-k* system is probably in fact much stronger that the
network asynchrony assumed by FLP.

Among the other advantages of OFT-Archipelago is the fact that the algorithm
seems to be very simple, much more than Raft, as it is described in the paper
in only 42 lines of very understandable pseudocode.  There is also a BFT
variant of Archipelago, which is not of interest to us in the context of Garage
as we are making the hypothesis that all nodes are trusted.

## Where to go from now?

Before an algorithm such as OFT-Archipelago can be added to Garage, a few fundamental
questions need to be answered, among which:

- How should Archipelago interact with Garage's use of CRDT data types? Do we
  have to create a fully separate subsystem for things that are managed under
  consensus, or can we hopefully share some logic?  More precisely, can we use
  a consensus algorithm simply as a total order broadcast primitive that
  becomes a mandatory passing point for all modification requests on a set of
  metadata tables, with those tables still being based on the CRDT table
  replication and synchronisation library which is currently in use in Garage?
  In this situation, nodes that come back from a crash can simply catch up on
  old changes using the Merkle tree algorithm synchronisation algorithm that we
  already have.  Or must we use the consensus algorithm as the only way to
  broadcast operations and data for the tables that are managed by it? This
  would mean that we must add specific logic to handle the case of a node
  coming back from a crash, where it must either download all the log of
  operations since it was last up, or an entire snapshot of the metadata tables
  in question. I think this is mostly related to the reason we want to add
  consensus, and the exact consistency guarantees we are expecting it to
  provide to us.

- Can Archipelago be made correct under cluster reconfiguration scenarios? This
  is linked to the work done for task 3 of the 2023 NLnet project
  ([#495](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/495),
  [#667](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/667)), which focuses
  on making the Quorum-based algorithm for CRDT updates reliable even when the
  cluster layout is updated. I will be writing more about this topic in a
  future blog post, but in a nutshell, the NLnet task is mainly focused on
  maintaining read-after-write consistency in Garage at all times, which has
  led us to develop a relatively general framework for modeling algorithm based
  on quorums. Since Archipelago also guarantees its correctness using a
  non-empty-intersection-of-quorums property, it could benefit from the work
  that was originally made on quorums for the CRDT algorithms.

If we obtain satisfactory answers to these questions, the remaining work will be
the technical implementation of Archipelago in Garage and its validation:

- Determine more precisely how the pipelined version of Archipelago is made,
  as its complete description is not given in the leaderless consensus paper,
  only a few basic pointers (Section 8.1 of the JPDC version).

- Implement Archipelago in Rust, ideally under the form of a generic reusable crate
  that could be used outside of the context of Garage.

- Do a benchmark of Archipelago vs. existing Raft implementations (for instance
  the async-raft crate).  We should benchmark the algorithms in the following
  scenarios: stable networking, high latency and jitter, evolutive situation
  with different phases.  My hypothesis is that Archipelago could be slower (in
  terms of latency, not necessarily in throughput) than Raft in the stable
  networking scenario, but the other two scenarios would force Raft to
  reconfigure often (i.e. change leaders), which could be the source of huge
  performance penalties, which Archipelago would not suffer from.

- Integrate Archipelago with Garage to solve the CreateBucket issue.

- To validate our implementation, we would want to test it using automated
  testing frameworks such as Jepsen. I've been using Jepsen for the NLnet task
  3 and I'm starting to understand quite well how it works, so this could be
  relatively easy.

- If we want to go further, there is always the possibility of formalizing a
  proof of our implementation, however I don't know what are the good tools to
  do this, and in all cases it would be an extreme amount of work.


Please send your comments and feedback to
[garagehq@deuxfleurs.fr](mailto:garagehq@deuxfleurs.fr) if you have any.

---

<sup id="1">1</sup>: We are concerned only with consensus algorithms in the
context of closed, trusted systems such as distributed databases, and not in
large trustless networks such as blockchains.

Written by [Alex Auvolat](https://adnab.me).
add blog post on leaderless consensus 2023-11-30 12:41:38 +00:00			`+++`
			`title='Thoughts on "Leaderless Consensus"'`
			`date=2023-11-30`
			`+++`

			`*Consensus algorithms such as Raft and Paxos, which are used in many distributed databases,`
			`have notoriously unpredictable performance in low-quality networks that suffer from`
			`latency, jitter, packet loss and/or unavailable nodes, which is why Garage does not use`
			`them and uses only CRDTs. A new paper by Antoniadis et al., [Leaderless Consensus](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151),`
			`introduces a new category of algorithms that better tolerate the frequent`
			`unavailability of a subset of nodes. However, additional research and practical work is required before`
			`these results can be put into practice. Read for more details.*`

			`<!-- more -->`

			`---`

			`As I have said many times when presenting Garage, we have made a point of not`
			`using any consensus algorithm in Garage and using only CRDTs, for several`
			`reasons. The first, and most important reason, is that all of the consensus`
			`algorithms that we know of[^1] (in particular Raft, which is very popular in`
			`distributed databases) suffer from unpredictable performance when nodes or the`
			`network are unreliable. Even in relatively stable conditions, Raft-like`
			`algorithms can still be much slower than CRDTs (as we have shown in some`
			`[benchmarks](https://garagehq.deuxfleurs.fr/documentation/design/benchmarks/#on-a-complex-simulated-network))`
			`because they elect a leader node and require all operations to pass through the`
			`leader, which can become a bottleneck. Other than performance issues, Raft is`
			`a complex algorithm and implementing it correctly is a challenging software`
			`engineering endeavor that we did not wish to undertake, preferring instead`
			`simplicity as a foundational principle to help us write correct software.`

			`However, writing a distributed system such as Garage can be challenging when`
			`consensus is not available, as we can only use CRDTs (conflict-free replicated`
			`data types) in the code, and we cannot rely on state machine replication. This`
			`means that the specific semantics of CRDTs have to be taken into account`
			`everywhere in the code, which is often not a problem but sometimes adds some`
			`complexity. More importantly, this means that a whole class of features cannot`
			`be implemented in Garage, like those that would require some form of locking or`
			`exclusive access. In practice, this has been causing us issues on the`
			`CreateBucket endpoint, which by definition is meant to exclusively associate a`
			`bucket name to a newly created bucket. In current Garage versions, concurrent`
			`calls to CreateBucket with the same name may create several buckets and leave`
			`Garage in an inconsistent state.`

			`This leads naturally to the following question: is it possible to implement a`
			`consensus algorithm that eschews the shortcomings of Raft-like algorithms in`
			`unreliable systems? And in particular, is it possible to implement a consensus`
			`algorithm that does not elect a leader, and is therefore not sensitive to`
			`temporary slowdowns or unavailabilities of individual nodes? A new paper by`
			`Antoniadis et al., [*Leaderless`
			`Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151)`
			`[[PDF](/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf)],`
			`suggests that the answer is yes. However, as with all new research, putting`
			`it into practice will take some time and a lot of work. I will discuss in this`
			`article practical questions posed by the Leaderless Consensus paper, and`
			`further steps that could be taken to advance on these issues.`

			`Please note that the entire content of this article is purely speculative`
			`and does not include any positive results. Note also that we are not`
			`discussing Byzantine-tolerant systems, which seem to be the main focus of`
			`Leaderless Consensus, even though the authors also propose an algorithm for`
			`non-Byzantine systems (the one we are interested in).`

			`## Main takeaways of Leaderless Consensus`

			`To be able to meaningfully say that an algorithm is leaderless, one has to first`
			`determine what leaderless precisely means. The paper starts by offering such`
			`a definition, using a network model they call synchronous-k ("synchronous minus k"),`
			`where n nodes are running in synchronous steps where at most k nodes might be`
			`offline, paused, or otherwise unavailable, at each step.`
			`The synchronous-k model has a variant called eventually synchronous-k which seems`
			`to better model the behaviour of WAN links on the Internet, although I am not sure`
			`of the precise difference between the two. Once the synchronous-k network model`
			`is defined, a leaderless consensus algorithm is simply defined as a consensus algorithm`
			`that still works (i.e. it terminates, giving a decision), in a synchronous-1 system.`
			`Concretely, this means that at any given time, a random node in the network may be`
			`disconnected (not always the same one), and the consensus algorithm will be impacted`
			`only minimally. In other words, we can say that a leaderless consensus algorithm`
			`degrades gracefully in the presence of transient node failures.`
			`This "graceful degradation" property, which Raft does not have,`
			`seems to be exactly what we are looking for in a potential consensus algorithm that`
			`could be added to Garage.`

			`Having given this definition, the paper continues by offering concrete`
			`algorithms to implement leaderless consensus. Of particular interest to us, the`
			`paper presents in Section 5 a leaderless consensus algorithm, which they call`
			`OFT-Archipelago, which works in message passing systems without Byzantine`
			`nodes, where the only faults that can occur are message omissions (like`
			`messages being dropped by the network, or temporary node crashes). This is`
			`exactly the premise made by Garage, so this algorithm could be a good candidate`
			`for us. Interestingly, while leaderless consensus is formally defined as a`
			`consensus algorithm that works in a synchronous-1 system (i.e. tolerating`
			`only one failed node at each step), Archipelago works with up to f < n/2`
			`unavailable nodes at each time steps.`

			`According to the benchmarks in the leaderless consensus paper, while`
			`Archipelago has very good throughput (around 50kops/s), the latency of`
			`individual operations is generally between 1 or 2 seconds. This seems to be`
			`acceptable for application in Garage if used only for administrative operations`
			`on buckets and access keys which are relatively rare. From a theoretical point`
			`of view, OFT-Archipelago can terminate in 3 RTT in the optimal scenario,`
			`however it is not clear to me whether there is an upper bound on the`
			`termination time, or whether there is a probabilistic analysis of the`
			`termination delays that could be made. It is also not very clear to me the`
			`link between this algorithm and the FLP impossibility theorem: since`
			`Archipelago seems to do things that are forbidden by FLP, it means that the`
			`premise of a synchronous-k system is probably in fact much stronger that the`
			`network asynchrony assumed by FLP.`

			`Among the other advantages of OFT-Archipelago is the fact that the algorithm`
			`seems to be very simple, much more than Raft, as it is described in the paper`
			`in only 42 lines of very understandable pseudocode. There is also a BFT`
			`variant of Archipelago, which is not of interest to us in the context of Garage`
			`as we are making the hypothesis that all nodes are trusted.`

			`## Where to go from now?`

			`Before an algorithm such as OFT-Archipelago can be added to Garage, a few fundamental`
			`questions need to be answered, among which:`

			`- How should Archipelago interact with Garage's use of CRDT data types? Do we`
			`have to create a fully separate subsystem for things that are managed under`
			`consensus, or can we hopefully share some logic? More precisely, can we use`
			`a consensus algorithm simply as a total order broadcast primitive that`
			`becomes a mandatory passing point for all modification requests on a set of`
			`metadata tables, with those tables still being based on the CRDT table`
			`replication and synchronisation library which is currently in use in Garage?`
			`In this situation, nodes that come back from a crash can simply catch up on`
			`old changes using the Merkle tree algorithm synchronisation algorithm that we`
			`already have. Or must we use the consensus algorithm as the only way to`
			`broadcast operations and data for the tables that are managed by it? This`
			`would mean that we must add specific logic to handle the case of a node`
			`coming back from a crash, where it must either download all the log of`
			`operations since it was last up, or an entire snapshot of the metadata tables`
			`in question. I think this is mostly related to the reason we want to add`
			`consensus, and the exact consistency guarantees we are expecting it to`
			`provide to us.`

			`- Can Archipelago be made correct under cluster reconfiguration scenarios? This`
			`is linked to the work done for task 3 of the 2023 NLnet project`
			`([#495](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/495),`
			`[#667](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/667)), which focuses`
			`on making the Quorum-based algorithm for CRDT updates reliable even when the`
			`cluster layout is updated. I will be writing more about this topic in a`
			`future blog post, but in a nutshell, the NLnet task is mainly focused on`
			`maintaining read-after-write consistency in Garage at all times, which has`
			`led us to develop a relatively general framework for modeling algorithm based`
			`on quorums. Since Archipelago also guarantees its correctness using a`
			`non-empty-intersection-of-quorums property, it could benefit from the work`
			`that was originally made on quorums for the CRDT algorithms.`

			`If we obtain satisfactory answers to these questions, the remaining work will be`
			`the technical implementation of Archipelago in Garage and its validation:`

			`- Determine more precisely how the pipelined version of Archipelago is made,`
			`as its complete description is not given in the leaderless consensus paper,`
			`only a few basic pointers (Section 8.1 of the JPDC version).`

			`- Implement Archipelago in Rust, ideally under the form of a generic reusable crate`
			`that could be used outside of the context of Garage.`

			`- Do a benchmark of Archipelago vs. existing Raft implementations (for instance`
			`the async-raft crate). We should benchmark the algorithms in the following`
			`scenarios: stable networking, high latency and jitter, evolutive situation`
			`with different phases. My hypothesis is that Archipelago could be slower (in`
			`terms of latency, not necessarily in throughput) than Raft in the stable`
			`networking scenario, but the other two scenarios would force Raft to`
			`reconfigure often (i.e. change leaders), which could be the source of huge`
			`performance penalties, which Archipelago would not suffer from.`

			`- Integrate Archipelago with Garage to solve the CreateBucket issue.`

			`- To validate our implementation, we would want to test it using automated`
			`testing frameworks such as Jepsen. I've been using Jepsen for the NLnet task`
			`3 and I'm starting to understand quite well how it works, so this could be`
			`relatively easy.`

			`- If we want to go further, there is always the possibility of formalizing a`
			`proof of our implementation, however I don't know what are the good tools to`
			`do this, and in all cases it would be an extreme amount of work.`


			`Please send your comments and feedback to`
			`[garagehq@deuxfleurs.fr](mailto:garagehq@deuxfleurs.fr) if you have any.`

			`---`

			`<sup id="1">1</sup>: We are concerned only with consensus algorithms in the`
			`context of closed, trusted systems such as distributed databases, and not in`
			`large trustless networks such as blockchains.`

			`Written by [Alex Auvolat](https://adnab.me).`