193 lines
11 KiB
Markdown
193 lines
11 KiB
Markdown
|
+++
|
||
|
title='Thoughts on "Leaderless Consensus"'
|
||
|
date=2023-11-30
|
||
|
+++
|
||
|
|
||
|
*Consensus algorithms such as Raft and Paxos, which are used in many distributed databases,
|
||
|
have notoriously unpredictable performance in low-quality networks that suffer from
|
||
|
latency, jitter, packet loss and/or unavailable nodes, which is why Garage does not use
|
||
|
them and uses only CRDTs. A new paper by Antoniadis et al., [*Leaderless Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151),
|
||
|
introduces a new category of algorithms that better tolerate the frequent
|
||
|
unavailability of a subset of nodes. However, additional research and practical work is required before
|
||
|
these results can be put into practice. Read for more details.*
|
||
|
|
||
|
<!-- more -->
|
||
|
|
||
|
---
|
||
|
|
||
|
As I have said many times when presenting Garage, we have made a point of not
|
||
|
using any consensus algorithm in Garage and using only CRDTs, for several
|
||
|
reasons. The first, and most important reason, is that all of the consensus
|
||
|
algorithms that we know of[^1] (in particular Raft, which is very popular in
|
||
|
distributed databases) suffer from unpredictable performance when nodes or the
|
||
|
network are unreliable. Even in relatively stable conditions, Raft-like
|
||
|
algorithms can still be much slower than CRDTs (as we have shown in some
|
||
|
[benchmarks](https://garagehq.deuxfleurs.fr/documentation/design/benchmarks/#on-a-complex-simulated-network))
|
||
|
because they elect a leader node and require all operations to pass through the
|
||
|
leader, which can become a bottleneck. Other than performance issues, Raft is
|
||
|
a complex algorithm and implementing it correctly is a challenging software
|
||
|
engineering endeavor that we did not wish to undertake, preferring instead
|
||
|
simplicity as a foundational principle to help us write correct software.
|
||
|
|
||
|
However, writing a distributed system such as Garage can be challenging when
|
||
|
consensus is not available, as we can only use CRDTs (conflict-free replicated
|
||
|
data types) in the code, and we cannot rely on state machine replication. This
|
||
|
means that the specific semantics of CRDTs have to be taken into account
|
||
|
everywhere in the code, which is often not a problem but sometimes adds some
|
||
|
complexity. More importantly, this means that a whole class of features cannot
|
||
|
be implemented in Garage, like those that would require some form of locking or
|
||
|
exclusive access. In practice, this has been causing us issues on the
|
||
|
CreateBucket endpoint, which by definition is meant to exclusively associate a
|
||
|
bucket name to a newly created bucket. In current Garage versions, concurrent
|
||
|
calls to CreateBucket with the same name may create several buckets and leave
|
||
|
Garage in an inconsistent state.
|
||
|
|
||
|
This leads naturally to the following question: is it possible to implement a
|
||
|
consensus algorithm that eschews the shortcomings of Raft-like algorithms in
|
||
|
unreliable systems? And in particular, is it possible to implement a consensus
|
||
|
algorithm that does not elect a leader, and is therefore not sensitive to
|
||
|
temporary slowdowns or unavailabilities of individual nodes? A new paper by
|
||
|
Antoniadis et al., [*Leaderless
|
||
|
Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151)
|
||
|
[[PDF](/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf)],
|
||
|
suggests that the answer is *yes*. However, as with all new research, putting
|
||
|
it into practice will take some time and a lot of work. I will discuss in this
|
||
|
article practical questions posed by the *Leaderless Consensus* paper, and
|
||
|
further steps that could be taken to advance on these issues.
|
||
|
|
||
|
Please note that the entire content of this article is **purely speculative**
|
||
|
and does not include any *positive results*. Note also that we are not
|
||
|
discussing Byzantine-tolerant systems, which seem to be the main focus of
|
||
|
*Leaderless Consensus*, even though the authors also propose an algorithm for
|
||
|
non-Byzantine systems (the one we are interested in).
|
||
|
|
||
|
## Main takeaways of *Leaderless Consensus*
|
||
|
|
||
|
To be able to meaningfully say that an algorithm is *leaderless*, one has to first
|
||
|
determine what *leaderless* precisely means. The paper starts by offering such
|
||
|
a definition, using a network model they call *synchronous-k* ("synchronous minus *k*"),
|
||
|
where *n* nodes are running in synchronous steps where at most *k* nodes might be
|
||
|
offline, paused, or otherwise unavailable, at each step.
|
||
|
The *synchronous-k* model has a variant called *eventually synchronous-k* which seems
|
||
|
to better model the behaviour of WAN links on the Internet, although I am not sure
|
||
|
of the precise difference between the two. Once the *synchronous-k* network model
|
||
|
is defined, a leaderless consensus algorithm is simply defined as a consensus algorithm
|
||
|
that still works (i.e. it terminates, giving a decision), in a *synchronous-1* system.
|
||
|
Concretely, this means that at any given time, a random node in the network may be
|
||
|
disconnected (not always the same one), and the consensus algorithm will be impacted
|
||
|
only minimally. In other words, we can say that a leaderless consensus algorithm
|
||
|
degrades gracefully in the presence of transient node failures.
|
||
|
This "graceful degradation" property, which Raft does not have,
|
||
|
seems to be exactly what we are looking for in a potential consensus algorithm that
|
||
|
could be added to Garage.
|
||
|
|
||
|
Having given this definition, the paper continues by offering concrete
|
||
|
algorithms to implement leaderless consensus. Of particular interest to us, the
|
||
|
paper presents in Section 5 a leaderless consensus algorithm, which they call
|
||
|
OFT-Archipelago, which works in message passing systems without Byzantine
|
||
|
nodes, where the only faults that can occur are message omissions (like
|
||
|
messages being dropped by the network, or temporary node crashes). This is
|
||
|
exactly the premise made by Garage, so this algorithm could be a good candidate
|
||
|
for us. Interestingly, while leaderless consensus is formally defined as a
|
||
|
consensus algorithm that works in a *synchronous-1* system (i.e. tolerating
|
||
|
only one failed node at each step), Archipelago works with up to *f < n/2*
|
||
|
unavailable nodes at each time steps.
|
||
|
|
||
|
According to the benchmarks in the leaderless consensus paper, while
|
||
|
Archipelago has very good throughput (around 50kops/s), the latency of
|
||
|
individual operations is generally between 1 or 2 seconds. This seems to be
|
||
|
acceptable for application in Garage if used only for administrative operations
|
||
|
on buckets and access keys which are relatively rare. From a theoretical point
|
||
|
of view, OFT-Archipelago can terminate in 3 RTT in the optimal scenario,
|
||
|
however it is not clear to me whether there is an upper bound on the
|
||
|
termination time, or whether there is a probabilistic analysis of the
|
||
|
termination delays that could be made. It is also not very clear to me the
|
||
|
link between this algorithm and the FLP impossibility theorem: since
|
||
|
Archipelago seems to do things that are forbidden by FLP, it means that the
|
||
|
premise of a *synchronous-k* system is probably in fact much stronger that the
|
||
|
network asynchrony assumed by FLP.
|
||
|
|
||
|
Among the other advantages of OFT-Archipelago is the fact that the algorithm
|
||
|
seems to be very simple, much more than Raft, as it is described in the paper
|
||
|
in only 42 lines of very understandable pseudocode. There is also a BFT
|
||
|
variant of Archipelago, which is not of interest to us in the context of Garage
|
||
|
as we are making the hypothesis that all nodes are trusted.
|
||
|
|
||
|
## Where to go from now?
|
||
|
|
||
|
Before an algorithm such as OFT-Archipelago can be added to Garage, a few fundamental
|
||
|
questions need to be answered, among which:
|
||
|
|
||
|
- How should Archipelago interact with Garage's use of CRDT data types? Do we
|
||
|
have to create a fully separate subsystem for things that are managed under
|
||
|
consensus, or can we hopefully share some logic? More precisely, can we use
|
||
|
a consensus algorithm simply as a total order broadcast primitive that
|
||
|
becomes a mandatory passing point for all modification requests on a set of
|
||
|
metadata tables, with those tables still being based on the CRDT table
|
||
|
replication and synchronisation library which is currently in use in Garage?
|
||
|
In this situation, nodes that come back from a crash can simply catch up on
|
||
|
old changes using the Merkle tree algorithm synchronisation algorithm that we
|
||
|
already have. Or must we use the consensus algorithm as the only way to
|
||
|
broadcast operations and data for the tables that are managed by it? This
|
||
|
would mean that we must add specific logic to handle the case of a node
|
||
|
coming back from a crash, where it must either download all the log of
|
||
|
operations since it was last up, or an entire snapshot of the metadata tables
|
||
|
in question. I think this is mostly related to the reason we want to add
|
||
|
consensus, and the exact consistency guarantees we are expecting it to
|
||
|
provide to us.
|
||
|
|
||
|
- Can Archipelago be made correct under cluster reconfiguration scenarios? This
|
||
|
is linked to the work done for task 3 of the 2023 NLnet project
|
||
|
([#495](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/495),
|
||
|
[#667](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/667)), which focuses
|
||
|
on making the Quorum-based algorithm for CRDT updates reliable even when the
|
||
|
cluster layout is updated. I will be writing more about this topic in a
|
||
|
future blog post, but in a nutshell, the NLnet task is mainly focused on
|
||
|
maintaining read-after-write consistency in Garage at all times, which has
|
||
|
led us to develop a relatively general framework for modeling algorithm based
|
||
|
on quorums. Since Archipelago also guarantees its correctness using a
|
||
|
non-empty-intersection-of-quorums property, it could benefit from the work
|
||
|
that was originally made on quorums for the CRDT algorithms.
|
||
|
|
||
|
If we obtain satisfactory answers to these questions, the remaining work will be
|
||
|
the technical implementation of Archipelago in Garage and its validation:
|
||
|
|
||
|
- Determine more precisely how the pipelined version of Archipelago is made,
|
||
|
as its complete description is not given in the leaderless consensus paper,
|
||
|
only a few basic pointers (Section 8.1 of the JPDC version).
|
||
|
|
||
|
- Implement Archipelago in Rust, ideally under the form of a generic reusable crate
|
||
|
that could be used outside of the context of Garage.
|
||
|
|
||
|
- Do a benchmark of Archipelago vs. existing Raft implementations (for instance
|
||
|
the async-raft crate). We should benchmark the algorithms in the following
|
||
|
scenarios: stable networking, high latency and jitter, evolutive situation
|
||
|
with different phases. My hypothesis is that Archipelago could be slower (in
|
||
|
terms of latency, not necessarily in throughput) than Raft in the stable
|
||
|
networking scenario, but the other two scenarios would force Raft to
|
||
|
reconfigure often (i.e. change leaders), which could be the source of huge
|
||
|
performance penalties, which Archipelago would not suffer from.
|
||
|
|
||
|
- Integrate Archipelago with Garage to solve the CreateBucket issue.
|
||
|
|
||
|
- To validate our implementation, we would want to test it using automated
|
||
|
testing frameworks such as Jepsen. I've been using Jepsen for the NLnet task
|
||
|
3 and I'm starting to understand quite well how it works, so this could be
|
||
|
relatively easy.
|
||
|
|
||
|
- If we want to go further, there is always the possibility of formalizing a
|
||
|
proof of our implementation, however I don't know what are the good tools to
|
||
|
do this, and in all cases it would be an extreme amount of work.
|
||
|
|
||
|
|
||
|
Please send your comments and feedback to
|
||
|
[garagehq@deuxfleurs.fr](mailto:garagehq@deuxfleurs.fr) if you have any.
|
||
|
|
||
|
---
|
||
|
|
||
|
<sup id="1">1</sup>: We are concerned only with consensus algorithms in the
|
||
|
context of closed, trusted systems such as distributed databases, and not in
|
||
|
large trustless networks such as blockchains.
|
||
|
|
||
|
Written by [Alex Auvolat](https://adnab.me).
|