add blog post on leaderless consensus
Some checks failed
continuous-integration/drone/push Build is failing
Some checks failed
continuous-integration/drone/push Build is failing
This commit is contained in:
parent
16fa02f53a
commit
c8514a3793
2 changed files with 196 additions and 0 deletions
Binary file not shown.
196
content/blog/2023-11-thoughts-on-leaderless-consensus/index.md
Normal file
196
content/blog/2023-11-thoughts-on-leaderless-consensus/index.md
Normal file
|
@ -0,0 +1,196 @@
|
|||
+++
|
||||
title='Thoughts on "Leaderless Consensus"'
|
||||
date=2023-11-30
|
||||
+++
|
||||
|
||||
*Consensus algorithms such as Raft and Paxos, which are used in many distributed databases,
|
||||
have notoriously unpredictable performance in low-quality networks that suffer from
|
||||
latency, jitter, packet loss and/or unavailable nodes, which is why Garage does not use
|
||||
them and uses only CRDTs. A new paper by Antoniadis et al., [*Leaderless Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151),
|
||||
introduces a new category of algorithms that better tolerate the frequent
|
||||
unavailability of a subset of nodes. However, additional research and practical work is required before
|
||||
these results can be put into practice. Read for more details.*
|
||||
|
||||
<!-- more -->
|
||||
|
||||
---
|
||||
|
||||
As I have said many times when presenting Garage, we have made a point of not
|
||||
using any consensus algorithm in Garage and using only CRDTs, for several
|
||||
reasons. The first, and most important reason, is that all of the consensus
|
||||
algorithms that we know of[^1] (in particular Raft, which is very popular in
|
||||
distributed databases) suffer from unpredictable performance when nodes or the
|
||||
network are unreliable. Even in relatively stable conditions, Raft-like
|
||||
algorithms can still be much slower than CRDTs (as we have shown in some
|
||||
[benchmarks](https://garagehq.deuxfleurs.fr/documentation/design/benchmarks/#on-a-complex-simulated-network))
|
||||
because they elect a leader node and require all operations to pass through the
|
||||
leader, which can become a bottleneck. Other than performance issues, Raft is
|
||||
a complex algorithm and implementing it correctly is a challenging software
|
||||
engineering endeavor that we did not wish to undertake, preferring instead
|
||||
simplicity as a foundational principle to help us write correct software.
|
||||
|
||||
However, writing a distributed system such as Garage can be challenging when
|
||||
consensus is not available, as we can only use CRDTs (conflict-free replicated
|
||||
data types) in the code, and we cannot rely on state machine replication. This
|
||||
means that the specific semantics of CRDTs have to be taken into account
|
||||
everywhere in the code, which is often not a problem but sometimes adds some
|
||||
complexity. More importantly, this means that a whole class of features cannot
|
||||
be implemented in Garage, like those that would require some form of locking or
|
||||
exclusive access. In practice, this has been causing us issues on the
|
||||
CreateBucket endpoint, which by definition is meant to exclusively associate a
|
||||
bucket name to a newly created bucket. In current Garage versions, concurrent
|
||||
calls to CreateBucket with the same name may create several buckets and leave
|
||||
Garage in an inconsistent state.
|
||||
|
||||
This leads naturally to the following question: is it possible to implement a
|
||||
consensus algorithm that eschews the shortcomings of Raft-like algorithms in
|
||||
unreliable systems? And in particular, is it possible to implement a consensus
|
||||
algorithm that does not elect a leader, and is therefore not sensitive to
|
||||
temporary slowdowns or unavailabilities of individual nodes? A new paper by
|
||||
Antoniadis et al., [*Leaderless
|
||||
Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151)
|
||||
[[PDF](/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf)],
|
||||
suggests that the answer is *yes*. However, as with all new research, putting
|
||||
it into practice will take some time and a lot of work. I will discuss in this
|
||||
article practical questions posed by the *Leaderless Consensus* paper, and
|
||||
further steps that could be taken to advance on these issues.
|
||||
|
||||
Please note that the entire content of this article is **purely speculative**
|
||||
and does not include any *positive results*. Note also that we are not
|
||||
discussing Byzantine-tolerant systems, which seem to be the main focus of
|
||||
*Leaderless Consensus*, even though the authors also propose an algorithm for
|
||||
non-Byzantine systems (the one we are interested in).
|
||||
|
||||
---
|
||||
|
||||
## Main takeaways of *Leaderless Consensus*
|
||||
|
||||
To be able to meaningfully say that an algorithm is *leaderless*, one has to first
|
||||
determine what *leaderless* precisely means. The paper starts by offering such
|
||||
a definition, using a network model they call *synchronous-k* ("synchronous minus *k*"),
|
||||
where *n* nodes are running in synchronous steps where at most *k* nodes might be
|
||||
offline, paused, or otherwise unavailable, at each step.
|
||||
The *synchronous-k* model has a variant called *eventually synchronous-k* which seems
|
||||
to better model the behaviour of WAN links on the Internet, although I am not sure
|
||||
of the precise difference between the two. Once the *synchronous-k* network model
|
||||
is defined, a leaderless consensus algorithm is simply defined as a consensus algorithm
|
||||
that still works (i.e. it terminates, giving a decision), in a *synchronous-1* system.
|
||||
Concretely, this means that at any given time, a random node in the network may be
|
||||
disconnected (not always the same one), and the consensus algorithm will be impacted
|
||||
only minimally. In other words, we can say that a leaderless consensus algorithm
|
||||
degrades gracefully in the presence of transient node failures.
|
||||
This "graceful degradation" property, which Raft does not have,
|
||||
seems to be exactly what we are looking for in a potential consensus algorithm that
|
||||
could be added to Garage.
|
||||
|
||||
Having given this definition, the paper continues by offering concrete
|
||||
algorithms to implement leaderless consensus. Of particular interest to us, the
|
||||
paper presents in Section 5 a leaderless consensus algorithm, which they call
|
||||
OFT-Archipelago, which works in message passing systems without Byzantine
|
||||
nodes, where the only faults that can occur are message omissions (like
|
||||
messages being dropped by the network, or temporary node crashes). This is
|
||||
exactly the premise made by Garage, so this algorithm could be a good candidate
|
||||
for us. Interestingly, while leaderless consensus is formally defined as a
|
||||
consensus algorithm that works in a *synchronous-1* system (i.e. tolerating
|
||||
only one failed node at each step), Archipelago works with up to *f < n/2*
|
||||
unavailable nodes at each time steps.
|
||||
|
||||
According to the benchmarks in the leaderless consensus paper, while
|
||||
Archipelago has very good throughput (around 50kops/s), the latency of
|
||||
individual operations is generally between 1 or 2 seconds. This seems to be
|
||||
acceptable for application in Garage if used only for administrative operations
|
||||
on buckets and access keys which are relatively rare. From a theoretical point
|
||||
of view, OFT-Archipelago can terminate in 3 RTT in the optimal scenario,
|
||||
however it is not clear to me whether there is an upper bound on the
|
||||
termination time, or whether there is a probabilistic analysis of the
|
||||
termination delays that could be made. It is also not very clear to me the
|
||||
link between this algorithm and the FLP impossibility theorem: since
|
||||
Archipelago seems to do things that are forbidden by FLP, it means that the
|
||||
premise of a *synchronous-k* system is probably in fact much stronger that the
|
||||
network asynchrony assumed by FLP.
|
||||
|
||||
Among the other advantages of OFT-Archipelago is the fact that the algorithm
|
||||
seems to be very simple, much more than Raft, as it is described in the paper
|
||||
in only 42 lines of very understandable pseudocode. There is also a BFT
|
||||
variant of Archipelago, which is not of interest to us in the context of Garage
|
||||
as we are making the hypothesis that all nodes are trusted.
|
||||
|
||||
---
|
||||
|
||||
## Where to go from now?
|
||||
|
||||
Before an algorithm such as OFT-Archipelago can be added to Garage, a few fundamental
|
||||
questions need to be answered, among which:
|
||||
|
||||
- How should Archipelago interact with Garage's use of CRDT data types? Do we
|
||||
have to create a fully separate subsystem for things that are managed under
|
||||
consensus, or can we hopefully share some logic? More precisely, can we use
|
||||
a consensus algorithm simply as a total order broadcast primitive that
|
||||
becomes a mandatory passing point for all modification requests on a set of
|
||||
metadata tables, with those tables still being based on the CRDT table
|
||||
replication and synchronisation library which is currently in use in Garage?
|
||||
In this situation, nodes that come back from a crash can simply catch up on
|
||||
old changes using the Merkle tree algorithm synchronisation algorithm that we
|
||||
already have. Or must we use the consensus algorithm as the only way to
|
||||
broadcast operations and data for the tables that are managed by it? This
|
||||
would mean that we must add specific logic to handle the case of a node
|
||||
coming back from a crash, where it must either download all the log of
|
||||
operations since it was last up, or an entire snapshot of the metadata tables
|
||||
in question. I think this is mostly related to the reason we want to add
|
||||
consensus, and the exact consistency guarantees we are expecting it to
|
||||
provide to us.
|
||||
|
||||
- Can Archipelago be made correct under cluster reconfiguration scenarios? This
|
||||
is linked to the work done for task 3 of the 2023 NLnet project
|
||||
([#495](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/495),
|
||||
[#667](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/667)), which focuses
|
||||
on making the Quorum-based algorithm for CRDT updates reliable even when the
|
||||
cluster layout is updated. I will be writing more about this topic in a
|
||||
future blog post, but in a nutshell, the NLnet task is mainly focused on
|
||||
maintaining read-after-write consistency in Garage at all times, which has
|
||||
led us to develop a relatively general framework for modeling algorithm based
|
||||
on quorums. Since Archipelago also guarantees its correctness using a
|
||||
non-empty-intersection-of-quorums property, it could benefit from the work
|
||||
that was originally made on quorums for the CRDT algorithms.
|
||||
|
||||
If we obtain satisfactory answers to these questions, the remaining work will be
|
||||
the technical implementation of Archipelago in Garage and its validation:
|
||||
|
||||
- Determine more precisely how the pipelined version of Archipelago is made,
|
||||
as its complete description is not given in the leaderless consensus paper,
|
||||
only a few basic pointers (Section 8.1 of the JPDC version).
|
||||
|
||||
- Implement Archipelago in Rust, ideally under the form of a generic reusable crate
|
||||
that could be used outside of the context of Garage.
|
||||
|
||||
- Do a benchmark of Archipelago vs. existing Raft implementations (for instance
|
||||
the async-raft crate). We should benchmark the algorithms in the following
|
||||
scenarios: stable networking, high latency and jitter, evolutive situation
|
||||
with different phases. My hypothesis is that Archipelago could be slower (in
|
||||
terms of latency, not necessarily in throughput) than Raft in the stable
|
||||
networking scenario, but the other two scenarios would force Raft to
|
||||
reconfigure often (i.e. change leaders), which could be the source of huge
|
||||
performance penalties, which Archipelago would not suffer from.
|
||||
|
||||
- Integrate Archipelago with Garage to solve the CreateBucket issue.
|
||||
|
||||
- To validate our implementation, we would want to test it using automated
|
||||
testing frameworks such as Jepsen. I've been using Jepsen for the NLnet task
|
||||
3 and I'm starting to understand quite well how it works, so this could be
|
||||
relatively easy.
|
||||
|
||||
- If we want to go further, there is always the possibility of formalizing a
|
||||
proof of our implementation, however I don't know what are the good tools to
|
||||
do this, and in all cases it would be an extreme amount of work.
|
||||
|
||||
|
||||
Please send your comments and feedback to
|
||||
[garagehq@deuxfleurs.fr](mailto:garagehq@deuxfleurs.fr) if you have any.
|
||||
|
||||
---
|
||||
|
||||
<sup id="1">1</sup>: We are concerned only with consensus algorithms in the
|
||||
context of closed, trusted systems such as distributed databases, and not in
|
||||
large trustless networks such as blockchains.
|
||||
|
||||
Written by [Alex Auvolat](https://adnab.me).
|
Loading…
Add table
Reference in a new issue