Compare commits
No commits in common. "c8514a37935492cc3657d317d26c8bf1974de4f6" and "1fd49ef40fec827b5c5bdc04e2edfc613f687ef3" have entirely different histories.
c8514a3793
...
1fd49ef40f
3 changed files with 1 additions and 197 deletions
Binary file not shown.
|
@ -1,196 +0,0 @@
|
||||||
+++
|
|
||||||
title='Thoughts on "Leaderless Consensus"'
|
|
||||||
date=2023-11-30
|
|
||||||
+++
|
|
||||||
|
|
||||||
*Consensus algorithms such as Raft and Paxos, which are used in many distributed databases,
|
|
||||||
have notoriously unpredictable performance in low-quality networks that suffer from
|
|
||||||
latency, jitter, packet loss and/or unavailable nodes, which is why Garage does not use
|
|
||||||
them and uses only CRDTs. A new paper by Antoniadis et al., [*Leaderless Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151),
|
|
||||||
introduces a new category of algorithms that better tolerate the frequent
|
|
||||||
unavailability of a subset of nodes. However, additional research and practical work is required before
|
|
||||||
these results can be put into practice. Read for more details.*
|
|
||||||
|
|
||||||
<!-- more -->
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
As I have said many times when presenting Garage, we have made a point of not
|
|
||||||
using any consensus algorithm in Garage and using only CRDTs, for several
|
|
||||||
reasons. The first, and most important reason, is that all of the consensus
|
|
||||||
algorithms that we know of[^1] (in particular Raft, which is very popular in
|
|
||||||
distributed databases) suffer from unpredictable performance when nodes or the
|
|
||||||
network are unreliable. Even in relatively stable conditions, Raft-like
|
|
||||||
algorithms can still be much slower than CRDTs (as we have shown in some
|
|
||||||
[benchmarks](https://garagehq.deuxfleurs.fr/documentation/design/benchmarks/#on-a-complex-simulated-network))
|
|
||||||
because they elect a leader node and require all operations to pass through the
|
|
||||||
leader, which can become a bottleneck. Other than performance issues, Raft is
|
|
||||||
a complex algorithm and implementing it correctly is a challenging software
|
|
||||||
engineering endeavor that we did not wish to undertake, preferring instead
|
|
||||||
simplicity as a foundational principle to help us write correct software.
|
|
||||||
|
|
||||||
However, writing a distributed system such as Garage can be challenging when
|
|
||||||
consensus is not available, as we can only use CRDTs (conflict-free replicated
|
|
||||||
data types) in the code, and we cannot rely on state machine replication. This
|
|
||||||
means that the specific semantics of CRDTs have to be taken into account
|
|
||||||
everywhere in the code, which is often not a problem but sometimes adds some
|
|
||||||
complexity. More importantly, this means that a whole class of features cannot
|
|
||||||
be implemented in Garage, like those that would require some form of locking or
|
|
||||||
exclusive access. In practice, this has been causing us issues on the
|
|
||||||
CreateBucket endpoint, which by definition is meant to exclusively associate a
|
|
||||||
bucket name to a newly created bucket. In current Garage versions, concurrent
|
|
||||||
calls to CreateBucket with the same name may create several buckets and leave
|
|
||||||
Garage in an inconsistent state.
|
|
||||||
|
|
||||||
This leads naturally to the following question: is it possible to implement a
|
|
||||||
consensus algorithm that eschews the shortcomings of Raft-like algorithms in
|
|
||||||
unreliable systems? And in particular, is it possible to implement a consensus
|
|
||||||
algorithm that does not elect a leader, and is therefore not sensitive to
|
|
||||||
temporary slowdowns or unavailabilities of individual nodes? A new paper by
|
|
||||||
Antoniadis et al., [*Leaderless
|
|
||||||
Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151)
|
|
||||||
[[PDF](/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf)],
|
|
||||||
suggests that the answer is *yes*. However, as with all new research, putting
|
|
||||||
it into practice will take some time and a lot of work. I will discuss in this
|
|
||||||
article practical questions posed by the *Leaderless Consensus* paper, and
|
|
||||||
further steps that could be taken to advance on these issues.
|
|
||||||
|
|
||||||
Please note that the entire content of this article is **purely speculative**
|
|
||||||
and does not include any *positive results*. Note also that we are not
|
|
||||||
discussing Byzantine-tolerant systems, which seem to be the main focus of
|
|
||||||
*Leaderless Consensus*, even though the authors also propose an algorithm for
|
|
||||||
non-Byzantine systems (the one we are interested in).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Main takeaways of *Leaderless Consensus*
|
|
||||||
|
|
||||||
To be able to meaningfully say that an algorithm is *leaderless*, one has to first
|
|
||||||
determine what *leaderless* precisely means. The paper starts by offering such
|
|
||||||
a definition, using a network model they call *synchronous-k* ("synchronous minus *k*"),
|
|
||||||
where *n* nodes are running in synchronous steps where at most *k* nodes might be
|
|
||||||
offline, paused, or otherwise unavailable, at each step.
|
|
||||||
The *synchronous-k* model has a variant called *eventually synchronous-k* which seems
|
|
||||||
to better model the behaviour of WAN links on the Internet, although I am not sure
|
|
||||||
of the precise difference between the two. Once the *synchronous-k* network model
|
|
||||||
is defined, a leaderless consensus algorithm is simply defined as a consensus algorithm
|
|
||||||
that still works (i.e. it terminates, giving a decision), in a *synchronous-1* system.
|
|
||||||
Concretely, this means that at any given time, a random node in the network may be
|
|
||||||
disconnected (not always the same one), and the consensus algorithm will be impacted
|
|
||||||
only minimally. In other words, we can say that a leaderless consensus algorithm
|
|
||||||
degrades gracefully in the presence of transient node failures.
|
|
||||||
This "graceful degradation" property, which Raft does not have,
|
|
||||||
seems to be exactly what we are looking for in a potential consensus algorithm that
|
|
||||||
could be added to Garage.
|
|
||||||
|
|
||||||
Having given this definition, the paper continues by offering concrete
|
|
||||||
algorithms to implement leaderless consensus. Of particular interest to us, the
|
|
||||||
paper presents in Section 5 a leaderless consensus algorithm, which they call
|
|
||||||
OFT-Archipelago, which works in message passing systems without Byzantine
|
|
||||||
nodes, where the only faults that can occur are message omissions (like
|
|
||||||
messages being dropped by the network, or temporary node crashes). This is
|
|
||||||
exactly the premise made by Garage, so this algorithm could be a good candidate
|
|
||||||
for us. Interestingly, while leaderless consensus is formally defined as a
|
|
||||||
consensus algorithm that works in a *synchronous-1* system (i.e. tolerating
|
|
||||||
only one failed node at each step), Archipelago works with up to *f < n/2*
|
|
||||||
unavailable nodes at each time steps.
|
|
||||||
|
|
||||||
According to the benchmarks in the leaderless consensus paper, while
|
|
||||||
Archipelago has very good throughput (around 50kops/s), the latency of
|
|
||||||
individual operations is generally between 1 or 2 seconds. This seems to be
|
|
||||||
acceptable for application in Garage if used only for administrative operations
|
|
||||||
on buckets and access keys which are relatively rare. From a theoretical point
|
|
||||||
of view, OFT-Archipelago can terminate in 3 RTT in the optimal scenario,
|
|
||||||
however it is not clear to me whether there is an upper bound on the
|
|
||||||
termination time, or whether there is a probabilistic analysis of the
|
|
||||||
termination delays that could be made. It is also not very clear to me the
|
|
||||||
link between this algorithm and the FLP impossibility theorem: since
|
|
||||||
Archipelago seems to do things that are forbidden by FLP, it means that the
|
|
||||||
premise of a *synchronous-k* system is probably in fact much stronger that the
|
|
||||||
network asynchrony assumed by FLP.
|
|
||||||
|
|
||||||
Among the other advantages of OFT-Archipelago is the fact that the algorithm
|
|
||||||
seems to be very simple, much more than Raft, as it is described in the paper
|
|
||||||
in only 42 lines of very understandable pseudocode. There is also a BFT
|
|
||||||
variant of Archipelago, which is not of interest to us in the context of Garage
|
|
||||||
as we are making the hypothesis that all nodes are trusted.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Where to go from now?
|
|
||||||
|
|
||||||
Before an algorithm such as OFT-Archipelago can be added to Garage, a few fundamental
|
|
||||||
questions need to be answered, among which:
|
|
||||||
|
|
||||||
- How should Archipelago interact with Garage's use of CRDT data types? Do we
|
|
||||||
have to create a fully separate subsystem for things that are managed under
|
|
||||||
consensus, or can we hopefully share some logic? More precisely, can we use
|
|
||||||
a consensus algorithm simply as a total order broadcast primitive that
|
|
||||||
becomes a mandatory passing point for all modification requests on a set of
|
|
||||||
metadata tables, with those tables still being based on the CRDT table
|
|
||||||
replication and synchronisation library which is currently in use in Garage?
|
|
||||||
In this situation, nodes that come back from a crash can simply catch up on
|
|
||||||
old changes using the Merkle tree algorithm synchronisation algorithm that we
|
|
||||||
already have. Or must we use the consensus algorithm as the only way to
|
|
||||||
broadcast operations and data for the tables that are managed by it? This
|
|
||||||
would mean that we must add specific logic to handle the case of a node
|
|
||||||
coming back from a crash, where it must either download all the log of
|
|
||||||
operations since it was last up, or an entire snapshot of the metadata tables
|
|
||||||
in question. I think this is mostly related to the reason we want to add
|
|
||||||
consensus, and the exact consistency guarantees we are expecting it to
|
|
||||||
provide to us.
|
|
||||||
|
|
||||||
- Can Archipelago be made correct under cluster reconfiguration scenarios? This
|
|
||||||
is linked to the work done for task 3 of the 2023 NLnet project
|
|
||||||
([#495](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/495),
|
|
||||||
[#667](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/667)), which focuses
|
|
||||||
on making the Quorum-based algorithm for CRDT updates reliable even when the
|
|
||||||
cluster layout is updated. I will be writing more about this topic in a
|
|
||||||
future blog post, but in a nutshell, the NLnet task is mainly focused on
|
|
||||||
maintaining read-after-write consistency in Garage at all times, which has
|
|
||||||
led us to develop a relatively general framework for modeling algorithm based
|
|
||||||
on quorums. Since Archipelago also guarantees its correctness using a
|
|
||||||
non-empty-intersection-of-quorums property, it could benefit from the work
|
|
||||||
that was originally made on quorums for the CRDT algorithms.
|
|
||||||
|
|
||||||
If we obtain satisfactory answers to these questions, the remaining work will be
|
|
||||||
the technical implementation of Archipelago in Garage and its validation:
|
|
||||||
|
|
||||||
- Determine more precisely how the pipelined version of Archipelago is made,
|
|
||||||
as its complete description is not given in the leaderless consensus paper,
|
|
||||||
only a few basic pointers (Section 8.1 of the JPDC version).
|
|
||||||
|
|
||||||
- Implement Archipelago in Rust, ideally under the form of a generic reusable crate
|
|
||||||
that could be used outside of the context of Garage.
|
|
||||||
|
|
||||||
- Do a benchmark of Archipelago vs. existing Raft implementations (for instance
|
|
||||||
the async-raft crate). We should benchmark the algorithms in the following
|
|
||||||
scenarios: stable networking, high latency and jitter, evolutive situation
|
|
||||||
with different phases. My hypothesis is that Archipelago could be slower (in
|
|
||||||
terms of latency, not necessarily in throughput) than Raft in the stable
|
|
||||||
networking scenario, but the other two scenarios would force Raft to
|
|
||||||
reconfigure often (i.e. change leaders), which could be the source of huge
|
|
||||||
performance penalties, which Archipelago would not suffer from.
|
|
||||||
|
|
||||||
- Integrate Archipelago with Garage to solve the CreateBucket issue.
|
|
||||||
|
|
||||||
- To validate our implementation, we would want to test it using automated
|
|
||||||
testing frameworks such as Jepsen. I've been using Jepsen for the NLnet task
|
|
||||||
3 and I'm starting to understand quite well how it works, so this could be
|
|
||||||
relatively easy.
|
|
||||||
|
|
||||||
- If we want to go further, there is always the possibility of formalizing a
|
|
||||||
proof of our implementation, however I don't know what are the good tools to
|
|
||||||
do this, and in all cases it would be an extreme amount of work.
|
|
||||||
|
|
||||||
|
|
||||||
Please send your comments and feedback to
|
|
||||||
[garagehq@deuxfleurs.fr](mailto:garagehq@deuxfleurs.fr) if you have any.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
<sup id="1">1</sup>: We are concerned only with consensus algorithms in the
|
|
||||||
context of closed, trusted systems such as distributed databases, and not in
|
|
||||||
large trustless networks such as blockchains.
|
|
||||||
|
|
||||||
Written by [Alex Auvolat](https://adnab.me).
|
|
2
garage
2
garage
|
@ -1 +1 @@
|
||||||
Subproject commit ffa659433d45d5186acc618134c5561bf9b21f37
|
Subproject commit 36bd21a148089fac4a04e1559f2456435e66de87
|
Loading…
Add table
Reference in a new issue