garagehq.deuxfleurs.fr/content/blog/2023-11-thoughts-on-leaderless-consensus/index.md

193 lines
11 KiB
Markdown
Raw Permalink Normal View History

2023-11-30 12:41:38 +00:00
+++
title='Thoughts on "Leaderless Consensus"'
date=2023-11-30
+++
*Consensus algorithms such as Raft and Paxos, which are used in many distributed databases,
have notoriously unpredictable performance in low-quality networks that suffer from
latency, jitter, packet loss and/or unavailable nodes, which is why Garage does not use
them and uses only CRDTs. A new paper by Antoniadis et al., [*Leaderless Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151),
introduces a new category of algorithms that better tolerate the frequent
unavailability of a subset of nodes. However, additional research and practical work is required before
these results can be put into practice. Read for more details.*
<!-- more -->
---
As I have said many times when presenting Garage, we have made a point of not
using any consensus algorithm in Garage and using only CRDTs, for several
reasons. The first, and most important reason, is that all of the consensus
algorithms that we know of[^1] (in particular Raft, which is very popular in
distributed databases) suffer from unpredictable performance when nodes or the
network are unreliable. Even in relatively stable conditions, Raft-like
algorithms can still be much slower than CRDTs (as we have shown in some
[benchmarks](https://garagehq.deuxfleurs.fr/documentation/design/benchmarks/#on-a-complex-simulated-network))
because they elect a leader node and require all operations to pass through the
leader, which can become a bottleneck. Other than performance issues, Raft is
a complex algorithm and implementing it correctly is a challenging software
engineering endeavor that we did not wish to undertake, preferring instead
simplicity as a foundational principle to help us write correct software.
However, writing a distributed system such as Garage can be challenging when
consensus is not available, as we can only use CRDTs (conflict-free replicated
data types) in the code, and we cannot rely on state machine replication. This
means that the specific semantics of CRDTs have to be taken into account
everywhere in the code, which is often not a problem but sometimes adds some
complexity. More importantly, this means that a whole class of features cannot
be implemented in Garage, like those that would require some form of locking or
exclusive access. In practice, this has been causing us issues on the
CreateBucket endpoint, which by definition is meant to exclusively associate a
bucket name to a newly created bucket. In current Garage versions, concurrent
calls to CreateBucket with the same name may create several buckets and leave
Garage in an inconsistent state.
This leads naturally to the following question: is it possible to implement a
consensus algorithm that eschews the shortcomings of Raft-like algorithms in
unreliable systems? And in particular, is it possible to implement a consensus
algorithm that does not elect a leader, and is therefore not sensitive to
temporary slowdowns or unavailabilities of individual nodes? A new paper by
Antoniadis et al., [*Leaderless
Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151)
[[PDF](/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf)],
suggests that the answer is *yes*. However, as with all new research, putting
it into practice will take some time and a lot of work. I will discuss in this
article practical questions posed by the *Leaderless Consensus* paper, and
further steps that could be taken to advance on these issues.
Please note that the entire content of this article is **purely speculative**
and does not include any *positive results*. Note also that we are not
discussing Byzantine-tolerant systems, which seem to be the main focus of
*Leaderless Consensus*, even though the authors also propose an algorithm for
non-Byzantine systems (the one we are interested in).
## Main takeaways of *Leaderless Consensus*
To be able to meaningfully say that an algorithm is *leaderless*, one has to first
determine what *leaderless* precisely means. The paper starts by offering such
a definition, using a network model they call *synchronous-k* ("synchronous minus *k*"),
where *n* nodes are running in synchronous steps where at most *k* nodes might be
offline, paused, or otherwise unavailable, at each step.
The *synchronous-k* model has a variant called *eventually synchronous-k* which seems
to better model the behaviour of WAN links on the Internet, although I am not sure
of the precise difference between the two. Once the *synchronous-k* network model
is defined, a leaderless consensus algorithm is simply defined as a consensus algorithm
that still works (i.e. it terminates, giving a decision), in a *synchronous-1* system.
Concretely, this means that at any given time, a random node in the network may be
disconnected (not always the same one), and the consensus algorithm will be impacted
only minimally. In other words, we can say that a leaderless consensus algorithm
degrades gracefully in the presence of transient node failures.
This "graceful degradation" property, which Raft does not have,
seems to be exactly what we are looking for in a potential consensus algorithm that
could be added to Garage.
Having given this definition, the paper continues by offering concrete
algorithms to implement leaderless consensus. Of particular interest to us, the
paper presents in Section 5 a leaderless consensus algorithm, which they call
OFT-Archipelago, which works in message passing systems without Byzantine
nodes, where the only faults that can occur are message omissions (like
messages being dropped by the network, or temporary node crashes). This is
exactly the premise made by Garage, so this algorithm could be a good candidate
for us. Interestingly, while leaderless consensus is formally defined as a
consensus algorithm that works in a *synchronous-1* system (i.e. tolerating
only one failed node at each step), Archipelago works with up to *f < n/2*
unavailable nodes at each time steps.
According to the benchmarks in the leaderless consensus paper, while
Archipelago has very good throughput (around 50kops/s), the latency of
individual operations is generally between 1 or 2 seconds. This seems to be
acceptable for application in Garage if used only for administrative operations
on buckets and access keys which are relatively rare. From a theoretical point
of view, OFT-Archipelago can terminate in 3 RTT in the optimal scenario,
however it is not clear to me whether there is an upper bound on the
termination time, or whether there is a probabilistic analysis of the
termination delays that could be made. It is also not very clear to me the
link between this algorithm and the FLP impossibility theorem: since
Archipelago seems to do things that are forbidden by FLP, it means that the
premise of a *synchronous-k* system is probably in fact much stronger that the
network asynchrony assumed by FLP.
Among the other advantages of OFT-Archipelago is the fact that the algorithm
seems to be very simple, much more than Raft, as it is described in the paper
in only 42 lines of very understandable pseudocode. There is also a BFT
variant of Archipelago, which is not of interest to us in the context of Garage
as we are making the hypothesis that all nodes are trusted.
## Where to go from now?
Before an algorithm such as OFT-Archipelago can be added to Garage, a few fundamental
questions need to be answered, among which:
- How should Archipelago interact with Garage's use of CRDT data types? Do we
have to create a fully separate subsystem for things that are managed under
consensus, or can we hopefully share some logic? More precisely, can we use
a consensus algorithm simply as a total order broadcast primitive that
becomes a mandatory passing point for all modification requests on a set of
metadata tables, with those tables still being based on the CRDT table
replication and synchronisation library which is currently in use in Garage?
In this situation, nodes that come back from a crash can simply catch up on
old changes using the Merkle tree algorithm synchronisation algorithm that we
already have. Or must we use the consensus algorithm as the only way to
broadcast operations and data for the tables that are managed by it? This
would mean that we must add specific logic to handle the case of a node
coming back from a crash, where it must either download all the log of
operations since it was last up, or an entire snapshot of the metadata tables
in question. I think this is mostly related to the reason we want to add
consensus, and the exact consistency guarantees we are expecting it to
provide to us.
- Can Archipelago be made correct under cluster reconfiguration scenarios? This
is linked to the work done for task 3 of the 2023 NLnet project
([#495](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/495),
[#667](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/667)), which focuses
on making the Quorum-based algorithm for CRDT updates reliable even when the
cluster layout is updated. I will be writing more about this topic in a
future blog post, but in a nutshell, the NLnet task is mainly focused on
maintaining read-after-write consistency in Garage at all times, which has
led us to develop a relatively general framework for modeling algorithm based
on quorums. Since Archipelago also guarantees its correctness using a
non-empty-intersection-of-quorums property, it could benefit from the work
that was originally made on quorums for the CRDT algorithms.
If we obtain satisfactory answers to these questions, the remaining work will be
the technical implementation of Archipelago in Garage and its validation:
- Determine more precisely how the pipelined version of Archipelago is made,
as its complete description is not given in the leaderless consensus paper,
only a few basic pointers (Section 8.1 of the JPDC version).
- Implement Archipelago in Rust, ideally under the form of a generic reusable crate
that could be used outside of the context of Garage.
- Do a benchmark of Archipelago vs. existing Raft implementations (for instance
the async-raft crate). We should benchmark the algorithms in the following
scenarios: stable networking, high latency and jitter, evolutive situation
with different phases. My hypothesis is that Archipelago could be slower (in
terms of latency, not necessarily in throughput) than Raft in the stable
networking scenario, but the other two scenarios would force Raft to
reconfigure often (i.e. change leaders), which could be the source of huge
performance penalties, which Archipelago would not suffer from.
- Integrate Archipelago with Garage to solve the CreateBucket issue.
- To validate our implementation, we would want to test it using automated
testing frameworks such as Jepsen. I've been using Jepsen for the NLnet task
3 and I'm starting to understand quite well how it works, so this could be
relatively easy.
- If we want to go further, there is always the possibility of formalizing a
proof of our implementation, however I don't know what are the good tools to
do this, and in all cases it would be an extreme amount of work.
Please send your comments and feedback to
[garagehq@deuxfleurs.fr](mailto:garagehq@deuxfleurs.fr) if you have any.
---
<sup id="1">1</sup>: We are concerned only with consensus algorithms in the
context of closed, trusted systems such as distributed databases, and not in
large trustless networks such as blockchains.
Written by [Alex Auvolat](https://adnab.me).