3 changed files with 1 additions and 197 deletions
--- a/content/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf
+++ b/content/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf
--- a/content/blog/2023-11-thoughts-on-leaderless-consensus/index.md
+++ b/content/blog/2023-11-thoughts-on-leaderless-consensus/index.md
@ -1,196 +0,0 @@
 +++
 title='Thoughts on "Leaderless Consensus"'
 date=2023-11-30
 +++
 *Consensus algorithms such as Raft and Paxos, which are used in many distributed databases,
 have notoriously unpredictable performance in low-quality networks that suffer from
 latency, jitter, packet loss and/or unavailable nodes, which is why Garage does not use
 them and uses only CRDTs. A new paper by Antoniadis et al., [*Leaderless Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151),
 introduces a new category of algorithms that better tolerate the frequent 
 unavailability of a subset of nodes. However, additional research and practical work is required before
 these results can be put into practice. Read for more details.*
 <!-- more -->
 ---
 As I have said many times when presenting Garage, we have made a point of not
 using any consensus algorithm in Garage and using only CRDTs, for several
 reasons.  The first, and most important reason, is that all of the consensus
 algorithms that we know of[^1] (in particular Raft, which is very popular in
 distributed databases) suffer from unpredictable performance when nodes or the
 network are unreliable.  Even in relatively stable conditions, Raft-like
 algorithms can still be much slower than CRDTs (as we have shown in some
 [benchmarks](https://garagehq.deuxfleurs.fr/documentation/design/benchmarks/#on-a-complex-simulated-network))
 because they elect a leader node and require all operations to pass through the
 leader, which can become a bottleneck.  Other than performance issues, Raft is
 a complex algorithm and implementing it correctly is a challenging software
 engineering endeavor that we did not wish to undertake, preferring instead
 simplicity as a foundational principle to help us write correct software.
 However, writing a distributed system such as Garage can be challenging when
 consensus is not available, as we can only use CRDTs (conflict-free replicated
 data types) in the code, and we cannot rely on state machine replication.  This
 means that the specific semantics of CRDTs have to be taken into account
 everywhere in the code, which is often not a problem but sometimes adds some
 complexity.  More importantly, this means that a whole class of features cannot
 be implemented in Garage, like those that would require some form of locking or
 exclusive access.  In practice, this has been causing us issues on the
 CreateBucket endpoint, which by definition is meant to exclusively associate a
 bucket name to a newly created bucket. In current Garage versions, concurrent
 calls to CreateBucket with the same name may create several buckets and leave
 Garage in an inconsistent state.
 This leads naturally to the following question: is it possible to implement a
 consensus algorithm that eschews the shortcomings of Raft-like algorithms in
 unreliable systems?  And in particular, is it possible to implement a consensus
 algorithm that does not elect a leader, and is therefore not sensitive to
 temporary slowdowns or unavailabilities of individual nodes?  A new paper by
 Antoniadis et al., [*Leaderless
 Consensus*](https://www.sciencedirect.com/science/article/abs/pii/S0743731523000151)
 [[PDF](/blog/2023-11-thoughts-on-leaderless-consensus/2023-Leaderless_consensus_JPDC.pdf)],
 suggests that the answer is *yes*.  However, as with all new research, putting
 it into practice will take some time and a lot of work.  I will discuss in this
 article practical questions posed by the *Leaderless Consensus* paper, and
 further steps that could be taken to advance on these issues.
 Please note that the entire content of this article is **purely speculative**
 and does not include any *positive results*. Note also that we are not
 discussing Byzantine-tolerant systems, which seem to be the main focus of
 *Leaderless Consensus*, even though the authors also propose an algorithm for
 non-Byzantine systems (the one we are interested in).
 ---
 ## Main takeaways of *Leaderless Consensus*
 To be able to meaningfully say that an algorithm is *leaderless*, one has to first
 determine what *leaderless* precisely means. The paper starts by offering such
 a definition, using a network model they call *synchronous-k* ("synchronous minus *k*"),
 where *n* nodes are running in synchronous steps where at most *k* nodes might be
 offline, paused, or otherwise unavailable, at each step.
 The *synchronous-k* model has a variant called *eventually synchronous-k* which seems
 to better model the behaviour of WAN links on the Internet, although I am not sure
 of the precise difference between the two. Once the *synchronous-k* network model
 is defined, a leaderless consensus algorithm is simply defined as a consensus algorithm
 that still works (i.e. it terminates, giving a decision), in a *synchronous-1* system.
 Concretely, this means that at any given time, a random node in the network may be
 disconnected (not always the same one), and the consensus algorithm will be impacted
 only minimally. In other words, we can say that a leaderless consensus algorithm
 degrades gracefully in the presence of transient node failures.
 This "graceful degradation" property, which Raft does not have,
 seems to be exactly what we are looking for in a potential consensus algorithm that
 could be added to Garage.
 Having given this definition, the paper continues by offering concrete
 algorithms to implement leaderless consensus. Of particular interest to us, the
 paper presents in Section 5 a leaderless consensus algorithm, which they call
 OFT-Archipelago, which works in message passing systems without Byzantine
 nodes, where the only faults that can occur are message omissions (like
 messages being dropped by the network, or temporary node crashes).  This is
 exactly the premise made by Garage, so this algorithm could be a good candidate
 for us. Interestingly, while leaderless consensus is formally defined as a
 consensus algorithm that works in a *synchronous-1* system (i.e.  tolerating
 only one failed node at each step), Archipelago works with up to *f < n/2*
 unavailable nodes at each time steps.
 According to the benchmarks in the leaderless consensus paper, while
 Archipelago has very good throughput (around 50kops/s), the latency of
 individual operations is generally between 1 or 2 seconds. This seems to be
 acceptable for application in Garage if used only for administrative operations
 on buckets and access keys which are relatively rare. From a theoretical point
 of view, OFT-Archipelago can terminate in 3 RTT in the optimal scenario,
 however it is not clear to me whether there is an upper bound on the
 termination time, or whether there is a probabilistic analysis of the
 termination delays that could be made.  It is also not very clear to me the
 link between this algorithm and the FLP impossibility theorem: since
 Archipelago seems to do things that are forbidden by FLP, it means that the
 premise of a *synchronous-k* system is probably in fact much stronger that the
 network asynchrony assumed by FLP.
 Among the other advantages of OFT-Archipelago is the fact that the algorithm
 seems to be very simple, much more than Raft, as it is described in the paper
 in only 42 lines of very understandable pseudocode.  There is also a BFT
 variant of Archipelago, which is not of interest to us in the context of Garage
 as we are making the hypothesis that all nodes are trusted.
 ---
 ## Where to go from now?
 Before an algorithm such as OFT-Archipelago can be added to Garage, a few fundamental
 questions need to be answered, among which:
 - How should Archipelago interact with Garage's use of CRDT data types? Do we
  have to create a fully separate subsystem for things that are managed under
  consensus, or can we hopefully share some logic?  More precisely, can we use
  a consensus algorithm simply as a total order broadcast primitive that
  becomes a mandatory passing point for all modification requests on a set of
  metadata tables, with those tables still being based on the CRDT table
  replication and synchronisation library which is currently in use in Garage?
  In this situation, nodes that come back from a crash can simply catch up on
  old changes using the Merkle tree algorithm synchronisation algorithm that we
  already have.  Or must we use the consensus algorithm as the only way to
  broadcast operations and data for the tables that are managed by it? This
  would mean that we must add specific logic to handle the case of a node
  coming back from a crash, where it must either download all the log of
  operations since it was last up, or an entire snapshot of the metadata tables
  in question. I think this is mostly related to the reason we want to add
  consensus, and the exact consistency guarantees we are expecting it to
  provide to us.
 - Can Archipelago be made correct under cluster reconfiguration scenarios? This
  is linked to the work done for task 3 of the 2023 NLnet project
  ([#495](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/495),
  [#667](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/667)), which focuses
  on making the Quorum-based algorithm for CRDT updates reliable even when the
  cluster layout is updated. I will be writing more about this topic in a
  future blog post, but in a nutshell, the NLnet task is mainly focused on
  maintaining read-after-write consistency in Garage at all times, which has
  led us to develop a relatively general framework for modeling algorithm based
  on quorums. Since Archipelago also guarantees its correctness using a
  non-empty-intersection-of-quorums property, it could benefit from the work
  that was originally made on quorums for the CRDT algorithms.
 If we obtain satisfactory answers to these questions, the remaining work will be
 the technical implementation of Archipelago in Garage and its validation:
 - Determine more precisely how the pipelined version of Archipelago is made,
  as its complete description is not given in the leaderless consensus paper,
  only a few basic pointers (Section 8.1 of the JPDC version).
 - Implement Archipelago in Rust, ideally under the form of a generic reusable crate
  that could be used outside of the context of Garage.
 - Do a benchmark of Archipelago vs. existing Raft implementations (for instance
  the async-raft crate).  We should benchmark the algorithms in the following
  scenarios: stable networking, high latency and jitter, evolutive situation
  with different phases.  My hypothesis is that Archipelago could be slower (in
  terms of latency, not necessarily in throughput) than Raft in the stable
  networking scenario, but the other two scenarios would force Raft to
  reconfigure often (i.e. change leaders), which could be the source of huge
  performance penalties, which Archipelago would not suffer from.
 - Integrate Archipelago with Garage to solve the CreateBucket issue.
 - To validate our implementation, we would want to test it using automated
  testing frameworks such as Jepsen. I've been using Jepsen for the NLnet task
  3 and I'm starting to understand quite well how it works, so this could be
  relatively easy.
 - If we want to go further, there is always the possibility of formalizing a
  proof of our implementation, however I don't know what are the good tools to
  do this, and in all cases it would be an extreme amount of work.
 Please send your comments and feedback to
 [garagehq@deuxfleurs.fr](mailto:garagehq@deuxfleurs.fr) if you have any.
 ---
 <sup id="1">1</sup>: We are concerned only with consensus algorithms in the
 context of closed, trusted systems such as distributed databases, and not in
 large trustless networks such as blockchains.
 Written by [Alex Auvolat](https://adnab.me).
--- a/2
+++ b/2
@ -1 +1 @@
-Subproject commit ffa659433d45d5186acc618134c5561bf9b21f37
+Subproject commit 36bd21a148089fac4a04e1559f2456435e66de87
		`@ -1 +1 @@`
			`Subproject commit ffa659433d45d5186acc618134c5561bf9b21f37`				`Subproject commit 36bd21a148089fac4a04e1559f2456435e66de87`