garagehq.deuxfleurs.fr/content/blog/2023-12-preserving-read-after-write-consistency/index.md

+++
title="Maintaining Read-after-Write consistency in all circumstances"
date=2023-12-25
+++

*Garage is a data storage system that is based on CRDTs internally. It does not
use a consensus algorithm such as Raft, therefore maintaining consistency in a
cluster has to be done by other means. Since its inception, Garage has made use
of read and write quorums to guarantee read-after-write consistency, the only
consistency guarantee it provides. However, as of Garage v0.9.0, this guarantee
is not maintained when the composition of a cluster is updated and data is
moved between storage nodes. As part of our current NLnet-funded project, we
are developping a solution to this problem, that is briefly explained in this
blog post.*

<!-- more -->

---

Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:

**Read-after-write consistency.** *If a client A writes an object x (e.g. using PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to read object x (e.g. using GetObject), then B will read the version written by A, or a more recent version.*

The consistency guarantee offered by Garage is slightly more general than this
simplistic formulation, as it also applies to other S3 endpoints such as
ListObjects, which are always guaranteed to reflect the latest version of
objects inserted in a bucket.

This consistency guarantee at the level of objects in the S3 API is in fact a
reflection of read-after-write consistency in the internal metadata engine of
Garage (which is a distributed key/value store with CRDT values).  Reads and
writes to metadata tables use quorums of 2 out of 3 nodes for each operation,
ensuring that if operation B starts after operation A has completed, then there
is at least one node that is handling both operation A and B. In the case where
A is a write (an update) and B is a read, that node will have the opportunity
to return the value written in A to the reading client B.  A visual depiction
of this process can be found in [this
presentation](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/a8b0e01f88b947bc34c05d818d51860b4d171967/doc/talks/2023-09-20-ocp/talk.pdf)
on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page 54).

Note that read-after-write guarantees [are broken and have always
been](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/147) for the bucket
and access key tables, which might not be something we can fix due to different
requirements on the quorums.

## Current issues with Read-after-Write consistency

Maintaining Read-after-Write consistency depends crucially on the intersection
of the quorums being non-empty. There is however a scenario where these quorums
may be empty: when the set of nodes affected to storing some entries changes,
for instance when nodes are added or removed and data is being rebalanced
between nodes.

### A concrete example

Take the case of a partition (a subset of the data stored by Garage) which is
stored on nodes A, B and C. At some point, a layout change occurs in the
cluster, and after the change, nodes A, D and E are responsible for storing the
partition. All read and write operations that were initiated before the layout
change, or by nodes that were not yet aware of the new layout version, will be
directed to nodes A, B and C, and will be handled by a quorum of two nodes among
those three.  However, once the new layout is introduced in the cluster, read
and write operations will start being directed to nodes A, D and E, expecting a
quorum of two nodes among this new set of three nodes.

Crucially,  coordinating when operations start being directed to the new layout
is a hard problem, and in all cases we must assume that due to some network
asynchrony, there can still be some nodes that keep sending requests to nodes
A, B and C for a long time even after everyone else is aware of the new layout.
Moreover, data will be progressively moved from nodes B and C to nodes D and E,
which can take a long time depending on the quantity of data. This creates a
period of uncertainty as to where exactly the data is stored in the cluster.
Overall, this basically means that there is no way to guarantee the
intersection-of-quorums property, which is necessary for read-after-write, with
such a simplistic scheme.

Concretely, here is a very simple scenario in which read-after-write is broken:

1. A write operation is directed to nodes A, B and C (the old layout), and
   receives OK responses from nodes B and C, forming a quorum, so the write
   completes successfully.  The written data sent to node A is lost or delayed
   for a long time.

2. The new layout version is introduced in the cluster.

3. Before nodes D and E have had the chance to retrieve the data that was
   stored on nodes B and C, a read operation for the same key is directed to
   nodes A, D and E.  This request receives OK responses from nodes D and E,
   both containing no data but still forming a quorum of 2 responses. So the
   read returns a null value instead of the value that was written before, even
   though the write operation reported a success.


### Evidencing the issue with Jepsen testing

The first thing that I had to do for the NLnet project was to develop a testing
framework to show that read-after-write consistency issues could in fact arise
in Garage when the cluster layout was updated.

To make such tests, I chose to use the Jepsen testing framework, which helps us
put distributed software in complex adverse scenarios and verify whether they
respect some claimed consistency guarantees or not. I will not enter into too
much detail on the testing procedure, but suffice to say that issues were
found. More precisely, I was able to show that Garage did guarantee
read-after-write in a variety of adverse scenarios such as network partitions,
node crashes and clock scrambling, but that it was unable to do so as soon as
regular layout updates were introduced.

The progress of the Jepsen testing work is tracked in [PR
#544](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/544)


## Fixing Read-after-Write consistency when layouts change

To solve this issue, we will have to keep track of several information in the cluster.
We will also have to adapt our data transfer strategy and our quorums to make sure that
data can be found when it is requested.

Basically, here is how we will make sure that read-after-write is guaranteed:

- Several versions of the cluster layout can be live in the cluster at the same time.

- When multiple cluster layout versions are live, the writes are directed to
  all of the live versions.

- Nodes will synchronize data so that the nodes in the newest live layout
  version will catch up with the older live layout versions.

- Reads are initially directed to the oldest live layout version, but will
  progressively be moved to the newer versions once the synchronizations are
  complete.

- Once all nodes are reading from newer layout versions, the oldest live versions
  can be pruned and the corresponding data deleted.


More precisely, the following modifications are made to how quorums are used in
read/write operations and how the sync is made:

- Writes are sent to all nodes responsible for the paritition in all live
  layout versions, and will return OK only when they receive a quorum of OK
  responses for each of the live layout versions.  This means that writes could
  be a bit slower whan a layout change is being synchronized in the cluster.

- Reads are sent to the newest live layout version for which all nodes have
  completed a sync to catch up on existing data, and only expect a quorum of 2
  responses among the three nodes of that layout version.  This way, reads
  always stay as performant as when no layout update is in progress.

- A sync for a new layout version is not initiated until all cluster nodes have
  acknowledged receiving that version and having finished all write operations
  that were only addressed to previous layout versions. This makes sure that no
  data will be missed by the sync: once the sync has started, no more data can
  be written only to old layout versions.  All of the writes will also be
  directed to the new nodes (more exactly: all data that the source nodes of
  the sync does not yet contain when the sync starts, is written by a write
  operation that is also directed at a quorum of nodes among the new ones),
  meaning that at the end of the sync, a read quorum among the new nodes will
  necessarily return an up-to-date copy of all of the data.

- The oldest live layout version can be pruned once all nodes have completed a
  sync to a newer version AND all nodes have acknowleged that fact, signaling
  that they are no longer reading from that old version and are now reading
  from a newer version instead. After being pruned, the old layout version is
  no longer live, and nodes that are no longer designated to store data in the
  newer layout versions can simply delete the data that they were storing.

As you can see, the previous algorithm needs to keep track of a lot of
information in the cluster. Ths information is kept in three "layout update trackers",
which keep track of the following information:

- The `ack` layout tracker keeps track of nodes receiving the latest layout
  versions.  A node will not "ack" (acknowledge) a new layout version while it
  still has outstanding write operations that were not directed to the nodes
  included in that version. Once all nodes have acknowledged a new version, we
  know that all write operations that are made in the cluster are directed to
  the nodes that were added in this layout version.

- The `sync` layout tracker keeps track of nodes finishing a full metadata table
  sync, that was started after all nodes `ack`'ed the new layout version.

- The `sync_ack` layout tracker keeps track of nodes receiving the `sync`
  tracker update for all cluster nodes, and thus starting to direct reads to
  the newly synchronized layout version. This makes it possible to know when no
  more nodes are reading from an old version, at which point the corresponding
  data can be deleted.


## Current status and future work