garagehq.deuxfleurs.fr/content/blog/2023-12-preserving-read-after-write-consistency/index.md
Alex 96b7e90976
All checks were successful
continuous-integration/drone/push Build is passing
wip: blog post on read-after-write and nlnet task 3
2023-12-01 13:30:29 +01:00

7.1 KiB

+++ title="Maintaining Read-after-Write consistency in all circumstances" date=2023-12-25 +++

Garage is a data storage system that is based on CRDTs internally. It does not use a consensus algorithm such as Raft, therefore maintaining consistency in a cluster has to be done by other means. Since its inception, Garage has made use of read and write quorums to guarantee read-after-write consistency, the only consistency guarantee it provides. However, as of Garage v0.9.0, this guarantee is not maintained when the composition of a cluster is updated and data is moved between storage nodes. As part of our current NLnet-funded project, we are developping a solution to this problem, that is briefly explained in this blog post.


Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:

Read-after-write consistency. If a client A writes an object x (e.g. using PutObject) and receives a HTTP 200 OK response, and later a client B tries to read object x (e.g. using GetObject), then B will read the version written by A, or a more recent version.

The consistency guarantee offered by Garage is slightly more general than this simplistic formulation, as it also applies to other S3 endpoints such as ListObjects, which are always guaranteed to reflect the latest version of objects inserted in a bucket.

This consistency guarantee at the level of objects in the S3 API is in fact a reflection of read-after-write consistency in the internal metadata engine of Garage (which is a distributed key/value store with CRDT values). Reads and writes to metadata tables use quorums of 2 out of 3 nodes for each operation, ensuring that if operation B starts after operation A has completed, then there is at least one node that is handling both operation A and B. In the case where A is a write (an update) and B is a read, that node will have the opportunity to return the value written in A to the reading client B. A visual depiction of this process can be found in this presentation on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page 54).

Note that read-after-write guarantees are broken and have always been for the bucket and access key tables, which might not be something we can fix due to different requirements on the quorums.

Current issues with Read-after-Write consistency

Maintaining Read-after-Write consistency depends crucially on the intersection of the quorums being non-empty. There is however a scenario where these quorums may be empty: when the set of nodes affected to storing some entries changes, for instance when nodes are added or removed and data is being rebalanced between nodes.

A concrete example

Take the case of a partition (a subset of the data stored by Garage) which is stored on nodes A, B and C. At some point, a layout change occurs in the cluster, and after the change, nodes A, D and E are responsible for storing the partition. All read and write operations that were initiated before the layout change, or by nodes that were not yet aware of the new layout version, will be directed to nodes A, B and C, and will be handled by a quorum of two nodes among those three. However, once the new layout is introduced in the cluster, read and write operations will start being directed to nodes A, D and E, expecting a quorum of two nodes among this new set of three nodes.

Crucially, coordinating when operations start being directed to the new layout is a hard problem, and in all cases we must assume that due to some network asynchrony, there can still be some nodes that keep sending requests to nodes A, B and C for a long time even after everyone else is aware of the new layout. Moreover, data will be progressively moved from nodes B and C to nodes D and E, which can take a long time depending on the quantity of data. This creates a period of uncertainty as to where exactly the data is stored in the cluster. Overall, this basically means that there is no way to guarantee the intersection-of-quorums property, which is necessary for read-after-write, with such a simplistic scheme.

Concretely, here is a very simple scenario in which read-after-write is broken:

  1. A write operation is directed to nodes A, B and C (the old layout), and receives OK responses from nodes B and C, forming a quorum, so the write completes successfully. The written data sent to node A is lost or delayed for a long time.

  2. The new layout version is introduced in the cluster.

  3. Before nodes D and E have had the chance to retrieve the data that was stored on nodes B and C, a read operation for the same key is directed to nodes A, D and E. This request receives OK responses from nodes D and E, both containing no data but still forming a quorum of 2 responses. So the read returns a null value instead of the value that was written before, even though the write operation reported a success.

Evidencing the issue with Jepsen testing

The first thing that I had to do for the NLnet project was to develop a testing framework to show that read-after-write consistency issues could in fact arise in Garage when the cluster layout was updated.

To make such tests, I chose to use the Jepsen testing framework, which helps us put distributed software in complex adverse scenarios and verify whether they respect some claimed consistency guarantees or not. I will not enter into too much detail on the testing procedure, but suffice to say that issues were found. More precisely, I was able to show that Garage did guarantee read-after-write in a variety of adverse scenarios such as network partitions, node crashes and clock scrambling, but that it was unable to do so as soon as regular layout updates were introduced.

The progress of the Jepsen testing work is tracked in PR #544

Fixing Read-after-Write consistency when layouts change

To solve this issue, we will have to track the progress of the transfer of data from nodes of layout version n-1 to nodes of layout version n. As long as that transfer has not finished, we will have to use a dual-quorum strategy to ensure consistency:

  • use write quorums amongst nodes of layout version n
  • use two read quorums, one amongst nodes of layout version n-1 and one amongst nodes of layout version n

This can be flipped the other way around, which might make more sense if we assume that reads are the most frequent operations and need to complete fast, however it might be a bit more tricky to implement:

  • use two write quorums, one amongst nodes of layout version n-1 and one amongst nodes of layout version n
  • use read quorum amongst nodes of layout n-1

We will also have to add more synchronization to ensure that data is not saved to nodes that are no longer responsible for a given data partition, as nodes may not be informed of the layout change at exactly the same time and small inconsistencies may appear in this interval.

Current status and future work