190 lines
10 KiB
Markdown
190 lines
10 KiB
Markdown
+++
|
|
title="Maintaining Read-after-Write consistency in all circumstances"
|
|
date=2023-12-25
|
|
+++
|
|
|
|
*Garage is a data storage system that is based on CRDTs internally. It does not
|
|
use a consensus algorithm such as Raft, therefore maintaining consistency in a
|
|
cluster has to be done by other means. Since its inception, Garage has made use
|
|
of read and write quorums to guarantee read-after-write consistency, the only
|
|
consistency guarantee it provides. However, as of Garage v0.9.0, this guarantee
|
|
is not maintained when the composition of a cluster is updated and data is
|
|
moved between storage nodes. As part of our current NLnet-funded project, we
|
|
are developping a solution to this problem, that is briefly explained in this
|
|
blog post.*
|
|
|
|
<!-- more -->
|
|
|
|
---
|
|
|
|
Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:
|
|
|
|
**Read-after-write consistency.** *If a client A writes an object x (e.g. using PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to read object x (e.g. using GetObject), then B will read the version written by A, or a more recent version.*
|
|
|
|
The consistency guarantee offered by Garage is slightly more general than this
|
|
simplistic formulation, as it also applies to other S3 endpoints such as
|
|
ListObjects, which are always guaranteed to reflect the latest version of
|
|
objects inserted in a bucket.
|
|
|
|
This consistency guarantee at the level of objects in the S3 API is in fact a
|
|
reflection of read-after-write consistency in the internal metadata engine of
|
|
Garage (which is a distributed key/value store with CRDT values). Reads and
|
|
writes to metadata tables use quorums of 2 out of 3 nodes for each operation,
|
|
ensuring that if operation B starts after operation A has completed, then there
|
|
is at least one node that is handling both operation A and B. In the case where
|
|
A is a write (an update) and B is a read, that node will have the opportunity
|
|
to return the value written in A to the reading client B. A visual depiction
|
|
of this process can be found in [this
|
|
presentation](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/a8b0e01f88b947bc34c05d818d51860b4d171967/doc/talks/2023-09-20-ocp/talk.pdf)
|
|
on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page 54).
|
|
|
|
Note that read-after-write guarantees [are broken and have always
|
|
been](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/147) for the bucket
|
|
and access key tables, which might not be something we can fix due to different
|
|
requirements on the quorums.
|
|
|
|
## Current issues with Read-after-Write consistency
|
|
|
|
Maintaining Read-after-Write consistency depends crucially on the intersection
|
|
of the quorums being non-empty. There is however a scenario where these quorums
|
|
may be empty: when the set of nodes affected to storing some entries changes,
|
|
for instance when nodes are added or removed and data is being rebalanced
|
|
between nodes.
|
|
|
|
### A concrete example
|
|
|
|
Take the case of a partition (a subset of the data stored by Garage) which is
|
|
stored on nodes A, B and C. At some point, a layout change occurs in the
|
|
cluster, and after the change, nodes A, D and E are responsible for storing the
|
|
partition. All read and write operations that were initiated before the layout
|
|
change, or by nodes that were not yet aware of the new layout version, will be
|
|
directed to nodes A, B and C, and will be handled by a quorum of two nodes among
|
|
those three. However, once the new layout is introduced in the cluster, read
|
|
and write operations will start being directed to nodes A, D and E, expecting a
|
|
quorum of two nodes among this new set of three nodes.
|
|
|
|
Crucially, coordinating when operations start being directed to the new layout
|
|
is a hard problem, and in all cases we must assume that due to some network
|
|
asynchrony, there can still be some nodes that keep sending requests to nodes
|
|
A, B and C for a long time even after everyone else is aware of the new layout.
|
|
Moreover, data will be progressively moved from nodes B and C to nodes D and E,
|
|
which can take a long time depending on the quantity of data. This creates a
|
|
period of uncertainty as to where exactly the data is stored in the cluster.
|
|
Overall, this basically means that there is no way to guarantee the
|
|
intersection-of-quorums property, which is necessary for read-after-write, with
|
|
such a simplistic scheme.
|
|
|
|
Concretely, here is a very simple scenario in which read-after-write is broken:
|
|
|
|
1. A write operation is directed to nodes A, B and C (the old layout), and
|
|
receives OK responses from nodes B and C, forming a quorum, so the write
|
|
completes successfully. The written data sent to node A is lost or delayed
|
|
for a long time.
|
|
|
|
2. The new layout version is introduced in the cluster.
|
|
|
|
3. Before nodes D and E have had the chance to retrieve the data that was
|
|
stored on nodes B and C, a read operation for the same key is directed to
|
|
nodes A, D and E. This request receives OK responses from nodes D and E,
|
|
both containing no data but still forming a quorum of 2 responses. So the
|
|
read returns a null value instead of the value that was written before, even
|
|
though the write operation reported a success.
|
|
|
|
|
|
### Evidencing the issue with Jepsen testing
|
|
|
|
The first thing that I had to do for the NLnet project was to develop a testing
|
|
framework to show that read-after-write consistency issues could in fact arise
|
|
in Garage when the cluster layout was updated.
|
|
|
|
To make such tests, I chose to use the Jepsen testing framework, which helps us
|
|
put distributed software in complex adverse scenarios and verify whether they
|
|
respect some claimed consistency guarantees or not. I will not enter into too
|
|
much detail on the testing procedure, but suffice to say that issues were
|
|
found. More precisely, I was able to show that Garage did guarantee
|
|
read-after-write in a variety of adverse scenarios such as network partitions,
|
|
node crashes and clock scrambling, but that it was unable to do so as soon as
|
|
regular layout updates were introduced.
|
|
|
|
The progress of the Jepsen testing work is tracked in [PR
|
|
#544](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/544)
|
|
|
|
|
|
## Fixing Read-after-Write consistency when layouts change
|
|
|
|
To solve this issue, we will have to keep track of several information in the cluster.
|
|
We will also have to adapt our data transfer strategy and our quorums to make sure that
|
|
data can be found when it is requested.
|
|
|
|
Basically, here is how we will make sure that read-after-write is guaranteed:
|
|
|
|
- Several versions of the cluster layout can be live in the cluster at the same time.
|
|
|
|
- When multiple cluster layout versions are live, the writes are directed to
|
|
all of the live versions.
|
|
|
|
- Nodes will synchronize data so that the nodes in the newest live layout
|
|
version will catch up with the older live layout versions.
|
|
|
|
- Reads are initially directed to the oldest live layout version, but will
|
|
progressively be moved to the newer versions once the synchronizations are
|
|
complete.
|
|
|
|
- Once all nodes are reading from newer layout versions, the oldest live versions
|
|
can be pruned and the corresponding data deleted.
|
|
|
|
|
|
More precisely, the following modifications are made to how quorums are used in
|
|
read/write operations and how the sync is made:
|
|
|
|
- Writes are sent to all nodes responsible for the paritition in all live
|
|
layout versions, and will return OK only when they receive a quorum of OK
|
|
responses for each of the live layout versions. This means that writes could
|
|
be a bit slower whan a layout change is being synchronized in the cluster.
|
|
|
|
- Reads are sent to the newest live layout version for which all nodes have
|
|
completed a sync to catch up on existing data, and only expect a quorum of 2
|
|
responses among the three nodes of that layout version. This way, reads
|
|
always stay as performant as when no layout update is in progress.
|
|
|
|
- A sync for a new layout version is not initiated until all cluster nodes have
|
|
acknowledged receiving that version and having finished all write operations
|
|
that were only addressed to previous layout versions. This makes sure that no
|
|
data will be missed by the sync: once the sync has started, no more data can
|
|
be written only to old layout versions. All of the writes will also be
|
|
directed to the new nodes (more exactly: all data that the source nodes of
|
|
the sync does not yet contain when the sync starts, is written by a write
|
|
operation that is also directed at a quorum of nodes among the new ones),
|
|
meaning that at the end of the sync, a read quorum among the new nodes will
|
|
necessarily return an up-to-date copy of all of the data.
|
|
|
|
- The oldest live layout version can be pruned once all nodes have completed a
|
|
sync to a newer version AND all nodes have acknowleged that fact, signaling
|
|
that they are no longer reading from that old version and are now reading
|
|
from a newer version instead. After being pruned, the old layout version is
|
|
no longer live, and nodes that are no longer designated to store data in the
|
|
newer layout versions can simply delete the data that they were storing.
|
|
|
|
As you can see, the previous algorithm needs to keep track of a lot of
|
|
information in the cluster. Ths information is kept in three "layout update trackers",
|
|
which keep track of the following information:
|
|
|
|
- The `ack` layout tracker keeps track of nodes receiving the latest layout
|
|
versions. A node will not "ack" (acknowledge) a new layout version while it
|
|
still has outstanding write operations that were not directed to the nodes
|
|
included in that version. Once all nodes have acknowledged a new version, we
|
|
know that all write operations that are made in the cluster are directed to
|
|
the nodes that were added in this layout version.
|
|
|
|
- The `sync` layout tracker keeps track of nodes finishing a full metadata table
|
|
sync, that was started after all nodes `ack`'ed the new layout version.
|
|
|
|
- The `sync_ack` layout tracker keeps track of nodes receiving the `sync`
|
|
tracker update for all cluster nodes, and thus starting to direct reads to
|
|
the newly synchronized layout version. This makes it possible to know when no
|
|
more nodes are reading from an old version, at which point the corresponding
|
|
data can be deleted.
|
|
|
|
|
|
|
|
## Current status and future work
|