281 lines
15 KiB
Markdown
281 lines
15 KiB
Markdown
+++
|
|
title="Maintaining read-after-write consistency in all circumstances"
|
|
date=2023-12-06
|
|
+++
|
|
|
|
*Garage is a data storage system that is based on CRDTs internally. It does not
|
|
use a consensus algorithm such as Raft, therefore maintaining consistency in a
|
|
cluster has to be done by other means. Since its inception, Garage has made use
|
|
of read and write quorums to guarantee read-after-write consistency, the only
|
|
consistency guarantee it provides. However, as of Garage v0.9.0, this guarantee
|
|
is not maintained when the composition of a cluster is updated and data is
|
|
moved between storage nodes. As part of our current NLnet-funded project, we
|
|
are developing a solution to this problem. This blog post proposes a
|
|
high-level overview of the proposed solution.*
|
|
|
|
<!-- more -->
|
|
|
|
---
|
|
|
|
Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:
|
|
|
|
**Read-after-write consistency.** *If a client A writes an object x (e.g. using
|
|
PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to
|
|
read object x (e.g. using GetObject), then B will read the version written by
|
|
A, or a more recent version.*
|
|
|
|
The consistency guarantee offered by Garage is slightly more general than this
|
|
simplistic formulation, as it also applies to other S3 endpoints such as
|
|
ListObjects, which are always guaranteed to reflect the latest version of
|
|
objects inserted in a bucket. Note that Amazon calls this guarantee [*strong*
|
|
read-after-write consistency](https://aws.amazon.com/s3/consistency/) (they
|
|
also have it on AWS), to differentiate it from [another definition of
|
|
read-after-write
|
|
consistency](https://avikdas.com/2020/04/13/scalability-concepts-read-after-write-consistency.html)
|
|
that only applies to data that is read by the same client that wrote it. Since
|
|
that weaker form is also called
|
|
[read-your-writes](https://jepsen.io/consistency/models/read-your-writes), I
|
|
will always be referring to the strong version when using the term
|
|
"read-after-write consistency".
|
|
|
|
In Garage, this consistency guarantee at the level of objects in the S3 API is
|
|
in fact a reflection of read-after-write consistency in the internal metadata
|
|
engine (which is a distributed key/value store with CRDT values). Reads and
|
|
writes to metadata tables use quorums of 2 out of 3 nodes for each operation,
|
|
ensuring that if operation B starts after operation A has completed, then there
|
|
is at least one node that is handling both operation A and B. In the case where
|
|
A is a write (an update) and B is a read, that node will have the opportunity
|
|
to return the value written in A to the reading client B. A visual depiction
|
|
of this process can be found in [this
|
|
presentation](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/a8b0e01f88b947bc34c05d818d51860b4d171967/doc/talks/2023-09-20-ocp/talk.pdf)
|
|
on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page
|
|
54).
|
|
|
|
Note that read-after-write guarantees [are broken and have always
|
|
been](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/147) for metadata
|
|
related to buckets and access keys, which might not be something we can fix due
|
|
to different requirements on the quorums for the related metadata tables.
|
|
|
|
|
|
## Current issues with read-after-write consistency
|
|
|
|
Maintaining read-after-write consistency depends crucially on the intersection
|
|
of the quorums being non-empty. There is however a scenario where these quorums
|
|
may be empty: when the set of nodes affected to storing some entries changes,
|
|
for instance when nodes are added or removed and data is being rebalanced
|
|
between nodes.
|
|
|
|
### A concrete example
|
|
|
|
Take the case of a partition (a subset of the data stored by Garage) which is
|
|
stored on nodes A, B and C. At some point, a layout change occurs in the
|
|
cluster, and after the change, nodes A, D and E are responsible for storing the
|
|
partition. All read and write operations that were initiated before the layout
|
|
change, or by nodes that were not yet aware of the new layout version, will be
|
|
directed to nodes A, B and C, and will be handled by a quorum of two nodes among
|
|
those three. However, once the new layout is introduced in the cluster, read
|
|
and write operations will start being directed to nodes A, D and E, expecting a
|
|
quorum of two nodes among this new set of three nodes.
|
|
|
|
Crucially, coordinating when operations start being directed to the new layout
|
|
is a hard problem, and in all cases we must assume that due to some network
|
|
asynchrony, there can still be some nodes that keep sending requests to nodes
|
|
A, B and C for a long time even after everyone else is aware of the new layout.
|
|
Moreover, data will be progressively moved from nodes B and C to nodes D and E,
|
|
which can take a long time depending on the quantity of data. This creates a
|
|
period of uncertainty as to where exactly the data is stored in the cluster.
|
|
Overall, this basically means that this simplistic scheme gives us no way to
|
|
guarantee the intersection-of-quorums property, which is necessary for
|
|
read-after-write.
|
|
|
|
Concretely, here is a very simple scenario in which read-after-write is broken:
|
|
|
|
1. A write operation is directed to nodes A, B and C (the old layout), and
|
|
receives OK responses from nodes B and C, forming a quorum, so the write
|
|
completes successfully. The written data then arrives to node A as well.
|
|
|
|
2. The new layout version is introduced in the cluster.
|
|
|
|
3. Before nodes D and E have had the chance to retrieve the data that was
|
|
stored on nodes B and C, a read operation for the same key is directed to
|
|
nodes A, D and E. D and E both return an OK response with no data (a null
|
|
value), because they is not yet up-to-date. An answer from node A is not
|
|
received in time. The two responses from nodes D and E, that contain no
|
|
data, still form a quorum, so the read returns a null value instead of the
|
|
value that was written before, even though the write operation reported a
|
|
success.
|
|
|
|
|
|
### Evidencing the issue with Jepsen testing
|
|
|
|
The first thing that I had to do for the NLnet project was to develop a testing
|
|
framework to show that read-after-write consistency issues could in fact arise
|
|
in Garage when the cluster layout was updated. To make such tests, I chose to
|
|
use the [Jepsen](https://jepsen.io/) testing framework, which helps us put
|
|
distributed software in complex adverse scenarios and verify whether they
|
|
respect some claimed consistency guarantees or not.
|
|
|
|
I will not enter into too much detail on the testing procedure, but suffice to
|
|
say that issues were found. More precisely, I was able to show that Garage
|
|
*did* guarantee read-after-write in a variety of adverse scenarios such as
|
|
network partitions, node crashes and clock scrambling, but that it was unable
|
|
to do so as soon as regular layout updates were introduced.
|
|
|
|
The progress of the Jepsen testing work is tracked in [PR
|
|
#544](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/544)
|
|
|
|
|
|
## Fixing read-after-write consistency when layouts change
|
|
|
|
To solve this issue, we will have to keep track of several pieces of
|
|
information in the cluster. We will also have to adapt our read/write quorums
|
|
and our data transfer strategy during rebalancing to make sure that data can be
|
|
found when it is requested.
|
|
|
|
First of all, we adapted Garage's code to be able to handle *several versions
|
|
of the cluster layout* that can be live in the cluster at the same time, to
|
|
keep track of multiple possible locations for data that is currently being
|
|
transferred between nodes. When multiple cluster layout versions are live,
|
|
write operations are directed to all of the nodes responsible for storing the
|
|
data in all the live versions. This ensures that the nodes in the oldest live
|
|
layout version always have an up-to-date view of the data, and that a read
|
|
quorum among those nodes is always a safe way to ensure read-after-write
|
|
consistency.
|
|
|
|
Nodes will progressively synchronize data so that the nodes in the newest live
|
|
layout version will catch up with data stored by nodes in the older live layout
|
|
version. Once nodes in the newer layout versions also have an up-to-date view
|
|
of the data, read operations will progressively start using a quorum of nodes
|
|
in the new layout version instead of the old one.
|
|
|
|
Once all nodes are reading from newer layout versions, the oldest live versions
|
|
can be pruned. This means that writes will stop being directed to those nodes,
|
|
and the nodes will delete the data they were storing. Obviously, in the (very
|
|
common) case where some nodes are both in the old and new layout versions,
|
|
those nodes will not delete their data and they will continue to receive
|
|
writes.
|
|
|
|
### Performance impacts
|
|
|
|
When multiple layout versions are live, writes are sent to all nodes
|
|
responsible for the partition of the requested key in all live layout
|
|
versions, and will return OK only when they receive a quorum of OK responses
|
|
for each of the live layout versions. This means that writes could be a bit
|
|
slower when a layout change is being synchronized in the cluster. Typically if
|
|
only one node is changing between the old and the new layout version, the write
|
|
operation will await for 3 responses among 4 requests, instead of the classical
|
|
2 responses among 3 requests.
|
|
|
|
Concerning reads, they are still sent to only three nodes. Indeed, they are
|
|
sent to the nodes of the newest live layout version for which nodes have
|
|
completed a sync to catch up on existing data, and they only expect a quorum of
|
|
2 responses among the three nodes of that layout version. This way, reads
|
|
always stay as performant as when no layout change is being processed.
|
|
|
|
### Ensuring that new nodes are up-to-date
|
|
|
|
An additional coordination mechanism is necessary for the data synchronization
|
|
procedure, to ensure that it is not started too early and that after it
|
|
completes, the nodes in the new layout indeed contains an up-to-date view of
|
|
the data.
|
|
|
|
Indeed, imagine the following adverse scenario, which we want to avoid: a new
|
|
layout version is introduced in the cluster, and nodes immediately start
|
|
copying the data to the new nodes. However, some write operations that were
|
|
initiated before the new layout was introduced (or that were handled by a node
|
|
not yet aware of the layout) could be delayed, and the written data was not yet
|
|
received by the old nodes when they sent their copy of everything. When the
|
|
sync reports completion, and read operations start being directed to nodes of
|
|
the new layout, the written data might be missing from the nodes handling the
|
|
read, and read-after-write consistency could be violated.
|
|
|
|
To avoid this situation, the synchronization operation is not initiated until
|
|
all cluster nodes have reported an "acknowledge" of the new layout version,
|
|
indicating that they have received the new layout version, and that they are no
|
|
longer processing write operations that were only addressed to nodes of the
|
|
previous layout versions. This makes sure that no data will be missed by the
|
|
sync: once the sync has started, no more data can be written only to old layout
|
|
versions. All of the writes will also be directed to the new nodes. More
|
|
exactly: all data that the source nodes of the sync does not yet contain when
|
|
the sync starts, is written by a write operation that is also directed at a
|
|
quorum of nodes among the new ones. This means that at the end of the sync, a
|
|
read quorum among the new nodes will necessarily return an up-to-date copy of
|
|
all of the data.
|
|
|
|
### Details on update trackers
|
|
|
|
As you can see, the previous algorithm needs to keep track of a lot of
|
|
information in the cluster. This information is kept in three "layout update trackers",
|
|
which keep track of the following information:
|
|
|
|
- The `ack` layout tracker keeps track of nodes receiving the latest layout
|
|
versions and indicating that they are no longer processing writes addressed
|
|
only to older layout versions. Once all nodes have acknowledged a new
|
|
version, we know that all in-progress and future write operations that are
|
|
made in the cluster are directed to the nodes that were added in this layout
|
|
version as well.
|
|
|
|
- The `sync` layout tracker keeps track of nodes finishing a full metadata table
|
|
sync, that was started after all nodes `ack`'ed the new layout version.
|
|
|
|
- The `sync_ack` layout tracker keeps track of nodes receiving the `sync`
|
|
tracker update for all cluster nodes, and thus starting to direct reads to
|
|
the newly synchronized layout version. This makes it possible to know when no
|
|
more nodes are reading from an old version, at which point the corresponding
|
|
data can be deleted.
|
|
|
|
In the simplest scenario, only two layout versions are live, and these trackers
|
|
therefore can only have the values `n` (the new layout version) and `n-1` (the
|
|
old one). However this mechanism handles the general case where several
|
|
successive layout updates are being processed and more than two layout versions
|
|
are live simultaneously. The layout update trackers can take as values the
|
|
version numbers of any currently live layout version.
|
|
|
|
### What about dead nodes?
|
|
|
|
In this post I have used many times the phrases "once all nodes have
|
|
acknowledged a new layout version", or "once all nodes have completed a sync".
|
|
This obviously means that if some nodes are dead or unresponsive, the
|
|
processing of the layout update can be delayed indefinitely, and nodes in the
|
|
old layout versions will keep receiving writes and storing unnecessary data.
|
|
This is an unfortunate fact with the method proposed here. To cover for these
|
|
situations, the following workarounds can be made:
|
|
|
|
- A layout change is generally a supervised operation, meaning that a system
|
|
administrator may manually intervene to inform the cluster that certain nodes
|
|
are dead and that their layout tracker values should not be taken into
|
|
account.
|
|
|
|
- For the `sync` update tracker, we don't actually need to wait for all of the
|
|
synchronizations to terminate, quorums can be used instead as they should be
|
|
sufficient to ensure that the copied data is up-to-date.
|
|
|
|
- For the `ack` and `sync_ack` update trackers, we can automatically increase
|
|
them for all nodes (even dead ones) after a certain time delay, as there is
|
|
no reason for the changes taking more than e.g. 10 minutes to propagate in
|
|
regular conditions. We might not enable this behaviour by default, though,
|
|
due to its possible impacts on consistency.
|
|
|
|
|
|
## Current status and future work
|
|
|
|
The work described in this blog post is currently almost complete but it still
|
|
needs to be ironed out. I have made a first run of Jepsen testing on the new
|
|
code that showed that the changes seem to be fixing the issue. I will be
|
|
running longer and more intensive runs of Jepsen testing once the code is
|
|
finished, to make sure everything is fine. The changes will require a major
|
|
update of Garage: this will be the v0.10.0 release, which will probably be
|
|
finished in January or February of 2024. This update will be a very safe and
|
|
transparent update, as only the layout data structure is changed and nothing
|
|
related to object storage itself is touched.
|
|
|
|
If I had the time to do so, I would write the algorithm described in this post
|
|
in a formal way, in the form of a scientific paper. I believe such a paper
|
|
would be worthy of presenting at a scientific conference or journal, especially
|
|
given the fact that it is motivated by a very concrete use case and has been
|
|
validated quite thoroughly (with Jepsen). Unfortunately, this is not my
|
|
highest priority at the moment.
|
|
|
|
---
|
|
|
|
Written by [Alex Auvolat](https://adnab.me).
|