complete blog post on nlnet task3
This commit is contained in:
parent
10bd783ce6
commit
e66f6de458
1 changed files with 171 additions and 80 deletions
|
@ -1,5 +1,5 @@
|
|||
+++
|
||||
title="Maintaining Read-after-Write consistency in all circumstances"
|
||||
title="Maintaining read-after-write consistency in all circumstances"
|
||||
date=2023-12-25
|
||||
+++
|
||||
|
||||
|
@ -10,8 +10,8 @@ of read and write quorums to guarantee read-after-write consistency, the only
|
|||
consistency guarantee it provides. However, as of Garage v0.9.0, this guarantee
|
||||
is not maintained when the composition of a cluster is updated and data is
|
||||
moved between storage nodes. As part of our current NLnet-funded project, we
|
||||
are developping a solution to this problem, that is briefly explained in this
|
||||
blog post.*
|
||||
are developing a solution to this problem. This blog post proposes a
|
||||
high-level overview of the proposed solution.*
|
||||
|
||||
<!-- more -->
|
||||
|
||||
|
@ -19,16 +19,28 @@ blog post.*
|
|||
|
||||
Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:
|
||||
|
||||
**Read-after-write consistency.** *If a client A writes an object x (e.g. using PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to read object x (e.g. using GetObject), then B will read the version written by A, or a more recent version.*
|
||||
**Read-after-write consistency.** *If a client A writes an object x (e.g. using
|
||||
PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to
|
||||
read object x (e.g. using GetObject), then B will read the version written by
|
||||
A, or a more recent version.*
|
||||
|
||||
The consistency guarantee offered by Garage is slightly more general than this
|
||||
simplistic formulation, as it also applies to other S3 endpoints such as
|
||||
ListObjects, which are always guaranteed to reflect the latest version of
|
||||
objects inserted in a bucket.
|
||||
objects inserted in a bucket. Note that Amazon calls this guarantee [*strong*
|
||||
read-after-write consistency](https://aws.amazon.com/s3/consistency/) (they
|
||||
also have it on AWS), to differentiate it from [another definition of
|
||||
read-after-write
|
||||
consistency](https://avikdas.com/2020/04/13/scalability-concepts-read-after-write-consistency.html)
|
||||
that only applies to data that is read by the same client that wrote it. Since
|
||||
that weaker form is also called
|
||||
[read-your-writes](https://jepsen.io/consistency/models/read-your-writes), I
|
||||
will always be referring to the strong version when using the term
|
||||
"read-after-write consistency".
|
||||
|
||||
This consistency guarantee at the level of objects in the S3 API is in fact a
|
||||
reflection of read-after-write consistency in the internal metadata engine of
|
||||
Garage (which is a distributed key/value store with CRDT values). Reads and
|
||||
In Garage, this consistency guarantee at the level of objects in the S3 API is
|
||||
in fact a reflection of read-after-write consistency in the internal metadata
|
||||
engine (which is a distributed key/value store with CRDT values). Reads and
|
||||
writes to metadata tables use quorums of 2 out of 3 nodes for each operation,
|
||||
ensuring that if operation B starts after operation A has completed, then there
|
||||
is at least one node that is handling both operation A and B. In the case where
|
||||
|
@ -36,16 +48,18 @@ A is a write (an update) and B is a read, that node will have the opportunity
|
|||
to return the value written in A to the reading client B. A visual depiction
|
||||
of this process can be found in [this
|
||||
presentation](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/a8b0e01f88b947bc34c05d818d51860b4d171967/doc/talks/2023-09-20-ocp/talk.pdf)
|
||||
on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page 54).
|
||||
on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page
|
||||
54).
|
||||
|
||||
Note that read-after-write guarantees [are broken and have always
|
||||
been](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/147) for the bucket
|
||||
and access key tables, which might not be something we can fix due to different
|
||||
requirements on the quorums.
|
||||
been](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/147) for metadata
|
||||
related to buckets and access keys, which might not be something we can fix due
|
||||
to different requirements on the quorums for the related metadata tables.
|
||||
|
||||
## Current issues with Read-after-Write consistency
|
||||
|
||||
Maintaining Read-after-Write consistency depends crucially on the intersection
|
||||
## Current issues with read-after-write consistency
|
||||
|
||||
Maintaining read-after-write consistency depends crucially on the intersection
|
||||
of the quorums being non-empty. There is however a scenario where these quorums
|
||||
may be empty: when the set of nodes affected to storing some entries changes,
|
||||
for instance when nodes are added or removed and data is being rebalanced
|
||||
|
@ -70,111 +84,136 @@ A, B and C for a long time even after everyone else is aware of the new layout.
|
|||
Moreover, data will be progressively moved from nodes B and C to nodes D and E,
|
||||
which can take a long time depending on the quantity of data. This creates a
|
||||
period of uncertainty as to where exactly the data is stored in the cluster.
|
||||
Overall, this basically means that there is no way to guarantee the
|
||||
intersection-of-quorums property, which is necessary for read-after-write, with
|
||||
such a simplistic scheme.
|
||||
Overall, this basically means that this simplistic scheme gives us no way to
|
||||
guarantee the intersection-of-quorums property, which is necessary for
|
||||
read-after-write.
|
||||
|
||||
Concretely, here is a very simple scenario in which read-after-write is broken:
|
||||
|
||||
1. A write operation is directed to nodes A, B and C (the old layout), and
|
||||
receives OK responses from nodes B and C, forming a quorum, so the write
|
||||
completes successfully. The written data sent to node A is lost or delayed
|
||||
for a long time.
|
||||
completes successfully. The written data then arrives to node A as well.
|
||||
|
||||
2. The new layout version is introduced in the cluster.
|
||||
|
||||
3. Before nodes D and E have had the chance to retrieve the data that was
|
||||
stored on nodes B and C, a read operation for the same key is directed to
|
||||
nodes A, D and E. This request receives OK responses from nodes D and E,
|
||||
both containing no data but still forming a quorum of 2 responses. So the
|
||||
read returns a null value instead of the value that was written before, even
|
||||
though the write operation reported a success.
|
||||
nodes A, D and E. D and E both return an OK response with no data (a null
|
||||
value), because they is not yet up-to-date. An answer from node A is not
|
||||
received in time. The two responses from nodes D and E, that contain no
|
||||
data, still form a quorum, so the read returns a null value instead of the
|
||||
value that was written before, even though the write operation reported a
|
||||
success.
|
||||
|
||||
|
||||
### Evidencing the issue with Jepsen testing
|
||||
|
||||
The first thing that I had to do for the NLnet project was to develop a testing
|
||||
framework to show that read-after-write consistency issues could in fact arise
|
||||
in Garage when the cluster layout was updated.
|
||||
in Garage when the cluster layout was updated. To make such tests, I chose to
|
||||
use the [Jepsen](https://jepsen.io/) testing framework, which helps us put
|
||||
distributed software in complex adverse scenarios and verify whether they
|
||||
respect some claimed consistency guarantees or not.
|
||||
|
||||
To make such tests, I chose to use the Jepsen testing framework, which helps us
|
||||
put distributed software in complex adverse scenarios and verify whether they
|
||||
respect some claimed consistency guarantees or not. I will not enter into too
|
||||
much detail on the testing procedure, but suffice to say that issues were
|
||||
found. More precisely, I was able to show that Garage did guarantee
|
||||
read-after-write in a variety of adverse scenarios such as network partitions,
|
||||
node crashes and clock scrambling, but that it was unable to do so as soon as
|
||||
regular layout updates were introduced.
|
||||
I will not enter into too much detail on the testing procedure, but suffice to
|
||||
say that issues were found. More precisely, I was able to show that Garage
|
||||
*did* guarantee read-after-write in a variety of adverse scenarios such as
|
||||
network partitions, node crashes and clock scrambling, but that it was unable
|
||||
to do so as soon as regular layout updates were introduced.
|
||||
|
||||
The progress of the Jepsen testing work is tracked in [PR
|
||||
#544](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/544)
|
||||
|
||||
|
||||
## Fixing Read-after-Write consistency when layouts change
|
||||
## Fixing read-after-write consistency when layouts change
|
||||
|
||||
To solve this issue, we will have to keep track of several information in the cluster.
|
||||
We will also have to adapt our data transfer strategy and our quorums to make sure that
|
||||
data can be found when it is requested.
|
||||
To solve this issue, we will have to keep track of several pieces of
|
||||
information in the cluster. We will also have to adapt our read/write quorums
|
||||
and our data transfer strategy during rebalancing to make sure that data can be
|
||||
found when it is requested.
|
||||
|
||||
Basically, here is how we will make sure that read-after-write is guaranteed:
|
||||
First of all, we adapted Garage's code to be able to handle *several versions
|
||||
of the cluster layout* that can be live in the cluster at the same time, to
|
||||
keep track of multiple possible locations for data that is currently being
|
||||
transferred between nodes. When multiple cluster layout versions are live,
|
||||
write operations are directed to all of the nodes responsible for storing the
|
||||
data in all the live versions. This ensures that the nodes in the oldest live
|
||||
layout version always have an up-to-date view of the data, and that a read
|
||||
quorum among those nodes is always a safe way to ensure read-after-write
|
||||
consistency.
|
||||
|
||||
- Several versions of the cluster layout can be live in the cluster at the same time.
|
||||
Nodes will progressively synchronize data so that the nodes in the newest live
|
||||
layout version will catch up with data stored by nodes in the older live layout
|
||||
version. Once nodes in the newer layout versions also have an up-to-date view
|
||||
of the data, read operations will progressively start using a quorum of nodes
|
||||
in the new layout version instead of the old one.
|
||||
|
||||
- When multiple cluster layout versions are live, the writes are directed to
|
||||
all of the live versions.
|
||||
Once all nodes are reading from newer layout versions, the oldest live versions
|
||||
can be pruned. This means that writes will stop being directed to those nodes,
|
||||
and the nodes will delete the data they were storing. Obviously, in the (very
|
||||
common) case where some nodes are both in the old and new layout versions,
|
||||
those nodes will not delete their data and they will continue to receive
|
||||
writes.
|
||||
|
||||
- Nodes will synchronize data so that the nodes in the newest live layout
|
||||
version will catch up with the older live layout versions.
|
||||
### Performance impacts
|
||||
|
||||
- Reads are initially directed to the oldest live layout version, but will
|
||||
progressively be moved to the newer versions once the synchronizations are
|
||||
complete.
|
||||
When multiple layout versions are live, writes are sent to all nodes
|
||||
responsible for the partition of the requested key in all live layout
|
||||
versions, and will return OK only when they receive a quorum of OK responses
|
||||
for each of the live layout versions. This means that writes could be a bit
|
||||
slower when a layout change is being synchronized in the cluster. Typically if
|
||||
only one node is changing between the old and the new layout version, the write
|
||||
operation will await for 3 responses among 4 requests, instead of the classical
|
||||
2 responses among 3 requests.
|
||||
|
||||
- Once all nodes are reading from newer layout versions, the oldest live versions
|
||||
can be pruned and the corresponding data deleted.
|
||||
Concerning reads, they are still sent to only three nodes. Indeed, they are
|
||||
sent to the nodes of the newest live layout version for which nodes have
|
||||
completed a sync to catch up on existing data, and they only expect a quorum of
|
||||
2 responses among the three nodes of that layout version. This way, reads
|
||||
always stay as performant as when no layout change is being processed.
|
||||
|
||||
### Ensuring that new nodes are up-to-date
|
||||
|
||||
More precisely, the following modifications are made to how quorums are used in
|
||||
read/write operations and how the sync is made:
|
||||
An additional coordination mechanism is necessary for the data synchronization
|
||||
procedure, to ensure that it is not started too early and that after it
|
||||
completes, the nodes in the new layout indeed contains an up-to-date view of
|
||||
the data.
|
||||
|
||||
- Writes are sent to all nodes responsible for the paritition in all live
|
||||
layout versions, and will return OK only when they receive a quorum of OK
|
||||
responses for each of the live layout versions. This means that writes could
|
||||
be a bit slower whan a layout change is being synchronized in the cluster.
|
||||
Indeed, imagine the following adverse scenario, which we want to avoid: a new
|
||||
layout version is introduced in the cluster, and nodes immediately start
|
||||
copying the data to the new nodes. However, some write operations that were
|
||||
initiated before the new layout was introduced (or that were handled by a node
|
||||
not yet aware of the layout) could be delayed, and the written data was not yet
|
||||
received by the old nodes when they sent their copy of everything. When the
|
||||
sync reports completion, and read operations start being directed to nodes of
|
||||
the new layout, the written data might be missing from the nodes handling the
|
||||
read, and read-after-write consistency could be violated.
|
||||
|
||||
- Reads are sent to the newest live layout version for which all nodes have
|
||||
completed a sync to catch up on existing data, and only expect a quorum of 2
|
||||
responses among the three nodes of that layout version. This way, reads
|
||||
always stay as performant as when no layout update is in progress.
|
||||
To avoid this situation, the synchronization operation is not initiated until
|
||||
all cluster nodes have reported an "acknowledge" of the new layout version,
|
||||
indicating that they have received the new layout version, and that they are no
|
||||
longer processing write operations that were only addressed to nodes of the
|
||||
previous layout versions. This makes sure that no data will be missed by the
|
||||
sync: once the sync has started, no more data can be written only to old layout
|
||||
versions. All of the writes will also be directed to the new nodes. More
|
||||
exactly: all data that the source nodes of the sync does not yet contain when
|
||||
the sync starts, is written by a write operation that is also directed at a
|
||||
quorum of nodes among the new ones. This means that at the end of the sync, a
|
||||
read quorum among the new nodes will necessarily return an up-to-date copy of
|
||||
all of the data.
|
||||
|
||||
- A sync for a new layout version is not initiated until all cluster nodes have
|
||||
acknowledged receiving that version and having finished all write operations
|
||||
that were only addressed to previous layout versions. This makes sure that no
|
||||
data will be missed by the sync: once the sync has started, no more data can
|
||||
be written only to old layout versions. All of the writes will also be
|
||||
directed to the new nodes (more exactly: all data that the source nodes of
|
||||
the sync does not yet contain when the sync starts, is written by a write
|
||||
operation that is also directed at a quorum of nodes among the new ones),
|
||||
meaning that at the end of the sync, a read quorum among the new nodes will
|
||||
necessarily return an up-to-date copy of all of the data.
|
||||
|
||||
- The oldest live layout version can be pruned once all nodes have completed a
|
||||
sync to a newer version AND all nodes have acknowleged that fact, signaling
|
||||
that they are no longer reading from that old version and are now reading
|
||||
from a newer version instead. After being pruned, the old layout version is
|
||||
no longer live, and nodes that are no longer designated to store data in the
|
||||
newer layout versions can simply delete the data that they were storing.
|
||||
### Details on update trackers
|
||||
|
||||
As you can see, the previous algorithm needs to keep track of a lot of
|
||||
information in the cluster. Ths information is kept in three "layout update trackers",
|
||||
information in the cluster. This information is kept in three "layout update trackers",
|
||||
which keep track of the following information:
|
||||
|
||||
- The `ack` layout tracker keeps track of nodes receiving the latest layout
|
||||
versions. A node will not "ack" (acknowledge) a new layout version while it
|
||||
still has outstanding write operations that were not directed to the nodes
|
||||
included in that version. Once all nodes have acknowledged a new version, we
|
||||
know that all write operations that are made in the cluster are directed to
|
||||
the nodes that were added in this layout version.
|
||||
versions and indicating that they are no longer processing writes addressed
|
||||
only to older layout versions. Once all nodes have acknowledged a new
|
||||
version, we know that all in-progress and future write operations that are
|
||||
made in the cluster are directed to the nodes that were added in this layout
|
||||
version as well.
|
||||
|
||||
- The `sync` layout tracker keeps track of nodes finishing a full metadata table
|
||||
sync, that was started after all nodes `ack`'ed the new layout version.
|
||||
|
@ -185,6 +224,58 @@ which keep track of the following information:
|
|||
more nodes are reading from an old version, at which point the corresponding
|
||||
data can be deleted.
|
||||
|
||||
In the simplest scenario, only two layout versions are live, and these trackers
|
||||
therefore can only have the values `n` (the new layout version) and `n-1` (the
|
||||
old one). However this mechanism handles the general case where several
|
||||
successive layout updates are being processed and more than two layout versions
|
||||
are live simultaneously. The layout update trackers can take as values the
|
||||
version numbers of any currently live layout version.
|
||||
|
||||
### What about dead nodes?
|
||||
|
||||
In this post I have used many times the phrases "once all nodes have
|
||||
acknowledged a new layout version", or "once all nodes have completed a sync".
|
||||
This obviously means that if some nodes are dead or unresponsive, the
|
||||
processing of the layout update can be delayed indefinitely, and nodes in the
|
||||
old layout versions will keep receiving writes and storing unnecessary data.
|
||||
This is an unfortunate fact with the method proposed here. To cover for these
|
||||
situations, the following workarounds can be made:
|
||||
|
||||
- A layout change is generally a supervised operation, meaning that a system
|
||||
administrator may manually intervene to inform the cluster that certain nodes
|
||||
are dead and that their layout tracker values should not be taken into
|
||||
account.
|
||||
|
||||
- For the `sync` update tracker, we don't actually need to wait for all of the
|
||||
synchronizations to terminate, quorums can be used instead as they should be
|
||||
sufficient to ensure that the copied data is up-to-date.
|
||||
|
||||
- For the `ack` and `sync_ack` update trackers, we can automatically increase
|
||||
them for all nodes (even dead ones) after a certain time delay, as there is
|
||||
no reason for the changes taking more than e.g. 10 minutes to propagate in
|
||||
regular conditions. We might not enable this behaviour by default, though,
|
||||
due to its possible impacts on consistency.
|
||||
|
||||
|
||||
## Current status and future work
|
||||
|
||||
The work described in this blog post is currently almost complete but it still
|
||||
needs to be ironed out. I have made a first run of Jepsen testing on the new
|
||||
code that showed that the changes seem to be fixing the issue. I will be
|
||||
running longer and more intensive runs of Jepsen testing once the code is
|
||||
finished, to make sure everything is fine. The changes will require a major
|
||||
update of Garage: this will be the v0.10.0 release, which will probably be
|
||||
finished in January or February of 2024. This update will be a very safe and
|
||||
transparent update, as only the layout data structure is changed and nothing
|
||||
related to object storage itself is touched.
|
||||
|
||||
If I had the time to do so, I would write the algorithm described in this post
|
||||
in a formal way, in the form of a scientific paper. I believe such a paper
|
||||
would be worthy of presenting at a scientific conference or journal, especially
|
||||
given the fact that it is motivated by a very concrete use case and has been
|
||||
validated quite thoroughly (with Jepsen). Unfortunately, this is not my
|
||||
highest priority at the moment.
|
||||
|
||||
---
|
||||
|
||||
Written by [Alex Auvolat](https://adnab.me).
|
||||
|
|
Loading…
Reference in a new issue