complete blog post on nlnet task3
This commit is contained in:
parent
10bd783ce6
commit
e66f6de458
1 changed files with 171 additions and 80 deletions
|
@ -1,5 +1,5 @@
|
||||||
+++
|
+++
|
||||||
title="Maintaining Read-after-Write consistency in all circumstances"
|
title="Maintaining read-after-write consistency in all circumstances"
|
||||||
date=2023-12-25
|
date=2023-12-25
|
||||||
+++
|
+++
|
||||||
|
|
||||||
|
@ -10,8 +10,8 @@ of read and write quorums to guarantee read-after-write consistency, the only
|
||||||
consistency guarantee it provides. However, as of Garage v0.9.0, this guarantee
|
consistency guarantee it provides. However, as of Garage v0.9.0, this guarantee
|
||||||
is not maintained when the composition of a cluster is updated and data is
|
is not maintained when the composition of a cluster is updated and data is
|
||||||
moved between storage nodes. As part of our current NLnet-funded project, we
|
moved between storage nodes. As part of our current NLnet-funded project, we
|
||||||
are developping a solution to this problem, that is briefly explained in this
|
are developing a solution to this problem. This blog post proposes a
|
||||||
blog post.*
|
high-level overview of the proposed solution.*
|
||||||
|
|
||||||
<!-- more -->
|
<!-- more -->
|
||||||
|
|
||||||
|
@ -19,16 +19,28 @@ blog post.*
|
||||||
|
|
||||||
Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:
|
Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:
|
||||||
|
|
||||||
**Read-after-write consistency.** *If a client A writes an object x (e.g. using PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to read object x (e.g. using GetObject), then B will read the version written by A, or a more recent version.*
|
**Read-after-write consistency.** *If a client A writes an object x (e.g. using
|
||||||
|
PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to
|
||||||
|
read object x (e.g. using GetObject), then B will read the version written by
|
||||||
|
A, or a more recent version.*
|
||||||
|
|
||||||
The consistency guarantee offered by Garage is slightly more general than this
|
The consistency guarantee offered by Garage is slightly more general than this
|
||||||
simplistic formulation, as it also applies to other S3 endpoints such as
|
simplistic formulation, as it also applies to other S3 endpoints such as
|
||||||
ListObjects, which are always guaranteed to reflect the latest version of
|
ListObjects, which are always guaranteed to reflect the latest version of
|
||||||
objects inserted in a bucket.
|
objects inserted in a bucket. Note that Amazon calls this guarantee [*strong*
|
||||||
|
read-after-write consistency](https://aws.amazon.com/s3/consistency/) (they
|
||||||
|
also have it on AWS), to differentiate it from [another definition of
|
||||||
|
read-after-write
|
||||||
|
consistency](https://avikdas.com/2020/04/13/scalability-concepts-read-after-write-consistency.html)
|
||||||
|
that only applies to data that is read by the same client that wrote it. Since
|
||||||
|
that weaker form is also called
|
||||||
|
[read-your-writes](https://jepsen.io/consistency/models/read-your-writes), I
|
||||||
|
will always be referring to the strong version when using the term
|
||||||
|
"read-after-write consistency".
|
||||||
|
|
||||||
This consistency guarantee at the level of objects in the S3 API is in fact a
|
In Garage, this consistency guarantee at the level of objects in the S3 API is
|
||||||
reflection of read-after-write consistency in the internal metadata engine of
|
in fact a reflection of read-after-write consistency in the internal metadata
|
||||||
Garage (which is a distributed key/value store with CRDT values). Reads and
|
engine (which is a distributed key/value store with CRDT values). Reads and
|
||||||
writes to metadata tables use quorums of 2 out of 3 nodes for each operation,
|
writes to metadata tables use quorums of 2 out of 3 nodes for each operation,
|
||||||
ensuring that if operation B starts after operation A has completed, then there
|
ensuring that if operation B starts after operation A has completed, then there
|
||||||
is at least one node that is handling both operation A and B. In the case where
|
is at least one node that is handling both operation A and B. In the case where
|
||||||
|
@ -36,16 +48,18 @@ A is a write (an update) and B is a read, that node will have the opportunity
|
||||||
to return the value written in A to the reading client B. A visual depiction
|
to return the value written in A to the reading client B. A visual depiction
|
||||||
of this process can be found in [this
|
of this process can be found in [this
|
||||||
presentation](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/a8b0e01f88b947bc34c05d818d51860b4d171967/doc/talks/2023-09-20-ocp/talk.pdf)
|
presentation](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/a8b0e01f88b947bc34c05d818d51860b4d171967/doc/talks/2023-09-20-ocp/talk.pdf)
|
||||||
on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page 54).
|
on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page
|
||||||
|
54).
|
||||||
|
|
||||||
Note that read-after-write guarantees [are broken and have always
|
Note that read-after-write guarantees [are broken and have always
|
||||||
been](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/147) for the bucket
|
been](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/147) for metadata
|
||||||
and access key tables, which might not be something we can fix due to different
|
related to buckets and access keys, which might not be something we can fix due
|
||||||
requirements on the quorums.
|
to different requirements on the quorums for the related metadata tables.
|
||||||
|
|
||||||
## Current issues with Read-after-Write consistency
|
|
||||||
|
|
||||||
Maintaining Read-after-Write consistency depends crucially on the intersection
|
## Current issues with read-after-write consistency
|
||||||
|
|
||||||
|
Maintaining read-after-write consistency depends crucially on the intersection
|
||||||
of the quorums being non-empty. There is however a scenario where these quorums
|
of the quorums being non-empty. There is however a scenario where these quorums
|
||||||
may be empty: when the set of nodes affected to storing some entries changes,
|
may be empty: when the set of nodes affected to storing some entries changes,
|
||||||
for instance when nodes are added or removed and data is being rebalanced
|
for instance when nodes are added or removed and data is being rebalanced
|
||||||
|
@ -70,111 +84,136 @@ A, B and C for a long time even after everyone else is aware of the new layout.
|
||||||
Moreover, data will be progressively moved from nodes B and C to nodes D and E,
|
Moreover, data will be progressively moved from nodes B and C to nodes D and E,
|
||||||
which can take a long time depending on the quantity of data. This creates a
|
which can take a long time depending on the quantity of data. This creates a
|
||||||
period of uncertainty as to where exactly the data is stored in the cluster.
|
period of uncertainty as to where exactly the data is stored in the cluster.
|
||||||
Overall, this basically means that there is no way to guarantee the
|
Overall, this basically means that this simplistic scheme gives us no way to
|
||||||
intersection-of-quorums property, which is necessary for read-after-write, with
|
guarantee the intersection-of-quorums property, which is necessary for
|
||||||
such a simplistic scheme.
|
read-after-write.
|
||||||
|
|
||||||
Concretely, here is a very simple scenario in which read-after-write is broken:
|
Concretely, here is a very simple scenario in which read-after-write is broken:
|
||||||
|
|
||||||
1. A write operation is directed to nodes A, B and C (the old layout), and
|
1. A write operation is directed to nodes A, B and C (the old layout), and
|
||||||
receives OK responses from nodes B and C, forming a quorum, so the write
|
receives OK responses from nodes B and C, forming a quorum, so the write
|
||||||
completes successfully. The written data sent to node A is lost or delayed
|
completes successfully. The written data then arrives to node A as well.
|
||||||
for a long time.
|
|
||||||
|
|
||||||
2. The new layout version is introduced in the cluster.
|
2. The new layout version is introduced in the cluster.
|
||||||
|
|
||||||
3. Before nodes D and E have had the chance to retrieve the data that was
|
3. Before nodes D and E have had the chance to retrieve the data that was
|
||||||
stored on nodes B and C, a read operation for the same key is directed to
|
stored on nodes B and C, a read operation for the same key is directed to
|
||||||
nodes A, D and E. This request receives OK responses from nodes D and E,
|
nodes A, D and E. D and E both return an OK response with no data (a null
|
||||||
both containing no data but still forming a quorum of 2 responses. So the
|
value), because they is not yet up-to-date. An answer from node A is not
|
||||||
read returns a null value instead of the value that was written before, even
|
received in time. The two responses from nodes D and E, that contain no
|
||||||
though the write operation reported a success.
|
data, still form a quorum, so the read returns a null value instead of the
|
||||||
|
value that was written before, even though the write operation reported a
|
||||||
|
success.
|
||||||
|
|
||||||
|
|
||||||
### Evidencing the issue with Jepsen testing
|
### Evidencing the issue with Jepsen testing
|
||||||
|
|
||||||
The first thing that I had to do for the NLnet project was to develop a testing
|
The first thing that I had to do for the NLnet project was to develop a testing
|
||||||
framework to show that read-after-write consistency issues could in fact arise
|
framework to show that read-after-write consistency issues could in fact arise
|
||||||
in Garage when the cluster layout was updated.
|
in Garage when the cluster layout was updated. To make such tests, I chose to
|
||||||
|
use the [Jepsen](https://jepsen.io/) testing framework, which helps us put
|
||||||
|
distributed software in complex adverse scenarios and verify whether they
|
||||||
|
respect some claimed consistency guarantees or not.
|
||||||
|
|
||||||
To make such tests, I chose to use the Jepsen testing framework, which helps us
|
I will not enter into too much detail on the testing procedure, but suffice to
|
||||||
put distributed software in complex adverse scenarios and verify whether they
|
say that issues were found. More precisely, I was able to show that Garage
|
||||||
respect some claimed consistency guarantees or not. I will not enter into too
|
*did* guarantee read-after-write in a variety of adverse scenarios such as
|
||||||
much detail on the testing procedure, but suffice to say that issues were
|
network partitions, node crashes and clock scrambling, but that it was unable
|
||||||
found. More precisely, I was able to show that Garage did guarantee
|
to do so as soon as regular layout updates were introduced.
|
||||||
read-after-write in a variety of adverse scenarios such as network partitions,
|
|
||||||
node crashes and clock scrambling, but that it was unable to do so as soon as
|
|
||||||
regular layout updates were introduced.
|
|
||||||
|
|
||||||
The progress of the Jepsen testing work is tracked in [PR
|
The progress of the Jepsen testing work is tracked in [PR
|
||||||
#544](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/544)
|
#544](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/544)
|
||||||
|
|
||||||
|
|
||||||
## Fixing Read-after-Write consistency when layouts change
|
## Fixing read-after-write consistency when layouts change
|
||||||
|
|
||||||
To solve this issue, we will have to keep track of several information in the cluster.
|
To solve this issue, we will have to keep track of several pieces of
|
||||||
We will also have to adapt our data transfer strategy and our quorums to make sure that
|
information in the cluster. We will also have to adapt our read/write quorums
|
||||||
data can be found when it is requested.
|
and our data transfer strategy during rebalancing to make sure that data can be
|
||||||
|
found when it is requested.
|
||||||
|
|
||||||
Basically, here is how we will make sure that read-after-write is guaranteed:
|
First of all, we adapted Garage's code to be able to handle *several versions
|
||||||
|
of the cluster layout* that can be live in the cluster at the same time, to
|
||||||
|
keep track of multiple possible locations for data that is currently being
|
||||||
|
transferred between nodes. When multiple cluster layout versions are live,
|
||||||
|
write operations are directed to all of the nodes responsible for storing the
|
||||||
|
data in all the live versions. This ensures that the nodes in the oldest live
|
||||||
|
layout version always have an up-to-date view of the data, and that a read
|
||||||
|
quorum among those nodes is always a safe way to ensure read-after-write
|
||||||
|
consistency.
|
||||||
|
|
||||||
- Several versions of the cluster layout can be live in the cluster at the same time.
|
Nodes will progressively synchronize data so that the nodes in the newest live
|
||||||
|
layout version will catch up with data stored by nodes in the older live layout
|
||||||
|
version. Once nodes in the newer layout versions also have an up-to-date view
|
||||||
|
of the data, read operations will progressively start using a quorum of nodes
|
||||||
|
in the new layout version instead of the old one.
|
||||||
|
|
||||||
- When multiple cluster layout versions are live, the writes are directed to
|
Once all nodes are reading from newer layout versions, the oldest live versions
|
||||||
all of the live versions.
|
can be pruned. This means that writes will stop being directed to those nodes,
|
||||||
|
and the nodes will delete the data they were storing. Obviously, in the (very
|
||||||
|
common) case where some nodes are both in the old and new layout versions,
|
||||||
|
those nodes will not delete their data and they will continue to receive
|
||||||
|
writes.
|
||||||
|
|
||||||
- Nodes will synchronize data so that the nodes in the newest live layout
|
### Performance impacts
|
||||||
version will catch up with the older live layout versions.
|
|
||||||
|
|
||||||
- Reads are initially directed to the oldest live layout version, but will
|
When multiple layout versions are live, writes are sent to all nodes
|
||||||
progressively be moved to the newer versions once the synchronizations are
|
responsible for the partition of the requested key in all live layout
|
||||||
complete.
|
versions, and will return OK only when they receive a quorum of OK responses
|
||||||
|
for each of the live layout versions. This means that writes could be a bit
|
||||||
|
slower when a layout change is being synchronized in the cluster. Typically if
|
||||||
|
only one node is changing between the old and the new layout version, the write
|
||||||
|
operation will await for 3 responses among 4 requests, instead of the classical
|
||||||
|
2 responses among 3 requests.
|
||||||
|
|
||||||
- Once all nodes are reading from newer layout versions, the oldest live versions
|
Concerning reads, they are still sent to only three nodes. Indeed, they are
|
||||||
can be pruned and the corresponding data deleted.
|
sent to the nodes of the newest live layout version for which nodes have
|
||||||
|
completed a sync to catch up on existing data, and they only expect a quorum of
|
||||||
|
2 responses among the three nodes of that layout version. This way, reads
|
||||||
|
always stay as performant as when no layout change is being processed.
|
||||||
|
|
||||||
|
### Ensuring that new nodes are up-to-date
|
||||||
|
|
||||||
More precisely, the following modifications are made to how quorums are used in
|
An additional coordination mechanism is necessary for the data synchronization
|
||||||
read/write operations and how the sync is made:
|
procedure, to ensure that it is not started too early and that after it
|
||||||
|
completes, the nodes in the new layout indeed contains an up-to-date view of
|
||||||
|
the data.
|
||||||
|
|
||||||
- Writes are sent to all nodes responsible for the paritition in all live
|
Indeed, imagine the following adverse scenario, which we want to avoid: a new
|
||||||
layout versions, and will return OK only when they receive a quorum of OK
|
layout version is introduced in the cluster, and nodes immediately start
|
||||||
responses for each of the live layout versions. This means that writes could
|
copying the data to the new nodes. However, some write operations that were
|
||||||
be a bit slower whan a layout change is being synchronized in the cluster.
|
initiated before the new layout was introduced (or that were handled by a node
|
||||||
|
not yet aware of the layout) could be delayed, and the written data was not yet
|
||||||
|
received by the old nodes when they sent their copy of everything. When the
|
||||||
|
sync reports completion, and read operations start being directed to nodes of
|
||||||
|
the new layout, the written data might be missing from the nodes handling the
|
||||||
|
read, and read-after-write consistency could be violated.
|
||||||
|
|
||||||
- Reads are sent to the newest live layout version for which all nodes have
|
To avoid this situation, the synchronization operation is not initiated until
|
||||||
completed a sync to catch up on existing data, and only expect a quorum of 2
|
all cluster nodes have reported an "acknowledge" of the new layout version,
|
||||||
responses among the three nodes of that layout version. This way, reads
|
indicating that they have received the new layout version, and that they are no
|
||||||
always stay as performant as when no layout update is in progress.
|
longer processing write operations that were only addressed to nodes of the
|
||||||
|
previous layout versions. This makes sure that no data will be missed by the
|
||||||
|
sync: once the sync has started, no more data can be written only to old layout
|
||||||
|
versions. All of the writes will also be directed to the new nodes. More
|
||||||
|
exactly: all data that the source nodes of the sync does not yet contain when
|
||||||
|
the sync starts, is written by a write operation that is also directed at a
|
||||||
|
quorum of nodes among the new ones. This means that at the end of the sync, a
|
||||||
|
read quorum among the new nodes will necessarily return an up-to-date copy of
|
||||||
|
all of the data.
|
||||||
|
|
||||||
- A sync for a new layout version is not initiated until all cluster nodes have
|
### Details on update trackers
|
||||||
acknowledged receiving that version and having finished all write operations
|
|
||||||
that were only addressed to previous layout versions. This makes sure that no
|
|
||||||
data will be missed by the sync: once the sync has started, no more data can
|
|
||||||
be written only to old layout versions. All of the writes will also be
|
|
||||||
directed to the new nodes (more exactly: all data that the source nodes of
|
|
||||||
the sync does not yet contain when the sync starts, is written by a write
|
|
||||||
operation that is also directed at a quorum of nodes among the new ones),
|
|
||||||
meaning that at the end of the sync, a read quorum among the new nodes will
|
|
||||||
necessarily return an up-to-date copy of all of the data.
|
|
||||||
|
|
||||||
- The oldest live layout version can be pruned once all nodes have completed a
|
|
||||||
sync to a newer version AND all nodes have acknowleged that fact, signaling
|
|
||||||
that they are no longer reading from that old version and are now reading
|
|
||||||
from a newer version instead. After being pruned, the old layout version is
|
|
||||||
no longer live, and nodes that are no longer designated to store data in the
|
|
||||||
newer layout versions can simply delete the data that they were storing.
|
|
||||||
|
|
||||||
As you can see, the previous algorithm needs to keep track of a lot of
|
As you can see, the previous algorithm needs to keep track of a lot of
|
||||||
information in the cluster. Ths information is kept in three "layout update trackers",
|
information in the cluster. This information is kept in three "layout update trackers",
|
||||||
which keep track of the following information:
|
which keep track of the following information:
|
||||||
|
|
||||||
- The `ack` layout tracker keeps track of nodes receiving the latest layout
|
- The `ack` layout tracker keeps track of nodes receiving the latest layout
|
||||||
versions. A node will not "ack" (acknowledge) a new layout version while it
|
versions and indicating that they are no longer processing writes addressed
|
||||||
still has outstanding write operations that were not directed to the nodes
|
only to older layout versions. Once all nodes have acknowledged a new
|
||||||
included in that version. Once all nodes have acknowledged a new version, we
|
version, we know that all in-progress and future write operations that are
|
||||||
know that all write operations that are made in the cluster are directed to
|
made in the cluster are directed to the nodes that were added in this layout
|
||||||
the nodes that were added in this layout version.
|
version as well.
|
||||||
|
|
||||||
- The `sync` layout tracker keeps track of nodes finishing a full metadata table
|
- The `sync` layout tracker keeps track of nodes finishing a full metadata table
|
||||||
sync, that was started after all nodes `ack`'ed the new layout version.
|
sync, that was started after all nodes `ack`'ed the new layout version.
|
||||||
|
@ -185,6 +224,58 @@ which keep track of the following information:
|
||||||
more nodes are reading from an old version, at which point the corresponding
|
more nodes are reading from an old version, at which point the corresponding
|
||||||
data can be deleted.
|
data can be deleted.
|
||||||
|
|
||||||
|
In the simplest scenario, only two layout versions are live, and these trackers
|
||||||
|
therefore can only have the values `n` (the new layout version) and `n-1` (the
|
||||||
|
old one). However this mechanism handles the general case where several
|
||||||
|
successive layout updates are being processed and more than two layout versions
|
||||||
|
are live simultaneously. The layout update trackers can take as values the
|
||||||
|
version numbers of any currently live layout version.
|
||||||
|
|
||||||
|
### What about dead nodes?
|
||||||
|
|
||||||
|
In this post I have used many times the phrases "once all nodes have
|
||||||
|
acknowledged a new layout version", or "once all nodes have completed a sync".
|
||||||
|
This obviously means that if some nodes are dead or unresponsive, the
|
||||||
|
processing of the layout update can be delayed indefinitely, and nodes in the
|
||||||
|
old layout versions will keep receiving writes and storing unnecessary data.
|
||||||
|
This is an unfortunate fact with the method proposed here. To cover for these
|
||||||
|
situations, the following workarounds can be made:
|
||||||
|
|
||||||
|
- A layout change is generally a supervised operation, meaning that a system
|
||||||
|
administrator may manually intervene to inform the cluster that certain nodes
|
||||||
|
are dead and that their layout tracker values should not be taken into
|
||||||
|
account.
|
||||||
|
|
||||||
|
- For the `sync` update tracker, we don't actually need to wait for all of the
|
||||||
|
synchronizations to terminate, quorums can be used instead as they should be
|
||||||
|
sufficient to ensure that the copied data is up-to-date.
|
||||||
|
|
||||||
|
- For the `ack` and `sync_ack` update trackers, we can automatically increase
|
||||||
|
them for all nodes (even dead ones) after a certain time delay, as there is
|
||||||
|
no reason for the changes taking more than e.g. 10 minutes to propagate in
|
||||||
|
regular conditions. We might not enable this behaviour by default, though,
|
||||||
|
due to its possible impacts on consistency.
|
||||||
|
|
||||||
|
|
||||||
## Current status and future work
|
## Current status and future work
|
||||||
|
|
||||||
|
The work described in this blog post is currently almost complete but it still
|
||||||
|
needs to be ironed out. I have made a first run of Jepsen testing on the new
|
||||||
|
code that showed that the changes seem to be fixing the issue. I will be
|
||||||
|
running longer and more intensive runs of Jepsen testing once the code is
|
||||||
|
finished, to make sure everything is fine. The changes will require a major
|
||||||
|
update of Garage: this will be the v0.10.0 release, which will probably be
|
||||||
|
finished in January or February of 2024. This update will be a very safe and
|
||||||
|
transparent update, as only the layout data structure is changed and nothing
|
||||||
|
related to object storage itself is touched.
|
||||||
|
|
||||||
|
If I had the time to do so, I would write the algorithm described in this post
|
||||||
|
in a formal way, in the form of a scientific paper. I believe such a paper
|
||||||
|
would be worthy of presenting at a scientific conference or journal, especially
|
||||||
|
given the fact that it is motivated by a very concrete use case and has been
|
||||||
|
validated quite thoroughly (with Jepsen). Unfortunately, this is not my
|
||||||
|
highest priority at the moment.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Written by [Alex Auvolat](https://adnab.me).
|
||||||
|
|
Loading…
Reference in a new issue