complete blog post on nlnet task3

2023-12-06 18:01:47 +01:00 · 2023-12-06 18:01:47 +01:00 · e66f6de458
commit e66f6de458
parent 10bd783ce6
1 changed files with 171 additions and 80 deletions
--- a/content/blog/2023-12-preserving-read-after-write-consistency/index.md
+++ b/content/blog/2023-12-preserving-read-after-write-consistency/index.md
@ -1,5 +1,5 @@
 +++
-title="Maintaining Read-after-Write consistency in all circumstances"
+title="Maintaining read-after-write consistency in all circumstances"
 date=2023-12-25
 +++

@ -10,8 +10,8 @@ of read and write quorums to guarantee read-after-write consistency, the only
 consistency guarantee it provides. However, as of Garage v0.9.0, this guarantee
 is not maintained when the composition of a cluster is updated and data is
 moved between storage nodes. As part of our current NLnet-funded project, we
-are developping a solution to this problem, that is briefly explained in this
-blog post.*
+are developing a solution to this problem. This blog post proposes a
+high-level overview of the proposed solution.*

 <!-- more -->

@ -19,16 +19,28 @@ blog post.*

 Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:

-**Read-after-write consistency.** *If a client A writes an object x (e.g. using PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to read object x (e.g. using GetObject), then B will read the version written by A, or a more recent version.*
+**Read-after-write consistency.** *If a client A writes an object x (e.g. using
+PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to
+read object x (e.g. using GetObject), then B will read the version written by
+A, or a more recent version.*

 The consistency guarantee offered by Garage is slightly more general than this
 simplistic formulation, as it also applies to other S3 endpoints such as
 ListObjects, which are always guaranteed to reflect the latest version of
-objects inserted in a bucket. 
+objects inserted in a bucket.  Note that Amazon calls this guarantee [*strong*
+read-after-write consistency](https://aws.amazon.com/s3/consistency/) (they
+also have it on AWS), to differentiate it from [another definition of
+read-after-write
+consistency](https://avikdas.com/2020/04/13/scalability-concepts-read-after-write-consistency.html)
+that only applies to data that is read by the same client that wrote it.  Since
+that weaker form is also called
+[read-your-writes](https://jepsen.io/consistency/models/read-your-writes), I
+will always be referring to the strong version when using the term
+"read-after-write consistency".

-This consistency guarantee at the level of objects in the S3 API is in fact a
-reflection of read-after-write consistency in the internal metadata engine of
-Garage (which is a distributed key/value store with CRDT values).  Reads and
+In Garage, this consistency guarantee at the level of objects in the S3 API is
+in fact a reflection of read-after-write consistency in the internal metadata
+engine (which is a distributed key/value store with CRDT values).  Reads and
 writes to metadata tables use quorums of 2 out of 3 nodes for each operation,
 ensuring that if operation B starts after operation A has completed, then there
 is at least one node that is handling both operation A and B. In the case where
@ -36,16 +48,18 @@ A is a write (an update) and B is a read, that node will have the opportunity
 to return the value written in A to the reading client B.  A visual depiction
 of this process can be found in [this
 presentation](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/a8b0e01f88b947bc34c05d818d51860b4d171967/doc/talks/2023-09-20-ocp/talk.pdf)
-on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page 54).
+on slide 32 (pages 57-64), and the algorithm is written down on slide 33 (page
+54).

 Note that read-after-write guarantees [are broken and have always
-been](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/147) for the bucket
-and access key tables, which might not be something we can fix due to different
-requirements on the quorums. 
+been](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/147) for metadata
+related to buckets and access keys, which might not be something we can fix due
+to different requirements on the quorums for the related metadata tables. 

-## Current issues with Read-after-Write consistency

-Maintaining Read-after-Write consistency depends crucially on the intersection
+## Current issues with read-after-write consistency
+
+Maintaining read-after-write consistency depends crucially on the intersection
 of the quorums being non-empty. There is however a scenario where these quorums
 may be empty: when the set of nodes affected to storing some entries changes,
 for instance when nodes are added or removed and data is being rebalanced
@ -70,111 +84,136 @@ A, B and C for a long time even after everyone else is aware of the new layout.
 Moreover, data will be progressively moved from nodes B and C to nodes D and E,
 which can take a long time depending on the quantity of data. This creates a
 period of uncertainty as to where exactly the data is stored in the cluster.
-Overall, this basically means that there is no way to guarantee the
-intersection-of-quorums property, which is necessary for read-after-write, with
-such a simplistic scheme.
+Overall, this basically means that this simplistic scheme gives us no way to
+guarantee the intersection-of-quorums property, which is necessary for
+read-after-write.

 Concretely, here is a very simple scenario in which read-after-write is broken:

 1. A write operation is directed to nodes A, B and C (the old layout), and
   receives OK responses from nodes B and C, forming a quorum, so the write
-   completes successfully.  The written data sent to node A is lost or delayed
-   for a long time.
+   completes successfully.  The written data then arrives to node A as well.

 2. The new layout version is introduced in the cluster.

 3. Before nodes D and E have had the chance to retrieve the data that was
   stored on nodes B and C, a read operation for the same key is directed to
-   nodes A, D and E.  This request receives OK responses from nodes D and E,
-   both containing no data but still forming a quorum of 2 responses. So the
-   read returns a null value instead of the value that was written before, even
-   though the write operation reported a success.
+   nodes A, D and E. D and E both return an OK response with no data (a null
+   value), because they is not yet up-to-date.  An answer from node A is not
+   received in time.  The two responses from nodes D and E, that contain no
+   data, still form a quorum, so the read returns a null value instead of the
+   value that was written before, even though the write operation reported a
+   success.


 ### Evidencing the issue with Jepsen testing

 The first thing that I had to do for the NLnet project was to develop a testing
 framework to show that read-after-write consistency issues could in fact arise
-in Garage when the cluster layout was updated.
+in Garage when the cluster layout was updated.  To make such tests, I chose to
+use the [Jepsen](https://jepsen.io/) testing framework, which helps us put
+distributed software in complex adverse scenarios and verify whether they
+respect some claimed consistency guarantees or not.

-To make such tests, I chose to use the Jepsen testing framework, which helps us
-put distributed software in complex adverse scenarios and verify whether they
-respect some claimed consistency guarantees or not. I will not enter into too
-much detail on the testing procedure, but suffice to say that issues were
-found. More precisely, I was able to show that Garage did guarantee
-read-after-write in a variety of adverse scenarios such as network partitions,
-node crashes and clock scrambling, but that it was unable to do so as soon as
-regular layout updates were introduced.
+I will not enter into too much detail on the testing procedure, but suffice to
+say that issues were found. More precisely, I was able to show that Garage
+*did* guarantee read-after-write in a variety of adverse scenarios such as
+network partitions, node crashes and clock scrambling, but that it was unable
+to do so as soon as regular layout updates were introduced.

 The progress of the Jepsen testing work is tracked in [PR
 #544](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/544)


-## Fixing Read-after-Write consistency when layouts change
+## Fixing read-after-write consistency when layouts change

-To solve this issue, we will have to keep track of several information in the cluster.
-We will also have to adapt our data transfer strategy and our quorums to make sure that
-data can be found when it is requested.
+To solve this issue, we will have to keep track of several pieces of
+information in the cluster.  We will also have to adapt our read/write quorums
+and our data transfer strategy during rebalancing to make sure that data can be
+found when it is requested.

-Basically, here is how we will make sure that read-after-write is guaranteed:
+First of all, we adapted Garage's code to be able to handle *several versions
+of the cluster layout* that can be live in the cluster at the same time, to
+keep track of multiple possible locations for data that is currently being
+transferred between nodes.  When multiple cluster layout versions are live,
+write operations are directed to all of the nodes responsible for storing the
+data in all the live versions. This ensures that the nodes in the oldest live
+layout version always have an up-to-date view of the data, and that a read
+quorum among those nodes is always a safe way to ensure read-after-write
+consistency.

- Several versions of the cluster layout can be live in the cluster at the same time.
+Nodes will progressively synchronize data so that the nodes in the newest live
+layout version will catch up with data stored by nodes in the older live layout
+version.  Once nodes in the newer layout versions also have an up-to-date view
+of the data, read operations will progressively start using a quorum of nodes
+in the new layout version instead of the old one.

- When multiple cluster layout versions are live, the writes are directed to
-  all of the live versions.
+Once all nodes are reading from newer layout versions, the oldest live versions
+can be pruned. This means that writes will stop being directed to those nodes,
+and the nodes will delete the data they were storing.  Obviously, in the (very
+common) case where some nodes are both in the old and new layout versions,
+those nodes will not delete their data and they will continue to receive
+writes.

- Nodes will synchronize data so that the nodes in the newest live layout
-  version will catch up with the older live layout versions.
+### Performance impacts

- Reads are initially directed to the oldest live layout version, but will
-  progressively be moved to the newer versions once the synchronizations are
-  complete.
+When multiple layout versions are live, writes are sent to all nodes
+responsible for the partition of the requested key in all live layout
+versions, and will return OK only when they receive a quorum of OK responses
+for each of the live layout versions.  This means that writes could be a bit
+slower when a layout change is being synchronized in the cluster. Typically if
+only one node is changing between the old and the new layout version, the write
+operation will await for 3 responses among 4 requests, instead of the classical
+2 responses among 3 requests.

- Once all nodes are reading from newer layout versions, the oldest live versions
-  can be pruned and the corresponding data deleted.
+Concerning reads, they are still sent to only three nodes.  Indeed,  they are
+sent to the nodes of the newest live layout version for which nodes have
+completed a sync to catch up on existing data, and they only expect a quorum of
+2 responses among the three nodes of that layout version.  This way, reads
+always stay as performant as when no layout change is being processed.

+### Ensuring that new nodes are up-to-date

-More precisely, the following modifications are made to how quorums are used in
-read/write operations and how the sync is made:
+An additional coordination mechanism is necessary for the data synchronization
+procedure, to ensure that it is not started too early and that after it
+completes, the nodes in the new layout indeed contains an up-to-date view of
+the data.

- Writes are sent to all nodes responsible for the paritition in all live
-  layout versions, and will return OK only when they receive a quorum of OK
-  responses for each of the live layout versions.  This means that writes could
-  be a bit slower whan a layout change is being synchronized in the cluster.
+Indeed, imagine the following adverse scenario, which we want to avoid: a new
+layout version is introduced in the cluster, and nodes immediately start
+copying the data to the new nodes. However, some write operations that were
+initiated before the new layout was introduced (or that were handled by a node
+not yet aware of the layout) could be delayed, and the written data was not yet
+received by the old nodes when they sent their copy of everything.  When the
+sync reports completion, and read operations start being directed to nodes of
+the new layout, the written data might be missing from the nodes handling the
+read, and read-after-write consistency could be violated.

- Reads are sent to the newest live layout version for which all nodes have
-  completed a sync to catch up on existing data, and only expect a quorum of 2
-  responses among the three nodes of that layout version.  This way, reads
-  always stay as performant as when no layout update is in progress.
+To avoid this situation, the synchronization operation is not initiated until
+all cluster nodes have reported an "acknowledge" of the new layout version,
+indicating that they have received the new layout version, and that they are no
+longer processing write operations that were only addressed to nodes of the
+previous layout versions.  This makes sure that no data will be missed by the
+sync: once the sync has started, no more data can be written only to old layout
+versions.  All of the writes will also be directed to the new nodes. More
+exactly: all data that the source nodes of the sync does not yet contain when
+the sync starts, is written by a write operation that is also directed at a
+quorum of nodes among the new ones. This means that at the end of the sync, a
+read quorum among the new nodes will necessarily return an up-to-date copy of
+all of the data.

- A sync for a new layout version is not initiated until all cluster nodes have
-  acknowledged receiving that version and having finished all write operations
-  that were only addressed to previous layout versions. This makes sure that no
-  data will be missed by the sync: once the sync has started, no more data can
-  be written only to old layout versions.  All of the writes will also be
-  directed to the new nodes (more exactly: all data that the source nodes of
-  the sync does not yet contain when the sync starts, is written by a write
-  operation that is also directed at a quorum of nodes among the new ones),
-  meaning that at the end of the sync, a read quorum among the new nodes will
-  necessarily return an up-to-date copy of all of the data.
-
- The oldest live layout version can be pruned once all nodes have completed a
-  sync to a newer version AND all nodes have acknowleged that fact, signaling
-  that they are no longer reading from that old version and are now reading
-  from a newer version instead. After being pruned, the old layout version is
-  no longer live, and nodes that are no longer designated to store data in the
-  newer layout versions can simply delete the data that they were storing.
+### Details on update trackers

 As you can see, the previous algorithm needs to keep track of a lot of
-information in the cluster. Ths information is kept in three "layout update trackers",
+information in the cluster. This information is kept in three "layout update trackers",
 which keep track of the following information:

 - The `ack` layout tracker keeps track of nodes receiving the latest layout
-  versions.  A node will not "ack" (acknowledge) a new layout version while it
-  still has outstanding write operations that were not directed to the nodes
-  included in that version. Once all nodes have acknowledged a new version, we
-  know that all write operations that are made in the cluster are directed to
-  the nodes that were added in this layout version.
+  versions and indicating that they are no longer processing writes addressed
+  only to older layout versions. Once all nodes have acknowledged a new
+  version, we know that all in-progress and future write operations that are
+  made in the cluster are directed to the nodes that were added in this layout
+  version as well.

 - The `sync` layout tracker keeps track of nodes finishing a full metadata table
  sync, that was started after all nodes `ack`'ed the new layout version.
@ -185,6 +224,58 @@ which keep track of the following information:
  more nodes are reading from an old version, at which point the corresponding
  data can be deleted.

+In the simplest scenario, only two layout versions are live, and these trackers
+therefore can only have the values `n` (the new layout version) and `n-1` (the
+old one).  However this mechanism handles the general case where several
+successive layout updates are being processed and more than two layout versions
+are live simultaneously. The layout update trackers can take as values the
+version numbers of any currently live layout version.
+
+### What about dead nodes?
+
+In this post I have used many times the phrases "once all nodes have
+acknowledged a new layout version", or "once all nodes have completed a sync".
+This obviously means that if some nodes are dead or unresponsive, the
+processing of the layout update can be delayed indefinitely, and nodes in the
+old layout versions will keep receiving writes and storing unnecessary data.
+This is an unfortunate fact with the method proposed here. To cover for these
+situations, the following workarounds can be made:
+
+- A layout change is generally a supervised operation, meaning that a system
+  administrator may manually intervene to inform the cluster that certain nodes
+  are dead and that their layout tracker values should not be taken into
+  account.
+
+- For the `sync` update tracker, we don't actually need to wait for all of the
+  synchronizations to terminate, quorums can be used instead as they should be
+  sufficient to ensure that the copied data is up-to-date.
+
+- For the `ack` and `sync_ack` update trackers, we can automatically increase
+  them for all nodes (even dead ones) after a certain time delay, as there is
+  no reason for the changes taking more than e.g. 10 minutes to propagate in
+  regular conditions. We might not enable this behaviour by default, though,
+  due to its possible impacts on consistency.


 ## Current status and future work
+
+The work described in this blog post is currently almost complete but it still
+needs to be ironed out.  I have made a first run of Jepsen testing on the new
+code that showed that the changes seem to be fixing the issue.  I will be
+running longer and more intensive runs of Jepsen testing once the code is
+finished, to make sure everything is fine.  The changes will require a major
+update of Garage: this will be the v0.10.0 release, which will probably be
+finished in January or February of 2024. This update will be a very safe and
+transparent update, as only the layout data structure is changed and nothing
+related to object storage itself is touched.
+
+If I had the time to do so, I would write the algorithm described in this post
+in a formal way, in the form of a scientific paper. I believe such a paper
+would be worthy of presenting at a scientific conference or journal, especially
+given the fact that it is motivated by a very concrete use case and has been
+validated quite thoroughly (with Jepsen).  Unfortunately, this is not my
+highest priority at the moment.
+
+---
+
+Written by [Alex Auvolat](https://adnab.me).