diff --git a/content/blog/2023-12-preserving-read-after-write-consistency/index.md b/content/blog/2023-12-preserving-read-after-write-consistency/index.md index f83d1cb..53c3d43 100644 --- a/content/blog/2023-12-preserving-read-after-write-consistency/index.md +++ b/content/blog/2023-12-preserving-read-after-write-consistency/index.md @@ -112,17 +112,79 @@ The progress of the Jepsen testing work is tracked in [PR ## Fixing Read-after-Write consistency when layouts change -To solve this issue, we will have to track the progress of the transfer of data from nodes of layout version n-1 to nodes of layout version n. As long as that transfer has not finished, we will have to use a dual-quorum strategy to ensure consistency: +To solve this issue, we will have to keep track of several information in the cluster. +We will also have to adapt our data transfer strategy and our quorums to make sure that +data can be found when it is requested. -- use write quorums amongst nodes of layout version n -- use two read quorums, one amongst nodes of layout version n-1 and one amongst nodes of layout version n +Basically, here is how we will make sure that read-after-write is guaranteed: -This can be flipped the other way around, which might make more sense if we assume that reads are the most frequent operations and need to complete fast, however it might be a bit more tricky to implement: +- Several versions of the cluster layout can be live in the cluster at the same time. -- use two write quorums, one amongst nodes of layout version n-1 and one amongst nodes of layout version n -- use read quorum amongst nodes of layout n-1 +- When multiple cluster layout versions are live, the writes are directed to + all of the live versions. + +- Nodes will synchronize data so that the nodes in the newest live layout + version will catch up with the older live layout versions. + +- Reads are initially directed to the oldest live layout version, but will + progressively be moved to the newer versions once the synchronizations are + complete. + +- Once all nodes are reading from newer layout versions, the oldest live versions + can be pruned and the corresponding data deleted. + + +More precisely, the following modifications are made to how quorums are used in +read/write operations and how the sync is made: + +- Writes are sent to all nodes responsible for the paritition in all live + layout versions, and will return OK only when they receive a quorum of OK + responses for each of the live layout versions. This means that writes could + be a bit slower whan a layout change is being synchronized in the cluster. + +- Reads are sent to the newest live layout version for which all nodes have + completed a sync to catch up on existing data, and only expect a quorum of 2 + responses among the three nodes of that layout version. This way, reads + always stay as performant as when no layout update is in progress. + +- A sync for a new layout version is not initiated until all cluster nodes have + acknowledged receiving that version and having finished all write operations + that were only addressed to previous layout versions. This makes sure that no + data will be missed by the sync: once the sync has started, no more data can + be written only to old layout versions. All of the writes will also be + directed to the new nodes (more exactly: all data that the source nodes of + the sync does not yet contain when the sync starts, is written by a write + operation that is also directed at a quorum of nodes among the new ones), + meaning that at the end of the sync, a read quorum among the new nodes will + necessarily return an up-to-date copy of all of the data. + +- The oldest live layout version can be pruned once all nodes have completed a + sync to a newer version AND all nodes have acknowleged that fact, signaling + that they are no longer reading from that old version and are now reading + from a newer version instead. After being pruned, the old layout version is + no longer live, and nodes that are no longer designated to store data in the + newer layout versions can simply delete the data that they were storing. + +As you can see, the previous algorithm needs to keep track of a lot of +information in the cluster. Ths information is kept in three "layout update trackers", +which keep track of the following information: + +- The `ack` layout tracker keeps track of nodes receiving the latest layout + versions. A node will not "ack" (acknowledge) a new layout version while it + still has outstanding write operations that were not directed to the nodes + included in that version. Once all nodes have acknowledged a new version, we + know that all write operations that are made in the cluster are directed to + the nodes that were added in this layout version. + +- The `sync` layout tracker keeps track of nodes finishing a full metadata table + sync, that was started after all nodes `ack`'ed the new layout version. + +- The `sync_ack` layout tracker keeps track of nodes receiving the `sync` + tracker update for all cluster nodes, and thus starting to direct reads to + the newly synchronized layout version. This makes it possible to know when no + more nodes are reading from an old version, at which point the corresponding + data can be deleted. -We will also have to add more synchronization to ensure that data is not saved to nodes that are no longer responsible for a given data partition, as nodes may not be informed of the layout change at exactly the same time and small inconsistencies may appear in this interval. ## Current status and future work