Read-after-write consistency may not be maintained when layout changes #495

Closed
opened 2023-01-30 14:17:26 +00:00 by lx · 1 comment
Owner

Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:

Read-after-write consistency. If a client A writes an object x (e.g. using PutObject) and receives a HTTP 200 OK response, and later a client B tries to read object x (e.g. using GetObject), then B will read the version written by A, or a more recent version.

This consistency guarantee at the level of objects in the object store API is in fact a reflection of read-after-write consistency in the internal metadata engine of Garage (which is a distributed key/value store with CRDT values). Reads and writes to metadata tables use quorums of 2 out of 3 nodes for each operation, ensuring that if operation B starts after operation A has completed, then there is at least one node that is handling both operation A and B. In the case where A is a write (an update) and B is a read, that node will have the opportunity to return the value written in A to the reading client B.

The issue. Maintaining this property depends crucially on the intersection of the quorums being non-empty. There is however a scenario where these quorums may be empty: when the set of nodes affected to storing some entries changes, for instance when nodes are added or removed and data is being rebalanced between nodes. For instance, a partition (a subset of the data stored by Garage) might be stored by nodes 1, 2 and 3 before a layout change, and by nodes 1, 4 and 5 after the layout change. All operations done before the layout change will have been handled by two nodes among 1, 2, 3, but there is no guarantee of the intersection with two nodes among 1, 4, 5, and moreover nodes 4 and 5 will not catch up with all of the data stored by nodes 2 and 3 before some significant rebalancing delay. So read-after-write consistency is broken while the rebalance is in progress.

Possible solutions. To solve this issue, we will have to track the progress of the transfer of data from nodes of layout version n-1 to nodes of layout version n. As long as that transfer has not finished, we will have to use a dual-quorum strategy to ensure consistency:

  • use write quorums amongst nodes of layout version n
  • use two read quorums, one amongst nodes of layout version n-1 and one amongst nodes of layout version n

This can be flipped the other way around, which might make more sense if we assume that reads are the most frequent operations and need to complete fast, however it might be a bit more tricky to implement:

  • use two write quorums, one amongst nodes of layout version n-1 and one amongst nodes of layout version n
  • use read quorum amongst nodes of layout n-1

We will also have to add more synchronization to ensure that data is not saved to nodes that are no longer responsible for a given data partition, as nodes may not be informed of the layout change at exactly the same time and small inconsistencies may appear in this interval.

Description of this task. We want to solve this the best we can before we tag Garage v1.0. Here are the steps we should take:

  1. Try to break Garage, by putting it in a situation where this inconsistency actually appears. We could use Jepsen or other similar tools for this.

  2. Understand exactly what is happening and why it breaks.

  3. Make a theoretical model of the system that reflects the issue, and figure out an algorithm that works in this model (e.g. based on one of the two solutions proposed above).

  4. Implement the chosen solution.

  5. Check that the issue is resolved using the tools and baseline defined in step 1.

Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows: **Read-after-write consistency.** *If a client A writes an object x (e.g. using PutObject) and receives a `HTTP 200 OK` response, and later a client B tries to read object x (e.g. using GetObject), then B will read the version written by A, or a more recent version.* This consistency guarantee at the level of objects in the object store API is in fact a reflection of read-after-write consistency in the internal metadata engine of Garage (which is a distributed key/value store with CRDT values). Reads and writes to metadata tables use quorums of 2 out of 3 nodes for each operation, ensuring that if operation B starts after operation A has completed, then there is at least one node that is handling both operation A and B. In the case where A is a write (an update) and B is a read, that node will have the opportunity to return the value written in A to the reading client B. **The issue.** Maintaining this property depends crucially on the intersection of the quorums being non-empty. There is however a scenario where these quorums may be empty: when the set of nodes affected to storing some entries changes, for instance when nodes are added or removed and data is being rebalanced between nodes. For instance, a partition (a subset of the data stored by Garage) might be stored by nodes 1, 2 and 3 before a layout change, and by nodes 1, 4 and 5 after the layout change. All operations done before the layout change will have been handled by two nodes among 1, 2, 3, but there is no guarantee of the intersection with two nodes among 1, 4, 5, and moreover nodes 4 and 5 will not catch up with all of the data stored by nodes 2 and 3 before some significant rebalancing delay. So read-after-write consistency is broken while the rebalance is in progress. **Possible solutions.** To solve this issue, we will have to track the progress of the transfer of data from nodes of layout version n-1 to nodes of layout version n. As long as that transfer has not finished, we will have to use a dual-quorum strategy to ensure consistency: - use write quorums amongst nodes of layout version n - use two read quorums, one amongst nodes of layout version n-1 and one amongst nodes of layout version n This can be flipped the other way around, which might make more sense if we assume that reads are the most frequent operations and need to complete fast, however it might be a bit more tricky to implement: - use two write quorums, one amongst nodes of layout version n-1 and one amongst nodes of layout version n - use read quorum amongst nodes of layout n-1 We will also have to add more synchronization to ensure that data is not saved to nodes that are no longer responsible for a given data partition, as nodes may not be informed of the layout change at exactly the same time and small inconsistencies may appear in this interval. **Description of this task.** We want to solve this the best we can before we tag Garage v1.0. Here are the steps we should take: 1. Try to break Garage, by putting it in a situation where this inconsistency actually appears. We could use Jepsen or other similar tools for this. 2. Understand exactly what is happening and why it breaks. 3. Make a theoretical model of the system that reflects the issue, and figure out an algorithm that works in this model (e.g. based on one of the two solutions proposed above). 4. Implement the chosen solution. 5. Check that the issue is resolved using the tools and baseline defined in step 1.
lx added this to the v0.9 milestone 2023-01-30 14:17:26 +00:00
lx added the
Correctness
label 2023-01-30 14:17:26 +00:00
lx added the
Bug
label 2023-03-20 10:56:37 +00:00
lx changed title from Consistency when layout changes to Read-after-write consistency may not be maintained when layout changes 2023-03-20 10:56:51 +00:00
lx modified the milestone from v0.9 to v1.0 2023-03-20 11:21:58 +00:00
Author
Owner

Issue #151 could probably benefit from this, maybe be fixed entirely

Issue #151 could probably benefit from this, maybe be fixed entirely
lx closed this issue 2024-01-11 10:52:13 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#495
No description provided.