Read-after-write consistency may not be maintained when layout changes #495
Labels
No Label
AdminAPI
Bug
Check AWS
CI
Correctness
Critical
Documentation
Ideas
Improvement
Low priority
Newcomer
Performance
S3 Compatibility
Testing
Usability
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#495
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:
Read-after-write consistency. If a client A writes an object x (e.g. using PutObject) and receives a
HTTP 200 OK
response, and later a client B tries to read object x (e.g. using GetObject), then B will read the version written by A, or a more recent version.This consistency guarantee at the level of objects in the object store API is in fact a reflection of read-after-write consistency in the internal metadata engine of Garage (which is a distributed key/value store with CRDT values). Reads and writes to metadata tables use quorums of 2 out of 3 nodes for each operation, ensuring that if operation B starts after operation A has completed, then there is at least one node that is handling both operation A and B. In the case where A is a write (an update) and B is a read, that node will have the opportunity to return the value written in A to the reading client B.
The issue. Maintaining this property depends crucially on the intersection of the quorums being non-empty. There is however a scenario where these quorums may be empty: when the set of nodes affected to storing some entries changes, for instance when nodes are added or removed and data is being rebalanced between nodes. For instance, a partition (a subset of the data stored by Garage) might be stored by nodes 1, 2 and 3 before a layout change, and by nodes 1, 4 and 5 after the layout change. All operations done before the layout change will have been handled by two nodes among 1, 2, 3, but there is no guarantee of the intersection with two nodes among 1, 4, 5, and moreover nodes 4 and 5 will not catch up with all of the data stored by nodes 2 and 3 before some significant rebalancing delay. So read-after-write consistency is broken while the rebalance is in progress.
Possible solutions. To solve this issue, we will have to track the progress of the transfer of data from nodes of layout version n-1 to nodes of layout version n. As long as that transfer has not finished, we will have to use a dual-quorum strategy to ensure consistency:
This can be flipped the other way around, which might make more sense if we assume that reads are the most frequent operations and need to complete fast, however it might be a bit more tricky to implement:
We will also have to add more synchronization to ensure that data is not saved to nodes that are no longer responsible for a given data partition, as nodes may not be informed of the layout change at exactly the same time and small inconsistencies may appear in this interval.
Description of this task. We want to solve this the best we can before we tag Garage v1.0. Here are the steps we should take:
Try to break Garage, by putting it in a situation where this inconsistency actually appears. We could use Jepsen or other similar tools for this.
Understand exactly what is happening and why it breaks.
Make a theoretical model of the system that reflects the issue, and figure out an algorithm that works in this model (e.g. based on one of the two solutions proposed above).
Implement the chosen solution.
Check that the issue is resolved using the tools and baseline defined in step 1.
Consistency when layout changesto Read-after-write consistency may not be maintained when layout changesIssue #151 could probably benefit from this, maybe be fixed entirely