Read-after-write consistency may not be maintained when layout changes #495
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#495
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Garage provides mainly one consistency guarantee, read-after-write for objects, which can be described as follows:
Read-after-write consistency. If a client A writes an object x (e.g. using PutObject) and receives a
HTTP 200 OK
response, and later a client B tries to read object x (e.g. using GetObject), then B will read the version written by A, or a more recent version.This consistency guarantee at the level of objects in the object store API is in fact a reflection of read-after-write consistency in the internal metadata engine of Garage (which is a distributed key/value store with CRDT values). Reads and writes to metadata tables use quorums of 2 out of 3 nodes for each operation, ensuring that if operation B starts after operation A has completed, then there is at least one node that is handling both operation A and B. In the case where A is a write (an update) and B is a read, that node will have the opportunity to return the value written in A to the reading client B.
The issue. Maintaining this property depends crucially on the intersection of the quorums being non-empty. There is however a scenario where these quorums may be empty: when the set of nodes affected to storing some entries changes, for instance when nodes are added or removed and data is being rebalanced between nodes. For instance, a partition (a subset of the data stored by Garage) might be stored by nodes 1, 2 and 3 before a layout change, and by nodes 1, 4 and 5 after the layout change. All operations done before the layout change will have been handled by two nodes among 1, 2, 3, but there is no guarantee of the intersection with two nodes among 1, 4, 5, and moreover nodes 4 and 5 will not catch up with all of the data stored by nodes 2 and 3 before some significant rebalancing delay. So read-after-write consistency is broken while the rebalance is in progress.
Possible solutions. To solve this issue, we will have to track the progress of the transfer of data from nodes of layout version n-1 to nodes of layout version n. As long as that transfer has not finished, we will have to use a dual-quorum strategy to ensure consistency:
This can be flipped the other way around, which might make more sense if we assume that reads are the most frequent operations and need to complete fast, however it might be a bit more tricky to implement:
We will also have to add more synchronization to ensure that data is not saved to nodes that are no longer responsible for a given data partition, as nodes may not be informed of the layout change at exactly the same time and small inconsistencies may appear in this interval.
Description of this task. We want to solve this the best we can before we tag Garage v1.0. Here are the steps we should take:
Try to break Garage, by putting it in a situation where this inconsistency actually appears. We could use Jepsen or other similar tools for this.
Understand exactly what is happening and why it breaks.
Make a theoretical model of the system that reflects the issue, and figure out an algorithm that works in this model (e.g. based on one of the two solutions proposed above).
Implement the chosen solution.
Check that the issue is resolved using the tools and baseline defined in step 1.
Consistency when layout changesto Read-after-write consistency may not be maintained when layout changesIssue #151 could probably benefit from this, maybe be fixed entirely