Try to solve persistence issues #259
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#259
Loading…
Reference in a new issue
No description provided.
Delete branch "fix-resync"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
fsync()
at appropriate places inwrite_block
bae45fa6c1
tod1eefea917
d1eefea917
to2f9d606bd6
@ -579,6 +578,9 @@ impl BlockManager {
// don't do resync and return early, but still
// make sure the item is still in queue at expected time
self.put_to_resync_at(&hash, ec.next_try())?;
// ec.next_try() > now >= time_msec, so this remove
I don't understand the condition in which we are here, can you confirm my following reasoning ?
We get a block to repair in the queue
-> its scheduled resync time is before now, so we handle it
-> we get the block associated error counter
-> the error counter has a next_try method that implement exponential backoff
-> (the block can be added by another tool in the queue that do not consider exponential backoff?)
-> The exponential backoff says we should not reschedule now, as its exponential backoff value is greater than the previously scheduled one
-> We re-add the block at the value computed by the exponential backoff
-> We remove the block at the current time value
After this analysis, I have a question:
True. I'll see if I can refactor this logic to make the handling of the resync queue more self-contained and more understandable. But I think that to implement what you are saying, we need to have a transaction that takes a lock on the two trees at once (resync_notify and resync_errors), which we cannot do with the
SledCountedTree
wrapper, so we probably need to have a mutex for all operations on the queue. I have to think about it.@ -892,6 +902,14 @@ impl BlockManagerLocked {
fs::remove_file(to_delete).await?;
}
let dir = fs::OpenOptions::new()
If I am correct, this code is used to fsync a move, as you mentionned on Matrix?
I suggest we add a comment to avoid removing it if someone else refactor this part of the code.
5bf58bd539
tod78bf379fb
dae0d8aebd
toba6b56ae68
fixes #256
fixes #257