block/repair.rs: Added a random element of 10 days to SCRUB_INTERVAL #516
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#516
Loading…
Reference in a new issue
No description provided.
Delete branch "jpds/garage:scrub-randomize-window"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
When a Garage cluster is initially set up - the nodes would be scheduled to perform a scrub in 30 days in the future.
This then means that potentially all those nodes will be scrubbing togther whilst also handling user traffic/recovery operations. To help balance this scrub load across cluster - this PR changes the interval to 25 days and adds a random element of 10 days after that.
ff049da88d
tobd458d3a12
I think if we want this to be robust, the planned time of the next scrub has to be chosen exactly once and persisted to disk, not chosen randomly every time a sleep call is made.
bd458d3a12
to34ebbacec1
@lx I think I've added this in now.
LGTM, just two minor remarks
@ -178,2 +180,4 @@
pub(crate) corruptions_detected: u64,
}
fn randomize_next_run_time() -> u64 {
the name of this function should contain the word
scrub
, mayberandomize_next_scrub_run_time
@ -337,2 +362,4 @@
self.persister
.set_with(|p| p.time_last_complete_scrub = now_msec())?;
self.persister
.set_with(|p| p.time_next_run_scrub = randomize_next_run_time())?;
Probably better to concatenate those two calls in one:
set_with
does non-async (possibly blocking) IO to write the persistent data to disk, better do it only once if possible34ebbacec1
to148b66b843
Thanks!