block/repair.rs: Added a random element of 10 days to SCRUB_INTERVAL #516

Merged
lx merged 2 commits from jpds/garage:scrub-randomize-window into main 2023-03-06 14:11:25 +00:00
Contributor

When a Garage cluster is initially set up - the nodes would be scheduled to perform a scrub in 30 days in the future.

This then means that potentially all those nodes will be scrubbing togther whilst also handling user traffic/recovery operations. To help balance this scrub load across cluster - this PR changes the interval to 25 days and adds a random element of 10 days after that.

When a Garage cluster is initially set up - the nodes would be scheduled to perform a scrub in 30 days in the future. This then means that potentially all those nodes will be scrubbing togther whilst also handling user traffic/recovery operations. To help balance this scrub load across cluster - this PR changes the interval to 25 days and adds a random element of 10 days after that.
jpds force-pushed scrub-randomize-window from ff049da88d to bd458d3a12 2023-03-05 11:32:44 +00:00 Compare
Owner

I think if we want this to be robust, the planned time of the next scrub has to be chosen exactly once and persisted to disk, not chosen randomly every time a sleep call is made.

I think if we want this to be robust, the planned time of the next scrub has to be chosen exactly once and persisted to disk, not chosen randomly every time a sleep call is made.
jpds force-pushed scrub-randomize-window from bd458d3a12 to 34ebbacec1 2023-03-06 12:32:10 +00:00 Compare
Author
Contributor

@lx I think I've added this in now.

@lx I think I've added this in now.
lx reviewed 2023-03-06 13:25:12 +00:00
lx left a comment
Owner

LGTM, just two minor remarks

LGTM, just two minor remarks
@ -178,2 +180,4 @@
pub(crate) corruptions_detected: u64,
}
fn randomize_next_run_time() -> u64 {
Owner

the name of this function should contain the word scrub, maybe randomize_next_scrub_run_time

the name of this function should contain the word `scrub`, maybe `randomize_next_scrub_run_time`
jpds marked this conversation as resolved
@ -337,2 +362,4 @@
self.persister
.set_with(|p| p.time_last_complete_scrub = now_msec())?;
self.persister
.set_with(|p| p.time_next_run_scrub = randomize_next_run_time())?;
Owner

Probably better to concatenate those two calls in one:

self.persister.set_with(|p| 
  p.time_last_complete_scrub = now_msec();
  p.time_next_run_scrub = randomize_next_run_time();
})?

set_with does non-async (possibly blocking) IO to write the persistent data to disk, better do it only once if possible

Probably better to concatenate those two calls in one: ```rust self.persister.set_with(|p| p.time_last_complete_scrub = now_msec(); p.time_next_run_scrub = randomize_next_run_time(); })? ``` `set_with` does non-async (possibly blocking) IO to write the persistent data to disk, better do it only once if possible
jpds marked this conversation as resolved
jpds force-pushed scrub-randomize-window from 34ebbacec1 to 148b66b843 2023-03-06 13:46:35 +00:00 Compare
Owner

Thanks!

Thanks!
lx merged commit 2dc80abbb1 into main 2023-03-06 14:11:25 +00:00
jpds deleted branch scrub-randomize-window 2023-03-06 14:18:30 +00:00
Sign in to join this conversation.
No description provided.