Try to solve persistence issues #259

Merged
lx merged 5 commits from fix-resync into main 2022-03-14 15:27:15 +00:00
Owner
  • call fsync() at appropriate places in write_block
  • make sure blocks in resync queue stay there until they are sucessfully processed
- [x] call `fsync()` at appropriate places in `write_block` - [x] make sure blocks in resync queue stay there until they are **sucessfully** processed
lx force-pushed fix-resync from bae45fa6c1 to d1eefea917 2022-03-01 11:03:44 +00:00 Compare
lx force-pushed fix-resync from d1eefea917 to 2f9d606bd6 2022-03-01 11:04:39 +00:00 Compare
lx changed target branch from main to bug/resync-exponential-backoff 2022-03-01 11:05:08 +00:00
lx added 1 commit 2022-03-01 13:55:59 +00:00
Fix resync queue to not drop items
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone/pr Build is passing
5bf58bd539
quentin added the
kind
correctness
label 2022-03-03 13:51:14 +00:00
quentin approved these changes 2022-03-04 09:41:41 +00:00
@ -579,6 +578,9 @@ impl BlockManager {
// don't do resync and return early, but still
// make sure the item is still in queue at expected time
self.put_to_resync_at(&hash, ec.next_try())?;
// ec.next_try() > now >= time_msec, so this remove
Owner

I don't understand the condition in which we are here, can you confirm my following reasoning ?

We get a block to repair in the queue
-> its scheduled resync time is before now, so we handle it
-> we get the block associated error counter
-> the error counter has a next_try method that implement exponential backoff
-> (the block can be added by another tool in the queue that do not consider exponential backoff?)
-> The exponential backoff says we should not reschedule now, as its exponential backoff value is greater than the previously scheduled one
-> We re-add the block at the value computed by the exponential backoff
-> We remove the block at the current time value

After this analysis, I have a question:

  • Why do we not have an helper/method to put a block in the resync queue that would directly check the value of the ErrorCounter (if any) and its associated exponential backoff time?
I don't understand the condition in which we are here, can you confirm my following reasoning ? We get a block to repair in the queue -> its scheduled resync time is before now, so we handle it -> we get the block associated error counter -> the error counter has a next_try method that implement exponential backoff -> (the block can be added by another tool in the queue that do not consider exponential backoff?) -> The exponential backoff says we should not reschedule now, as its exponential backoff value is greater than the previously scheduled one -> We re-add the block at the value computed by the exponential backoff -> We remove the block at the current time value After this analysis, I have a question: - Why do we not have an helper/method to put a block in the resync queue that would directly check the value of the ErrorCounter (if any) and its associated exponential backoff time?
Author
Owner

True. I'll see if I can refactor this logic to make the handling of the resync queue more self-contained and more understandable. But I think that to implement what you are saying, we need to have a transaction that takes a lock on the two trees at once (resync_notify and resync_errors), which we cannot do with the SledCountedTree wrapper, so we probably need to have a mutex for all operations on the queue. I have to think about it.

True. I'll see if I can refactor this logic to make the handling of the resync queue more self-contained and more understandable. But I think that to implement what you are saying, we need to have a transaction that takes a lock on the two trees at once (resync_notify and resync_errors), which we cannot do with the `SledCountedTree` wrapper, so we probably need to have a mutex for all operations on the queue. I have to think about it.
@ -892,6 +902,14 @@ impl BlockManagerLocked {
fs::remove_file(to_delete).await?;
}
let dir = fs::OpenOptions::new()
Owner

If I am correct, this code is used to fsync a move, as you mentionned on Matrix?
I suggest we add a comment to avoid removing it if someone else refactor this part of the code.

If I am correct, this code is used to fsync a move, as you mentionned on Matrix? I suggest we add a comment to avoid removing it if someone else refactor this part of the code.
lx marked this conversation as resolved
lx changed target branch from bug/resync-exponential-backoff to main 2022-03-14 10:50:55 +00:00
lx force-pushed fix-resync from 5bf58bd539 to d78bf379fb 2022-03-14 10:51:43 +00:00 Compare
lx added 1 commit 2022-03-14 10:54:08 +00:00
Add comment for fsync
Some checks reported errors
continuous-integration/drone/push Build is passing
continuous-integration/drone/pr Build was killed
0af314b295
lx force-pushed fix-resync from dae0d8aebd to ba6b56ae68 2022-03-14 11:28:07 +00:00 Compare
Author
Owner

fixes #256
fixes #257

fixes #256 fixes #257
lx merged commit ba6b56ae68 into main 2022-03-14 15:27:15 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#259
No description provided.