Try to solve persistence issues #259

lx · 2022-03-01T10:53:52Z

lx commented

2022-03-01 10:53:52 +00:00

call fsync() at appropriate places in write_block
make sure blocks in resync queue stay there until they are sucessfully processed

- [x] call `fsync()` at appropriate places in `write_block` - [x] make sure blocks in resync queue stay there until they are **sucessfully** processed

lx force-pushed fix-resync from bae45fa6c1 to d1eefea917

2022-03-01 11:03:44 +00:00

Compare

lx force-pushed fix-resync from d1eefea917 to 2f9d606bd6

2022-03-01 11:04:39 +00:00

Compare

lx changed target branch from main to bug/resync-exponential-backoff

2022-03-01 11:05:08 +00:00

lx added 1 commit 2022-03-01 13:55:59 +00:00

Fix resync queue to not drop items

continuous-integration/drone/push Build is passing

Details

continuous-integration/drone/pr Build is passing

Details

5bf58bd539

quentin added the

kind

correctness

label 2022-03-03 13:51:14 +00:00

quentin approved these changes 2022-03-04 09:41:41 +00:00

src/model/block.rs Outdated

					
				@ -579,6 +578,9 @@ impl BlockManager {

										// don't do resync and return early, but still

										// make sure the item is still in queue at expected time

										self.put_to_resync_at(&hash, ec.next_try())?;

										// ec.next_try() > now >= time_msec, so this remove

quentin commented

2022-03-04 09:41:09 +00:00

I don't understand the condition in which we are here, can you confirm my following reasoning ?

We get a block to repair in the queue
-> its scheduled resync time is before now, so we handle it
-> we get the block associated error counter
-> the error counter has a next_try method that implement exponential backoff
-> (the block can be added by another tool in the queue that do not consider exponential backoff?)
-> The exponential backoff says we should not reschedule now, as its exponential backoff value is greater than the previously scheduled one
-> We re-add the block at the value computed by the exponential backoff
-> We remove the block at the current time value

After this analysis, I have a question:

Why do we not have an helper/method to put a block in the resync queue that would directly check the value of the ErrorCounter (if any) and its associated exponential backoff time?

I don't understand the condition in which we are here, can you confirm my following reasoning ? We get a block to repair in the queue -> its scheduled resync time is before now, so we handle it -> we get the block associated error counter -> the error counter has a next_try method that implement exponential backoff -> (the block can be added by another tool in the queue that do not consider exponential backoff?) -> The exponential backoff says we should not reschedule now, as its exponential backoff value is greater than the previously scheduled one -> We re-add the block at the value computed by the exponential backoff -> We remove the block at the current time value After this analysis, I have a question: - Why do we not have an helper/method to put a block in the resync queue that would directly check the value of the ErrorCounter (if any) and its associated exponential backoff time?

lx commented

2022-03-07 10:33:55 +00:00

True. I'll see if I can refactor this logic to make the handling of the resync queue more self-contained and more understandable. But I think that to implement what you are saying, we need to have a transaction that takes a lock on the two trees at once (resync_notify and resync_errors), which we cannot do with the SledCountedTree wrapper, so we probably need to have a mutex for all operations on the queue. I have to think about it.

True. I'll see if I can refactor this logic to make the handling of the resync queue more self-contained and more understandable. But I think that to implement what you are saying, we need to have a transaction that takes a lock on the two trees at once (resync_notify and resync_errors), which we cannot do with the `SledCountedTree` wrapper, so we probably need to have a mutex for all operations on the queue. I have to think about it.

src/model/block.rs Outdated

					
				@ -892,6 +902,14 @@ impl BlockManagerLocked {

							fs::remove_file(to_delete).await?;

						}

						let dir = fs::OpenOptions::new()

quentin commented

2022-03-04 09:25:25 +00:00

If I am correct, this code is used to fsync a move, as you mentionned on Matrix?
I suggest we add a comment to avoid removing it if someone else refactor this part of the code.

If I am correct, this code is used to fsync a move, as you mentionned on Matrix? I suggest we add a comment to avoid removing it if someone else refactor this part of the code.

lx marked this conversation as resolved

lx changed target branch from bug/resync-exponential-backoff to main

2022-03-14 10:50:55 +00:00

lx force-pushed fix-resync from 5bf58bd539 to d78bf379fb

2022-03-14 10:51:43 +00:00

Compare

lx added 1 commit 2022-03-14 10:54:08 +00:00

Add comment for fsync

continuous-integration/drone/push Build is passing

Details

continuous-integration/drone/pr Build was killed

Details

0af314b295

lx force-pushed fix-resync from dae0d8aebd to ba6b56ae68

2022-03-14 11:28:07 +00:00

Compare

lx commented

2022-03-14 15:26:29 +00:00

fixes #256
fixes #257

fixes #256 fixes #257

lx merged commit ba6b56ae68 into main

2022-03-14 15:27:15 +00:00

Sign in to join this conversation.

No reviewers

No milestone

No project

No assignees

2 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#259