db-snapshot: Add error handling to metadata snapshot creation #930

Open
baptiste wants to merge 2 commits from handle_snapshot_errors into main
Owner

Fix #920

Fix #920
baptiste added 1 commit 2025-01-22 12:33:09 +00:00
db-snapshot: Add error handling to metadata snapshot creation
All checks were successful
ci/woodpecker/push/debug Pipeline was successful
ci/woodpecker/pr/debug Pipeline was successful
e5e054187e
Fix #920
baptiste force-pushed handle_snapshot_errors from e5e054187e to 9178de819c 2025-01-24 18:32:23 +00:00 Compare
Author
Owner

My first attempt at fixing the issue was completely wrong (I'm still discoverying Rust error handling)

I finally found the actual root cause of the bug, so this fix should be better. Review welcome, there may be better Rust idioms to use.

My first attempt at fixing the issue was completely wrong (I'm still discoverying Rust error handling) I finally found the actual root cause of the bug, so this fix should be better. Review welcome, there may be better Rust idioms to use.
requested review from trinity-1686a 2025-01-26 15:41:13 +00:00
Armael approved these changes 2025-01-26 17:04:24 +00:00
Armael left a comment
Owner

the logic of the change is good AFAICT

the logic of the change is good AFAICT
@ -483,3 +483,3 @@
PRIO_NORMAL,
)
.await
.await?
Owner

This part is indeed somewhat perplexing without having all the types. Maybe it's worth adding a comment explaining that there are two nested Result being returned, one for the outcome of the RPC call and one for the RPC operation itself and we are simply flattening them.

(If I understand correctly, the issue before this PR is that we could get an OK (for the call) wrapping an Error (for the RPC operation) and that would get simply interpreted as an overall OK.

This part is indeed somewhat perplexing without having all the types. Maybe it's worth adding a comment explaining that there are two nested `Result` being returned, one for the outcome of the RPC call and one for the RPC operation itself and we are simply flattening them. (If I understand correctly, the issue before this PR is that we could get an OK (for the call) wrapping an Error (for the RPC operation) and that would get simply interpreted as an overall OK.
Author
Owner

Ah, I did not understand why the type-checker didn't catch the issue, thanks for the explanation about the nested Result! I'll add a comment to explain.

Ah, I did not understand why the type-checker didn't catch the issue, thanks for the explanation about the nested `Result`! I'll add a comment to explain.
Author
Owner

Actually, we may need to check errors at the two Result levels (RPC errors for the first level, and actual snapshot errors for the second level).

Actually, we may need to check errors at the two Result levels (RPC errors for the first level, and actual snapshot errors for the second level).
Owner

I approve adding the ?. Indeed the function of "making a snapshot on all nodes" should fail if either:

  1. one node could not be contacted
  2. one node could be contacted but failed when doing the snapshot

which is what this change does.

I approve adding the `?`. Indeed the function of "making a snapshot on all nodes" should fail if either: 1. one node could not be contacted 2. one node could be contacted but failed when doing the snapshot which is what this change does.
@ -499,0 +499,4 @@
Err(_) => true,
Ok(_) => false,
}) {
Err(Error::BadRequest(format_table_to_string(ret)).into())
Owner

any particular reason for using the BadRequest error case? Elsewhere it seems used to report incorrect uses of the CLI, but here this seems like a different kind of error?

any particular reason for using the `BadRequest` error case? Elsewhere it seems used to report incorrect uses of the CLI, but here this seems like a different kind of error?
Author
Owner

I haven't found a better error class, and this one is already used in many different cases, for example:

Error::BadRequest(format!("Could not launch repair on nodes: {:?} (launched successfully on other nodes)", failures))

I could use a GarageError::Message instead, but the effect from the CLI is the same, the message gets prefixed with "Error: ".

I have no clear view of the interactions between internal error messages, RPC error messages, and CLI-related error messages, so I'm open to suggestions.

I haven't found a better error class, and this one is already used in many different cases, for example: `Error::BadRequest(format!("Could not launch repair on nodes: {:?} (launched successfully on other nodes)", failures))` I could use a `GarageError::Message` instead, but the effect from the CLI is the same, the message gets prefixed with "Error: ". I have no clear view of the interactions between internal error messages, RPC error messages, and CLI-related error messages, so I'm open to suggestions.
Owner

I think using Error::BadRequests for other similar cases is a mistake and should be fixed, Error::Message is the correct choice

I think using `Error::BadRequests` for other similar cases is a mistake and should be fixed, `Error::Message` is the correct choice
Owner

@baptiste Could you rebase this PR on the current main to allow the updated CI to go through? Thanks, and sorry for the inconvenience.

@baptiste Could you rebase this PR on the current main to allow the updated CI to go through? Thanks, and sorry for the inconvenience.
baptiste force-pushed handle_snapshot_errors from 9178de819c to 8ff2aa729b 2025-01-27 17:33:44 +00:00 Compare
lx requested changes 2025-01-27 18:06:32 +00:00
@ -496,3 +496,3 @@
}
Ok(AdminRpc::Ok(format_table_to_string(ret)))
if resps.iter().any(|resp| match resp {
Owner

simplify: resps.iter().any(Result::is_err)

simplify: `resps.iter().any(Result::is_err)`
Author
Owner

I tried many different ways to simplify this code but I did not find anything satisfying. Thanks for the very nice solution :)

I tried many different ways to simplify this code but I did not find anything satisfying. Thanks for the very nice solution :)
@ -499,0 +499,4 @@
Err(_) => true,
Ok(_) => false,
}) {
Err(Error::BadRequest(format_table_to_string(ret)).into())
Owner

Error::Message is more appropriate here

`Error::Message` is more appropriate here
All checks were successful
ci/woodpecker/pr/debug Pipeline was successful
Required
Details
ci/woodpecker/push/debug Pipeline was successful
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin handle_snapshot_errors:handle_snapshot_errors
git checkout handle_snapshot_errors
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#930
No description provided.