db-snapshot: Add error handling to metadata snapshot creation #930
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#930
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "handle_snapshot_errors"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fix #920
e5e054187e
to9178de819c
My first attempt at fixing the issue was completely wrong (I'm still discoverying Rust error handling)
I finally found the actual root cause of the bug, so this fix should be better. Review welcome, there may be better Rust idioms to use.
the logic of the change is good AFAICT
@ -483,3 +483,3 @@
PRIO_NORMAL,
)
.await
.await?
This part is indeed somewhat perplexing without having all the types. Maybe it's worth adding a comment explaining that there are two nested
Result
being returned, one for the outcome of the RPC call and one for the RPC operation itself and we are simply flattening them.(If I understand correctly, the issue before this PR is that we could get an OK (for the call) wrapping an Error (for the RPC operation) and that would get simply interpreted as an overall OK.
Ah, I did not understand why the type-checker didn't catch the issue, thanks for the explanation about the nested
Result
! I'll add a comment to explain.Actually, we may need to check errors at the two Result levels (RPC errors for the first level, and actual snapshot errors for the second level).
I approve adding the
?
. Indeed the function of "making a snapshot on all nodes" should fail if either:which is what this change does.
@ -499,0 +499,4 @@
Err(_) => true,
Ok(_) => false,
}) {
Err(Error::BadRequest(format_table_to_string(ret)).into())
any particular reason for using the
BadRequest
error case? Elsewhere it seems used to report incorrect uses of the CLI, but here this seems like a different kind of error?I haven't found a better error class, and this one is already used in many different cases, for example:
Error::BadRequest(format!("Could not launch repair on nodes: {:?} (launched successfully on other nodes)", failures))
I could use a
GarageError::Message
instead, but the effect from the CLI is the same, the message gets prefixed with "Error: ".I have no clear view of the interactions between internal error messages, RPC error messages, and CLI-related error messages, so I'm open to suggestions.
I think using
Error::BadRequests
for other similar cases is a mistake and should be fixed,Error::Message
is the correct choice@baptiste Could you rebase this PR on the current main to allow the updated CI to go through? Thanks, and sorry for the inconvenience.
9178de819c
to8ff2aa729b
@ -496,3 +496,3 @@
}
Ok(AdminRpc::Ok(format_table_to_string(ret)))
if resps.iter().any(|resp| match resp {
simplify:
resps.iter().any(Result::is_err)
I tried many different ways to simplify this code but I did not find anything satisfying. Thanks for the very nice solution :)
@ -499,0 +499,4 @@
Err(_) => true,
Ok(_) => false,
}) {
Err(Error::BadRequest(format_table_to_string(ret)).into())
Error::Message
is more appropriate hereView command line instructions
Checkout
From your project repository, check out a new branch and test the changes.