db-snapshot: Add error handling to metadata snapshot creation #930

Open
baptiste wants to merge 2 commits from handle_snapshot_errors into main

View file

@ -482,7 +482,7 @@ impl AdminRpcHandler {
AdminRpc::MetaOperation(MetaOperation::Snapshot { all: false }),
PRIO_NORMAL,
)
.await
.await?

This part is indeed somewhat perplexing without having all the types. Maybe it's worth adding a comment explaining that there are two nested Result being returned, one for the outcome of the RPC call and one for the RPC operation itself and we are simply flattening them.

(If I understand correctly, the issue before this PR is that we could get an OK (for the call) wrapping an Error (for the RPC operation) and that would get simply interpreted as an overall OK.

This part is indeed somewhat perplexing without having all the types. Maybe it's worth adding a comment explaining that there are two nested `Result` being returned, one for the outcome of the RPC call and one for the RPC operation itself and we are simply flattening them. (If I understand correctly, the issue before this PR is that we could get an OK (for the call) wrapping an Error (for the RPC operation) and that would get simply interpreted as an overall OK.

Ah, I did not understand why the type-checker didn't catch the issue, thanks for the explanation about the nested Result! I'll add a comment to explain.

Ah, I did not understand why the type-checker didn't catch the issue, thanks for the explanation about the nested `Result`! I'll add a comment to explain.

Actually, we may need to check errors at the two Result levels (RPC errors for the first level, and actual snapshot errors for the second level).

Actually, we may need to check errors at the two Result levels (RPC errors for the first level, and actual snapshot errors for the second level).
Outdated
Review

I approve adding the ?. Indeed the function of "making a snapshot on all nodes" should fail if either:

  1. one node could not be contacted
  2. one node could be contacted but failed when doing the snapshot

which is what this change does.

I approve adding the `?`. Indeed the function of "making a snapshot on all nodes" should fail if either: 1. one node could not be contacted 2. one node could be contacted but failed when doing the snapshot which is what this change does.
}))
.await;
@ -495,7 +495,14 @@ impl AdminRpcHandler {
ret.push(format!("{:?}\t{}", to, res_str));
}
Ok(AdminRpc::Ok(format_table_to_string(ret)))
if resps.iter().any(|resp| match resp {
Review

simplify: resps.iter().any(Result::is_err)

simplify: `resps.iter().any(Result::is_err)`
Review

I tried many different ways to simplify this code but I did not find anything satisfying. Thanks for the very nice solution :)

I tried many different ways to simplify this code but I did not find anything satisfying. Thanks for the very nice solution :)
Err(_) => true,
Ok(_) => false,
}) {
Err(Error::BadRequest(format_table_to_string(ret)).into())

any particular reason for using the BadRequest error case? Elsewhere it seems used to report incorrect uses of the CLI, but here this seems like a different kind of error?

any particular reason for using the `BadRequest` error case? Elsewhere it seems used to report incorrect uses of the CLI, but here this seems like a different kind of error?

I haven't found a better error class, and this one is already used in many different cases, for example:

Error::BadRequest(format!("Could not launch repair on nodes: {:?} (launched successfully on other nodes)", failures))

I could use a GarageError::Message instead, but the effect from the CLI is the same, the message gets prefixed with "Error: ".

I have no clear view of the interactions between internal error messages, RPC error messages, and CLI-related error messages, so I'm open to suggestions.

I haven't found a better error class, and this one is already used in many different cases, for example: `Error::BadRequest(format!("Could not launch repair on nodes: {:?} (launched successfully on other nodes)", failures))` I could use a `GarageError::Message` instead, but the effect from the CLI is the same, the message gets prefixed with "Error: ". I have no clear view of the interactions between internal error messages, RPC error messages, and CLI-related error messages, so I'm open to suggestions.
Outdated
Review

I think using Error::BadRequests for other similar cases is a mistake and should be fixed, Error::Message is the correct choice

I think using `Error::BadRequests` for other similar cases is a mistake and should be fixed, `Error::Message` is the correct choice
Review

Error::Message is more appropriate here

`Error::Message` is more appropriate here
} else {
Ok(AdminRpc::Ok(format_table_to_string(ret)))
}
}
MetaOperation::Snapshot { all: false } => {
garage_model::snapshot::async_snapshot_metadata(&self.garage).await?;