db-snapshot: Add error handling to metadata snapshot creation #930

My first attempt at fixing the issue was completely wrong (I'm still discoverying Rust error handling) I finally found the actual root cause of the bug, so this fix should be better. Review welcome, there may be better Rust idioms to use.

baptiste referenced this pull request

2025-01-24 18:41:01 +00:00

Metadata snapshot does not indicate an error even when disk is full #920

requested review from trinity-1686a

2025-01-26 15:41:13 +00:00

Armael approved these changes 2025-01-26 17:04:24 +00:00

Armael left a comment

the logic of the change is good AFAICT

src/garage/admin/mod.rs Outdated

					
				@ -483,3 +483,3 @@

											PRIO_NORMAL,

										)

										.await

										.await?

Armael commented

2025-01-26 16:59:57 +00:00

This part is indeed somewhat perplexing without having all the types. Maybe it's worth adding a comment explaining that there are two nested Result being returned, one for the outcome of the RPC call and one for the RPC operation itself and we are simply flattening them.

(If I understand correctly, the issue before this PR is that we could get an OK (for the call) wrapping an Error (for the RPC operation) and that would get simply interpreted as an overall OK.

This part is indeed somewhat perplexing without having all the types. Maybe it's worth adding a comment explaining that there are two nested `Result` being returned, one for the outcome of the RPC call and one for the RPC operation itself and we are simply flattening them. (If I understand correctly, the issue before this PR is that we could get an OK (for the call) wrapping an Error (for the RPC operation) and that would get simply interpreted as an overall OK.

baptiste commented

2025-01-26 17:48:42 +00:00

Ah, I did not understand why the type-checker didn't catch the issue, thanks for the explanation about the nested Result! I'll add a comment to explain.

Ah, I did not understand why the type-checker didn't catch the issue, thanks for the explanation about the nested `Result`! I'll add a comment to explain.

baptiste commented

2025-01-27 17:58:34 +00:00

Actually, we may need to check errors at the two Result levels (RPC errors for the first level, and actual snapshot errors for the second level).

lx commented

2025-01-27 18:08:43 +00:00

I approve adding the ?. Indeed the function of "making a snapshot on all nodes" should fail if either:

one node could not be contacted
one node could be contacted but failed when doing the snapshot

which is what this change does.

I approve adding the `?`. Indeed the function of "making a snapshot on all nodes" should fail if either: 1. one node could not be contacted 2. one node could be contacted but failed when doing the snapshot which is what this change does.

src/garage/admin/mod.rs Outdated

					
				@ -499,0 +499,4 @@

									Err(_) => true,

									Ok(_) => false,

								}) {

									Err(Error::BadRequest(format_table_to_string(ret)).into())

Armael commented

2025-01-26 17:03:30 +00:00

any particular reason for using the BadRequest error case? Elsewhere it seems used to report incorrect uses of the CLI, but here this seems like a different kind of error?

any particular reason for using the `BadRequest` error case? Elsewhere it seems used to report incorrect uses of the CLI, but here this seems like a different kind of error?

baptiste commented

2025-01-26 17:47:13 +00:00

I haven't found a better error class, and this one is already used in many different cases, for example:

Error::BadRequest(format!("Could not launch repair on nodes: {:?} (launched successfully on other nodes)", failures))

I could use a GarageError::Message instead, but the effect from the CLI is the same, the message gets prefixed with "Error: ".

I have no clear view of the interactions between internal error messages, RPC error messages, and CLI-related error messages, so I'm open to suggestions.

I haven't found a better error class, and this one is already used in many different cases, for example: `Error::BadRequest(format!("Could not launch repair on nodes: {:?} (launched successfully on other nodes)", failures))` I could use a `GarageError::Message` instead, but the effect from the CLI is the same, the message gets prefixed with "Error: ". I have no clear view of the interactions between internal error messages, RPC error messages, and CLI-related error messages, so I'm open to suggestions.

lx commented

2025-01-27 18:07:11 +00:00

I think using Error::BadRequests for other similar cases is a mistake and should be fixed, Error::Message is the correct choice

I think using `Error::BadRequests` for other similar cases is a mistake and should be fixed, `Error::Message` is the correct choice

lx commented

2025-01-27 15:58:51 +00:00

@baptiste Could you rebase this PR on the current main to allow the updated CI to go through? Thanks, and sorry for the inconvenience.

baptiste force-pushed handle_snapshot_errors from 9178de819c to 8ff2aa729b

2025-01-27 17:33:44 +00:00

Compare

lx requested changes 2025-01-27 18:06:32 +00:00

src/garage/admin/mod.rs Outdated

					
				@ -496,3 +496,3 @@

								}

								Ok(AdminRpc::Ok(format_table_to_string(ret)))

								if resps.iter().any(|resp| match resp {

lx commented

2025-01-27 18:05:22 +00:00

simplify: resps.iter().any(Result::is_err)

simplify: `resps.iter().any(Result::is_err)`

❤️ 1

baptiste commented

2025-01-27 18:11:32 +00:00

I tried many different ways to simplify this code but I did not find anything satisfying. Thanks for the very nice solution :)

src/garage/admin/mod.rs Outdated

					
				@ -499,0 +499,4 @@

									Err(_) => true,

									Ok(_) => false,

								}) {

									Err(Error::BadRequest(format_table_to_string(ret)).into())