fix crash in layout computation when changing all nodes of a zone to gateway mode #937

Merged
lx merged 1 commit from baptiste/garage:fix_layout_crash into main 2025-02-19 17:09:11 +00:00
Owner

This change is probably not a proper fix, somebody with more expertise on
this code should look at it.

Here is how to reproduce the crash:

  • start with a layout with two zones
  • move all nodes of a zone to gateway mode: garage layout assign fea54bcc081f318 -g
  • garage layout show will panic with a backtrace

Fortunately, the crash is only on the RPC client side, not on the Garage
server itself, and garage layout revert still works to go back to the
previous state.

As far as I can tell, this bug is present since Garage 0.9.0 which
includes the new layout assignation algorithm:

#296

This change is probably not a proper fix, somebody with more expertise on this code should look at it. Here is how to reproduce the crash: - start with a layout with two zones - move all nodes of a zone to gateway mode: `garage layout assign fea54bcc081f318 -g` - `garage layout show` will panic with a backtrace Fortunately, the crash is only on the RPC client side, not on the Garage server itself, and `garage layout revert` still works to go back to the previous state. As far as I can tell, this bug is present since Garage 0.9.0 which includes the new layout assignation algorithm: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296
baptiste added 1 commit 2025-01-27 18:41:06 +00:00
WIP: fix crash in layout computation when changing all nodes of a zone to gateway mode
All checks were successful
ci/woodpecker/pr/debug Pipeline was successful
6d798c640f
This change is probably not a proper fix, somebody with more expertise on
this code should look at it.

Here is how to reproduce the crash:

- start with a layout with two zones
- move all nodes of a zone to gateway mode: `garage layout assign fea54bcc081f318 -g`
- `garage layout show` will panic with a backtrace

Fortunately, the crash is only on the RPC client side, not on the Garage
server itself, and `garage layout revert` still works to go back to the
previous state.

As far as I can tell, this bug is present since Garage 0.9.0 which
includes the new layout assignation algorithm:

  #296
lx approved these changes 2025-02-14 12:39:00 +00:00
lx left a comment
Owner

I think it's fine to merge this as-is

I think it's fine to merge this as-is
@ -650,8 +650,11 @@ impl LayoutVersion {
let mut cost = CostFunction::new();
for (p, assoc_p) in prev_assign.iter().enumerate() {
for n in assoc_p.iter() {
Owner

Maybe add a // FIXME comment to indicate that we are not sure that this is the correct solution.

Even if this is the "wrong" solution, I don't think it will have a big impact. At worse, too much data will be migrated between nodes, but I don't think it can crash Garage or put it in an invalid state.

Maybe add a `// FIXME` comment to indicate that we are not sure that this is the correct solution. Even if this is the "wrong" solution, I don't think it will have a big impact. At worse, too much data will be migrated between nodes, but I don't think it can crash Garage or put it in an invalid state.
@ -754,2 +757,2 @@
old_zones_of_p
.push(zone_to_id[self.expect_get_node_zone(&self.node_id_vec[*n])]);
if let Some(&zone_id) =
zone_to_id.get(self.expect_get_node_zone(&self.node_id_vec[*n]))
Owner

This one does not impact correctness of the layout computation, at worse wrong statistics will be shown but I think that's ok to leave for a future issue if it happens.

This one does not impact correctness of the layout computation, at worse wrong statistics will be shown but I think that's ok to leave for a future issue if it happens.
lx added the
kind
correctness
scope
ops
labels 2025-02-14 12:52:21 +00:00
Member

The issue comes from the fact if node was used in the previous assignment but becomes a gateway node, it still appears in prev_assign (with an index >= self.nongateway_nodes().len()).

I think too that the proposed fix is fine to be merged as is. It does not introduce any suboptimality since it's implicit consequence is to ignore gateway node in the last optimization step.

A further fix would be to explicitly exclude gateway node from prev_assign and old_assignment_opt by editing the last loop of update_node_id_vec(...). I can find some time to do it.

The issue comes from the fact if node was used in the previous assignment but becomes a gateway node, it still appears in `prev_assign` (with an index >= `self.nongateway_nodes().len()`). I think too that the proposed fix is fine to be merged as is. It does not introduce any suboptimality since it's implicit consequence is to ignore gateway node in the last optimization step. A further fix would be to explicitly exclude gateway node from `prev_assign` and `old_assignment_opt` by editing the last loop of `update_node_id_vec(...)`. I can find some time to do it.
lx changed title from WIP: fix crash in layout computation when changing all nodes of a zone to gateway mode to fix crash in layout computation when changing all nodes of a zone to gateway mode 2025-02-19 17:09:06 +00:00
lx merged commit 1cb0ae10a8 into main 2025-02-19 17:09:11 +00:00
maximilien deleted branch fix_layout_crash 2025-02-19 21:07:42 +00:00
Sign in to join this conversation.
No reviewers
lx
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#937
No description provided.