fix crash in layout computation when changing all nodes of a zone to gateway mode #937
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind/experimental
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#937
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "baptiste/garage:fix_layout_crash"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This change is probably not a proper fix, somebody with more expertise on
this code should look at it.
Here is how to reproduce the crash:
garage layout assign fea54bcc081f318 -g
garage layout show
will panic with a backtraceFortunately, the crash is only on the RPC client side, not on the Garage
server itself, and
garage layout revert
still works to go back to theprevious state.
As far as I can tell, this bug is present since Garage 0.9.0 which
includes the new layout assignation algorithm:
#296
I think it's fine to merge this as-is
@ -650,8 +650,11 @@ impl LayoutVersion {
let mut cost = CostFunction::new();
for (p, assoc_p) in prev_assign.iter().enumerate() {
for n in assoc_p.iter() {
Maybe add a
// FIXME
comment to indicate that we are not sure that this is the correct solution.Even if this is the "wrong" solution, I don't think it will have a big impact. At worse, too much data will be migrated between nodes, but I don't think it can crash Garage or put it in an invalid state.
@ -754,2 +757,2 @@
old_zones_of_p
.push(zone_to_id[self.expect_get_node_zone(&self.node_id_vec[*n])]);
if let Some(&zone_id) =
zone_to_id.get(self.expect_get_node_zone(&self.node_id_vec[*n]))
This one does not impact correctness of the layout computation, at worse wrong statistics will be shown but I think that's ok to leave for a future issue if it happens.
The issue comes from the fact if node was used in the previous assignment but becomes a gateway node, it still appears in
prev_assign
(with an index >=self.nongateway_nodes().len()
).I think too that the proposed fix is fine to be merged as is. It does not introduce any suboptimality since it's implicit consequence is to ignore gateway node in the last optimization step.
A further fix would be to explicitly exclude gateway node from
prev_assign
andold_assignment_opt
by editing the last loop ofupdate_node_id_vec(...)
. I can find some time to do it.WIP: fix crash in layout computation when changing all nodes of a zone to gateway modeto fix crash in layout computation when changing all nodes of a zone to gateway mode