Panic during application of new cluster layout in 0.8.0-rc1 #414
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#414
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Seeing a panic occurring when trying to add a new instance to our cluster, via the admin REST interface. The instance which the
POST /v0/layout/apply
is being performed against is the same which is being added to the cluster. The logs on that instance look like:Even on another host, when running
garage layout show
, we can get the same panic:This is all on garage
0.8.0-rc1
(2197753d
), all linux/amd64.I can attempt to retrieve the exact request which is being made to the instance, but it will be a bit of effort, hoping maybe the bug/solution will be evident just from this 🙏 but yeah, please let me know if I need to provide more info.
The error seems to be in the layout algorithm, more especially the algorithm detects that an invariant has been violated and crashes instead of building a buggy layout. So it (probably) means that 1) you have found a bug 2) we catched it early with this assertion.
@mediocregopher : Thanks for reporting this bug. I have 3 questions/requests fo you that could help us solving this bug:
@lx : officially, our current algorithm should have a best effort approach to spread data over different zones when
replication_factor
is set to3
but only2
zones are available. Would it be possible that we still have a regression/bug in this logic?Thanks @quentin, glad the assertion caught the bug :)
Yes, even with a third instance in a third zone the bug still happens. It doesn't seem to matter if I'm applying the layout change on the new instance or an existing instance either.
Attached toml file for the instance being added. The other config files are only different in their storage locations and bind addresses.
Yes, it happens everytime. I added some logging to get the exact HTTP requests which are occurring:
** REQUEST 1 **
** REQUEST 2 **
** REQUEST 3 **
At this point the request never returns, and this shows up in the logs:
Note: The instance that those
Not connected
errors are for,7d4aafc54fdae0f8
, is neither the instance being added nor the instance which the REST requests are being performed against. I don't know if they have to do with anything.@mediocregopher thanks for the detailed report.
Could you check if the bug still happens:
if you set all capacity values to 100 instead of 1 ?
if you delete the previous layout, i.e. the
cluster_layout
file in the metadata directory, on all your nodes ? (stop all nodes, deletecluster_layout
everywhere, restart nodes, retry creating the layout you tried)@lx
The panic did not occur after setting capacity values to 100, rather than 1. Are capacity values on the admin API not units of 100GB? I assumed they would behave the same as on the CLI, but the docs don't explicitly say.
After deleting the cluster_layout on all instances the panic still did not occur. I suppose that indicates that the old cluster_layout had gotten itself into some kind of weird state?
Thanks for testing, @mediocregopher .
This issue is indeed a bug in the layout computation code. The current version of the code is currently at end of life, soon to be replaced by #296 (planned for v0.9), so I don't think we should spend too much time debugging this. There are two courses of action we could take from here:
Do nothing and just tell people not to use too small capacity values
Add a hack that detects when the capacity values are too small and multiplies them all by 10 or 100 when the layout is computed
What do you think?
Can you clarify what is meant by "too small capacity values"? I was under the impression that the capacity values of
1
corresponded to100 GB
as per the Deployment on a cluster cookbookAre capacity values actually unit-less and only used to determine the proportion of data each node will replicate? If so then (2) sounds ok, if I understand your suggestion right. Within the computation code you would multiply the capacity of each instance by 100, and since the capacities only have meaning relative to each other it wouldn't change anything.
It sounds like I can remove the "divide capacity by 100" operation in my codebase, since I was assuming capacity is a multiple of 100GB.
Or perhaps I'm completely misunderstanding :)
@quentin @lx we're still seeing the panic even when using larger capacity values:
Thanks for this new report @mediocregopher. So since the "solution" of multiplying everything by 100 doesn't work, I won't try fixing this for v0.8.0, which is long overdue and which I'm going to try and release today. This will likely be fixed by the new algorithm in #296 is merged (I can't give a precise timeframe but we're making good progress!). In the meantime, I think your options are:
starting from an empty cluster layout by deleting the
cluster_layout
filetweaking the capacity values to see if there are some values for which the algorithm doesn't panic
I'm also quite intrigued by the fact that you are AFAIK the only person that stumbled on this issue, please tell me if you hear of anyone else having the same.
@lx I'm also surprised no one else has run into this, though perhaps not so many people are using multiple instances in the same zone with the same capacity.
Can you give any hints as to what capacity values might result in this panic? Also, do you think 0.7.0 would have this same issue? We had tried it previously but had to reset our cluster for other reasons. I was hoping to be able to start from scratch with 0.8.0, but if 0.7.0 would work fine then we might just start there.
I might have figured out the cause of this issue, if I make a branch with a patch will you try it out?
(I think this part of the code hasn't moved since v0.7.0, so you would probably also have the panic.)
Yes absolutely!
Here is the proposed fix: #429
Sorry I don't have time to make a full build today, tell me if you need one and I'll do it ASAP.
np, I can build it. Will get back asap
@lx This seems to work better! Neither of the configurations which caused panics previously are doing so now. One piece of oddness I'm seeing though is that the partition layout is a bit wacky:
I'd expect that the partition counts of
c15e3709cacb1ae2
,fa4caaa1121129f4
, and9bee8edce347bad3
to be roughly the same, but that doesn't seem to be so.Maybe this isn't a real issue, or not something you'd want to fix for 0.8.0, just letting you know.
In any case, no more panics!
Great news!
I think that's because the algorithm is fundamentally broken, it's unable to properly optimize in a global way in some cases. I don't think this is something I can fix on its own, #296 will hopefully solve this by redoing everything.
Thanks for your help working through this issue :)
Yeah fair enough, thank you!