after remove a node with garage layout remove <node id>, the remaining nodes are still trying to connect to that node #555
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#555
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
After the node been removed from layout, it will show as unassigned role.
If this removed node been turned off, then the other nodes will complain say they can't reach this node.
How to completely remove this node from the cluster?
So it seems there's layout remove, but no node remove.
Thus if this node is to be removed and turned off, the other nodes keep complaining about can't connect to it.
Following.
I am glad I came here to see this before I pulled a node out. As I am still in the testing phases, rebuilding my cluster is easy, but this would have driven me nuts to not figure out.
As a temporary workaround, one way I found to silence these log messages was:
Shutting down all the garage instances
Delete just the
cluster_layout
file in all of your nodes metadata directoriesStart Garage back up
Redo the layout assignment step
Garage will simply come up, find all the existing metadata and data, but not know of the node you removed.
Is this bug fixed in v0.9?
It is not. This has relatively low priority for me as it does not impact the usefullness of Garage. However if you have any specific reasons why I should consider this more important, please say so and we can add it to a specific milestone.
This year we did intensive tests of garage. We had 6 servers in 6 localities, every had 6 TB disk.
We moved servers between localities, changed IP adress, replaced disks any many more during running cluster. Everything was fine until we met this bug.
The suggested workaround doesnt look good fo me.
So thank you very much, you have done a lot of work.
But I will wait for 1.0 milestone.
Are all of you guys using Consul or Kubernetes discovery ? Ping @jpds @kristof.p @flamingm0e @tradingpost3
Can anyone check if PR #719 fixes this issue?
A development build for commit
fa7c7780243e461d9b95eb18d8eff992dca8ae5b
should be available from the download page soon if that helps testing.If the issue is still here after the patch (that will be published with 0.9.2 / 1.0), please reopen the issue.
lx referenced this issue2024-03-01 14:14:56 +00:00