after remove a node with garage layout remove <node id>, the remaining nodes are still trying to connect to that node #555

Closed
opened 2023-04-29 08:13:59 +00:00 by tradingpost3 · 9 comments
Apr 29 17:50:07 node1.test.com garage[851]: 2023-04-29T07:50:07.848480Z ERROR garage_rpc::system: Error establishing RPC connection to remote node: 12ea5dacb669c7de1b3ca1d61379e326a189e>
Apr 29 17:50:07 node1.test.com garage[851]: This can happen if the remote node is not reachable on the network, but also if the two nodes are not configured with the same rpc_secret.
Apr 29 17:50:07 node1.test.com garage[851]: Handshake error: i/o: unexpected end of file

After the node been removed from layout, it will show as unassigned role.
If this removed node been turned off, then the other nodes will complain say they can't reach this node.

How to completely remove this node from the cluster?

``` Apr 29 17:50:07 node1.test.com garage[851]: 2023-04-29T07:50:07.848480Z ERROR garage_rpc::system: Error establishing RPC connection to remote node: 12ea5dacb669c7de1b3ca1d61379e326a189e> Apr 29 17:50:07 node1.test.com garage[851]: This can happen if the remote node is not reachable on the network, but also if the two nodes are not configured with the same rpc_secret. Apr 29 17:50:07 node1.test.com garage[851]: Handshake error: i/o: unexpected end of file ``` After the node been removed from layout, it will show as unassigned role. If this removed node been turned off, then the other nodes will complain say they can't reach this node. How to completely remove this node from the cluster?
Author

So it seems there's layout remove, but no node remove.
Thus if this node is to be removed and turned off, the other nodes keep complaining about can't connect to it.

So it seems there's layout remove, but no node remove. Thus if this node is to be removed and turned off, the other nodes keep complaining about can't connect to it.
lx added the
kind
wrong-behavior
label 2023-05-09 08:57:11 +00:00

Following.

I am glad I came here to see this before I pulled a node out. As I am still in the testing phases, rebuilding my cluster is easy, but this would have driven me nuts to not figure out.

Following. I am glad I came here to see this before I pulled a node out. As I am still in the testing phases, rebuilding my cluster is easy, but this would have driven me nuts to not figure out.
Contributor

As a temporary workaround, one way I found to silence these log messages was:

  1. Shutting down all the garage instances

  2. Delete just the cluster_layout file in all of your nodes metadata directories

  3. Start Garage back up

  4. Redo the layout assignment step

Garage will simply come up, find all the existing metadata and data, but not know of the node you removed.

As a temporary workaround, one way I found to silence these log messages was: 1. Shutting down all the garage instances 2. Delete **just** the `cluster_layout` file in all of your nodes metadata directories 3. Start Garage back up 4. Redo the layout assignment step Garage will simply come up, find all the existing metadata and data, but not know of the node you removed.

Is this bug fixed in v0.9?

Is this bug fixed in v0.9?
Owner

Is this bug fixed in v0.9?

It is not. This has relatively low priority for me as it does not impact the usefullness of Garage. However if you have any specific reasons why I should consider this more important, please say so and we can add it to a specific milestone.

> Is this bug fixed in v0.9? It is not. This has relatively low priority for me as it does not impact the usefullness of Garage. However if you have any specific reasons why I should consider this more important, please say so and we can add it to a specific milestone.
lx added this to the v1.0 milestone 2023-10-16 09:48:21 +00:00

This year we did intensive tests of garage. We had 6 servers in 6 localities, every had 6 TB disk.
We moved servers between localities, changed IP adress, replaced disks any many more during running cluster. Everything was fine until we met this bug.
The suggested workaround doesnt look good fo me.
So thank you very much, you have done a lot of work.
But I will wait for 1.0 milestone.

This year we did intensive tests of garage. We had 6 servers in 6 localities, every had 6 TB disk. We moved servers between localities, changed IP adress, replaced disks any many more during running cluster. Everything was fine until we met this bug. The suggested workaround doesnt look good fo me. So thank you very much, you have done a lot of work. But I will wait for 1.0 milestone.
Owner

Are all of you guys using Consul or Kubernetes discovery ? Ping @jpds @kristof.p @flamingm0e @tradingpost3

Are all of you guys using Consul or Kubernetes discovery ? Ping @jpds @kristof.p @flamingm0e @tradingpost3
Owner

Can anyone check if PR #719 fixes this issue?

A development build for commit fa7c7780243e461d9b95eb18d8eff992dca8ae5b should be available from the download page soon if that helps testing.

Can anyone check if PR #719 fixes this issue? A development build for commit `fa7c7780243e461d9b95eb18d8eff992dca8ae5b` should be available from the download page soon if that helps testing.
lx closed this issue 2024-02-20 10:37:12 +00:00
Owner

If the issue is still here after the patch (that will be published with 0.9.2 / 1.0), please reopen the issue.

If the issue is still here after the patch (that will be published with 0.9.2 / 1.0), please reopen the issue.
Sign in to join this conversation.
No milestone
No project
No assignees
5 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#555
No description provided.