after remove a node with garage layout remove <node id>, the remaining nodes are still trying to connect to that node

tradingpost3 commented

2023-04-29 08:13:59 +00:00

Apr 29 17:50:07 node1.test.com garage[851]: 2023-04-29T07:50:07.848480Z ERROR garage_rpc::system: Error establishing RPC connection to remote node: 12ea5dacb669c7de1b3ca1d61379e326a189e>
Apr 29 17:50:07 node1.test.com garage[851]: This can happen if the remote node is not reachable on the network, but also if the two nodes are not configured with the same rpc_secret.
Apr 29 17:50:07 node1.test.com garage[851]: Handshake error: i/o: unexpected end of file

After the node been removed from layout, it will show as unassigned role.
If this removed node been turned off, then the other nodes will complain say they can't reach this node.

How to completely remove this node from the cluster?

``` Apr 29 17:50:07 node1.test.com garage[851]: 2023-04-29T07:50:07.848480Z ERROR garage_rpc::system: Error establishing RPC connection to remote node: 12ea5dacb669c7de1b3ca1d61379e326a189e> Apr 29 17:50:07 node1.test.com garage[851]: This can happen if the remote node is not reachable on the network, but also if the two nodes are not configured with the same rpc_secret. Apr 29 17:50:07 node1.test.com garage[851]: Handshake error: i/o: unexpected end of file ``` After the node been removed from layout, it will show as unassigned role. If this removed node been turned off, then the other nodes will complain say they can't reach this node. How to completely remove this node from the cluster?

tradingpost3 commented

2023-05-07 14:54:57 +00:00

Author

So it seems there's layout remove, but no node remove.
Thus if this node is to be removed and turned off, the other nodes keep complaining about can't connect to it.

So it seems there's layout remove, but no node remove. Thus if this node is to be removed and turned off, the other nodes keep complaining about can't connect to it.

lx added the

kind

wrong-behavior

label 2023-05-09 08:57:11 +00:00

flamingm0e commented

2023-05-13 05:04:23 +00:00

Following.

I am glad I came here to see this before I pulled a node out. As I am still in the testing phases, rebuilding my cluster is easy, but this would have driven me nuts to not figure out.

Following. I am glad I came here to see this before I pulled a node out. As I am still in the testing phases, rebuilding my cluster is easy, but this would have driven me nuts to not figure out.

jpds commented

2023-05-14 04:55:09 +00:00

Contributor

As a temporary workaround, one way I found to silence these log messages was:

Shutting down all the garage instances
Delete just the cluster_layout file in all of your nodes metadata directories
Start Garage back up
Redo the layout assignment step

Garage will simply come up, find all the existing metadata and data, but not know of the node you removed.

As a temporary workaround, one way I found to silence these log messages was: 1. Shutting down all the garage instances 2. Delete **just** the `cluster_layout` file in all of your nodes metadata directories 3. Start Garage back up 4. Redo the layout assignment step Garage will simply come up, find all the existing metadata and data, but not know of the node you removed.

👍 1

kristof.p commented

2023-10-12 05:32:19 +00:00

Is this bug fixed in v0.9?

lx commented

2023-10-12 11:07:01 +00:00

Owner

Is this bug fixed in v0.9?

It is not. This has relatively low priority for me as it does not impact the usefullness of Garage. However if you have any specific reasons why I should consider this more important, please say so and we can add it to a specific milestone.

> Is this bug fixed in v0.9? It is not. This has relatively low priority for me as it does not impact the usefullness of Garage. However if you have any specific reasons why I should consider this more important, please say so and we can add it to a specific milestone.

lx added this to the v1.0 milestone 2023-10-16 09:48:21 +00:00

kristof.p commented

2023-10-16 13:20:34 +00:00

This year we did intensive tests of garage. We had 6 servers in 6 localities, every had 6 TB disk.
We moved servers between localities, changed IP adress, replaced disks any many more during running cluster. Everything was fine until we met this bug.
The suggested workaround doesnt look good fo me.
So thank you very much, you have done a lot of work.
But I will wait for 1.0 milestone.

This year we did intensive tests of garage. We had 6 servers in 6 localities, every had 6 TB disk. We moved servers between localities, changed IP adress, replaced disks any many more during running cluster. Everything was fine until we met this bug. The suggested workaround doesnt look good fo me. So thank you very much, you have done a lot of work. But I will wait for 1.0 milestone.

lx commented

2024-02-16 09:43:20 +00:00

Owner

Are all of you guys using Consul or Kubernetes discovery ? Ping @jpds @kristof.p @flamingm0e @tradingpost3

lx referenced this issue from a pull request that will close it,

2024-02-16 10:06:35 +00:00

Filter nodes Garage tries to connect to #719

lx commented

2024-02-16 10:33:17 +00:00

Owner

Can anyone check if PR #719 fixes this issue?

A development build for commit fa7c7780243e461d9b95eb18d8eff992dca8ae5b should be available from the download page soon if that helps testing.

Can anyone check if PR #719 fixes this issue? A development build for commit `fa7c7780243e461d9b95eb18d8eff992dca8ae5b` should be available from the download page soon if that helps testing.

lx closed this issue

2024-02-20 10:37:12 +00:00

lx commented

2024-02-20 10:37:47 +00:00

Owner

If the issue is still here after the patch (that will be published with 0.9.2 / 1.0), please reopen the issue.

~~lx referenced this issue 2024-03-01 14:14:56 +00:00~~

Bump version to v0.9.2 #747

Rows
Columns

after remove a node with garage layout remove <node id>, the remaining nodes are still trying to connect to that node #555