Changing IP address of a node leads to a half-connected and broken cluster #652

Closed
opened 2023-10-15 10:46:14 +00:00 by baptiste · 2 comments
Owner

I had to change the IP address of a node, so I changed both rpc_bind_addr and rpc_public_addr for this node. There's no NAT.

Old config of node A:

rpc_bind_addr = "192.168.0.126:3901"
rpc_public_addr = "192.168.0.126:3901"

New config of node A:

rpc_bind_addr = "192.168.0.93:3901"
rpc_public_addr = "192.168.0.93:3901"

After restarting node A, here is the status on node A, which says it's correctly connected again to node B:

==== HEALTHY NODES ====
ID                Hostname     Address              Tags  Zone     Capacity  DataAvail
6ea290fbe1cbf9d9  nodeA        192.168.0.93:3901    []    zoneA    100       36.3 GB (34.6%)
8c8a4ab1878f5f80  ?            192.168.0.173:3901   []    zoneB    600       ?

But on node B, it says that node A is still disconnected:

==== HEALTHY NODES ====
ID                Hostname   Address              Tags  Zone     Capacity  DataAvail
8c8a4ab1878f5f80  nodeB      192.168.0.173:3901   []    zoneB    600       552.3 GB (87.2%)

==== FAILED NODES ====
ID                Hostname   Address              Tags  Zone     Capacity  Last seen
6ea290fbe1cbf9d9  nodeA      192.168.0.126:3901   []    zoneA    100       1 week ago

Note how node B still has the previous IP address of node A.

When I look at the logs of node B, it even accepts the connection from node A:

INFO netapp::netapp: Incoming connection from 192.168.0.93:32982, negotiating handshake...
INFO netapp::netapp: Accepted connection from 6ea290fbe1cbf9d9 at 192.168.0.93:32982

But this is never reflected in the status of node B.

This issue is not transient, I waited maybe 20 minutes and nothing changes. It also prevents node B from reaching a quorum when it receives queries.

This is using Garage 0.8.4 on Debian.

I had to change the IP address of a node, so I changed both rpc_bind_addr and rpc_public_addr for this node. There's no NAT. Old config of node A: ``` rpc_bind_addr = "192.168.0.126:3901" rpc_public_addr = "192.168.0.126:3901" ``` New config of node A: ``` rpc_bind_addr = "192.168.0.93:3901" rpc_public_addr = "192.168.0.93:3901" ``` After restarting node A, here is the status on node A, which says it's correctly connected again to node B: ``` ==== HEALTHY NODES ==== ID Hostname Address Tags Zone Capacity DataAvail 6ea290fbe1cbf9d9 nodeA 192.168.0.93:3901 [] zoneA 100 36.3 GB (34.6%) 8c8a4ab1878f5f80 ? 192.168.0.173:3901 [] zoneB 600 ? ``` But on node B, it says that node A is still disconnected: ``` ==== HEALTHY NODES ==== ID Hostname Address Tags Zone Capacity DataAvail 8c8a4ab1878f5f80 nodeB 192.168.0.173:3901 [] zoneB 600 552.3 GB (87.2%) ==== FAILED NODES ==== ID Hostname Address Tags Zone Capacity Last seen 6ea290fbe1cbf9d9 nodeA 192.168.0.126:3901 [] zoneA 100 1 week ago ``` Note how node B still has the previous IP address of node A. When I look at the logs of node B, it even accepts the connection from node A: ``` INFO netapp::netapp: Incoming connection from 192.168.0.93:32982, negotiating handshake... INFO netapp::netapp: Accepted connection from 6ea290fbe1cbf9d9 at 192.168.0.93:32982 ``` But this is never reflected in the status of node B. This issue is not transient, I waited maybe 20 minutes and nothing changes. It also prevents node B from reaching a quorum when it receives queries. This is using Garage 0.8.4 on Debian.
Owner

I think I already had this issue, and it is generally fixed by restarting the garage daemon on other nodes.

I think I already had this issue, and it is generally fixed by restarting the garage daemon on other nodes.
lx added the
Bug
label 2023-10-16 09:35:19 +00:00
lx added this to the v1.0 milestone 2023-10-16 09:35:25 +00:00
lx closed this issue 2024-02-19 17:37:02 +00:00
Owner

PR #724 probably fixes the issue, it will be published with v0.9.2 / v1.0. If the issue is sill there, please reopen the issue.

PR #724 probably fixes the issue, it will be published with v0.9.2 / v1.0. If the issue is sill there, please reopen the issue.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#652
No description provided.