Garage fails to count to 3? #597

New issue

Closed

opened 2023-07-12 20:40:07 +00:00 by withinboredom · 2 comments

withinboredom commented

2023-07-12 20:40:07 +00:00

Contributor

Jul 12 20:34:52 capital garage[939]: Could not reach quorum of 2. 1 of 2 request succeeded, others returned errors: ["Netapp error: Not connected: cfa6eef8e72ff56c"]
Jul 12 20:34:57 capital garage[939]: 2023-07-12T20:34:57.603669Z  INFO garage_rpc::system: Doing a bootstrap/discovery step (not_configured: false, no_peers: false, bad_peers: true)
Jul 12 20:34:57 capital garage[939]: 2023-07-12T20:34:57.604309Z ERROR garage_rpc::system: Error establishing RPC connection to remote node: cfa6eef8e72ff56c29ed54e5ab312008bbab09fded391a256adc252cbe10b9e1@[2a01:4f9:6b:5601::2]:3901.
Jul 12 20:34:57 capital garage[939]: This can happen if the remote node is not reachable on the network, but also if the two nodes are not configured with the same rpc_secret.
Jul 12 20:34:57 capital garage[939]: IO error: Connection refused (os error 111)
Jul 12 20:35:41 capital garage[939]: 2023-07-12T20:35:41.341862Z  INFO garage_block::resync: Resync block cf31ef58792cb3fb: offloading and deleting
Jul 12 20:35:41 capital garage[939]: 2023-07-12T20:35:41.365887Z ERROR garage_block::resync: Error when resyncing cf31ef58792cb3fb: NeedBlockQuery RPC
Jul 12 20:35:41 capital garage[939]: Netapp error: Not connected: cfa6eef8e72ff56c

For some reason, garage cannot reach a quorum despite 2 nodes being available in the region and replicas being available outside the region.

garage status
==== HEALTHY NODES ====
ID                Hostname      Address                       Tags         Zone  Capacity  DataAvail
fd7e70e6ac3b71fe  capital       65.108.75.198:3901            [capital]    hel   5         405.7 GB (80.8%)
6f9edc9a20c362d0  storage-1-de  [2a01:4f8:1c1b:8d94::1]:3901  [storage-1]  nde   1         52.4 GB (48.9%)
29f3b149599f324e  cantor        [::ffff:65.108.6.254]:3901    [cantor]     hel   5         371.9 GB (74.0%)
7bc582cfe6c98d31  storage-2-de  [2a01:4f8:c2c:3ce8::1]:3901   [storage-2]  nde   1         90.9 GB (84.7%)
b9a421e6ef5a3ee1  storage-0-de  [2a01:4f8:1c1b:c9d0::1]:3901  [storage-0]  nde   1         86.3 GB (80.4%)

==== FAILED NODES ====
ID                Hostname  Address                     Tags     Zone  Capacity  Last seen
cfa6eef8e72ff56c  cameo     [2a01:4f9:6b:5601::2]:3901  [cameo]  hel   10        9 minutes ago

Any suggestions on how to fix this (the failed node is basically fubar -- see #595 -- and I'm trying to reach a good state again, so it will be removed completely).

``` Jul 12 20:34:52 capital garage[939]: Could not reach quorum of 2. 1 of 2 request succeeded, others returned errors: ["Netapp error: Not connected: cfa6eef8e72ff56c"] Jul 12 20:34:57 capital garage[939]: 2023-07-12T20:34:57.603669Z INFO garage_rpc::system: Doing a bootstrap/discovery step (not_configured: false, no_peers: false, bad_peers: true) Jul 12 20:34:57 capital garage[939]: 2023-07-12T20:34:57.604309Z ERROR garage_rpc::system: Error establishing RPC connection to remote node: cfa6eef8e72ff56c29ed54e5ab312008bbab09fded391a256adc252cbe10b9e1@[2a01:4f9:6b:5601::2]:3901. Jul 12 20:34:57 capital garage[939]: This can happen if the remote node is not reachable on the network, but also if the two nodes are not configured with the same rpc_secret. Jul 12 20:34:57 capital garage[939]: IO error: Connection refused (os error 111) Jul 12 20:35:41 capital garage[939]: 2023-07-12T20:35:41.341862Z INFO garage_block::resync: Resync block cf31ef58792cb3fb: offloading and deleting Jul 12 20:35:41 capital garage[939]: 2023-07-12T20:35:41.365887Z ERROR garage_block::resync: Error when resyncing cf31ef58792cb3fb: NeedBlockQuery RPC Jul 12 20:35:41 capital garage[939]: Netapp error: Not connected: cfa6eef8e72ff56c ``` For some reason, garage cannot reach a quorum despite 2 nodes being available in the region and replicas being available outside the region. ``` garage status ==== HEALTHY NODES ==== ID Hostname Address Tags Zone Capacity DataAvail fd7e70e6ac3b71fe capital 65.108.75.198:3901 [capital] hel 5 405.7 GB (80.8%) 6f9edc9a20c362d0 storage-1-de [2a01:4f8:1c1b:8d94::1]:3901 [storage-1] nde 1 52.4 GB (48.9%) 29f3b149599f324e cantor [::ffff:65.108.6.254]:3901 [cantor] hel 5 371.9 GB (74.0%) 7bc582cfe6c98d31 storage-2-de [2a01:4f8:c2c:3ce8::1]:3901 [storage-2] nde 1 90.9 GB (84.7%) b9a421e6ef5a3ee1 storage-0-de [2a01:4f8:1c1b:c9d0::1]:3901 [storage-0] nde 1 86.3 GB (80.4%) ==== FAILED NODES ==== ID Hostname Address Tags Zone Capacity Last seen cfa6eef8e72ff56c cameo [2a01:4f9:6b:5601::2]:3901 [cameo] hel 10 9 minutes ago ``` Any suggestions on how to fix this (the failed node is basically fubar -- see #595 -- and I'm trying to reach a good state again, so it will be removed completely).

jpds commented

2023-07-13 16:48:44 +00:00

Contributor

capital 65.108.75.198:3901

Presumably, this would be an issue with the fact that capital and cantor are on IPv4 - and thus have no way to communicate with any of the other nodes as they have IPv6 addresses registered.

You would have to readd them to the deployment with their IPv6 addresses.

> capital 65.108.75.198:3901 Presumably, this would be an issue with the fact that `capital` and `cantor` are on IPv4 - and thus have no way to communicate with any of the other nodes as they have IPv6 addresses registered. You would have to readd them to the deployment with their IPv6 addresses.

lx commented

2023-07-14 15:55:40 +00:00

Owner

Also, if your plan is to eventually remove the failed node from the cluster, you can remove it now from the layout to rebuild copies of all your data, there is no particular reason to wait.

Are you using replication mode 2 or 3 ? Your logs look like you are using replication mode 2, which would explain why a single unavailable node breaks your cluster. If that's the case, try setting it to 2-dangerous to restore write capability to your cluster. Also, if your plan is to eventually remove the failed node from the cluster, you can remove it now from the layout to rebuild copies of all your data, there is no particular reason to wait.