Cluster does not come up when Discovery fails #808

Open
opened 2024-04-16 16:25:19 +00:00 by ulim · 4 comments

I have a K3s cluster running v1.24.13+k3s1 for close to a year and had three Garage nodes as part of it. It did work for quite some time, but then something apparently happened that broke discovery (I assume that IP addresses changed for some reason). I then saw repeated messages like this in all three pods' logs:

ERROR garage_rpc::system: Error while publishing node to Kubernetes: HyperError: error trying to connect: invalid peer certificate encoding

When I restart the statefulset then the Garage cluster does not come up. Apparently it can't discover the other nodes and I am seeing something like this:

==== HEALTHY NODES ====
ID                Hostname  Address           Tags  Zone  Capacity  DataAvail
6c81bd7650d73850  garage-2  10.42.8.210:3901  []    dc1   1         15.9 GB (99.1%)

==== FAILED NODES ====
ID                Hostname  Address                    Tags  Zone  Capacity  Last seen
5b46b160c569e307  garage-1  [::ffff:10.42.7.104]:3901  []    dc1   1         never seen
f67a0c86684b9534  garage-0  [::ffff:10.42.6.215]:3901  []    dc1   1         never seen

I can then manually reconnect all nodes and have to do that on EACH of my three nodes to get the cluster working again. However, the error message "HyperError" still persists and I can only assume that the next time the statefulset restarts, the cluster will not come up again.

So it would be nice to know how to debug the error and fix discovery. Connecting to the K8s API server works fine, the TLS handshake and certificate (ECDSA signature) look normal.

Garage version is 0.8.2 and I installed it using the Helm chart 0.4.1.

I have a K3s cluster running v1.24.13+k3s1 for close to a year and had three Garage nodes as part of it. It did work for quite some time, but then something apparently happened that broke discovery (I assume that IP addresses changed for some reason). I then saw repeated messages like this in all three pods' logs: `ERROR garage_rpc::system: Error while publishing node to Kubernetes: HyperError: error trying to connect: invalid peer certificate encoding` When I restart the statefulset then the Garage cluster does not come up. Apparently it can't discover the other nodes and I am seeing something like this: ``` ==== HEALTHY NODES ==== ID Hostname Address Tags Zone Capacity DataAvail 6c81bd7650d73850 garage-2 10.42.8.210:3901 [] dc1 1 15.9 GB (99.1%) ==== FAILED NODES ==== ID Hostname Address Tags Zone Capacity Last seen 5b46b160c569e307 garage-1 [::ffff:10.42.7.104]:3901 [] dc1 1 never seen f67a0c86684b9534 garage-0 [::ffff:10.42.6.215]:3901 [] dc1 1 never seen ``` I can then manually reconnect all nodes and have to do that on EACH of my three nodes to get the cluster working again. However, the error message "HyperError" still persists and I can only assume that the next time the statefulset restarts, the cluster will not come up again. So it would be nice to know how to debug the error and fix discovery. Connecting to the K8s API server works fine, the TLS handshake and certificate (ECDSA signature) look normal. Garage version is 0.8.2 and I installed it using the Helm chart 0.4.1.
Owner

Would you mind copying here the public part of the API server certificate, possibly with some fields redacted. Something like the output of openssl x509 -in <cert_file_path> -noout -text

Would you mind copying here the public part of the API server certificate, possibly with some fields redacted. Something like the output of `openssl x509 -in <cert_file_path> -noout -text`
Author
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 0 (0x0)
        Signature Algorithm: ecdsa-with-SHA256
        Issuer: CN = k3s-server-ca@1683825034
        Validity
            Not Before: May 11 17:10:34 2023 GMT
            Not After : May  8 17:10:34 2033 GMT
        Subject: CN = k3s-server-ca@1683825034
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:b9:99:1f:98:e6:a2:a0:5c:e4:36:1e:28:3b:26:
                    6d:d2:90:7c:2f:a5:a4:ef:d9:fd:b4:23:18:16:92:
                    36:4f:d8:a3:b8:8b:02:ca:93:0c:e3:d2:c4:8b:a4:
                    bf:e8:ae:6b:68:84:b0:53:b3:ab:05:39:32:f0:35:
                    5d:5c:f1:45:35
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment, Certificate Sign
            X509v3 Basic Constraints: critical
                CA:TRUE
            X509v3 Subject Key Identifier: 
                15:9A:89:DB:D7:8A:F8:21:DB:24:67:29:82:81:D9:3A:92:45:75:83
    Signature Algorithm: ecdsa-with-SHA256
         30:45:02:21:00:b2:0a:06:7e:9f:26:26:4c:c4:7b:38:08:b7:
         f9:58:76:6d:02:e4:51:8b:b0:f3:e3:fa:0a:75:62:0b:9a:85:
         59:02:20:2a:2d:b1:7f:ff:12:f6:17:b4:ff:31:a3:da:cf:c7:
         ee:bb:ec:cc:ea:7b:63:72:0f:8d:04:91:49:27:3c:6b:de
``` Certificate: Data: Version: 3 (0x2) Serial Number: 0 (0x0) Signature Algorithm: ecdsa-with-SHA256 Issuer: CN = k3s-server-ca@1683825034 Validity Not Before: May 11 17:10:34 2023 GMT Not After : May 8 17:10:34 2033 GMT Subject: CN = k3s-server-ca@1683825034 Subject Public Key Info: Public Key Algorithm: id-ecPublicKey Public-Key: (256 bit) pub: 04:b9:99:1f:98:e6:a2:a0:5c:e4:36:1e:28:3b:26: 6d:d2:90:7c:2f:a5:a4:ef:d9:fd:b4:23:18:16:92: 36:4f:d8:a3:b8:8b:02:ca:93:0c:e3:d2:c4:8b:a4: bf:e8:ae:6b:68:84:b0:53:b3:ab:05:39:32:f0:35: 5d:5c:f1:45:35 ASN1 OID: prime256v1 NIST CURVE: P-256 X509v3 extensions: X509v3 Key Usage: critical Digital Signature, Key Encipherment, Certificate Sign X509v3 Basic Constraints: critical CA:TRUE X509v3 Subject Key Identifier: 15:9A:89:DB:D7:8A:F8:21:DB:24:67:29:82:81:D9:3A:92:45:75:83 Signature Algorithm: ecdsa-with-SHA256 30:45:02:21:00:b2:0a:06:7e:9f:26:26:4c:c4:7b:38:08:b7: f9:58:76:6d:02:e4:51:8b:b0:f3:e3:fa:0a:75:62:0b:9a:85: 59:02:20:2a:2d:b1:7f:ff:12:f6:17:b4:ff:31:a3:da:cf:c7: ee:bb:ec:cc:ea:7b:63:72:0f:8d:04:91:49:27:3c:6b:de ```
Owner

Garage 0.8.2 is using rusttls 0.19.1 which is already supposed to support ECDSA certs and P-256 curves... I would still suggest that you either try to switch your cluster to a more classic RSA 4096 API server cert or update garage to at least the 0.9.x release, 0.8.2 is VERY old at this point and we don't support it.

Garage 0.8.2 is using [rusttls 0.19.1](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/tag/v0.8.2/Cargo.lock#L1630) which is [already supposed to support ECDSA certs and P-256 curves](https://docs.rs/crate/rustls/0.19.1)... I would still suggest that you either try to switch your cluster to a more classic RSA 4096 API server cert or update garage to at least the 0.9.x release, 0.8.2 is VERY old at this point and we don't support it.
Author

After upgrading to 0.9.4 the problem seems to have disappeared. All three pods are coming up, only the third pod has an error message (after apparently connecting with the other two):

2024-06-07T15:59:18.843859Z ERROR garage_rpc::system: Error establishing RPC connection to remote node: f67a0c86684b95346e0845e7a21a7cc8e9580175049d2e3391c5e72b70486fd1@10.42.6.216:3901.
This can happen if the remote node is not reachable on the network, but also if the two nodes are not configured with the same rpc_secret.
IO error: Host is unreachable (os error 113)
2024-06-07T15:59:18.885473Z ERROR garage_rpc::system: Error establishing RPC connection to remote node: 5b46b160c569e30712497e715f91e2c7a667758f1cf69d63e80d2f5fe95506fd@10.42.7.105:3901.
This can happen if the remote node is not reachable on the network, but also if the two nodes are not configured with the same rpc_secret.
IO error: Host is unreachable (os error 113)

But it is possible to read from and write to the cluster.

After upgrading to 0.9.4 the problem seems to have disappeared. All three pods are coming up, only the third pod has an error message (after apparently connecting with the other two): ``` 2024-06-07T15:59:18.843859Z ERROR garage_rpc::system: Error establishing RPC connection to remote node: f67a0c86684b95346e0845e7a21a7cc8e9580175049d2e3391c5e72b70486fd1@10.42.6.216:3901. This can happen if the remote node is not reachable on the network, but also if the two nodes are not configured with the same rpc_secret. IO error: Host is unreachable (os error 113) 2024-06-07T15:59:18.885473Z ERROR garage_rpc::system: Error establishing RPC connection to remote node: 5b46b160c569e30712497e715f91e2c7a667758f1cf69d63e80d2f5fe95506fd@10.42.7.105:3901. This can happen if the remote node is not reachable on the network, but also if the two nodes are not configured with the same rpc_secret. IO error: Host is unreachable (os error 113) ``` But it is possible to read from and write to the cluster.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#808
No description provided.