Cluster does not come up when Discovery fails #808
Labels
No Label
AdminAPI
Bug
Check AWS
CI
Correctness
Critical
Documentation
Ideas
Improvement
Low priority
Newcomer
Performance
S3 Compatibility
Testing
Usability
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#808
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I have a K3s cluster running v1.24.13+k3s1 for close to a year and had three Garage nodes as part of it. It did work for quite some time, but then something apparently happened that broke discovery (I assume that IP addresses changed for some reason). I then saw repeated messages like this in all three pods' logs:
ERROR garage_rpc::system: Error while publishing node to Kubernetes: HyperError: error trying to connect: invalid peer certificate encoding
When I restart the statefulset then the Garage cluster does not come up. Apparently it can't discover the other nodes and I am seeing something like this:
I can then manually reconnect all nodes and have to do that on EACH of my three nodes to get the cluster working again. However, the error message "HyperError" still persists and I can only assume that the next time the statefulset restarts, the cluster will not come up again.
So it would be nice to know how to debug the error and fix discovery. Connecting to the K8s API server works fine, the TLS handshake and certificate (ECDSA signature) look normal.
Garage version is 0.8.2 and I installed it using the Helm chart 0.4.1.
Would you mind copying here the public part of the API server certificate, possibly with some fields redacted. Something like the output of
openssl x509 -in <cert_file_path> -noout -text
Garage 0.8.2 is using rusttls 0.19.1 which is already supposed to support ECDSA certs and P-256 curves... I would still suggest that you either try to switch your cluster to a more classic RSA 4096 API server cert or update garage to at least the 0.9.x release, 0.8.2 is VERY old at this point and we don't support it.