Cluster does not come up when Discovery fails #808
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#808
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I have a K3s cluster running v1.24.13+k3s1 for close to a year and had three Garage nodes as part of it. It did work for quite some time, but then something apparently happened that broke discovery (I assume that IP addresses changed for some reason). I then saw repeated messages like this in all three pods' logs:
ERROR garage_rpc::system: Error while publishing node to Kubernetes: HyperError: error trying to connect: invalid peer certificate encoding
When I restart the statefulset then the Garage cluster does not come up. Apparently it can't discover the other nodes and I am seeing something like this:
I can then manually reconnect all nodes and have to do that on EACH of my three nodes to get the cluster working again. However, the error message "HyperError" still persists and I can only assume that the next time the statefulset restarts, the cluster will not come up again.
So it would be nice to know how to debug the error and fix discovery. Connecting to the K8s API server works fine, the TLS handshake and certificate (ECDSA signature) look normal.
Garage version is 0.8.2 and I installed it using the Helm chart 0.4.1.
Would you mind copying here the public part of the API server certificate, possibly with some fields redacted. Something like the output of
openssl x509 -in <cert_file_path> -noout -text
Garage 0.8.2 is using rusttls 0.19.1 which is already supposed to support ECDSA certs and P-256 curves... I would still suggest that you either try to switch your cluster to a more classic RSA 4096 API server cert or update garage to at least the 0.9.x release, 0.8.2 is VERY old at this point and we don't support it.
After upgrading to 0.9.4 the problem seems to have disappeared. All three pods are coming up, only the third pod has an error message (after apparently connecting with the other two):
But it is possible to read from and write to the cluster.