New metric listing Garage peers #545
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#545
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
To monitor the health of a Garage cluster, we would like to know when a node is disconnected from other nodes.
It would be nice to have a new metric with this information, it could look like this for each peer (beware, I have no idea about best practices for naming metrics):
The value would be 0 (unreachable/disconnected) or 1 (connected). While at it, we can include all information shown in
garage status
.I'm hoping to add this once I have time to finish integrating the opentelemetry version bump, however it'll probably be in the form:
From the aggregate of this across all nodes, you'll be able to spot which one is disconnected as its value plummets and set an alert on
configured != connected
.Adding too many labels (for instance,
capacity
is already covered bygarage_local_disk_total
) is advised against because of cardinality, see:Thanks, you're right about cardinality. I believe
id
,hostname
andzone
are fine as labels because they are mostly constant, and could be used to e.g. aggregate by zone.Another alternative to the 0/1 metric could be a "last seen" metric, where 0 means "currently connected" and any positive value means "disconnected since X seconds":
This would allow alerting rules such as "send an alert if last seen is above 5 minutes, but stop alerting if it's over 7 days (e.g. because the remote node is really dead)"
With your proposed configured_total and connected_total metrics, would you be able to send an alert saying "node M is disconnected from the point of view of node N"?
It seems you will only be able to say "N is missing a peer", but without being able to know who the missing peer actually is. If M has crashed hard, you won't get any metric from it.
zone
might be tricky, as I already add that in with Prometheus'sexternal_labels
:However, I'm able to do this as I run Thanos on Garage, and run a separate Prometheus instance within each Garage zone, in line with best practice:
I'll experiment with seeing if Prometheus simply merges them.
And the
garage_peer_last_seen_seconds
idea is a great one.Yes, but at that point you want to have it alert on an
up == 0
alert from the built-in Prometheus metric.I think I will add at least a basic version of this for v1.0
lx referenced this issue2024-02-20 10:40:30 +00:00
lx referenced this issue2024-03-01 14:14:56 +00:00