New metric listing Garage peers #545

Closed
opened 2023-04-19 21:35:34 +00:00 by baptiste · 6 comments
Owner

To monitor the health of a Garage cluster, we would like to know when a node is disconnected from other nodes.

It would be nice to have a new metric with this information, it could look like this for each peer (beware, I have no idea about best practices for naming metrics):

garage_peer_status{id="8c8a4ab1878f5f80", hostname="foo", address="10.X.Y.Z:PPP", zone="earth", capacity=100} 1

The value would be 0 (unreachable/disconnected) or 1 (connected). While at it, we can include all information shown in garage status.

To monitor the health of a Garage cluster, we would like to know when a node is disconnected from other nodes. It would be nice to have a new metric with this information, it could look like this for each peer (beware, I have no idea about best practices for naming metrics): ``` garage_peer_status{id="8c8a4ab1878f5f80", hostname="foo", address="10.X.Y.Z:PPP", zone="earth", capacity=100} 1 ``` The value would be 0 (unreachable/disconnected) or 1 (connected). While at it, we can include all information shown in `garage status`.
Contributor

I'm hoping to add this once I have time to finish integrating the opentelemetry version bump, however it'll probably be in the form:

garage_peers_configured_total{id="8c8a4ab1878f5f80"} 5
garage_peers_connected_total{id="8c8a4ab1878f5f80"} 4

From the aggregate of this across all nodes, you'll be able to spot which one is disconnected as its value plummets and set an alert on configured != connected.

Adding too many labels (for instance, capacity is already covered by garage_local_disk_total) is advised against because of cardinality, see:

I'm hoping to add this once I have time to finish integrating the opentelemetry version bump, however it'll probably be in the form: ``` garage_peers_configured_total{id="8c8a4ab1878f5f80"} 5 garage_peers_connected_total{id="8c8a4ab1878f5f80"} 4 ``` From the aggregate of this across all nodes, you'll be able to spot which one is disconnected as its value plummets and set an alert on `configured != connected`. Adding too many labels (for instance, `capacity` is already covered by `garage_local_disk_total`) is advised against because of cardinality, see: - https://prometheus.io/docs/practices/naming/#labels - https://www.robustperception.io/cardinality-is-key/
Author
Owner

Thanks, you're right about cardinality. I believe id, hostname and zone are fine as labels because they are mostly constant, and could be used to e.g. aggregate by zone.

Thanks, you're right about cardinality. I believe `id`, `hostname` and `zone` are fine as labels because they are mostly constant, and could be used to e.g. aggregate by zone.
Author
Owner

Another alternative to the 0/1 metric could be a "last seen" metric, where 0 means "currently connected" and any positive value means "disconnected since X seconds":

garage_peer_last_seen_seconds{id="8c8a4ab1878f5f80", hostname="foo", zone="earth"} 3142
garage_peer_last_seen_seconds{id="21262c899a68909d", hostname="bar", zone="mars"} 0

This would allow alerting rules such as "send an alert if last seen is above 5 minutes, but stop alerting if it's over 7 days (e.g. because the remote node is really dead)"

Another alternative to the 0/1 metric could be a "last seen" metric, where 0 means "currently connected" and any positive value means "disconnected since X seconds": ``` garage_peer_last_seen_seconds{id="8c8a4ab1878f5f80", hostname="foo", zone="earth"} 3142 garage_peer_last_seen_seconds{id="21262c899a68909d", hostname="bar", zone="mars"} 0 ``` This would allow alerting rules such as "send an alert if last seen is above 5 minutes, but stop alerting if it's over 7 days (e.g. because the remote node is really dead)"
Author
Owner

With your proposed configured_total and connected_total metrics, would you be able to send an alert saying "node M is disconnected from the point of view of node N"?

It seems you will only be able to say "N is missing a peer", but without being able to know who the missing peer actually is. If M has crashed hard, you won't get any metric from it.

With your proposed configured_total and connected_total metrics, would you be able to send an alert saying "node M is disconnected from the point of view of node N"? It seems you will only be able to say "N is missing a peer", but without being able to know who the missing peer actually is. If M has crashed hard, you won't get any metric from it.
Contributor

zone might be tricky, as I already add that in with Prometheus's external_labels:

However, I'm able to do this as I run Thanos on Garage, and run a separate Prometheus instance within each Garage zone, in line with best practice:

I'll experiment with seeing if Prometheus simply merges them.

And the garage_peer_last_seen_seconds idea is a great one.

without being able to know who the missing peer actually is. If M has crashed hard, you won't get any metric from it.

Yes, but at that point you want to have it alert on an up == 0 alert from the built-in Prometheus metric.

`zone` might be tricky, as I already add that in with Prometheus's `external_labels`: - https://prometheus.io/docs/prometheus/latest/configuration/configuration/ However, I'm able to do this as I run Thanos on Garage, and run a separate Prometheus instance within each Garage zone, in line with best practice: - https://thanos.io/tip/thanos/quick-tutorial.md/#prometheus I'll experiment with seeing if Prometheus simply merges them. And the `garage_peer_last_seen_seconds` idea is a great one. > without being able to know who the missing peer actually is. If M has crashed hard, you won't get any metric from it. Yes, but at that point you want to have it alert on an `up == 0` alert from the built-in Prometheus metric.
lx added the
Improvement
AdminAPI
labels 2023-04-20 15:54:47 +00:00
lx added this to the v1.0 milestone 2024-02-16 10:26:57 +00:00
Owner

I think I will add at least a basic version of this for v1.0

I think I will add at least a basic version of this for v1.0
lx closed this issue 2024-02-20 15:35:13 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#545
No description provided.