New metric listing Garage peers

baptiste commented

2023-04-19 21:35:34 +00:00

Owner

To monitor the health of a Garage cluster, we would like to know when a node is disconnected from other nodes.

It would be nice to have a new metric with this information, it could look like this for each peer (beware, I have no idea about best practices for naming metrics):

garage_peer_status{id="8c8a4ab1878f5f80", hostname="foo", address="10.X.Y.Z:PPP", zone="earth", capacity=100} 1

The value would be 0 (unreachable/disconnected) or 1 (connected). While at it, we can include all information shown in garage status.

To monitor the health of a Garage cluster, we would like to know when a node is disconnected from other nodes. It would be nice to have a new metric with this information, it could look like this for each peer (beware, I have no idea about best practices for naming metrics): ``` garage_peer_status{id="8c8a4ab1878f5f80", hostname="foo", address="10.X.Y.Z:PPP", zone="earth", capacity=100} 1 ``` The value would be 0 (unreachable/disconnected) or 1 (connected). While at it, we can include all information shown in `garage status`.

jpds commented

2023-04-19 22:48:35 +00:00

Contributor

I'm hoping to add this once I have time to finish integrating the opentelemetry version bump, however it'll probably be in the form:

garage_peers_configured_total{id="8c8a4ab1878f5f80"} 5
garage_peers_connected_total{id="8c8a4ab1878f5f80"} 4

From the aggregate of this across all nodes, you'll be able to spot which one is disconnected as its value plummets and set an alert on configured != connected.

Adding too many labels (for instance, capacity is already covered by garage_local_disk_total) is advised against because of cardinality, see:

I'm hoping to add this once I have time to finish integrating the opentelemetry version bump, however it'll probably be in the form: ``` garage_peers_configured_total{id="8c8a4ab1878f5f80"} 5 garage_peers_connected_total{id="8c8a4ab1878f5f80"} 4 ``` From the aggregate of this across all nodes, you'll be able to spot which one is disconnected as its value plummets and set an alert on `configured != connected`. Adding too many labels (for instance, `capacity` is already covered by `garage_local_disk_total`) is advised against because of cardinality, see: - https://prometheus.io/docs/practices/naming/#labels - https://www.robustperception.io/cardinality-is-key/

baptiste commented

2023-04-19 23:12:07 +00:00

Author

Owner

Thanks, you're right about cardinality. I believe id, hostname and zone are fine as labels because they are mostly constant, and could be used to e.g. aggregate by zone.

Thanks, you're right about cardinality. I believe `id`, `hostname` and `zone` are fine as labels because they are mostly constant, and could be used to e.g. aggregate by zone.

baptiste commented

2023-04-19 23:13:29 +00:00

Author

Owner

Another alternative to the 0/1 metric could be a "last seen" metric, where 0 means "currently connected" and any positive value means "disconnected since X seconds":

garage_peer_last_seen_seconds{id="8c8a4ab1878f5f80", hostname="foo", zone="earth"} 3142
garage_peer_last_seen_seconds{id="21262c899a68909d", hostname="bar", zone="mars"} 0

This would allow alerting rules such as "send an alert if last seen is above 5 minutes, but stop alerting if it's over 7 days (e.g. because the remote node is really dead)"

Another alternative to the 0/1 metric could be a "last seen" metric, where 0 means "currently connected" and any positive value means "disconnected since X seconds": ``` garage_peer_last_seen_seconds{id="8c8a4ab1878f5f80", hostname="foo", zone="earth"} 3142 garage_peer_last_seen_seconds{id="21262c899a68909d", hostname="bar", zone="mars"} 0 ``` This would allow alerting rules such as "send an alert if last seen is above 5 minutes, but stop alerting if it's over 7 days (e.g. because the remote node is really dead)"

baptiste commented

2023-04-19 23:18:40 +00:00

Author

Owner

With your proposed configured_total and connected_total metrics, would you be able to send an alert saying "node M is disconnected from the point of view of node N"?

It seems you will only be able to say "N is missing a peer", but without being able to know who the missing peer actually is. If M has crashed hard, you won't get any metric from it.

With your proposed configured_total and connected_total metrics, would you be able to send an alert saying "node M is disconnected from the point of view of node N"? It seems you will only be able to say "N is missing a peer", but without being able to know who the missing peer actually is. If M has crashed hard, you won't get any metric from it.

jpds commented

2023-04-19 23:36:24 +00:00

Contributor

zone might be tricky, as I already add that in with Prometheus's external_labels:

https://prometheus.io/docs/prometheus/latest/configuration/configuration/

However, I'm able to do this as I run Thanos on Garage, and run a separate Prometheus instance within each Garage zone, in line with best practice:

https://thanos.io/tip/thanos/quick-tutorial.md/#prometheus

I'll experiment with seeing if Prometheus simply merges them.

And the garage_peer_last_seen_seconds idea is a great one.

without being able to know who the missing peer actually is. If M has crashed hard, you won't get any metric from it.

Yes, but at that point you want to have it alert on an up == 0 alert from the built-in Prometheus metric.

`zone` might be tricky, as I already add that in with Prometheus's `external_labels`: - https://prometheus.io/docs/prometheus/latest/configuration/configuration/ However, I'm able to do this as I run Thanos on Garage, and run a separate Prometheus instance within each Garage zone, in line with best practice: - https://thanos.io/tip/thanos/quick-tutorial.md/#prometheus I'll experiment with seeing if Prometheus simply merges them. And the `garage_peer_last_seen_seconds` idea is a great one. > without being able to know who the missing peer actually is. If M has crashed hard, you won't get any metric from it. Yes, but at that point you want to have it alert on an `up == 0` alert from the built-in Prometheus metric.