Improved monitoring of repair operations #207

New issue

Closed

opened 2022-01-27 13:03:26 +00:00 by lx · 7 comments

lx commented

2022-01-27 13:03:26 +00:00

Owner

It would be nice to monitor the progress of repair operations.
In particular for scrub operations, we need to know when a corrupted data file was found. (note that a corrupted data file can also be found by the course of normal operation, in which case it should also be reported)

It would be nice to monitor the progress of repair operations. In particular for scrub operations, we need to know when a corrupted data file was found. (note that a corrupted data file can also be found by the course of normal operation, in which case it should also be reported)

lx added the

kind

improvement

label 2022-01-27 13:05:22 +00:00

lx commented

2022-01-31 16:24:33 +00:00

Author

Owner

We talked about an alerting system using webhooks to be able to notify the admin when something is going wrong, e.g. by sending them a text message. So we have these things to do:

when errors happen, add them to a log
when errors happen, call a webhook if it is configured
add a command that shows the errors in the log
add a command that shows the progress of the current scrub (like zpool status does)

We talked about an alerting system using webhooks to be able to notify the admin when something is going wrong, e.g. by sending them a text message. So we have these things to do: - [ ] when errors happen, add them to a log - [ ] when errors happen, call a webhook if it is configured - [ ] add a command that shows the errors in the log - [ ] add a command that shows the progress of the current scrub (like zpool status does)

quentin commented

2022-01-31 16:30:29 +00:00

Owner

Note that often alerting is done through metrics and Prometheus on other systems:

So another way to do it is to report errors in our future opentelemetry endpoint and then putting Prometheus alerts on these values.

Note that often alerting is done through metrics and Prometheus on other systems: - https://docs.min.io/minio/baremetal/monitoring/metrics-alerts/minio-metrics-and-alerts.html - https://docs.riak.com/riak/cs/latest/cookbooks/monitoring-and-metrics/index.html So another way to do it is to report errors in our future opentelemetry endpoint and then putting Prometheus alerts on these values.

lx commented

2022-01-31 16:31:25 +00:00

Author

Owner

(see #111)

maximilien commented

2022-01-31 23:07:01 +00:00

Owner

I agree, I've seldom seen webhook for those things inside of actual products (besides alerting-specific products).

Having a standard metric that people can alert upon, a message in garage status and logging the cause are the usual approaches.

I agree, I've seldom seen webhook for those things inside of actual products (besides alerting-specific products). Having a standard metric that people can alert upon, a message in `garage status` and logging the cause are the usual approaches.

quentin commented

2022-02-01 13:23:37 +00:00

Owner

Be more verbose when there is a data error when running repair or scrub.