Improved monitoring of repair operations #207

Closed
opened 2022-01-27 13:03:26 +00:00 by lx · 7 comments
Owner

It would be nice to monitor the progress of repair operations.
In particular for scrub operations, we need to know when a corrupted data file was found. (note that a corrupted data file can also be found by the course of normal operation, in which case it should also be reported)

It would be nice to monitor the progress of repair operations. In particular for scrub operations, we need to know when a corrupted data file was found. (note that a corrupted data file can also be found by the course of normal operation, in which case it should also be reported)
lx added the
Improvement
label 2022-01-27 13:05:22 +00:00
Author
Owner

We talked about an alerting system using webhooks to be able to notify the admin when something is going wrong, e.g. by sending them a text message. So we have these things to do:

  • when errors happen, add them to a log
  • when errors happen, call a webhook if it is configured
  • add a command that shows the errors in the log
  • add a command that shows the progress of the current scrub (like zpool status does)
We talked about an alerting system using webhooks to be able to notify the admin when something is going wrong, e.g. by sending them a text message. So we have these things to do: - [ ] when errors happen, add them to a log - [ ] when errors happen, call a webhook if it is configured - [ ] add a command that shows the errors in the log - [ ] add a command that shows the progress of the current scrub (like zpool status does)
Owner

Note that often alerting is done through metrics and Prometheus on other systems:

So another way to do it is to report errors in our future opentelemetry endpoint and then putting Prometheus alerts on these values.

Note that often alerting is done through metrics and Prometheus on other systems: - https://docs.min.io/minio/baremetal/monitoring/metrics-alerts/minio-metrics-and-alerts.html - https://docs.riak.com/riak/cs/latest/cookbooks/monitoring-and-metrics/index.html So another way to do it is to report errors in our future opentelemetry endpoint and then putting Prometheus alerts on these values.
Author
Owner

(see #111)

(see #111)
Owner

I agree, I've seldom seen webhook for those things inside of actual products (besides alerting-specific products).

Having a standard metric that people can alert upon, a message in garage status and logging the cause are the usual approaches.

I agree, I've seldom seen webhook for those things inside of actual products (besides alerting-specific products). Having a standard metric that people can alert upon, a message in `garage status` and logging the cause are the usual approaches.
Owner

Be more verbose when there is a data error when running repair or scrub.

Be more verbose when there is a data error when running repair or scrub.
Author
Owner
  • API to view repair operations in progress
  • API to dynamically change parameters of some operations (e.g. tranquility of a scrub)
- API to view repair operations in progress - API to dynamically change parameters of some operations (e.g. tranquility of a scrub)
Author
Owner

Closing this as we now have improved monitoring of background tasks thanks to #332

Closing this as we now have improved monitoring of background tasks thanks to #332
lx closed this issue 2022-09-14 11:12:43 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#207
No description provided.