Improved monitoring of repair operations #207
Labels
No Label
AdminAPI
Bug
Check AWS
CI
Correctness
Critical
Documentation
Ideas
Improvement
Low priority
Newcomer
Performance
S3 Compatibility
Testing
Usability
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#207
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
It would be nice to monitor the progress of repair operations.
In particular for scrub operations, we need to know when a corrupted data file was found. (note that a corrupted data file can also be found by the course of normal operation, in which case it should also be reported)
We talked about an alerting system using webhooks to be able to notify the admin when something is going wrong, e.g. by sending them a text message. So we have these things to do:
Note that often alerting is done through metrics and Prometheus on other systems:
So another way to do it is to report errors in our future opentelemetry endpoint and then putting Prometheus alerts on these values.
(see #111)
I agree, I've seldom seen webhook for those things inside of actual products (besides alerting-specific products).
Having a standard metric that people can alert upon, a message in
garage status
and logging the cause are the usual approaches.Be more verbose when there is a data error when running repair or scrub.
Closing this as we now have improved monitoring of background tasks thanks to #332