Improved monitoring of repair operations #207
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#207
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
It would be nice to monitor the progress of repair operations.
In particular for scrub operations, we need to know when a corrupted data file was found. (note that a corrupted data file can also be found by the course of normal operation, in which case it should also be reported)
We talked about an alerting system using webhooks to be able to notify the admin when something is going wrong, e.g. by sending them a text message. So we have these things to do:
Note that often alerting is done through metrics and Prometheus on other systems:
So another way to do it is to report errors in our future opentelemetry endpoint and then putting Prometheus alerts on these values.
(see #111)
I agree, I've seldom seen webhook for those things inside of actual products (besides alerting-specific products).
Having a standard metric that people can alert upon, a message in
garage status
and logging the cause are the usual approaches.Be more verbose when there is a data error when running repair or scrub.
Closing this as we now have improved monitoring of background tasks thanks to #332