Proposal: Webhook notifications for regular operations #338
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#338
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Building an application using Garage for storage is quite awesome (especially the upcoming k/v store). However, I (personally) would love to be able to configure webhooks when operations complete in order to perform analysis, enforce simple policies, and other useful things without polling. Here are some operations I'm thinking of:
Note that this is similar to AWS Notifications except would be defined as part of the Garage configuration, and use regular HTTP callbacks with a JSON payload. The payload would look something like:
and be defined in the configuration maybe something like:
It would be up to engineers/operators to handle the hooks and route hooks they are interested in receiving.
In the future, it may make sense to allow buckets to opt-in/out of webhooks, or provide a more fine-grained model, in order to prevent a massive amount of calls for larger clusters.
I'd be happy to take a stab at implementing this, but first I wanted to propose this to you to see if you'd be interested in having it, get thoughts on how it might be better, and/or what kind of data should be sent.
@withinboredom, thanks for the suggestion and for providing such a detailed description of your desired use case. I think this would be an interesting feature, and probably not too hard to implement by hooking into the correct places in the API. I've tagged this with "Newcomer" because I believe it would be easy to add a rudimentary version into Garage without too much effort.
I would however like to bring to your attention a specific point that might require a bit of thought: reliability of the webhook triggers. Indeed, Garage has to tolerate many kinds of failures, and we are operating under the assumption that node failure can happen at any point in the code. In particular, if we implement simply webhooks as a function that is triggered by the API node answering a particular request just after the request has been completed, we expose ourself to two kinds of pathological situations where the webhook isn't triggered:
The API node crashed after the operation completed but before triggering the webhook
The API node was not able to reach quorum when doing its operation, thus it won't trigger the webhook as it considers the operation failed, but the operation in fact still reached one correct node and will be eventually propagated to the entire cluster (this is a special kind of behaviour that is quite specific to how Garage works).
If we define the webhook semantics as "might skip some events, make sure to poll regularly for changes", then this is fine. However if we expect some kind of strong reliability from the webhook, we would need to devise some more advanced way of recovering missed events and triggering webhooks for them.
Alternatively, we could also implement webhooks by adding trigger at all of the replica nodes: in this case, each webhook will be triggered not once but three times for each operation. This might seem wastefull but if you have idempotence in your handling of webhooks then this is one of the easiest way to implement reliability.
In your use case, do you envision a scenario when missing a webhook trigger causes a lot of issues in your system, or could you recover from such missed events in some way? More genrally, what do you think of this issue?
In general, I think they should be reliable and as the k2v feature matures, the operator could configure a bucket and we could potentially store the status of webhooks there as a way to make them reliable, and queryable (for debugging hook failures, cancelling all pending webhooks, etc). This would work especially well if k2v were to support TTL for keys.
The downside to adding reliability is that no matter what, something is going to be "amplified" by performing operations, either by increasing the load on the disks due to performing more writes or increasing the load on the network by making multiple calls; this should still be less intensive than polling infinitely (I think).
So, my 2¢ is that for an initial implementation, focus on getting the basics ("might skip some events, make sure to poll regularly for changes") in place and then evaluate how we might make them reliable at a later date.
I've created a POC in #340