Proposal: Webhook notifications for regular operations #338

Open
opened 2022-07-07 11:58:23 +00:00 by withinboredom · 3 comments
Contributor

Building an application using Garage for storage is quite awesome (especially the upcoming k/v store). However, I (personally) would love to be able to configure webhooks when operations complete in order to perform analysis, enforce simple policies, and other useful things without polling. Here are some operations I'm thinking of:

  • BucketCreated
  • BucketDeleted
  • ObjectCopied
  • ObjectDeleted
  • ObjectCreated
  • MultipartObjectStarted
  • MultipartObjectFinished
  • WebsiteCorsChanged
  • WebsiteCreated
  • WebsiteDeleted

Note that this is similar to AWS Notifications except would be defined as part of the Garage configuration, and use regular HTTP callbacks with a JSON payload. The payload would look something like:

{
  "type": "ObjectDeleted",
  "bucket": "{bucket name}",
  "via": "{key id}",
  "at": "{time}",
  "object": "{object uri}"
}

and be defined in the configuration maybe something like:

webhook_uri = "https://example.com/garage_hooks/"

It would be up to engineers/operators to handle the hooks and route hooks they are interested in receiving.

In the future, it may make sense to allow buckets to opt-in/out of webhooks, or provide a more fine-grained model, in order to prevent a massive amount of calls for larger clusters.

I'd be happy to take a stab at implementing this, but first I wanted to propose this to you to see if you'd be interested in having it, get thoughts on how it might be better, and/or what kind of data should be sent.

Building an application using Garage for storage is quite awesome (especially the upcoming k/v store). However, I (personally) would love to be able to configure webhooks when operations complete in order to perform analysis, enforce simple policies, and other useful things without polling. Here are some operations I'm thinking of: - BucketCreated - BucketDeleted - ObjectCopied - ObjectDeleted - ObjectCreated - MultipartObjectStarted - MultipartObjectFinished - WebsiteCorsChanged - WebsiteCreated - WebsiteDeleted Note that this is similar to [AWS Notifications](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketNotificationConfiguration.html) except would be defined as part of the Garage configuration, and use regular HTTP callbacks with a JSON payload. The payload would look something like: ```json { "type": "ObjectDeleted", "bucket": "{bucket name}", "via": "{key id}", "at": "{time}", "object": "{object uri}" } ``` and be defined in the configuration maybe something like: ``` webhook_uri = "https://example.com/garage_hooks/" ``` It would be up to engineers/operators to handle the hooks and route hooks they are interested in receiving. In the future, it may make sense to allow buckets to opt-in/out of webhooks, or provide a more fine-grained model, in order to prevent a massive amount of calls for larger clusters. I'd be happy to take a stab at implementing this, but first I wanted to propose this to you to see if you'd be interested in having it, get thoughts on how it might be better, and/or what kind of data should be sent.
lx added the
Improvement
Newcomer
labels 2022-07-07 13:23:12 +00:00
Owner

@withinboredom, thanks for the suggestion and for providing such a detailed description of your desired use case. I think this would be an interesting feature, and probably not too hard to implement by hooking into the correct places in the API. I've tagged this with "Newcomer" because I believe it would be easy to add a rudimentary version into Garage without too much effort.

I would however like to bring to your attention a specific point that might require a bit of thought: reliability of the webhook triggers. Indeed, Garage has to tolerate many kinds of failures, and we are operating under the assumption that node failure can happen at any point in the code. In particular, if we implement simply webhooks as a function that is triggered by the API node answering a particular request just after the request has been completed, we expose ourself to two kinds of pathological situations where the webhook isn't triggered:

  1. The API node crashed after the operation completed but before triggering the webhook

  2. The API node was not able to reach quorum when doing its operation, thus it won't trigger the webhook as it considers the operation failed, but the operation in fact still reached one correct node and will be eventually propagated to the entire cluster (this is a special kind of behaviour that is quite specific to how Garage works).

If we define the webhook semantics as "might skip some events, make sure to poll regularly for changes", then this is fine. However if we expect some kind of strong reliability from the webhook, we would need to devise some more advanced way of recovering missed events and triggering webhooks for them.

Alternatively, we could also implement webhooks by adding trigger at all of the replica nodes: in this case, each webhook will be triggered not once but three times for each operation. This might seem wastefull but if you have idempotence in your handling of webhooks then this is one of the easiest way to implement reliability.

In your use case, do you envision a scenario when missing a webhook trigger causes a lot of issues in your system, or could you recover from such missed events in some way? More genrally, what do you think of this issue?

@withinboredom, thanks for the suggestion and for providing such a detailed description of your desired use case. I think this would be an interesting feature, and probably not too hard to implement by hooking into the correct places in the API. I've tagged this with "Newcomer" because I believe it would be easy to add a rudimentary version into Garage without too much effort. I would however like to bring to your attention a specific point that might require a bit of thought: reliability of the webhook triggers. Indeed, Garage has to tolerate many kinds of failures, and we are operating under the assumption that node failure can happen at any point in the code. In particular, if we implement simply webhooks as a function that is triggered by the API node answering a particular request just after the request has been completed, we expose ourself to two kinds of pathological situations where the webhook isn't triggered: 1. The API node crashed after the operation completed but before triggering the webhook 2. The API node was not able to reach quorum when doing its operation, thus it won't trigger the webhook as it considers the operation failed, but the operation in fact still reached one correct node and will be eventually propagated to the entire cluster (this is a special kind of behaviour that is quite specific to how Garage works). If we define the webhook semantics as "might skip some events, make sure to poll regularly for changes", then this is fine. However if we expect some kind of strong reliability from the webhook, we would need to devise some more advanced way of recovering missed events and triggering webhooks for them. Alternatively, we could also implement webhooks by adding trigger at all of the replica nodes: in this case, each webhook will be triggered not once but three times for each operation. This might seem wastefull but if you have idempotence in your handling of webhooks then this is one of the easiest way to implement reliability. In your use case, do you envision a scenario when missing a webhook trigger causes a lot of issues in your system, or could you recover from such missed events in some way? More genrally, what do you think of this issue?
Author
Contributor

In your use case, do you envision a scenario when missing a webhook trigger causes a lot of issues in your system, or could you recover from such missed events in some way? More genrally, what do you think of this issue?

In general, I think they should be reliable and as the k2v feature matures, the operator could configure a bucket and we could potentially store the status of webhooks there as a way to make them reliable, and queryable (for debugging hook failures, cancelling all pending webhooks, etc). This would work especially well if k2v were to support TTL for keys.

The downside to adding reliability is that no matter what, something is going to be "amplified" by performing operations, either by increasing the load on the disks due to performing more writes or increasing the load on the network by making multiple calls; this should still be less intensive than polling infinitely (I think).

So, my 2¢ is that for an initial implementation, focus on getting the basics ("might skip some events, make sure to poll regularly for changes") in place and then evaluate how we might make them reliable at a later date.

> In your use case, do you envision a scenario when missing a webhook trigger causes a lot of issues in your system, or could you recover from such missed events in some way? More genrally, what do you think of this issue? In general, I think they should be reliable and as the k2v feature matures, the operator could configure a bucket and we could potentially store the status of webhooks there as a way to make them reliable, and queryable (for debugging hook failures, cancelling all pending webhooks, etc). This would work especially well if k2v were to support TTL for keys. The downside to adding reliability is that no matter what, something is going to be "amplified" by performing operations, either by increasing the load on the disks due to performing more writes or increasing the load on the network by making multiple calls; this should still be less intensive than polling infinitely (I think). So, my 2¢ is that for an initial implementation, focus on getting the basics ("might skip some events, make sure to poll regularly for changes") in place and then evaluate how we might make them reliable at a later date.
Author
Contributor

I've created a POC in #340

I've created a POC in #340
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#338
No description provided.