WIP: POC for webhooks #340

Closed
withinboredom wants to merge 2 commits from withinboredom/garage:main into main
Contributor

I've created this PR as a way to experiment with web hooks and get to know the codebase better. It may or may not be indicative of the final PR. I don't want to take up too much of your time, so I'm mostly looking for a review of the general approach and implementation, and less of a "full" code review (I'm fairly certain -- even as a newcomer to Rust -- that I've done some terrible-ish things here).

Intended Behavior

After our brief discussion in #338, I decided to go with a slightly different tact. Instead of worrying about unreliable webhooks, perhaps a webhook can be an indication that something MIGHT have happened. IOW, it is telling the application "hey, something might have happened with this (bucket, key) and you should check it out." The application can make it's own determination about when, or if, it should do anything with that notification.

This has a few slight effects. For example, on receiving a webhook for a PutObject, the object may not have finished writing to disk and/or may be cancelled. However, an application knows to expect something at that location in the near future and can check for it with an exponential backoff, eventually giving up after a reasonable time.

I believe this gives the best trade-off for reliability and speed, but I'm open to suggestions.

Manual Testing

Start a simple webserver to output requests in your terminal:

echo '<?php error_log(file_get_contents("php://input"));' > server.php
php -S 0.0.0.0:8088 server.php

Add a configuration line to script/dev-cluster.sh:

...
webhook_uri = "http://localhost:8088"
...

Start up a local testing cluster:

make
./script/dev-cluster.sh
# in another terminal
./script/dev-configure.sh
./script/dev-bucket.sh
dd if=/dev/urandom of=/tmp/garage.1.rnd bs=1k count=2
source ./script/dev-env-s3cmd.sh
s3cmd put /tmp/garage.1.rnd "s3://eprouvette/test"

You should see some output from the web server like:

[Fri Jul  8 16:10:17 2022] PHP 8.1.2 Development Server (http://0.0.0.0:8088) started
[Fri Jul  8 16:11:36 2022] 127.0.0.1:40714 Accepted
[Fri Jul  8 16:11:36 2022] {"hook_type":"PutObject","bucket":"eprouvette","object":"test","via":"GK36b72cfa483251da1a1e27b4"}
[Fri Jul  8 16:11:36 2022] 127.0.0.1:40714 Closing
I've created this PR as a way to experiment with web hooks and get to know the codebase better. It may or may not be indicative of the final PR. I don't want to take up too much of your time, so I'm mostly looking for a review of the general approach and implementation, and less of a "full" code review (I'm fairly certain -- even as a newcomer to Rust -- that I've done some terrible-ish things here). ## Intended Behavior After our brief discussion in #338, I decided to go with a slightly different tact. Instead of worrying about unreliable webhooks, perhaps a webhook can be an indication that something MIGHT have happened. IOW, it is telling the application "hey, something might have happened with this (bucket, key) and you should check it out." The application can make it's own determination about when, or if, it should do anything with that notification. This has a few slight effects. For example, on receiving a webhook for a PutObject, the object may not have finished writing to disk and/or may be cancelled. However, an application knows to expect something at that location in the near future and can check for it with an exponential backoff, eventually giving up after a reasonable time. I believe this gives the best trade-off for reliability and speed, but I'm open to suggestions. ## Manual Testing Start a simple webserver to output requests in your terminal: ``` echo '<?php error_log(file_get_contents("php://input"));' > server.php php -S 0.0.0.0:8088 server.php ``` Add a configuration line to `script/dev-cluster.sh`: ``` ... webhook_uri = "http://localhost:8088" ... ``` Start up a local testing cluster: ``` make ./script/dev-cluster.sh # in another terminal ./script/dev-configure.sh ./script/dev-bucket.sh dd if=/dev/urandom of=/tmp/garage.1.rnd bs=1k count=2 source ./script/dev-env-s3cmd.sh s3cmd put /tmp/garage.1.rnd "s3://eprouvette/test" ``` You should see some output from the web server like: ``` [Fri Jul 8 16:10:17 2022] PHP 8.1.2 Development Server (http://0.0.0.0:8088) started [Fri Jul 8 16:11:36 2022] 127.0.0.1:40714 Accepted [Fri Jul 8 16:11:36 2022] {"hook_type":"PutObject","bucket":"eprouvette","object":"test","via":"GK36b72cfa483251da1a1e27b4"} [Fri Jul 8 16:11:36 2022] 127.0.0.1:40714 Closing ```
withinboredom added 2 commits 2022-07-08 14:14:34 +00:00
Owner

Thanks for the work! It's great to see people dive into the codebase to try to add new things :)

I like your idea of having webhooks being more like an indication that something "might have happenned", rather than a source of truth, as it lightens the constraints a lot on how to implement them.

Some remarks:

  • Calling a webhook should rather be a background job. Here you are waiting for the webhook to complete its action before returning to the S3 client that did the PutObject, which could make the entire S3 service much slower if the webhook handler is slow for some reason. (There is a specific idiom in Garage for launching background tasks: garage.background.spawn(async move { /* your task */ }), instead of tokio::spawn, in which case Garage will try to finish as many spawned tasks as possible before exiting when it recieves a SIGINT/SIGTERM.)

  • I understand your idea of launching the webhook at the same time the put_object handler starts, so that the webhook handler gets the notification ASAP. However I think in most cases something like this will happen: webhook handler gets notification, tries to read newly written object, sees nothing because Garage didn't complete the write internally, goes into exponential backoff waiting strategy (a PutObject has several internal steps in Garage, and the completion of the last of these steps is really necessary for the new data to be visible). Since the handler doesn't know how long to wait for, it has to use a pessimistic waiting strategy with a relatively large delay, actually making everything much slower.

    An even worse scenario could happen if there was already an object before the PutObject: the webhook handler could read the old version of the object and believe that's the new version, in which case the changes in the new version won't be handled at all by the webhook handler.

    In Garage, we do have the guarantee that after put_object finished, any GetObject call will see the new version. I think it's worth the cost of waiting for the put_object to finish before spawning the webhook as it ensures that a webhook handler will always see the updated version. Note that this could also allow the webhook handler to receive an information on the return code of the PutObject call. Basically it would look something like this:

let put_result = handle_put(garage, req, &bucket, &key, content_sha256).await;
let result_code = http_response_code(&put_result); // (this function doesn't exist yet)
let hook = create_put_object_hook(bucket_name, &key, api_key.key_id, result_code)
garage.background.spawn(call_hook(garage.clone(), hook));
put_result

Also, just a tiny remark on JSON serialization: we use #[serde(rename_all = "camelCase")] on all our APIs (see for instance here) so it would make sense to use it also on the webhooks (here it would rename the hook_type field to hookType in the JSON struct, consistent with the other APIs of Garage)

Thanks for the work! It's great to see people dive into the codebase to try to add new things :) I like your idea of having webhooks being more like an indication that something "might have happenned", rather than a source of truth, as it lightens the constraints a lot on how to implement them. Some remarks: - Calling a webhook should rather be a background job. Here you are waiting for the webhook to complete its action before returning to the S3 client that did the PutObject, which could make the entire S3 service much slower if the webhook handler is slow for some reason. (There is a specific idiom in Garage for launching background tasks: `garage.background.spawn(async move { /* your task */ })`, instead of `tokio::spawn`, in which case Garage will try to finish as many spawned tasks as possible before exiting when it recieves a SIGINT/SIGTERM.) - I understand your idea of launching the webhook at the same time the `put_object` handler starts, so that the webhook handler gets the notification ASAP. However I think in most cases something like this will happen: webhook handler gets notification, tries to read newly written object, sees nothing because Garage didn't complete the write internally, goes into exponential backoff waiting strategy (a PutObject has several internal steps in Garage, and the completion of the last of these steps is really necessary for the new data to be visible). Since the handler doesn't know how long to wait for, it has to use a pessimistic waiting strategy with a relatively large delay, actually making everything much slower. An even worse scenario could happen if there was already an object before the PutObject: the webhook handler could read the old version of the object and believe that's the new version, in which case the changes in the new version won't be handled at all by the webhook handler. In Garage, we do have the guarantee that after `put_object` finished, any GetObject call will see the new version. I think it's worth the cost of waiting for the `put_object` to finish before spawning the webhook as it ensures that a webhook handler will always see the updated version. Note that this could also allow the webhook handler to receive an information on the return code of the PutObject call. Basically it would look something like this: ```rust let put_result = handle_put(garage, req, &bucket, &key, content_sha256).await; let result_code = http_response_code(&put_result); // (this function doesn't exist yet) let hook = create_put_object_hook(bucket_name, &key, api_key.key_id, result_code) garage.background.spawn(call_hook(garage.clone(), hook)); put_result ``` Also, just a tiny remark on JSON serialization: we use `#[serde(rename_all = "camelCase")]` on all our APIs (see for instance [here](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/api/admin/cluster.rs#L98)) so it would make sense to use it also on the webhooks (here it would rename the `hook_type` field to `hookType` in the JSON struct, consistent with the other APIs of Garage)
Author
Contributor

I've had to vanish off the face of the earth due to some family issues. I'm back and I'm hoping to get back into this. I just wanted to give an update now that I'm back on the web.

I've had to vanish off the face of the earth due to some family issues. I'm back and I'm hoping to get back into this. I just wanted to give an update now that I'm back on the web.
Owner

Thanks for the update @withinboredom , I hope everything is well. Take care.

Thanks for the update @withinboredom , I hope everything is well. Take care.
Owner

For the final implementation, it could be interesting to implement the following to not send all notifications to the webhook but only some:

We would then say that we guarantee an "at most once" delivery instead of an "at least once"? You can pass only the TopicConfiguration entry and it should work with your current implementation.

Now that we are filtering the events that are sent to the webhook, we can imagine that we don't want all events going to the same webhook.

We could then imagine that the garage config file could contain a list of webhooks with an associated name. Each entry name would be the "topic name".

Pros:

  • the gateway can not send data to unknown target (security feature)
  • we keep the compatibility with the AWS ecosystem

Cons:

  • the gateway(s) must be restarted to add a new webhook target

Another option would be to configure dynamically the webhook, from the admin API.
It would still involve some admin work but it could be easier to manage.

But implementing webhooks raise a question: we will have some queuing somewhere, so how do we want to manage it? Do we want to perform this queuing inside Garage or outside?

We could also take inspiration from the Ceph implementation here:

basically, we would follow an even more abstract approach:

  • an administrator create a topic
  • a topic is associated with zero, one or more endpoints
  • endpoints can be a webhook, an AQMP endpoint, a Kafka endpoint, etc.
  • (we could have the same logic permission between key and topic)
  • a key could then add a BucketNotificationConfiguration to a bucket configured with one of these topics.
For the final implementation, it could be interesting to implement the following to not send all notifications to the webhook but only some: - https://docs.aws.amazon.com/AmazonS3/latest/userguide/NotificationHowTo.html - https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketNotificationConfiguration.html We would then say that we guarantee an "at most once" delivery instead of an "at least once"? You can pass only the `TopicConfiguration` entry and it should work with your current implementation. Now that we are filtering the events that are sent to the webhook, we can imagine that we don't want all events going to the same webhook. We could then imagine that the garage config file could contain a list of webhooks with an associated name. Each entry name would be the "topic name". Pros: - the gateway can not send data to unknown target (security feature) - we keep the compatibility with the AWS ecosystem Cons: - the gateway(s) must be restarted to add a new webhook target Another option would be to configure dynamically the webhook, from the admin API. It would still involve some admin work but it could be easier to manage. But implementing webhooks raise a question: we will have some queuing somewhere, so how do we want to manage it? Do we want to perform this queuing inside Garage or outside? We could also take inspiration from the Ceph implementation here: - https://docs.ceph.com/en/quincy/radosgw/notifications/ - https://docs.ceph.com/en/quincy/radosgw/s3/bucketops/#id22 basically, we would follow an even more abstract approach: - an administrator create a topic - a topic is associated with zero, one or more endpoints - endpoints can be a webhook, an AQMP endpoint, a Kafka endpoint, etc. - (we could have the same logic permission between key and topic) - a key could then add a BucketNotificationConfiguration to a bucket configured with one of these topics.
Owner

Is this alive?

Is this alive?
First-time contributor

I came here looking for integrating garage into an event driven system, similarly to how s3/lambda works today. In this case, at least once delivery is quite important

But implementing webhooks raise a question: we will have some queuing somewhere, so how do we want to manage it? Do we want to perform this queuing inside Garage or outside?

Could this be handily managed by Garage K/V? That way the queue itself is distributed with the same guarantees that K/V provides

I do also see value in being able to separate configuration for a bucket and the events defined on that bucket.

Happy to pitch in on this

I came here looking for integrating garage into an event driven system, similarly to how s3/lambda works today. In this case, at least once delivery is quite important > But implementing webhooks raise a question: we will have some queuing somewhere, so how do we want to manage it? Do we want to perform this queuing inside Garage or outside? Could this be handily managed by Garage K/V? That way the queue itself is distributed with the same guarantees that K/V provides I do also see value in being able to separate configuration for a bucket and the events defined on that bucket. Happy to pitch in on this
Owner

Could this be handily managed by Garage K/V? That way the queue itself is distributed with the same guarantees that K/V provides

It could be stored in a new table of Garage's metadata engine, which basically implies the same guarantees but means that 1/ queue items are an internal data item and are not made available publicly in k2v (they don't pollute the k2v namespace), and 2/ appropriate CRDT data types can be used. That's how I would do it anyway.

I do also see value in being able to separate configuration for a bucket and the events defined on that bucket.

Can you give an example?

> Could this be handily managed by Garage K/V? That way the queue itself is distributed with the same guarantees that K/V provides It could be stored in a new table of Garage's metadata engine, which basically implies the same guarantees but means that 1/ queue items are an internal data item and are not made available publicly in k2v (they don't pollute the k2v namespace), and 2/ appropriate CRDT data types can be used. That's how I would do it anyway. > I do also see value in being able to separate configuration for a bucket and the events defined on that bucket. Can you give an example?
Owner

Closing as there is no active developement on this PR. Discussion can continue in #338.

Closing as there is no active developement on this PR. Discussion can continue in #338.
lx closed this pull request 2023-08-28 09:27:16 +00:00
Some checks failed
continuous-integration/drone/pr Build is failing
Required
Details

Pull request closed

Sign in to join this conversation.
No description provided.