WIP: POC for webhooks #340
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#340
Loading…
Reference in a new issue
No description provided.
Delete branch "withinboredom/garage:main"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I've created this PR as a way to experiment with web hooks and get to know the codebase better. It may or may not be indicative of the final PR. I don't want to take up too much of your time, so I'm mostly looking for a review of the general approach and implementation, and less of a "full" code review (I'm fairly certain -- even as a newcomer to Rust -- that I've done some terrible-ish things here).
Intended Behavior
After our brief discussion in #338, I decided to go with a slightly different tact. Instead of worrying about unreliable webhooks, perhaps a webhook can be an indication that something MIGHT have happened. IOW, it is telling the application "hey, something might have happened with this (bucket, key) and you should check it out." The application can make it's own determination about when, or if, it should do anything with that notification.
This has a few slight effects. For example, on receiving a webhook for a PutObject, the object may not have finished writing to disk and/or may be cancelled. However, an application knows to expect something at that location in the near future and can check for it with an exponential backoff, eventually giving up after a reasonable time.
I believe this gives the best trade-off for reliability and speed, but I'm open to suggestions.
Manual Testing
Start a simple webserver to output requests in your terminal:
Add a configuration line to
script/dev-cluster.sh
:Start up a local testing cluster:
You should see some output from the web server like:
Thanks for the work! It's great to see people dive into the codebase to try to add new things :)
I like your idea of having webhooks being more like an indication that something "might have happenned", rather than a source of truth, as it lightens the constraints a lot on how to implement them.
Some remarks:
Calling a webhook should rather be a background job. Here you are waiting for the webhook to complete its action before returning to the S3 client that did the PutObject, which could make the entire S3 service much slower if the webhook handler is slow for some reason. (There is a specific idiom in Garage for launching background tasks:
garage.background.spawn(async move { /* your task */ })
, instead oftokio::spawn
, in which case Garage will try to finish as many spawned tasks as possible before exiting when it recieves a SIGINT/SIGTERM.)I understand your idea of launching the webhook at the same time the
put_object
handler starts, so that the webhook handler gets the notification ASAP. However I think in most cases something like this will happen: webhook handler gets notification, tries to read newly written object, sees nothing because Garage didn't complete the write internally, goes into exponential backoff waiting strategy (a PutObject has several internal steps in Garage, and the completion of the last of these steps is really necessary for the new data to be visible). Since the handler doesn't know how long to wait for, it has to use a pessimistic waiting strategy with a relatively large delay, actually making everything much slower.An even worse scenario could happen if there was already an object before the PutObject: the webhook handler could read the old version of the object and believe that's the new version, in which case the changes in the new version won't be handled at all by the webhook handler.
In Garage, we do have the guarantee that after
put_object
finished, any GetObject call will see the new version. I think it's worth the cost of waiting for theput_object
to finish before spawning the webhook as it ensures that a webhook handler will always see the updated version. Note that this could also allow the webhook handler to receive an information on the return code of the PutObject call. Basically it would look something like this:Also, just a tiny remark on JSON serialization: we use
#[serde(rename_all = "camelCase")]
on all our APIs (see for instance here) so it would make sense to use it also on the webhooks (here it would rename thehook_type
field tohookType
in the JSON struct, consistent with the other APIs of Garage)I've had to vanish off the face of the earth due to some family issues. I'm back and I'm hoping to get back into this. I just wanted to give an update now that I'm back on the web.
Thanks for the update @withinboredom , I hope everything is well. Take care.
For the final implementation, it could be interesting to implement the following to not send all notifications to the webhook but only some:
We would then say that we guarantee an "at most once" delivery instead of an "at least once"? You can pass only the
TopicConfiguration
entry and it should work with your current implementation.Now that we are filtering the events that are sent to the webhook, we can imagine that we don't want all events going to the same webhook.
We could then imagine that the garage config file could contain a list of webhooks with an associated name. Each entry name would be the "topic name".
Pros:
Cons:
Another option would be to configure dynamically the webhook, from the admin API.
It would still involve some admin work but it could be easier to manage.
But implementing webhooks raise a question: we will have some queuing somewhere, so how do we want to manage it? Do we want to perform this queuing inside Garage or outside?
We could also take inspiration from the Ceph implementation here:
basically, we would follow an even more abstract approach:
Is this alive?
I came here looking for integrating garage into an event driven system, similarly to how s3/lambda works today. In this case, at least once delivery is quite important
Could this be handily managed by Garage K/V? That way the queue itself is distributed with the same guarantees that K/V provides
I do also see value in being able to separate configuration for a bucket and the events defined on that bucket.
Happy to pitch in on this
It could be stored in a new table of Garage's metadata engine, which basically implies the same guarantees but means that 1/ queue items are an internal data item and are not made available publicly in k2v (they don't pollute the k2v namespace), and 2/ appropriate CRDT data types can be used. That's how I would do it anyway.
Can you give an example?
Closing as there is no active developement on this PR. Discussion can continue in #338.
Pull request closed