WIP: Implement preemptive sends to alleviate slow RPC propagation

withings commented

2024-08-23 12:46:21 +00:00

Contributor

Observations

As part of investigations for #851, we noticed that when some nodes experience slowdowns and those fail to be captured in the ping measurements, then the entire cluster can suffer as the other nodes start waiting for the slow ones indefinitely. This can be quickly mitigated by setting rpc_timeout to a suitable value.

The biggest impact we noted was on the PutObject endpoint, which starts with an existing object lookup. If the latter is unexpectedly slow, the entire write slows down. This behaviour was observed on multiple traces in our testing cluster, and was first mitigated by enabling RequestStrategy::rs_send_all_at_once on table read operations. We feel there may be a better way to achieve similar improvements, without flooding (potentially larger) clusters with unnecessary RPCs.

Proposal

In scenarios where the RPC scheduling logic fails to take those slowdowns into account, it would be nice to have countermeasures against slow RPCs. For table reads, this is actually already a suggested TODO in rpc_helper::try_call_many_inner :

TODO: this could be made more aggressive, e.g. if after 2x the average ping of a given request, the response is not yet received, preemptively send an additional request to any remaining nodes.

This PR is a first attempt at implementing this, we'd love to have your thoughts on it!

## Observations As part of investigations for #851, we noticed that when some nodes experience slowdowns and those fail to be captured in the ping measurements, then the entire cluster can suffer as the other nodes start waiting for the slow ones indefinitely. This can be quickly mitigated by setting `rpc_timeout` to a suitable value. The biggest impact we noted was on the `PutObject` endpoint, which starts with an existing object lookup. If the latter is unexpectedly slow, the entire write slows down. This behaviour was observed on multiple traces in our testing cluster, and was first mitigated by enabling `RequestStrategy::rs_send_all_at_once` on table read operations. We feel there may be a better way to achieve similar improvements, without flooding (potentially larger) clusters with unnecessary RPCs. ## Proposal In scenarios where the RPC scheduling logic fails to take those slowdowns into account, it would be nice to have countermeasures against slow RPCs. For table reads, this is actually already a suggested TODO in `rpc_helper::try_call_many_inner` : > TODO: this could be made more aggressive, e.g. if after 2x the average ping of a given request, the response is not yet received, preemptively send an additional request to any remaining nodes. This PR is a first attempt at implementing this, we'd love to have your thoughts on it!

withings added 1 commit 2024-08-23 12:46:21 +00:00

rpc: implement preemptive sends to alleviate slow RPC propagation

ci/woodpecker/pr/debug Pipeline was successful

Details

e628072d37

lx commented

2024-08-24 09:47:45 +00:00

Owner

Hello, thank you very much for this proposal! This is definitely going in the right direction.

I think it would be interesting to add metrics, so that we can see how often the watchdog triggers. Ideally, we want it to never trigger if the cluster is not overloaded. Adding such metrics is easy, new counters should be declared in rpc/metrics.rs and incremented at the appropriate places in rpc_helper.rs. We could have two counters:

rpc_watchdogs_started, incremented every time a watchdog is created (line 413)
rpc_watchdogs_timed_out, incremented every time a watchdog times out (line 455).

There is however a small issue with this way of measuring things: a timed out watchdog does not necessarily cause a new request to be sent, because maybe there are no more peers to send requests to. More precisely, if we are reaching for a quorum of 2 among 3 nodes, the first watchdog will cause the 3rd request to be sent, but further watchdogs will have no effect. It would be nice if we could distinguish these case: what we really want to know is the proportion of these 3rd requests that were sent or not. Maybe a simpler way to count would be the number of calls to try_call_many with preemptive send enabled, and the number of such calls where a watchdog triggers (only counting the first watchdog trigger, not further ones).

If you have a cluster with a big workload at withings, I would love to see some data on this. It would help validate that this is a relevant approach and that we are not just adding complexity for nothing.

There is another point which I think is important here: how do we measure the average latency of requests to each peer? Currently, the average ping metric that is available and that you are using in this prototype is a ping requests where the pinged node responds immediately to ping requests. It does not take into account time needed to read or write data from disk. This means that if the network is very fast, almost all requests that require reading from metadata will take significantly longer and the watchdog will trigger almost all the time, even if the cluster is running perfectly fine. In practice we would need to measure the average latency of individual RPC requests, such as TableRpc::ReadEntry, to have a relevant idea of how long they should take and to avoid triggering the watchdog needlessly.

I think the next step is to try to obtain data: what are typical values for the average ping as it is measured currently? What are typical values for the latency of ReadEntry RPCs? With the patch in its current form, do watchdogs trigger only when necessary, or do they trigger all the time even when not needed?

The low-tech solution: This is all starting to look like a very complex matter, but in practice, you were able to solve your performance problems by setting rs_send_all_at_once on all table RPC requests. Maybe we could just make this behavior part of Garage, possibly with a new configuration option? (we could call it rpc_always_send_all_at_once). This is actually in line with the best distributed system engineering practices as described by Amazon.

Hello, thank you very much for this proposal! This is definitely going in the right direction. I think it would be interesting to add metrics, so that we can see how often the watchdog triggers. Ideally, we want it to never trigger if the cluster is not overloaded. Adding such metrics is easy, new counters should be declared in `rpc/metrics.rs` and incremented at the appropriate places in `rpc_helper.rs`. We could have two counters: - rpc_watchdogs_started, incremented every time a watchdog is created ([line 413](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/e628072d37d83a49cc85a962e755829ff61bbadf/src/rpc/rpc_helper.rs#L413)) - rpc_watchdogs_timed_out, incremented every time a watchdog times out ([line 455](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/e628072d37d83a49cc85a962e755829ff61bbadf/src/rpc/rpc_helper.rs#L455)). There is however a small issue with this way of measuring things: a timed out watchdog does not necessarily cause a new request to be sent, because maybe there are no more peers to send requests to. More precisely, if we are reaching for a quorum of 2 among 3 nodes, the first watchdog will cause the 3rd request to be sent, but further watchdogs will have no effect. It would be nice if we could distinguish these case: what we really want to know is the proportion of these 3rd requests that were sent or not. Maybe a simpler way to count would be the number of calls to `try_call_many` with preemptive send enabled, and the number of such calls where a watchdog triggers (only counting the first watchdog trigger, not further ones). If you have a cluster with a big workload at withings, I would love to see some data on this. It would help validate that this is a relevant approach and that we are not just adding complexity for nothing. There is another point which I think is important here: how do we measure the average latency of requests to each peer? Currently, the average ping metric that is available and that you are using in this prototype is a ping requests where the pinged node responds immediately to ping requests. It does not take into account time needed to read or write data from disk. This means that if the network is very fast, almost all requests that require reading from metadata will take significantly longer and the watchdog will trigger almost all the time, even if the cluster is running perfectly fine. In practice we would need to measure the average latency of individual RPC requests, such as `TableRpc::ReadEntry`, to have a relevant idea of how long they should take and to avoid triggering the watchdog needlessly. I think the next step is to try to obtain data: what are typical values for the average ping as it is measured currently? What are typical values for the latency of ReadEntry RPCs? With the patch in its current form, do watchdogs trigger only when necessary, or do they trigger all the time even when not needed? ---- **The low-tech solution:** This is all starting to look like a very complex matter, but in practice, you were able to solve your performance problems by setting `rs_send_all_at_once` on all table RPC requests. Maybe we could just make this behavior part of Garage, possibly with a new configuration option? (we could call it `rpc_always_send_all_at_once`). This is actually in line with the best distributed system engineering practices [as described by Amazon](https://aws.amazon.com/builders-library/reliability-and-constant-work/?did=ba_card&trk=ba_card).

withings added 1 commit 2024-08-26 13:33:38 +00:00

rpc: add watchdog metrics, ignore edge cases beyond the layout size

ci/woodpecker/pr/debug Pipeline failed

Details

5b6521b868

withings commented

2024-08-26 13:34:28 +00:00

Author

Contributor

Hello,

Definitely with you regarding metrics, I've just pushed a commit adding rpc_watchdogs_started_counter and rpc_watchdogs_preemption_counter. I've adjusted the names slightly in hope of making the meaning clearer: the first counts RPC sent out under a watchdog, the second the number of times an "extra" RPC had to be sent preemptively because a watchdog triggered.

There is however a small issue with this way of measuring things: a timed out watchdog does not necessarily cause a new request to be sent, because maybe there are no more peers to send requests to.

Indeed, the "preemption" counter should only be increased if the time out is going to lead to another request being sent. I believe this can be enforced by adding a simple check before increasing the target number of outgoing requests (target_outbound_count). I have also added this in the commit.

In the scenario you describe, I believe this approach would increase rpc_watchdogs_started by 3 (in total) and rpc_watchdogs_preemption by 1. Overall those metrics can tell you how many RPCs are sent out due to a watchdog timing out (in this case, 1/3). Is this what you had in mind?

If you have a cluster with a big workload at withings, I would love to see some data on this. It would help validate that this is a relevant approach and that we are not just adding complexity for nothing.

At the moment, I am only running this on a test cluster with ~2M objects. I think I can give you a quick overview using the dashboard from #851. This is a one hour benchmark without preemptive sends (and no send_all_at_once) :

And with preemptive send enabled:

Both in the same picture (with preemptive sends, then without) :

The metrics for rpc_watchdogs_started and rpc_watchdogs_preemption look something like this, with ~5% of all RPCs issued due to a watchdog timeout:

Admittedly, now that !855 is merged, the gain isn't very perceptible (the original issue amplified the delays and this was one of our leads then). We'd have to try in a cluster where 1 of the 3 nodes is much slower (unfortunately that's not something I have on hand at the moment). When everything is fine and the cluster is healthy, I wouldn't expect a big improvement tbh.

There is another point which I think is important here: how do we measure the average latency of requests to each peer?

That idea has actually come up in discussion, @quentin suggested eventually moving to a percentile (p95/p99) of the time needed so far for each node to complete a read/write. This would indeed provide interesting metrics for RPC scheduling, although it is probably less straightforward to implement.

The low-tech solution: This is all starting to look like a very complex matter, but in practice, you were able to solve your performance problems by setting rs_send_all_at_once on all table RPC requests.

For a cluster with 3 nodes and 2 zones, that's perfectly acceptable yes. It's also a very small network cost in our case, so we could definitely go with that. It should also be noted that this only applies to reads, since writes have their own logic.

I would however be wary of such a feature in a larger cluster, with perhaps weaker or less balanced network links. This may create a significant amount of noise on the network and consume more bandwidth that may feel reasonable (ie. billable).

Maybe we could just make this behavior part of Garage, possibly with a new configuration option?

For small-ish clusters, that sounds like a good compromise IMO. This could be the case for preemptive sends too, should they end up in a release.

JKR

Hello, Definitely with you regarding metrics, I've just pushed a commit adding `rpc_watchdogs_started_counter` and `rpc_watchdogs_preemption_counter`. I've adjusted the names slightly in hope of making the meaning clearer: the first counts RPC sent out under a watchdog, the second the number of times an "extra" RPC had to be sent preemptively because a watchdog triggered. > There is however a small issue with this way of measuring things: a timed out watchdog does not necessarily cause a new request to be sent, because maybe there are no more peers to send requests to. Indeed, the "preemption" counter should only be increased if the time out is going to lead to another request being sent. I believe this can be enforced by adding a simple check before increasing the target number of outgoing requests (`target_outbound_count`). I have also added this in the commit. In the scenario you describe, I believe this approach would increase `rpc_watchdogs_started` by 3 (in total) and `rpc_watchdogs_preemption` by 1. Overall those metrics can tell you how many RPCs are sent out due to a watchdog timing out (in this case, 1/3). Is this what you had in mind? > If you have a cluster with a big workload at withings, I would love to see some data on this. It would help validate that this is a relevant approach and that we are not just adding complexity for nothing. At the moment, I am only running this on a test cluster with ~2M objects. I think I can give you a quick overview using the dashboard from #851. This is a one hour benchmark without preemptive sends (and no `send_all_at_once`) : <img width="473" alt="image" src="/attachments/a7ab6a8e-5825-4542-91df-b36378122701"> <img width="470" alt="image" src="/attachments/6e2d38ed-0082-4c8f-accc-ac4b7574c1b1"> And with preemptive send enabled: <img width="472" alt="image" src="/attachments/ca60f2be-fb55-4cb2-9f7d-108f91ca4896"> <img width="469" alt="image" src="/attachments/e92e40b9-b8e3-43ae-9bae-cb1f2f897f01"> Both in the same picture (with preemptive sends, then without) : <img width="469" alt="image" src="/attachments/ebdf112b-ff6f-4382-800d-a96826afd2e0"> The metrics for `rpc_watchdogs_started` and `rpc_watchdogs_preemption` look something like this, with ~5% of all RPCs issued due to a watchdog timeout: <img width="471" alt="image" src="/attachments/809f92dc-d7f8-4576-a43d-7ae3b7d28e5a"> Admittedly, now that !855 is merged, the gain isn't very perceptible (the original issue amplified the delays and this was one of our leads then). We'd have to try in a cluster where 1 of the 3 nodes is much slower (unfortunately that's not something I have on hand at the moment). When everything is fine and the cluster is healthy, I wouldn't expect a big improvement tbh. > There is another point which I think is important here: how do we measure the average latency of requests to each peer? That idea has actually come up in discussion, @quentin suggested eventually moving to a percentile (p95/p99) of the time needed so far for each node to complete a read/write. This would indeed provide interesting metrics for RPC scheduling, although it is probably less straightforward to implement. > The low-tech solution: This is all starting to look like a very complex matter, but in practice, you were able to solve your performance problems by setting `rs_send_all_at_once` on all table RPC requests. For a cluster with 3 nodes and 2 zones, that's perfectly acceptable yes. It's also a very small network cost in our case, so we could definitely go with that. It should also be noted that this only applies to *reads*, since writes have their own logic. I would however be wary of such a feature in a larger cluster, with perhaps weaker or less balanced network links. This may create a significant amount of noise on the network and consume more bandwidth that may feel reasonable (ie. billable). > Maybe we could just make this behavior part of Garage, possibly with a new configuration option? For small-ish clusters, that sounds like a good compromise IMO. This could be the case for preemptive sends too, should they end up in a release. JKR

image.png

34 KiB

image.png

16 KiB

image.png

40 KiB

image.png

38 KiB

image.png

38 KiB

image.png

63 KiB

image.png

56 KiB

image.png

43 KiB

image.png

69 KiB

image.png

36 KiB

withings added 1 commit 2024-08-26 13:36:27 +00:00

rpc: formatting fixes in rpc_helper

ci/woodpecker/pr/debug Pipeline was successful

Details

1ca12a9cb6

lx commented

2024-08-26 15:59:03 +00:00

Owner

Ok, these results look promising!

@quentin Do you have an opinion on this?

I'm kind of tempted to merge this as-is, but I'm just worried that with preemptive send enabled, in the case of slow machines with fast networking, the watchdog would trigger nearly all of the time. Maybe we could make preemptive send an option in the Garage config and leave it disabled by default for now? At least until we implement precise measuring of the time taken for RPCs that read/write metadata.

Ok, these results look promising! @quentin Do you have an opinion on this? I'm kind of tempted to merge this as-is, but I'm just worried that with preemptive send enabled, in the case of slow machines with fast networking, the watchdog would trigger nearly all of the time. Maybe we could make preemptive send an option in the Garage config and leave it disabled by default for now? At least until we implement precise measuring of the time taken for RPCs that read/write metadata.

quentin commented

2024-08-26 16:13:26 +00:00

Owner

Hey, very interesting topic! I will review the discussion more carefully and answer tonight or tomorrow morning :-)

quentin commented

2024-08-27 07:56:50 +00:00

Owner

➡️ First, can you confirm I correctly understand the issue?

To make an API write (PutObject), we need to perform some internal reads (RPC requests resulting in metadata reads). Our current logic is to peek 2 nodes among the 3, send them the read RPC request, wait for the answer of both requests, and then continue the API write logic. To the best of my knowledge, this behavior is different from Dynamo, Riak or Cassandra that send the 3 requests at the same time and drop the slowest one (this logic is named rs_send_all_at_once in Garage).

In our logic, it appears that if 1 of the 2 peeked nodes is slow, the RPC request is slow, thus the API write is slow. In some ways, the RPC read latency is the maximum latency of the 2 nodes. The problem can thus be articulated in 2 ways:

Sometimes internal reads are way too slow in Garage / internal read performance is not stable or predictable enough
The current design / algorithm choice amplifies tail latency instead of erasing/masking them

The Dynamo / Cassandra / Riak / rs_send_all_at_once sends more RPC requests and reads on the cluster as a whole (33% more exactly) to get a RPC read latency that is the median latency of the 3 nodes. In practice, it is a well known advertised feature of these software: it erases tail latency. We can present this as a trade-off: we do more work to have stabler latency. JFK argues however that in some cases, this behavior could be an issue, especially in those cases:

bandwidth asymmetry or bandwidth-constrained (weaker or less balanced network links)
unsustainable number of packets/seconds or requests/second (significant amount of noise on the network)
cost of bandwidth implied by the overhead

➡️ Then, can you also confirm I correctly understand the proposed solution?

The proposed solution aims at detecting slow requests and try to dynamically erase latency by sending a 3rd request. It expects that, even if it sent the request later, it will complete faster than the one to the slow node. Slow is defined as "twice the average network latency".

Based on this definition, it expects that 1) the network latency is quite stable (ie. jitter is low) and 2) that internal read latency is way lower than network latency.

In some ways, we can see the proposed solutions as a fallback mechanism, which in turns creates 2 operating mode for Garage. The default one, trying to minimize RPC requests, and the fallback one trying to minimize tail latency.

Sorry I am running out of time but I have a (strong) opinion on the subject. I will try to share it with you later today. Stay tuned! In the mean time, feel free to tell me if my summary is correct or not.

*➡️ First, can you confirm I correctly understand the issue?* To make an API write (PutObject), we need to perform some internal reads (RPC requests resulting in metadata reads). Our current logic is to peek 2 nodes among the 3, send them the read RPC request, wait for the answer of both requests, and then continue the API write logic. To the best of my knowledge, this behavior is different from Dynamo, Riak or Cassandra that send the 3 requests at the same time and drop the slowest one (this logic is named `rs_send_all_at_once` in Garage). In our logic, it appears that if 1 of the 2 peeked nodes is slow, the RPC request is slow, thus the API write is slow. In some ways, the RPC read latency is **the maximum latency** of the 2 nodes. The problem can thus be articulated in 2 ways: - Sometimes internal reads are way too slow in Garage / internal read performance is not stable or predictable enough - The current design / algorithm choice amplifies tail latency instead of erasing/masking them The Dynamo / Cassandra / Riak / `rs_send_all_at_once` sends more RPC requests and reads on the cluster as a whole (33% more exactly) to get a RPC read latency that is **the median latency** of the 3 nodes. In practice, it is a well known advertised feature of these software: it erases tail latency. We can present this as a trade-off: we do more work to have stabler latency. JFK argues however that in some cases, this behavior could be an issue, especially in those cases: - bandwidth asymmetry or bandwidth-constrained (weaker or less balanced network links) - unsustainable number of packets/seconds or requests/second (significant amount of noise on the network) - cost of bandwidth implied by the overhead *➡️ Then, can you also confirm I correctly understand the proposed solution?* The proposed solution aims at detecting slow requests and try to dynamically erase latency by sending a 3rd request. It expects that, even if it sent the request later, it will complete faster than the one to the slow node. Slow is defined as "twice the average network latency". Based on this definition, it expects that 1) the network latency is quite stable (ie. jitter is low) and 2) that internal read latency is way lower than network latency. In some ways, we can see the proposed solutions as a fallback mechanism, which in turns creates 2 operating mode for Garage. The default one, trying to minimize RPC requests, and the fallback one trying to minimize tail latency. *Sorry I am running out of time but I have a (strong) opinion on the subject. I will try to share it with you later today. Stay tuned! In the mean time, feel free to tell me if my summary is correct or not.*

quentin commented

2024-08-27 11:53:50 +00:00

Owner

➡️ It's possible to make safe fallbacks but generally it's better to completely avoid them

Amazon is very clear about avoiding fallbacks in distributed systems. Following Amazon's blog post, I see 2 issues with fallbacks: 1) they are less tested/robust than the normal flow and 2) they are an inefficient pattern preventing better solutions from emerging.

Fallbacks are typically not executed often, hence we often notice bugs only when they are triggered in production. In the end, instead of improving system reliability, they often degrade it. Amazon says that, if you really want to implement a fallback, you should execute it periodically even when the main mechanism is working. It would mean in our case, for example, that we would send ~20% of the requests in fallback mode, even if everything is right. And only switch to 100% if there is an observed issue. But even in this case, some unexpected behavior could emerge only when 100% of the requests are sent in fallback mode.

Fallbacks must also be triggered at the right moment. Here Alex is not sure that it will work as intended on machines with slow I/O and fast networks. In fact, even with fast I/O, the behavior could be strange on fast networks. If you machine are on a fast LAN, their average network RTT could be as low as 1ms. It means the behavior would be triggered at 2ms, which can occurre for many reasons even with fast I/O.

As mentioned, we don't measure the right thing (network instead of network + metadata read). But even if we change the code to do it, we can create an oscillating system: by switching to the normal mode to the fallback mode, we release pressure on a given node, that becomes again fast enough, so we switch back to normal mode, and so on and so forth.

So in the end, we would want to send a certain percentage of the request as fallback and the rest as normal and adjust this percentage to find an equilibrium such that we minimize the average (or p95 or p99) delay to complete the RPC requests. And we fall in control theory here, where the observed value is something related to RPC delay over a window of time, and our action is done by tweaking the percentage of fallback requests VS normal requests.

More philosophically, Amazon argues that fallbacks are an inefficient pattern. Indeed, if 1) the fallback is an acceptable mechanism and 2) is expected to work when the main mechanism does not, why not use the fallback everytime, as the default solution?. Here, it would mean always using rs_send_all_at_once, as proposed by Alex.

It has been pointed that using this fallback as the default could increase the bandwidth cost and have a performance impact. If these concerns materialize, triggering dynamically this behavior is potentially worse than setting it at the default. Indeed, operators dimension and benchmark their systems by observing the default behavior, and once deployed, suddenly, their cloud bill may increase by 33% or their network saturates, without them understanding why - we can't assume all operators are familiar with Garage internals.

In some ways, I fell in the same trap with LMDB: I only benchmarked it when "LDMB size" < "RAM size" and did not notice that behavior will change when this condition is not true anymore, leading to misleading benchmarks and poor configuration.

As a conclusion, I don't see any benefit from switching dynamically from one behavior to another: systems are designed and tuned according to an expected workload, changing the behavior breaks these assumptions. Instead of implementing a fallback mechanism, Amazon recommends to invest energy on building a better default system, and that's what I will try to draft in my next messages.

*➡️ It's possible to make safe fallbacks but generally it's better to completely avoid them* Amazon is very clear [about avoiding fallbacks in distributed systems](https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems/). Following Amazon's blog post, I see 2 issues with fallbacks: 1) they are less tested/robust than the normal flow and 2) they are an inefficient pattern preventing better solutions from emerging. --- Fallbacks are typically not executed often, hence we often notice bugs only when they are triggered in production. In the end, instead of improving system reliability, they often degrade it. Amazon says that, if you really want to implement a fallback, you should execute it periodically even when the main mechanism is working. It would mean in our case, for example, that we would send ~20% of the requests in fallback mode, even if everything is right. And only switch to 100% if there is an observed issue. But even in this case, some unexpected behavior could emerge only when 100% of the requests are sent in fallback mode. ---- Fallbacks must also be triggered at the right moment. Here Alex is not sure that it will work as intended on machines with slow I/O and fast networks. In fact, even with fast I/O, the behavior could be strange on fast networks. If you machine are on a fast LAN, their average network RTT could be as low as 1ms. It means the behavior would be triggered at 2ms, which can occurre for many reasons even with fast I/O. As mentioned, we don't measure the right thing (network instead of network + metadata read). But even if we change the code to do it, we can create an oscillating system: by switching to the normal mode to the fallback mode, we release pressure on a given node, that becomes again fast enough, so we switch back to normal mode, and so on and so forth. So in the end, we would want to send a certain percentage of the request as fallback and the rest as normal and adjust this percentage to find an equilibrium such that we minimize the average (or p95 or p99) delay to complete the RPC requests. And we fall in control theory here, where the observed value is something related to RPC delay over a window of time, and our action is done by tweaking the percentage of fallback requests VS normal requests. ---- More philosophically, Amazon argues that fallbacks are an inefficient pattern. Indeed, if 1) the fallback is an acceptable mechanism and 2) is expected to work when the main mechanism does not, **why not use the fallback everytime, as the default solution?**. Here, it would mean always using `rs_send_all_at_once`, as proposed by Alex. It has been pointed that using this fallback as the default could increase the bandwidth cost and have a performance impact. If these concerns materialize, triggering dynamically this behavior is potentially worse than setting it at the default. Indeed, operators dimension and benchmark their systems by observing the default behavior, and once deployed, suddenly, their cloud bill may increase by 33% or their network saturates, without them understanding why - we can't assume all operators are familiar with Garage internals. In some ways, I fell in the same trap with LMDB: I only benchmarked it when "LDMB size" < "RAM size" and did not notice that behavior will change when this condition is not true anymore, leading to misleading benchmarks and poor configuration. ---- As a conclusion, I don't see any benefit from switching dynamically from one behavior to another: systems are designed and tuned according to an expected workload, changing the behavior breaks these assumptions. Instead of implementing a fallback mechanism, Amazon recommends to invest energy on building a better default system, and that's what I will try to draft in my next messages.

quentin commented

2024-08-27 17:44:50 +00:00

Owner

➡️ Some alternatives to implementing a fallback

As a reminder, we have more or less 3 goals (that are more or less connected) with Garage currently:

decreasing RPC tail latency
homogeneous clusters must lead to homogeneous per-node performances
improving PutObject rate

Switching to rs_send_all_at_once - RPC requests on tables are expected to be way smaller than blocks so we can expect that the additional data sent is negligible in term of bandwidth. In the end, nodes on the cluster will have to process more requests, which means in the end more work done by the cluster. But currently, as RPC requests are not spread, the gain in term of performances is not spread among the nodes, hence the cluster do not benefit from it.

Try a spread logic on try_call_many - Instead of sorting by latency, we could benchmark the cluster by randomizing the nodes. We would then compare it to 1) the current try_call_many and 2) rs_send_all_at_once. In fact, the current try_call_many is efficient only when the main source of latency is the network latency due to distance - and not load. So the question would be 1) should we randomize nodes or 2) send all at once or 3) try to be more clever and putting us at risk of doing dumb things. In the end, I think that if we want

Use better datastructures - We store some queues in LMDB, a sorted collection. Sorted collections are expensive datastructure and queues don't need to be sorted according to a key. We could completely bypass LMDB in this case, either by implementing our own on-disk queue or by using an existing one. Considering the PutObject workload, it would be beneficial as many writes are done in the "todo merkle tree" that is, to the best of my knowledge, a queue.

Implement backpressure logic - We have unbounded queues in some places in the code (for example the merkle todo is an example of such infinites queues). I think we should avoid them and instead block such that the parent / calling process is informed. Queues don't fix overload.

Avoiding mmap - We learnt that mmap might be the source of our performance issue. Implementing a LSM-Tree metadata engine that don't rely on mmap could be very interesting, like RocksDB or Fjall.

Continue instrumenting the code & benchmarking - As I still don't completely understand these waves we have when 1) the cluster has many cores and 2) it is benchmarked with minio warp on PutObject. We might have missed somehting.

➡️ *Some alternatives to implementing a fallback* As a reminder, we have more or less 3 goals (that are more or less connected) with Garage currently: 1. decreasing RPC tail latency 2. homogeneous clusters must lead to homogeneous per-node performances 3. improving PutObject rate **Switching to `rs_send_all_at_once`** - RPC requests on tables are expected to be way smaller than blocks so we can expect that the additional data sent is negligible in term of bandwidth. In the end, nodes on the cluster will have to process more requests, which means in the end more work done by the cluster. But currently, as RPC requests are not spread, the gain in term of performances is not spread among the nodes, hence the cluster do not benefit from it. **Try a spread logic on `try_call_many`** - Instead of sorting by latency, we could benchmark the cluster by randomizing the nodes. We would then compare it to 1) the current `try_call_many` and 2) `rs_send_all_at_once`. In fact, the current `try_call_many` is efficient only when the main source of latency is the network latency due to distance - and not load. So the question would be 1) should we randomize nodes or 2) send all at once or 3) try to be more clever and putting us at risk of doing dumb things. In the end, I think that if we want **Use better datastructures** - We store some queues in LMDB, a sorted collection. Sorted collections are expensive datastructure and queues don't need to be sorted according to a key. We could completely bypass LMDB in this case, either by implementing our own on-disk queue or by [using an existing one](https://docs.rs/yaque/latest/yaque/). Considering the PutObject workload, it would be beneficial as many writes are done in the "todo merkle tree" that is, to the best of my knowledge, a queue. **Implement backpressure logic** - We have unbounded queues in some places in the code (for example the merkle todo is an example of such infinites queues). I think we should avoid them and instead block such that the parent / calling process is informed. [Queues don't fix overload](https://ferd.ca/queues-don-t-fix-overload.html). **Avoiding mmap** - We learnt that mmap might be the source of [our performance issue](https://db.cs.cmu.edu/mmap-cidr2022/). Implementing a LSM-Tree metadata engine that don't rely on mmap could be very interesting, like RocksDB or Fjall. **Continue instrumenting the code & benchmarking** - As I still don't completely understand these waves we have when 1) the cluster has many cores and 2) it is benchmarked with minio warp on PutObject. We might have missed somehting.

quentin commented

2024-08-27 18:49:10 +00:00

Owner

➡️ Finally my opinion on the subject

I am not a huge fan of this dynamic fallback. I would prefer we switch to rs_send_all_at_once as it's the most stable / reliable / robust strategy (when a node is failing, user experience is not impacted) without giving the possibility for the user to choose.
Which would mean, I confess, abandoning this pull request while keeping the idea to send to all 3 nodes.
Enthusiasm: ➕➕

If we want users to select their RPC strategy, I would prefer that 1) we add an experimental section in garage.toml with configurations we don't commit to maintain from one version to another and 2) add a field like rpc_read_strategy={high_latency_link,spread_load,reduce_tail_latency} and we try to benchmark these strategies in a near future. After a while, we would have a discussion to stabilize this configuration.
It would mean abandoning the "dynamic selection" based on latency from this pull request but keeping the idea of multiple RPC strategies.
Enthusiasm: ➕

If we really want to try a fallback logic, I would recommend we add it as a 4th rpc_read_strategy={...,dynamic} and document cases where we think it could be useful. I think the work to cleverly switch from one mode to another is not obvious at all. I think we should not merge the existing dynamic selection mechanism, and instead try to build a more robust one. Also, I think the change should be done progressively (adjusting a ratio) instead of a binary one, as the strategy we use has an impact on the observed latency.
It would mean keeping all the idea of this PR but redesigning/reworking the "dynamic selection".
Enthusiasm: ➖

Finally, I think we would benefit from more knowledge on the current Garage behavior. So before committing to a solution, I think exploring yaqueue could be an easy way to both (1) add backpressure and (2) reduce pressure on LMDB and see if it drastically changes observed behavior or not. I don't know whether RocksDB or Fjall is a better LSM-Tree candidate, but the storage interface is not that big, so not something huge to code either, and again we would be able to evaluate LMDB impact here. And collecting additional metrics + doing new benchmarks on powerful servers would help us have a more accurate picture.
I think we still need to collect some knowledge about how Garage behaves under pressure, as I don't fully understand what happen currently..
Enthusiasm: ➕➕

➡️ *Finally my opinion on the subject* I am not a huge fan of this dynamic fallback. I would prefer we switch to `rs_send_all_at_once` as it's the most stable / reliable / robust strategy (when a node is failing, user experience is not impacted) without giving the possibility for the user to choose. **Which would mean, I confess, abandoning this pull request while keeping the idea to send to all 3 nodes.** *Enthusiasm: ➕➕* If we want users to select their RPC strategy, I would prefer that 1) we add an `experimental` section in `garage.toml` with configurations we don't commit to maintain from one version to another and 2) add a field like `rpc_read_strategy={high_latency_link,spread_load,reduce_tail_latency}` and we try to benchmark these strategies in a near future. After a while, we would have a discussion to stabilize this configuration. **It would mean abandoning the "dynamic selection" based on latency from this pull request but keeping the idea of multiple RPC strategies.** *Enthusiasm: ➕* If we really want to try a fallback logic, I would recommend we add it as a 4th `rpc_read_strategy={...,dynamic}` and document cases where we think it could be useful. I think the work to cleverly switch from one mode to another is not obvious at all. I think we should not merge the existing dynamic selection mechanism, and instead try to build a more robust one. Also, I think the change should be done progressively (adjusting a ratio) instead of a binary one, as the strategy we use has an impact on the observed latency. **It would mean keeping all the idea of this PR but redesigning/reworking the "dynamic selection"**. *Enthusiasm: ➖* Finally, I think we would benefit from more knowledge on the current Garage behavior. So before committing to a solution, I think exploring [yaqueue](https://docs.rs/yaque/latest/yaque/) could be an easy way to both (1) add backpressure and (2) reduce pressure on LMDB and see if it drastically changes observed behavior or not. I don't know whether RocksDB or Fjall is a better LSM-Tree candidate, but the storage interface is not that big, so not something huge to code either, and again we would be able to evaluate LMDB impact here. And collecting additional metrics + doing new benchmarks on powerful servers would help us have a more accurate picture. **I think we still need to collect some knowledge about how Garage behaves under pressure, as I don't fully understand what happen currently.**. *Enthusiasm: ➕➕*

withings referenced this pull request

2024-08-29 14:55:08 +00:00

Performances collapse with 10 millions pictures in a bucket #851

withings commented

2024-09-03 15:47:59 +00:00

Author

Contributor

Thank you very much for your feedback!

I'm just worried that with preemptive send enabled, in the case of slow machines with fast networking, the watchdog would trigger nearly all of the time

I agree that improving the node selection metric beyond ping times may in fact be a prerequisite before such a strategy can be considered.

I would prefer we switch to rs_send_all_at_once [...] add a field like rpc_read_strategy={high_latency_link,spread_load,reduce_tail_latency}

Regarding the use of the broadcasting send_all_at_once strategy option, I am tempted to agree as well. This does offer better determinism, and I am now unsure that a dynamic approach would make much sense in a cluster where preference lists are more balanced. In a small cluster, the option would allow administrators to simply disable all node selection logic for reads, which may at times be suitable. This PR does not bring in that feature.

If we really want to try a fallback logic, I would recommend we add it as a 4th rpc_read_strategy={...,dynamic}

To be honest, I am no longer convinced that this strategy would help. The initial results outlined above only showed minor improvements, and other leads have now emerged regarding the performance issues in #851.

All in all, I have no objection to closing this PR for the time being. The idea it introduces does add a significant amount of complexity to the code, while "simpler" approaches (improved latency metric, send all at once) could offer similar/better results.

JKR

Thank you very much for your feedback! > I'm just worried that with preemptive send enabled, in the case of slow machines with fast networking, the watchdog would trigger nearly all of the time I agree that improving the node selection metric beyond ping times may in fact be a prerequisite before such a strategy can be considered. > I would prefer we switch to rs_send_all_at_once [...] add a field like `rpc_read_strategy={high_latency_link,spread_load,reduce_tail_latency}` Regarding the use of the broadcasting `send_all_at_once` strategy option, I am tempted to agree as well. This does offer better determinism, and I am now unsure that a dynamic approach would make much sense in a cluster where preference lists are more balanced. In a small cluster, the option would allow administrators to simply disable all node selection logic for reads, which may at times be suitable. This PR does not bring in that feature. > If we really want to try a fallback logic, I would recommend we add it as a 4th `rpc_read_strategy={...,dynamic}` To be honest, I am no longer convinced that this strategy would help. The initial results outlined above only showed minor improvements, and other leads have now emerged regarding the performance issues in #851. All in all, I have no objection to closing this PR for the time being. The idea it introduces does add a significant amount of complexity to the code, while "simpler" approaches (improved latency metric, send all at once) could offer similar/better results. JKR