WIP: Implement preemptive sends to alleviate slow RPC propagation #860
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#860
Loading…
Reference in a new issue
No description provided.
Delete branch "withings/garage:feat/preemptive-rpc-strategy"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Observations
As part of investigations for #851, we noticed that when some nodes experience slowdowns and those fail to be captured in the ping measurements, then the entire cluster can suffer as the other nodes start waiting for the slow ones indefinitely. This can be quickly mitigated by setting
rpc_timeout
to a suitable value.The biggest impact we noted was on the
PutObject
endpoint, which starts with an existing object lookup. If the latter is unexpectedly slow, the entire write slows down. This behaviour was observed on multiple traces in our testing cluster, and was first mitigated by enablingRequestStrategy::rs_send_all_at_once
on table read operations. We feel there may be a better way to achieve similar improvements, without flooding (potentially larger) clusters with unnecessary RPCs.Proposal
In scenarios where the RPC scheduling logic fails to take those slowdowns into account, it would be nice to have countermeasures against slow RPCs. For table reads, this is actually already a suggested TODO in
rpc_helper::try_call_many_inner
:This PR is a first attempt at implementing this, we'd love to have your thoughts on it!
Hello, thank you very much for this proposal! This is definitely going in the right direction.
I think it would be interesting to add metrics, so that we can see how often the watchdog triggers. Ideally, we want it to never trigger if the cluster is not overloaded. Adding such metrics is easy, new counters should be declared in
rpc/metrics.rs
and incremented at the appropriate places inrpc_helper.rs
. We could have two counters:There is however a small issue with this way of measuring things: a timed out watchdog does not necessarily cause a new request to be sent, because maybe there are no more peers to send requests to. More precisely, if we are reaching for a quorum of 2 among 3 nodes, the first watchdog will cause the 3rd request to be sent, but further watchdogs will have no effect. It would be nice if we could distinguish these case: what we really want to know is the proportion of these 3rd requests that were sent or not. Maybe a simpler way to count would be the number of calls to
try_call_many
with preemptive send enabled, and the number of such calls where a watchdog triggers (only counting the first watchdog trigger, not further ones).If you have a cluster with a big workload at withings, I would love to see some data on this. It would help validate that this is a relevant approach and that we are not just adding complexity for nothing.
There is another point which I think is important here: how do we measure the average latency of requests to each peer? Currently, the average ping metric that is available and that you are using in this prototype is a ping requests where the pinged node responds immediately to ping requests. It does not take into account time needed to read or write data from disk. This means that if the network is very fast, almost all requests that require reading from metadata will take significantly longer and the watchdog will trigger almost all the time, even if the cluster is running perfectly fine. In practice we would need to measure the average latency of individual RPC requests, such as
TableRpc::ReadEntry
, to have a relevant idea of how long they should take and to avoid triggering the watchdog needlessly.I think the next step is to try to obtain data: what are typical values for the average ping as it is measured currently? What are typical values for the latency of ReadEntry RPCs? With the patch in its current form, do watchdogs trigger only when necessary, or do they trigger all the time even when not needed?
The low-tech solution: This is all starting to look like a very complex matter, but in practice, you were able to solve your performance problems by setting
rs_send_all_at_once
on all table RPC requests. Maybe we could just make this behavior part of Garage, possibly with a new configuration option? (we could call itrpc_always_send_all_at_once
). This is actually in line with the best distributed system engineering practices as described by Amazon.Hello,
Definitely with you regarding metrics, I've just pushed a commit adding
rpc_watchdogs_started_counter
andrpc_watchdogs_preemption_counter
. I've adjusted the names slightly in hope of making the meaning clearer: the first counts RPC sent out under a watchdog, the second the number of times an "extra" RPC had to be sent preemptively because a watchdog triggered.Indeed, the "preemption" counter should only be increased if the time out is going to lead to another request being sent. I believe this can be enforced by adding a simple check before increasing the target number of outgoing requests (
target_outbound_count
). I have also added this in the commit.In the scenario you describe, I believe this approach would increase
rpc_watchdogs_started
by 3 (in total) andrpc_watchdogs_preemption
by 1. Overall those metrics can tell you how many RPCs are sent out due to a watchdog timing out (in this case, 1/3). Is this what you had in mind?At the moment, I am only running this on a test cluster with ~2M objects. I think I can give you a quick overview using the dashboard from #851. This is a one hour benchmark without preemptive sends (and no
send_all_at_once
) :And with preemptive send enabled:
Both in the same picture (with preemptive sends, then without) :
The metrics for
rpc_watchdogs_started
andrpc_watchdogs_preemption
look something like this, with ~5% of all RPCs issued due to a watchdog timeout:Admittedly, now that !855 is merged, the gain isn't very perceptible (the original issue amplified the delays and this was one of our leads then). We'd have to try in a cluster where 1 of the 3 nodes is much slower (unfortunately that's not something I have on hand at the moment). When everything is fine and the cluster is healthy, I wouldn't expect a big improvement tbh.
That idea has actually come up in discussion, @quentin suggested eventually moving to a percentile (p95/p99) of the time needed so far for each node to complete a read/write. This would indeed provide interesting metrics for RPC scheduling, although it is probably less straightforward to implement.
For a cluster with 3 nodes and 2 zones, that's perfectly acceptable yes. It's also a very small network cost in our case, so we could definitely go with that. It should also be noted that this only applies to reads, since writes have their own logic.
I would however be wary of such a feature in a larger cluster, with perhaps weaker or less balanced network links. This may create a significant amount of noise on the network and consume more bandwidth that may feel reasonable (ie. billable).
For small-ish clusters, that sounds like a good compromise IMO. This could be the case for preemptive sends too, should they end up in a release.
JKR
Ok, these results look promising!
@quentin Do you have an opinion on this?
I'm kind of tempted to merge this as-is, but I'm just worried that with preemptive send enabled, in the case of slow machines with fast networking, the watchdog would trigger nearly all of the time. Maybe we could make preemptive send an option in the Garage config and leave it disabled by default for now? At least until we implement precise measuring of the time taken for RPCs that read/write metadata.
Hey, very interesting topic! I will review the discussion more carefully and answer tonight or tomorrow morning :-)
➡️ First, can you confirm I correctly understand the issue?
To make an API write (PutObject), we need to perform some internal reads (RPC requests resulting in metadata reads). Our current logic is to peek 2 nodes among the 3, send them the read RPC request, wait for the answer of both requests, and then continue the API write logic. To the best of my knowledge, this behavior is different from Dynamo, Riak or Cassandra that send the 3 requests at the same time and drop the slowest one (this logic is named
rs_send_all_at_once
in Garage).In our logic, it appears that if 1 of the 2 peeked nodes is slow, the RPC request is slow, thus the API write is slow. In some ways, the RPC read latency is the maximum latency of the 2 nodes. The problem can thus be articulated in 2 ways:
The Dynamo / Cassandra / Riak /
rs_send_all_at_once
sends more RPC requests and reads on the cluster as a whole (33% more exactly) to get a RPC read latency that is the median latency of the 3 nodes. In practice, it is a well known advertised feature of these software: it erases tail latency. We can present this as a trade-off: we do more work to have stabler latency. JFK argues however that in some cases, this behavior could be an issue, especially in those cases:➡️ Then, can you also confirm I correctly understand the proposed solution?
The proposed solution aims at detecting slow requests and try to dynamically erase latency by sending a 3rd request. It expects that, even if it sent the request later, it will complete faster than the one to the slow node. Slow is defined as "twice the average network latency".
Based on this definition, it expects that 1) the network latency is quite stable (ie. jitter is low) and 2) that internal read latency is way lower than network latency.
In some ways, we can see the proposed solutions as a fallback mechanism, which in turns creates 2 operating mode for Garage. The default one, trying to minimize RPC requests, and the fallback one trying to minimize tail latency.
Sorry I am running out of time but I have a (strong) opinion on the subject. I will try to share it with you later today. Stay tuned! In the mean time, feel free to tell me if my summary is correct or not.
➡️ It's possible to make safe fallbacks but generally it's better to completely avoid them
Amazon is very clear about avoiding fallbacks in distributed systems. Following Amazon's blog post, I see 2 issues with fallbacks: 1) they are less tested/robust than the normal flow and 2) they are an inefficient pattern preventing better solutions from emerging.
Fallbacks are typically not executed often, hence we often notice bugs only when they are triggered in production. In the end, instead of improving system reliability, they often degrade it. Amazon says that, if you really want to implement a fallback, you should execute it periodically even when the main mechanism is working. It would mean in our case, for example, that we would send ~20% of the requests in fallback mode, even if everything is right. And only switch to 100% if there is an observed issue. But even in this case, some unexpected behavior could emerge only when 100% of the requests are sent in fallback mode.
Fallbacks must also be triggered at the right moment. Here Alex is not sure that it will work as intended on machines with slow I/O and fast networks. In fact, even with fast I/O, the behavior could be strange on fast networks. If you machine are on a fast LAN, their average network RTT could be as low as 1ms. It means the behavior would be triggered at 2ms, which can occurre for many reasons even with fast I/O.
As mentioned, we don't measure the right thing (network instead of network + metadata read). But even if we change the code to do it, we can create an oscillating system: by switching to the normal mode to the fallback mode, we release pressure on a given node, that becomes again fast enough, so we switch back to normal mode, and so on and so forth.
So in the end, we would want to send a certain percentage of the request as fallback and the rest as normal and adjust this percentage to find an equilibrium such that we minimize the average (or p95 or p99) delay to complete the RPC requests. And we fall in control theory here, where the observed value is something related to RPC delay over a window of time, and our action is done by tweaking the percentage of fallback requests VS normal requests.
More philosophically, Amazon argues that fallbacks are an inefficient pattern. Indeed, if 1) the fallback is an acceptable mechanism and 2) is expected to work when the main mechanism does not, why not use the fallback everytime, as the default solution?. Here, it would mean always using
rs_send_all_at_once
, as proposed by Alex.It has been pointed that using this fallback as the default could increase the bandwidth cost and have a performance impact. If these concerns materialize, triggering dynamically this behavior is potentially worse than setting it at the default. Indeed, operators dimension and benchmark their systems by observing the default behavior, and once deployed, suddenly, their cloud bill may increase by 33% or their network saturates, without them understanding why - we can't assume all operators are familiar with Garage internals.
In some ways, I fell in the same trap with LMDB: I only benchmarked it when "LDMB size" < "RAM size" and did not notice that behavior will change when this condition is not true anymore, leading to misleading benchmarks and poor configuration.
As a conclusion, I don't see any benefit from switching dynamically from one behavior to another: systems are designed and tuned according to an expected workload, changing the behavior breaks these assumptions. Instead of implementing a fallback mechanism, Amazon recommends to invest energy on building a better default system, and that's what I will try to draft in my next messages.
➡️ Some alternatives to implementing a fallback
As a reminder, we have more or less 3 goals (that are more or less connected) with Garage currently:
Switching to
rs_send_all_at_once
- RPC requests on tables are expected to be way smaller than blocks so we can expect that the additional data sent is negligible in term of bandwidth. In the end, nodes on the cluster will have to process more requests, which means in the end more work done by the cluster. But currently, as RPC requests are not spread, the gain in term of performances is not spread among the nodes, hence the cluster do not benefit from it.Try a spread logic on
try_call_many
- Instead of sorting by latency, we could benchmark the cluster by randomizing the nodes. We would then compare it to 1) the currenttry_call_many
and 2)rs_send_all_at_once
. In fact, the currenttry_call_many
is efficient only when the main source of latency is the network latency due to distance - and not load. So the question would be 1) should we randomize nodes or 2) send all at once or 3) try to be more clever and putting us at risk of doing dumb things. In the end, I think that if we wantUse better datastructures - We store some queues in LMDB, a sorted collection. Sorted collections are expensive datastructure and queues don't need to be sorted according to a key. We could completely bypass LMDB in this case, either by implementing our own on-disk queue or by using an existing one. Considering the PutObject workload, it would be beneficial as many writes are done in the "todo merkle tree" that is, to the best of my knowledge, a queue.
Implement backpressure logic - We have unbounded queues in some places in the code (for example the merkle todo is an example of such infinites queues). I think we should avoid them and instead block such that the parent / calling process is informed. Queues don't fix overload.
Avoiding mmap - We learnt that mmap might be the source of our performance issue. Implementing a LSM-Tree metadata engine that don't rely on mmap could be very interesting, like RocksDB or Fjall.
Continue instrumenting the code & benchmarking - As I still don't completely understand these waves we have when 1) the cluster has many cores and 2) it is benchmarked with minio warp on PutObject. We might have missed somehting.
➡️ Finally my opinion on the subject
I am not a huge fan of this dynamic fallback. I would prefer we switch to
rs_send_all_at_once
as it's the most stable / reliable / robust strategy (when a node is failing, user experience is not impacted) without giving the possibility for the user to choose.Which would mean, I confess, abandoning this pull request while keeping the idea to send to all 3 nodes.
Enthusiasm: ➕➕
If we want users to select their RPC strategy, I would prefer that 1) we add an
experimental
section ingarage.toml
with configurations we don't commit to maintain from one version to another and 2) add a field likerpc_read_strategy={high_latency_link,spread_load,reduce_tail_latency}
and we try to benchmark these strategies in a near future. After a while, we would have a discussion to stabilize this configuration.It would mean abandoning the "dynamic selection" based on latency from this pull request but keeping the idea of multiple RPC strategies.
Enthusiasm: ➕
If we really want to try a fallback logic, I would recommend we add it as a 4th
rpc_read_strategy={...,dynamic}
and document cases where we think it could be useful. I think the work to cleverly switch from one mode to another is not obvious at all. I think we should not merge the existing dynamic selection mechanism, and instead try to build a more robust one. Also, I think the change should be done progressively (adjusting a ratio) instead of a binary one, as the strategy we use has an impact on the observed latency.It would mean keeping all the idea of this PR but redesigning/reworking the "dynamic selection".
Enthusiasm: ➖
Finally, I think we would benefit from more knowledge on the current Garage behavior. So before committing to a solution, I think exploring yaqueue could be an easy way to both (1) add backpressure and (2) reduce pressure on LMDB and see if it drastically changes observed behavior or not. I don't know whether RocksDB or Fjall is a better LSM-Tree candidate, but the storage interface is not that big, so not something huge to code either, and again we would be able to evaluate LMDB impact here. And collecting additional metrics + doing new benchmarks on powerful servers would help us have a more accurate picture.
I think we still need to collect some knowledge about how Garage behaves under pressure, as I don't fully understand what happen currently..
Enthusiasm: ➕➕
Thank you very much for your feedback!
I agree that improving the node selection metric beyond ping times may in fact be a prerequisite before such a strategy can be considered.
Regarding the use of the broadcasting
send_all_at_once
strategy option, I am tempted to agree as well. This does offer better determinism, and I am now unsure that a dynamic approach would make much sense in a cluster where preference lists are more balanced. In a small cluster, the option would allow administrators to simply disable all node selection logic for reads, which may at times be suitable. This PR does not bring in that feature.To be honest, I am no longer convinced that this strategy would help. The initial results outlined above only showed minor improvements, and other leads have now emerged regarding the performance issues in #851.
All in all, I have no objection to closing this PR for the time being. The idea it introduces does add a significant amount of complexity to the code, while "simpler" approaches (improved latency metric, send all at once) could offer similar/better results.
JKR
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.