When TCP sockets are not closed by the OS, a node failure is not reported #264
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind/experimental
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#264
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Expected
Stopping a node should not impact the response time and the availability of the data. The node must be listed as failed in garage status.
Observed
All requests take 10seconds to complete. The node is listed as healthy.
How to reproduce
Given this docker-compose
And this config.toml:
Then setup:
Then test:
But it seems not 100% deterministic, sometimes the request is answered in 1 second even if a node is unavalaible.
We have some netapp logs that complain it is not able to communicate to both servers:
After thinking to the problem, it might be linked to a special case where we do not diagnose a node as offline even if it is. It might be due to the fact that we rely only on the TCP socket being closed by the OS to put a node as failed. Based on this first hypothetical bug, the non deterministic 10s timeout might be due to our request scheduler. In the original paper we took inspiration from, 3 requests are sent at the same time to the cluster and as long as 2 are successfully answered, a response is sent. We have the same approach but we do not send the 3 requests at the same time, in our case probably 2 at the same time and a 3rd one if one of the 2 timeout, after 10 seconds. The non-determinism comes to the fact that, sometimes, Garage chose the 2 remaining available servers at first, and sometimes, Garage chooses 1 remaining + 1 down server at first, waits for the timeout on the unavailable server, and finally sends the request on the 3rd one.
To confirm this hypothesis, we will need to:
My hypothesis does not include explanations on why NetApp says it has failed to communicate with 2 nodes (while only 1 is unavailable)
Some tests by @Rune: https://pastebin.com/7bvLV6Us
I've been going over the code and trying to narrow down the problem more.
The server first hangs at
src/api/s3_get.rs
in thepub async fn handle_get
function line 187the
.get
call initiates TABLE_RPC traffic for the object_tableLowering timeout on TABLE_RPC_TIMEOUT to 1 second makes all requests time out after 1-2 seconds (and also makes it much quicker to troubleshoot)
More precisely we can limit this change to the
get
functionBut this is of course a workaround at best since it now always takes 1 second to make each request.
I traced down in the
src/rpc/rpc_helper.rs
try_call_many
function, which is working as intended, but perhaps begs the question if querying all nodes at once would "fix" the client impact of this issue. Of course this is also a workaround for the nodes not being marked as failed.I've just started going over the netapp dependency, but I've yet to grok exactly how it works.
Netapp tries regularly to ping other nodes, to estimate link latency which is then used in Garage's
rpc_helper
to priorize nodes to which a request is made.The following things seem to be wrong with Netapp's ping logic:
All of this is in the code in Netapp's
src/peering/fullmesh.rs
. We need to fix both points so that when pings timeout, the connection is closed (there is no way in the Netapp API to signal that a connection is faulty: when an error occurs on the network, the connection is closed, and that's how Netapp signals an error). We probably don't want to close the connection as soon as a single ping message fails though, we probably want a counter, and close the connection when we reach 3 failed attempts.docker-compose stop node leads to 10s commands & missing node is not detectedto When TCP sockets are not closed by the OS, a node failure is not reportedNote: Netapp is our low-level library to handle network communications. Its repository is hosted here: https://git.deuxfleurs.fr/lx/netapp
It is possible to build the examples shipped with netapp by using its Makefile (just run
make
), and then use thefullmesh
binary to reproduce the bug (and check it has been fixed).I pushed an update to Netapp, and we have a Garage version that uses the updated version in branch
update-netapp
. Can you test this branch and see if it works better now?This works far better.
Dropping all traffic to gar3 causes gar3 node to be marked as failed in 10-20 seconds. After the node is marked failed client performance is unaffected.
As expected I get ~10 seconds response time from my client if i make a request right away after blocking gar3.
Similarly dropping all traffic to gar2 and gar3 at the same time causes the client to temporarily have 10s response times, but then immediate quorum error after the 10-20 seconds passes to notice the node outage.
Unblocking traffic puts the node back online within 5-6 seconds and client traffic is served again (if it just regained quorum). Client requests are completely unaffected when going from 2 -> 3 working nodes.
I'd say this fixes the problem in a satisfactory way, but, if I had to be nitpicky, a small tweaks could be made on timeout for the client get requests. 10 seconds is a fairly long time if you place the request moments before a node is marked as failed.
Thank you for your detailed feedback.
I'm afraid that lowering the
TABLE_RPC_TIMEOUT
too much could possibly lead to RPC requests timing out not because the other node is down, but just because they're legitimately taking a long time to process, or because of temporary network congestion. We would risk losing quorum in cases where waiting just a little longer would have allowed us to receive enough responses. It's always a hard compromise defining timeout values, but to me it's better risking being slow in a node failure scenario (an abnormal condition), than risking timeout-induced failure in a case where all nodes are up but just answering a bit slowly (which can happen much more often).