When TCP sockets are not closed by the OS, a node failure is not reported

quentin commented

2022-03-11 21:35:45 +00:00

Owner

Expected

Stopping a node should not impact the response time and the availability of the data. The node must be listed as failed in garage status.

Observed

All requests take 10seconds to complete. The node is listed as healthy.

How to reproduce

Given this docker-compose

version: '3.7'

networks:
  mynetwork:
    ipam:
      config:
        - subnet: 172.20.0.0/24

services:
  garage1:
    container_name: garage1
    image: dxflrs/amd64_garage:v0.6.0
    ports:
      - 3900:3900
      - 3901:3901
      - 3902:3902
    volumes:
      - ./g1:/mnt/garage
      - ./config.toml:/etc/garage.toml
    restart: always
    networks:
      mynetwork:
        ipv4_address: 172.20.0.2

  garage2:
    container_name: garage2
    image: dxflrs/amd64_garage:v0.6.0
    volumes:
      - ./g2:/mnt/garage
      - ./config.toml:/etc/garage.toml
    restart: always
    networks:
      mynetwork:
        ipv4_address: 172.20.0.3

  garage3:
    container_name: garage3
    image: dxflrs/amd64_garage:v0.6.0
    volumes:
      - ./g3:/mnt/garage
      - ./config.toml:/etc/garage.toml
    restart: always
    networks:
      mynetwork:
        ipv4_address: 172.20.0.4

And this config.toml:

metadata_dir = "/mnt/garage/meta"
data_dir = "/mnt/garage/data"

replication_mode = "3"

compression_level = 2

rpc_bind_addr = "0.0.0.0:3901"
rpc_secret = "xxx"

bootstrap_peers = []

[s3_api]
s3_region = "garage"
api_bind_addr = "[::]:3900"
root_domain = ".s3.garage"

[s3_web]
bind_addr = "[::]:3902"
root_domain = ".web.garage"
index = "index.html"

Then setup:

docker-compose up
docker-compose exec garage1 /garage status
docker-compose exec garage1 /garage node connect xxx@172.20.0.3:3900
docker-compose exec garage1 /garage node connect xxx@172.20.0.4:3900
docker-compose exec garage1 /garage layout assign -z gar1 -c 1 <gar1>
docker-compose exec garage1 /garage layout assign -z gar2 -c 1 <gar2>
docker-compose exec garage1 /garage layout assign -z gar3 -c 1 <gar3>
docker-compose exec garage1 /garage layout apply --version 1
docker-compose exec garage1 /garage key new --name k
docker-compose exec garage1 /garage bucket create test
docker-compose exec garage1 /garage bucket allow --owner --read --write --key k test
# configure your awscli client

Then test:

cp /proc/cpuinfo /tmp/cpuinfo.txt
aws s3 cp /tmp/cpuinfo.txt s3://test

time aws s3 ls s3://test
# 0m1,334s

docker-compose stop garage3
docker-compose exec garage1 /garage status
# all 3 nodes are listed as healthy even after 5 minutes

time aws s3 ls s3://test
# 0m11,353s

docker-compose start garage3

time aws s3 ls s3://test
# 0m1,215s

But it seems not 100% deterministic, sometimes the request is answered in 1 second even if a node is unavalaible.

We have some netapp logs that complain it is not able to communicate to both servers:

garage2    |  WARN  garage_table::sync      > (bucket_v2) Error while syncing TodoPartition { partition: 0, begin: 0000000000000000…, end: ffffffffffffffff…, retain: true }: Sync failed with too many nodes (should have been: [0a0cbb1517f736a8…, 92d153649db9c419…]).

## Expected Stopping a node should not impact the response time and the availability of the data. The node must be listed as failed in garage status. ## Observed All requests take 10seconds to complete. The node is listed as healthy. ## How to reproduce Given this docker-compose ``` version: '3.7' networks: mynetwork: ipam: config: - subnet: 172.20.0.0/24 services: garage1: container_name: garage1 image: dxflrs/amd64_garage:v0.6.0 ports: - 3900:3900 - 3901:3901 - 3902:3902 volumes: - ./g1:/mnt/garage - ./config.toml:/etc/garage.toml restart: always networks: mynetwork: ipv4_address: 172.20.0.2 garage2: container_name: garage2 image: dxflrs/amd64_garage:v0.6.0 volumes: - ./g2:/mnt/garage - ./config.toml:/etc/garage.toml restart: always networks: mynetwork: ipv4_address: 172.20.0.3 garage3: container_name: garage3 image: dxflrs/amd64_garage:v0.6.0 volumes: - ./g3:/mnt/garage - ./config.toml:/etc/garage.toml restart: always networks: mynetwork: ipv4_address: 172.20.0.4 ``` And this config.toml: ``` metadata_dir = "/mnt/garage/meta" data_dir = "/mnt/garage/data" replication_mode = "3" compression_level = 2 rpc_bind_addr = "0.0.0.0:3901" rpc_secret = "xxx" bootstrap_peers = [] [s3_api] s3_region = "garage" api_bind_addr = "[::]:3900" root_domain = ".s3.garage" [s3_web] bind_addr = "[::]:3902" root_domain = ".web.garage" index = "index.html" ``` Then setup: ```bash docker-compose up docker-compose exec garage1 /garage status docker-compose exec garage1 /garage node connect xxx@172.20.0.3:3900 docker-compose exec garage1 /garage node connect xxx@172.20.0.4:3900 docker-compose exec garage1 /garage layout assign -z gar1 -c 1 <gar1> docker-compose exec garage1 /garage layout assign -z gar2 -c 1 <gar2> docker-compose exec garage1 /garage layout assign -z gar3 -c 1 <gar3> docker-compose exec garage1 /garage layout apply --version 1 docker-compose exec garage1 /garage key new --name k docker-compose exec garage1 /garage bucket create test docker-compose exec garage1 /garage bucket allow --owner --read --write --key k test # configure your awscli client ``` Then test: ``` cp /proc/cpuinfo /tmp/cpuinfo.txt aws s3 cp /tmp/cpuinfo.txt s3://test time aws s3 ls s3://test # 0m1,334s docker-compose stop garage3 docker-compose exec garage1 /garage status # all 3 nodes are listed as healthy even after 5 minutes time aws s3 ls s3://test # 0m11,353s docker-compose start garage3 time aws s3 ls s3://test # 0m1,215s ``` But it seems not 100% deterministic, sometimes the request is answered in 1 second even if a node is unavalaible. We have some netapp logs that complain it is not able to communicate to both servers: ``` garage2 | WARN garage_table::sync > (bucket_v2) Error while syncing TodoPartition { partition: 0, begin: 0000000000000000…, end: ffffffffffffffff…, retain: true }: Sync failed with too many nodes (should have been: [0a0cbb1517f736a8…, 92d153649db9c419…]). ```

quentin added the

kind

wrong-behavior

label 2022-03-11 21:35:45 +00:00

quentin commented

2022-03-12 12:17:52 +00:00

Author

Owner

After thinking to the problem, it might be linked to a special case where we do not diagnose a node as offline even if it is. It might be due to the fact that we rely only on the TCP socket being closed by the OS to put a node as failed. Based on this first hypothetical bug, the non deterministic 10s timeout might be due to our request scheduler. In the original paper we took inspiration from, 3 requests are sent at the same time to the cluster and as long as 2 are successfully answered, a response is sent. We have the same approach but we do not send the 3 requests at the same time, in our case probably 2 at the same time and a 3rd one if one of the 2 timeout, after 10 seconds. The non-determinism comes to the fact that, sometimes, Garage chose the 2 remaining available servers at first, and sometimes, Garage chooses 1 remaining + 1 down server at first, waits for the timeout on the unavailable server, and finally sends the request on the 3rd one.

To confirm this hypothesis, we will need to:

reproduce/check/understand when and why a TCP socket is never closed even when we send data on it. A first step would be to reproduce this case with netcat as a server in one container, and as a client in another container, and kill the server container to see if something is detected from the client.
check/confirm/understand how netapp handles/reports network failures
check/confirm that we schedule requests as in my description, check we have a 10s timeout

My hypothesis does not include explanations on why NetApp says it has failed to communicate with 2 nodes (while only 1 is unavailable)

Some tests by @Rune: https://pastebin.com/7bvLV6Us

After thinking to the problem, it might be linked to a special case where we do not diagnose a node as offline even if it is. It might be due to the fact that we rely only on the TCP socket being closed by the OS to put a node as failed. Based on this first hypothetical bug, the non deterministic 10s timeout might be due to our request scheduler. In the original paper we took inspiration from, 3 requests are sent at the same time to the cluster and as long as 2 are successfully answered, a response is sent. We have the same approach but we do not send the 3 requests at the same time, in our case probably 2 at the same time and a 3rd one if one of the 2 timeout, after 10 seconds. The non-determinism comes to the fact that, sometimes, Garage chose the 2 remaining available servers at first, and sometimes, Garage chooses 1 remaining + 1 down server at first, waits for the timeout on the unavailable server, and finally sends the request on the 3rd one. To confirm this hypothesis, we will need to: - reproduce/check/understand when and why a TCP socket is never closed even when we send data on it. A first step would be to reproduce this case with netcat as a server in one container, and as a client in another container, and kill the server container to see if something is detected from the client. - check/confirm/understand how netapp handles/reports network failures - check/confirm that we schedule requests as in my description, check we have a 10s timeout My hypothesis does not include explanations on why NetApp says it has failed to communicate with 2 nodes (while only 1 is unavailable) Some tests by @Rune: https://pastebin.com/7bvLV6Us

rune commented

2022-03-14 16:34:25 +00:00

Contributor

I've been going over the code and trying to narrow down the problem more.

The server first hangs at src/api/s3_get.rs in the pub async fn handle_get function line 187

    let object = garage
        .object_table
        .get(&bucket_id, &key.to_string())
        .await?
        .ok_or(Error::NoSuchKey)?;

the .get call initiates TABLE_RPC traffic for the object_table

Lowering timeout on TABLE_RPC_TIMEOUT to 1 second makes all requests time out after 1-2 seconds (and also makes it much quicker to troubleshoot)

diff --git a/src/table/table.rs b/src/table/table.rs
index 01789c1..46bfecb 100644
--- a/src/table/table.rs
+++ b/src/table/table.rs
@@ -21,7 +21,7 @@ use crate::replication::*;
 use crate::schema::*;
 use crate::sync::*;
 
-const TABLE_RPC_TIMEOUT: Duration = Duration::from_secs(10);
+const TABLE_RPC_TIMEOUT: Duration = Duration::from_secs(1);
 
 pub struct Table<F: TableSchema + 'static, R: TableReplication + 'static> {
        pub system: Arc<System>,

More precisely we can limit this change to the get function

diff --git a/src/table/table.rs b/src/table/table.rs
index 01789c1..6351cad 100644
--- a/src/table/table.rs
+++ b/src/table/table.rs
@@ -163,7 +163,7 @@ where
                                rpc,
                                RequestStrategy::with_priority(PRIO_NORMAL)
                                        .with_quorum(self.data.replication.read_quorum())
-                                       .with_timeout(TABLE_RPC_TIMEOUT)
+                                       .with_timeout(Duration::from_secs(1))
                                        .interrupt_after_quorum(true),
                        )
                        .await?;

But this is of course a workaround at best since it now always takes 1 second to make each request.

I traced down in the src/rpc/rpc_helper.rs try_call_many function, which is working as intended, but perhaps begs the question if querying all nodes at once would "fix" the client impact of this issue. Of course this is also a workaround for the nodes not being marked as failed.

I've just started going over the netapp dependency, but I've yet to grok exactly how it works.

I've been going over the code and trying to narrow down the problem more. The server first hangs at `src/api/s3_get.rs` in the `pub async fn handle_get` function line 187 ``` let object = garage .object_table .get(&bucket_id, &key.to_string()) .await? .ok_or(Error::NoSuchKey)?; ``` the `.get` call initiates TABLE_RPC traffic for the object_table Lowering timeout on TABLE_RPC_TIMEOUT to 1 second makes all requests time out after 1-2 seconds (and also makes it much quicker to troubleshoot) ``` diff --git a/src/table/table.rs b/src/table/table.rs index 01789c1..46bfecb 100644 --- a/src/table/table.rs +++ b/src/table/table.rs @@ -21,7 +21,7 @@ use crate::replication::*; use crate::schema::*; use crate::sync::*; -const TABLE_RPC_TIMEOUT: Duration = Duration::from_secs(10); +const TABLE_RPC_TIMEOUT: Duration = Duration::from_secs(1); pub struct Table<F: TableSchema + 'static, R: TableReplication + 'static> { pub system: Arc<System>, ``` More precisely we can limit this change to the `get` function ``` diff --git a/src/table/table.rs b/src/table/table.rs index 01789c1..6351cad 100644 --- a/src/table/table.rs +++ b/src/table/table.rs @@ -163,7 +163,7 @@ where rpc, RequestStrategy::with_priority(PRIO_NORMAL) .with_quorum(self.data.replication.read_quorum()) - .with_timeout(TABLE_RPC_TIMEOUT) + .with_timeout(Duration::from_secs(1)) .interrupt_after_quorum(true), ) .await?; ``` But this is of course a workaround at best since it now always takes 1 second to make each request. I traced down in the `src/rpc/rpc_helper.rs` `try_call_many` function, which is working as intended, but perhaps begs the question if querying all nodes at once would "fix" the client impact of this issue. Of course this is also a workaround for the nodes not being marked as failed. I've just started going over the netapp dependency, but I've yet to grok exactly how it works.

lx commented

2022-03-15 10:18:35 +00:00

Owner

Netapp tries regularly to ping other nodes, to estimate link latency which is then used in Garage's rpc_helper to priorize nodes to which a request is made.

The following things seem to be wrong with Netapp's ping logic:

ping messages don't have a timeout, so if ping messages aren't answered, Netapp will just hang there and not detect that something is wrong
errors when pinging nodes aren't registered, so if a network error happens somewhere in there, Garage won't become aware of it

All of this is in the code in Netapp's src/peering/fullmesh.rs. We need to fix both points so that when pings timeout, the connection is closed (there is no way in the Netapp API to signal that a connection is faulty: when an error occurs on the network, the connection is closed, and that's how Netapp signals an error). We probably don't want to close the connection as soon as a single ping message fails though, we probably want a counter, and close the connection when we reach 3 failed attempts.

Netapp tries regularly to ping other nodes, to estimate link latency which is then used in Garage's `rpc_helper` to priorize nodes to which a request is made. The following things seem to be wrong with Netapp's ping logic: - ping messages don't have a timeout, so if ping messages aren't answered, Netapp will just hang there and not detect that something is wrong - errors when pinging nodes aren't registered, so if a network error happens somewhere in there, Garage won't become aware of it All of this is in the code in Netapp's `src/peering/fullmesh.rs`. We need to fix both points so that when pings timeout, the connection is closed (there is no way in the Netapp API to signal that a connection is faulty: when an error occurs on the network, the connection is closed, and that's how Netapp signals an error). We probably don't want to close the connection as soon as a single ping message fails though, we probably want a counter, and close the connection when we reach 3 failed attempts.

quentin changed title from ~~docker-compose stop node leads to 10s commands & missing node is not detected~~ to When TCP sockets are not closed by the OS, a node failure is not reported

2022-03-15 10:22:04 +00:00

quentin commented

2022-03-15 10:27:41 +00:00

Author

Owner

Note: Netapp is our low-level library to handle network communications. Its repository is hosted here: https://git.deuxfleurs.fr/lx/netapp

It is possible to build the examples shipped with netapp by using its Makefile (just run make), and then use the fullmesh binary to reproduce the bug (and check it has been fixed).

Note: Netapp is our low-level library to handle network communications. Its repository is hosted here: https://git.deuxfleurs.fr/lx/netapp It is possible to build the examples shipped with netapp by using its Makefile (just run `make`), and then use the `fullmesh` binary to reproduce the bug (and check it has been fixed).

lx referenced this issue from a pull request that will close it,

2022-03-15 16:10:51 +00:00

Update netapp to v0.4.1 #268

lx commented

2022-03-15 16:11:46 +00:00

Owner

I pushed an update to Netapp, and we have a Garage version that uses the updated version in branch update-netapp. Can you test this branch and see if it works better now?

I pushed an update to Netapp, and we have a Garage version that uses the updated version in branch `update-netapp`. Can you test this branch and see if it works better now?

🎉 1

rune commented

2022-03-16 16:34:19 +00:00

Contributor

This works far better.

Dropping all traffic to gar3 causes gar3 node to be marked as failed in 10-20 seconds. After the node is marked failed client performance is unaffected.

As expected I get ~10 seconds response time from my client if i make a request right away after blocking gar3.

Similarly dropping all traffic to gar2 and gar3 at the same time causes the client to temporarily have 10s response times, but then immediate quorum error after the 10-20 seconds passes to notice the node outage.

Unblocking traffic puts the node back online within 5-6 seconds and client traffic is served again (if it just regained quorum). Client requests are completely unaffected when going from 2 -> 3 working nodes.

I'd say this fixes the problem in a satisfactory way, but, if I had to be nitpicky, a small tweaks could be made on timeout for the client get requests. 10 seconds is a fairly long time if you place the request moments before a node is marked as failed.

This works far better. Dropping all traffic to gar3 causes gar3 node to be marked as failed in 10-20 seconds. After the node is marked failed client performance is unaffected. As expected I get ~10 seconds response time from my client if i make a request right away after blocking gar3. Similarly dropping all traffic to gar2 and gar3 at the same time causes the client to temporarily have 10s response times, but then immediate quorum error after the 10-20 seconds passes to notice the node outage. Unblocking traffic puts the node back online within 5-6 seconds and client traffic is served again (if it just regained quorum). Client requests are completely unaffected when going from 2 -> 3 working nodes. I'd say this fixes the problem in a satisfactory way, but, if I had to be nitpicky, a small tweaks could be made on timeout for the client get requests. 10 seconds is a fairly long time if you place the request moments before a node is marked as failed.

lx commented

2022-03-17 10:22:28 +00:00

Owner

Thank you for your detailed feedback.

I'm afraid that lowering the TABLE_RPC_TIMEOUT too much could possibly lead to RPC requests timing out not because the other node is down, but just because they're legitimately taking a long time to process, or because of temporary network congestion. We would risk losing quorum in cases where waiting just a little longer would have allowed us to receive enough responses. It's always a hard compromise defining timeout values, but to me it's better risking being slow in a node failure scenario (an abnormal condition), than risking timeout-induced failure in a case where all nodes are up but just answering a bit slowly (which can happen much more often).

Thank you for your detailed feedback. I'm afraid that lowering the `TABLE_RPC_TIMEOUT` too much could possibly lead to RPC requests timing out not because the other node is down, but just because they're legitimately taking a long time to process, or because of temporary network congestion. We would risk losing quorum in cases where waiting just a little longer would have allowed us to receive enough responses. It's always a hard compromise defining timeout values, but to me it's better risking being slow in a node failure scenario (an abnormal condition), than risking timeout-induced failure in a case where all nodes are up but just answering a bit slowly (which can happen much more often).

👍 1

lx closed this issue

2022-03-17 10:23:13 +00:00

Rows
Columns

When TCP sockets are not closed by the OS, a node failure is not reported #264

Expected

Observed

How to reproduce