mknet/liveness.md

# Liveness issues

We know that some people reported having timeouts when putting some load on their Garage cluster.
On our production cluster that runs without pressure, we don't really observe this behaviour.

But when I wanted to start a benchmark created by Minio developers, I hit the same limit.
So I wanted to reproduce this behavior in a more controlled environment.

## Reproducing the error in mknet

I used mknet to emulate a simple network with close to zero latency but with a very small bandwidth: 1Mbit/s. The idea is that the network will be the bottleneck, but not the CPU, the memory or the disk, even on my low powered laptop.

After a while, we quickly observe that the cluster is not reacting very well:

```
[nix-shell:/home/quentin/Documents/dev/deuxfleurs/mknet]# warp get --host=[fc00:9a7a:9e::1]:3900 --obj.size 100MiB --obj.randsize --duration=10m --concurrent 8 --objects 200 --access-key=GKc1e16da48142bdb95d98a4e4 --secret-key=c4ef5d5f7ee24ccae12a98912bf5b1fda28120a7e3a8f90cb3710c8683478b31
Creating Bucket "warp-benchmark-bucket"...Element { tag_name: {http://s3.amazonaws.com/doc/2006-03-01/}LocationConstraint, attributes: [], namespaces: [Namespace { name: None, uri: "http://s3.amazonaws.com/doc/2006-03-01/" }] }
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 3cb7ed98f7c66a55", "Netapp error: Not connected: 92c7fb74ed89f289"]
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/xVdzjy23/1.KximayVLlhLwfE5f.rnd": dial tcp [fc00:9a7a:9e::1]:3900: i/o timeout
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/N4zQvKhs/1.XkkO6DJ%28hVpGIrMj.rnd": dial tcp [fc00:9a7a:9e::1]:3900: i/o timeout
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 3cb7ed98f7c66a55", "Netapp error: Not connected: 92c7fb74ed89f289"]
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/GQrsevhN/1.7hglGIP%28mXTJMgFE.rnd": read tcp [fc00:9a7a:9e:ffff:ffff:ffff:ffff:ffff]:57008->[fc00:9a7a:9e::1]:3900: read: connection reset by peer

warp: <ERROR> Error preparing server: upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"].
```

We observe many different types of error that I categorize as follow:
  - [RPC] Timeout quorum errors, they are probably generated by a ping between nodes
  - [RPC] Not connected error, after a timeout, a reconnection is not triggered directly
  - [S3 Gateway] The gateway took to much time to answer and a timeout was triggered in the client
  - [S3 Gateway] The S3 gateway closes the TCP connection before answering

As a first conclusion, we started to clearly reduce the scope of the problem by identifying that this undesirable behavior is triggered by a network bottleneck. 

Looking at Garage logs, we see that:
  - node1, which is our S3 gateway, has many quorum errors / netapp not connected errors, which are the same than the ones reported earlier
  - node2 and node3 which are only used as storage nodes, have no error/warn in their logs

It starts to really look like a congestion control/control flow error/scheduler issue: our S3 gateway seems to receive more data than it can send over the network, which in turn trigger timeouts, that trigger disconnect, and breaks everything. 

## Write a custom client exhibiting the issue

We know how to trigger the issue with `warp`, Minio's benchmark tool but we don't yet understand well what kind of load it puts on the cluster except that it sends concurrently Multipart and PutObject requests concurrently. So, before investigating the issue more in depth, we want to know:
  - If a single large PUT request can trigger this issue or not?
  - How many parallel requests are needed to trigger this issue?
  - Does Multipart transfer are more impacted by this issue?

To get answer to our questions, we will write a specific benchmark.
Named s3concurrent, it is available here: https://git.deuxfleurs.fr/quentin/s3concurrent
The benchmark starts by sending 1 file, then 2 files concurrently,
then 3, then 4, up to 16 (this is hardcoded for now).

When ran on our mknet cluster, we start triggering issues as soon as we send 2 files at once:

```
$ ./s3concurrent
2022/08/11 20:35:28 created bucket 3ffd6798-bdab-4218-b6d0-973a07e46ea9
2022/08/11 20:35:28 start concurrent loop with 1 coroutines
2022/08/11 20:35:55 done, 1 coroutines returned
2022/08/11 20:35:55 start concurrent loop with 2 coroutines
2022/08/11 20:36:34 1/2 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
2022/08/11 20:36:37 done, 2 coroutines returned
2022/08/11 20:36:37 start concurrent loop with 3 coroutines
2022/08/11 20:37:13 1/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 92c7fb74ed89f289", "Netapp error: Not connected: 3cb7ed98f7c66a55"]
2022/08/11 20:37:51 2/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 92c7fb74ed89f289", "Netapp error: Not connected: 3cb7ed98f7c66a55"]
2022/08/11 20:37:51 3/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 92c7fb74ed89f289", "Netapp error: Not connected: 3cb7ed98f7c66a55"]
2022/08/11 20:37:51 done, 3 coroutines returned
2022/08/11 20:37:51 start concurrent loop with 4 coroutines
```
Starting my write-up 2022-08-11 14:35:53 +00:00			`# Liveness issues`

			`We know that some people reported having timeouts when putting some load on their Garage cluster.`
			`On our production cluster that runs without pressure, we don't really observe this behaviour.`

			`But when I wanted to start a benchmark created by Minio developers, I hit the same limit.`
			`So I wanted to reproduce this behavior in a more controlled environment.`

Add info about logs and client to the writeup 2022-08-11 14:56:54 +00:00			`## Reproducing the error in mknet`

			`I used mknet to emulate a simple network with close to zero latency but with a very small bandwidth: 1Mbit/s. The idea is that the network will be the bottleneck, but not the CPU, the memory or the disk, even on my low powered laptop.`
Starting my write-up 2022-08-11 14:35:53 +00:00
			`After a while, we quickly observe that the cluster is not reacting very well:`

			```
			`[nix-shell:/home/quentin/Documents/dev/deuxfleurs/mknet]# warp get --host=[fc00:9a7a:9e::1]:3900 --obj.size 100MiB --obj.randsize --duration=10m --concurrent 8 --objects 200 --access-key=GKc1e16da48142bdb95d98a4e4 --secret-key=c4ef5d5f7ee24ccae12a98912bf5b1fda28120a7e3a8f90cb3710c8683478b31`
			`Creating Bucket "warp-benchmark-bucket"...Element { tag_name: {http://s3.amazonaws.com/doc/2006-03-01/}LocationConstraint, attributes: [], namespaces: [Namespace { name: None, uri: "http://s3.amazonaws.com/doc/2006-03-01/" }] }`
			`warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]`
			`warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 3cb7ed98f7c66a55", "Netapp error: Not connected: 92c7fb74ed89f289"]`
			`warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/xVdzjy23/1.KximayVLlhLwfE5f.rnd": dial tcp [fc00:9a7a:9e::1]:3900: i/o timeout`
			`warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/N4zQvKhs/1.XkkO6DJ%28hVpGIrMj.rnd": dial tcp [fc00:9a7a:9e::1]:3900: i/o timeout`
			`warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]`
			`warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 3cb7ed98f7c66a55", "Netapp error: Not connected: 92c7fb74ed89f289"]`
			`warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/GQrsevhN/1.7hglGIP%28mXTJMgFE.rnd": read tcp [fc00:9a7a:9e:ffff:ffff:ffff:ffff:ffff]:57008->[fc00:9a7a:9e::1]:3900: read: connection reset by peer`

			`warp: <ERROR> Error preparing server: upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"].`
			```

Add info about logs and client to the writeup 2022-08-11 14:56:54 +00:00			`We observe many different types of error that I categorize as follow:`
Starting my write-up 2022-08-11 14:35:53 +00:00			`- [RPC] Timeout quorum errors, they are probably generated by a ping between nodes`
			`- [RPC] Not connected error, after a timeout, a reconnection is not triggered directly`
			`- [S3 Gateway] The gateway took to much time to answer and a timeout was triggered in the client`
			`- [S3 Gateway] The S3 gateway closes the TCP connection before answering`

Add info about logs and client to the writeup 2022-08-11 14:56:54 +00:00			`As a first conclusion, we started to clearly reduce the scope of the problem by identifying that this undesirable behavior is triggered by a network bottleneck.`

			`Looking at Garage logs, we see that:`
			`- node1, which is our S3 gateway, has many quorum errors / netapp not connected errors, which are the same than the ones reported earlier`
			`- node2 and node3 which are only used as storage nodes, have no error/warn in their logs`

			`It starts to really look like a congestion control/control flow error/scheduler issue: our S3 gateway seems to receive more data than it can send over the network, which in turn trigger timeouts, that trigger disconnect, and breaks everything.`

			`## Write a custom client exhibiting the issue`
Starting my write-up 2022-08-11 14:35:53 +00:00
Add some doc about our own bench tool 2022-08-11 20:16:01 +00:00			We know how to trigger the issue with `warp`, Minio's benchmark tool but we don't yet understand well what kind of load it puts on the cluster except that it sends concurrently Multipart and PutObject requests concurrently. So, before investigating the issue more in depth, we want to know:
Add info about logs and client to the writeup 2022-08-11 14:56:54 +00:00			`- If a single large PUT request can trigger this issue or not?`
			`- How many parallel requests are needed to trigger this issue?`
			`- Does Multipart transfer are more impacted by this issue?`
Starting my write-up 2022-08-11 14:35:53 +00:00
Add info about logs and client to the writeup 2022-08-11 14:56:54 +00:00			`To get answer to our questions, we will write a specific benchmark.`
Add some doc about our own bench tool 2022-08-11 20:16:01 +00:00			`Named s3concurrent, it is available here: https://git.deuxfleurs.fr/quentin/s3concurrent`
			`The benchmark starts by sending 1 file, then 2 files concurrently,`
			`then 3, then 4, up to 16 (this is hardcoded for now).`

			`When ran on our mknet cluster, we start triggering issues as soon as we send 2 files at once:`

			```
			`$ ./s3concurrent`
			`2022/08/11 20:35:28 created bucket 3ffd6798-bdab-4218-b6d0-973a07e46ea9`
			`2022/08/11 20:35:28 start concurrent loop with 1 coroutines`
			`2022/08/11 20:35:55 done, 1 coroutines returned`
			`2022/08/11 20:35:55 start concurrent loop with 2 coroutines`
			`2022/08/11 20:36:34 1/2 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]`
			`2022/08/11 20:36:37 done, 2 coroutines returned`
			`2022/08/11 20:36:37 start concurrent loop with 3 coroutines`
			`2022/08/11 20:37:13 1/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 92c7fb74ed89f289", "Netapp error: Not connected: 3cb7ed98f7c66a55"]`
			`2022/08/11 20:37:51 2/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 92c7fb74ed89f289", "Netapp error: Not connected: 3cb7ed98f7c66a55"]`
			`2022/08/11 20:37:51 3/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 92c7fb74ed89f289", "Netapp error: Not connected: 3cb7ed98f7c66a55"]`
			`2022/08/11 20:37:51 done, 3 coroutines returned`
			`2022/08/11 20:37:51 start concurrent loop with 4 coroutines`
			```