I used mknet to emulate a simple network with close to zero latency but with a very small bandwidth: 1Mbit/s. The idea is that the network will be the bottleneck, but not the CPU, the memory or the disk, even on my low powered laptop. The following configuration describes the simulated network configuration I used:
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 3cb7ed98f7c66a55", "Netapp error: Not connected: 92c7fb74ed89f289"]
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 3cb7ed98f7c66a55", "Netapp error: Not connected: 92c7fb74ed89f289"]
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/GQrsevhN/1.7hglGIP%28mXTJMgFE.rnd": read tcp [fc00:9a7a:9e:ffff:ffff:ffff:ffff:ffff]:57008->[fc00:9a7a:9e::1]:3900: read: connection reset by peer
warp: <ERROR> Error preparing server: upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"].
As a first conclusion, we started to clearly reduce the scope of the problem by identifying that this undesirable behavior is triggered by a network bottleneck.
Looking at Garage logs, we see that:
- node1, which is our S3 gateway, has many quorum errors / netapp not connected errors, which are the same than the ones reported earlier
- node2 and node3 which are only used as storage nodes, have no error/warn in their logs
It starts to really look like a congestion control/control flow error/scheduler issue: our S3 gateway seems to receive more data than it can send over the network, which in turn trigger timeouts, that trigger disconnect, and breaks everything.
We know how to trigger the issue with `warp`, Minio's benchmark tool but we don't yet understand well what kind of load it puts on the cluster except that it sends concurrently Multipart and PutObject requests concurrently. So, before investigating the issue more in depth, we want to know:
Named s3concurrent, it is available here: https://git.deuxfleurs.fr/quentin/s3concurrent
The benchmark starts by sending 1 file, then 2 files concurrently,
then 3, then 4, up to 16 (this is hardcoded for now).
When ran on our mknet cluster, we start triggering issues as soon as we send 2 files at once:
```
$ ./s3concurrent
2022/08/11 20:35:28 created bucket 3ffd6798-bdab-4218-b6d0-973a07e46ea9
2022/08/11 20:35:28 start concurrent loop with 1 coroutines
2022/08/11 20:35:55 done, 1 coroutines returned
2022/08/11 20:35:55 start concurrent loop with 2 coroutines
2022/08/11 20:36:34 1/2 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
2022/08/11 20:36:37 done, 2 coroutines returned
2022/08/11 20:36:37 start concurrent loop with 3 coroutines
2022/08/11 20:37:13 1/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 92c7fb74ed89f289", "Netapp error: Not connected: 3cb7ed98f7c66a55"]
2022/08/11 20:37:51 2/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 92c7fb74ed89f289", "Netapp error: Not connected: 3cb7ed98f7c66a55"]
2022/08/11 20:37:51 3/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 92c7fb74ed89f289", "Netapp error: Not connected: 3cb7ed98f7c66a55"]
2022/08/11 20:37:51 done, 3 coroutines returned
2022/08/11 20:37:51 start concurrent loop with 4 coroutines
We observe that Garage starts its timeout as soon as 2 coroutines are executed in parallel. We know that pushing blocks on the same channel as RPC messages is a weakness of Garage, as while a block is sent, no RPC message can be sent. In our specific deployment, it seems that sending 2 blocks takes enough time to trigger a timeout.
That's also the reason why we selected small blocks at the beginning (1MB and not 10MB), and why we are working on a mechanism to transfer blocks on a dedicated channel (https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343).
So the idea is to observe the behaviour of Garage with a smaller block size: RPC Ping messages will probably be better multiplexed in this case. We select 128 KiB blocks instead of 1MiB ones (10 times smaller).
This time, we can handle 2 coroutines at once but not 3:
```
2022/08/12 10:50:08 created bucket a565074b-0609-4f5f-8d46-389f86565197
2022/08/12 10:50:08 start concurrent loop with 1 coroutines
2022/08/12 10:50:32 done, 1 coroutines returned
2022/08/12 10:50:32 start concurrent loop with 2 coroutines
2022/08/12 10:51:18 done, 2 coroutines returned
2022/08/12 10:51:18 start concurrent loop with 3 coroutines
2022/08/12 10:51:35 1/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
2022/08/12 10:51:45 2/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 0b36b6d0de0a6393", "Netapp error: Not connected: b61e6a192c9462c9"]
2022/08/12 10:51:45 3/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
2022/08/12 10:51:45 done, 3 coroutines returned
2022/08/12 10:51:45 start concurrent loop with 4 coroutines
2022/08/12 10:52:09 1/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
2022/08/12 10:52:13 2/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 0b36b6d0de0a6393", "Netapp error: Not connected: b61e6a192c9462c9"]
2022/08/12 10:52:13 3/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 0b36b6d0de0a6393", "Netapp error: Not connected: b61e6a192c9462c9"]
2022/08/12 10:52:15 4/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
2022/08/12 10:52:15 done, 4 coroutines returned
```
So we see that improving multiplexing opportunities has a positive impact, it does not solve all our problems! We might need a scheduler to always prioritize Ping RPC, and also ensure after that we have a bounded queue - otherwise we will simply fill a queue.