forked from Deuxfleurs/mknet
36 lines
3.1 KiB
Markdown
36 lines
3.1 KiB
Markdown
|
# Liveness issues
|
||
|
|
||
|
We know that some people reported having timeouts when putting some load on their Garage cluster.
|
||
|
On our production cluster that runs without pressure, we don't really observe this behaviour.
|
||
|
|
||
|
But when I wanted to start a benchmark created by Minio developers, I hit the same limit.
|
||
|
So I wanted to reproduce this behavior in a more controlled environment.
|
||
|
|
||
|
I thus chose to use mknet to emulate a simple network with close to zero latency but with a very small bandwidth, 1M. The idea is that the network will be the bottleneck, and not the CPU, the memory or the disk.
|
||
|
|
||
|
After a while, we quickly observe that the cluster is not reacting very well:
|
||
|
|
||
|
```
|
||
|
[nix-shell:/home/quentin/Documents/dev/deuxfleurs/mknet]# warp get --host=[fc00:9a7a:9e::1]:3900 --obj.size 100MiB --obj.randsize --duration=10m --concurrent 8 --objects 200 --access-key=GKc1e16da48142bdb95d98a4e4 --secret-key=c4ef5d5f7ee24ccae12a98912bf5b1fda28120a7e3a8f90cb3710c8683478b31
|
||
|
Creating Bucket "warp-benchmark-bucket"...Element { tag_name: {http://s3.amazonaws.com/doc/2006-03-01/}LocationConstraint, attributes: [], namespaces: [Namespace { name: None, uri: "http://s3.amazonaws.com/doc/2006-03-01/" }] }
|
||
|
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
|
||
|
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 3cb7ed98f7c66a55", "Netapp error: Not connected: 92c7fb74ed89f289"]
|
||
|
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/xVdzjy23/1.KximayVLlhLwfE5f.rnd": dial tcp [fc00:9a7a:9e::1]:3900: i/o timeout
|
||
|
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/N4zQvKhs/1.XkkO6DJ%28hVpGIrMj.rnd": dial tcp [fc00:9a7a:9e::1]:3900: i/o timeout
|
||
|
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
|
||
|
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 3cb7ed98f7c66a55", "Netapp error: Not connected: 92c7fb74ed89f289"]
|
||
|
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/GQrsevhN/1.7hglGIP%28mXTJMgFE.rnd": read tcp [fc00:9a7a:9e:ffff:ffff:ffff:ffff:ffff]:57008->[fc00:9a7a:9e::1]:3900: read: connection reset by peer
|
||
|
|
||
|
warp: <ERROR> Error preparing server: upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"].
|
||
|
```
|
||
|
|
||
|
We observe many different types of error:
|
||
|
- [RPC] Timeout quorum errors, they are probably generated by a ping between nodes
|
||
|
- [RPC] Not connected error, after a timeout, a reconnection is not triggered directly
|
||
|
- [S3 Gateway] The gateway took to much time to answer and a timeout was triggered in the client
|
||
|
- [S3 Gateway] The S3 gateway closes the TCP connection before answering
|
||
|
|
||
|
As a first conclusion, we started to clearly reduce the scope of the problem by identifying that this undesirable behavior is triggered by a network bottleneck.
|
||
|
|
||
|
|