forked from Deuxfleurs/mknet
Starting my write-up
This commit is contained in:
parent
5a8be6fc73
commit
e83f059675
4 changed files with 62 additions and 2 deletions
|
@ -26,7 +26,7 @@ you must add to your path some tools.
|
|||
```bash
|
||||
# see versions on https://garagehq.deuxfleurs.fr/_releases.html
|
||||
export GRG_ARCH=x86_64-unknown-linux-musl
|
||||
export GRG_VERSION=v0.5.0
|
||||
export GRG_VERSION=v0.7.2.1
|
||||
|
||||
sudo wget https://garagehq.deuxfleurs.fr/_releases/${GRG_VERSION}/${GRG_ARCH}/garage -O /usr/local/bin/garage
|
||||
sudo chmod +x /usr/local/bin/garage
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
#!/bin/bash
|
||||
#!/usr/bin/env bash
|
||||
|
||||
set -euo pipefail
|
||||
IFS=$'\n\t'
|
||||
|
|
35
liveness.md
Normal file
35
liveness.md
Normal file
|
@ -0,0 +1,35 @@
|
|||
# Liveness issues
|
||||
|
||||
We know that some people reported having timeouts when putting some load on their Garage cluster.
|
||||
On our production cluster that runs without pressure, we don't really observe this behaviour.
|
||||
|
||||
But when I wanted to start a benchmark created by Minio developers, I hit the same limit.
|
||||
So I wanted to reproduce this behavior in a more controlled environment.
|
||||
|
||||
I thus chose to use mknet to emulate a simple network with close to zero latency but with a very small bandwidth, 1M. The idea is that the network will be the bottleneck, and not the CPU, the memory or the disk.
|
||||
|
||||
After a while, we quickly observe that the cluster is not reacting very well:
|
||||
|
||||
```
|
||||
[nix-shell:/home/quentin/Documents/dev/deuxfleurs/mknet]# warp get --host=[fc00:9a7a:9e::1]:3900 --obj.size 100MiB --obj.randsize --duration=10m --concurrent 8 --objects 200 --access-key=GKc1e16da48142bdb95d98a4e4 --secret-key=c4ef5d5f7ee24ccae12a98912bf5b1fda28120a7e3a8f90cb3710c8683478b31
|
||||
Creating Bucket "warp-benchmark-bucket"...Element { tag_name: {http://s3.amazonaws.com/doc/2006-03-01/}LocationConstraint, attributes: [], namespaces: [Namespace { name: None, uri: "http://s3.amazonaws.com/doc/2006-03-01/" }] }
|
||||
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
|
||||
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 3cb7ed98f7c66a55", "Netapp error: Not connected: 92c7fb74ed89f289"]
|
||||
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/xVdzjy23/1.KximayVLlhLwfE5f.rnd": dial tcp [fc00:9a7a:9e::1]:3900: i/o timeout
|
||||
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/N4zQvKhs/1.XkkO6DJ%28hVpGIrMj.rnd": dial tcp [fc00:9a7a:9e::1]:3900: i/o timeout
|
||||
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
|
||||
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 3cb7ed98f7c66a55", "Netapp error: Not connected: 92c7fb74ed89f289"]
|
||||
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/GQrsevhN/1.7hglGIP%28mXTJMgFE.rnd": read tcp [fc00:9a7a:9e:ffff:ffff:ffff:ffff:ffff]:57008->[fc00:9a7a:9e::1]:3900: read: connection reset by peer
|
||||
|
||||
warp: <ERROR> Error preparing server: upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"].
|
||||
```
|
||||
|
||||
We observe many different types of error:
|
||||
- [RPC] Timeout quorum errors, they are probably generated by a ping between nodes
|
||||
- [RPC] Not connected error, after a timeout, a reconnection is not triggered directly
|
||||
- [S3 Gateway] The gateway took to much time to answer and a timeout was triggered in the client
|
||||
- [S3 Gateway] The S3 gateway closes the TCP connection before answering
|
||||
|
||||
As a first conclusion, we started to clearly reduce the scope of the problem by identifying that this undesirable behavior is triggered by a network bottleneck.
|
||||
|
||||
|
25
slow-net.yml
Normal file
25
slow-net.yml
Normal file
|
@ -0,0 +1,25 @@
|
|||
links:
|
||||
- &slow
|
||||
bandwidth: 1M
|
||||
latency: 500us
|
||||
- &1000
|
||||
bandwidth: 1000M
|
||||
latency: 100us
|
||||
|
||||
servers:
|
||||
- name: node1
|
||||
<<: *slow
|
||||
- name: node2
|
||||
<<: *slow
|
||||
- name: node3
|
||||
<<: *slow
|
||||
|
||||
global:
|
||||
subnet:
|
||||
base: 'fc00:9a7a:9e::'
|
||||
local: 64
|
||||
zone: 16
|
||||
latency-offset: 3ms
|
||||
upstream:
|
||||
ip: fc00:9a7a:9e:ffff:ffff:ffff:ffff:ffff
|
||||
conn: *1000
|
Loading…
Reference in a new issue