Starting my write-up
This commit is contained in:
parent
5a8be6fc73
commit
e83f059675
4 changed files with 62 additions and 2 deletions
|
@ -26,7 +26,7 @@ you must add to your path some tools.
|
||||||
```bash
|
```bash
|
||||||
# see versions on https://garagehq.deuxfleurs.fr/_releases.html
|
# see versions on https://garagehq.deuxfleurs.fr/_releases.html
|
||||||
export GRG_ARCH=x86_64-unknown-linux-musl
|
export GRG_ARCH=x86_64-unknown-linux-musl
|
||||||
export GRG_VERSION=v0.5.0
|
export GRG_VERSION=v0.7.2.1
|
||||||
|
|
||||||
sudo wget https://garagehq.deuxfleurs.fr/_releases/${GRG_VERSION}/${GRG_ARCH}/garage -O /usr/local/bin/garage
|
sudo wget https://garagehq.deuxfleurs.fr/_releases/${GRG_VERSION}/${GRG_ARCH}/garage -O /usr/local/bin/garage
|
||||||
sudo chmod +x /usr/local/bin/garage
|
sudo chmod +x /usr/local/bin/garage
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
#!/bin/bash
|
#!/usr/bin/env bash
|
||||||
|
|
||||||
set -euo pipefail
|
set -euo pipefail
|
||||||
IFS=$'\n\t'
|
IFS=$'\n\t'
|
||||||
|
|
35
liveness.md
Normal file
35
liveness.md
Normal file
|
@ -0,0 +1,35 @@
|
||||||
|
# Liveness issues
|
||||||
|
|
||||||
|
We know that some people reported having timeouts when putting some load on their Garage cluster.
|
||||||
|
On our production cluster that runs without pressure, we don't really observe this behaviour.
|
||||||
|
|
||||||
|
But when I wanted to start a benchmark created by Minio developers, I hit the same limit.
|
||||||
|
So I wanted to reproduce this behavior in a more controlled environment.
|
||||||
|
|
||||||
|
I thus chose to use mknet to emulate a simple network with close to zero latency but with a very small bandwidth, 1M. The idea is that the network will be the bottleneck, and not the CPU, the memory or the disk.
|
||||||
|
|
||||||
|
After a while, we quickly observe that the cluster is not reacting very well:
|
||||||
|
|
||||||
|
```
|
||||||
|
[nix-shell:/home/quentin/Documents/dev/deuxfleurs/mknet]# warp get --host=[fc00:9a7a:9e::1]:3900 --obj.size 100MiB --obj.randsize --duration=10m --concurrent 8 --objects 200 --access-key=GKc1e16da48142bdb95d98a4e4 --secret-key=c4ef5d5f7ee24ccae12a98912bf5b1fda28120a7e3a8f90cb3710c8683478b31
|
||||||
|
Creating Bucket "warp-benchmark-bucket"...Element { tag_name: {http://s3.amazonaws.com/doc/2006-03-01/}LocationConstraint, attributes: [], namespaces: [Namespace { name: None, uri: "http://s3.amazonaws.com/doc/2006-03-01/" }] }
|
||||||
|
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
|
||||||
|
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 3cb7ed98f7c66a55", "Netapp error: Not connected: 92c7fb74ed89f289"]
|
||||||
|
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/xVdzjy23/1.KximayVLlhLwfE5f.rnd": dial tcp [fc00:9a7a:9e::1]:3900: i/o timeout
|
||||||
|
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/N4zQvKhs/1.XkkO6DJ%28hVpGIrMj.rnd": dial tcp [fc00:9a7a:9e::1]:3900: i/o timeout
|
||||||
|
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
|
||||||
|
warp: <ERROR> upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 3cb7ed98f7c66a55", "Netapp error: Not connected: 92c7fb74ed89f289"]
|
||||||
|
warp: <ERROR> upload error: Put "http://[fc00:9a7a:9e::1]:3900/warp-benchmark-bucket/GQrsevhN/1.7hglGIP%28mXTJMgFE.rnd": read tcp [fc00:9a7a:9e:ffff:ffff:ffff:ffff:ffff]:57008->[fc00:9a7a:9e::1]:3900: read: connection reset by peer
|
||||||
|
|
||||||
|
warp: <ERROR> Error preparing server: upload error: Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"].
|
||||||
|
```
|
||||||
|
|
||||||
|
We observe many different types of error:
|
||||||
|
- [RPC] Timeout quorum errors, they are probably generated by a ping between nodes
|
||||||
|
- [RPC] Not connected error, after a timeout, a reconnection is not triggered directly
|
||||||
|
- [S3 Gateway] The gateway took to much time to answer and a timeout was triggered in the client
|
||||||
|
- [S3 Gateway] The S3 gateway closes the TCP connection before answering
|
||||||
|
|
||||||
|
As a first conclusion, we started to clearly reduce the scope of the problem by identifying that this undesirable behavior is triggered by a network bottleneck.
|
||||||
|
|
||||||
|
|
25
slow-net.yml
Normal file
25
slow-net.yml
Normal file
|
@ -0,0 +1,25 @@
|
||||||
|
links:
|
||||||
|
- &slow
|
||||||
|
bandwidth: 1M
|
||||||
|
latency: 500us
|
||||||
|
- &1000
|
||||||
|
bandwidth: 1000M
|
||||||
|
latency: 100us
|
||||||
|
|
||||||
|
servers:
|
||||||
|
- name: node1
|
||||||
|
<<: *slow
|
||||||
|
- name: node2
|
||||||
|
<<: *slow
|
||||||
|
- name: node3
|
||||||
|
<<: *slow
|
||||||
|
|
||||||
|
global:
|
||||||
|
subnet:
|
||||||
|
base: 'fc00:9a7a:9e::'
|
||||||
|
local: 64
|
||||||
|
zone: 16
|
||||||
|
latency-offset: 3ms
|
||||||
|
upstream:
|
||||||
|
ip: fc00:9a7a:9e:ffff:ffff:ffff:ffff:ffff
|
||||||
|
conn: *1000
|
Loading…
Reference in a new issue