From c99b9b4abb63c581bde59fc4ea9f75a6a153c310 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Fri, 12 Aug 2022 18:32:56 +0200 Subject: [PATCH] Investigation about liveness issues --- example/deploy_garage.sh | 2 +- liveness.md | 25 +++++++++++++++++++++++++ 2 files changed, 26 insertions(+), 1 deletion(-) diff --git a/example/deploy_garage.sh b/example/deploy_garage.sh index c011c7b..215ad37 100755 --- a/example/deploy_garage.sh +++ b/example/deploy_garage.sh @@ -49,7 +49,7 @@ admin_token = "ae8cb40ea7368bbdbb6430af11cca7da833d3458a5f52086f4e805a570fb5c2a" trace_sink = "http://[fc00:9a7a:9e:ffff:ffff:ffff:ffff:ffff]:4317" EOF -RUST_LOG=garage=debug ${GARAGE_PATH} server 2>> ${NODE_STORAGE_PATH}/logs & disown +RUST_LOG=debug ${GARAGE_PATH} server 2>> ${NODE_STORAGE_PATH}/logs & disown sleep 2 CONFIG_NODE_FPATH=$(find /tmp/garage-testnet/ -maxdepth 3 -name garage.toml|head -n 1) diff --git a/liveness.md b/liveness.md index 768543c..83de29e 100644 --- a/liveness.md +++ b/liveness.md @@ -147,6 +147,31 @@ So, here is my current mental model of our issue when we send multiple PutObject To check this hypothesis, I will start by logging netapp queues and their content. +It appears that the problem is more complicated than it seemed first, as we have 2 ping logic, one at the netapp layer and one at the garage layer. And in both ways. And it seems that netapp pings are failing from the storage node to the gateway node. + +``` + WARN netapp::peering::fullmesh > Error pinging 90af93030366c0da: Ping timeout + DEBUG netapp::peering::fullmesh > Ping from 591ac2bffb05a3ec + WARN netapp::peering::fullmesh > Error pinging 90af93030366c0da: Ping timeout + WARN netapp::peering::fullmesh > Too many failed pings from 90af93030366c0da, closing connection. + DEBUG netapp::netapp > Closing connection to 90af93030366c0da ([fc00:9a7a:9e::1]:3901) +``` + +Currently Garage does not pipeline writes, so it waits that a chunk has been written before writing the next one. So in the end, we have not so many entries in the queue: + - the first chunk of upload 1 + - the first chunk of upload 2 + - the first chunk of upload 3, and so on, and so forth + +But we can see that problems can still occure with numerous uploads! +And if we start pipeling sending, it will make the problem even worse! + +It seems that we could improve the situation by: + - Deleting Garage pings as netapp is handling them for us (even if it seems that they are used to measure an average ping - not sure of this point) + - Deleting timeouts on RPC blocks as failure detection is handled by netapp + +But before implementing these solutions, we must understand why netapp pings are failing, this is even more surprising as they have a 5 second timeout instead of a 2sec one on Garage... + + ## Overview of available tools to observe Garage internals