Investigation about liveness issues

2022-08-12 18:32:56 +02:00 · 2022-08-12 18:32:56 +02:00 · c99b9b4abb
commit c99b9b4abb
parent 7c7eea6d26
2 changed files with 26 additions and 1 deletions
--- a/example/deploy_garage.sh
+++ b/example/deploy_garage.sh
@ -49,7 +49,7 @@ admin_token = "ae8cb40ea7368bbdbb6430af11cca7da833d3458a5f52086f4e805a570fb5c2a"
 trace_sink = "http://[fc00:9a7a:9e:ffff:ffff:ffff:ffff:ffff]:4317"
 EOF

-RUST_LOG=garage=debug ${GARAGE_PATH} server 2>> ${NODE_STORAGE_PATH}/logs & disown
+RUST_LOG=debug ${GARAGE_PATH} server 2>> ${NODE_STORAGE_PATH}/logs & disown
 sleep 2

 CONFIG_NODE_FPATH=$(find /tmp/garage-testnet/ -maxdepth 3 -name garage.toml|head -n 1)
--- a/liveness.md
+++ b/liveness.md
@ -147,6 +147,31 @@ So, here is my current mental model of our issue when we send multiple PutObject

 To check this hypothesis, I will start by logging netapp queues and their content.

+It appears that the problem is more complicated than it seemed first, as we have 2 ping logic, one at the netapp layer and one at the garage layer. And in both ways. And it seems that netapp pings are failing from the storage node to the gateway node.
+
+```
+ WARN  netapp::peering::fullmesh    > Error pinging 90af93030366c0da: Ping timeout
+ DEBUG netapp::peering::fullmesh    > Ping from 591ac2bffb05a3ec
+ WARN  netapp::peering::fullmesh    > Error pinging 90af93030366c0da: Ping timeout
+ WARN  netapp::peering::fullmesh    > Too many failed pings from 90af93030366c0da, closing connection.
+ DEBUG netapp::netapp               > Closing connection to 90af93030366c0da ([fc00:9a7a:9e::1]:3901)
+```
+
+Currently Garage does not pipeline writes, so it waits that a chunk has been written before writing the next one. So in the end, we have not so many entries in the queue:
+  - the first chunk of upload 1
+  - the first chunk of upload 2
+  - the first chunk of upload 3, and so on, and so forth
+
+But we can see that problems can still occure with numerous uploads!
+And if we start pipeling sending, it will make the problem even worse!
+
+It seems that we could improve the situation by:
+  - Deleting Garage pings as netapp is handling them for us (even if it seems that they are used to measure an average ping - not sure of this point)
+  - Deleting timeouts on RPC blocks as failure detection is handled by netapp
+
+But before implementing these solutions, we must understand why netapp pings are failing, this is even more surprising as they have a 5 second timeout instead of a 2sec one on Garage...
+
+

 ## Overview of available tools to observe Garage internals