Investigation about liveness issues

This commit is contained in:
Quentin 2022-08-12 18:32:56 +02:00
parent 7c7eea6d26
commit c99b9b4abb
Signed by: quentin
GPG Key ID: E9602264D639FF68
2 changed files with 26 additions and 1 deletions

View File

@ -49,7 +49,7 @@ admin_token = "ae8cb40ea7368bbdbb6430af11cca7da833d3458a5f52086f4e805a570fb5c2a"
trace_sink = "http://[fc00:9a7a:9e:ffff:ffff:ffff:ffff:ffff]:4317"
EOF
RUST_LOG=garage=debug ${GARAGE_PATH} server 2>> ${NODE_STORAGE_PATH}/logs & disown
RUST_LOG=debug ${GARAGE_PATH} server 2>> ${NODE_STORAGE_PATH}/logs & disown
sleep 2
CONFIG_NODE_FPATH=$(find /tmp/garage-testnet/ -maxdepth 3 -name garage.toml|head -n 1)

View File

@ -147,6 +147,31 @@ So, here is my current mental model of our issue when we send multiple PutObject
To check this hypothesis, I will start by logging netapp queues and their content.
It appears that the problem is more complicated than it seemed first, as we have 2 ping logic, one at the netapp layer and one at the garage layer. And in both ways. And it seems that netapp pings are failing from the storage node to the gateway node.
```
WARN netapp::peering::fullmesh > Error pinging 90af93030366c0da: Ping timeout
DEBUG netapp::peering::fullmesh > Ping from 591ac2bffb05a3ec
WARN netapp::peering::fullmesh > Error pinging 90af93030366c0da: Ping timeout
WARN netapp::peering::fullmesh > Too many failed pings from 90af93030366c0da, closing connection.
DEBUG netapp::netapp > Closing connection to 90af93030366c0da ([fc00:9a7a:9e::1]:3901)
```
Currently Garage does not pipeline writes, so it waits that a chunk has been written before writing the next one. So in the end, we have not so many entries in the queue:
- the first chunk of upload 1
- the first chunk of upload 2
- the first chunk of upload 3, and so on, and so forth
But we can see that problems can still occure with numerous uploads!
And if we start pipeling sending, it will make the problem even worse!
It seems that we could improve the situation by:
- Deleting Garage pings as netapp is handling them for us (even if it seems that they are used to measure an average ping - not sure of this point)
- Deleting timeouts on RPC blocks as failure detection is handled by netapp
But before implementing these solutions, we must understand why netapp pings are failing, this is even more surprising as they have a 5 second timeout instead of a 2sec one on Garage...
## Overview of available tools to observe Garage internals