Update write up

This commit is contained in:
Quentin 2022-08-12 14:32:38 +02:00
parent a8af0e657d
commit 7c7eea6d26
Signed by: quentin
GPG Key ID: E9602264D639FF68
1 changed files with 24 additions and 11 deletions

View File

@ -98,11 +98,14 @@ $ ./s3concurrent
2022/08/11 20:37:51 start concurrent loop with 4 coroutines
```
We observe that Garage starts its timeout as soon as 2 coroutines are executed in parallel. We know that pushing blocks on the same channel as RPC messages is a weakness of Garage, as while a block is sent, no RPC message can be sent. In our specific deployment, it seems that sending 2 blocks takes enough time to trigger a timeout.
We observe that Garage starts its timeout as soon as 2 coroutines are executed in parallel.
That's also the reason why we selected small blocks at the beginning (1MB and not 10MB), and why we are working on a mechanism to transfer blocks on a dedicated channel (https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343).
Internally, Garage uses a library of our own named netapp. This library has an integrated scheduler that knows how to:
- handle priority of RPC packets
- multiplex packets of the same priority so that progress is made by everyone
So the idea is to observe the behaviour of Garage with a smaller block size: RPC Ping messages will probably be better multiplexed in this case. We select 128 KiB blocks instead of 1MiB ones (10 times smaller).
So in theory, this scheduler should be able to handle all our packets seamlessly.
To better understand its behaviour, we observe the behaviour of Garage with a smaller block size, to see if it's a multiplexing problem. We select 128 KiB blocks instead of 1MiB ones (10 times smaller).
This time, we can handle 2 coroutines at once but not 3:
@ -125,9 +128,25 @@ This time, we can handle 2 coroutines at once but not 3:
2022/08/12 10:52:15 done, 4 coroutines returned
```
So we see that improving multiplexing opportunities has a positive impact, it does not solve all our problems! We might need a scheduler to always prioritize Ping RPC, and also ensure after that we have a bounded queue - otherwise we will simply fill a queue.
Despite the fact we divided by 10 our block size, we did not improve by 10 our parallelism. As a conclusion, we need to question our design.
## Making an hypothesis on netapp inner working
First, we took a look at netapp failure detectors.
It seems that, despite a closed TCP socket, it has no way to detect failures.
So we have a second layer in Garage to detect failures based on timeouts on RPC commands. In our analysis we identified 2 critical RPC commands:
- Ping, that have high priority, and a timeout of 2 seconds
- BlockRW, that have a normal priority, and a timeout of 30 seconds.
It appears that the timeout is triggered by the second RPC command.
For a reason I don't understand yet, it appears that any timeout
will trigger a disconnect/reconnect of the node (with a delay).
So, here is my current mental model of our issue when we send multiple PutObject requests:
- Ping RPC commands are always handled in less than 2sec due to their high priority and their low number
- BlockRW commands are accumulating in the queue without any limit. They all have the same priority, so progress is slowly made on each of them at the same pace. Because they are so many, none of them complete in less than 30 seconds, thus triggering many timeouts in Garage.
To check this hypothesis, I will start by logging netapp queues and their content.
But for now, we will start
## Overview of available tools to observe Garage internals
@ -193,9 +212,3 @@ And then I observe:
We see many traces with exactly a 30s request.
I suspect that we are hitting a timeout here but I am not sure.
## Reading source code
(todo)