Test another Garage conf + try Jaeger

This commit is contained in:
Quentin 2022-08-12 11:01:55 +02:00
parent 4e6ca1b724
commit a8af0e657d
Signed by: quentin
GPG key ID: E9602264D639FF68
3 changed files with 63 additions and 0 deletions

View file

@ -24,6 +24,7 @@ cat > ${GARAGE_CONFIG_FILE} << EOF
metadata_dir = "${NODE_STORAGE_PATH}/meta"
data_dir = "${NODE_STORAGE_PATH}/data"
block_size = 131072
replication_mode = "3"
rpc_bind_addr = "[::]:3901"

BIN
img/jaeger_s3_put.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 184 KiB

View file

@ -98,6 +98,37 @@ $ ./s3concurrent
2022/08/11 20:37:51 start concurrent loop with 4 coroutines
```
We observe that Garage starts its timeout as soon as 2 coroutines are executed in parallel. We know that pushing blocks on the same channel as RPC messages is a weakness of Garage, as while a block is sent, no RPC message can be sent. In our specific deployment, it seems that sending 2 blocks takes enough time to trigger a timeout.
That's also the reason why we selected small blocks at the beginning (1MB and not 10MB), and why we are working on a mechanism to transfer blocks on a dedicated channel (https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343).
So the idea is to observe the behaviour of Garage with a smaller block size: RPC Ping messages will probably be better multiplexed in this case. We select 128 KiB blocks instead of 1MiB ones (10 times smaller).
This time, we can handle 2 coroutines at once but not 3:
```
2022/08/12 10:50:08 created bucket a565074b-0609-4f5f-8d46-389f86565197
2022/08/12 10:50:08 start concurrent loop with 1 coroutines
2022/08/12 10:50:32 done, 1 coroutines returned
2022/08/12 10:50:32 start concurrent loop with 2 coroutines
2022/08/12 10:51:18 done, 2 coroutines returned
2022/08/12 10:51:18 start concurrent loop with 3 coroutines
2022/08/12 10:51:35 1/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
2022/08/12 10:51:45 2/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 0b36b6d0de0a6393", "Netapp error: Not connected: b61e6a192c9462c9"]
2022/08/12 10:51:45 3/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
2022/08/12 10:51:45 done, 3 coroutines returned
2022/08/12 10:51:45 start concurrent loop with 4 coroutines
2022/08/12 10:52:09 1/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
2022/08/12 10:52:13 2/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 0b36b6d0de0a6393", "Netapp error: Not connected: b61e6a192c9462c9"]
2022/08/12 10:52:13 3/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 0b36b6d0de0a6393", "Netapp error: Not connected: b61e6a192c9462c9"]
2022/08/12 10:52:15 4/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
2022/08/12 10:52:15 done, 4 coroutines returned
```
So we see that improving multiplexing opportunities has a positive impact, it does not solve all our problems! We might need a scheduler to always prioritize Ping RPC, and also ensure after that we have a bounded queue - otherwise we will simply fill a queue.
But for now, we will start
## Overview of available tools to observe Garage internals
Even if I have some theory on what is going wrong, I want to collect as many information as possible before making hypothesis,
@ -135,5 +166,36 @@ api_s3_request_duration_sum{api_endpoint="PutObject"} 147.68400154399998
api_s3_request_duration_count{api_endpoint="PutObject"} 6
```
### Traces with Jaeger and OLTP
Based on Jaeger doc, I run:
```
docker run --name jaeger \
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
-e COLLECTOR_OTLP_ENABLED=true \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
-p 14250:14250 \
-p 14268:14268 \
-p 14269:14269 \
-p 9411:9411 \
jaegertracing/all-in-one:1.37
```
And then I observe:
![Jaeger trace screenshot](img/jaeger_s3_put.png)
We see many traces with exactly a 30s request.
I suspect that we are hitting a timeout here but I am not sure.
## Reading source code
(todo)