Test another Garage conf + try Jaeger

2022-08-12 11:01:55 +02:00 · 2022-08-12 11:01:55 +02:00 · a8af0e657d
commit a8af0e657d
parent 4e6ca1b724
3 changed files with 63 additions and 0 deletions
--- a/example/deploy_garage.sh
+++ b/example/deploy_garage.sh
@ -24,6 +24,7 @@ cat > ${GARAGE_CONFIG_FILE} << EOF
 metadata_dir = "${NODE_STORAGE_PATH}/meta"
 data_dir = "${NODE_STORAGE_PATH}/data"

+block_size = 131072
 replication_mode = "3"

 rpc_bind_addr = "[::]:3901"
--- a/img/jaeger_s3_put.png
+++ b/img/jaeger_s3_put.png
--- a/liveness.md
+++ b/liveness.md
@ -98,6 +98,37 @@ $ ./s3concurrent
 2022/08/11 20:37:51 start concurrent loop with 4 coroutines
 ```

+We observe that Garage starts its timeout as soon as 2 coroutines are executed in parallel. We know that pushing blocks on the same channel as RPC messages is a weakness of Garage, as while a block is sent, no RPC message can be sent. In our specific deployment, it seems that sending 2 blocks takes enough time to trigger a timeout.
+
+That's also the reason why we selected small blocks at the beginning (1MB and not 10MB), and why we are working on a mechanism to transfer blocks on a dedicated channel (https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343).
+
+So the idea is to observe the behaviour of Garage with a smaller block size: RPC Ping messages will probably be better multiplexed in this case. We select 128 KiB blocks instead of 1MiB ones (10 times smaller).
+
+This time, we can handle 2 coroutines at once but not 3:
+
+```
+2022/08/12 10:50:08 created bucket a565074b-0609-4f5f-8d46-389f86565197
+2022/08/12 10:50:08 start concurrent loop with 1 coroutines
+2022/08/12 10:50:32 done, 1 coroutines returned
+2022/08/12 10:50:32 start concurrent loop with 2 coroutines
+2022/08/12 10:51:18 done, 2 coroutines returned
+2022/08/12 10:51:18 start concurrent loop with 3 coroutines
+2022/08/12 10:51:35 1/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
+2022/08/12 10:51:45 2/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 0b36b6d0de0a6393", "Netapp error: Not connected: b61e6a192c9462c9"]
+2022/08/12 10:51:45 3/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
+2022/08/12 10:51:45 done, 3 coroutines returned
+2022/08/12 10:51:45 start concurrent loop with 4 coroutines
+2022/08/12 10:52:09 1/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
+2022/08/12 10:52:13 2/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 0b36b6d0de0a6393", "Netapp error: Not connected: b61e6a192c9462c9"]
+2022/08/12 10:52:13 3/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 0b36b6d0de0a6393", "Netapp error: Not connected: b61e6a192c9462c9"]
+2022/08/12 10:52:15 4/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"]
+2022/08/12 10:52:15 done, 4 coroutines returned
+```
+
+So we see that improving multiplexing opportunities has a positive impact, it does not solve all our problems! We might need a scheduler to always prioritize Ping RPC, and also ensure after that we have a bounded queue - otherwise we will simply fill a queue.
+
+But for now, we will start 
+
 ## Overview of available tools to observe Garage internals

 Even if I have some theory on what is going wrong, I want to collect as many information as possible before making hypothesis,
@ -135,5 +166,36 @@ api_s3_request_duration_sum{api_endpoint="PutObject"} 147.68400154399998
 api_s3_request_duration_count{api_endpoint="PutObject"} 6
 ```

+### Traces with Jaeger and OLTP
+
+Based on Jaeger doc, I run:
+
+```
+docker run --name jaeger \
+  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
+  -e COLLECTOR_OTLP_ENABLED=true \
+  -p 6831:6831/udp \
+  -p 6832:6832/udp \
+  -p 5778:5778 \
+  -p 16686:16686 \
+  -p 4317:4317 \
+  -p 4318:4318 \
+  -p 14250:14250 \
+  -p 14268:14268 \
+  -p 14269:14269 \
+  -p 9411:9411 \
+  jaegertracing/all-in-one:1.37
+```
+
+And then I observe:
+
+![Jaeger trace screenshot](img/jaeger_s3_put.png)
+
+We see many traces with exactly a 30s request.
+I suspect that we are hitting a timeout here but I am not sure.
+
+## Reading source code
+
+(todo)