diff --git a/example/deploy_garage.sh b/example/deploy_garage.sh index b9941aa..c011c7b 100755 --- a/example/deploy_garage.sh +++ b/example/deploy_garage.sh @@ -24,6 +24,7 @@ cat > ${GARAGE_CONFIG_FILE} << EOF metadata_dir = "${NODE_STORAGE_PATH}/meta" data_dir = "${NODE_STORAGE_PATH}/data" +block_size = 131072 replication_mode = "3" rpc_bind_addr = "[::]:3901" diff --git a/img/jaeger_s3_put.png b/img/jaeger_s3_put.png new file mode 100644 index 0000000..5d6877a Binary files /dev/null and b/img/jaeger_s3_put.png differ diff --git a/liveness.md b/liveness.md index 4719c9a..83fcfeb 100644 --- a/liveness.md +++ b/liveness.md @@ -98,6 +98,37 @@ $ ./s3concurrent 2022/08/11 20:37:51 start concurrent loop with 4 coroutines ``` +We observe that Garage starts its timeout as soon as 2 coroutines are executed in parallel. We know that pushing blocks on the same channel as RPC messages is a weakness of Garage, as while a block is sent, no RPC message can be sent. In our specific deployment, it seems that sending 2 blocks takes enough time to trigger a timeout. + +That's also the reason why we selected small blocks at the beginning (1MB and not 10MB), and why we are working on a mechanism to transfer blocks on a dedicated channel (https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343). + +So the idea is to observe the behaviour of Garage with a smaller block size: RPC Ping messages will probably be better multiplexed in this case. We select 128 KiB blocks instead of 1MiB ones (10 times smaller). + +This time, we can handle 2 coroutines at once but not 3: + +``` +2022/08/12 10:50:08 created bucket a565074b-0609-4f5f-8d46-389f86565197 +2022/08/12 10:50:08 start concurrent loop with 1 coroutines +2022/08/12 10:50:32 done, 1 coroutines returned +2022/08/12 10:50:32 start concurrent loop with 2 coroutines +2022/08/12 10:51:18 done, 2 coroutines returned +2022/08/12 10:51:18 start concurrent loop with 3 coroutines +2022/08/12 10:51:35 1/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"] +2022/08/12 10:51:45 2/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 0b36b6d0de0a6393", "Netapp error: Not connected: b61e6a192c9462c9"] +2022/08/12 10:51:45 3/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"] +2022/08/12 10:51:45 done, 3 coroutines returned +2022/08/12 10:51:45 start concurrent loop with 4 coroutines +2022/08/12 10:52:09 1/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"] +2022/08/12 10:52:13 2/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 0b36b6d0de0a6393", "Netapp error: Not connected: b61e6a192c9462c9"] +2022/08/12 10:52:13 3/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 0b36b6d0de0a6393", "Netapp error: Not connected: b61e6a192c9462c9"] +2022/08/12 10:52:15 4/4 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"] +2022/08/12 10:52:15 done, 4 coroutines returned +``` + +So we see that improving multiplexing opportunities has a positive impact, it does not solve all our problems! We might need a scheduler to always prioritize Ping RPC, and also ensure after that we have a bounded queue - otherwise we will simply fill a queue. + +But for now, we will start + ## Overview of available tools to observe Garage internals Even if I have some theory on what is going wrong, I want to collect as many information as possible before making hypothesis, @@ -135,5 +166,36 @@ api_s3_request_duration_sum{api_endpoint="PutObject"} 147.68400154399998 api_s3_request_duration_count{api_endpoint="PutObject"} 6 ``` +### Traces with Jaeger and OLTP + +Based on Jaeger doc, I run: + +``` +docker run --name jaeger \ + -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \ + -e COLLECTOR_OTLP_ENABLED=true \ + -p 6831:6831/udp \ + -p 6832:6832/udp \ + -p 5778:5778 \ + -p 16686:16686 \ + -p 4317:4317 \ + -p 4318:4318 \ + -p 14250:14250 \ + -p 14268:14268 \ + -p 14269:14269 \ + -p 9411:9411 \ + jaegertracing/all-in-one:1.37 +``` + +And then I observe: + +![Jaeger trace screenshot](img/jaeger_s3_put.png) + +We see many traces with exactly a 30s request. +I suspect that we are hitting a timeout here but I am not sure. + +## Reading source code + +(todo)