From 430259d0508bf455d198eb27e364e9ef8f19bc57 Mon Sep 17 00:00:00 2001 From: Quentin Dufour Date: Thu, 11 Aug 2022 22:16:01 +0200 Subject: [PATCH] Add some doc about our own bench tool --- liveness.md | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/liveness.md b/liveness.md index bec4ea4..c33bd05 100644 --- a/liveness.md +++ b/liveness.md @@ -42,9 +42,30 @@ It starts to really look like a congestion control/control flow error/scheduler ## Write a custom client exhibiting the issue -We know how to trigger the issue with `warp`, Minio's benchmark tool but we don't yet understand well what kind of load it puts on the cluster except that it sends concurrently PUT and Multipart requests. So, before investigating the issue more in depth, we want to know: +We know how to trigger the issue with `warp`, Minio's benchmark tool but we don't yet understand well what kind of load it puts on the cluster except that it sends concurrently Multipart and PutObject requests concurrently. So, before investigating the issue more in depth, we want to know: - If a single large PUT request can trigger this issue or not? - How many parallel requests are needed to trigger this issue? - Does Multipart transfer are more impacted by this issue? To get answer to our questions, we will write a specific benchmark. +Named s3concurrent, it is available here: https://git.deuxfleurs.fr/quentin/s3concurrent +The benchmark starts by sending 1 file, then 2 files concurrently, +then 3, then 4, up to 16 (this is hardcoded for now). + +When ran on our mknet cluster, we start triggering issues as soon as we send 2 files at once: + +``` +$ ./s3concurrent +2022/08/11 20:35:28 created bucket 3ffd6798-bdab-4218-b6d0-973a07e46ea9 +2022/08/11 20:35:28 start concurrent loop with 1 coroutines +2022/08/11 20:35:55 done, 1 coroutines returned +2022/08/11 20:35:55 start concurrent loop with 2 coroutines +2022/08/11 20:36:34 1/2 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Timeout", "Timeout"] +2022/08/11 20:36:37 done, 2 coroutines returned +2022/08/11 20:36:37 start concurrent loop with 3 coroutines +2022/08/11 20:37:13 1/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 92c7fb74ed89f289", "Netapp error: Not connected: 3cb7ed98f7c66a55"] +2022/08/11 20:37:51 2/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 92c7fb74ed89f289", "Netapp error: Not connected: 3cb7ed98f7c66a55"] +2022/08/11 20:37:51 3/3 failed with Internal error: Could not reach quorum of 2. 1 of 3 request succeeded, others returned errors: ["Netapp error: Not connected: 92c7fb74ed89f289", "Netapp error: Not connected: 3cb7ed98f7c66a55"] +2022/08/11 20:37:51 done, 3 coroutines returned +2022/08/11 20:37:51 start concurrent loop with 4 coroutines +```