From bacebcfbf117286e4068728bfea78b2fb6c9e1bf Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Wed, 28 Sep 2022 15:45:32 +0200 Subject: [PATCH] Fixes until "myriads of objects" --- content/blog/2022-perf/index.md | 55 +++++++++++++++++---------------- 1 file changed, 28 insertions(+), 27 deletions(-) diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index 83fe06d..535cd16 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -97,7 +97,7 @@ and the moment where the first bytes of the returned object are received by the Second, we will evaluate generic throughput, to understand how well Garage can leverage the underlying machine's performances. -**Time To First Byte** - One specificity of Garage is that we implemented S3 +**Time-to-First-Byte** - One specificity of Garage is that we implemented S3 web endpoints, with the idea to make it a platform of choice to publish static websites. When publishing a website, TTFB can be directly observed by the end user, as it will impact the perceived reactivity of the websites. @@ -136,46 +136,47 @@ whose results are shown on the following figure: ![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png) -Garage v0.7, which does not support block streaming, features TTFB between 1.6s -and 2s, which corresponds to the theoretical time to transfer the full block. -On the other side of the plot, Garage v0.8 has a very low TTFB thanks to the -streaming feature (the lowest value is 43 ms). Minio sits between the two +Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s +and 2s, which corresponds to the time to transfer the full block which we calculated before. +On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the +streaming feature (the lowest value is 43ms). Minio sits between the two Garage versions: we suppose that it does some form of batching, but smaller than 1MB. **Throughput** - As soon as we publicly released Garage, people started -benchmarking it, comparing its performances to writing directly on the +benchmarking it, comparing its performances with writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the -situation, we put costly processing like hashing on a dedicated thread and did -many compute optimization +situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread +and many others ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to -`v0.8 beta 1`. We also noted logic we wrote (to better control resource usage -and detect errors, like semaphores or timeouts) was artificially limiting +version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote +to better control resource usage +and detect errors, like semaphores or timeouts, was artificially limiting performances. In another iteration, we made this logic less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in -`v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we +version 0.8 "Beta 2". Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot. -*A note about fsync: for performance reasons, operating systems often do not +*A note about `fsync`: for performance reasons, operating systems often do not write directly to the disk when a process creates or updates a file in your -filesystem, instead, the write is kept in memory, and flushed later in a batch -with other writes. If a power loss occurs before the OS has time to flush the -writes on the disk, data will be lost. To ensure that a write is effectively -written on disk, you must use the -[fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it -will block until your file or directory has been written from your volatile -memory to your persisting storage device. Additionally, the exact semantic of -fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) -and, even on battle-tested software like Postgres, -[they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). -Note that on Garage, we are currently working on our "fsync" policy and thus, for +filesystem. Instead, the write is kept in memory, and flushed later in a batch +with other writes. If a power loss occurs before the OS has time to flush +data to disk, some writes will be lost. To ensure that a write is effectively +written on disk, the +[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used, +which blocks until the file or directory on which it is called has been written from volatile +memory to the persistent storage device. Additionally, the exact semantic of +`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) +and, even on battle-tested software like Postgres, it was +["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). +Note that on Garage, we are currently working on our `fsync` policy and thus, for now, you should expect limited data durability in case of power loss, as we are aware of some inconsistency on this point (which we describe in the following and plan to solve).* @@ -188,16 +189,16 @@ performance with a standardized and mixed workload. ![Plot showing IO perf of Garage configs and Minio](io.png) -Minio, our ground truth, features the best performances in this test. -Considering Garage, we observe that each improvement we made has a visible -impact on its performances. We also note that we have a progress margin in +Minio, our reference point, gives us the best performances in this test. +Looking at Garage, we observe that each improvement we made has a visible +impact on performances. We also note that we have a progress margin in terms of performances compared to Minio: additional benchmarks, tests, and monitoring could help better understand the remaining difference. ## A myriad of objects -Object storage systems do not handle a single object but a myriad of them: +Object storage systems do not handle a single object but huge numbers of them: Amazon claims to handle trillions of objects on their platform, and Red Hat communicates about Ceph being able to handle 10 billion objects. All these objects must be tracked efficiently in the system to be fetched, listed,