2022-09-29 11:16:04 +00:00
1 changed files with 86 additions and 86 deletions
--- a/content/blog/2022-perf/index.md
+++ b/content/blog/2022-perf/index.md
@ -8,8 +8,8 @@ date=2022-09-26
 their theoretical trade-offs for Garage. In particular, we pondered the impacts
 of data structures, networking methods, and scheduling algorithms.
 Garage worked well enough for our production
-cluster at Deuxfleurs, but we also knew that people started to discover some
-unexpected behaviors. We thus started a round of benchmarks and performance
+cluster at Deuxfleurs, but we also knew that people started to experience some
+unexpected behaviors, which motivated us to start a round of benchmarks and performance
 measurements to see how Garage behaves compared to our expectations.
 This post presents some of our first results, which cover
 3 aspects of performance: efficient I/O, "myriads of objects", and resiliency,
@ -21,35 +21,35 @@ reflecting the high-level properties we are seeking.*

 ## ⚠️ Disclaimer

-The results presented in this blog post must be taken with a critical grain of salt due to some
+The results presented in this blog post must be taken with a (critical) grain of salt due to some
 limitations that are inherent to any benchmarking endeavor. We try to reference them as
-exhaustively as possible in this first section, but other limitations might exist.
+exhaustively as possible here, but other limitations might exist.

-Most of our tests were made on simulated networks, which by definition cannot represent all the
-diversity of real networks (dynamic drop, jitter, latency, all of which could be
+Most of our tests were made on _simulated_ networks, which by definition cannot represent all the
+diversity of _real_ networks (dynamic drop, jitter, latency, all of which could be
 correlated with throughput or any other external event). We also limited
 ourselves to very small workloads that are not representative of a production
 cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
-our results are thus not an evaluation of the performance of Garage as a whole.
+our results are not an evaluation of the performance of Garage as a whole.

 For some benchmarks, we used Minio as a reference. It must be noted that we did
 not try to optimize its configuration as we have done for Garage, and more
-generally, we have way less knowledge on Minio than on Garage, which can lead
+generally, we have significantly less knowledge of Minio's internals compared to Garage, which could lead
 to underrated performance measurements for Minio. It must also be noted that
 Garage and Minio are systems with different feature sets. For instance, Minio supports
-erasure coding for higher data density, which Garage doesn't, Minio implements
+erasure coding for higher data density and Garage doesn't, Minio implements
 way more S3 endpoints than Garage, etc. Such features necessarily have a cost
 that you must keep in mind when reading the plots we will present. You should consider
-results on Minio as a way to contextualize our results on Garage, to see that our improvements
-are not artificial in the light of existing object storage implementations.
+Minio's results as a way to contextualize Garage's numbers, to justify that our improvements
+are not simply artificial in the light of existing object storage implementations.

 The impact of the testing environment is also not evaluated (kernel patches,
 configuration, parameters, filesystem, hardware configuration, etc.). Some of
 these parameters could favor one configuration or software product over another.
 Especially, it must be noted that most of the tests were done on a
-consumer-grade PC with only an SSD, which is different from most
+consumer-grade PC with only a SSD, which is different from most
 production setups. Finally, our results are also provided without statistical
-tests to check their significance, and might thus have insufficient significance
+tests to validate their significance, and might have insufficient ground
 to be claimed as reliable.

 When reading this post, please keep in mind that **we are not making any
@ -75,16 +75,16 @@ geo-distributed topology. We used the Grid5000 testbed only during our
 preliminary tests to identify issues when running Garage on many powerful
 servers. We then  reproduced these issues in a controlled environment
 outside of Grid5000, so don't be
-surprised then if Grid5000 is not mentioned often on our plots.
+surprised then if Grid5000 is not always mentioned on our plots.

 To reproduce some environments locally, we have a small set of Python scripts
 called [`mknet`](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
-needs[^ref1]. Most of the following tests were thus run locally with `mknet` on a
+needs[^ref1]. Most of the following tests were run locally with `mknet` on a
 single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of
-RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
+RAM and a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
 used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and
-`vm.dirty_ratio` has been reduced to `2` and `1` respectively as, with default
-values, the system tends to freeze when it is under heavy I/O load.
+`vm.dirty_ratio` have been reduced to `2` and `1` respectively: with default
+values, the system tends to freeze under heavy I/O load.

 ## Efficient I/O

@ -93,7 +93,7 @@ across the network, and the faster these two functions can be accomplished,
 the more efficient the system as a whole will be. For this analysis, we focus on
 2 aspects of performance. First, since many applications can start processing a file
 before receiving it completely, we will evaluate the time-to-first-byte (TTFB)
-on GetObject requests, i.e. the duration between the moment a request is sent
+on `GetObject` requests, i.e. the duration between the moment a request is sent
 and the moment where the first bytes of the returned object are received by the client.
 Second, we will evaluate generic throughput, to understand how well
 Garage can leverage the underlying machine's performance.
@ -101,18 +101,18 @@ Garage can leverage the underlying machine's performance.
 **Time-to-First-Byte** - One specificity of Garage is that we implemented S3
 web endpoints, with the idea to make it a platform of choice to publish
 static websites. When publishing a website, TTFB can be directly observed
-by the end user, as it will impact the perceived reactivity of the website.
+by the end user, as it will impact the perceived reactivity of the page being loaded.

 Up to version 0.7.3, time-to-first-byte on Garage used to be relatively high.
 This can be explained by the fact that Garage was not able to handle data internally
-at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object
+at a smaller granularity level than entire data blocks, which are up to 1MB chunks of a given object
 (a size which [can be configured](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
-Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
-1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
-request, the first block had to be fully retrieved by the gateway node from the
-storage node before starting to send any data to the client.
+Let us take the example of a 4.5MB object, which Garage will split by default into four 1MB blocks and one 0.5MB block.
+With the old design, when you were sending a `GET`
+request, the first block had to be _fully_ retrieved by the gateway node from the
+storage node before it starts to send any data to the client.

-With Garage v0.8, we integrated a data streaming logic that allows the gateway
+With Garage v0.8, we added a data streaming logic that allows the gateway
 to send the beginning of a block without having to wait for the full block to be received from
 the storage node. We can visually represent the difference as follow:

@ -120,13 +120,13 @@ the storage node. We can visually represent the difference as follow:
 <img src="schema-streaming.png" alt="A schema depicting how streaming improves the delivery of a block" />
 </center>

-As our default block size is only 1MB, the difference should be very small on
+As our default block size is only 1MB, the difference should be marginal on
 fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network,
-thus adding at most 8ms of latency to a GetObject request (assuming no other
+adding at most 8ms of latency to a `GetObject` request (assuming no other
 data transfer is happening in parallel). However,
 on a very slow network, or a very congested link with many parallel requests
-handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
-to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
+handled, the impact can be much more important: on a 5Mbps network, it takes at least 1.6 seconds
+to transfer our 1MB block, and streaming will heavily improve user experience.

 We wanted to see if this theory holds in practice: we simulated a low latency
 but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
@ -138,11 +138,11 @@ whose results are shown on the following figure:
 ![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)

 Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
-and 2s, which corresponds to the time to transfer the full block which we calculated above.
+and 2s, which matches the time required to transfer the full block which we calculated above.
 On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
 streaming feature (the lowest value is 43ms). Minio sits between the two
 Garage versions: we suppose that it does some form of batching, but smaller
-than 1MB.
+than our initial 1MB default.

 **Throughput** - As soon as we publicly released Garage, people started
 benchmarking it, comparing its performances to writing directly on the
@ -152,7 +152,7 @@ situation, we did some optimizations, such as putting costly processing like has
 and many others
 ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
 [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)), which led us to
-version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote
+version 0.8 "Beta 1". We also noticed that some of the logic we wrote
 to better control resource usage
 and detect errors, including semaphores and timeouts, was artificially limiting
 performances. In another iteration, we made this logic less restrictive at the
@ -162,7 +162,7 @@ version 0.8 "Beta 2". Finally, we currently do multiple `fsync` calls each time
 write a block. We know that this is expensive and did a test build without any
 `fsync` call ([see the
 commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7))
-that will not be merged, just to assess the impact of `fsync`. We refer to it
+that will not be merged, only to assess the impact of `fsync`. We refer to it
 as `no-fsync` in the following plot.

 *A note about `fsync`: for performance reasons, operating systems often do not
@ -172,12 +172,12 @@ with other writes. If a power loss occurs before the OS has time to flush
 data to disk, some writes will be lost. To ensure that a write is effectively
 written to disk, the
 [`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
-which blocks until the file or directory on which it is called has been flushed from volatile
+which effectively blocks until the file or directory on which it is called has been flushed from volatile
 memory to the persistent storage device. Additionally, the exact semantic of
 `fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
 and, even on battle-tested software like Postgres, it was
 ["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
-Note that on Garage, we are currently working on our `fsync` policy and thus, for
+Note that on Garage, we are still working on our `fsync` policy and thus, for
 now, you should expect limited data durability in case of power loss, as we are
 aware of some inconsistencies on this point (which we describe in the following
 and plan to solve).*
@ -194,14 +194,14 @@ Minio, our reference point, gives us the best performances in this test.
 Looking at Garage, we observe that each improvement we made had a visible
 impact on performances. We also note that we have a progress margin in
 terms of performances compared to Minio: additional benchmarks, tests, and
-monitoring could help us better understand the remaining difference.
+monitoring could help us better understand the remaining gap.


 ## A myriad of objects

 Object storage systems do not handle a single object but huge numbers of them:
 Amazon claims to handle trillions of objects on their platform, and Red Hat
-communicates about Ceph being able to handle 10 billion objects. All these
+tout Ceph as being able to handle 10 billion objects. All these
 objects must be tracked efficiently in the system to be fetched, listed,
 removed, etc. In Garage, we use a "metadata engine" component to track them.
 For this analysis, we compare different metadata engines in Garage and see how
@ -214,25 +214,25 @@ the only supported option was [sled](https://sled.rs/), but we started having
 serious issues with it - and we were not alone
 ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
 v0.8, we introduce an abstraction semantic over the features we expect from our
-database, allowing us to switch from one back-end to another without touching
+database, allowing us to switch from one metadata back-end to another without touching
 the rest of our codebase. We added two additional back-ends: LMDB
 (through [heed](https://github.com/meilisearch/heed)) and SQLite
 (using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
-are both experimental: contrarily to sled, we have never run them in production
-for a long time.**
+are both experimental: contrarily to sled, we have yet to run them in production
+for a significant time.**

 Similarly to the impact of `fsync` on block writing, each database engine we use
-has its own policy with `fsync`. Sled flushes its writes every 2 seconds by
+has its own `fsync` policy. Sled flushes its writes every 2 seconds by
 default (this is
 [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
-LMDB by default does an `fsync` on each write, which on early tests led to
+LMDB default to an `fsync` on each write, which on early tests led to
 abysmal performance. We thus added 2 flags,
 [MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
 and
 [MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
-to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
+to deactivate `fsync` entirely. On SQLite, it is also possible to deactivate `fsync` with
 `pragma synchronous = off`, but we have not started any optimization work on it yet:
-our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
+our SQLite implementation currently still calls `fsync` for all write operations. Additionally, we are
 using these engines through Rust bindings that do not support async Rust,
 with which Garage is built, which has an impact on performance as well.
 **Our comparison will therefore not reflect the raw performances of
@ -242,20 +242,20 @@ Still, we think it makes sense to evaluate our implementations in their current
 state in Garage. We designed a benchmark that is intensive on the metadata part
 of the software, i.e. handling large numbers of tiny files. We chose again
 `minio/warp` as a benchmark tool, but we
-configured it here with the smallest possible object size it supported, 256
-bytes, to put some pressure on the metadata engine. We evaluated sled twice:
+configured it with the smallest possible object size it supported, 256
+bytes, to put pressure on the metadata engine. We evaluated sled twice:
 with its default configuration, and with a configuration where we set a flush
-interval of 10 minutes to disable `fsync`.
+interval of 10 minutes (longer than the test) to disable `fsync`.

-*Note that S3 has not been designed for such workloads that store huge numbers of small objects;
+*Note that S3 has not been designed for workloads that store huge numbers of small objects;
 a regular database, like Cassandra, would be more appropriate. This test has
-only been designed to stress our metadata engine, and is not indicative of
+only been designed to stress our metadata engine and is not indicative of
 real-world performances.*

 ![Plot of our metadata engines comparison with Warp](db_engine.png)

-Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have
-the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than
+Unsurprisingly, we observe abysmal performances with SQLite, as it is the engine we did not put work on yet,
+and that still does an `fsync` for each write. Garage with the `fsync`-disabled LMDB backend performs twice better than
 with sled in its default version and 60% better than the "no `fsync`" sled version in our
 benchmark.  Furthermore, and not depicted on these plots, LMDB uses way less
 disk storage and RAM; we would like to quantify that in the future.  As we are
@ -263,7 +263,7 @@ only at the very beginning of our work on metadata engines, it is hard to draw
 strong conclusions.  Still, we can say that SQLite is not ready for production
 workloads, and that LMDB looks very promising both in terms of performances and resource
 usage, and is a very good candidate for being Garage's default metadata engine in
-future releases. In the future, we will need to define a data policy for Garage to help us
+future releases, once we figure out the proper `fsync` tuning. In the future, we will need to define a data policy for Garage to help us
 arbitrate between performance and durability.

 *To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
@ -300,35 +300,35 @@ We wrote our own benchmarking tool for this test,
 [s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2].
 The benchmark procedure consists in 
 concurrently sending a defined number of tiny objects (8192 objects of 16
-bytes by default) and measuring the time it takes. This step is then repeated a given
+bytes by default) and measuring the wall clock time to the last object upload. This step is then repeated a given
 number of times (128 by default) to effectively create a target number of
 objects on the cluster (1M by default). On our local setup with 3
 nodes, both Minio and Garage with LMDB were able to achieve this target. In the
 following plot, we show how much time it took Garage and Minio to handle
 each batch.

-Before looking at the plot, **you must keep in mind some important points about
+Before looking at the plot, **you must keep in mind some important points regarding
 the internals of both Minio and Garage**.

 Minio has no metadata engine, it stores its objects directly on the filesystem.
 Sending 1 million objects on Minio results in creating one million inodes on
 the storage server in our current setup. So the performances of the filesystem
-probably have a substantial impact on the results we observe.
+probably have a substantial impact on the observed results.
 In our precise setup, we know that the
 filesystem we used is not adapted at all for Minio (encryption layer, fixed
 number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
 `fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
 creation of objects. Finally, object storage is designed for big objects, for which the
-costs measured here are negligible. In the end, again, we use Minio as a
+costs measured here are negligible. In the end, again, we use Minio only as a
 reference point to understand what performance budget we have for each part of our
 software.

 Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
 not created on the filesystem but the object is directly stored inline in the
 metadata engine.  In the future, we plan to evaluate how Garage behaves at scale with
->3KB objects, which we expect to be way closer to Minio, as it will have to create
+objects above 3KB, which we expect to be way closer to Minio, as it will have to create
 at least one inode per object. For now, we limit ourselves to evaluating our
-metadata engine and thus focus only on 16-byte objects.
+metadata engine and focus only on 16-byte objects.

 ![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png)

@ -339,27 +339,27 @@ time to complete a batch of inserts is constant, while on Garage it still increa
 It could be interesting to know if Garage's batch completion time would cross Minio's one
 for a very large number of objects.  If we reason per object, both Minio's and
 Garage's performances remain very good: it takes respectively around 20ms and
-5ms to create an object.  At 100 Mbps, the upload of a 10MB file takes
+5ms to create an object.  In a real-world scenario, at 100 Mbps, the upload of a 10MB file takes
 800ms, and goes up to 8sec for a 100MB file: in both cases
-handling the object metadata is only a fraction of the upload time.  The
+handling the object metadata would be only a fraction of the upload time.  The
 only cases where a difference would be noticeable would be when uploading a lot of very
-small files at once, which again is an unusual usage of the S3 API.
+small files at once, which again would be an unusual usage of the S3 API.

 Let us now focus on Garage's metrics only to better see its specific behavior:

 ![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)

 Two effects are now more visible: 1., batch completion time increases with the
-number of objects in the bucket and 2., measurements are dispersed, at least
-more than for Minio. We expect this batch completion time increase to be logarithmic,
-but we don't have enough data points to conclude safety: additional
+number of objects in the bucket and 2., measurements are scattered, at least
+more than for Minio. We expected this batch completion time increase to be logarithmic,
+but we don't have enough data points to conclude confidently it is the case: additional
 measurements are needed. Concerning the observed instability, it could
-be a symptom of what we saw with some other experiments in this machine,
+be a symptom of what we saw with some other experiments on this setup,
 which sometimes freezes under heavy I/O load.  Such freezes could lead to
 request timeouts and failures. If this occurs on our testing computer, it might
 occur on other servers as well: it would be interesting to better understand this
-issue, document how to avoid it, and potentially change how we handle our I/O
-internally in Garage.  But still, this was a very stressful test that will probably not be encountered in
+issue, document how to avoid it, and potentially change how we handle I/O
+internally in Garage.  But still, this was a very heavy test that will probably not be encountered in
 many setups: we were adding 273 objects per second for 30 minutes straight!

 To conclude this part, Garage can ingest 1 million tiny objects while remaining
@ -382,45 +382,45 @@ core value proposition of Garage.  For example, our production cluster is
 hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
 behind consumer-grade fiber links across France and Belgium (if you are reading this,
 congratulation, you fetched this webpage from it!). That's why we are very
-careful that our internal protocol (named RPC protocol in our documentation)
+careful that our internal protocol (referred to as "RPC protocol" in our documentation)
 remains as lightweight as possible. For this analysis, we quantify how network
-latency and the number of nodes in the cluster impact the duration of the most
+latency and number of nodes in the cluster impact the duration of the most
 important kinds of S3 requests.

 **Latency amplification** - With the kind of networks we use (consumer-grade
 fiber links across the EU), the observed latency between nodes is in the 50ms range.
 When latency is not negligible, you will observe that request completion
 time is a factor of the observed latency. That's to be expected: in many cases, the
-node of the cluster you are contacting can not directly answer your request, and
-has to reach other nodes of the cluster to get the requested information. Each
+node of the cluster you are contacting cannot directly answer your request, and
+has to reach other nodes of the cluster to get the data. Each
 of these sequential remote procedure calls - or RPCs - adds to the final S3 request duration, which can quickly become
 expensive.  This ratio between request duration and network latency is what we
 refer to as *latency amplification*.

-For example, on Garage, a GetObject request does two sequential calls: first,
-it fetches the descriptor of the requested object, which contains a reference
+For example, on Garage, a `GetObject` request does two sequential calls: first,
+it fetches the descriptor of the requested object from the metadata engine, which contains a reference
 to the first block of data, and then only in a second step it can start retrieving data blocks
 from storage nodes.  We can therefore expect that the
-request duration of a small GetObject request will be close to twice the
+request duration of a small `GetObject` request will be close to twice the
 network latency.

 We tested the latency amplification theory with another benchmark of our own named
 [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
-which does a single request at a time on an endpoint and measures its response
+which does a single request at a time on an endpoint and measures the response
 time. As we are not interested in bandwidth but latency, all our requests
-involving an object are made on a tiny file of around 16 bytes. Our benchmark
+involving objects are made on tiny files of around 16 bytes. Our benchmark
 tests 5 standard endpoints of the S3 API: ListBuckets, ListObjects, PutObject, GetObject and
 RemoveObject. Here are the results:


 ![Latency amplification](amplification.png)

-As Garage has been optimized for this use case from the beginning, we don't see
+As Garage has been optimized for this use case from the very beginning, we don't see
 any significant evolution from one version to another (Garage v0.7.3 and Garage
 v0.8.0 Beta 1 here).  Compared to Minio, these values are either similar (for
-ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
-RemoveObject).  This can be easily understood by the fact that Minio has not been designed for
-environments with high latencies. Instead, it expects to run on clusters that are built
+ListObjects and ListBuckets) or significantly better (for GetObject, PutObject, and
+RemoveObject).  This can be easily explained by the fact that Minio has not been designed with
+environments with high latencies in mind. Instead, it is expected to run on clusters that are built
 in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected with their asynchronous
 [bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
 feature.
@ -488,7 +488,7 @@ terabytes of data and billions of objects on long-lasting experiments.

 In the meantime, stay tuned: we have released
 [a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1),
-and are already working on a number of features for the next version.
+and are already working on several features for the next version.
 For instance, we are working on a new layout that will have enhanced optimality properties,
 as well as a theoretical proof of correctness
 ([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)). We are also