2022-09-29 11:16:04 +00:00
2 changed files with 69 additions and 59 deletions
--- a/content/blog/2022-perf/index.md
+++ b/content/blog/2022-perf/index.md
@ -1,16 +1,17 @@
 +++
-title="Bringing theoretical design and observed performances face to face"
+title="Confronting theoretical design with observed performances"
 date=2022-09-26
 +++
-*For the past years, we have extensively analyzed possible design decisions and
+*During the past years, we have extensively analyzed possible design decisions and
-their theoretical tradeoffs on Garage, especially on the network, data
+their theoretical trade-offs for Garage, especially concerning networking, data
-structure, or scheduling side. And it worked well enough for our production
+structures, and scheduling. Garage worked well enough for our production
-cluster at Deuxfleurs, but we also knew that people started discovering some
+cluster at Deuxfleurs, but we also knew that people started to discover some
 unexpected behaviors. We thus started a round of benchmark and performance
-measurements to see how Garage behaves compared to our expectations. We split
+measurements to see how Garage behaves compared to our expectations.
-them into 3 categories: "efficient I/O", "myriads of objects" and "resiliency"
+This post presents some of our first results, which cover
 3 aspects of performance: efficient I/O, "myriads of objects" and resiliency,
 to reflect the high-level properties we are seeking.*
 <!-- more -->
@ -21,61 +22,63 @@ to reflect the high-level properties we are seeking.*
 The following results must be taken with a critical grain of salt due to some
 limitations that are inherent to any benchmark. We try to reference them as
-exhaustively as possible in this section, but other limitations might exist. 
+exhaustively as possible in this first section, but other limitations might exist. 
-Most of our tests are done on simulated networks that can not represent all the
+Most of our tests were made on simulated networks, which by definition cannot represent all the
-diversity of real networks (dynamic drop, jitter, latency, all of them could be
+diversity of real networks (dynamic drop, jitter, latency, all of which could be
 correlated with throughput or any other external event). We also limited
 ourselves to very small workloads that are not representative of a production
 cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
-our results are thus not an overview of the whole software performance.
+our results are thus not an evaluation of the performance of Garage as a whole.
 For some benchmarks, we used Minio as a reference. It must be noted that we did
 not try to optimize its configuration as we have done on Garage, and more
 generally, we have way less knowledge on Minio than on Garage, which can lead
 to underrated performance measurements for Minio.  It must also be noted that
-Garage and Minio are systems with different feature sets, *eg.* Minio supports
+Garage and Minio are systems with different feature sets. For instance Minio supports
-erasure coding for better data density while Garage doesn't, Minio implements
+erasure coding for higher data density, which Garage doesn't, Minio implements
-way more S3 endpoints than Garage, etc. Such features have necessarily a cost
+way more S3 endpoints than Garage, etc. Such features necessarily have a cost
-that you must keep in mind when reading plots. You should consider Minio
+that you must keep in mind when reading the plots we will present. You should consider
-results as a way to contextualize our results, to check that our improvements
+results on Minio as a way to contextualize our results on Garage, to see that our improvements
-are not artificials compared to existing object storage implementations.
+are not artificial compared to existing object storage implementations.
 The impact of the testing environment is also not evaluated (kernel patches,
-configuration, parameters, filesystem, hardware configuration, etc.), some of
+configuration, parameters, filesystem, hardware configuration, etc.). Some of
-these configurations could favor one configuration/software over another.
+these parameters could favor one configuration or software product over another.
 Especially, it must be noted that most of the tests were done on a
-consumer-grade computer and SSD only, which will be different from most
+consumer-grade PC with only an SSD, which is different from most
 production setups. Finally, our results are also provided without statistical
-tests to check their significance, and thus might be statistically not
+tests to check their significance, and might thus have insufficient significance
-significant.
+to be claimed as reliable.
 When reading this post, please keep in mind that **we are not making any
-business or technical recommendations here, this is not a scientific paper
+business or technical recommendations here, and this is not a scientific paper
 either**; we only share bits of our development process as honestly as
-possible.  Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html),
+possible.
-make your own
+Make your own tests if you need to take a decision,
-tests if you need to take a decision, and remain supportive and caring with
+remember to read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html)
-your peers...
+and to remain supportive and caring with your peers ;)
 ## About our testing environment
-We started a batch of tests on
+We made a first batch of tests on
 [Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible
-testbed for experiment-driven research in all areas of computer science, under
+testbed for experiment-driven research in all areas of computer science,
-the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program.
+which has an
 [Open Access program](https://www.grid5000.fr/w/Grid5000:Open-Access).
 During our tests, we used part of the following clusters:
 [nova](https://www.grid5000.fr/w/Lyon:Hardware#nova),
 [paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and
-[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a
+[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome), to make a
 geo-distributed topology. We used the Grid5000 testbed only during our
 preliminary tests to identify issues when running Garage on many powerful
-servers, issues that we then reproduced in a controlled environment; don't be
+servers. We then  reproduced these issues in a controlled environment
 outside of Grid5000, so don't be
 surprised then if Grid5000 is not mentioned often on our plots.
 To reproduce some environments locally, we have a small set of Python scripts
-named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
+called [`mknet`](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
-needs[^ref1]. Most of the following tests were thus run locally with mknet on a
+needs[^ref1]. Most of the following tests were thus run locally with `mknet` on a
 single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of
 RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
 used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and
@ -84,24 +87,27 @@ values, the system tends to freeze when it is under heavy I/O load.
 ## Efficient I/O
-The main goal of an object storage system is to store or retrieve an object
+The main purpose of an object storage system is to store and retrieve objects
-across the network, and the faster, the better.  For this analysis, we focus on
+across the network, and the faster these two functions can be accomplished,
-2 aspects: time to first byte, as many applications can start processing a file
+the more efficient the system as a whole will be.  For this analysis, we focus on
-before receiving it completely, and generic throughput, to understand how well
+2 aspects of performance. First, since many applications can start processing a file
-Garage can leverage the underlying machine performances.
+before receiving it completely, we will evaulate the Time-to-First-Byte (TTFB)
-
+on GetObject requests, i.e. the duration between the moment a request is sent
 and the moment where the first bytes of the returned object are received by the client.
 Second, we will evaluate generic throughput, to understand how well
 Garage can leverage the underlying machine's performances.
 **Time To First Byte** - One specificity of Garage is that we implemented S3
-web endpoints, with the idea to make it the platform of choice to publish your
+web endpoints, with the idea to make it a platform of choice to publish
-static website. When publishing a website, one metric you observe is Time To
+static websites. When publishing a website, TTFB can be directly observed
-First Byte (TTFB), as it will impact the perceived reactivity of your website.
+by the end user, as it will impact the perceived reactivity of the websites.
 On Garage, time to first byte was a bit high. 
-This is not surprising as, until now, the smallest level of granularity
+Up to version 0.7.3, Time-to-First-Byte on Garage used to be relatively high.
-internally was handling full blocks. Blocks are 1MB chunks (this is
+This can be explained by the fact that Garage was not able to handle data internally
-[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size))
+at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object
-of a given object. For example, a 4.5MB object will be split into 4 blocks of
+(this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
-1MB and 1 block of 0.5MB. With this design, when you were sending a GET
+Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
 1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
 request, the first block had to be fully retrieved by the gateway node from the
 storage node before starting to send any data to the client. 
@ -109,20 +115,24 @@ With Garage v0.8, we integrated a block streaming logic that allows the gateway
 to send the beginning of a block without having to wait for the full block from
 the storage node. We can visually represent the difference as follow:
-![A schema depicting how streaming improves the delivery of a block](schema-streaming.png)
+<center>
 <img src="schema-streaming.png" alt="A schema depicting how streaming improves the delivery of a block" />
 </center>
-As our default block size is only 1MB, the difference will be very small on
+As our default block size is only 1MB, the difference should be very small on
-fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However,
+fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network,
-on a very slow network (or a very congested link with many parallel requests
+thus adding at most 8ms of latency to a GetObject request (assuming no other
-handled), the impact can be much more important: at 5Mbps, it takes 1.6 seconds
+data transfer is happening in parallel). However,
-to transfer our 1MB block, and streaming could heavily improve user experience. 
+on a very slow network, or a very congested link with many parallel requests
 handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
 to transfer our 1MB block, and streaming has the potential of heavily improving user experience. 
 We wanted to see if this theory holds in practice: we simulated a low latency
-but slow network on mknet and did some requests with (garage v0.8 beta) and
+but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
-without (garage v0.7) block streaming. We also added Minio as a reference. To
+without (Garage v0.7.3). We also added Minio as a reference. To
 benchmark this behavior, we wrote a small test named
 [s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb),
-its results are depicted in the following figure.
+whose results are shown on the following figure:
 ![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
--- a/templates/section.html
+++ b/templates/section.html
@ -42,7 +42,7 @@
          </div>
          <div class="content mt-2">
            <div class="text-gray-700 text-lg not-italic">
-              {{ page.summary | safe | striptags }}
+              {{ page.summary | striptags | safe }}
            </div>
            <a class="group font-semibold p-4 flex items-center space-x-1 text-garage-orange" href='{{ page.permalink }}'>
              <div class="h-0.5 mt-0.5 w-4 group-hover:w-8 group-hover:bg-garage-gray transition-all bg-garage-orange"></div>