diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index bc91404..83fe06d 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -1,16 +1,17 @@ +++ -title="Bringing theoretical design and observed performances face to face" +title="Confronting theoretical design with observed performances" date=2022-09-26 +++ -*For the past years, we have extensively analyzed possible design decisions and -their theoretical tradeoffs on Garage, especially on the network, data -structure, or scheduling side. And it worked well enough for our production -cluster at Deuxfleurs, but we also knew that people started discovering some +*During the past years, we have extensively analyzed possible design decisions and +their theoretical trade-offs for Garage, especially concerning networking, data +structures, and scheduling. Garage worked well enough for our production +cluster at Deuxfleurs, but we also knew that people started to discover some unexpected behaviors. We thus started a round of benchmark and performance -measurements to see how Garage behaves compared to our expectations. We split -them into 3 categories: "efficient I/O", "myriads of objects" and "resiliency" +measurements to see how Garage behaves compared to our expectations. +This post presents some of our first results, which cover +3 aspects of performance: efficient I/O, "myriads of objects" and resiliency, to reflect the high-level properties we are seeking.* @@ -21,61 +22,63 @@ to reflect the high-level properties we are seeking.* The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them as -exhaustively as possible in this section, but other limitations might exist. +exhaustively as possible in this first section, but other limitations might exist. -Most of our tests are done on simulated networks that can not represent all the -diversity of real networks (dynamic drop, jitter, latency, all of them could be +Most of our tests were made on simulated networks, which by definition cannot represent all the +diversity of real networks (dynamic drop, jitter, latency, all of which could be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. Furthermore, we only benchmarked some very specific aspects of Garage: -our results are thus not an overview of the whole software performance. +our results are thus not an evaluation of the performance of Garage as a whole. For some benchmarks, we used Minio as a reference. It must be noted that we did not try to optimize its configuration as we have done on Garage, and more generally, we have way less knowledge on Minio than on Garage, which can lead to underrated performance measurements for Minio. It must also be noted that -Garage and Minio are systems with different feature sets, *eg.* Minio supports -erasure coding for better data density while Garage doesn't, Minio implements -way more S3 endpoints than Garage, etc. Such features have necessarily a cost -that you must keep in mind when reading plots. You should consider Minio -results as a way to contextualize our results, to check that our improvements -are not artificials compared to existing object storage implementations. +Garage and Minio are systems with different feature sets. For instance Minio supports +erasure coding for higher data density, which Garage doesn't, Minio implements +way more S3 endpoints than Garage, etc. Such features necessarily have a cost +that you must keep in mind when reading the plots we will present. You should consider +results on Minio as a way to contextualize our results on Garage, to see that our improvements +are not artificial compared to existing object storage implementations. The impact of the testing environment is also not evaluated (kernel patches, -configuration, parameters, filesystem, hardware configuration, etc.), some of -these configurations could favor one configuration/software over another. +configuration, parameters, filesystem, hardware configuration, etc.). Some of +these parameters could favor one configuration or software product over another. Especially, it must be noted that most of the tests were done on a -consumer-grade computer and SSD only, which will be different from most +consumer-grade PC with only an SSD, which is different from most production setups. Finally, our results are also provided without statistical -tests to check their significance, and thus might be statistically not -significant. +tests to check their significance, and might thus have insufficient significance +to be claimed as reliable. When reading this post, please keep in mind that **we are not making any -business or technical recommendations here, this is not a scientific paper +business or technical recommendations here, and this is not a scientific paper either**; we only share bits of our development process as honestly as -possible. Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html), -make your own -tests if you need to take a decision, and remain supportive and caring with -your peers... +possible. +Make your own tests if you need to take a decision, +remember to read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html) +and to remain supportive and caring with your peers ;) ## About our testing environment -We started a batch of tests on +We made a first batch of tests on [Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible -testbed for experiment-driven research in all areas of computer science, under -the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program. +testbed for experiment-driven research in all areas of computer science, +which has an +[Open Access program](https://www.grid5000.fr/w/Grid5000:Open-Access). During our tests, we used part of the following clusters: [nova](https://www.grid5000.fr/w/Lyon:Hardware#nova), [paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and -[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a +[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome), to make a geo-distributed topology. We used the Grid5000 testbed only during our preliminary tests to identify issues when running Garage on many powerful -servers, issues that we then reproduced in a controlled environment; don't be +servers. We then reproduced these issues in a controlled environment +outside of Grid5000, so don't be surprised then if Grid5000 is not mentioned often on our plots. To reproduce some environments locally, we have a small set of Python scripts -named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our -needs[^ref1]. Most of the following tests were thus run locally with mknet on a +called [`mknet`](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our +needs[^ref1]. Most of the following tests were thus run locally with `mknet` on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and @@ -84,24 +87,27 @@ values, the system tends to freeze when it is under heavy I/O load. ## Efficient I/O -The main goal of an object storage system is to store or retrieve an object -across the network, and the faster, the better. For this analysis, we focus on -2 aspects: time to first byte, as many applications can start processing a file -before receiving it completely, and generic throughput, to understand how well -Garage can leverage the underlying machine performances. - +The main purpose of an object storage system is to store and retrieve objects +across the network, and the faster these two functions can be accomplished, +the more efficient the system as a whole will be. For this analysis, we focus on +2 aspects of performance. First, since many applications can start processing a file +before receiving it completely, we will evaulate the Time-to-First-Byte (TTFB) +on GetObject requests, i.e. the duration between the moment a request is sent +and the moment where the first bytes of the returned object are received by the client. +Second, we will evaluate generic throughput, to understand how well +Garage can leverage the underlying machine's performances. **Time To First Byte** - One specificity of Garage is that we implemented S3 -web endpoints, with the idea to make it the platform of choice to publish your -static website. When publishing a website, one metric you observe is Time To -First Byte (TTFB), as it will impact the perceived reactivity of your website. -On Garage, time to first byte was a bit high. +web endpoints, with the idea to make it a platform of choice to publish +static websites. When publishing a website, TTFB can be directly observed +by the end user, as it will impact the perceived reactivity of the websites. -This is not surprising as, until now, the smallest level of granularity -internally was handling full blocks. Blocks are 1MB chunks (this is -[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)) -of a given object. For example, a 4.5MB object will be split into 4 blocks of -1MB and 1 block of 0.5MB. With this design, when you were sending a GET +Up to version 0.7.3, Time-to-First-Byte on Garage used to be relatively high. +This can be explained by the fact that Garage was not able to handle data internally +at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object +(this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)). +Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of +1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET` request, the first block had to be fully retrieved by the gateway node from the storage node before starting to send any data to the client. @@ -109,20 +115,24 @@ With Garage v0.8, we integrated a block streaming logic that allows the gateway to send the beginning of a block without having to wait for the full block from the storage node. We can visually represent the difference as follow: -![A schema depicting how streaming improves the delivery of a block](schema-streaming.png) +