Fixes up to TTFB plot

This commit is contained in:
Alex 2022-09-28 15:31:22 +02:00
parent 59afb0d32b
commit 6133fcd3ca
Signed by: lx
GPG key ID: 0E496D15096376BE
2 changed files with 69 additions and 59 deletions

View file

@ -1,16 +1,17 @@
+++ +++
title="Bringing theoretical design and observed performances face to face" title="Confronting theoretical design with observed performances"
date=2022-09-26 date=2022-09-26
+++ +++
*For the past years, we have extensively analyzed possible design decisions and *During the past years, we have extensively analyzed possible design decisions and
their theoretical tradeoffs on Garage, especially on the network, data their theoretical trade-offs for Garage, especially concerning networking, data
structure, or scheduling side. And it worked well enough for our production structures, and scheduling. Garage worked well enough for our production
cluster at Deuxfleurs, but we also knew that people started discovering some cluster at Deuxfleurs, but we also knew that people started to discover some
unexpected behaviors. We thus started a round of benchmark and performance unexpected behaviors. We thus started a round of benchmark and performance
measurements to see how Garage behaves compared to our expectations. We split measurements to see how Garage behaves compared to our expectations.
them into 3 categories: "efficient I/O", "myriads of objects" and "resiliency" This post presents some of our first results, which cover
3 aspects of performance: efficient I/O, "myriads of objects" and resiliency,
to reflect the high-level properties we are seeking.* to reflect the high-level properties we are seeking.*
<!-- more --> <!-- more -->
@ -21,61 +22,63 @@ to reflect the high-level properties we are seeking.*
The following results must be taken with a critical grain of salt due to some The following results must be taken with a critical grain of salt due to some
limitations that are inherent to any benchmark. We try to reference them as limitations that are inherent to any benchmark. We try to reference them as
exhaustively as possible in this section, but other limitations might exist. exhaustively as possible in this first section, but other limitations might exist.
Most of our tests are done on simulated networks that can not represent all the Most of our tests were made on simulated networks, which by definition cannot represent all the
diversity of real networks (dynamic drop, jitter, latency, all of them could be diversity of real networks (dynamic drop, jitter, latency, all of which could be
correlated with throughput or any other external event). We also limited correlated with throughput or any other external event). We also limited
ourselves to very small workloads that are not representative of a production ourselves to very small workloads that are not representative of a production
cluster. Furthermore, we only benchmarked some very specific aspects of Garage: cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
our results are thus not an overview of the whole software performance. our results are thus not an evaluation of the performance of Garage as a whole.
For some benchmarks, we used Minio as a reference. It must be noted that we did For some benchmarks, we used Minio as a reference. It must be noted that we did
not try to optimize its configuration as we have done on Garage, and more not try to optimize its configuration as we have done on Garage, and more
generally, we have way less knowledge on Minio than on Garage, which can lead generally, we have way less knowledge on Minio than on Garage, which can lead
to underrated performance measurements for Minio. It must also be noted that to underrated performance measurements for Minio. It must also be noted that
Garage and Minio are systems with different feature sets, *eg.* Minio supports Garage and Minio are systems with different feature sets. For instance Minio supports
erasure coding for better data density while Garage doesn't, Minio implements erasure coding for higher data density, which Garage doesn't, Minio implements
way more S3 endpoints than Garage, etc. Such features have necessarily a cost way more S3 endpoints than Garage, etc. Such features necessarily have a cost
that you must keep in mind when reading plots. You should consider Minio that you must keep in mind when reading the plots we will present. You should consider
results as a way to contextualize our results, to check that our improvements results on Minio as a way to contextualize our results on Garage, to see that our improvements
are not artificials compared to existing object storage implementations. are not artificial compared to existing object storage implementations.
The impact of the testing environment is also not evaluated (kernel patches, The impact of the testing environment is also not evaluated (kernel patches,
configuration, parameters, filesystem, hardware configuration, etc.), some of configuration, parameters, filesystem, hardware configuration, etc.). Some of
these configurations could favor one configuration/software over another. these parameters could favor one configuration or software product over another.
Especially, it must be noted that most of the tests were done on a Especially, it must be noted that most of the tests were done on a
consumer-grade computer and SSD only, which will be different from most consumer-grade PC with only an SSD, which is different from most
production setups. Finally, our results are also provided without statistical production setups. Finally, our results are also provided without statistical
tests to check their significance, and thus might be statistically not tests to check their significance, and might thus have insufficient significance
significant. to be claimed as reliable.
When reading this post, please keep in mind that **we are not making any When reading this post, please keep in mind that **we are not making any
business or technical recommendations here, this is not a scientific paper business or technical recommendations here, and this is not a scientific paper
either**; we only share bits of our development process as honestly as either**; we only share bits of our development process as honestly as
possible. Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html), possible.
make your own Make your own tests if you need to take a decision,
tests if you need to take a decision, and remain supportive and caring with remember to read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html)
your peers... and to remain supportive and caring with your peers ;)
## About our testing environment ## About our testing environment
We started a batch of tests on We made a first batch of tests on
[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible [Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible
testbed for experiment-driven research in all areas of computer science, under testbed for experiment-driven research in all areas of computer science,
the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program. which has an
[Open Access program](https://www.grid5000.fr/w/Grid5000:Open-Access).
During our tests, we used part of the following clusters: During our tests, we used part of the following clusters:
[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova), [nova](https://www.grid5000.fr/w/Lyon:Hardware#nova),
[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and [paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and
[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a [econome](https://www.grid5000.fr/w/Nantes:Hardware#econome), to make a
geo-distributed topology. We used the Grid5000 testbed only during our geo-distributed topology. We used the Grid5000 testbed only during our
preliminary tests to identify issues when running Garage on many powerful preliminary tests to identify issues when running Garage on many powerful
servers, issues that we then reproduced in a controlled environment; don't be servers. We then reproduced these issues in a controlled environment
outside of Grid5000, so don't be
surprised then if Grid5000 is not mentioned often on our plots. surprised then if Grid5000 is not mentioned often on our plots.
To reproduce some environments locally, we have a small set of Python scripts To reproduce some environments locally, we have a small set of Python scripts
named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our called [`mknet`](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
needs[^ref1]. Most of the following tests were thus run locally with mknet on a needs[^ref1]. Most of the following tests were thus run locally with `mknet` on a
single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of
RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and
@ -84,24 +87,27 @@ values, the system tends to freeze when it is under heavy I/O load.
## Efficient I/O ## Efficient I/O
The main goal of an object storage system is to store or retrieve an object The main purpose of an object storage system is to store and retrieve objects
across the network, and the faster, the better. For this analysis, we focus on across the network, and the faster these two functions can be accomplished,
2 aspects: time to first byte, as many applications can start processing a file the more efficient the system as a whole will be. For this analysis, we focus on
before receiving it completely, and generic throughput, to understand how well 2 aspects of performance. First, since many applications can start processing a file
Garage can leverage the underlying machine performances. before receiving it completely, we will evaulate the Time-to-First-Byte (TTFB)
on GetObject requests, i.e. the duration between the moment a request is sent
and the moment where the first bytes of the returned object are received by the client.
Second, we will evaluate generic throughput, to understand how well
Garage can leverage the underlying machine's performances.
**Time To First Byte** - One specificity of Garage is that we implemented S3 **Time To First Byte** - One specificity of Garage is that we implemented S3
web endpoints, with the idea to make it the platform of choice to publish your web endpoints, with the idea to make it a platform of choice to publish
static website. When publishing a website, one metric you observe is Time To static websites. When publishing a website, TTFB can be directly observed
First Byte (TTFB), as it will impact the perceived reactivity of your website. by the end user, as it will impact the perceived reactivity of the websites.
On Garage, time to first byte was a bit high.
This is not surprising as, until now, the smallest level of granularity Up to version 0.7.3, Time-to-First-Byte on Garage used to be relatively high.
internally was handling full blocks. Blocks are 1MB chunks (this is This can be explained by the fact that Garage was not able to handle data internally
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)) at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object
of a given object. For example, a 4.5MB object will be split into 4 blocks of (this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
1MB and 1 block of 0.5MB. With this design, when you were sending a GET Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
request, the first block had to be fully retrieved by the gateway node from the request, the first block had to be fully retrieved by the gateway node from the
storage node before starting to send any data to the client. storage node before starting to send any data to the client.
@ -109,20 +115,24 @@ With Garage v0.8, we integrated a block streaming logic that allows the gateway
to send the beginning of a block without having to wait for the full block from to send the beginning of a block without having to wait for the full block from
the storage node. We can visually represent the difference as follow: the storage node. We can visually represent the difference as follow:
![A schema depicting how streaming improves the delivery of a block](schema-streaming.png) <center>
<img src="schema-streaming.png" alt="A schema depicting how streaming improves the delivery of a block" />
</center>
As our default block size is only 1MB, the difference will be very small on As our default block size is only 1MB, the difference should be very small on
fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However, fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network,
on a very slow network (or a very congested link with many parallel requests thus adding at most 8ms of latency to a GetObject request (assuming no other
handled), the impact can be much more important: at 5Mbps, it takes 1.6 seconds data transfer is happening in parallel). However,
to transfer our 1MB block, and streaming could heavily improve user experience. on a very slow network, or a very congested link with many parallel requests
handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
We wanted to see if this theory holds in practice: we simulated a low latency We wanted to see if this theory holds in practice: we simulated a low latency
but slow network on mknet and did some requests with (garage v0.8 beta) and but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
without (garage v0.7) block streaming. We also added Minio as a reference. To without (Garage v0.7.3). We also added Minio as a reference. To
benchmark this behavior, we wrote a small test named benchmark this behavior, we wrote a small test named
[s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb), [s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb),
its results are depicted in the following figure. whose results are shown on the following figure:
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png) ![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)

View file

@ -42,7 +42,7 @@
</div> </div>
<div class="content mt-2"> <div class="content mt-2">
<div class="text-gray-700 text-lg not-italic"> <div class="text-gray-700 text-lg not-italic">
{{ page.summary | safe | striptags }} {{ page.summary | striptags | safe }}
</div> </div>
<a class="group font-semibold p-4 flex items-center space-x-1 text-garage-orange" href='{{ page.permalink }}'> <a class="group font-semibold p-4 flex items-center space-x-1 text-garage-orange" href='{{ page.permalink }}'>
<div class="h-0.5 mt-0.5 w-4 group-hover:w-8 group-hover:bg-garage-gray transition-all bg-garage-orange"></div> <div class="h-0.5 mt-0.5 w-4 group-hover:w-8 group-hover:bg-garage-gray transition-all bg-garage-orange"></div>