New article: Bringing theoretical design and observed performances face to face #12

Merged
quentin merged 24 commits from perf into master 2022-09-29 11:16:04 +00:00
2 changed files with 69 additions and 59 deletions
Showing only changes of commit 6133fcd3ca - Show all commits

View file

@ -1,16 +1,17 @@
+++
title="Bringing theoretical design and observed performances face to face"
title="Confronting theoretical design with observed performances"
date=2022-09-26
+++
*For the past years, we have extensively analyzed possible design decisions and
their theoretical tradeoffs on Garage, especially on the network, data
structure, or scheduling side. And it worked well enough for our production
cluster at Deuxfleurs, but we also knew that people started discovering some
*During the past years, we have extensively analyzed possible design decisions and
their theoretical trade-offs for Garage, especially concerning networking, data
structures, and scheduling. Garage worked well enough for our production
cluster at Deuxfleurs, but we also knew that people started to discover some
unexpected behaviors. We thus started a round of benchmark and performance
measurements to see how Garage behaves compared to our expectations. We split
them into 3 categories: "efficient I/O", "myriads of objects" and "resiliency"
measurements to see how Garage behaves compared to our expectations.
This post presents some of our first results, which cover
3 aspects of performance: efficient I/O, "myriads of objects" and resiliency,
to reflect the high-level properties we are seeking.*
<!-- more -->
@ -21,61 +22,63 @@ to reflect the high-level properties we are seeking.*
The following results must be taken with a critical grain of salt due to some
limitations that are inherent to any benchmark. We try to reference them as
exhaustively as possible in this section, but other limitations might exist.
exhaustively as possible in this first section, but other limitations might exist.
Most of our tests are done on simulated networks that can not represent all the
diversity of real networks (dynamic drop, jitter, latency, all of them could be
Most of our tests were made on simulated networks, which by definition cannot represent all the
diversity of real networks (dynamic drop, jitter, latency, all of which could be
correlated with throughput or any other external event). We also limited
ourselves to very small workloads that are not representative of a production
cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
our results are thus not an overview of the whole software performance.
our results are thus not an evaluation of the performance of Garage as a whole.
For some benchmarks, we used Minio as a reference. It must be noted that we did
not try to optimize its configuration as we have done on Garage, and more
generally, we have way less knowledge on Minio than on Garage, which can lead
to underrated performance measurements for Minio. It must also be noted that
Garage and Minio are systems with different feature sets, *eg.* Minio supports
erasure coding for better data density while Garage doesn't, Minio implements
way more S3 endpoints than Garage, etc. Such features have necessarily a cost
that you must keep in mind when reading plots. You should consider Minio
results as a way to contextualize our results, to check that our improvements
are not artificials compared to existing object storage implementations.
Garage and Minio are systems with different feature sets. For instance Minio supports
erasure coding for higher data density, which Garage doesn't, Minio implements
way more S3 endpoints than Garage, etc. Such features necessarily have a cost
that you must keep in mind when reading the plots we will present. You should consider
results on Minio as a way to contextualize our results on Garage, to see that our improvements
are not artificial compared to existing object storage implementations.
The impact of the testing environment is also not evaluated (kernel patches,
configuration, parameters, filesystem, hardware configuration, etc.), some of
these configurations could favor one configuration/software over another.
configuration, parameters, filesystem, hardware configuration, etc.). Some of
these parameters could favor one configuration or software product over another.
Especially, it must be noted that most of the tests were done on a
consumer-grade computer and SSD only, which will be different from most
consumer-grade PC with only an SSD, which is different from most
production setups. Finally, our results are also provided without statistical
tests to check their significance, and thus might be statistically not
significant.
tests to check their significance, and might thus have insufficient significance
to be claimed as reliable.
When reading this post, please keep in mind that **we are not making any
business or technical recommendations here, this is not a scientific paper
business or technical recommendations here, and this is not a scientific paper
either**; we only share bits of our development process as honestly as
possible. Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html),
make your own
tests if you need to take a decision, and remain supportive and caring with
your peers...
possible.
Make your own tests if you need to take a decision,
remember to read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html)
and to remain supportive and caring with your peers ;)
## About our testing environment
We started a batch of tests on
We made a first batch of tests on
[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible
testbed for experiment-driven research in all areas of computer science, under
the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program.
testbed for experiment-driven research in all areas of computer science,
which has an
[Open Access program](https://www.grid5000.fr/w/Grid5000:Open-Access).
During our tests, we used part of the following clusters:
[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova),
[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and
[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a
[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome), to make a
geo-distributed topology. We used the Grid5000 testbed only during our
preliminary tests to identify issues when running Garage on many powerful
servers, issues that we then reproduced in a controlled environment; don't be
servers. We then reproduced these issues in a controlled environment
outside of Grid5000, so don't be
surprised then if Grid5000 is not mentioned often on our plots.
To reproduce some environments locally, we have a small set of Python scripts
named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
needs[^ref1]. Most of the following tests were thus run locally with mknet on a
called [`mknet`](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
needs[^ref1]. Most of the following tests were thus run locally with `mknet` on a
single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of
RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and
@ -84,24 +87,27 @@ values, the system tends to freeze when it is under heavy I/O load.
## Efficient I/O
The main goal of an object storage system is to store or retrieve an object
across the network, and the faster, the better. For this analysis, we focus on
2 aspects: time to first byte, as many applications can start processing a file
before receiving it completely, and generic throughput, to understand how well
Garage can leverage the underlying machine performances.
The main purpose of an object storage system is to store and retrieve objects
across the network, and the faster these two functions can be accomplished,
the more efficient the system as a whole will be. For this analysis, we focus on
2 aspects of performance. First, since many applications can start processing a file
before receiving it completely, we will evaulate the Time-to-First-Byte (TTFB)
on GetObject requests, i.e. the duration between the moment a request is sent
and the moment where the first bytes of the returned object are received by the client.
Second, we will evaluate generic throughput, to understand how well
Garage can leverage the underlying machine's performances.
**Time To First Byte** - One specificity of Garage is that we implemented S3
web endpoints, with the idea to make it the platform of choice to publish your
static website. When publishing a website, one metric you observe is Time To
First Byte (TTFB), as it will impact the perceived reactivity of your website.
On Garage, time to first byte was a bit high.
web endpoints, with the idea to make it a platform of choice to publish
static websites. When publishing a website, TTFB can be directly observed
by the end user, as it will impact the perceived reactivity of the websites.
This is not surprising as, until now, the smallest level of granularity
internally was handling full blocks. Blocks are 1MB chunks (this is
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size))
of a given object. For example, a 4.5MB object will be split into 4 blocks of
1MB and 1 block of 0.5MB. With this design, when you were sending a GET
Up to version 0.7.3, Time-to-First-Byte on Garage used to be relatively high.
This can be explained by the fact that Garage was not able to handle data internally
at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object
(this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
request, the first block had to be fully retrieved by the gateway node from the
storage node before starting to send any data to the client.
@ -109,20 +115,24 @@ With Garage v0.8, we integrated a block streaming logic that allows the gateway
to send the beginning of a block without having to wait for the full block from
the storage node. We can visually represent the difference as follow:
![A schema depicting how streaming improves the delivery of a block](schema-streaming.png)
<center>
<img src="schema-streaming.png" alt="A schema depicting how streaming improves the delivery of a block" />
</center>
As our default block size is only 1MB, the difference will be very small on
fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However,
on a very slow network (or a very congested link with many parallel requests
handled), the impact can be much more important: at 5Mbps, it takes 1.6 seconds
to transfer our 1MB block, and streaming could heavily improve user experience.
As our default block size is only 1MB, the difference should be very small on
fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network,
thus adding at most 8ms of latency to a GetObject request (assuming no other
data transfer is happening in parallel). However,
on a very slow network, or a very congested link with many parallel requests
handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
We wanted to see if this theory holds in practice: we simulated a low latency
but slow network on mknet and did some requests with (garage v0.8 beta) and
without (garage v0.7) block streaming. We also added Minio as a reference. To
but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
without (Garage v0.7.3). We also added Minio as a reference. To
benchmark this behavior, we wrote a small test named
[s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb),
its results are depicted in the following figure.
whose results are shown on the following figure:
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)

View file

@ -42,7 +42,7 @@
</div>
<div class="content mt-2">
<div class="text-gray-700 text-lg not-italic">
{{ page.summary | safe | striptags }}
{{ page.summary | striptags | safe }}
</div>
<a class="group font-semibold p-4 flex items-center space-x-1 text-garage-orange" href='{{ page.permalink }}'>
<div class="h-0.5 mt-0.5 w-4 group-hover:w-8 group-hover:bg-garage-gray transition-all bg-garage-orange"></div>