New article: Bringing theoretical design and observed performances face to face #12
2 changed files with 69 additions and 59 deletions
|
@ -1,16 +1,17 @@
|
|||
+++
|
||||
title="Bringing theoretical design and observed performances face to face"
|
||||
title="Confronting theoretical design with observed performances"
|
||||
date=2022-09-26
|
||||
+++
|
||||
|
||||
|
||||
*For the past years, we have extensively analyzed possible design decisions and
|
||||
their theoretical tradeoffs on Garage, especially on the network, data
|
||||
structure, or scheduling side. And it worked well enough for our production
|
||||
cluster at Deuxfleurs, but we also knew that people started discovering some
|
||||
*During the past years, we have extensively analyzed possible design decisions and
|
||||
their theoretical trade-offs for Garage, especially concerning networking, data
|
||||
structures, and scheduling. Garage worked well enough for our production
|
||||
cluster at Deuxfleurs, but we also knew that people started to discover some
|
||||
unexpected behaviors. We thus started a round of benchmark and performance
|
||||
measurements to see how Garage behaves compared to our expectations. We split
|
||||
them into 3 categories: "efficient I/O", "myriads of objects" and "resiliency"
|
||||
measurements to see how Garage behaves compared to our expectations.
|
||||
This post presents some of our first results, which cover
|
||||
3 aspects of performance: efficient I/O, "myriads of objects" and resiliency,
|
||||
to reflect the high-level properties we are seeking.*
|
||||
|
||||
<!-- more -->
|
||||
|
@ -21,61 +22,63 @@ to reflect the high-level properties we are seeking.*
|
|||
|
||||
The following results must be taken with a critical grain of salt due to some
|
||||
limitations that are inherent to any benchmark. We try to reference them as
|
||||
exhaustively as possible in this section, but other limitations might exist.
|
||||
exhaustively as possible in this first section, but other limitations might exist.
|
||||
|
||||
Most of our tests are done on simulated networks that can not represent all the
|
||||
diversity of real networks (dynamic drop, jitter, latency, all of them could be
|
||||
Most of our tests were made on simulated networks, which by definition cannot represent all the
|
||||
diversity of real networks (dynamic drop, jitter, latency, all of which could be
|
||||
correlated with throughput or any other external event). We also limited
|
||||
ourselves to very small workloads that are not representative of a production
|
||||
cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
|
||||
our results are thus not an overview of the whole software performance.
|
||||
our results are thus not an evaluation of the performance of Garage as a whole.
|
||||
|
||||
For some benchmarks, we used Minio as a reference. It must be noted that we did
|
||||
not try to optimize its configuration as we have done on Garage, and more
|
||||
generally, we have way less knowledge on Minio than on Garage, which can lead
|
||||
to underrated performance measurements for Minio. It must also be noted that
|
||||
Garage and Minio are systems with different feature sets, *eg.* Minio supports
|
||||
erasure coding for better data density while Garage doesn't, Minio implements
|
||||
way more S3 endpoints than Garage, etc. Such features have necessarily a cost
|
||||
that you must keep in mind when reading plots. You should consider Minio
|
||||
results as a way to contextualize our results, to check that our improvements
|
||||
are not artificials compared to existing object storage implementations.
|
||||
Garage and Minio are systems with different feature sets. For instance Minio supports
|
||||
erasure coding for higher data density, which Garage doesn't, Minio implements
|
||||
way more S3 endpoints than Garage, etc. Such features necessarily have a cost
|
||||
that you must keep in mind when reading the plots we will present. You should consider
|
||||
results on Minio as a way to contextualize our results on Garage, to see that our improvements
|
||||
are not artificial compared to existing object storage implementations.
|
||||
|
||||
The impact of the testing environment is also not evaluated (kernel patches,
|
||||
configuration, parameters, filesystem, hardware configuration, etc.), some of
|
||||
these configurations could favor one configuration/software over another.
|
||||
configuration, parameters, filesystem, hardware configuration, etc.). Some of
|
||||
these parameters could favor one configuration or software product over another.
|
||||
Especially, it must be noted that most of the tests were done on a
|
||||
consumer-grade computer and SSD only, which will be different from most
|
||||
consumer-grade PC with only an SSD, which is different from most
|
||||
production setups. Finally, our results are also provided without statistical
|
||||
tests to check their significance, and thus might be statistically not
|
||||
significant.
|
||||
tests to check their significance, and might thus have insufficient significance
|
||||
to be claimed as reliable.
|
||||
|
||||
When reading this post, please keep in mind that **we are not making any
|
||||
business or technical recommendations here, this is not a scientific paper
|
||||
business or technical recommendations here, and this is not a scientific paper
|
||||
either**; we only share bits of our development process as honestly as
|
||||
possible. Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html),
|
||||
make your own
|
||||
tests if you need to take a decision, and remain supportive and caring with
|
||||
your peers...
|
||||
possible.
|
||||
Make your own tests if you need to take a decision,
|
||||
remember to read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html)
|
||||
and to remain supportive and caring with your peers ;)
|
||||
|
||||
## About our testing environment
|
||||
|
||||
We started a batch of tests on
|
||||
We made a first batch of tests on
|
||||
[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible
|
||||
testbed for experiment-driven research in all areas of computer science, under
|
||||
the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program.
|
||||
testbed for experiment-driven research in all areas of computer science,
|
||||
which has an
|
||||
[Open Access program](https://www.grid5000.fr/w/Grid5000:Open-Access).
|
||||
During our tests, we used part of the following clusters:
|
||||
[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova),
|
||||
[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and
|
||||
[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a
|
||||
[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome), to make a
|
||||
geo-distributed topology. We used the Grid5000 testbed only during our
|
||||
preliminary tests to identify issues when running Garage on many powerful
|
||||
servers, issues that we then reproduced in a controlled environment; don't be
|
||||
servers. We then reproduced these issues in a controlled environment
|
||||
outside of Grid5000, so don't be
|
||||
surprised then if Grid5000 is not mentioned often on our plots.
|
||||
|
||||
To reproduce some environments locally, we have a small set of Python scripts
|
||||
named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
|
||||
needs[^ref1]. Most of the following tests were thus run locally with mknet on a
|
||||
called [`mknet`](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
|
||||
needs[^ref1]. Most of the following tests were thus run locally with `mknet` on a
|
||||
single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of
|
||||
RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
|
||||
used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and
|
||||
|
@ -84,24 +87,27 @@ values, the system tends to freeze when it is under heavy I/O load.
|
|||
|
||||
## Efficient I/O
|
||||
|
||||
The main goal of an object storage system is to store or retrieve an object
|
||||
across the network, and the faster, the better. For this analysis, we focus on
|
||||
2 aspects: time to first byte, as many applications can start processing a file
|
||||
before receiving it completely, and generic throughput, to understand how well
|
||||
Garage can leverage the underlying machine performances.
|
||||
|
||||
The main purpose of an object storage system is to store and retrieve objects
|
||||
across the network, and the faster these two functions can be accomplished,
|
||||
the more efficient the system as a whole will be. For this analysis, we focus on
|
||||
2 aspects of performance. First, since many applications can start processing a file
|
||||
before receiving it completely, we will evaulate the Time-to-First-Byte (TTFB)
|
||||
on GetObject requests, i.e. the duration between the moment a request is sent
|
||||
and the moment where the first bytes of the returned object are received by the client.
|
||||
Second, we will evaluate generic throughput, to understand how well
|
||||
Garage can leverage the underlying machine's performances.
|
||||
|
||||
**Time To First Byte** - One specificity of Garage is that we implemented S3
|
||||
web endpoints, with the idea to make it the platform of choice to publish your
|
||||
static website. When publishing a website, one metric you observe is Time To
|
||||
First Byte (TTFB), as it will impact the perceived reactivity of your website.
|
||||
On Garage, time to first byte was a bit high.
|
||||
web endpoints, with the idea to make it a platform of choice to publish
|
||||
static websites. When publishing a website, TTFB can be directly observed
|
||||
by the end user, as it will impact the perceived reactivity of the websites.
|
||||
|
||||
This is not surprising as, until now, the smallest level of granularity
|
||||
internally was handling full blocks. Blocks are 1MB chunks (this is
|
||||
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size))
|
||||
of a given object. For example, a 4.5MB object will be split into 4 blocks of
|
||||
1MB and 1 block of 0.5MB. With this design, when you were sending a GET
|
||||
Up to version 0.7.3, Time-to-First-Byte on Garage used to be relatively high.
|
||||
This can be explained by the fact that Garage was not able to handle data internally
|
||||
at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object
|
||||
(this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
|
||||
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
|
||||
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
|
||||
request, the first block had to be fully retrieved by the gateway node from the
|
||||
storage node before starting to send any data to the client.
|
||||
|
||||
|
@ -109,20 +115,24 @@ With Garage v0.8, we integrated a block streaming logic that allows the gateway
|
|||
to send the beginning of a block without having to wait for the full block from
|
||||
the storage node. We can visually represent the difference as follow:
|
||||
|
||||
![A schema depicting how streaming improves the delivery of a block](schema-streaming.png)
|
||||
<center>
|
||||
<img src="schema-streaming.png" alt="A schema depicting how streaming improves the delivery of a block" />
|
||||
</center>
|
||||
|
||||
As our default block size is only 1MB, the difference will be very small on
|
||||
fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However,
|
||||
on a very slow network (or a very congested link with many parallel requests
|
||||
handled), the impact can be much more important: at 5Mbps, it takes 1.6 seconds
|
||||
to transfer our 1MB block, and streaming could heavily improve user experience.
|
||||
As our default block size is only 1MB, the difference should be very small on
|
||||
fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network,
|
||||
thus adding at most 8ms of latency to a GetObject request (assuming no other
|
||||
data transfer is happening in parallel). However,
|
||||
on a very slow network, or a very congested link with many parallel requests
|
||||
handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
|
||||
to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
|
||||
|
||||
We wanted to see if this theory holds in practice: we simulated a low latency
|
||||
but slow network on mknet and did some requests with (garage v0.8 beta) and
|
||||
without (garage v0.7) block streaming. We also added Minio as a reference. To
|
||||
but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
|
||||
without (Garage v0.7.3). We also added Minio as a reference. To
|
||||
benchmark this behavior, we wrote a small test named
|
||||
[s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb),
|
||||
its results are depicted in the following figure.
|
||||
whose results are shown on the following figure:
|
||||
|
||||
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
|
||||
|
||||
|
|
|
@ -42,7 +42,7 @@
|
|||
</div>
|
||||
<div class="content mt-2">
|
||||
<div class="text-gray-700 text-lg not-italic">
|
||||
{{ page.summary | safe | striptags }}
|
||||
{{ page.summary | striptags | safe }}
|
||||
</div>
|
||||
<a class="group font-semibold p-4 flex items-center space-x-1 text-garage-orange" href='{{ page.permalink }}'>
|
||||
<div class="h-0.5 mt-0.5 w-4 group-hover:w-8 group-hover:bg-garage-gray transition-all bg-garage-orange"></div>
|
||||
|
|
Loading…
Reference in a new issue