New article: Bringing theoretical design and observed performances face to face #12
2 changed files with 69 additions and 59 deletions
|
@ -1,16 +1,17 @@
|
||||||
+++
|
+++
|
||||||
title="Bringing theoretical design and observed performances face to face"
|
title="Confronting theoretical design with observed performances"
|
||||||
date=2022-09-26
|
date=2022-09-26
|
||||||
+++
|
+++
|
||||||
|
|
||||||
|
|
||||||
*For the past years, we have extensively analyzed possible design decisions and
|
*During the past years, we have extensively analyzed possible design decisions and
|
||||||
their theoretical tradeoffs on Garage, especially on the network, data
|
their theoretical trade-offs for Garage, especially concerning networking, data
|
||||||
structure, or scheduling side. And it worked well enough for our production
|
structures, and scheduling. Garage worked well enough for our production
|
||||||
cluster at Deuxfleurs, but we also knew that people started discovering some
|
cluster at Deuxfleurs, but we also knew that people started to discover some
|
||||||
unexpected behaviors. We thus started a round of benchmark and performance
|
unexpected behaviors. We thus started a round of benchmark and performance
|
||||||
measurements to see how Garage behaves compared to our expectations. We split
|
measurements to see how Garage behaves compared to our expectations.
|
||||||
them into 3 categories: "efficient I/O", "myriads of objects" and "resiliency"
|
This post presents some of our first results, which cover
|
||||||
|
3 aspects of performance: efficient I/O, "myriads of objects" and resiliency,
|
||||||
to reflect the high-level properties we are seeking.*
|
to reflect the high-level properties we are seeking.*
|
||||||
|
|
||||||
<!-- more -->
|
<!-- more -->
|
||||||
|
@ -21,61 +22,63 @@ to reflect the high-level properties we are seeking.*
|
||||||
|
|
||||||
The following results must be taken with a critical grain of salt due to some
|
The following results must be taken with a critical grain of salt due to some
|
||||||
limitations that are inherent to any benchmark. We try to reference them as
|
limitations that are inherent to any benchmark. We try to reference them as
|
||||||
exhaustively as possible in this section, but other limitations might exist.
|
exhaustively as possible in this first section, but other limitations might exist.
|
||||||
|
|
||||||
Most of our tests are done on simulated networks that can not represent all the
|
Most of our tests were made on simulated networks, which by definition cannot represent all the
|
||||||
diversity of real networks (dynamic drop, jitter, latency, all of them could be
|
diversity of real networks (dynamic drop, jitter, latency, all of which could be
|
||||||
correlated with throughput or any other external event). We also limited
|
correlated with throughput or any other external event). We also limited
|
||||||
ourselves to very small workloads that are not representative of a production
|
ourselves to very small workloads that are not representative of a production
|
||||||
cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
|
cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
|
||||||
our results are thus not an overview of the whole software performance.
|
our results are thus not an evaluation of the performance of Garage as a whole.
|
||||||
|
|
||||||
For some benchmarks, we used Minio as a reference. It must be noted that we did
|
For some benchmarks, we used Minio as a reference. It must be noted that we did
|
||||||
not try to optimize its configuration as we have done on Garage, and more
|
not try to optimize its configuration as we have done on Garage, and more
|
||||||
generally, we have way less knowledge on Minio than on Garage, which can lead
|
generally, we have way less knowledge on Minio than on Garage, which can lead
|
||||||
to underrated performance measurements for Minio. It must also be noted that
|
to underrated performance measurements for Minio. It must also be noted that
|
||||||
Garage and Minio are systems with different feature sets, *eg.* Minio supports
|
Garage and Minio are systems with different feature sets. For instance Minio supports
|
||||||
erasure coding for better data density while Garage doesn't, Minio implements
|
erasure coding for higher data density, which Garage doesn't, Minio implements
|
||||||
way more S3 endpoints than Garage, etc. Such features have necessarily a cost
|
way more S3 endpoints than Garage, etc. Such features necessarily have a cost
|
||||||
that you must keep in mind when reading plots. You should consider Minio
|
that you must keep in mind when reading the plots we will present. You should consider
|
||||||
results as a way to contextualize our results, to check that our improvements
|
results on Minio as a way to contextualize our results on Garage, to see that our improvements
|
||||||
are not artificials compared to existing object storage implementations.
|
are not artificial compared to existing object storage implementations.
|
||||||
|
|
||||||
The impact of the testing environment is also not evaluated (kernel patches,
|
The impact of the testing environment is also not evaluated (kernel patches,
|
||||||
configuration, parameters, filesystem, hardware configuration, etc.), some of
|
configuration, parameters, filesystem, hardware configuration, etc.). Some of
|
||||||
these configurations could favor one configuration/software over another.
|
these parameters could favor one configuration or software product over another.
|
||||||
Especially, it must be noted that most of the tests were done on a
|
Especially, it must be noted that most of the tests were done on a
|
||||||
consumer-grade computer and SSD only, which will be different from most
|
consumer-grade PC with only an SSD, which is different from most
|
||||||
production setups. Finally, our results are also provided without statistical
|
production setups. Finally, our results are also provided without statistical
|
||||||
tests to check their significance, and thus might be statistically not
|
tests to check their significance, and might thus have insufficient significance
|
||||||
significant.
|
to be claimed as reliable.
|
||||||
|
|
||||||
When reading this post, please keep in mind that **we are not making any
|
When reading this post, please keep in mind that **we are not making any
|
||||||
business or technical recommendations here, this is not a scientific paper
|
business or technical recommendations here, and this is not a scientific paper
|
||||||
either**; we only share bits of our development process as honestly as
|
either**; we only share bits of our development process as honestly as
|
||||||
possible. Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html),
|
possible.
|
||||||
make your own
|
Make your own tests if you need to take a decision,
|
||||||
tests if you need to take a decision, and remain supportive and caring with
|
remember to read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html)
|
||||||
your peers...
|
and to remain supportive and caring with your peers ;)
|
||||||
|
|
||||||
## About our testing environment
|
## About our testing environment
|
||||||
|
|
||||||
We started a batch of tests on
|
We made a first batch of tests on
|
||||||
[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible
|
[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible
|
||||||
testbed for experiment-driven research in all areas of computer science, under
|
testbed for experiment-driven research in all areas of computer science,
|
||||||
the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program.
|
which has an
|
||||||
|
[Open Access program](https://www.grid5000.fr/w/Grid5000:Open-Access).
|
||||||
During our tests, we used part of the following clusters:
|
During our tests, we used part of the following clusters:
|
||||||
[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova),
|
[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova),
|
||||||
[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and
|
[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and
|
||||||
[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a
|
[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome), to make a
|
||||||
geo-distributed topology. We used the Grid5000 testbed only during our
|
geo-distributed topology. We used the Grid5000 testbed only during our
|
||||||
preliminary tests to identify issues when running Garage on many powerful
|
preliminary tests to identify issues when running Garage on many powerful
|
||||||
servers, issues that we then reproduced in a controlled environment; don't be
|
servers. We then reproduced these issues in a controlled environment
|
||||||
|
outside of Grid5000, so don't be
|
||||||
surprised then if Grid5000 is not mentioned often on our plots.
|
surprised then if Grid5000 is not mentioned often on our plots.
|
||||||
|
|
||||||
To reproduce some environments locally, we have a small set of Python scripts
|
To reproduce some environments locally, we have a small set of Python scripts
|
||||||
named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
|
called [`mknet`](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
|
||||||
needs[^ref1]. Most of the following tests were thus run locally with mknet on a
|
needs[^ref1]. Most of the following tests were thus run locally with `mknet` on a
|
||||||
single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of
|
single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of
|
||||||
RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
|
RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
|
||||||
used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and
|
used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and
|
||||||
|
@ -84,24 +87,27 @@ values, the system tends to freeze when it is under heavy I/O load.
|
||||||
|
|
||||||
## Efficient I/O
|
## Efficient I/O
|
||||||
|
|
||||||
The main goal of an object storage system is to store or retrieve an object
|
The main purpose of an object storage system is to store and retrieve objects
|
||||||
across the network, and the faster, the better. For this analysis, we focus on
|
across the network, and the faster these two functions can be accomplished,
|
||||||
2 aspects: time to first byte, as many applications can start processing a file
|
the more efficient the system as a whole will be. For this analysis, we focus on
|
||||||
before receiving it completely, and generic throughput, to understand how well
|
2 aspects of performance. First, since many applications can start processing a file
|
||||||
Garage can leverage the underlying machine performances.
|
before receiving it completely, we will evaulate the Time-to-First-Byte (TTFB)
|
||||||
|
on GetObject requests, i.e. the duration between the moment a request is sent
|
||||||
|
and the moment where the first bytes of the returned object are received by the client.
|
||||||
|
Second, we will evaluate generic throughput, to understand how well
|
||||||
|
Garage can leverage the underlying machine's performances.
|
||||||
|
|
||||||
**Time To First Byte** - One specificity of Garage is that we implemented S3
|
**Time To First Byte** - One specificity of Garage is that we implemented S3
|
||||||
web endpoints, with the idea to make it the platform of choice to publish your
|
web endpoints, with the idea to make it a platform of choice to publish
|
||||||
static website. When publishing a website, one metric you observe is Time To
|
static websites. When publishing a website, TTFB can be directly observed
|
||||||
First Byte (TTFB), as it will impact the perceived reactivity of your website.
|
by the end user, as it will impact the perceived reactivity of the websites.
|
||||||
On Garage, time to first byte was a bit high.
|
|
||||||
|
|
||||||
This is not surprising as, until now, the smallest level of granularity
|
Up to version 0.7.3, Time-to-First-Byte on Garage used to be relatively high.
|
||||||
internally was handling full blocks. Blocks are 1MB chunks (this is
|
This can be explained by the fact that Garage was not able to handle data internally
|
||||||
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size))
|
at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object
|
||||||
of a given object. For example, a 4.5MB object will be split into 4 blocks of
|
(this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
|
||||||
1MB and 1 block of 0.5MB. With this design, when you were sending a GET
|
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
|
||||||
|
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
|
||||||
request, the first block had to be fully retrieved by the gateway node from the
|
request, the first block had to be fully retrieved by the gateway node from the
|
||||||
storage node before starting to send any data to the client.
|
storage node before starting to send any data to the client.
|
||||||
|
|
||||||
|
@ -109,20 +115,24 @@ With Garage v0.8, we integrated a block streaming logic that allows the gateway
|
||||||
to send the beginning of a block without having to wait for the full block from
|
to send the beginning of a block without having to wait for the full block from
|
||||||
the storage node. We can visually represent the difference as follow:
|
the storage node. We can visually represent the difference as follow:
|
||||||
|
|
||||||
![A schema depicting how streaming improves the delivery of a block](schema-streaming.png)
|
<center>
|
||||||
|
<img src="schema-streaming.png" alt="A schema depicting how streaming improves the delivery of a block" />
|
||||||
|
</center>
|
||||||
|
|
||||||
As our default block size is only 1MB, the difference will be very small on
|
As our default block size is only 1MB, the difference should be very small on
|
||||||
fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However,
|
fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network,
|
||||||
on a very slow network (or a very congested link with many parallel requests
|
thus adding at most 8ms of latency to a GetObject request (assuming no other
|
||||||
handled), the impact can be much more important: at 5Mbps, it takes 1.6 seconds
|
data transfer is happening in parallel). However,
|
||||||
to transfer our 1MB block, and streaming could heavily improve user experience.
|
on a very slow network, or a very congested link with many parallel requests
|
||||||
|
handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
|
||||||
|
to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
|
||||||
|
|
||||||
We wanted to see if this theory holds in practice: we simulated a low latency
|
We wanted to see if this theory holds in practice: we simulated a low latency
|
||||||
but slow network on mknet and did some requests with (garage v0.8 beta) and
|
but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
|
||||||
without (garage v0.7) block streaming. We also added Minio as a reference. To
|
without (Garage v0.7.3). We also added Minio as a reference. To
|
||||||
benchmark this behavior, we wrote a small test named
|
benchmark this behavior, we wrote a small test named
|
||||||
[s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb),
|
[s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb),
|
||||||
its results are depicted in the following figure.
|
whose results are shown on the following figure:
|
||||||
|
|
||||||
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
|
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
|
||||||
|
|
||||||
|
|
|
@ -42,7 +42,7 @@
|
||||||
</div>
|
</div>
|
||||||
<div class="content mt-2">
|
<div class="content mt-2">
|
||||||
<div class="text-gray-700 text-lg not-italic">
|
<div class="text-gray-700 text-lg not-italic">
|
||||||
{{ page.summary | safe | striptags }}
|
{{ page.summary | striptags | safe }}
|
||||||
</div>
|
</div>
|
||||||
<a class="group font-semibold p-4 flex items-center space-x-1 text-garage-orange" href='{{ page.permalink }}'>
|
<a class="group font-semibold p-4 flex items-center space-x-1 text-garage-orange" href='{{ page.permalink }}'>
|
||||||
<div class="h-0.5 mt-0.5 w-4 group-hover:w-8 group-hover:bg-garage-gray transition-all bg-garage-orange"></div>
|
<div class="h-0.5 mt-0.5 w-4 group-hover:w-8 group-hover:bg-garage-gray transition-all bg-garage-orange"></div>
|
||||||
|
|
Loading…
Reference in a new issue