New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 86 additions and 86 deletions
|
@ -8,8 +8,8 @@ date=2022-09-26
|
|||
their theoretical trade-offs for Garage. In particular, we pondered the impacts
|
||||
of data structures, networking methods, and scheduling algorithms.
|
||||
Garage worked well enough for our production
|
||||
cluster at Deuxfleurs, but we also knew that people started to discover some
|
||||
unexpected behaviors. We thus started a round of benchmarks and performance
|
||||
cluster at Deuxfleurs, but we also knew that people started to experience some
|
||||
unexpected behaviors, which motivated us to start a round of benchmarks and performance
|
||||
measurements to see how Garage behaves compared to our expectations.
|
||||
This post presents some of our first results, which cover
|
||||
3 aspects of performance: efficient I/O, "myriads of objects", and resiliency,
|
||||
|
@ -21,35 +21,35 @@ reflecting the high-level properties we are seeking.*
|
|||
|
||||
## ⚠️ Disclaimer
|
||||
|
||||
The results presented in this blog post must be taken with a critical grain of salt due to some
|
||||
The results presented in this blog post must be taken with a (critical) grain of salt due to some
|
||||
limitations that are inherent to any benchmarking endeavor. We try to reference them as
|
||||
exhaustively as possible in this first section, but other limitations might exist.
|
||||
exhaustively as possible here, but other limitations might exist.
|
||||
|
||||
Most of our tests were made on simulated networks, which by definition cannot represent all the
|
||||
diversity of real networks (dynamic drop, jitter, latency, all of which could be
|
||||
Most of our tests were made on _simulated_ networks, which by definition cannot represent all the
|
||||
diversity of _real_ networks (dynamic drop, jitter, latency, all of which could be
|
||||
correlated with throughput or any other external event). We also limited
|
||||
ourselves to very small workloads that are not representative of a production
|
||||
cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
|
||||
our results are thus not an evaluation of the performance of Garage as a whole.
|
||||
our results are not an evaluation of the performance of Garage as a whole.
|
||||
|
||||
For some benchmarks, we used Minio as a reference. It must be noted that we did
|
||||
not try to optimize its configuration as we have done for Garage, and more
|
||||
generally, we have way less knowledge on Minio than on Garage, which can lead
|
||||
generally, we have significantly less knowledge of Minio's internals compared to Garage, which could lead
|
||||
to underrated performance measurements for Minio. It must also be noted that
|
||||
Garage and Minio are systems with different feature sets. For instance, Minio supports
|
||||
erasure coding for higher data density, which Garage doesn't, Minio implements
|
||||
erasure coding for higher data density and Garage doesn't, Minio implements
|
||||
way more S3 endpoints than Garage, etc. Such features necessarily have a cost
|
||||
that you must keep in mind when reading the plots we will present. You should consider
|
||||
results on Minio as a way to contextualize our results on Garage, to see that our improvements
|
||||
are not artificial in the light of existing object storage implementations.
|
||||
Minio's results as a way to contextualize Garage's numbers, to justify that our improvements
|
||||
are not simply artificial in the light of existing object storage implementations.
|
||||
|
||||
The impact of the testing environment is also not evaluated (kernel patches,
|
||||
configuration, parameters, filesystem, hardware configuration, etc.). Some of
|
||||
these parameters could favor one configuration or software product over another.
|
||||
Especially, it must be noted that most of the tests were done on a
|
||||
consumer-grade PC with only an SSD, which is different from most
|
||||
consumer-grade PC with only a SSD, which is different from most
|
||||
production setups. Finally, our results are also provided without statistical
|
||||
tests to check their significance, and might thus have insufficient significance
|
||||
tests to validate their significance, and might have insufficient ground
|
||||
to be claimed as reliable.
|
||||
|
||||
When reading this post, please keep in mind that **we are not making any
|
||||
|
@ -75,16 +75,16 @@ geo-distributed topology. We used the Grid5000 testbed only during our
|
|||
preliminary tests to identify issues when running Garage on many powerful
|
||||
servers. We then reproduced these issues in a controlled environment
|
||||
outside of Grid5000, so don't be
|
||||
surprised then if Grid5000 is not mentioned often on our plots.
|
||||
surprised then if Grid5000 is not always mentioned on our plots.
|
||||
|
||||
To reproduce some environments locally, we have a small set of Python scripts
|
||||
called [`mknet`](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
|
||||
needs[^ref1]. Most of the following tests were thus run locally with `mknet` on a
|
||||
needs[^ref1]. Most of the following tests were run locally with `mknet` on a
|
||||
single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of
|
||||
RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
|
||||
RAM and a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
|
||||
used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and
|
||||
`vm.dirty_ratio` has been reduced to `2` and `1` respectively as, with default
|
||||
values, the system tends to freeze when it is under heavy I/O load.
|
||||
`vm.dirty_ratio` have been reduced to `2` and `1` respectively: with default
|
||||
values, the system tends to freeze under heavy I/O load.
|
||||
|
||||
## Efficient I/O
|
||||
|
||||
|
@ -93,7 +93,7 @@ across the network, and the faster these two functions can be accomplished,
|
|||
the more efficient the system as a whole will be. For this analysis, we focus on
|
||||
2 aspects of performance. First, since many applications can start processing a file
|
||||
before receiving it completely, we will evaluate the time-to-first-byte (TTFB)
|
||||
on GetObject requests, i.e. the duration between the moment a request is sent
|
||||
on `GetObject` requests, i.e. the duration between the moment a request is sent
|
||||
and the moment where the first bytes of the returned object are received by the client.
|
||||
Second, we will evaluate generic throughput, to understand how well
|
||||
Garage can leverage the underlying machine's performance.
|
||||
|
@ -101,18 +101,18 @@ Garage can leverage the underlying machine's performance.
|
|||
**Time-to-First-Byte** - One specificity of Garage is that we implemented S3
|
||||
web endpoints, with the idea to make it a platform of choice to publish
|
||||
static websites. When publishing a website, TTFB can be directly observed
|
||||
by the end user, as it will impact the perceived reactivity of the website.
|
||||
by the end user, as it will impact the perceived reactivity of the page being loaded.
|
||||
|
||||
Up to version 0.7.3, time-to-first-byte on Garage used to be relatively high.
|
||||
This can be explained by the fact that Garage was not able to handle data internally
|
||||
at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object
|
||||
at a smaller granularity level than entire data blocks, which are up to 1MB chunks of a given object
|
||||
(a size which [can be configured](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
|
||||
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
|
||||
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
|
||||
request, the first block had to be fully retrieved by the gateway node from the
|
||||
storage node before starting to send any data to the client.
|
||||
Let us take the example of a 4.5MB object, which Garage will split by default into four 1MB blocks and one 0.5MB block.
|
||||
With the old design, when you were sending a `GET`
|
||||
request, the first block had to be _fully_ retrieved by the gateway node from the
|
||||
storage node before it starts to send any data to the client.
|
||||
|
||||
With Garage v0.8, we integrated a data streaming logic that allows the gateway
|
||||
With Garage v0.8, we added a data streaming logic that allows the gateway
|
||||
to send the beginning of a block without having to wait for the full block to be received from
|
||||
the storage node. We can visually represent the difference as follow:
|
||||
|
||||
|
@ -120,13 +120,13 @@ the storage node. We can visually represent the difference as follow:
|
|||
<img src="schema-streaming.png" alt="A schema depicting how streaming improves the delivery of a block" />
|
||||
</center>
|
||||
|
||||
As our default block size is only 1MB, the difference should be very small on
|
||||
As our default block size is only 1MB, the difference should be marginal on
|
||||
fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network,
|
||||
thus adding at most 8ms of latency to a GetObject request (assuming no other
|
||||
adding at most 8ms of latency to a `GetObject` request (assuming no other
|
||||
data transfer is happening in parallel). However,
|
||||
on a very slow network, or a very congested link with many parallel requests
|
||||
handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
|
||||
to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
|
||||
handled, the impact can be much more important: on a 5Mbps network, it takes at least 1.6 seconds
|
||||
to transfer our 1MB block, and streaming will heavily improve user experience.
|
||||
|
||||
We wanted to see if this theory holds in practice: we simulated a low latency
|
||||
but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
|
||||
|
@ -138,11 +138,11 @@ whose results are shown on the following figure:
|
|||
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
|
||||
|
||||
Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
|
||||
and 2s, which corresponds to the time to transfer the full block which we calculated above.
|
||||
and 2s, which matches the time required to transfer the full block which we calculated above.
|
||||
On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
|
||||
streaming feature (the lowest value is 43ms). Minio sits between the two
|
||||
Garage versions: we suppose that it does some form of batching, but smaller
|
||||
than 1MB.
|
||||
than our initial 1MB default.
|
||||
|
||||
**Throughput** - As soon as we publicly released Garage, people started
|
||||
benchmarking it, comparing its performances to writing directly on the
|
||||
|
@ -152,7 +152,7 @@ situation, we did some optimizations, such as putting costly processing like has
|
|||
and many others
|
||||
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
|
||||
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)), which led us to
|
||||
version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote
|
||||
version 0.8 "Beta 1". We also noticed that some of the logic we wrote
|
||||
to better control resource usage
|
||||
and detect errors, including semaphores and timeouts, was artificially limiting
|
||||
performances. In another iteration, we made this logic less restrictive at the
|
||||
|
@ -162,7 +162,7 @@ version 0.8 "Beta 2". Finally, we currently do multiple `fsync` calls each time
|
|||
write a block. We know that this is expensive and did a test build without any
|
||||
`fsync` call ([see the
|
||||
commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7))
|
||||
that will not be merged, just to assess the impact of `fsync`. We refer to it
|
||||
that will not be merged, only to assess the impact of `fsync`. We refer to it
|
||||
as `no-fsync` in the following plot.
|
||||
|
||||
*A note about `fsync`: for performance reasons, operating systems often do not
|
||||
|
@ -172,12 +172,12 @@ with other writes. If a power loss occurs before the OS has time to flush
|
|||
data to disk, some writes will be lost. To ensure that a write is effectively
|
||||
written to disk, the
|
||||
[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
|
||||
which blocks until the file or directory on which it is called has been flushed from volatile
|
||||
which effectively blocks until the file or directory on which it is called has been flushed from volatile
|
||||
memory to the persistent storage device. Additionally, the exact semantic of
|
||||
`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
|
||||
and, even on battle-tested software like Postgres, it was
|
||||
["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
|
||||
Note that on Garage, we are currently working on our `fsync` policy and thus, for
|
||||
Note that on Garage, we are still working on our `fsync` policy and thus, for
|
||||
now, you should expect limited data durability in case of power loss, as we are
|
||||
aware of some inconsistencies on this point (which we describe in the following
|
||||
and plan to solve).*
|
||||
|
@ -194,14 +194,14 @@ Minio, our reference point, gives us the best performances in this test.
|
|||
Looking at Garage, we observe that each improvement we made had a visible
|
||||
impact on performances. We also note that we have a progress margin in
|
||||
terms of performances compared to Minio: additional benchmarks, tests, and
|
||||
monitoring could help us better understand the remaining difference.
|
||||
monitoring could help us better understand the remaining gap.
|
||||
|
||||
|
||||
## A myriad of objects
|
||||
|
||||
Object storage systems do not handle a single object but huge numbers of them:
|
||||
Amazon claims to handle trillions of objects on their platform, and Red Hat
|
||||
communicates about Ceph being able to handle 10 billion objects. All these
|
||||
tout Ceph as being able to handle 10 billion objects. All these
|
||||
objects must be tracked efficiently in the system to be fetched, listed,
|
||||
removed, etc. In Garage, we use a "metadata engine" component to track them.
|
||||
For this analysis, we compare different metadata engines in Garage and see how
|
||||
|
@ -214,25 +214,25 @@ the only supported option was [sled](https://sled.rs/), but we started having
|
|||
serious issues with it - and we were not alone
|
||||
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
|
||||
v0.8, we introduce an abstraction semantic over the features we expect from our
|
||||
database, allowing us to switch from one back-end to another without touching
|
||||
database, allowing us to switch from one metadata back-end to another without touching
|
||||
the rest of our codebase. We added two additional back-ends: LMDB
|
||||
(through [heed](https://github.com/meilisearch/heed)) and SQLite
|
||||
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
||||
are both experimental: contrarily to sled, we have never run them in production
|
||||
for a long time.**
|
||||
are both experimental: contrarily to sled, we have yet to run them in production
|
||||
for a significant time.**
|
||||
|
||||
Similarly to the impact of `fsync` on block writing, each database engine we use
|
||||
has its own policy with `fsync`. Sled flushes its writes every 2 seconds by
|
||||
has its own `fsync` policy. Sled flushes its writes every 2 seconds by
|
||||
default (this is
|
||||
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
|
||||
LMDB by default does an `fsync` on each write, which on early tests led to
|
||||
LMDB default to an `fsync` on each write, which on early tests led to
|
||||
abysmal performance. We thus added 2 flags,
|
||||
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
|
||||
and
|
||||
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
|
||||
to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
|
||||
to deactivate `fsync` entirely. On SQLite, it is also possible to deactivate `fsync` with
|
||||
`pragma synchronous = off`, but we have not started any optimization work on it yet:
|
||||
our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
|
||||
our SQLite implementation currently still calls `fsync` for all write operations. Additionally, we are
|
||||
using these engines through Rust bindings that do not support async Rust,
|
||||
with which Garage is built, which has an impact on performance as well.
|
||||
**Our comparison will therefore not reflect the raw performances of
|
||||
|
@ -242,20 +242,20 @@ Still, we think it makes sense to evaluate our implementations in their current
|
|||
state in Garage. We designed a benchmark that is intensive on the metadata part
|
||||
of the software, i.e. handling large numbers of tiny files. We chose again
|
||||
`minio/warp` as a benchmark tool, but we
|
||||
configured it here with the smallest possible object size it supported, 256
|
||||
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
|
||||
configured it with the smallest possible object size it supported, 256
|
||||
bytes, to put pressure on the metadata engine. We evaluated sled twice:
|
||||
with its default configuration, and with a configuration where we set a flush
|
||||
interval of 10 minutes to disable `fsync`.
|
||||
interval of 10 minutes (longer than the test) to disable `fsync`.
|
||||
|
||||
*Note that S3 has not been designed for such workloads that store huge numbers of small objects;
|
||||
*Note that S3 has not been designed for workloads that store huge numbers of small objects;
|
||||
a regular database, like Cassandra, would be more appropriate. This test has
|
||||
only been designed to stress our metadata engine, and is not indicative of
|
||||
only been designed to stress our metadata engine and is not indicative of
|
||||
real-world performances.*
|
||||
|
||||
![Plot of our metadata engines comparison with Warp](db_engine.png)
|
||||
|
||||
Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have
|
||||
the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than
|
||||
Unsurprisingly, we observe abysmal performances with SQLite, as it is the engine we did not put work on yet,
|
||||
and that still does an `fsync` for each write. Garage with the `fsync`-disabled LMDB backend performs twice better than
|
||||
with sled in its default version and 60% better than the "no `fsync`" sled version in our
|
||||
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
|
||||
disk storage and RAM; we would like to quantify that in the future. As we are
|
||||
|
@ -263,7 +263,7 @@ only at the very beginning of our work on metadata engines, it is hard to draw
|
|||
strong conclusions. Still, we can say that SQLite is not ready for production
|
||||
workloads, and that LMDB looks very promising both in terms of performances and resource
|
||||
usage, and is a very good candidate for being Garage's default metadata engine in
|
||||
future releases. In the future, we will need to define a data policy for Garage to help us
|
||||
future releases, once we figure out the proper `fsync` tuning. In the future, we will need to define a data policy for Garage to help us
|
||||
arbitrate between performance and durability.
|
||||
|
||||
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
|
||||
|
@ -300,35 +300,35 @@ We wrote our own benchmarking tool for this test,
|
|||
[s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2].
|
||||
The benchmark procedure consists in
|
||||
concurrently sending a defined number of tiny objects (8192 objects of 16
|
||||
bytes by default) and measuring the time it takes. This step is then repeated a given
|
||||
bytes by default) and measuring the wall clock time to the last object upload. This step is then repeated a given
|
||||
number of times (128 by default) to effectively create a target number of
|
||||
objects on the cluster (1M by default). On our local setup with 3
|
||||
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
|
||||
following plot, we show how much time it took Garage and Minio to handle
|
||||
each batch.
|
||||
|
||||
Before looking at the plot, **you must keep in mind some important points about
|
||||
Before looking at the plot, **you must keep in mind some important points regarding
|
||||
the internals of both Minio and Garage**.
|
||||
|
||||
Minio has no metadata engine, it stores its objects directly on the filesystem.
|
||||
Sending 1 million objects on Minio results in creating one million inodes on
|
||||
the storage server in our current setup. So the performances of the filesystem
|
||||
probably have a substantial impact on the results we observe.
|
||||
probably have a substantial impact on the observed results.
|
||||
In our precise setup, we know that the
|
||||
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
||||
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
||||
`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
|
||||
creation of objects. Finally, object storage is designed for big objects, for which the
|
||||
costs measured here are negligible. In the end, again, we use Minio as a
|
||||
costs measured here are negligible. In the end, again, we use Minio only as a
|
||||
reference point to understand what performance budget we have for each part of our
|
||||
software.
|
||||
|
||||
Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
|
||||
not created on the filesystem but the object is directly stored inline in the
|
||||
metadata engine. In the future, we plan to evaluate how Garage behaves at scale with
|
||||
>3KB objects, which we expect to be way closer to Minio, as it will have to create
|
||||
objects above 3KB, which we expect to be way closer to Minio, as it will have to create
|
||||
at least one inode per object. For now, we limit ourselves to evaluating our
|
||||
metadata engine and thus focus only on 16-byte objects.
|
||||
metadata engine and focus only on 16-byte objects.
|
||||
|
||||
![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png)
|
||||
|
||||
|
@ -339,27 +339,27 @@ time to complete a batch of inserts is constant, while on Garage it still increa
|
|||
It could be interesting to know if Garage's batch completion time would cross Minio's one
|
||||
for a very large number of objects. If we reason per object, both Minio's and
|
||||
Garage's performances remain very good: it takes respectively around 20ms and
|
||||
5ms to create an object. At 100 Mbps, the upload of a 10MB file takes
|
||||
5ms to create an object. In a real-world scenario, at 100 Mbps, the upload of a 10MB file takes
|
||||
800ms, and goes up to 8sec for a 100MB file: in both cases
|
||||
handling the object metadata is only a fraction of the upload time. The
|
||||
handling the object metadata would be only a fraction of the upload time. The
|
||||
only cases where a difference would be noticeable would be when uploading a lot of very
|
||||
small files at once, which again is an unusual usage of the S3 API.
|
||||
small files at once, which again would be an unusual usage of the S3 API.
|
||||
|
||||
Let us now focus on Garage's metrics only to better see its specific behavior:
|
||||
|
||||
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
|
||||
|
||||
Two effects are now more visible: 1., batch completion time increases with the
|
||||
number of objects in the bucket and 2., measurements are dispersed, at least
|
||||
more than for Minio. We expect this batch completion time increase to be logarithmic,
|
||||
but we don't have enough data points to conclude safety: additional
|
||||
number of objects in the bucket and 2., measurements are scattered, at least
|
||||
more than for Minio. We expected this batch completion time increase to be logarithmic,
|
||||
but we don't have enough data points to conclude confidently it is the case: additional
|
||||
measurements are needed. Concerning the observed instability, it could
|
||||
be a symptom of what we saw with some other experiments in this machine,
|
||||
be a symptom of what we saw with some other experiments on this setup,
|
||||
which sometimes freezes under heavy I/O load. Such freezes could lead to
|
||||
request timeouts and failures. If this occurs on our testing computer, it might
|
||||
occur on other servers as well: it would be interesting to better understand this
|
||||
issue, document how to avoid it, and potentially change how we handle our I/O
|
||||
internally in Garage. But still, this was a very stressful test that will probably not be encountered in
|
||||
issue, document how to avoid it, and potentially change how we handle I/O
|
||||
internally in Garage. But still, this was a very heavy test that will probably not be encountered in
|
||||
many setups: we were adding 273 objects per second for 30 minutes straight!
|
||||
|
||||
To conclude this part, Garage can ingest 1 million tiny objects while remaining
|
||||
|
@ -382,45 +382,45 @@ core value proposition of Garage. For example, our production cluster is
|
|||
hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
|
||||
behind consumer-grade fiber links across France and Belgium (if you are reading this,
|
||||
congratulation, you fetched this webpage from it!). That's why we are very
|
||||
careful that our internal protocol (named RPC protocol in our documentation)
|
||||
careful that our internal protocol (referred to as "RPC protocol" in our documentation)
|
||||
remains as lightweight as possible. For this analysis, we quantify how network
|
||||
latency and the number of nodes in the cluster impact the duration of the most
|
||||
latency and number of nodes in the cluster impact the duration of the most
|
||||
important kinds of S3 requests.
|
||||
|
||||
**Latency amplification** - With the kind of networks we use (consumer-grade
|
||||
fiber links across the EU), the observed latency between nodes is in the 50ms range.
|
||||
When latency is not negligible, you will observe that request completion
|
||||
time is a factor of the observed latency. That's to be expected: in many cases, the
|
||||
node of the cluster you are contacting can not directly answer your request, and
|
||||
has to reach other nodes of the cluster to get the requested information. Each
|
||||
node of the cluster you are contacting cannot directly answer your request, and
|
||||
has to reach other nodes of the cluster to get the data. Each
|
||||
of these sequential remote procedure calls - or RPCs - adds to the final S3 request duration, which can quickly become
|
||||
expensive. This ratio between request duration and network latency is what we
|
||||
refer to as *latency amplification*.
|
||||
|
||||
For example, on Garage, a GetObject request does two sequential calls: first,
|
||||
it fetches the descriptor of the requested object, which contains a reference
|
||||
For example, on Garage, a `GetObject` request does two sequential calls: first,
|
||||
it fetches the descriptor of the requested object from the metadata engine, which contains a reference
|
||||
to the first block of data, and then only in a second step it can start retrieving data blocks
|
||||
from storage nodes. We can therefore expect that the
|
||||
request duration of a small GetObject request will be close to twice the
|
||||
request duration of a small `GetObject` request will be close to twice the
|
||||
network latency.
|
||||
|
||||
We tested the latency amplification theory with another benchmark of our own named
|
||||
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
|
||||
which does a single request at a time on an endpoint and measures its response
|
||||
which does a single request at a time on an endpoint and measures the response
|
||||
time. As we are not interested in bandwidth but latency, all our requests
|
||||
involving an object are made on a tiny file of around 16 bytes. Our benchmark
|
||||
involving objects are made on tiny files of around 16 bytes. Our benchmark
|
||||
tests 5 standard endpoints of the S3 API: ListBuckets, ListObjects, PutObject, GetObject and
|
||||
RemoveObject. Here are the results:
|
||||
|
||||
|
||||
![Latency amplification](amplification.png)
|
||||
|
||||
As Garage has been optimized for this use case from the beginning, we don't see
|
||||
As Garage has been optimized for this use case from the very beginning, we don't see
|
||||
any significant evolution from one version to another (Garage v0.7.3 and Garage
|
||||
v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
|
||||
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
|
||||
RemoveObject). This can be easily understood by the fact that Minio has not been designed for
|
||||
environments with high latencies. Instead, it expects to run on clusters that are built
|
||||
ListObjects and ListBuckets) or significantly better (for GetObject, PutObject, and
|
||||
RemoveObject). This can be easily explained by the fact that Minio has not been designed with
|
||||
environments with high latencies in mind. Instead, it is expected to run on clusters that are built
|
||||
in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected with their asynchronous
|
||||
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
||||
feature.
|
||||
|
@ -488,7 +488,7 @@ terabytes of data and billions of objects on long-lasting experiments.
|
|||
|
||||
In the meantime, stay tuned: we have released
|
||||
[a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1),
|
||||
and are already working on a number of features for the next version.
|
||||
and are already working on several features for the next version.
|
||||
For instance, we are working on a new layout that will have enhanced optimality properties,
|
||||
as well as a theoretical proof of correctness
|
||||
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)). We are also
|
||||
|
|
Loading…
Reference in a new issue