New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 86 additions and 86 deletions
|
@ -8,8 +8,8 @@ date=2022-09-26
|
||||||
their theoretical trade-offs for Garage. In particular, we pondered the impacts
|
their theoretical trade-offs for Garage. In particular, we pondered the impacts
|
||||||
of data structures, networking methods, and scheduling algorithms.
|
of data structures, networking methods, and scheduling algorithms.
|
||||||
Garage worked well enough for our production
|
Garage worked well enough for our production
|
||||||
cluster at Deuxfleurs, but we also knew that people started to discover some
|
cluster at Deuxfleurs, but we also knew that people started to experience some
|
||||||
unexpected behaviors. We thus started a round of benchmarks and performance
|
unexpected behaviors, which motivated us to start a round of benchmarks and performance
|
||||||
measurements to see how Garage behaves compared to our expectations.
|
measurements to see how Garage behaves compared to our expectations.
|
||||||
This post presents some of our first results, which cover
|
This post presents some of our first results, which cover
|
||||||
3 aspects of performance: efficient I/O, "myriads of objects", and resiliency,
|
3 aspects of performance: efficient I/O, "myriads of objects", and resiliency,
|
||||||
|
@ -21,35 +21,35 @@ reflecting the high-level properties we are seeking.*
|
||||||
|
|
||||||
## ⚠️ Disclaimer
|
## ⚠️ Disclaimer
|
||||||
|
|
||||||
The results presented in this blog post must be taken with a critical grain of salt due to some
|
The results presented in this blog post must be taken with a (critical) grain of salt due to some
|
||||||
limitations that are inherent to any benchmarking endeavor. We try to reference them as
|
limitations that are inherent to any benchmarking endeavor. We try to reference them as
|
||||||
exhaustively as possible in this first section, but other limitations might exist.
|
exhaustively as possible here, but other limitations might exist.
|
||||||
|
|
||||||
Most of our tests were made on simulated networks, which by definition cannot represent all the
|
Most of our tests were made on _simulated_ networks, which by definition cannot represent all the
|
||||||
diversity of real networks (dynamic drop, jitter, latency, all of which could be
|
diversity of _real_ networks (dynamic drop, jitter, latency, all of which could be
|
||||||
correlated with throughput or any other external event). We also limited
|
correlated with throughput or any other external event). We also limited
|
||||||
ourselves to very small workloads that are not representative of a production
|
ourselves to very small workloads that are not representative of a production
|
||||||
cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
|
cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
|
||||||
our results are thus not an evaluation of the performance of Garage as a whole.
|
our results are not an evaluation of the performance of Garage as a whole.
|
||||||
|
|
||||||
For some benchmarks, we used Minio as a reference. It must be noted that we did
|
For some benchmarks, we used Minio as a reference. It must be noted that we did
|
||||||
not try to optimize its configuration as we have done for Garage, and more
|
not try to optimize its configuration as we have done for Garage, and more
|
||||||
generally, we have way less knowledge on Minio than on Garage, which can lead
|
generally, we have significantly less knowledge of Minio's internals compared to Garage, which could lead
|
||||||
to underrated performance measurements for Minio. It must also be noted that
|
to underrated performance measurements for Minio. It must also be noted that
|
||||||
Garage and Minio are systems with different feature sets. For instance, Minio supports
|
Garage and Minio are systems with different feature sets. For instance, Minio supports
|
||||||
erasure coding for higher data density, which Garage doesn't, Minio implements
|
erasure coding for higher data density and Garage doesn't, Minio implements
|
||||||
way more S3 endpoints than Garage, etc. Such features necessarily have a cost
|
way more S3 endpoints than Garage, etc. Such features necessarily have a cost
|
||||||
that you must keep in mind when reading the plots we will present. You should consider
|
that you must keep in mind when reading the plots we will present. You should consider
|
||||||
results on Minio as a way to contextualize our results on Garage, to see that our improvements
|
Minio's results as a way to contextualize Garage's numbers, to justify that our improvements
|
||||||
are not artificial in the light of existing object storage implementations.
|
are not simply artificial in the light of existing object storage implementations.
|
||||||
|
|
||||||
The impact of the testing environment is also not evaluated (kernel patches,
|
The impact of the testing environment is also not evaluated (kernel patches,
|
||||||
configuration, parameters, filesystem, hardware configuration, etc.). Some of
|
configuration, parameters, filesystem, hardware configuration, etc.). Some of
|
||||||
these parameters could favor one configuration or software product over another.
|
these parameters could favor one configuration or software product over another.
|
||||||
Especially, it must be noted that most of the tests were done on a
|
Especially, it must be noted that most of the tests were done on a
|
||||||
consumer-grade PC with only an SSD, which is different from most
|
consumer-grade PC with only a SSD, which is different from most
|
||||||
production setups. Finally, our results are also provided without statistical
|
production setups. Finally, our results are also provided without statistical
|
||||||
tests to check their significance, and might thus have insufficient significance
|
tests to validate their significance, and might have insufficient ground
|
||||||
to be claimed as reliable.
|
to be claimed as reliable.
|
||||||
|
|
||||||
When reading this post, please keep in mind that **we are not making any
|
When reading this post, please keep in mind that **we are not making any
|
||||||
|
@ -75,25 +75,25 @@ geo-distributed topology. We used the Grid5000 testbed only during our
|
||||||
preliminary tests to identify issues when running Garage on many powerful
|
preliminary tests to identify issues when running Garage on many powerful
|
||||||
servers. We then reproduced these issues in a controlled environment
|
servers. We then reproduced these issues in a controlled environment
|
||||||
outside of Grid5000, so don't be
|
outside of Grid5000, so don't be
|
||||||
surprised then if Grid5000 is not mentioned often on our plots.
|
surprised then if Grid5000 is not always mentioned on our plots.
|
||||||
|
|
||||||
To reproduce some environments locally, we have a small set of Python scripts
|
To reproduce some environments locally, we have a small set of Python scripts
|
||||||
called [`mknet`](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
|
called [`mknet`](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
|
||||||
needs[^ref1]. Most of the following tests were thus run locally with `mknet` on a
|
needs[^ref1]. Most of the following tests were run locally with `mknet` on a
|
||||||
single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of
|
single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of
|
||||||
RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
|
RAM and a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
|
||||||
used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and
|
used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and
|
||||||
`vm.dirty_ratio` has been reduced to `2` and `1` respectively as, with default
|
`vm.dirty_ratio` have been reduced to `2` and `1` respectively: with default
|
||||||
values, the system tends to freeze when it is under heavy I/O load.
|
values, the system tends to freeze under heavy I/O load.
|
||||||
|
|
||||||
## Efficient I/O
|
## Efficient I/O
|
||||||
|
|
||||||
The main purpose of an object storage system is to store and retrieve objects
|
The main purpose of an object storage system is to store and retrieve objects
|
||||||
across the network, and the faster these two functions can be accomplished,
|
across the network, and the faster these two functions can be accomplished,
|
||||||
the more efficient the system as a whole will be. For this analysis, we focus on
|
the more efficient the system as a whole will be. For this analysis, we focus on
|
||||||
2 aspects of performance. First, since many applications can start processing a file
|
2 aspects of performance. First, since many applications can start processing a file
|
||||||
before receiving it completely, we will evaluate the time-to-first-byte (TTFB)
|
before receiving it completely, we will evaluate the time-to-first-byte (TTFB)
|
||||||
on GetObject requests, i.e. the duration between the moment a request is sent
|
on `GetObject` requests, i.e. the duration between the moment a request is sent
|
||||||
and the moment where the first bytes of the returned object are received by the client.
|
and the moment where the first bytes of the returned object are received by the client.
|
||||||
Second, we will evaluate generic throughput, to understand how well
|
Second, we will evaluate generic throughput, to understand how well
|
||||||
Garage can leverage the underlying machine's performance.
|
Garage can leverage the underlying machine's performance.
|
||||||
|
@ -101,18 +101,18 @@ Garage can leverage the underlying machine's performance.
|
||||||
**Time-to-First-Byte** - One specificity of Garage is that we implemented S3
|
**Time-to-First-Byte** - One specificity of Garage is that we implemented S3
|
||||||
web endpoints, with the idea to make it a platform of choice to publish
|
web endpoints, with the idea to make it a platform of choice to publish
|
||||||
static websites. When publishing a website, TTFB can be directly observed
|
static websites. When publishing a website, TTFB can be directly observed
|
||||||
by the end user, as it will impact the perceived reactivity of the website.
|
by the end user, as it will impact the perceived reactivity of the page being loaded.
|
||||||
|
|
||||||
Up to version 0.7.3, time-to-first-byte on Garage used to be relatively high.
|
Up to version 0.7.3, time-to-first-byte on Garage used to be relatively high.
|
||||||
This can be explained by the fact that Garage was not able to handle data internally
|
This can be explained by the fact that Garage was not able to handle data internally
|
||||||
at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object
|
at a smaller granularity level than entire data blocks, which are up to 1MB chunks of a given object
|
||||||
(a size which [can be configured](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
|
(a size which [can be configured](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
|
||||||
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
|
Let us take the example of a 4.5MB object, which Garage will split by default into four 1MB blocks and one 0.5MB block.
|
||||||
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
|
With the old design, when you were sending a `GET`
|
||||||
request, the first block had to be fully retrieved by the gateway node from the
|
request, the first block had to be _fully_ retrieved by the gateway node from the
|
||||||
storage node before starting to send any data to the client.
|
storage node before it starts to send any data to the client.
|
||||||
|
|
||||||
With Garage v0.8, we integrated a data streaming logic that allows the gateway
|
With Garage v0.8, we added a data streaming logic that allows the gateway
|
||||||
to send the beginning of a block without having to wait for the full block to be received from
|
to send the beginning of a block without having to wait for the full block to be received from
|
||||||
the storage node. We can visually represent the difference as follow:
|
the storage node. We can visually represent the difference as follow:
|
||||||
|
|
||||||
|
@ -120,13 +120,13 @@ the storage node. We can visually represent the difference as follow:
|
||||||
<img src="schema-streaming.png" alt="A schema depicting how streaming improves the delivery of a block" />
|
<img src="schema-streaming.png" alt="A schema depicting how streaming improves the delivery of a block" />
|
||||||
</center>
|
</center>
|
||||||
|
|
||||||
As our default block size is only 1MB, the difference should be very small on
|
As our default block size is only 1MB, the difference should be marginal on
|
||||||
fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network,
|
fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network,
|
||||||
thus adding at most 8ms of latency to a GetObject request (assuming no other
|
adding at most 8ms of latency to a `GetObject` request (assuming no other
|
||||||
data transfer is happening in parallel). However,
|
data transfer is happening in parallel). However,
|
||||||
on a very slow network, or a very congested link with many parallel requests
|
on a very slow network, or a very congested link with many parallel requests
|
||||||
handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
|
handled, the impact can be much more important: on a 5Mbps network, it takes at least 1.6 seconds
|
||||||
to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
|
to transfer our 1MB block, and streaming will heavily improve user experience.
|
||||||
|
|
||||||
We wanted to see if this theory holds in practice: we simulated a low latency
|
We wanted to see if this theory holds in practice: we simulated a low latency
|
||||||
but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
|
but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
|
||||||
|
@ -138,11 +138,11 @@ whose results are shown on the following figure:
|
||||||
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
|
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
|
||||||
|
|
||||||
Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
|
Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
|
||||||
and 2s, which corresponds to the time to transfer the full block which we calculated above.
|
and 2s, which matches the time required to transfer the full block which we calculated above.
|
||||||
On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
|
On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
|
||||||
streaming feature (the lowest value is 43ms). Minio sits between the two
|
streaming feature (the lowest value is 43ms). Minio sits between the two
|
||||||
Garage versions: we suppose that it does some form of batching, but smaller
|
Garage versions: we suppose that it does some form of batching, but smaller
|
||||||
than 1MB.
|
than our initial 1MB default.
|
||||||
|
|
||||||
**Throughput** - As soon as we publicly released Garage, people started
|
**Throughput** - As soon as we publicly released Garage, people started
|
||||||
benchmarking it, comparing its performances to writing directly on the
|
benchmarking it, comparing its performances to writing directly on the
|
||||||
|
@ -152,7 +152,7 @@ situation, we did some optimizations, such as putting costly processing like has
|
||||||
and many others
|
and many others
|
||||||
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
|
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
|
||||||
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)), which led us to
|
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)), which led us to
|
||||||
version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote
|
version 0.8 "Beta 1". We also noticed that some of the logic we wrote
|
||||||
to better control resource usage
|
to better control resource usage
|
||||||
and detect errors, including semaphores and timeouts, was artificially limiting
|
and detect errors, including semaphores and timeouts, was artificially limiting
|
||||||
performances. In another iteration, we made this logic less restrictive at the
|
performances. In another iteration, we made this logic less restrictive at the
|
||||||
|
@ -162,7 +162,7 @@ version 0.8 "Beta 2". Finally, we currently do multiple `fsync` calls each time
|
||||||
write a block. We know that this is expensive and did a test build without any
|
write a block. We know that this is expensive and did a test build without any
|
||||||
`fsync` call ([see the
|
`fsync` call ([see the
|
||||||
commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7))
|
commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7))
|
||||||
that will not be merged, just to assess the impact of `fsync`. We refer to it
|
that will not be merged, only to assess the impact of `fsync`. We refer to it
|
||||||
as `no-fsync` in the following plot.
|
as `no-fsync` in the following plot.
|
||||||
|
|
||||||
*A note about `fsync`: for performance reasons, operating systems often do not
|
*A note about `fsync`: for performance reasons, operating systems often do not
|
||||||
|
@ -172,12 +172,12 @@ with other writes. If a power loss occurs before the OS has time to flush
|
||||||
data to disk, some writes will be lost. To ensure that a write is effectively
|
data to disk, some writes will be lost. To ensure that a write is effectively
|
||||||
written to disk, the
|
written to disk, the
|
||||||
[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
|
[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
|
||||||
which blocks until the file or directory on which it is called has been flushed from volatile
|
which effectively blocks until the file or directory on which it is called has been flushed from volatile
|
||||||
memory to the persistent storage device. Additionally, the exact semantic of
|
memory to the persistent storage device. Additionally, the exact semantic of
|
||||||
`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
|
`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
|
||||||
and, even on battle-tested software like Postgres, it was
|
and, even on battle-tested software like Postgres, it was
|
||||||
["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
|
["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
|
||||||
Note that on Garage, we are currently working on our `fsync` policy and thus, for
|
Note that on Garage, we are still working on our `fsync` policy and thus, for
|
||||||
now, you should expect limited data durability in case of power loss, as we are
|
now, you should expect limited data durability in case of power loss, as we are
|
||||||
aware of some inconsistencies on this point (which we describe in the following
|
aware of some inconsistencies on this point (which we describe in the following
|
||||||
and plan to solve).*
|
and plan to solve).*
|
||||||
|
@ -194,14 +194,14 @@ Minio, our reference point, gives us the best performances in this test.
|
||||||
Looking at Garage, we observe that each improvement we made had a visible
|
Looking at Garage, we observe that each improvement we made had a visible
|
||||||
impact on performances. We also note that we have a progress margin in
|
impact on performances. We also note that we have a progress margin in
|
||||||
terms of performances compared to Minio: additional benchmarks, tests, and
|
terms of performances compared to Minio: additional benchmarks, tests, and
|
||||||
monitoring could help us better understand the remaining difference.
|
monitoring could help us better understand the remaining gap.
|
||||||
|
|
||||||
|
|
||||||
## A myriad of objects
|
## A myriad of objects
|
||||||
|
|
||||||
Object storage systems do not handle a single object but huge numbers of them:
|
Object storage systems do not handle a single object but huge numbers of them:
|
||||||
Amazon claims to handle trillions of objects on their platform, and Red Hat
|
Amazon claims to handle trillions of objects on their platform, and Red Hat
|
||||||
communicates about Ceph being able to handle 10 billion objects. All these
|
tout Ceph as being able to handle 10 billion objects. All these
|
||||||
objects must be tracked efficiently in the system to be fetched, listed,
|
objects must be tracked efficiently in the system to be fetched, listed,
|
||||||
removed, etc. In Garage, we use a "metadata engine" component to track them.
|
removed, etc. In Garage, we use a "metadata engine" component to track them.
|
||||||
For this analysis, we compare different metadata engines in Garage and see how
|
For this analysis, we compare different metadata engines in Garage and see how
|
||||||
|
@ -214,25 +214,25 @@ the only supported option was [sled](https://sled.rs/), but we started having
|
||||||
serious issues with it - and we were not alone
|
serious issues with it - and we were not alone
|
||||||
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
|
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
|
||||||
v0.8, we introduce an abstraction semantic over the features we expect from our
|
v0.8, we introduce an abstraction semantic over the features we expect from our
|
||||||
database, allowing us to switch from one back-end to another without touching
|
database, allowing us to switch from one metadata back-end to another without touching
|
||||||
the rest of our codebase. We added two additional back-ends: LMDB
|
the rest of our codebase. We added two additional back-ends: LMDB
|
||||||
(through [heed](https://github.com/meilisearch/heed)) and SQLite
|
(through [heed](https://github.com/meilisearch/heed)) and SQLite
|
||||||
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
||||||
are both experimental: contrarily to sled, we have never run them in production
|
are both experimental: contrarily to sled, we have yet to run them in production
|
||||||
for a long time.**
|
for a significant time.**
|
||||||
|
|
||||||
Similarly to the impact of `fsync` on block writing, each database engine we use
|
Similarly to the impact of `fsync` on block writing, each database engine we use
|
||||||
has its own policy with `fsync`. Sled flushes its writes every 2 seconds by
|
has its own `fsync` policy. Sled flushes its writes every 2 seconds by
|
||||||
default (this is
|
default (this is
|
||||||
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
|
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
|
||||||
LMDB by default does an `fsync` on each write, which on early tests led to
|
LMDB default to an `fsync` on each write, which on early tests led to
|
||||||
abysmal performance. We thus added 2 flags,
|
abysmal performance. We thus added 2 flags,
|
||||||
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
|
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
|
||||||
and
|
and
|
||||||
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
|
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
|
||||||
to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
|
to deactivate `fsync` entirely. On SQLite, it is also possible to deactivate `fsync` with
|
||||||
`pragma synchronous = off`, but we have not started any optimization work on it yet:
|
`pragma synchronous = off`, but we have not started any optimization work on it yet:
|
||||||
our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
|
our SQLite implementation currently still calls `fsync` for all write operations. Additionally, we are
|
||||||
using these engines through Rust bindings that do not support async Rust,
|
using these engines through Rust bindings that do not support async Rust,
|
||||||
with which Garage is built, which has an impact on performance as well.
|
with which Garage is built, which has an impact on performance as well.
|
||||||
**Our comparison will therefore not reflect the raw performances of
|
**Our comparison will therefore not reflect the raw performances of
|
||||||
|
@ -242,20 +242,20 @@ Still, we think it makes sense to evaluate our implementations in their current
|
||||||
state in Garage. We designed a benchmark that is intensive on the metadata part
|
state in Garage. We designed a benchmark that is intensive on the metadata part
|
||||||
of the software, i.e. handling large numbers of tiny files. We chose again
|
of the software, i.e. handling large numbers of tiny files. We chose again
|
||||||
`minio/warp` as a benchmark tool, but we
|
`minio/warp` as a benchmark tool, but we
|
||||||
configured it here with the smallest possible object size it supported, 256
|
configured it with the smallest possible object size it supported, 256
|
||||||
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
|
bytes, to put pressure on the metadata engine. We evaluated sled twice:
|
||||||
with its default configuration, and with a configuration where we set a flush
|
with its default configuration, and with a configuration where we set a flush
|
||||||
interval of 10 minutes to disable `fsync`.
|
interval of 10 minutes (longer than the test) to disable `fsync`.
|
||||||
|
|
||||||
*Note that S3 has not been designed for such workloads that store huge numbers of small objects;
|
*Note that S3 has not been designed for workloads that store huge numbers of small objects;
|
||||||
a regular database, like Cassandra, would be more appropriate. This test has
|
a regular database, like Cassandra, would be more appropriate. This test has
|
||||||
only been designed to stress our metadata engine, and is not indicative of
|
only been designed to stress our metadata engine and is not indicative of
|
||||||
real-world performances.*
|
real-world performances.*
|
||||||
|
|
||||||
![Plot of our metadata engines comparison with Warp](db_engine.png)
|
![Plot of our metadata engines comparison with Warp](db_engine.png)
|
||||||
|
|
||||||
Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have
|
Unsurprisingly, we observe abysmal performances with SQLite, as it is the engine we did not put work on yet,
|
||||||
the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than
|
and that still does an `fsync` for each write. Garage with the `fsync`-disabled LMDB backend performs twice better than
|
||||||
with sled in its default version and 60% better than the "no `fsync`" sled version in our
|
with sled in its default version and 60% better than the "no `fsync`" sled version in our
|
||||||
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
|
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
|
||||||
disk storage and RAM; we would like to quantify that in the future. As we are
|
disk storage and RAM; we would like to quantify that in the future. As we are
|
||||||
|
@ -263,7 +263,7 @@ only at the very beginning of our work on metadata engines, it is hard to draw
|
||||||
strong conclusions. Still, we can say that SQLite is not ready for production
|
strong conclusions. Still, we can say that SQLite is not ready for production
|
||||||
workloads, and that LMDB looks very promising both in terms of performances and resource
|
workloads, and that LMDB looks very promising both in terms of performances and resource
|
||||||
usage, and is a very good candidate for being Garage's default metadata engine in
|
usage, and is a very good candidate for being Garage's default metadata engine in
|
||||||
future releases. In the future, we will need to define a data policy for Garage to help us
|
future releases, once we figure out the proper `fsync` tuning. In the future, we will need to define a data policy for Garage to help us
|
||||||
arbitrate between performance and durability.
|
arbitrate between performance and durability.
|
||||||
|
|
||||||
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
|
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
|
||||||
|
@ -300,35 +300,35 @@ We wrote our own benchmarking tool for this test,
|
||||||
[s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2].
|
[s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2].
|
||||||
The benchmark procedure consists in
|
The benchmark procedure consists in
|
||||||
concurrently sending a defined number of tiny objects (8192 objects of 16
|
concurrently sending a defined number of tiny objects (8192 objects of 16
|
||||||
bytes by default) and measuring the time it takes. This step is then repeated a given
|
bytes by default) and measuring the wall clock time to the last object upload. This step is then repeated a given
|
||||||
number of times (128 by default) to effectively create a target number of
|
number of times (128 by default) to effectively create a target number of
|
||||||
objects on the cluster (1M by default). On our local setup with 3
|
objects on the cluster (1M by default). On our local setup with 3
|
||||||
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
|
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
|
||||||
following plot, we show how much time it took Garage and Minio to handle
|
following plot, we show how much time it took Garage and Minio to handle
|
||||||
each batch.
|
each batch.
|
||||||
|
|
||||||
Before looking at the plot, **you must keep in mind some important points about
|
Before looking at the plot, **you must keep in mind some important points regarding
|
||||||
the internals of both Minio and Garage**.
|
the internals of both Minio and Garage**.
|
||||||
|
|
||||||
Minio has no metadata engine, it stores its objects directly on the filesystem.
|
Minio has no metadata engine, it stores its objects directly on the filesystem.
|
||||||
Sending 1 million objects on Minio results in creating one million inodes on
|
Sending 1 million objects on Minio results in creating one million inodes on
|
||||||
the storage server in our current setup. So the performances of the filesystem
|
the storage server in our current setup. So the performances of the filesystem
|
||||||
probably have a substantial impact on the results we observe.
|
probably have a substantial impact on the observed results.
|
||||||
In our precise setup, we know that the
|
In our precise setup, we know that the
|
||||||
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
||||||
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
||||||
`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
|
`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
|
||||||
creation of objects. Finally, object storage is designed for big objects, for which the
|
creation of objects. Finally, object storage is designed for big objects, for which the
|
||||||
costs measured here are negligible. In the end, again, we use Minio as a
|
costs measured here are negligible. In the end, again, we use Minio only as a
|
||||||
reference point to understand what performance budget we have for each part of our
|
reference point to understand what performance budget we have for each part of our
|
||||||
software.
|
software.
|
||||||
|
|
||||||
Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
|
Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
|
||||||
not created on the filesystem but the object is directly stored inline in the
|
not created on the filesystem but the object is directly stored inline in the
|
||||||
metadata engine. In the future, we plan to evaluate how Garage behaves at scale with
|
metadata engine. In the future, we plan to evaluate how Garage behaves at scale with
|
||||||
>3KB objects, which we expect to be way closer to Minio, as it will have to create
|
objects above 3KB, which we expect to be way closer to Minio, as it will have to create
|
||||||
at least one inode per object. For now, we limit ourselves to evaluating our
|
at least one inode per object. For now, we limit ourselves to evaluating our
|
||||||
metadata engine and thus focus only on 16-byte objects.
|
metadata engine and focus only on 16-byte objects.
|
||||||
|
|
||||||
![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png)
|
![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png)
|
||||||
|
|
||||||
|
@ -339,27 +339,27 @@ time to complete a batch of inserts is constant, while on Garage it still increa
|
||||||
It could be interesting to know if Garage's batch completion time would cross Minio's one
|
It could be interesting to know if Garage's batch completion time would cross Minio's one
|
||||||
for a very large number of objects. If we reason per object, both Minio's and
|
for a very large number of objects. If we reason per object, both Minio's and
|
||||||
Garage's performances remain very good: it takes respectively around 20ms and
|
Garage's performances remain very good: it takes respectively around 20ms and
|
||||||
5ms to create an object. At 100 Mbps, the upload of a 10MB file takes
|
5ms to create an object. In a real-world scenario, at 100 Mbps, the upload of a 10MB file takes
|
||||||
800ms, and goes up to 8sec for a 100MB file: in both cases
|
800ms, and goes up to 8sec for a 100MB file: in both cases
|
||||||
handling the object metadata is only a fraction of the upload time. The
|
handling the object metadata would be only a fraction of the upload time. The
|
||||||
only cases where a difference would be noticeable would be when uploading a lot of very
|
only cases where a difference would be noticeable would be when uploading a lot of very
|
||||||
small files at once, which again is an unusual usage of the S3 API.
|
small files at once, which again would be an unusual usage of the S3 API.
|
||||||
|
|
||||||
Let us now focus on Garage's metrics only to better see its specific behavior:
|
Let us now focus on Garage's metrics only to better see its specific behavior:
|
||||||
|
|
||||||
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
|
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
|
||||||
|
|
||||||
Two effects are now more visible: 1., batch completion time increases with the
|
Two effects are now more visible: 1., batch completion time increases with the
|
||||||
number of objects in the bucket and 2., measurements are dispersed, at least
|
number of objects in the bucket and 2., measurements are scattered, at least
|
||||||
more than for Minio. We expect this batch completion time increase to be logarithmic,
|
more than for Minio. We expected this batch completion time increase to be logarithmic,
|
||||||
but we don't have enough data points to conclude safety: additional
|
but we don't have enough data points to conclude confidently it is the case: additional
|
||||||
measurements are needed. Concerning the observed instability, it could
|
measurements are needed. Concerning the observed instability, it could
|
||||||
be a symptom of what we saw with some other experiments in this machine,
|
be a symptom of what we saw with some other experiments on this setup,
|
||||||
which sometimes freezes under heavy I/O load. Such freezes could lead to
|
which sometimes freezes under heavy I/O load. Such freezes could lead to
|
||||||
request timeouts and failures. If this occurs on our testing computer, it might
|
request timeouts and failures. If this occurs on our testing computer, it might
|
||||||
occur on other servers as well: it would be interesting to better understand this
|
occur on other servers as well: it would be interesting to better understand this
|
||||||
issue, document how to avoid it, and potentially change how we handle our I/O
|
issue, document how to avoid it, and potentially change how we handle I/O
|
||||||
internally in Garage. But still, this was a very stressful test that will probably not be encountered in
|
internally in Garage. But still, this was a very heavy test that will probably not be encountered in
|
||||||
many setups: we were adding 273 objects per second for 30 minutes straight!
|
many setups: we were adding 273 objects per second for 30 minutes straight!
|
||||||
|
|
||||||
To conclude this part, Garage can ingest 1 million tiny objects while remaining
|
To conclude this part, Garage can ingest 1 million tiny objects while remaining
|
||||||
|
@ -382,45 +382,45 @@ core value proposition of Garage. For example, our production cluster is
|
||||||
hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
|
hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
|
||||||
behind consumer-grade fiber links across France and Belgium (if you are reading this,
|
behind consumer-grade fiber links across France and Belgium (if you are reading this,
|
||||||
congratulation, you fetched this webpage from it!). That's why we are very
|
congratulation, you fetched this webpage from it!). That's why we are very
|
||||||
careful that our internal protocol (named RPC protocol in our documentation)
|
careful that our internal protocol (referred to as "RPC protocol" in our documentation)
|
||||||
remains as lightweight as possible. For this analysis, we quantify how network
|
remains as lightweight as possible. For this analysis, we quantify how network
|
||||||
latency and the number of nodes in the cluster impact the duration of the most
|
latency and number of nodes in the cluster impact the duration of the most
|
||||||
important kinds of S3 requests.
|
important kinds of S3 requests.
|
||||||
|
|
||||||
**Latency amplification** - With the kind of networks we use (consumer-grade
|
**Latency amplification** - With the kind of networks we use (consumer-grade
|
||||||
fiber links across the EU), the observed latency between nodes is in the 50ms range.
|
fiber links across the EU), the observed latency between nodes is in the 50ms range.
|
||||||
When latency is not negligible, you will observe that request completion
|
When latency is not negligible, you will observe that request completion
|
||||||
time is a factor of the observed latency. That's to be expected: in many cases, the
|
time is a factor of the observed latency. That's to be expected: in many cases, the
|
||||||
node of the cluster you are contacting can not directly answer your request, and
|
node of the cluster you are contacting cannot directly answer your request, and
|
||||||
has to reach other nodes of the cluster to get the requested information. Each
|
has to reach other nodes of the cluster to get the data. Each
|
||||||
of these sequential remote procedure calls - or RPCs - adds to the final S3 request duration, which can quickly become
|
of these sequential remote procedure calls - or RPCs - adds to the final S3 request duration, which can quickly become
|
||||||
expensive. This ratio between request duration and network latency is what we
|
expensive. This ratio between request duration and network latency is what we
|
||||||
refer to as *latency amplification*.
|
refer to as *latency amplification*.
|
||||||
|
|
||||||
For example, on Garage, a GetObject request does two sequential calls: first,
|
For example, on Garage, a `GetObject` request does two sequential calls: first,
|
||||||
it fetches the descriptor of the requested object, which contains a reference
|
it fetches the descriptor of the requested object from the metadata engine, which contains a reference
|
||||||
to the first block of data, and then only in a second step it can start retrieving data blocks
|
to the first block of data, and then only in a second step it can start retrieving data blocks
|
||||||
from storage nodes. We can therefore expect that the
|
from storage nodes. We can therefore expect that the
|
||||||
request duration of a small GetObject request will be close to twice the
|
request duration of a small `GetObject` request will be close to twice the
|
||||||
network latency.
|
network latency.
|
||||||
|
|
||||||
We tested the latency amplification theory with another benchmark of our own named
|
We tested the latency amplification theory with another benchmark of our own named
|
||||||
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
|
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
|
||||||
which does a single request at a time on an endpoint and measures its response
|
which does a single request at a time on an endpoint and measures the response
|
||||||
time. As we are not interested in bandwidth but latency, all our requests
|
time. As we are not interested in bandwidth but latency, all our requests
|
||||||
involving an object are made on a tiny file of around 16 bytes. Our benchmark
|
involving objects are made on tiny files of around 16 bytes. Our benchmark
|
||||||
tests 5 standard endpoints of the S3 API: ListBuckets, ListObjects, PutObject, GetObject and
|
tests 5 standard endpoints of the S3 API: ListBuckets, ListObjects, PutObject, GetObject and
|
||||||
RemoveObject. Here are the results:
|
RemoveObject. Here are the results:
|
||||||
|
|
||||||
|
|
||||||
![Latency amplification](amplification.png)
|
![Latency amplification](amplification.png)
|
||||||
|
|
||||||
As Garage has been optimized for this use case from the beginning, we don't see
|
As Garage has been optimized for this use case from the very beginning, we don't see
|
||||||
any significant evolution from one version to another (Garage v0.7.3 and Garage
|
any significant evolution from one version to another (Garage v0.7.3 and Garage
|
||||||
v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
|
v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
|
||||||
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
|
ListObjects and ListBuckets) or significantly better (for GetObject, PutObject, and
|
||||||
RemoveObject). This can be easily understood by the fact that Minio has not been designed for
|
RemoveObject). This can be easily explained by the fact that Minio has not been designed with
|
||||||
environments with high latencies. Instead, it expects to run on clusters that are built
|
environments with high latencies in mind. Instead, it is expected to run on clusters that are built
|
||||||
in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected with their asynchronous
|
in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected with their asynchronous
|
||||||
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
||||||
feature.
|
feature.
|
||||||
|
@ -488,7 +488,7 @@ terabytes of data and billions of objects on long-lasting experiments.
|
||||||
|
|
||||||
In the meantime, stay tuned: we have released
|
In the meantime, stay tuned: we have released
|
||||||
[a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1),
|
[a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1),
|
||||||
and are already working on a number of features for the next version.
|
and are already working on several features for the next version.
|
||||||
For instance, we are working on a new layout that will have enhanced optimality properties,
|
For instance, we are working on a new layout that will have enhanced optimality properties,
|
||||||
as well as a theoretical proof of correctness
|
as well as a theoretical proof of correctness
|
||||||
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)). We are also
|
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)). We are also
|
||||||
|
|
Loading…
Reference in a new issue