New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 55 additions and 54 deletions
|
@ -22,7 +22,7 @@ to reflect the high-level properties we are seeking.*
|
|||
|
||||
The following results must be taken with a critical grain of salt due to some
|
||||
limitations that are inherent to any benchmark. We try to reference them as
|
||||
exhaustively as possible in this first section, but other limitations might exist.
|
||||
exhaustively as possible in this first section, but other limitations might exist.
|
||||
|
||||
Most of our tests were made on simulated networks, which by definition cannot represent all the
|
||||
diversity of real networks (dynamic drop, jitter, latency, all of which could be
|
||||
|
@ -109,7 +109,7 @@ at a smaller granularity level than entire data blocks, which are 1MB chunks of
|
|||
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
|
||||
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
|
||||
request, the first block had to be fully retrieved by the gateway node from the
|
||||
storage node before starting to send any data to the client.
|
||||
storage node before starting to send any data to the client.
|
||||
|
||||
With Garage v0.8, we integrated a block streaming logic that allows the gateway
|
||||
to send the beginning of a block without having to wait for the full block from
|
||||
|
@ -125,7 +125,7 @@ thus adding at most 8ms of latency to a GetObject request (assuming no other
|
|||
data transfer is happening in parallel). However,
|
||||
on a very slow network, or a very congested link with many parallel requests
|
||||
handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
|
||||
to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
|
||||
to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
|
||||
|
||||
We wanted to see if this theory holds in practice: we simulated a low latency
|
||||
but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
|
||||
|
@ -185,7 +185,7 @@ To assess performance improvements, we used the benchmark tool
|
|||
[minio/warp](https://github.com/minio/warp) in a non-standard configuration,
|
||||
adapted for small-scale tests, and we kept only the aggregated result named
|
||||
"cluster total". The goal of this experiment is to get an idea of the cluster
|
||||
performance with a standardized and mixed workload.
|
||||
performance with a standardized and mixed workload.
|
||||
|
||||
![Plot showing IO perf of Garage configs and Minio](io.png)
|
||||
|
||||
|
@ -194,7 +194,7 @@ Looking at Garage, we observe that each improvement we made has a visible
|
|||
impact on performances. We also note that we have a progress margin in
|
||||
terms of performances compared to Minio: additional benchmarks, tests, and
|
||||
monitoring could help better understand the remaining difference.
|
||||
|
||||
|
||||
|
||||
## A myriad of objects
|
||||
|
||||
|
@ -206,78 +206,79 @@ removed, etc. In Garage, we use a "metadata engine" component to track them.
|
|||
For this analysis, we compare different metadata engines in Garage and see how
|
||||
well the best one scale to a million objects.
|
||||
|
||||
**Testing metadata engines** - With Garage, we chose to not store metadata
|
||||
directly on the filesystem, like Minio for example, but in an on-disk fancy
|
||||
B-Tree structure, in other words, in an embedded database engine. Until now,
|
||||
the only available option was [sled](https://sled.rs/), but we started having
|
||||
serious issues with it, and we were not alone
|
||||
**Testing metadata engines** - With Garage, we chose not to store metadata
|
||||
directly on the filesystem, like Minio for example, but in a specialized on-disk
|
||||
B-Tree data structure; in other words, in an embedded database engine. Until now,
|
||||
the only supported option was [sled](https://sled.rs/), but we started having
|
||||
serious issues with it - and we were not alone
|
||||
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
|
||||
v0.8, we introduce an abstraction semantic over the features we expect from our
|
||||
database, allowing us to switch from one backend to another without touching
|
||||
the rest of our codebase. We added two additional backends: lmdb
|
||||
([heed](https://github.com/meilisearch/heed)) and sqlite
|
||||
([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
||||
the rest of our codebase. We added two additional backends: LMDB
|
||||
(through [heed](https://github.com/meilisearch/heed)) and SQLite
|
||||
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
||||
are both experimental: contrarily to sled, we have never run them in production
|
||||
for a long time.**
|
||||
|
||||
Similarly to the impact of fsync on block writing, each database engine we use
|
||||
has its own policy with fsync. Sled flushes its write every 2 seconds by
|
||||
Similarly to the impact of `fsync` on block writing, each database engine we use
|
||||
has its own policy with `fsync`. Sled flushes its write every 2 seconds by
|
||||
default, this is
|
||||
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
|
||||
lmdb by default does an `fsync` on each write, on early tests it led to very
|
||||
slow resynchronizations between nodes. We added 2 flags:
|
||||
LMDB by default does an `fsync` on each write, which on early tests led to very
|
||||
slow resynchronizations between nodes. We thus added 2 flags,
|
||||
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
|
||||
and
|
||||
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16)
|
||||
which deactivate fsync. On sqlite, it is also possible to deactivate fsync with
|
||||
`pragma synchronous = off;`, but we did not start any optimization work on it:
|
||||
our sqlite implementation fsync all the data on the disk. Additionally, we are
|
||||
using these engines through a Rust binding that had to do some tradeoff on the
|
||||
concurrency part. **Our comparison will not reflect the raw performances of
|
||||
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
|
||||
to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
|
||||
`pragma synchronous = off`, but we have not started any optimization work on it yet:
|
||||
our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
|
||||
using these engines through Rust bindings that do not support async Rust,
|
||||
with which Garage is built. **Our comparison will therefore not reflect the raw performances of
|
||||
these database engines, but instead, our integration choices.**
|
||||
|
||||
Still, we think it makes sense to evaluate our implementations in their current
|
||||
state in Garage. We designed a benchmark that is intensive on the metadata part
|
||||
of the software, ie. handling tiny files. We chose again minio/warp but we
|
||||
configure it with the smallest possible object size supported by warp, 256
|
||||
bytes, to put some pressure on the metadata engine. We evaluate sled twice:
|
||||
of the software, i.e. handling large numbers of tiny files. We chose again
|
||||
`minio/warp` as a benchmark tool but we
|
||||
configured it with the smallest possible object size it supported, 256
|
||||
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
|
||||
with its default configuration, and with a configuration where we set a flush
|
||||
interval of 10 minutes to disable fsync.
|
||||
interval of 10 minutes to disable `fsync`.
|
||||
|
||||
*Note that S3 has not been designed for such small objects; a regular database,
|
||||
like Cassandra, would be more appropriate for such workloads. This test has
|
||||
only been designed to stress our metadata engine, it is not indicative of
|
||||
*Note that S3 has not been designed for such workloads that store huge numbers of small objects;
|
||||
a regular database, like Cassandra, would be more appropriate. This test has
|
||||
only been designed to stress our metadata engine, and is not indicative of
|
||||
real-world performances.*
|
||||
|
||||
![Plot of our metadata engines comparison with Warp](db_engine.png)
|
||||
|
||||
Unsurprisingly, we observe abysmal performances for sqlite, the engine we have
|
||||
the less tested and kept fsync for each write. lmdb performs twice better than
|
||||
sled in its default version and 60% better than the "no fsync" version in our
|
||||
Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have
|
||||
the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than
|
||||
with sled in its default version and 60% better than the "no `fsync`" sled version in our
|
||||
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
|
||||
disk storage and RAM; we would like to quantify that in the future. As we are
|
||||
only at the very beginning of our work on metadata engines, it is hard to draw
|
||||
strong conclusions. Still, we can say that sqlite is not ready for production
|
||||
workloads, LMDB looks very promising both in terms of performances and resource
|
||||
usage, it is a very good candidate for Garage's default metadata engine in the
|
||||
future, and we need to define a data policy for Garage that would help us
|
||||
strong conclusions. Still, we can say that SQLite is not ready for production
|
||||
workloads, and that LMDB looks very promising both in terms of performances and resource
|
||||
usage, and is a very good candidate for being Garage's default metadata engine in the
|
||||
future. In the future, we will need to define a data policy for Garage to help us
|
||||
arbitrate between performances and durability.
|
||||
|
||||
*To fsync or not to fsync? Performance is nothing without reliability, so we
|
||||
need to better assess the impact of validating a write and then losing it.
|
||||
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
|
||||
need to better assess the impact of validating a write and then possibly losing it.
|
||||
Because Garage is a distributed system, even if a node loses its write due to a
|
||||
power loss, it will fetch it back from the 2 other nodes storing it. But rare
|
||||
situations where 1 node is down and the 2 others validated the write and then
|
||||
lost power can occur, what is our policy in this case? For storage durability,
|
||||
situations can occur, where 1 node is down and the 2 others validated the write and then
|
||||
lost power. What is our policy in this case? For storage durability,
|
||||
we are already supposing that we never lose the storage of more than 2 nodes,
|
||||
should we also expect that we don't lose power on more than 2 nodes at the same
|
||||
so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
|
||||
time? What should we think about people hosting all their nodes at the same
|
||||
place without a UPS? Historically, it seems that Minio developers also accepted
|
||||
place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
|
||||
some compromises on this side
|
||||
([#3536](https://github.com/minio/minio/issues/3536),
|
||||
[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
|
||||
use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
|
||||
only data and not metadata are persisted on disk - in combination with
|
||||
only data and not metadata is persisted on disk - in combination with
|
||||
`O_DIRECT` for direct I/O
|
||||
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
|
||||
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
||||
|
@ -301,10 +302,10 @@ number of times (128 by default) to effectively create a certain number of
|
|||
objects on the target cluster (1M by default). On our local setup with 3
|
||||
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
|
||||
following plot, we show how many times it took to Garage and Minio to handle
|
||||
each batch.
|
||||
each batch.
|
||||
|
||||
Before looking at the plot, **you must keep in mind some important points about
|
||||
Minio and Garage internals**.
|
||||
Minio and Garage internals**.
|
||||
|
||||
Minio has no metadata engine, it stores its objects directly on the filesystem.
|
||||
Sending 1 million objects on Minio results in creating one million inodes on
|
||||
|
@ -312,7 +313,7 @@ the storage node in our current setup. So the performance of your filesystem
|
|||
will probably substantially impact the results you will observe; we know the
|
||||
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
||||
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
||||
fsync for our metadata engine, minio has some fsync logic here slowing down the
|
||||
`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the
|
||||
creation of objects. Finally, object storage is designed for big objects: this
|
||||
cost is negligible with bigger objects. In the end, again, we use Minio as a
|
||||
reference to understand what is our performance budget for each part of our
|
||||
|
@ -330,7 +331,7 @@ metadata engine and thus focus only on 16-byte objects.
|
|||
It appears that the performances of our metadata engine are acceptable, as we
|
||||
have a comfortable margin compared to Minio (Minio is between 3x and 4x times
|
||||
slower per batch). We also note that, past 200k objects, Minio batch
|
||||
completion time is constant as Garage's one is still increasing in the observed range:
|
||||
completion time is constant as Garage's one is still increasing in the observed range:
|
||||
it could be interesting to know if Garage batch's completion time would cross Minio's one
|
||||
for a very large number of objects. If we reason per object, both Minio and
|
||||
Garage performances remain very good: it takes respectively around 20ms and
|
||||
|
@ -396,7 +397,7 @@ For example, on Garage, a GetObject request does two sequential calls: first,
|
|||
it asks for the descriptor of the requested object containing the block list of
|
||||
the requested object, then it retrieves its blocks. We can expect that the
|
||||
request duration of a small GetObject request will be close to twice the
|
||||
network latency.
|
||||
network latency.
|
||||
|
||||
We tested this theory with another benchmark of our own named
|
||||
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
|
||||
|
@ -417,7 +418,7 @@ RemoveObject). It is understandable: Minio has not been designed for
|
|||
environments with high latencies, you are expected to build your clusters in
|
||||
the same datacenter, and then possibly connect them with their asynchronous
|
||||
[Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
||||
feature.
|
||||
feature.
|
||||
|
||||
*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/)
|
||||
but it is even more sensitive to latency: "Multi-site replication has increased
|
||||
|
@ -454,7 +455,7 @@ that their load started to become non-negligible: it seems that we are not
|
|||
hitting a limit on the protocol side but we have simply exhausted the resource
|
||||
of our testing nodes. In the future, we would like to run this experiment
|
||||
again, but on way more physical nodes, to confirm our hypothesis. For now, we
|
||||
are confident that a Garage cluster with 100+ nodes should work.
|
||||
are confident that a Garage cluster with 100+ nodes should work.
|
||||
|
||||
|
||||
## Conclusion and Future work
|
||||
|
@ -462,7 +463,7 @@ are confident that a Garage cluster with 100+ nodes should work.
|
|||
During this work, we identified some sensitive points on Garage we will
|
||||
continue working on: our data durability target and interaction with the
|
||||
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
|
||||
our components, our new metadata engines (lmdb, sqlite) still need some testing
|
||||
our components, our new metadata engines (LMDB, SQLite) still need some testing
|
||||
and tuning, and we know that raw I/O (GetObject, PutObject) have a small
|
||||
improvement margin.
|
||||
|
||||
|
@ -489,11 +490,11 @@ soon introduce officially a new API (as a technical preview) named K2V
|
|||
([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
|
||||
|
||||
|
||||
## Notes
|
||||
## Notes
|
||||
|
||||
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)
|
||||
existence. This tool is far more complex than our set of scripts, but we know
|
||||
that it is also way more versatile.
|
||||
that it is also way more versatile.
|
||||
|
||||
[^ref2]: The program name contains the word "billion" and we only tested Garage
|
||||
up to 1 "million" object, this is not a typo, we were just a little bit too
|
||||
|
|
Loading…
Reference in a new issue