New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 55 additions and 54 deletions
|
@ -206,78 +206,79 @@ removed, etc. In Garage, we use a "metadata engine" component to track them.
|
||||||
For this analysis, we compare different metadata engines in Garage and see how
|
For this analysis, we compare different metadata engines in Garage and see how
|
||||||
well the best one scale to a million objects.
|
well the best one scale to a million objects.
|
||||||
|
|
||||||
**Testing metadata engines** - With Garage, we chose to not store metadata
|
**Testing metadata engines** - With Garage, we chose not to store metadata
|
||||||
directly on the filesystem, like Minio for example, but in an on-disk fancy
|
directly on the filesystem, like Minio for example, but in a specialized on-disk
|
||||||
B-Tree structure, in other words, in an embedded database engine. Until now,
|
B-Tree data structure; in other words, in an embedded database engine. Until now,
|
||||||
the only available option was [sled](https://sled.rs/), but we started having
|
the only supported option was [sled](https://sled.rs/), but we started having
|
||||||
serious issues with it, and we were not alone
|
serious issues with it - and we were not alone
|
||||||
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
|
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
|
||||||
v0.8, we introduce an abstraction semantic over the features we expect from our
|
v0.8, we introduce an abstraction semantic over the features we expect from our
|
||||||
database, allowing us to switch from one backend to another without touching
|
database, allowing us to switch from one backend to another without touching
|
||||||
the rest of our codebase. We added two additional backends: lmdb
|
the rest of our codebase. We added two additional backends: LMDB
|
||||||
([heed](https://github.com/meilisearch/heed)) and sqlite
|
(through [heed](https://github.com/meilisearch/heed)) and SQLite
|
||||||
([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
||||||
are both experimental: contrarily to sled, we have never run them in production
|
are both experimental: contrarily to sled, we have never run them in production
|
||||||
for a long time.**
|
for a long time.**
|
||||||
|
|
||||||
Similarly to the impact of fsync on block writing, each database engine we use
|
Similarly to the impact of `fsync` on block writing, each database engine we use
|
||||||
has its own policy with fsync. Sled flushes its write every 2 seconds by
|
has its own policy with `fsync`. Sled flushes its write every 2 seconds by
|
||||||
default, this is
|
default, this is
|
||||||
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
|
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
|
||||||
lmdb by default does an `fsync` on each write, on early tests it led to very
|
LMDB by default does an `fsync` on each write, which on early tests led to very
|
||||||
slow resynchronizations between nodes. We added 2 flags:
|
slow resynchronizations between nodes. We thus added 2 flags,
|
||||||
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
|
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
|
||||||
and
|
and
|
||||||
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16)
|
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
|
||||||
which deactivate fsync. On sqlite, it is also possible to deactivate fsync with
|
to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
|
||||||
`pragma synchronous = off;`, but we did not start any optimization work on it:
|
`pragma synchronous = off`, but we have not started any optimization work on it yet:
|
||||||
our sqlite implementation fsync all the data on the disk. Additionally, we are
|
our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
|
||||||
using these engines through a Rust binding that had to do some tradeoff on the
|
using these engines through Rust bindings that do not support async Rust,
|
||||||
concurrency part. **Our comparison will not reflect the raw performances of
|
with which Garage is built. **Our comparison will therefore not reflect the raw performances of
|
||||||
these database engines, but instead, our integration choices.**
|
these database engines, but instead, our integration choices.**
|
||||||
|
|
||||||
Still, we think it makes sense to evaluate our implementations in their current
|
Still, we think it makes sense to evaluate our implementations in their current
|
||||||
state in Garage. We designed a benchmark that is intensive on the metadata part
|
state in Garage. We designed a benchmark that is intensive on the metadata part
|
||||||
of the software, ie. handling tiny files. We chose again minio/warp but we
|
of the software, i.e. handling large numbers of tiny files. We chose again
|
||||||
configure it with the smallest possible object size supported by warp, 256
|
`minio/warp` as a benchmark tool but we
|
||||||
bytes, to put some pressure on the metadata engine. We evaluate sled twice:
|
configured it with the smallest possible object size it supported, 256
|
||||||
|
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
|
||||||
with its default configuration, and with a configuration where we set a flush
|
with its default configuration, and with a configuration where we set a flush
|
||||||
interval of 10 minutes to disable fsync.
|
interval of 10 minutes to disable `fsync`.
|
||||||
|
|
||||||
*Note that S3 has not been designed for such small objects; a regular database,
|
*Note that S3 has not been designed for such workloads that store huge numbers of small objects;
|
||||||
like Cassandra, would be more appropriate for such workloads. This test has
|
a regular database, like Cassandra, would be more appropriate. This test has
|
||||||
only been designed to stress our metadata engine, it is not indicative of
|
only been designed to stress our metadata engine, and is not indicative of
|
||||||
real-world performances.*
|
real-world performances.*
|
||||||
|
|
||||||
![Plot of our metadata engines comparison with Warp](db_engine.png)
|
![Plot of our metadata engines comparison with Warp](db_engine.png)
|
||||||
|
|
||||||
Unsurprisingly, we observe abysmal performances for sqlite, the engine we have
|
Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have
|
||||||
the less tested and kept fsync for each write. lmdb performs twice better than
|
the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than
|
||||||
sled in its default version and 60% better than the "no fsync" version in our
|
with sled in its default version and 60% better than the "no `fsync`" sled version in our
|
||||||
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
|
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
|
||||||
disk storage and RAM; we would like to quantify that in the future. As we are
|
disk storage and RAM; we would like to quantify that in the future. As we are
|
||||||
only at the very beginning of our work on metadata engines, it is hard to draw
|
only at the very beginning of our work on metadata engines, it is hard to draw
|
||||||
strong conclusions. Still, we can say that sqlite is not ready for production
|
strong conclusions. Still, we can say that SQLite is not ready for production
|
||||||
workloads, LMDB looks very promising both in terms of performances and resource
|
workloads, and that LMDB looks very promising both in terms of performances and resource
|
||||||
usage, it is a very good candidate for Garage's default metadata engine in the
|
usage, and is a very good candidate for being Garage's default metadata engine in the
|
||||||
future, and we need to define a data policy for Garage that would help us
|
future. In the future, we will need to define a data policy for Garage to help us
|
||||||
arbitrate between performances and durability.
|
arbitrate between performances and durability.
|
||||||
|
|
||||||
*To fsync or not to fsync? Performance is nothing without reliability, so we
|
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
|
||||||
need to better assess the impact of validating a write and then losing it.
|
need to better assess the impact of validating a write and then possibly losing it.
|
||||||
Because Garage is a distributed system, even if a node loses its write due to a
|
Because Garage is a distributed system, even if a node loses its write due to a
|
||||||
power loss, it will fetch it back from the 2 other nodes storing it. But rare
|
power loss, it will fetch it back from the 2 other nodes storing it. But rare
|
||||||
situations where 1 node is down and the 2 others validated the write and then
|
situations can occur, where 1 node is down and the 2 others validated the write and then
|
||||||
lost power can occur, what is our policy in this case? For storage durability,
|
lost power. What is our policy in this case? For storage durability,
|
||||||
we are already supposing that we never lose the storage of more than 2 nodes,
|
we are already supposing that we never lose the storage of more than 2 nodes,
|
||||||
should we also expect that we don't lose power on more than 2 nodes at the same
|
so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
|
||||||
time? What should we think about people hosting all their nodes at the same
|
time? What should we think about people hosting all their nodes at the same
|
||||||
place without a UPS? Historically, it seems that Minio developers also accepted
|
place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
|
||||||
some compromises on this side
|
some compromises on this side
|
||||||
([#3536](https://github.com/minio/minio/issues/3536),
|
([#3536](https://github.com/minio/minio/issues/3536),
|
||||||
[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
|
[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
|
||||||
use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
|
use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
|
||||||
only data and not metadata are persisted on disk - in combination with
|
only data and not metadata is persisted on disk - in combination with
|
||||||
`O_DIRECT` for direct I/O
|
`O_DIRECT` for direct I/O
|
||||||
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
|
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
|
||||||
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
||||||
|
@ -312,7 +313,7 @@ the storage node in our current setup. So the performance of your filesystem
|
||||||
will probably substantially impact the results you will observe; we know the
|
will probably substantially impact the results you will observe; we know the
|
||||||
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
||||||
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
||||||
fsync for our metadata engine, minio has some fsync logic here slowing down the
|
`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the
|
||||||
creation of objects. Finally, object storage is designed for big objects: this
|
creation of objects. Finally, object storage is designed for big objects: this
|
||||||
cost is negligible with bigger objects. In the end, again, we use Minio as a
|
cost is negligible with bigger objects. In the end, again, we use Minio as a
|
||||||
reference to understand what is our performance budget for each part of our
|
reference to understand what is our performance budget for each part of our
|
||||||
|
@ -462,7 +463,7 @@ are confident that a Garage cluster with 100+ nodes should work.
|
||||||
During this work, we identified some sensitive points on Garage we will
|
During this work, we identified some sensitive points on Garage we will
|
||||||
continue working on: our data durability target and interaction with the
|
continue working on: our data durability target and interaction with the
|
||||||
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
|
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
|
||||||
our components, our new metadata engines (lmdb, sqlite) still need some testing
|
our components, our new metadata engines (LMDB, SQLite) still need some testing
|
||||||
and tuning, and we know that raw I/O (GetObject, PutObject) have a small
|
and tuning, and we know that raw I/O (GetObject, PutObject) have a small
|
||||||
improvement margin.
|
improvement margin.
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue