Fixes until "millions of objects"

This commit is contained in:
Alex 2022-09-28 16:12:45 +02:00
parent bacebcfbf1
commit 7a354483d7
Signed by: lx
GPG key ID: 0E496D15096376BE

View file

@ -206,78 +206,79 @@ removed, etc. In Garage, we use a "metadata engine" component to track them.
For this analysis, we compare different metadata engines in Garage and see how
well the best one scale to a million objects.
**Testing metadata engines** - With Garage, we chose to not store metadata
directly on the filesystem, like Minio for example, but in an on-disk fancy
B-Tree structure, in other words, in an embedded database engine. Until now,
the only available option was [sled](https://sled.rs/), but we started having
serious issues with it, and we were not alone
**Testing metadata engines** - With Garage, we chose not to store metadata
directly on the filesystem, like Minio for example, but in a specialized on-disk
B-Tree data structure; in other words, in an embedded database engine. Until now,
the only supported option was [sled](https://sled.rs/), but we started having
serious issues with it - and we were not alone
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
v0.8, we introduce an abstraction semantic over the features we expect from our
database, allowing us to switch from one backend to another without touching
the rest of our codebase. We added two additional backends: lmdb
([heed](https://github.com/meilisearch/heed)) and sqlite
([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
the rest of our codebase. We added two additional backends: LMDB
(through [heed](https://github.com/meilisearch/heed)) and SQLite
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
are both experimental: contrarily to sled, we have never run them in production
for a long time.**
Similarly to the impact of fsync on block writing, each database engine we use
has its own policy with fsync. Sled flushes its write every 2 seconds by
Similarly to the impact of `fsync` on block writing, each database engine we use
has its own policy with `fsync`. Sled flushes its write every 2 seconds by
default, this is
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
lmdb by default does an `fsync` on each write, on early tests it led to very
slow resynchronizations between nodes. We added 2 flags:
LMDB by default does an `fsync` on each write, which on early tests led to very
slow resynchronizations between nodes. We thus added 2 flags,
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
and
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16)
which deactivate fsync. On sqlite, it is also possible to deactivate fsync with
`pragma synchronous = off;`, but we did not start any optimization work on it:
our sqlite implementation fsync all the data on the disk. Additionally, we are
using these engines through a Rust binding that had to do some tradeoff on the
concurrency part. **Our comparison will not reflect the raw performances of
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
`pragma synchronous = off`, but we have not started any optimization work on it yet:
our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
using these engines through Rust bindings that do not support async Rust,
with which Garage is built. **Our comparison will therefore not reflect the raw performances of
these database engines, but instead, our integration choices.**
Still, we think it makes sense to evaluate our implementations in their current
state in Garage. We designed a benchmark that is intensive on the metadata part
of the software, ie. handling tiny files. We chose again minio/warp but we
configure it with the smallest possible object size supported by warp, 256
bytes, to put some pressure on the metadata engine. We evaluate sled twice:
of the software, i.e. handling large numbers of tiny files. We chose again
`minio/warp` as a benchmark tool but we
configured it with the smallest possible object size it supported, 256
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
with its default configuration, and with a configuration where we set a flush
interval of 10 minutes to disable fsync.
interval of 10 minutes to disable `fsync`.
*Note that S3 has not been designed for such small objects; a regular database,
like Cassandra, would be more appropriate for such workloads. This test has
only been designed to stress our metadata engine, it is not indicative of
*Note that S3 has not been designed for such workloads that store huge numbers of small objects;
a regular database, like Cassandra, would be more appropriate. This test has
only been designed to stress our metadata engine, and is not indicative of
real-world performances.*
![Plot of our metadata engines comparison with Warp](db_engine.png)
Unsurprisingly, we observe abysmal performances for sqlite, the engine we have
the less tested and kept fsync for each write. lmdb performs twice better than
sled in its default version and 60% better than the "no fsync" version in our
Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have
the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than
with sled in its default version and 60% better than the "no `fsync`" sled version in our
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
disk storage and RAM; we would like to quantify that in the future. As we are
only at the very beginning of our work on metadata engines, it is hard to draw
strong conclusions. Still, we can say that sqlite is not ready for production
workloads, LMDB looks very promising both in terms of performances and resource
usage, it is a very good candidate for Garage's default metadata engine in the
future, and we need to define a data policy for Garage that would help us
strong conclusions. Still, we can say that SQLite is not ready for production
workloads, and that LMDB looks very promising both in terms of performances and resource
usage, and is a very good candidate for being Garage's default metadata engine in the
future. In the future, we will need to define a data policy for Garage to help us
arbitrate between performances and durability.
*To fsync or not to fsync? Performance is nothing without reliability, so we
need to better assess the impact of validating a write and then losing it.
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
need to better assess the impact of validating a write and then possibly losing it.
Because Garage is a distributed system, even if a node loses its write due to a
power loss, it will fetch it back from the 2 other nodes storing it. But rare
situations where 1 node is down and the 2 others validated the write and then
lost power can occur, what is our policy in this case? For storage durability,
situations can occur, where 1 node is down and the 2 others validated the write and then
lost power. What is our policy in this case? For storage durability,
we are already supposing that we never lose the storage of more than 2 nodes,
should we also expect that we don't lose power on more than 2 nodes at the same
so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
time? What should we think about people hosting all their nodes at the same
place without a UPS? Historically, it seems that Minio developers also accepted
place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
some compromises on this side
([#3536](https://github.com/minio/minio/issues/3536),
[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
only data and not metadata are persisted on disk - in combination with
only data and not metadata is persisted on disk - in combination with
`O_DIRECT` for direct I/O
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
@ -312,7 +313,7 @@ the storage node in our current setup. So the performance of your filesystem
will probably substantially impact the results you will observe; we know the
filesystem we used is not adapted at all for Minio (encryption layer, fixed
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
fsync for our metadata engine, minio has some fsync logic here slowing down the
`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the
creation of objects. Finally, object storage is designed for big objects: this
cost is negligible with bigger objects. In the end, again, we use Minio as a
reference to understand what is our performance budget for each part of our
@ -462,7 +463,7 @@ are confident that a Garage cluster with 100+ nodes should work.
During this work, we identified some sensitive points on Garage we will
continue working on: our data durability target and interaction with the
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
our components, our new metadata engines (lmdb, sqlite) still need some testing
our components, our new metadata engines (LMDB, SQLite) still need some testing
and tuning, and we know that raw I/O (GetObject, PutObject) have a small
improvement margin.