Fixes until "millions of objects"
This commit is contained in:
parent
bacebcfbf1
commit
7a354483d7
1 changed files with 55 additions and 54 deletions
|
@ -206,78 +206,79 @@ removed, etc. In Garage, we use a "metadata engine" component to track them.
|
|||
For this analysis, we compare different metadata engines in Garage and see how
|
||||
well the best one scale to a million objects.
|
||||
|
||||
**Testing metadata engines** - With Garage, we chose to not store metadata
|
||||
directly on the filesystem, like Minio for example, but in an on-disk fancy
|
||||
B-Tree structure, in other words, in an embedded database engine. Until now,
|
||||
the only available option was [sled](https://sled.rs/), but we started having
|
||||
serious issues with it, and we were not alone
|
||||
**Testing metadata engines** - With Garage, we chose not to store metadata
|
||||
directly on the filesystem, like Minio for example, but in a specialized on-disk
|
||||
B-Tree data structure; in other words, in an embedded database engine. Until now,
|
||||
the only supported option was [sled](https://sled.rs/), but we started having
|
||||
serious issues with it - and we were not alone
|
||||
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
|
||||
v0.8, we introduce an abstraction semantic over the features we expect from our
|
||||
database, allowing us to switch from one backend to another without touching
|
||||
the rest of our codebase. We added two additional backends: lmdb
|
||||
([heed](https://github.com/meilisearch/heed)) and sqlite
|
||||
([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
||||
the rest of our codebase. We added two additional backends: LMDB
|
||||
(through [heed](https://github.com/meilisearch/heed)) and SQLite
|
||||
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
||||
are both experimental: contrarily to sled, we have never run them in production
|
||||
for a long time.**
|
||||
|
||||
Similarly to the impact of fsync on block writing, each database engine we use
|
||||
has its own policy with fsync. Sled flushes its write every 2 seconds by
|
||||
Similarly to the impact of `fsync` on block writing, each database engine we use
|
||||
has its own policy with `fsync`. Sled flushes its write every 2 seconds by
|
||||
default, this is
|
||||
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
|
||||
lmdb by default does an `fsync` on each write, on early tests it led to very
|
||||
slow resynchronizations between nodes. We added 2 flags:
|
||||
LMDB by default does an `fsync` on each write, which on early tests led to very
|
||||
slow resynchronizations between nodes. We thus added 2 flags,
|
||||
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
|
||||
and
|
||||
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16)
|
||||
which deactivate fsync. On sqlite, it is also possible to deactivate fsync with
|
||||
`pragma synchronous = off;`, but we did not start any optimization work on it:
|
||||
our sqlite implementation fsync all the data on the disk. Additionally, we are
|
||||
using these engines through a Rust binding that had to do some tradeoff on the
|
||||
concurrency part. **Our comparison will not reflect the raw performances of
|
||||
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
|
||||
to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
|
||||
`pragma synchronous = off`, but we have not started any optimization work on it yet:
|
||||
our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
|
||||
using these engines through Rust bindings that do not support async Rust,
|
||||
with which Garage is built. **Our comparison will therefore not reflect the raw performances of
|
||||
these database engines, but instead, our integration choices.**
|
||||
|
||||
Still, we think it makes sense to evaluate our implementations in their current
|
||||
state in Garage. We designed a benchmark that is intensive on the metadata part
|
||||
of the software, ie. handling tiny files. We chose again minio/warp but we
|
||||
configure it with the smallest possible object size supported by warp, 256
|
||||
bytes, to put some pressure on the metadata engine. We evaluate sled twice:
|
||||
of the software, i.e. handling large numbers of tiny files. We chose again
|
||||
`minio/warp` as a benchmark tool but we
|
||||
configured it with the smallest possible object size it supported, 256
|
||||
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
|
||||
with its default configuration, and with a configuration where we set a flush
|
||||
interval of 10 minutes to disable fsync.
|
||||
interval of 10 minutes to disable `fsync`.
|
||||
|
||||
*Note that S3 has not been designed for such small objects; a regular database,
|
||||
like Cassandra, would be more appropriate for such workloads. This test has
|
||||
only been designed to stress our metadata engine, it is not indicative of
|
||||
*Note that S3 has not been designed for such workloads that store huge numbers of small objects;
|
||||
a regular database, like Cassandra, would be more appropriate. This test has
|
||||
only been designed to stress our metadata engine, and is not indicative of
|
||||
real-world performances.*
|
||||
|
||||
![Plot of our metadata engines comparison with Warp](db_engine.png)
|
||||
|
||||
Unsurprisingly, we observe abysmal performances for sqlite, the engine we have
|
||||
the less tested and kept fsync for each write. lmdb performs twice better than
|
||||
sled in its default version and 60% better than the "no fsync" version in our
|
||||
Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have
|
||||
the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than
|
||||
with sled in its default version and 60% better than the "no `fsync`" sled version in our
|
||||
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
|
||||
disk storage and RAM; we would like to quantify that in the future. As we are
|
||||
only at the very beginning of our work on metadata engines, it is hard to draw
|
||||
strong conclusions. Still, we can say that sqlite is not ready for production
|
||||
workloads, LMDB looks very promising both in terms of performances and resource
|
||||
usage, it is a very good candidate for Garage's default metadata engine in the
|
||||
future, and we need to define a data policy for Garage that would help us
|
||||
strong conclusions. Still, we can say that SQLite is not ready for production
|
||||
workloads, and that LMDB looks very promising both in terms of performances and resource
|
||||
usage, and is a very good candidate for being Garage's default metadata engine in the
|
||||
future. In the future, we will need to define a data policy for Garage to help us
|
||||
arbitrate between performances and durability.
|
||||
|
||||
*To fsync or not to fsync? Performance is nothing without reliability, so we
|
||||
need to better assess the impact of validating a write and then losing it.
|
||||
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
|
||||
need to better assess the impact of validating a write and then possibly losing it.
|
||||
Because Garage is a distributed system, even if a node loses its write due to a
|
||||
power loss, it will fetch it back from the 2 other nodes storing it. But rare
|
||||
situations where 1 node is down and the 2 others validated the write and then
|
||||
lost power can occur, what is our policy in this case? For storage durability,
|
||||
situations can occur, where 1 node is down and the 2 others validated the write and then
|
||||
lost power. What is our policy in this case? For storage durability,
|
||||
we are already supposing that we never lose the storage of more than 2 nodes,
|
||||
should we also expect that we don't lose power on more than 2 nodes at the same
|
||||
so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
|
||||
time? What should we think about people hosting all their nodes at the same
|
||||
place without a UPS? Historically, it seems that Minio developers also accepted
|
||||
place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
|
||||
some compromises on this side
|
||||
([#3536](https://github.com/minio/minio/issues/3536),
|
||||
[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
|
||||
use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
|
||||
only data and not metadata are persisted on disk - in combination with
|
||||
only data and not metadata is persisted on disk - in combination with
|
||||
`O_DIRECT` for direct I/O
|
||||
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
|
||||
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
||||
|
@ -312,7 +313,7 @@ the storage node in our current setup. So the performance of your filesystem
|
|||
will probably substantially impact the results you will observe; we know the
|
||||
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
||||
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
||||
fsync for our metadata engine, minio has some fsync logic here slowing down the
|
||||
`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the
|
||||
creation of objects. Finally, object storage is designed for big objects: this
|
||||
cost is negligible with bigger objects. In the end, again, we use Minio as a
|
||||
reference to understand what is our performance budget for each part of our
|
||||
|
@ -462,7 +463,7 @@ are confident that a Garage cluster with 100+ nodes should work.
|
|||
During this work, we identified some sensitive points on Garage we will
|
||||
continue working on: our data durability target and interaction with the
|
||||
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
|
||||
our components, our new metadata engines (lmdb, sqlite) still need some testing
|
||||
our components, our new metadata engines (LMDB, SQLite) still need some testing
|
||||
and tuning, and we know that raw I/O (GetObject, PutObject) have a small
|
||||
improvement margin.
|
||||
|
||||
|
|
Loading…
Reference in a new issue