Fixes until "millions of objects"

This commit is contained in:
Alex 2022-09-28 16:12:45 +02:00
parent bacebcfbf1
commit 7a354483d7
Signed by untrusted user: lx
GPG key ID: 0E496D15096376BE

View file

@ -206,78 +206,79 @@ removed, etc. In Garage, we use a "metadata engine" component to track them.
For this analysis, we compare different metadata engines in Garage and see how For this analysis, we compare different metadata engines in Garage and see how
well the best one scale to a million objects. well the best one scale to a million objects.
**Testing metadata engines** - With Garage, we chose to not store metadata **Testing metadata engines** - With Garage, we chose not to store metadata
directly on the filesystem, like Minio for example, but in an on-disk fancy directly on the filesystem, like Minio for example, but in a specialized on-disk
B-Tree structure, in other words, in an embedded database engine. Until now, B-Tree data structure; in other words, in an embedded database engine. Until now,
the only available option was [sled](https://sled.rs/), but we started having the only supported option was [sled](https://sled.rs/), but we started having
serious issues with it, and we were not alone serious issues with it - and we were not alone
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
v0.8, we introduce an abstraction semantic over the features we expect from our v0.8, we introduce an abstraction semantic over the features we expect from our
database, allowing us to switch from one backend to another without touching database, allowing us to switch from one backend to another without touching
the rest of our codebase. We added two additional backends: lmdb the rest of our codebase. We added two additional backends: LMDB
([heed](https://github.com/meilisearch/heed)) and sqlite (through [heed](https://github.com/meilisearch/heed)) and SQLite
([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they (using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
are both experimental: contrarily to sled, we have never run them in production are both experimental: contrarily to sled, we have never run them in production
for a long time.** for a long time.**
Similarly to the impact of fsync on block writing, each database engine we use Similarly to the impact of `fsync` on block writing, each database engine we use
has its own policy with fsync. Sled flushes its write every 2 seconds by has its own policy with `fsync`. Sled flushes its write every 2 seconds by
default, this is default, this is
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)). [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
lmdb by default does an `fsync` on each write, on early tests it led to very LMDB by default does an `fsync` on each write, which on early tests led to very
slow resynchronizations between nodes. We added 2 flags: slow resynchronizations between nodes. We thus added 2 flags,
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e) [MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
and and
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16) [MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
which deactivate fsync. On sqlite, it is also possible to deactivate fsync with to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
`pragma synchronous = off;`, but we did not start any optimization work on it: `pragma synchronous = off`, but we have not started any optimization work on it yet:
our sqlite implementation fsync all the data on the disk. Additionally, we are our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
using these engines through a Rust binding that had to do some tradeoff on the using these engines through Rust bindings that do not support async Rust,
concurrency part. **Our comparison will not reflect the raw performances of with which Garage is built. **Our comparison will therefore not reflect the raw performances of
these database engines, but instead, our integration choices.** these database engines, but instead, our integration choices.**
Still, we think it makes sense to evaluate our implementations in their current Still, we think it makes sense to evaluate our implementations in their current
state in Garage. We designed a benchmark that is intensive on the metadata part state in Garage. We designed a benchmark that is intensive on the metadata part
of the software, ie. handling tiny files. We chose again minio/warp but we of the software, i.e. handling large numbers of tiny files. We chose again
configure it with the smallest possible object size supported by warp, 256 `minio/warp` as a benchmark tool but we
bytes, to put some pressure on the metadata engine. We evaluate sled twice: configured it with the smallest possible object size it supported, 256
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
with its default configuration, and with a configuration where we set a flush with its default configuration, and with a configuration where we set a flush
interval of 10 minutes to disable fsync. interval of 10 minutes to disable `fsync`.
*Note that S3 has not been designed for such small objects; a regular database, *Note that S3 has not been designed for such workloads that store huge numbers of small objects;
like Cassandra, would be more appropriate for such workloads. This test has a regular database, like Cassandra, would be more appropriate. This test has
only been designed to stress our metadata engine, it is not indicative of only been designed to stress our metadata engine, and is not indicative of
real-world performances.* real-world performances.*
![Plot of our metadata engines comparison with Warp](db_engine.png) ![Plot of our metadata engines comparison with Warp](db_engine.png)
Unsurprisingly, we observe abysmal performances for sqlite, the engine we have Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have
the less tested and kept fsync for each write. lmdb performs twice better than the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than
sled in its default version and 60% better than the "no fsync" version in our with sled in its default version and 60% better than the "no `fsync`" sled version in our
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
disk storage and RAM; we would like to quantify that in the future. As we are disk storage and RAM; we would like to quantify that in the future. As we are
only at the very beginning of our work on metadata engines, it is hard to draw only at the very beginning of our work on metadata engines, it is hard to draw
strong conclusions. Still, we can say that sqlite is not ready for production strong conclusions. Still, we can say that SQLite is not ready for production
workloads, LMDB looks very promising both in terms of performances and resource workloads, and that LMDB looks very promising both in terms of performances and resource
usage, it is a very good candidate for Garage's default metadata engine in the usage, and is a very good candidate for being Garage's default metadata engine in the
future, and we need to define a data policy for Garage that would help us future. In the future, we will need to define a data policy for Garage to help us
arbitrate between performances and durability. arbitrate between performances and durability.
*To fsync or not to fsync? Performance is nothing without reliability, so we *To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
need to better assess the impact of validating a write and then losing it. need to better assess the impact of validating a write and then possibly losing it.
Because Garage is a distributed system, even if a node loses its write due to a Because Garage is a distributed system, even if a node loses its write due to a
power loss, it will fetch it back from the 2 other nodes storing it. But rare power loss, it will fetch it back from the 2 other nodes storing it. But rare
situations where 1 node is down and the 2 others validated the write and then situations can occur, where 1 node is down and the 2 others validated the write and then
lost power can occur, what is our policy in this case? For storage durability, lost power. What is our policy in this case? For storage durability,
we are already supposing that we never lose the storage of more than 2 nodes, we are already supposing that we never lose the storage of more than 2 nodes,
should we also expect that we don't lose power on more than 2 nodes at the same so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
time? What should we think about people hosting all their nodes at the same time? What should we think about people hosting all their nodes at the same
place without a UPS? Historically, it seems that Minio developers also accepted place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
some compromises on this side some compromises on this side
([#3536](https://github.com/minio/minio/issues/3536), ([#3536](https://github.com/minio/minio/issues/3536),
[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to [HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
only data and not metadata are persisted on disk - in combination with only data and not metadata is persisted on disk - in combination with
`O_DIRECT` for direct I/O `O_DIRECT` for direct I/O
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274), ([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).* [example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
@ -312,7 +313,7 @@ the storage node in our current setup. So the performance of your filesystem
will probably substantially impact the results you will observe; we know the will probably substantially impact the results you will observe; we know the
filesystem we used is not adapted at all for Minio (encryption layer, fixed filesystem we used is not adapted at all for Minio (encryption layer, fixed
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
fsync for our metadata engine, minio has some fsync logic here slowing down the `fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the
creation of objects. Finally, object storage is designed for big objects: this creation of objects. Finally, object storage is designed for big objects: this
cost is negligible with bigger objects. In the end, again, we use Minio as a cost is negligible with bigger objects. In the end, again, we use Minio as a
reference to understand what is our performance budget for each part of our reference to understand what is our performance budget for each part of our
@ -462,7 +463,7 @@ are confident that a Garage cluster with 100+ nodes should work.
During this work, we identified some sensitive points on Garage we will During this work, we identified some sensitive points on Garage we will
continue working on: our data durability target and interaction with the continue working on: our data durability target and interaction with the
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
our components, our new metadata engines (lmdb, sqlite) still need some testing our components, our new metadata engines (LMDB, SQLite) still need some testing
and tuning, and we know that raw I/O (GetObject, PutObject) have a small and tuning, and we know that raw I/O (GetObject, PutObject) have a small
improvement margin. improvement margin.