Fixes until "millions of objects"

2022-09-28 16:12:45 +02:00 · 2022-09-28 16:12:45 +02:00 · 7a354483d7
commit 7a354483d7
parent bacebcfbf1
1 changed files with 55 additions and 54 deletions
--- a/content/blog/2022-perf/index.md
+++ b/content/blog/2022-perf/index.md
@ -206,78 +206,79 @@ removed, etc. In Garage, we use a "metadata engine" component to track them.
 For this analysis, we compare different metadata engines in Garage and see how
 well the best one scale to a million objects.

-**Testing metadata engines** - With Garage, we chose to not store metadata
-directly on the filesystem, like Minio for example, but in an on-disk fancy
-B-Tree structure, in other words, in an embedded database engine. Until now,
-the only available option was [sled](https://sled.rs/), but we started having
-serious issues with it, and we were not alone
+**Testing metadata engines** - With Garage, we chose not to store metadata
+directly on the filesystem, like Minio for example, but in a specialized on-disk
+B-Tree data structure; in other words, in an embedded database engine. Until now,
+the only supported option was [sled](https://sled.rs/), but we started having
+serious issues with it - and we were not alone
 ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
 v0.8, we introduce an abstraction semantic over the features we expect from our
 database, allowing us to switch from one backend to another without touching
-the rest of our codebase. We added two additional backends: lmdb
-([heed](https://github.com/meilisearch/heed)) and sqlite
-([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
+the rest of our codebase. We added two additional backends: LMDB
+(through [heed](https://github.com/meilisearch/heed)) and SQLite
+(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
 are both experimental: contrarily to sled, we have never run them in production
 for a long time.**

-Similarly to the impact of fsync on block writing, each database engine we use
-has its own policy with fsync. Sled flushes its write every 2 seconds by
+Similarly to the impact of `fsync` on block writing, each database engine we use
+has its own policy with `fsync`. Sled flushes its write every 2 seconds by
 default, this is
 [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
-lmdb by default does an `fsync` on each write, on early tests it led to very
-slow resynchronizations between nodes. We added 2 flags:
+LMDB by default does an `fsync` on each write, which on early tests led to very
+slow resynchronizations between nodes. We thus added 2 flags,
 [MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
 and
-[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16)
-which deactivate fsync. On sqlite, it is also possible to deactivate fsync with
-`pragma synchronous = off;`, but we did not start any optimization work on it:
-our sqlite implementation fsync all the data on the disk. Additionally, we are
-using these engines through a Rust binding that had to do some tradeoff on the
-concurrency part. **Our comparison will not reflect the raw performances of
+[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
+to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
+`pragma synchronous = off`, but we have not started any optimization work on it yet:
+our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
+using these engines through Rust bindings that do not support async Rust,
+with which Garage is built.  **Our comparison will therefore not reflect the raw performances of
 these database engines, but instead, our integration choices.**

 Still, we think it makes sense to evaluate our implementations in their current
 state in Garage. We designed a benchmark that is intensive on the metadata part
-of the software, ie. handling tiny files. We chose again minio/warp but we
-configure it with the smallest possible object size supported by warp, 256
-bytes, to put some pressure on the metadata engine. We evaluate sled twice:
+of the software, i.e. handling large numbers of tiny files. We chose again
+`minio/warp` as a benchmark tool but we
+configured it with the smallest possible object size it supported, 256
+bytes, to put some pressure on the metadata engine. We evaluated sled twice:
 with its default configuration, and with a configuration where we set a flush
-interval of 10 minutes to disable fsync. 
+interval of 10 minutes to disable `fsync`.

-*Note that S3 has not been designed for such small objects; a regular database,
-like Cassandra, would be more appropriate for such workloads. This test has
-only been designed to stress our metadata engine, it is not indicative of
+*Note that S3 has not been designed for such workloads that store huge numbers of small objects;
+a regular database, like Cassandra, would be more appropriate. This test has
+only been designed to stress our metadata engine, and is not indicative of
 real-world performances.*

 ![Plot of our metadata engines comparison with Warp](db_engine.png)

-Unsurprisingly, we observe abysmal performances for sqlite, the engine we have
-the less tested and kept fsync for each write.  lmdb performs twice better than
-sled in its default version and 60% better than the "no fsync" version in our
+Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have
+the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than
+with sled in its default version and 60% better than the "no `fsync`" sled version in our
 benchmark.  Furthermore, and not depicted on these plots, LMDB uses way less
 disk storage and RAM; we would like to quantify that in the future.  As we are
 only at the very beginning of our work on metadata engines, it is hard to draw
-strong conclusions.  Still, we can say that sqlite is not ready for production
-workloads, LMDB looks very promising both in terms of performances and resource
-usage, it is a very good candidate for Garage's default metadata engine in the
-future, and we need to define a data policy for Garage that would help us
+strong conclusions.  Still, we can say that SQLite is not ready for production
+workloads, and that LMDB looks very promising both in terms of performances and resource
+usage, and is a very good candidate for being Garage's default metadata engine in the
+future. In the future, we will need to define a data policy for Garage to help us
 arbitrate between performances and durability.

-*To fsync or not to fsync? Performance is nothing without reliability, so we
-need to better assess the impact of validating a write and then losing it.
+*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
+need to better assess the impact of validating a write and then possibly losing it.
 Because Garage is a distributed system, even if a node loses its write due to a
 power loss, it will fetch it back from the 2 other nodes storing it. But rare
-situations where 1 node is down and the 2 others validated the write and then
-lost power can occur, what is our policy in this case? For storage durability,
+situations can occur, where 1 node is down and the 2 others validated the write and then
+lost power. What is our policy in this case? For storage durability,
 we are already supposing that we never lose the storage of more than 2 nodes,
-should we also expect that we don't lose power on more than 2 nodes at the same
+so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
 time? What should we think about people hosting all their nodes at the same
-place without a UPS? Historically, it seems that Minio developers also accepted
+place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
 some compromises on this side
 ([#3536](https://github.com/minio/minio/issues/3536),
 [HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
 use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
-only data and not metadata are persisted on disk - in combination with
+only data and not metadata is persisted on disk - in combination with
 `O_DIRECT` for direct I/O
 ([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
 [example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
@ -312,7 +313,7 @@ the storage node in our current setup. So the performance of your filesystem
 will probably substantially impact the results you will observe; we know the
 filesystem we used is not adapted at all for Minio (encryption layer, fixed
 number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
-fsync for our metadata engine, minio has some fsync logic here slowing down the
+`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the
 creation of objects. Finally, object storage is designed for big objects: this
 cost is negligible with bigger objects. In the end, again, we use Minio as a
 reference to understand what is our performance budget for each part of our
@ -462,7 +463,7 @@ are confident that a Garage cluster with 100+ nodes should work.
 During this work, we identified some sensitive points on Garage we will
 continue working on: our data durability target and interaction with the
 filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
-our components, our new metadata engines (lmdb, sqlite) still need some testing
+our components, our new metadata engines (LMDB, SQLite) still need some testing
 and tuning, and we know that raw I/O (GetObject, PutObject) have a small
 improvement margin.