Part about metadata engine

This commit is contained in:
Quentin 2022-09-27 12:09:56 +02:00
parent da075684e8
commit 6d646201d1
Signed by: quentin
GPG Key ID: E9602264D639FF68
2 changed files with 19 additions and 2 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 166 KiB

After

Width:  |  Height:  |  Size: 177 KiB

View File

@ -45,6 +45,8 @@ As planned, Garage v0.7 that does not support block streaming features TTFB betw
**Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted that logic we had to better control the resource usage and detect errors (semaphore, timeouts) were artificially limiting performances: we made them less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot.
*A note about fsync: for performance reasons, OS often do not write directly to the disk when you create or update a file in your filesystem: your write will be kept in memory, and flush later in a batch with other writes. If a power loss occures before the OS has time to flush the writes on the disk, they will be lost. To ensure that a write is effectively written on disk, you must use the [fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it will block until your file or directory has been written from your volatile RAM memory to your persisting storage device. Additionaly, the exact semantic of fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) and, even on battle-tested software like Postgres, [they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). Note that on Garage, we are currently working on our "fsync" policy and thus, you should expect limited data durability as we are aware of some inconsistency on this point (which we describe in the following).*
To assess performance improvements, we used the benchmark tool [minio/warp](https://github.com/minio/warp) in a non-standard configuration, adapted for small scale tests, and we kept only the aggregated result named "cluster total". The goal of this experiment is to get an idea of the cluster performance with a standardized and mixed workload.
![Plot showing IO perf of Garage configs and Minio](io.png)
@ -54,12 +56,27 @@ Minio, our ground truth, features the best performances in this test. Considerin
## A myriad of objects
**Testing metadata engines** - With Garage, we chose to not store metadata directly on the filesystem, like Minio for example, but in an on-disk fancy B-Tree structure, in other words, in an embedded database engine. Until now, the only available option was [sled](https://sled.rs/), but we started having serious issues with it, and we were not alone ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage v0.8, we introduce an abstraction semantic over the features we expect from our database, allowing us to switch from one backend to another without touching the rest of our codebase. We added two additional backends: lmdb and sqlite. Keep in mind that they are both experimental: contrarily to sled, we have never run them in production.
**Testing metadata engines** - With Garage, we chose to not store metadata directly on the filesystem, like Minio for example, but in an on-disk fancy B-Tree structure, in other words, in an embedded database engine. Until now, the only available option was [sled](https://sled.rs/), but we started having serious issues with it, and we were not alone ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage v0.8, we introduce an abstraction semantic over the features we expect from our database, allowing us to switch from one backend to another without touching the rest of our codebase. We added two additional backends: lmdb ([heed](https://github.com/meilisearch/heed)) and sqlite ([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they are both experimental: contrarily to sled, we have never run them in production for a long time.**
To
Similarly to the impact of fsync on block writing, each database engine we use has its own policy with fsync. Sled flushes its write every 2 seconds by default, this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)). lmdb by default does an `fsync` on each write, on early tests it lead to very slow resynchronizations between nodes. We added 2 flags: [MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e) and [MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16) that basically deactivate fsync. On sqlite, it is also possible to deactivate fsync with `pragma synchronous = off;`, but we did not start any optimization work on it: our sqlite implementation fsync all the data on the disk. Additionaly, we are using these engine through a Rust binding that had to do some tradeoff on the concurrency part. **Our comparison will not reflect the raw performances of these database engine, but instead, our integration choices.**
Still, we think it makes sense to evaluate our implementations in their current state in Garage. We designed a benchmark that is intensive on the metadata part of the software, ie. handling tiny files. We chose again minio/warp but we configure it with the smallest possible object size supported by warp, 256 bytes, to put some pressure on the metadata engine. We evaluate sled twice: with its default configuration, and with a configuration where we set a flush interval of 10 minutes to disable fsync.
*Note that S3 has not been designed for such small objects; a regular database, like Cassandra, would be more appropriate for such workloads. This test has only be designed to stress our metadata engine, it is not indicative of real world performances.*
![Plot of our metadata engines comparison with Warp](db_engine.png)
Unsurprinsingly, we observe abysall performances for sqlite, the engine we have the less tested and kept fsync for each write.
lmdb performs twice better than default sled and 60% better than no fsync sled in our benchmark.
Additionaly, and not depicted on these plots, LMDB uses way less disk storage and RAM; we would like to quantify that in the future.
As we are only at the very beginning of our work on metadata engine, it is hard to draw strong conclusions.
Still, we can say that sqlite is not ready for production workloads,
LMDB looks very promising both in term of performances and resource usage,
it is a very good candidate for Garage's default metadata engine in the future,
and we need to define a data policy for Garage that would help us arbitrate between performances and durability.
*To fsync or not to fsync? Performance is nothing without reliability, so we need to better assess the impact of validating a write and then losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare situations where 1 node is down and the 2 others validated the write and then lost power can occure, what is our policy in this case? For storage durability, we are already supposing that we never loose the storage of more than 2 nodes, should we also expect that we don't loose more than 2 nodes at the same time? What to think about people hosting all their nodes at the same place without an UPS?*
**Storing million of objects** -
![](1million-both.png)