diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index 535cd16..06bcd31 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -22,7 +22,7 @@ to reflect the high-level properties we are seeking.* The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them as -exhaustively as possible in this first section, but other limitations might exist. +exhaustively as possible in this first section, but other limitations might exist. Most of our tests were made on simulated networks, which by definition cannot represent all the diversity of real networks (dynamic drop, jitter, latency, all of which could be @@ -109,7 +109,7 @@ at a smaller granularity level than entire data blocks, which are 1MB chunks of Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of 1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET` request, the first block had to be fully retrieved by the gateway node from the -storage node before starting to send any data to the client. +storage node before starting to send any data to the client. With Garage v0.8, we integrated a block streaming logic that allows the gateway to send the beginning of a block without having to wait for the full block from @@ -125,7 +125,7 @@ thus adding at most 8ms of latency to a GetObject request (assuming no other data transfer is happening in parallel). However, on a very slow network, or a very congested link with many parallel requests handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds -to transfer our 1MB block, and streaming has the potential of heavily improving user experience. +to transfer our 1MB block, and streaming has the potential of heavily improving user experience. We wanted to see if this theory holds in practice: we simulated a low latency but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and @@ -185,7 +185,7 @@ To assess performance improvements, we used the benchmark tool [minio/warp](https://github.com/minio/warp) in a non-standard configuration, adapted for small-scale tests, and we kept only the aggregated result named "cluster total". The goal of this experiment is to get an idea of the cluster -performance with a standardized and mixed workload. +performance with a standardized and mixed workload. ![Plot showing IO perf of Garage configs and Minio](io.png) @@ -194,7 +194,7 @@ Looking at Garage, we observe that each improvement we made has a visible impact on performances. We also note that we have a progress margin in terms of performances compared to Minio: additional benchmarks, tests, and monitoring could help better understand the remaining difference. - + ## A myriad of objects @@ -206,78 +206,79 @@ removed, etc. In Garage, we use a "metadata engine" component to track them. For this analysis, we compare different metadata engines in Garage and see how well the best one scale to a million objects. -**Testing metadata engines** - With Garage, we chose to not store metadata -directly on the filesystem, like Minio for example, but in an on-disk fancy -B-Tree structure, in other words, in an embedded database engine. Until now, -the only available option was [sled](https://sled.rs/), but we started having -serious issues with it, and we were not alone +**Testing metadata engines** - With Garage, we chose not to store metadata +directly on the filesystem, like Minio for example, but in a specialized on-disk +B-Tree data structure; in other words, in an embedded database engine. Until now, +the only supported option was [sled](https://sled.rs/), but we started having +serious issues with it - and we were not alone ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage v0.8, we introduce an abstraction semantic over the features we expect from our database, allowing us to switch from one backend to another without touching -the rest of our codebase. We added two additional backends: lmdb -([heed](https://github.com/meilisearch/heed)) and sqlite -([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they +the rest of our codebase. We added two additional backends: LMDB +(through [heed](https://github.com/meilisearch/heed)) and SQLite +(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they are both experimental: contrarily to sled, we have never run them in production for a long time.** -Similarly to the impact of fsync on block writing, each database engine we use -has its own policy with fsync. Sled flushes its write every 2 seconds by +Similarly to the impact of `fsync` on block writing, each database engine we use +has its own policy with `fsync`. Sled flushes its write every 2 seconds by default, this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)). -lmdb by default does an `fsync` on each write, on early tests it led to very -slow resynchronizations between nodes. We added 2 flags: +LMDB by default does an `fsync` on each write, which on early tests led to very +slow resynchronizations between nodes. We thus added 2 flags, [MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e) and -[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16) -which deactivate fsync. On sqlite, it is also possible to deactivate fsync with -`pragma synchronous = off;`, but we did not start any optimization work on it: -our sqlite implementation fsync all the data on the disk. Additionally, we are -using these engines through a Rust binding that had to do some tradeoff on the -concurrency part. **Our comparison will not reflect the raw performances of +[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16), +to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with +`pragma synchronous = off`, but we have not started any optimization work on it yet: +our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are +using these engines through Rust bindings that do not support async Rust, +with which Garage is built. **Our comparison will therefore not reflect the raw performances of these database engines, but instead, our integration choices.** Still, we think it makes sense to evaluate our implementations in their current state in Garage. We designed a benchmark that is intensive on the metadata part -of the software, ie. handling tiny files. We chose again minio/warp but we -configure it with the smallest possible object size supported by warp, 256 -bytes, to put some pressure on the metadata engine. We evaluate sled twice: +of the software, i.e. handling large numbers of tiny files. We chose again +`minio/warp` as a benchmark tool but we +configured it with the smallest possible object size it supported, 256 +bytes, to put some pressure on the metadata engine. We evaluated sled twice: with its default configuration, and with a configuration where we set a flush -interval of 10 minutes to disable fsync. +interval of 10 minutes to disable `fsync`. -*Note that S3 has not been designed for such small objects; a regular database, -like Cassandra, would be more appropriate for such workloads. This test has -only been designed to stress our metadata engine, it is not indicative of +*Note that S3 has not been designed for such workloads that store huge numbers of small objects; +a regular database, like Cassandra, would be more appropriate. This test has +only been designed to stress our metadata engine, and is not indicative of real-world performances.* ![Plot of our metadata engines comparison with Warp](db_engine.png) -Unsurprisingly, we observe abysmal performances for sqlite, the engine we have -the less tested and kept fsync for each write. lmdb performs twice better than -sled in its default version and 60% better than the "no fsync" version in our +Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have +the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than +with sled in its default version and 60% better than the "no `fsync`" sled version in our benchmark. Furthermore, and not depicted on these plots, LMDB uses way less disk storage and RAM; we would like to quantify that in the future. As we are only at the very beginning of our work on metadata engines, it is hard to draw -strong conclusions. Still, we can say that sqlite is not ready for production -workloads, LMDB looks very promising both in terms of performances and resource -usage, it is a very good candidate for Garage's default metadata engine in the -future, and we need to define a data policy for Garage that would help us +strong conclusions. Still, we can say that SQLite is not ready for production +workloads, and that LMDB looks very promising both in terms of performances and resource +usage, and is a very good candidate for being Garage's default metadata engine in the +future. In the future, we will need to define a data policy for Garage to help us arbitrate between performances and durability. -*To fsync or not to fsync? Performance is nothing without reliability, so we -need to better assess the impact of validating a write and then losing it. +*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we +need to better assess the impact of validating a write and then possibly losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare -situations where 1 node is down and the 2 others validated the write and then -lost power can occur, what is our policy in this case? For storage durability, +situations can occur, where 1 node is down and the 2 others validated the write and then +lost power. What is our policy in this case? For storage durability, we are already supposing that we never lose the storage of more than 2 nodes, -should we also expect that we don't lose power on more than 2 nodes at the same +so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same time? What should we think about people hosting all their nodes at the same -place without a UPS? Historically, it seems that Minio developers also accepted +place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted some compromises on this side ([#3536](https://github.com/minio/minio/issues/3536), [HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures -only data and not metadata are persisted on disk - in combination with +only data and not metadata is persisted on disk - in combination with `O_DIRECT` for direct I/O ([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274), [example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).* @@ -301,10 +302,10 @@ number of times (128 by default) to effectively create a certain number of objects on the target cluster (1M by default). On our local setup with 3 nodes, both Minio and Garage with LMDB were able to achieve this target. In the following plot, we show how many times it took to Garage and Minio to handle -each batch. +each batch. Before looking at the plot, **you must keep in mind some important points about -Minio and Garage internals**. +Minio and Garage internals**. Minio has no metadata engine, it stores its objects directly on the filesystem. Sending 1 million objects on Minio results in creating one million inodes on @@ -312,7 +313,7 @@ the storage node in our current setup. So the performance of your filesystem will probably substantially impact the results you will observe; we know the filesystem we used is not adapted at all for Minio (encryption layer, fixed number of inodes, etc.). Additionally, we mentioned earlier that we deactivated -fsync for our metadata engine, minio has some fsync logic here slowing down the +`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the creation of objects. Finally, object storage is designed for big objects: this cost is negligible with bigger objects. In the end, again, we use Minio as a reference to understand what is our performance budget for each part of our @@ -330,7 +331,7 @@ metadata engine and thus focus only on 16-byte objects. It appears that the performances of our metadata engine are acceptable, as we have a comfortable margin compared to Minio (Minio is between 3x and 4x times slower per batch). We also note that, past 200k objects, Minio batch -completion time is constant as Garage's one is still increasing in the observed range: +completion time is constant as Garage's one is still increasing in the observed range: it could be interesting to know if Garage batch's completion time would cross Minio's one for a very large number of objects. If we reason per object, both Minio and Garage performances remain very good: it takes respectively around 20ms and @@ -396,7 +397,7 @@ For example, on Garage, a GetObject request does two sequential calls: first, it asks for the descriptor of the requested object containing the block list of the requested object, then it retrieves its blocks. We can expect that the request duration of a small GetObject request will be close to twice the -network latency. +network latency. We tested this theory with another benchmark of our own named [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) @@ -417,7 +418,7 @@ RemoveObject). It is understandable: Minio has not been designed for environments with high latencies, you are expected to build your clusters in the same datacenter, and then possibly connect them with their asynchronous [Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect) -feature. +feature. *Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/) but it is even more sensitive to latency: "Multi-site replication has increased @@ -454,7 +455,7 @@ that their load started to become non-negligible: it seems that we are not hitting a limit on the protocol side but we have simply exhausted the resource of our testing nodes. In the future, we would like to run this experiment again, but on way more physical nodes, to confirm our hypothesis. For now, we -are confident that a Garage cluster with 100+ nodes should work. +are confident that a Garage cluster with 100+ nodes should work. ## Conclusion and Future work @@ -462,7 +463,7 @@ are confident that a Garage cluster with 100+ nodes should work. During this work, we identified some sensitive points on Garage we will continue working on: our data durability target and interaction with the filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across -our components, our new metadata engines (lmdb, sqlite) still need some testing +our components, our new metadata engines (LMDB, SQLite) still need some testing and tuning, and we know that raw I/O (GetObject, PutObject) have a small improvement margin. @@ -489,11 +490,11 @@ soon introduce officially a new API (as a technical preview) named K2V ([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)). -## Notes +## Notes [^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen) existence. This tool is far more complex than our set of scripts, but we know -that it is also way more versatile. +that it is also way more versatile. [^ref2]: The program name contains the word "billion" and we only tested Garage up to 1 "million" object, this is not a typo, we were just a little bit too