Fixes until "millions of objects"

This commit is contained in:
Alex 2022-09-28 16:12:45 +02:00
parent bacebcfbf1
commit 7a354483d7
Signed by: lx
GPG key ID: 0E496D15096376BE

View file

@ -22,7 +22,7 @@ to reflect the high-level properties we are seeking.*
The following results must be taken with a critical grain of salt due to some The following results must be taken with a critical grain of salt due to some
limitations that are inherent to any benchmark. We try to reference them as limitations that are inherent to any benchmark. We try to reference them as
exhaustively as possible in this first section, but other limitations might exist. exhaustively as possible in this first section, but other limitations might exist.
Most of our tests were made on simulated networks, which by definition cannot represent all the Most of our tests were made on simulated networks, which by definition cannot represent all the
diversity of real networks (dynamic drop, jitter, latency, all of which could be diversity of real networks (dynamic drop, jitter, latency, all of which could be
@ -109,7 +109,7 @@ at a smaller granularity level than entire data blocks, which are 1MB chunks of
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET` 1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
request, the first block had to be fully retrieved by the gateway node from the request, the first block had to be fully retrieved by the gateway node from the
storage node before starting to send any data to the client. storage node before starting to send any data to the client.
With Garage v0.8, we integrated a block streaming logic that allows the gateway With Garage v0.8, we integrated a block streaming logic that allows the gateway
to send the beginning of a block without having to wait for the full block from to send the beginning of a block without having to wait for the full block from
@ -125,7 +125,7 @@ thus adding at most 8ms of latency to a GetObject request (assuming no other
data transfer is happening in parallel). However, data transfer is happening in parallel). However,
on a very slow network, or a very congested link with many parallel requests on a very slow network, or a very congested link with many parallel requests
handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
to transfer our 1MB block, and streaming has the potential of heavily improving user experience. to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
We wanted to see if this theory holds in practice: we simulated a low latency We wanted to see if this theory holds in practice: we simulated a low latency
but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
@ -185,7 +185,7 @@ To assess performance improvements, we used the benchmark tool
[minio/warp](https://github.com/minio/warp) in a non-standard configuration, [minio/warp](https://github.com/minio/warp) in a non-standard configuration,
adapted for small-scale tests, and we kept only the aggregated result named adapted for small-scale tests, and we kept only the aggregated result named
"cluster total". The goal of this experiment is to get an idea of the cluster "cluster total". The goal of this experiment is to get an idea of the cluster
performance with a standardized and mixed workload. performance with a standardized and mixed workload.
![Plot showing IO perf of Garage configs and Minio](io.png) ![Plot showing IO perf of Garage configs and Minio](io.png)
@ -194,7 +194,7 @@ Looking at Garage, we observe that each improvement we made has a visible
impact on performances. We also note that we have a progress margin in impact on performances. We also note that we have a progress margin in
terms of performances compared to Minio: additional benchmarks, tests, and terms of performances compared to Minio: additional benchmarks, tests, and
monitoring could help better understand the remaining difference. monitoring could help better understand the remaining difference.
## A myriad of objects ## A myriad of objects
@ -206,78 +206,79 @@ removed, etc. In Garage, we use a "metadata engine" component to track them.
For this analysis, we compare different metadata engines in Garage and see how For this analysis, we compare different metadata engines in Garage and see how
well the best one scale to a million objects. well the best one scale to a million objects.
**Testing metadata engines** - With Garage, we chose to not store metadata **Testing metadata engines** - With Garage, we chose not to store metadata
directly on the filesystem, like Minio for example, but in an on-disk fancy directly on the filesystem, like Minio for example, but in a specialized on-disk
B-Tree structure, in other words, in an embedded database engine. Until now, B-Tree data structure; in other words, in an embedded database engine. Until now,
the only available option was [sled](https://sled.rs/), but we started having the only supported option was [sled](https://sled.rs/), but we started having
serious issues with it, and we were not alone serious issues with it - and we were not alone
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage ([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
v0.8, we introduce an abstraction semantic over the features we expect from our v0.8, we introduce an abstraction semantic over the features we expect from our
database, allowing us to switch from one backend to another without touching database, allowing us to switch from one backend to another without touching
the rest of our codebase. We added two additional backends: lmdb the rest of our codebase. We added two additional backends: LMDB
([heed](https://github.com/meilisearch/heed)) and sqlite (through [heed](https://github.com/meilisearch/heed)) and SQLite
([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they (using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
are both experimental: contrarily to sled, we have never run them in production are both experimental: contrarily to sled, we have never run them in production
for a long time.** for a long time.**
Similarly to the impact of fsync on block writing, each database engine we use Similarly to the impact of `fsync` on block writing, each database engine we use
has its own policy with fsync. Sled flushes its write every 2 seconds by has its own policy with `fsync`. Sled flushes its write every 2 seconds by
default, this is default, this is
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)). [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
lmdb by default does an `fsync` on each write, on early tests it led to very LMDB by default does an `fsync` on each write, which on early tests led to very
slow resynchronizations between nodes. We added 2 flags: slow resynchronizations between nodes. We thus added 2 flags,
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e) [MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
and and
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16) [MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
which deactivate fsync. On sqlite, it is also possible to deactivate fsync with to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
`pragma synchronous = off;`, but we did not start any optimization work on it: `pragma synchronous = off`, but we have not started any optimization work on it yet:
our sqlite implementation fsync all the data on the disk. Additionally, we are our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
using these engines through a Rust binding that had to do some tradeoff on the using these engines through Rust bindings that do not support async Rust,
concurrency part. **Our comparison will not reflect the raw performances of with which Garage is built. **Our comparison will therefore not reflect the raw performances of
these database engines, but instead, our integration choices.** these database engines, but instead, our integration choices.**
Still, we think it makes sense to evaluate our implementations in their current Still, we think it makes sense to evaluate our implementations in their current
state in Garage. We designed a benchmark that is intensive on the metadata part state in Garage. We designed a benchmark that is intensive on the metadata part
of the software, ie. handling tiny files. We chose again minio/warp but we of the software, i.e. handling large numbers of tiny files. We chose again
configure it with the smallest possible object size supported by warp, 256 `minio/warp` as a benchmark tool but we
bytes, to put some pressure on the metadata engine. We evaluate sled twice: configured it with the smallest possible object size it supported, 256
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
with its default configuration, and with a configuration where we set a flush with its default configuration, and with a configuration where we set a flush
interval of 10 minutes to disable fsync. interval of 10 minutes to disable `fsync`.
*Note that S3 has not been designed for such small objects; a regular database, *Note that S3 has not been designed for such workloads that store huge numbers of small objects;
like Cassandra, would be more appropriate for such workloads. This test has a regular database, like Cassandra, would be more appropriate. This test has
only been designed to stress our metadata engine, it is not indicative of only been designed to stress our metadata engine, and is not indicative of
real-world performances.* real-world performances.*
![Plot of our metadata engines comparison with Warp](db_engine.png) ![Plot of our metadata engines comparison with Warp](db_engine.png)
Unsurprisingly, we observe abysmal performances for sqlite, the engine we have Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have
the less tested and kept fsync for each write. lmdb performs twice better than the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than
sled in its default version and 60% better than the "no fsync" version in our with sled in its default version and 60% better than the "no `fsync`" sled version in our
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
disk storage and RAM; we would like to quantify that in the future. As we are disk storage and RAM; we would like to quantify that in the future. As we are
only at the very beginning of our work on metadata engines, it is hard to draw only at the very beginning of our work on metadata engines, it is hard to draw
strong conclusions. Still, we can say that sqlite is not ready for production strong conclusions. Still, we can say that SQLite is not ready for production
workloads, LMDB looks very promising both in terms of performances and resource workloads, and that LMDB looks very promising both in terms of performances and resource
usage, it is a very good candidate for Garage's default metadata engine in the usage, and is a very good candidate for being Garage's default metadata engine in the
future, and we need to define a data policy for Garage that would help us future. In the future, we will need to define a data policy for Garage to help us
arbitrate between performances and durability. arbitrate between performances and durability.
*To fsync or not to fsync? Performance is nothing without reliability, so we *To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
need to better assess the impact of validating a write and then losing it. need to better assess the impact of validating a write and then possibly losing it.
Because Garage is a distributed system, even if a node loses its write due to a Because Garage is a distributed system, even if a node loses its write due to a
power loss, it will fetch it back from the 2 other nodes storing it. But rare power loss, it will fetch it back from the 2 other nodes storing it. But rare
situations where 1 node is down and the 2 others validated the write and then situations can occur, where 1 node is down and the 2 others validated the write and then
lost power can occur, what is our policy in this case? For storage durability, lost power. What is our policy in this case? For storage durability,
we are already supposing that we never lose the storage of more than 2 nodes, we are already supposing that we never lose the storage of more than 2 nodes,
should we also expect that we don't lose power on more than 2 nodes at the same so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
time? What should we think about people hosting all their nodes at the same time? What should we think about people hosting all their nodes at the same
place without a UPS? Historically, it seems that Minio developers also accepted place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
some compromises on this side some compromises on this side
([#3536](https://github.com/minio/minio/issues/3536), ([#3536](https://github.com/minio/minio/issues/3536),
[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to [HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
only data and not metadata are persisted on disk - in combination with only data and not metadata is persisted on disk - in combination with
`O_DIRECT` for direct I/O `O_DIRECT` for direct I/O
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274), ([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).* [example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
@ -301,10 +302,10 @@ number of times (128 by default) to effectively create a certain number of
objects on the target cluster (1M by default). On our local setup with 3 objects on the target cluster (1M by default). On our local setup with 3
nodes, both Minio and Garage with LMDB were able to achieve this target. In the nodes, both Minio and Garage with LMDB were able to achieve this target. In the
following plot, we show how many times it took to Garage and Minio to handle following plot, we show how many times it took to Garage and Minio to handle
each batch. each batch.
Before looking at the plot, **you must keep in mind some important points about Before looking at the plot, **you must keep in mind some important points about
Minio and Garage internals**. Minio and Garage internals**.
Minio has no metadata engine, it stores its objects directly on the filesystem. Minio has no metadata engine, it stores its objects directly on the filesystem.
Sending 1 million objects on Minio results in creating one million inodes on Sending 1 million objects on Minio results in creating one million inodes on
@ -312,7 +313,7 @@ the storage node in our current setup. So the performance of your filesystem
will probably substantially impact the results you will observe; we know the will probably substantially impact the results you will observe; we know the
filesystem we used is not adapted at all for Minio (encryption layer, fixed filesystem we used is not adapted at all for Minio (encryption layer, fixed
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
fsync for our metadata engine, minio has some fsync logic here slowing down the `fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the
creation of objects. Finally, object storage is designed for big objects: this creation of objects. Finally, object storage is designed for big objects: this
cost is negligible with bigger objects. In the end, again, we use Minio as a cost is negligible with bigger objects. In the end, again, we use Minio as a
reference to understand what is our performance budget for each part of our reference to understand what is our performance budget for each part of our
@ -330,7 +331,7 @@ metadata engine and thus focus only on 16-byte objects.
It appears that the performances of our metadata engine are acceptable, as we It appears that the performances of our metadata engine are acceptable, as we
have a comfortable margin compared to Minio (Minio is between 3x and 4x times have a comfortable margin compared to Minio (Minio is between 3x and 4x times
slower per batch). We also note that, past 200k objects, Minio batch slower per batch). We also note that, past 200k objects, Minio batch
completion time is constant as Garage's one is still increasing in the observed range: completion time is constant as Garage's one is still increasing in the observed range:
it could be interesting to know if Garage batch's completion time would cross Minio's one it could be interesting to know if Garage batch's completion time would cross Minio's one
for a very large number of objects. If we reason per object, both Minio and for a very large number of objects. If we reason per object, both Minio and
Garage performances remain very good: it takes respectively around 20ms and Garage performances remain very good: it takes respectively around 20ms and
@ -396,7 +397,7 @@ For example, on Garage, a GetObject request does two sequential calls: first,
it asks for the descriptor of the requested object containing the block list of it asks for the descriptor of the requested object containing the block list of
the requested object, then it retrieves its blocks. We can expect that the the requested object, then it retrieves its blocks. We can expect that the
request duration of a small GetObject request will be close to twice the request duration of a small GetObject request will be close to twice the
network latency. network latency.
We tested this theory with another benchmark of our own named We tested this theory with another benchmark of our own named
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
@ -417,7 +418,7 @@ RemoveObject). It is understandable: Minio has not been designed for
environments with high latencies, you are expected to build your clusters in environments with high latencies, you are expected to build your clusters in
the same datacenter, and then possibly connect them with their asynchronous the same datacenter, and then possibly connect them with their asynchronous
[Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect) [Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
feature. feature.
*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/) *Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/)
but it is even more sensitive to latency: "Multi-site replication has increased but it is even more sensitive to latency: "Multi-site replication has increased
@ -454,7 +455,7 @@ that their load started to become non-negligible: it seems that we are not
hitting a limit on the protocol side but we have simply exhausted the resource hitting a limit on the protocol side but we have simply exhausted the resource
of our testing nodes. In the future, we would like to run this experiment of our testing nodes. In the future, we would like to run this experiment
again, but on way more physical nodes, to confirm our hypothesis. For now, we again, but on way more physical nodes, to confirm our hypothesis. For now, we
are confident that a Garage cluster with 100+ nodes should work. are confident that a Garage cluster with 100+ nodes should work.
## Conclusion and Future work ## Conclusion and Future work
@ -462,7 +463,7 @@ are confident that a Garage cluster with 100+ nodes should work.
During this work, we identified some sensitive points on Garage we will During this work, we identified some sensitive points on Garage we will
continue working on: our data durability target and interaction with the continue working on: our data durability target and interaction with the
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
our components, our new metadata engines (lmdb, sqlite) still need some testing our components, our new metadata engines (LMDB, SQLite) still need some testing
and tuning, and we know that raw I/O (GetObject, PutObject) have a small and tuning, and we know that raw I/O (GetObject, PutObject) have a small
improvement margin. improvement margin.
@ -489,11 +490,11 @@ soon introduce officially a new API (as a technical preview) named K2V
([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)). ([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
## Notes ## Notes
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen) [^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)
existence. This tool is far more complex than our set of scripts, but we know existence. This tool is far more complex than our set of scripts, but we know
that it is also way more versatile. that it is also way more versatile.
[^ref2]: The program name contains the word "billion" and we only tested Garage [^ref2]: The program name contains the word "billion" and we only tested Garage
up to 1 "million" object, this is not a typo, we were just a little bit too up to 1 "million" object, this is not a typo, we were just a little bit too