Fixes until "millions of objects"

This commit is contained in:
Alex 2022-09-28 16:12:45 +02:00
parent bacebcfbf1
commit 7a354483d7
Signed by: lx
GPG key ID: 0E496D15096376BE

View file

@ -22,7 +22,7 @@ to reflect the high-level properties we are seeking.*
The following results must be taken with a critical grain of salt due to some
limitations that are inherent to any benchmark. We try to reference them as
exhaustively as possible in this first section, but other limitations might exist.
exhaustively as possible in this first section, but other limitations might exist.
Most of our tests were made on simulated networks, which by definition cannot represent all the
diversity of real networks (dynamic drop, jitter, latency, all of which could be
@ -109,7 +109,7 @@ at a smaller granularity level than entire data blocks, which are 1MB chunks of
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
request, the first block had to be fully retrieved by the gateway node from the
storage node before starting to send any data to the client.
storage node before starting to send any data to the client.
With Garage v0.8, we integrated a block streaming logic that allows the gateway
to send the beginning of a block without having to wait for the full block from
@ -125,7 +125,7 @@ thus adding at most 8ms of latency to a GetObject request (assuming no other
data transfer is happening in parallel). However,
on a very slow network, or a very congested link with many parallel requests
handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
We wanted to see if this theory holds in practice: we simulated a low latency
but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
@ -185,7 +185,7 @@ To assess performance improvements, we used the benchmark tool
[minio/warp](https://github.com/minio/warp) in a non-standard configuration,
adapted for small-scale tests, and we kept only the aggregated result named
"cluster total". The goal of this experiment is to get an idea of the cluster
performance with a standardized and mixed workload.
performance with a standardized and mixed workload.
![Plot showing IO perf of Garage configs and Minio](io.png)
@ -194,7 +194,7 @@ Looking at Garage, we observe that each improvement we made has a visible
impact on performances. We also note that we have a progress margin in
terms of performances compared to Minio: additional benchmarks, tests, and
monitoring could help better understand the remaining difference.
## A myriad of objects
@ -206,78 +206,79 @@ removed, etc. In Garage, we use a "metadata engine" component to track them.
For this analysis, we compare different metadata engines in Garage and see how
well the best one scale to a million objects.
**Testing metadata engines** - With Garage, we chose to not store metadata
directly on the filesystem, like Minio for example, but in an on-disk fancy
B-Tree structure, in other words, in an embedded database engine. Until now,
the only available option was [sled](https://sled.rs/), but we started having
serious issues with it, and we were not alone
**Testing metadata engines** - With Garage, we chose not to store metadata
directly on the filesystem, like Minio for example, but in a specialized on-disk
B-Tree data structure; in other words, in an embedded database engine. Until now,
the only supported option was [sled](https://sled.rs/), but we started having
serious issues with it - and we were not alone
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
v0.8, we introduce an abstraction semantic over the features we expect from our
database, allowing us to switch from one backend to another without touching
the rest of our codebase. We added two additional backends: lmdb
([heed](https://github.com/meilisearch/heed)) and sqlite
([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
the rest of our codebase. We added two additional backends: LMDB
(through [heed](https://github.com/meilisearch/heed)) and SQLite
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
are both experimental: contrarily to sled, we have never run them in production
for a long time.**
Similarly to the impact of fsync on block writing, each database engine we use
has its own policy with fsync. Sled flushes its write every 2 seconds by
Similarly to the impact of `fsync` on block writing, each database engine we use
has its own policy with `fsync`. Sled flushes its write every 2 seconds by
default, this is
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
lmdb by default does an `fsync` on each write, on early tests it led to very
slow resynchronizations between nodes. We added 2 flags:
LMDB by default does an `fsync` on each write, which on early tests led to very
slow resynchronizations between nodes. We thus added 2 flags,
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
and
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16)
which deactivate fsync. On sqlite, it is also possible to deactivate fsync with
`pragma synchronous = off;`, but we did not start any optimization work on it:
our sqlite implementation fsync all the data on the disk. Additionally, we are
using these engines through a Rust binding that had to do some tradeoff on the
concurrency part. **Our comparison will not reflect the raw performances of
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
`pragma synchronous = off`, but we have not started any optimization work on it yet:
our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
using these engines through Rust bindings that do not support async Rust,
with which Garage is built. **Our comparison will therefore not reflect the raw performances of
these database engines, but instead, our integration choices.**
Still, we think it makes sense to evaluate our implementations in their current
state in Garage. We designed a benchmark that is intensive on the metadata part
of the software, ie. handling tiny files. We chose again minio/warp but we
configure it with the smallest possible object size supported by warp, 256
bytes, to put some pressure on the metadata engine. We evaluate sled twice:
of the software, i.e. handling large numbers of tiny files. We chose again
`minio/warp` as a benchmark tool but we
configured it with the smallest possible object size it supported, 256
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
with its default configuration, and with a configuration where we set a flush
interval of 10 minutes to disable fsync.
interval of 10 minutes to disable `fsync`.
*Note that S3 has not been designed for such small objects; a regular database,
like Cassandra, would be more appropriate for such workloads. This test has
only been designed to stress our metadata engine, it is not indicative of
*Note that S3 has not been designed for such workloads that store huge numbers of small objects;
a regular database, like Cassandra, would be more appropriate. This test has
only been designed to stress our metadata engine, and is not indicative of
real-world performances.*
![Plot of our metadata engines comparison with Warp](db_engine.png)
Unsurprisingly, we observe abysmal performances for sqlite, the engine we have
the less tested and kept fsync for each write. lmdb performs twice better than
sled in its default version and 60% better than the "no fsync" version in our
Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have
the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than
with sled in its default version and 60% better than the "no `fsync`" sled version in our
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
disk storage and RAM; we would like to quantify that in the future. As we are
only at the very beginning of our work on metadata engines, it is hard to draw
strong conclusions. Still, we can say that sqlite is not ready for production
workloads, LMDB looks very promising both in terms of performances and resource
usage, it is a very good candidate for Garage's default metadata engine in the
future, and we need to define a data policy for Garage that would help us
strong conclusions. Still, we can say that SQLite is not ready for production
workloads, and that LMDB looks very promising both in terms of performances and resource
usage, and is a very good candidate for being Garage's default metadata engine in the
future. In the future, we will need to define a data policy for Garage to help us
arbitrate between performances and durability.
*To fsync or not to fsync? Performance is nothing without reliability, so we
need to better assess the impact of validating a write and then losing it.
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
need to better assess the impact of validating a write and then possibly losing it.
Because Garage is a distributed system, even if a node loses its write due to a
power loss, it will fetch it back from the 2 other nodes storing it. But rare
situations where 1 node is down and the 2 others validated the write and then
lost power can occur, what is our policy in this case? For storage durability,
situations can occur, where 1 node is down and the 2 others validated the write and then
lost power. What is our policy in this case? For storage durability,
we are already supposing that we never lose the storage of more than 2 nodes,
should we also expect that we don't lose power on more than 2 nodes at the same
so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
time? What should we think about people hosting all their nodes at the same
place without a UPS? Historically, it seems that Minio developers also accepted
place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
some compromises on this side
([#3536](https://github.com/minio/minio/issues/3536),
[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
only data and not metadata are persisted on disk - in combination with
only data and not metadata is persisted on disk - in combination with
`O_DIRECT` for direct I/O
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
@ -301,10 +302,10 @@ number of times (128 by default) to effectively create a certain number of
objects on the target cluster (1M by default). On our local setup with 3
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
following plot, we show how many times it took to Garage and Minio to handle
each batch.
each batch.
Before looking at the plot, **you must keep in mind some important points about
Minio and Garage internals**.
Minio and Garage internals**.
Minio has no metadata engine, it stores its objects directly on the filesystem.
Sending 1 million objects on Minio results in creating one million inodes on
@ -312,7 +313,7 @@ the storage node in our current setup. So the performance of your filesystem
will probably substantially impact the results you will observe; we know the
filesystem we used is not adapted at all for Minio (encryption layer, fixed
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
fsync for our metadata engine, minio has some fsync logic here slowing down the
`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the
creation of objects. Finally, object storage is designed for big objects: this
cost is negligible with bigger objects. In the end, again, we use Minio as a
reference to understand what is our performance budget for each part of our
@ -330,7 +331,7 @@ metadata engine and thus focus only on 16-byte objects.
It appears that the performances of our metadata engine are acceptable, as we
have a comfortable margin compared to Minio (Minio is between 3x and 4x times
slower per batch). We also note that, past 200k objects, Minio batch
completion time is constant as Garage's one is still increasing in the observed range:
completion time is constant as Garage's one is still increasing in the observed range:
it could be interesting to know if Garage batch's completion time would cross Minio's one
for a very large number of objects. If we reason per object, both Minio and
Garage performances remain very good: it takes respectively around 20ms and
@ -396,7 +397,7 @@ For example, on Garage, a GetObject request does two sequential calls: first,
it asks for the descriptor of the requested object containing the block list of
the requested object, then it retrieves its blocks. We can expect that the
request duration of a small GetObject request will be close to twice the
network latency.
network latency.
We tested this theory with another benchmark of our own named
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
@ -417,7 +418,7 @@ RemoveObject). It is understandable: Minio has not been designed for
environments with high latencies, you are expected to build your clusters in
the same datacenter, and then possibly connect them with their asynchronous
[Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
feature.
feature.
*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/)
but it is even more sensitive to latency: "Multi-site replication has increased
@ -454,7 +455,7 @@ that their load started to become non-negligible: it seems that we are not
hitting a limit on the protocol side but we have simply exhausted the resource
of our testing nodes. In the future, we would like to run this experiment
again, but on way more physical nodes, to confirm our hypothesis. For now, we
are confident that a Garage cluster with 100+ nodes should work.
are confident that a Garage cluster with 100+ nodes should work.
## Conclusion and Future work
@ -462,7 +463,7 @@ are confident that a Garage cluster with 100+ nodes should work.
During this work, we identified some sensitive points on Garage we will
continue working on: our data durability target and interaction with the
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
our components, our new metadata engines (lmdb, sqlite) still need some testing
our components, our new metadata engines (LMDB, SQLite) still need some testing
and tuning, and we know that raw I/O (GetObject, PutObject) have a small
improvement margin.
@ -489,11 +490,11 @@ soon introduce officially a new API (as a technical preview) named K2V
([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
## Notes
## Notes
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)
existence. This tool is far more complex than our set of scripts, but we know
that it is also way more versatile.
that it is also way more versatile.
[^ref2]: The program name contains the word "billion" and we only tested Garage
up to 1 "million" object, this is not a typo, we were just a little bit too