Fixes until "millions of objects"
This commit is contained in:
parent
bacebcfbf1
commit
7a354483d7
1 changed files with 55 additions and 54 deletions
|
@ -22,7 +22,7 @@ to reflect the high-level properties we are seeking.*
|
||||||
|
|
||||||
The following results must be taken with a critical grain of salt due to some
|
The following results must be taken with a critical grain of salt due to some
|
||||||
limitations that are inherent to any benchmark. We try to reference them as
|
limitations that are inherent to any benchmark. We try to reference them as
|
||||||
exhaustively as possible in this first section, but other limitations might exist.
|
exhaustively as possible in this first section, but other limitations might exist.
|
||||||
|
|
||||||
Most of our tests were made on simulated networks, which by definition cannot represent all the
|
Most of our tests were made on simulated networks, which by definition cannot represent all the
|
||||||
diversity of real networks (dynamic drop, jitter, latency, all of which could be
|
diversity of real networks (dynamic drop, jitter, latency, all of which could be
|
||||||
|
@ -109,7 +109,7 @@ at a smaller granularity level than entire data blocks, which are 1MB chunks of
|
||||||
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
|
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
|
||||||
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
|
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
|
||||||
request, the first block had to be fully retrieved by the gateway node from the
|
request, the first block had to be fully retrieved by the gateway node from the
|
||||||
storage node before starting to send any data to the client.
|
storage node before starting to send any data to the client.
|
||||||
|
|
||||||
With Garage v0.8, we integrated a block streaming logic that allows the gateway
|
With Garage v0.8, we integrated a block streaming logic that allows the gateway
|
||||||
to send the beginning of a block without having to wait for the full block from
|
to send the beginning of a block without having to wait for the full block from
|
||||||
|
@ -125,7 +125,7 @@ thus adding at most 8ms of latency to a GetObject request (assuming no other
|
||||||
data transfer is happening in parallel). However,
|
data transfer is happening in parallel). However,
|
||||||
on a very slow network, or a very congested link with many parallel requests
|
on a very slow network, or a very congested link with many parallel requests
|
||||||
handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
|
handled, the impact can be much more important: on a 5Mbps network, it takes 1.6 seconds
|
||||||
to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
|
to transfer our 1MB block, and streaming has the potential of heavily improving user experience.
|
||||||
|
|
||||||
We wanted to see if this theory holds in practice: we simulated a low latency
|
We wanted to see if this theory holds in practice: we simulated a low latency
|
||||||
but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
|
but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
|
||||||
|
@ -185,7 +185,7 @@ To assess performance improvements, we used the benchmark tool
|
||||||
[minio/warp](https://github.com/minio/warp) in a non-standard configuration,
|
[minio/warp](https://github.com/minio/warp) in a non-standard configuration,
|
||||||
adapted for small-scale tests, and we kept only the aggregated result named
|
adapted for small-scale tests, and we kept only the aggregated result named
|
||||||
"cluster total". The goal of this experiment is to get an idea of the cluster
|
"cluster total". The goal of this experiment is to get an idea of the cluster
|
||||||
performance with a standardized and mixed workload.
|
performance with a standardized and mixed workload.
|
||||||
|
|
||||||
![Plot showing IO perf of Garage configs and Minio](io.png)
|
![Plot showing IO perf of Garage configs and Minio](io.png)
|
||||||
|
|
||||||
|
@ -194,7 +194,7 @@ Looking at Garage, we observe that each improvement we made has a visible
|
||||||
impact on performances. We also note that we have a progress margin in
|
impact on performances. We also note that we have a progress margin in
|
||||||
terms of performances compared to Minio: additional benchmarks, tests, and
|
terms of performances compared to Minio: additional benchmarks, tests, and
|
||||||
monitoring could help better understand the remaining difference.
|
monitoring could help better understand the remaining difference.
|
||||||
|
|
||||||
|
|
||||||
## A myriad of objects
|
## A myriad of objects
|
||||||
|
|
||||||
|
@ -206,78 +206,79 @@ removed, etc. In Garage, we use a "metadata engine" component to track them.
|
||||||
For this analysis, we compare different metadata engines in Garage and see how
|
For this analysis, we compare different metadata engines in Garage and see how
|
||||||
well the best one scale to a million objects.
|
well the best one scale to a million objects.
|
||||||
|
|
||||||
**Testing metadata engines** - With Garage, we chose to not store metadata
|
**Testing metadata engines** - With Garage, we chose not to store metadata
|
||||||
directly on the filesystem, like Minio for example, but in an on-disk fancy
|
directly on the filesystem, like Minio for example, but in a specialized on-disk
|
||||||
B-Tree structure, in other words, in an embedded database engine. Until now,
|
B-Tree data structure; in other words, in an embedded database engine. Until now,
|
||||||
the only available option was [sled](https://sled.rs/), but we started having
|
the only supported option was [sled](https://sled.rs/), but we started having
|
||||||
serious issues with it, and we were not alone
|
serious issues with it - and we were not alone
|
||||||
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
|
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
|
||||||
v0.8, we introduce an abstraction semantic over the features we expect from our
|
v0.8, we introduce an abstraction semantic over the features we expect from our
|
||||||
database, allowing us to switch from one backend to another without touching
|
database, allowing us to switch from one backend to another without touching
|
||||||
the rest of our codebase. We added two additional backends: lmdb
|
the rest of our codebase. We added two additional backends: LMDB
|
||||||
([heed](https://github.com/meilisearch/heed)) and sqlite
|
(through [heed](https://github.com/meilisearch/heed)) and SQLite
|
||||||
([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
||||||
are both experimental: contrarily to sled, we have never run them in production
|
are both experimental: contrarily to sled, we have never run them in production
|
||||||
for a long time.**
|
for a long time.**
|
||||||
|
|
||||||
Similarly to the impact of fsync on block writing, each database engine we use
|
Similarly to the impact of `fsync` on block writing, each database engine we use
|
||||||
has its own policy with fsync. Sled flushes its write every 2 seconds by
|
has its own policy with `fsync`. Sled flushes its write every 2 seconds by
|
||||||
default, this is
|
default, this is
|
||||||
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
|
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
|
||||||
lmdb by default does an `fsync` on each write, on early tests it led to very
|
LMDB by default does an `fsync` on each write, which on early tests led to very
|
||||||
slow resynchronizations between nodes. We added 2 flags:
|
slow resynchronizations between nodes. We thus added 2 flags,
|
||||||
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
|
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
|
||||||
and
|
and
|
||||||
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16)
|
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
|
||||||
which deactivate fsync. On sqlite, it is also possible to deactivate fsync with
|
to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
|
||||||
`pragma synchronous = off;`, but we did not start any optimization work on it:
|
`pragma synchronous = off`, but we have not started any optimization work on it yet:
|
||||||
our sqlite implementation fsync all the data on the disk. Additionally, we are
|
our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
|
||||||
using these engines through a Rust binding that had to do some tradeoff on the
|
using these engines through Rust bindings that do not support async Rust,
|
||||||
concurrency part. **Our comparison will not reflect the raw performances of
|
with which Garage is built. **Our comparison will therefore not reflect the raw performances of
|
||||||
these database engines, but instead, our integration choices.**
|
these database engines, but instead, our integration choices.**
|
||||||
|
|
||||||
Still, we think it makes sense to evaluate our implementations in their current
|
Still, we think it makes sense to evaluate our implementations in their current
|
||||||
state in Garage. We designed a benchmark that is intensive on the metadata part
|
state in Garage. We designed a benchmark that is intensive on the metadata part
|
||||||
of the software, ie. handling tiny files. We chose again minio/warp but we
|
of the software, i.e. handling large numbers of tiny files. We chose again
|
||||||
configure it with the smallest possible object size supported by warp, 256
|
`minio/warp` as a benchmark tool but we
|
||||||
bytes, to put some pressure on the metadata engine. We evaluate sled twice:
|
configured it with the smallest possible object size it supported, 256
|
||||||
|
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
|
||||||
with its default configuration, and with a configuration where we set a flush
|
with its default configuration, and with a configuration where we set a flush
|
||||||
interval of 10 minutes to disable fsync.
|
interval of 10 minutes to disable `fsync`.
|
||||||
|
|
||||||
*Note that S3 has not been designed for such small objects; a regular database,
|
*Note that S3 has not been designed for such workloads that store huge numbers of small objects;
|
||||||
like Cassandra, would be more appropriate for such workloads. This test has
|
a regular database, like Cassandra, would be more appropriate. This test has
|
||||||
only been designed to stress our metadata engine, it is not indicative of
|
only been designed to stress our metadata engine, and is not indicative of
|
||||||
real-world performances.*
|
real-world performances.*
|
||||||
|
|
||||||
![Plot of our metadata engines comparison with Warp](db_engine.png)
|
![Plot of our metadata engines comparison with Warp](db_engine.png)
|
||||||
|
|
||||||
Unsurprisingly, we observe abysmal performances for sqlite, the engine we have
|
Unsurprisingly, we observe abysmal performances with SQLite, the engine which we have
|
||||||
the less tested and kept fsync for each write. lmdb performs twice better than
|
the less tested and that still does an `fsync` for each write. Garage with LMDB performs twice better than
|
||||||
sled in its default version and 60% better than the "no fsync" version in our
|
with sled in its default version and 60% better than the "no `fsync`" sled version in our
|
||||||
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
|
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
|
||||||
disk storage and RAM; we would like to quantify that in the future. As we are
|
disk storage and RAM; we would like to quantify that in the future. As we are
|
||||||
only at the very beginning of our work on metadata engines, it is hard to draw
|
only at the very beginning of our work on metadata engines, it is hard to draw
|
||||||
strong conclusions. Still, we can say that sqlite is not ready for production
|
strong conclusions. Still, we can say that SQLite is not ready for production
|
||||||
workloads, LMDB looks very promising both in terms of performances and resource
|
workloads, and that LMDB looks very promising both in terms of performances and resource
|
||||||
usage, it is a very good candidate for Garage's default metadata engine in the
|
usage, and is a very good candidate for being Garage's default metadata engine in the
|
||||||
future, and we need to define a data policy for Garage that would help us
|
future. In the future, we will need to define a data policy for Garage to help us
|
||||||
arbitrate between performances and durability.
|
arbitrate between performances and durability.
|
||||||
|
|
||||||
*To fsync or not to fsync? Performance is nothing without reliability, so we
|
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
|
||||||
need to better assess the impact of validating a write and then losing it.
|
need to better assess the impact of validating a write and then possibly losing it.
|
||||||
Because Garage is a distributed system, even if a node loses its write due to a
|
Because Garage is a distributed system, even if a node loses its write due to a
|
||||||
power loss, it will fetch it back from the 2 other nodes storing it. But rare
|
power loss, it will fetch it back from the 2 other nodes storing it. But rare
|
||||||
situations where 1 node is down and the 2 others validated the write and then
|
situations can occur, where 1 node is down and the 2 others validated the write and then
|
||||||
lost power can occur, what is our policy in this case? For storage durability,
|
lost power. What is our policy in this case? For storage durability,
|
||||||
we are already supposing that we never lose the storage of more than 2 nodes,
|
we are already supposing that we never lose the storage of more than 2 nodes,
|
||||||
should we also expect that we don't lose power on more than 2 nodes at the same
|
so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
|
||||||
time? What should we think about people hosting all their nodes at the same
|
time? What should we think about people hosting all their nodes at the same
|
||||||
place without a UPS? Historically, it seems that Minio developers also accepted
|
place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
|
||||||
some compromises on this side
|
some compromises on this side
|
||||||
([#3536](https://github.com/minio/minio/issues/3536),
|
([#3536](https://github.com/minio/minio/issues/3536),
|
||||||
[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
|
[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
|
||||||
use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
|
use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
|
||||||
only data and not metadata are persisted on disk - in combination with
|
only data and not metadata is persisted on disk - in combination with
|
||||||
`O_DIRECT` for direct I/O
|
`O_DIRECT` for direct I/O
|
||||||
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
|
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
|
||||||
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
||||||
|
@ -301,10 +302,10 @@ number of times (128 by default) to effectively create a certain number of
|
||||||
objects on the target cluster (1M by default). On our local setup with 3
|
objects on the target cluster (1M by default). On our local setup with 3
|
||||||
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
|
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
|
||||||
following plot, we show how many times it took to Garage and Minio to handle
|
following plot, we show how many times it took to Garage and Minio to handle
|
||||||
each batch.
|
each batch.
|
||||||
|
|
||||||
Before looking at the plot, **you must keep in mind some important points about
|
Before looking at the plot, **you must keep in mind some important points about
|
||||||
Minio and Garage internals**.
|
Minio and Garage internals**.
|
||||||
|
|
||||||
Minio has no metadata engine, it stores its objects directly on the filesystem.
|
Minio has no metadata engine, it stores its objects directly on the filesystem.
|
||||||
Sending 1 million objects on Minio results in creating one million inodes on
|
Sending 1 million objects on Minio results in creating one million inodes on
|
||||||
|
@ -312,7 +313,7 @@ the storage node in our current setup. So the performance of your filesystem
|
||||||
will probably substantially impact the results you will observe; we know the
|
will probably substantially impact the results you will observe; we know the
|
||||||
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
||||||
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
||||||
fsync for our metadata engine, minio has some fsync logic here slowing down the
|
`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the
|
||||||
creation of objects. Finally, object storage is designed for big objects: this
|
creation of objects. Finally, object storage is designed for big objects: this
|
||||||
cost is negligible with bigger objects. In the end, again, we use Minio as a
|
cost is negligible with bigger objects. In the end, again, we use Minio as a
|
||||||
reference to understand what is our performance budget for each part of our
|
reference to understand what is our performance budget for each part of our
|
||||||
|
@ -330,7 +331,7 @@ metadata engine and thus focus only on 16-byte objects.
|
||||||
It appears that the performances of our metadata engine are acceptable, as we
|
It appears that the performances of our metadata engine are acceptable, as we
|
||||||
have a comfortable margin compared to Minio (Minio is between 3x and 4x times
|
have a comfortable margin compared to Minio (Minio is between 3x and 4x times
|
||||||
slower per batch). We also note that, past 200k objects, Minio batch
|
slower per batch). We also note that, past 200k objects, Minio batch
|
||||||
completion time is constant as Garage's one is still increasing in the observed range:
|
completion time is constant as Garage's one is still increasing in the observed range:
|
||||||
it could be interesting to know if Garage batch's completion time would cross Minio's one
|
it could be interesting to know if Garage batch's completion time would cross Minio's one
|
||||||
for a very large number of objects. If we reason per object, both Minio and
|
for a very large number of objects. If we reason per object, both Minio and
|
||||||
Garage performances remain very good: it takes respectively around 20ms and
|
Garage performances remain very good: it takes respectively around 20ms and
|
||||||
|
@ -396,7 +397,7 @@ For example, on Garage, a GetObject request does two sequential calls: first,
|
||||||
it asks for the descriptor of the requested object containing the block list of
|
it asks for the descriptor of the requested object containing the block list of
|
||||||
the requested object, then it retrieves its blocks. We can expect that the
|
the requested object, then it retrieves its blocks. We can expect that the
|
||||||
request duration of a small GetObject request will be close to twice the
|
request duration of a small GetObject request will be close to twice the
|
||||||
network latency.
|
network latency.
|
||||||
|
|
||||||
We tested this theory with another benchmark of our own named
|
We tested this theory with another benchmark of our own named
|
||||||
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
|
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
|
||||||
|
@ -417,7 +418,7 @@ RemoveObject). It is understandable: Minio has not been designed for
|
||||||
environments with high latencies, you are expected to build your clusters in
|
environments with high latencies, you are expected to build your clusters in
|
||||||
the same datacenter, and then possibly connect them with their asynchronous
|
the same datacenter, and then possibly connect them with their asynchronous
|
||||||
[Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
[Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
||||||
feature.
|
feature.
|
||||||
|
|
||||||
*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/)
|
*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/)
|
||||||
but it is even more sensitive to latency: "Multi-site replication has increased
|
but it is even more sensitive to latency: "Multi-site replication has increased
|
||||||
|
@ -454,7 +455,7 @@ that their load started to become non-negligible: it seems that we are not
|
||||||
hitting a limit on the protocol side but we have simply exhausted the resource
|
hitting a limit on the protocol side but we have simply exhausted the resource
|
||||||
of our testing nodes. In the future, we would like to run this experiment
|
of our testing nodes. In the future, we would like to run this experiment
|
||||||
again, but on way more physical nodes, to confirm our hypothesis. For now, we
|
again, but on way more physical nodes, to confirm our hypothesis. For now, we
|
||||||
are confident that a Garage cluster with 100+ nodes should work.
|
are confident that a Garage cluster with 100+ nodes should work.
|
||||||
|
|
||||||
|
|
||||||
## Conclusion and Future work
|
## Conclusion and Future work
|
||||||
|
@ -462,7 +463,7 @@ are confident that a Garage cluster with 100+ nodes should work.
|
||||||
During this work, we identified some sensitive points on Garage we will
|
During this work, we identified some sensitive points on Garage we will
|
||||||
continue working on: our data durability target and interaction with the
|
continue working on: our data durability target and interaction with the
|
||||||
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
|
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
|
||||||
our components, our new metadata engines (lmdb, sqlite) still need some testing
|
our components, our new metadata engines (LMDB, SQLite) still need some testing
|
||||||
and tuning, and we know that raw I/O (GetObject, PutObject) have a small
|
and tuning, and we know that raw I/O (GetObject, PutObject) have a small
|
||||||
improvement margin.
|
improvement margin.
|
||||||
|
|
||||||
|
@ -489,11 +490,11 @@ soon introduce officially a new API (as a technical preview) named K2V
|
||||||
([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
|
([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
|
||||||
|
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)
|
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)
|
||||||
existence. This tool is far more complex than our set of scripts, but we know
|
existence. This tool is far more complex than our set of scripts, but we know
|
||||||
that it is also way more versatile.
|
that it is also way more versatile.
|
||||||
|
|
||||||
[^ref2]: The program name contains the word "billion" and we only tested Garage
|
[^ref2]: The program name contains the word "billion" and we only tested Garage
|
||||||
up to 1 "million" object, this is not a typo, we were just a little bit too
|
up to 1 "million" object, this is not a typo, we were just a little bit too
|
||||||
|
|
Loading…
Reference in a new issue