New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 61 additions and 58 deletions
|
@ -33,7 +33,7 @@ cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
|
||||||
our results are thus not an evaluation of the performance of Garage as a whole.
|
our results are thus not an evaluation of the performance of Garage as a whole.
|
||||||
|
|
||||||
For some benchmarks, we used Minio as a reference. It must be noted that we did
|
For some benchmarks, we used Minio as a reference. It must be noted that we did
|
||||||
not try to optimize its configuration as we have done on Garage, and more
|
not try to optimize its configuration as we have done for Garage, and more
|
||||||
generally, we have way less knowledge on Minio than on Garage, which can lead
|
generally, we have way less knowledge on Minio than on Garage, which can lead
|
||||||
to underrated performance measurements for Minio. It must also be noted that
|
to underrated performance measurements for Minio. It must also be noted that
|
||||||
Garage and Minio are systems with different feature sets. For instance Minio supports
|
Garage and Minio are systems with different feature sets. For instance Minio supports
|
||||||
|
@ -41,7 +41,7 @@ erasure coding for higher data density, which Garage doesn't, Minio implements
|
||||||
way more S3 endpoints than Garage, etc. Such features necessarily have a cost
|
way more S3 endpoints than Garage, etc. Such features necessarily have a cost
|
||||||
that you must keep in mind when reading the plots we will present. You should consider
|
that you must keep in mind when reading the plots we will present. You should consider
|
||||||
results on Minio as a way to contextualize our results on Garage, to see that our improvements
|
results on Minio as a way to contextualize our results on Garage, to see that our improvements
|
||||||
are not artificial compared to existing object storage implementations.
|
are not artificial in the light of existing object storage implementations.
|
||||||
|
|
||||||
The impact of the testing environment is also not evaluated (kernel patches,
|
The impact of the testing environment is also not evaluated (kernel patches,
|
||||||
configuration, parameters, filesystem, hardware configuration, etc.). Some of
|
configuration, parameters, filesystem, hardware configuration, etc.). Some of
|
||||||
|
@ -66,7 +66,7 @@ We made a first batch of tests on
|
||||||
[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible
|
[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible
|
||||||
testbed for experiment-driven research in all areas of computer science,
|
testbed for experiment-driven research in all areas of computer science,
|
||||||
which has an
|
which has an
|
||||||
[Open Access program](https://www.grid5000.fr/w/Grid5000:Open-Access).
|
[open access program](https://www.grid5000.fr/w/Grid5000:Open-Access).
|
||||||
During our tests, we used part of the following clusters:
|
During our tests, we used part of the following clusters:
|
||||||
[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova),
|
[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova),
|
||||||
[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and
|
[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and
|
||||||
|
@ -92,7 +92,7 @@ The main purpose of an object storage system is to store and retrieve objects
|
||||||
across the network, and the faster these two functions can be accomplished,
|
across the network, and the faster these two functions can be accomplished,
|
||||||
the more efficient the system as a whole will be. For this analysis, we focus on
|
the more efficient the system as a whole will be. For this analysis, we focus on
|
||||||
2 aspects of performance. First, since many applications can start processing a file
|
2 aspects of performance. First, since many applications can start processing a file
|
||||||
before receiving it completely, we will evaluate the Time-to-First-Byte (TTFB)
|
before receiving it completely, we will evaluate the time-to-first-byte (TTFB)
|
||||||
on GetObject requests, i.e. the duration between the moment a request is sent
|
on GetObject requests, i.e. the duration between the moment a request is sent
|
||||||
and the moment where the first bytes of the returned object are received by the client.
|
and the moment where the first bytes of the returned object are received by the client.
|
||||||
Second, we will evaluate generic throughput, to understand how well
|
Second, we will evaluate generic throughput, to understand how well
|
||||||
|
@ -103,17 +103,17 @@ web endpoints, with the idea to make it a platform of choice to publish
|
||||||
static websites. When publishing a website, TTFB can be directly observed
|
static websites. When publishing a website, TTFB can be directly observed
|
||||||
by the end user, as it will impact the perceived reactivity of the websites.
|
by the end user, as it will impact the perceived reactivity of the websites.
|
||||||
|
|
||||||
Up to version 0.7.3, Time-to-First-Byte on Garage used to be relatively high.
|
Up to version 0.7.3, time-to-first-byte on Garage used to be relatively high.
|
||||||
This can be explained by the fact that Garage was not able to handle data internally
|
This can be explained by the fact that Garage was not able to handle data internally
|
||||||
at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object
|
at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object
|
||||||
(this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
|
(a size which [can be configured](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
|
||||||
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
|
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
|
||||||
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
|
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
|
||||||
request, the first block had to be fully retrieved by the gateway node from the
|
request, the first block had to be fully retrieved by the gateway node from the
|
||||||
storage node before starting to send any data to the client.
|
storage node before starting to send any data to the client.
|
||||||
|
|
||||||
With Garage v0.8, we integrated a block streaming logic that allows the gateway
|
With Garage v0.8, we integrated a data streaming logic that allows the gateway
|
||||||
to send the beginning of a block without having to wait for the full block from
|
to send the beginning of a block without having to wait for the full block to be received from
|
||||||
the storage node. We can visually represent the difference as follow:
|
the storage node. We can visually represent the difference as follow:
|
||||||
|
|
||||||
<center>
|
<center>
|
||||||
|
@ -138,23 +138,23 @@ whose results are shown on the following figure:
|
||||||
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
|
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
|
||||||
|
|
||||||
Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
|
Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
|
||||||
and 2s, which corresponds to the time to transfer the full block which we calculated before.
|
and 2s, which corresponds to the time to transfer the full block which we calculated above.
|
||||||
On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
|
On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
|
||||||
streaming feature (the lowest value is 43ms). Minio sits between the two
|
streaming feature (the lowest value is 43ms). Minio sits between the two
|
||||||
Garage versions: we suppose that it does some form of batching, but smaller
|
Garage versions: we suppose that it does some form of batching, but smaller
|
||||||
than 1MB.
|
than 1MB.
|
||||||
|
|
||||||
**Throughput** - As soon as we publicly released Garage, people started
|
**Throughput** - As soon as we publicly released Garage, people started
|
||||||
benchmarking it, comparing its performances with writing directly on the
|
benchmarking it, comparing its performances to writing directly on the
|
||||||
filesystem, and observed that Garage was slower (eg.
|
filesystem, and observed that Garage was slower (eg.
|
||||||
[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the
|
[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the
|
||||||
situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread
|
situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread,
|
||||||
and many others
|
and many others
|
||||||
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
|
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
|
||||||
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to
|
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)), which led us to
|
||||||
version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote
|
version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote
|
||||||
to better control resource usage
|
to better control resource usage
|
||||||
and detect errors, like semaphores or timeouts, was artificially limiting
|
and detect errors, including semaphores and timeouts, was artificially limiting
|
||||||
performances. In another iteration, we made this logic less restrictive at the
|
performances. In another iteration, we made this logic less restrictive at the
|
||||||
cost of higher resource consumption under load
|
cost of higher resource consumption under load
|
||||||
([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in
|
([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in
|
||||||
|
@ -170,16 +170,16 @@ write directly to the disk when a process creates or updates a file in your
|
||||||
filesystem. Instead, the write is kept in memory, and flushed later in a batch
|
filesystem. Instead, the write is kept in memory, and flushed later in a batch
|
||||||
with other writes. If a power loss occurs before the OS has time to flush
|
with other writes. If a power loss occurs before the OS has time to flush
|
||||||
data to disk, some writes will be lost. To ensure that a write is effectively
|
data to disk, some writes will be lost. To ensure that a write is effectively
|
||||||
written on disk, the
|
written to disk, the
|
||||||
[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
|
[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
|
||||||
which blocks until the file or directory on which it is called has been written from volatile
|
which blocks until the file or directory on which it is called has been flushed from volatile
|
||||||
memory to the persistent storage device. Additionally, the exact semantic of
|
memory to the persistent storage device. Additionally, the exact semantic of
|
||||||
`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
|
`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
|
||||||
and, even on battle-tested software like Postgres, it was
|
and, even on battle-tested software like Postgres, it was
|
||||||
["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
|
["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
|
||||||
Note that on Garage, we are currently working on our `fsync` policy and thus, for
|
Note that on Garage, we are currently working on our `fsync` policy and thus, for
|
||||||
now, you should expect limited data durability in case of power loss, as we are
|
now, you should expect limited data durability in case of power loss, as we are
|
||||||
aware of some inconsistency on this point (which we describe in the following
|
aware of some inconsistencies on this point (which we describe in the following
|
||||||
and plan to solve).*
|
and plan to solve).*
|
||||||
|
|
||||||
To assess performance improvements, we used the benchmark tool
|
To assess performance improvements, we used the benchmark tool
|
||||||
|
@ -191,10 +191,10 @@ performance with a standardized and mixed workload.
|
||||||
![Plot showing IO performances of Garage configurations and Minio](io.png)
|
![Plot showing IO performances of Garage configurations and Minio](io.png)
|
||||||
|
|
||||||
Minio, our reference point, gives us the best performances in this test.
|
Minio, our reference point, gives us the best performances in this test.
|
||||||
Looking at Garage, we observe that each improvement we made has a visible
|
Looking at Garage, we observe that each improvement we made had a visible
|
||||||
impact on performances. We also note that we have a progress margin in
|
impact on performances. We also note that we have a progress margin in
|
||||||
terms of performances compared to Minio: additional benchmarks, tests, and
|
terms of performances compared to Minio: additional benchmarks, tests, and
|
||||||
monitoring could help better understand the remaining difference.
|
monitoring could help us better understand the remaining difference.
|
||||||
|
|
||||||
|
|
||||||
## A myriad of objects
|
## A myriad of objects
|
||||||
|
@ -205,7 +205,7 @@ communicates about Ceph being able to handle 10 billion objects. All these
|
||||||
objects must be tracked efficiently in the system to be fetched, listed,
|
objects must be tracked efficiently in the system to be fetched, listed,
|
||||||
removed, etc. In Garage, we use a "metadata engine" component to track them.
|
removed, etc. In Garage, we use a "metadata engine" component to track them.
|
||||||
For this analysis, we compare different metadata engines in Garage and see how
|
For this analysis, we compare different metadata engines in Garage and see how
|
||||||
well the best one scale to a million objects.
|
well the best one scales to a million objects.
|
||||||
|
|
||||||
**Testing metadata engines** - With Garage, we chose not to store metadata
|
**Testing metadata engines** - With Garage, we chose not to store metadata
|
||||||
directly on the filesystem, like Minio for example, but in a specialized on-disk
|
directly on the filesystem, like Minio for example, but in a specialized on-disk
|
||||||
|
@ -222,11 +222,11 @@ are both experimental: contrarily to sled, we have never run them in production
|
||||||
for a long time.**
|
for a long time.**
|
||||||
|
|
||||||
Similarly to the impact of `fsync` on block writing, each database engine we use
|
Similarly to the impact of `fsync` on block writing, each database engine we use
|
||||||
has its own policy with `fsync`. Sled flushes its write every 2 seconds by
|
has its own policy with `fsync`. Sled flushes its writes every 2 seconds by
|
||||||
default, this is
|
default (this is
|
||||||
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
|
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
|
||||||
LMDB by default does an `fsync` on each write, which on early tests led to very
|
LMDB by default does an `fsync` on each write, which on early tests led to
|
||||||
slow resynchronizations between nodes. We thus added 2 flags,
|
abysmal performance. We thus added 2 flags,
|
||||||
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
|
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
|
||||||
and
|
and
|
||||||
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
|
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
|
||||||
|
@ -234,14 +234,15 @@ to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
|
||||||
`pragma synchronous = off`, but we have not started any optimization work on it yet:
|
`pragma synchronous = off`, but we have not started any optimization work on it yet:
|
||||||
our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
|
our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
|
||||||
using these engines through Rust bindings that do not support async Rust,
|
using these engines through Rust bindings that do not support async Rust,
|
||||||
with which Garage is built. **Our comparison will therefore not reflect the raw performances of
|
with which Garage is built, which has an impact on performance as well.
|
||||||
|
**Our comparison will therefore not reflect the raw performances of
|
||||||
these database engines, but instead, our integration choices.**
|
these database engines, but instead, our integration choices.**
|
||||||
|
|
||||||
Still, we think it makes sense to evaluate our implementations in their current
|
Still, we think it makes sense to evaluate our implementations in their current
|
||||||
state in Garage. We designed a benchmark that is intensive on the metadata part
|
state in Garage. We designed a benchmark that is intensive on the metadata part
|
||||||
of the software, i.e. handling large numbers of tiny files. We chose again
|
of the software, i.e. handling large numbers of tiny files. We chose again
|
||||||
`minio/warp` as a benchmark tool but we
|
`minio/warp` as a benchmark tool, but we
|
||||||
configured it with the smallest possible object size it supported, 256
|
configured it here with the smallest possible object size it supported, 256
|
||||||
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
|
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
|
||||||
with its default configuration, and with a configuration where we set a flush
|
with its default configuration, and with a configuration where we set a flush
|
||||||
interval of 10 minutes to disable `fsync`.
|
interval of 10 minutes to disable `fsync`.
|
||||||
|
@ -261,19 +262,19 @@ disk storage and RAM; we would like to quantify that in the future. As we are
|
||||||
only at the very beginning of our work on metadata engines, it is hard to draw
|
only at the very beginning of our work on metadata engines, it is hard to draw
|
||||||
strong conclusions. Still, we can say that SQLite is not ready for production
|
strong conclusions. Still, we can say that SQLite is not ready for production
|
||||||
workloads, and that LMDB looks very promising both in terms of performances and resource
|
workloads, and that LMDB looks very promising both in terms of performances and resource
|
||||||
usage, and is a very good candidate for being Garage's default metadata engine in the
|
usage, and is a very good candidate for being Garage's default metadata engine in
|
||||||
future. In the future, we will need to define a data policy for Garage to help us
|
future releases. In the future, we will need to define a data policy for Garage to help us
|
||||||
arbitrate between performances and durability.
|
arbitrate between performance and durability.
|
||||||
|
|
||||||
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
|
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
|
||||||
need to better assess the impact of validating a write and then possibly losing it.
|
need to better assess the impact of possibly losing a write after it has been validated.
|
||||||
Because Garage is a distributed system, even if a node loses its write due to a
|
Because Garage is a distributed system, even if a node loses its write due to a
|
||||||
power loss, it will fetch it back from the 2 other nodes storing it. But rare
|
power loss, it will fetch it back from the 2 other nodes that store it. But rare
|
||||||
situations can occur, where 1 node is down and the 2 others validated the write and then
|
situations can occur where 1 node is down and the 2 others validate the write and then
|
||||||
lost power. What is our policy in this case? For storage durability,
|
lose power before having time to flush to disk. What is our policy in this case? For storage durability,
|
||||||
we are already supposing that we never lose the storage of more than 2 nodes,
|
we are already supposing that we never lose the storage of more than 2 nodes,
|
||||||
so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
|
so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
|
||||||
time? What should we think about people hosting all their nodes at the same
|
time? What should we do about people hosting all of their nodes at the same
|
||||||
place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
|
place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
|
||||||
some compromises on this side
|
some compromises on this side
|
||||||
([#3536](https://github.com/minio/minio/issues/3536),
|
([#3536](https://github.com/minio/minio/issues/3536),
|
||||||
|
@ -286,7 +287,7 @@ only data and not metadata is persisted on disk - in combination with
|
||||||
|
|
||||||
**Storing a million objects** - Object storage systems are designed not only
|
**Storing a million objects** - Object storage systems are designed not only
|
||||||
for data durability and availability but also for scalability, so naturally,
|
for data durability and availability but also for scalability, so naturally,
|
||||||
some people asked us how scalable Garage is. If answering this
|
some people asked us how scalable Garage is. If giving a definitive answer to this
|
||||||
question is out of the scope of this study, we wanted to be sure that our
|
question is out of the scope of this study, we wanted to be sure that our
|
||||||
metadata engine would be able to scale to a million objects. To put this
|
metadata engine would be able to scale to a million objects. To put this
|
||||||
target in context, it remains small compared to other industrial solutions:
|
target in context, it remains small compared to other industrial solutions:
|
||||||
|
@ -300,10 +301,10 @@ We wrote our own benchmarking tool for this test,
|
||||||
The benchmark procedure consists in
|
The benchmark procedure consists in
|
||||||
concurrently sending a defined number of tiny objects (8192 objects of 16
|
concurrently sending a defined number of tiny objects (8192 objects of 16
|
||||||
bytes by default) and measuring the time it takes. This step is then repeated a given
|
bytes by default) and measuring the time it takes. This step is then repeated a given
|
||||||
number of times (128 by default) to effectively create a certain target number of
|
number of times (128 by default) to effectively create a target number of
|
||||||
objects on the cluster (1M by default). On our local setup with 3
|
objects on the cluster (1M by default). On our local setup with 3
|
||||||
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
|
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
|
||||||
following plot, we show how much time it took to Garage and Minio to handle
|
following plot, we show how much time it took Garage and Minio to handle
|
||||||
each batch.
|
each batch.
|
||||||
|
|
||||||
Before looking at the plot, **you must keep in mind some important points about
|
Before looking at the plot, **you must keep in mind some important points about
|
||||||
|
@ -312,14 +313,14 @@ the internals of both Minio and Garage**.
|
||||||
Minio has no metadata engine, it stores its objects directly on the filesystem.
|
Minio has no metadata engine, it stores its objects directly on the filesystem.
|
||||||
Sending 1 million objects on Minio results in creating one million inodes on
|
Sending 1 million objects on Minio results in creating one million inodes on
|
||||||
the storage server in our current setup. So the performances of the filesystem
|
the storage server in our current setup. So the performances of the filesystem
|
||||||
will probably substantially impact the results we will observe.
|
probably have substantial impact on the results we observe.
|
||||||
In our precise setup, we know that the
|
In our precise setup, we know that the
|
||||||
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
||||||
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
||||||
`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
|
`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
|
||||||
creation of objects. Finally, object storage is designed for big objects: the
|
creation of objects. Finally, object storage is designed for big objects, for which the
|
||||||
costs measured here are negligible for bigger objects. In the end, again, we use Minio as a
|
costs measured here are negligible. In the end, again, we use Minio as a
|
||||||
reference to understand what is our performance budget for each part of our
|
reference point to understand what performance budget we have for each part of our
|
||||||
software.
|
software.
|
||||||
|
|
||||||
Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
|
Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
|
||||||
|
@ -334,8 +335,8 @@ metadata engine and thus focus only on 16-byte objects.
|
||||||
It appears that the performances of our metadata engine are acceptable, as we
|
It appears that the performances of our metadata engine are acceptable, as we
|
||||||
have a comfortable margin compared to Minio (Minio is between 3x and 4x times
|
have a comfortable margin compared to Minio (Minio is between 3x and 4x times
|
||||||
slower per batch). We also note that, past the 200k objects mark, Minio's
|
slower per batch). We also note that, past the 200k objects mark, Minio's
|
||||||
time to complete a batch of inserts is constant, while on Garage it is still increasing on the observed range.
|
time to complete a batch of inserts is constant, while on Garage it still increases on the observed range.
|
||||||
It could be interesting to know if Garage batch's completion time would cross Minio's one
|
It could be interesting to know if Garage's batch completion time would cross Minio's one
|
||||||
for a very large number of objects. If we reason per object, both Minio's and
|
for a very large number of objects. If we reason per object, both Minio's and
|
||||||
Garage's performances remain very good: it takes respectively around 20ms and
|
Garage's performances remain very good: it takes respectively around 20ms and
|
||||||
5ms to create an object. At 100 Mbps, the upload of a 10MB file takes
|
5ms to create an object. At 100 Mbps, the upload of a 10MB file takes
|
||||||
|
@ -348,17 +349,17 @@ Let us now focus on Garage's metrics only to better see its specific behavior:
|
||||||
|
|
||||||
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
|
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
|
||||||
|
|
||||||
Two effects are now more visible: 1., increasing batch completion time increases with the
|
Two effects are now more visible: 1., batch completion time increases with the
|
||||||
number of objects in the bucket and 2., measurements are dispersed, at least
|
number of objects in the bucket and 2., measurements are dispersed, at least
|
||||||
more than for Minio. We expect this batch completion time increase to be logarithmic,
|
more than for Minio. We expect this batch completion time increase to be logarithmic,
|
||||||
but we don't have enough data points to conclude safety: additional
|
but we don't have enough data points to conclude safety: additional
|
||||||
measurements are needed. Concerning the observed instability, it could
|
measurements are needed. Concerning the observed instability, it could
|
||||||
be a symptom of what we saw with some other experiments in this machine,
|
be a symptom of what we saw with some other experiments in this machine,
|
||||||
which sometimes freezes under heavy I/O load. Such freezes could lead to
|
which sometimes freezes under heavy I/O load. Such freezes could lead to
|
||||||
request timeouts and failures. If this occurs on our testing computer, it will
|
request timeouts and failures. If this occurs on our testing computer, it might
|
||||||
occur on other servers too: it would be interesting to better understand this
|
occur on other servers as well: it would be interesting to better understand this
|
||||||
issue, document how to avoid it, and potentially change how we handle our I/O. At the same
|
issue, document how to avoid it, and potentially change how we handle our I/O
|
||||||
time, this was a very stressful test that will probably not be encountered in
|
internally in Garage. But still, this was a very stressful test that will probably not be encountered in
|
||||||
many setups: we were adding 273 objects per second for 30 minutes straight!
|
many setups: we were adding 273 objects per second for 30 minutes straight!
|
||||||
|
|
||||||
To conclude this part, Garage can ingest 1 million tiny objects while remaining
|
To conclude this part, Garage can ingest 1 million tiny objects while remaining
|
||||||
|
@ -366,16 +367,16 @@ usable on our local setup. To put this result in perspective, our production
|
||||||
cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with
|
cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with
|
||||||
116k objects. This bucket contains real-world production data: it is used by our Matrix instance
|
116k objects. This bucket contains real-world production data: it is used by our Matrix instance
|
||||||
to store people's media files (profile pictures, shared pictures, videos,
|
to store people's media files (profile pictures, shared pictures, videos,
|
||||||
audios, documents...). Thanks to this benchmark, we have identified two points
|
audio files, documents...). Thanks to this benchmark, we have identified two points
|
||||||
of vigilance: the increase of batch insert time with the number of existing
|
of vigilance: the increase of batch insert time with the number of existing
|
||||||
objects in the cluster in the observed range, and the volatility in our measured data that
|
objects in the cluster in the observed range, and the volatility in our measured data that
|
||||||
could be a symptom of our system freezing under the load. Despite these two
|
could be a symptom of our system freezing under the load. Despite these two
|
||||||
points, we are confident that Garage could scale way above 1M+ objects, although
|
points, we are confident that Garage could scale way above 1M objects, although
|
||||||
that remains to be proven.
|
that remains to be proven.
|
||||||
|
|
||||||
## In an unpredictable world, stay resilient
|
## In an unpredictable world, stay resilient
|
||||||
|
|
||||||
Supporting a variety of network properties and computers, especially ones that
|
Supporting a variety of real-world networks and computers, especially ones that
|
||||||
were not designed for software-defined storage or even for server purposes, is the
|
were not designed for software-defined storage or even for server purposes, is the
|
||||||
core value proposition of Garage. For example, our production cluster is
|
core value proposition of Garage. For example, our production cluster is
|
||||||
hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
|
hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
|
||||||
|
@ -387,8 +388,8 @@ latency and the number of nodes in the cluster impact the duration of the most
|
||||||
important kinds of S3 requests.
|
important kinds of S3 requests.
|
||||||
|
|
||||||
**Latency amplification** - With the kind of networks we use (consumer-grade
|
**Latency amplification** - With the kind of networks we use (consumer-grade
|
||||||
fiber links across the EU), the observed latency is in the 50ms range between
|
fiber links across the EU), the observed latency between nodes is in the 50ms range.
|
||||||
nodes. When latency is not negligible, you will observe that request completion
|
When latency is not negligible, you will observe that request completion
|
||||||
time is a factor of the observed latency. That's to be expected: in many cases, the
|
time is a factor of the observed latency. That's to be expected: in many cases, the
|
||||||
node of the cluster you are contacting can not directly answer your request, and
|
node of the cluster you are contacting can not directly answer your request, and
|
||||||
has to reach other nodes of the cluster to get the requested information. Each
|
has to reach other nodes of the cluster to get the requested information. Each
|
||||||
|
@ -403,7 +404,7 @@ from storage nodes. We can therefore expect that the
|
||||||
request duration of a small GetObject request will be close to twice the
|
request duration of a small GetObject request will be close to twice the
|
||||||
network latency.
|
network latency.
|
||||||
|
|
||||||
We tested this theory with another benchmark of our own named
|
We tested the latency amplification theory with another benchmark of our own named
|
||||||
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
|
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
|
||||||
which does a single request at a time on an endpoint and measures its response
|
which does a single request at a time on an endpoint and measures its response
|
||||||
time. As we are not interested in bandwidth but latency, all our requests
|
time. As we are not interested in bandwidth but latency, all our requests
|
||||||
|
@ -420,7 +421,7 @@ v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
|
||||||
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
|
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
|
||||||
RemoveObject). This can be easily understood by the fact that Minio has not been designed for
|
RemoveObject). This can be easily understood by the fact that Minio has not been designed for
|
||||||
environments with high latencies. Instead, it expects to run on clusters that are built
|
environments with high latencies. Instead, it expects to run on clusters that are built
|
||||||
in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous
|
in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected with their asynchronous
|
||||||
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
||||||
feature.
|
feature.
|
||||||
|
|
||||||
|
@ -443,7 +444,7 @@ This test was ran directly on Grid5000 with 6 physical servers spread
|
||||||
in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up
|
in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up
|
||||||
to 65 instances of Garage simultaneously, for a total of 390 nodes. The
|
to 65 instances of Garage simultaneously, for a total of 390 nodes. The
|
||||||
network between physical servers is the dedicated network provided by
|
network between physical servers is the dedicated network provided by
|
||||||
the Grid5000 operators. Nodes on the same physical machine communicate directly
|
the Grid5000 community. Nodes on the same physical machine communicate directly
|
||||||
through the Linux network stack without any limitation. We are aware that this is a
|
through the Linux network stack without any limitation. We are aware that this is a
|
||||||
weakness of this test, but we still think that this test can be relevant as, at
|
weakness of this test, but we still think that this test can be relevant as, at
|
||||||
each step in the test, each instance of Garage has 83% (5/6) of its connections
|
each step in the test, each instance of Garage has 83% (5/6) of its connections
|
||||||
|
@ -487,8 +488,10 @@ terabytes of data and billions of objects on long-lasting experiments.
|
||||||
|
|
||||||
In the meantime, stay tuned: we have released
|
In the meantime, stay tuned: we have released
|
||||||
[a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1),
|
[a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1),
|
||||||
and we are working on proving and explaining our layout algorithm
|
and are already working on a number of features for the next version.
|
||||||
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also
|
For instance, we are working on a new layout that will have enhanced optimality properties,
|
||||||
|
as well as a theoretical proof of correctness
|
||||||
|
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)). We are also
|
||||||
working on a Python SDK for Garage's administration API
|
working on a Python SDK for Garage's administration API
|
||||||
([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will
|
([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will
|
||||||
soon officially introduce a new API (as a technical preview) named K2V
|
soon officially introduce a new API (as a technical preview) named K2V
|
||||||
|
|
Loading…
Reference in a new issue