New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 61 additions and 58 deletions
|
@ -33,7 +33,7 @@ cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
|
|||
our results are thus not an evaluation of the performance of Garage as a whole.
|
||||
|
||||
For some benchmarks, we used Minio as a reference. It must be noted that we did
|
||||
not try to optimize its configuration as we have done on Garage, and more
|
||||
not try to optimize its configuration as we have done for Garage, and more
|
||||
generally, we have way less knowledge on Minio than on Garage, which can lead
|
||||
to underrated performance measurements for Minio. It must also be noted that
|
||||
Garage and Minio are systems with different feature sets. For instance Minio supports
|
||||
|
@ -41,7 +41,7 @@ erasure coding for higher data density, which Garage doesn't, Minio implements
|
|||
way more S3 endpoints than Garage, etc. Such features necessarily have a cost
|
||||
that you must keep in mind when reading the plots we will present. You should consider
|
||||
results on Minio as a way to contextualize our results on Garage, to see that our improvements
|
||||
are not artificial compared to existing object storage implementations.
|
||||
are not artificial in the light of existing object storage implementations.
|
||||
|
||||
The impact of the testing environment is also not evaluated (kernel patches,
|
||||
configuration, parameters, filesystem, hardware configuration, etc.). Some of
|
||||
|
@ -66,7 +66,7 @@ We made a first batch of tests on
|
|||
[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible
|
||||
testbed for experiment-driven research in all areas of computer science,
|
||||
which has an
|
||||
[Open Access program](https://www.grid5000.fr/w/Grid5000:Open-Access).
|
||||
[open access program](https://www.grid5000.fr/w/Grid5000:Open-Access).
|
||||
During our tests, we used part of the following clusters:
|
||||
[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova),
|
||||
[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and
|
||||
|
@ -92,7 +92,7 @@ The main purpose of an object storage system is to store and retrieve objects
|
|||
across the network, and the faster these two functions can be accomplished,
|
||||
the more efficient the system as a whole will be. For this analysis, we focus on
|
||||
2 aspects of performance. First, since many applications can start processing a file
|
||||
before receiving it completely, we will evaluate the Time-to-First-Byte (TTFB)
|
||||
before receiving it completely, we will evaluate the time-to-first-byte (TTFB)
|
||||
on GetObject requests, i.e. the duration between the moment a request is sent
|
||||
and the moment where the first bytes of the returned object are received by the client.
|
||||
Second, we will evaluate generic throughput, to understand how well
|
||||
|
@ -103,17 +103,17 @@ web endpoints, with the idea to make it a platform of choice to publish
|
|||
static websites. When publishing a website, TTFB can be directly observed
|
||||
by the end user, as it will impact the perceived reactivity of the websites.
|
||||
|
||||
Up to version 0.7.3, Time-to-First-Byte on Garage used to be relatively high.
|
||||
Up to version 0.7.3, time-to-first-byte on Garage used to be relatively high.
|
||||
This can be explained by the fact that Garage was not able to handle data internally
|
||||
at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object
|
||||
(this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
|
||||
(a size which [can be configured](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
|
||||
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
|
||||
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
|
||||
request, the first block had to be fully retrieved by the gateway node from the
|
||||
storage node before starting to send any data to the client.
|
||||
|
||||
With Garage v0.8, we integrated a block streaming logic that allows the gateway
|
||||
to send the beginning of a block without having to wait for the full block from
|
||||
With Garage v0.8, we integrated a data streaming logic that allows the gateway
|
||||
to send the beginning of a block without having to wait for the full block to be received from
|
||||
the storage node. We can visually represent the difference as follow:
|
||||
|
||||
<center>
|
||||
|
@ -138,23 +138,23 @@ whose results are shown on the following figure:
|
|||
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
|
||||
|
||||
Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
|
||||
and 2s, which corresponds to the time to transfer the full block which we calculated before.
|
||||
and 2s, which corresponds to the time to transfer the full block which we calculated above.
|
||||
On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
|
||||
streaming feature (the lowest value is 43ms). Minio sits between the two
|
||||
Garage versions: we suppose that it does some form of batching, but smaller
|
||||
than 1MB.
|
||||
|
||||
**Throughput** - As soon as we publicly released Garage, people started
|
||||
benchmarking it, comparing its performances with writing directly on the
|
||||
benchmarking it, comparing its performances to writing directly on the
|
||||
filesystem, and observed that Garage was slower (eg.
|
||||
[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the
|
||||
situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread
|
||||
situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread,
|
||||
and many others
|
||||
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
|
||||
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to
|
||||
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)), which led us to
|
||||
version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote
|
||||
to better control resource usage
|
||||
and detect errors, like semaphores or timeouts, was artificially limiting
|
||||
and detect errors, including semaphores and timeouts, was artificially limiting
|
||||
performances. In another iteration, we made this logic less restrictive at the
|
||||
cost of higher resource consumption under load
|
||||
([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in
|
||||
|
@ -170,16 +170,16 @@ write directly to the disk when a process creates or updates a file in your
|
|||
filesystem. Instead, the write is kept in memory, and flushed later in a batch
|
||||
with other writes. If a power loss occurs before the OS has time to flush
|
||||
data to disk, some writes will be lost. To ensure that a write is effectively
|
||||
written on disk, the
|
||||
written to disk, the
|
||||
[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
|
||||
which blocks until the file or directory on which it is called has been written from volatile
|
||||
which blocks until the file or directory on which it is called has been flushed from volatile
|
||||
memory to the persistent storage device. Additionally, the exact semantic of
|
||||
`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
|
||||
and, even on battle-tested software like Postgres, it was
|
||||
["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
|
||||
Note that on Garage, we are currently working on our `fsync` policy and thus, for
|
||||
now, you should expect limited data durability in case of power loss, as we are
|
||||
aware of some inconsistency on this point (which we describe in the following
|
||||
aware of some inconsistencies on this point (which we describe in the following
|
||||
and plan to solve).*
|
||||
|
||||
To assess performance improvements, we used the benchmark tool
|
||||
|
@ -191,10 +191,10 @@ performance with a standardized and mixed workload.
|
|||
![Plot showing IO performances of Garage configurations and Minio](io.png)
|
||||
|
||||
Minio, our reference point, gives us the best performances in this test.
|
||||
Looking at Garage, we observe that each improvement we made has a visible
|
||||
Looking at Garage, we observe that each improvement we made had a visible
|
||||
impact on performances. We also note that we have a progress margin in
|
||||
terms of performances compared to Minio: additional benchmarks, tests, and
|
||||
monitoring could help better understand the remaining difference.
|
||||
monitoring could help us better understand the remaining difference.
|
||||
|
||||
|
||||
## A myriad of objects
|
||||
|
@ -205,7 +205,7 @@ communicates about Ceph being able to handle 10 billion objects. All these
|
|||
objects must be tracked efficiently in the system to be fetched, listed,
|
||||
removed, etc. In Garage, we use a "metadata engine" component to track them.
|
||||
For this analysis, we compare different metadata engines in Garage and see how
|
||||
well the best one scale to a million objects.
|
||||
well the best one scales to a million objects.
|
||||
|
||||
**Testing metadata engines** - With Garage, we chose not to store metadata
|
||||
directly on the filesystem, like Minio for example, but in a specialized on-disk
|
||||
|
@ -222,11 +222,11 @@ are both experimental: contrarily to sled, we have never run them in production
|
|||
for a long time.**
|
||||
|
||||
Similarly to the impact of `fsync` on block writing, each database engine we use
|
||||
has its own policy with `fsync`. Sled flushes its write every 2 seconds by
|
||||
default, this is
|
||||
has its own policy with `fsync`. Sled flushes its writes every 2 seconds by
|
||||
default (this is
|
||||
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
|
||||
LMDB by default does an `fsync` on each write, which on early tests led to very
|
||||
slow resynchronizations between nodes. We thus added 2 flags,
|
||||
LMDB by default does an `fsync` on each write, which on early tests led to
|
||||
abysmal performance. We thus added 2 flags,
|
||||
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
|
||||
and
|
||||
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
|
||||
|
@ -234,14 +234,15 @@ to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
|
|||
`pragma synchronous = off`, but we have not started any optimization work on it yet:
|
||||
our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
|
||||
using these engines through Rust bindings that do not support async Rust,
|
||||
with which Garage is built. **Our comparison will therefore not reflect the raw performances of
|
||||
with which Garage is built, which has an impact on performance as well.
|
||||
**Our comparison will therefore not reflect the raw performances of
|
||||
these database engines, but instead, our integration choices.**
|
||||
|
||||
Still, we think it makes sense to evaluate our implementations in their current
|
||||
state in Garage. We designed a benchmark that is intensive on the metadata part
|
||||
of the software, i.e. handling large numbers of tiny files. We chose again
|
||||
`minio/warp` as a benchmark tool but we
|
||||
configured it with the smallest possible object size it supported, 256
|
||||
`minio/warp` as a benchmark tool, but we
|
||||
configured it here with the smallest possible object size it supported, 256
|
||||
bytes, to put some pressure on the metadata engine. We evaluated sled twice:
|
||||
with its default configuration, and with a configuration where we set a flush
|
||||
interval of 10 minutes to disable `fsync`.
|
||||
|
@ -261,19 +262,19 @@ disk storage and RAM; we would like to quantify that in the future. As we are
|
|||
only at the very beginning of our work on metadata engines, it is hard to draw
|
||||
strong conclusions. Still, we can say that SQLite is not ready for production
|
||||
workloads, and that LMDB looks very promising both in terms of performances and resource
|
||||
usage, and is a very good candidate for being Garage's default metadata engine in the
|
||||
future. In the future, we will need to define a data policy for Garage to help us
|
||||
arbitrate between performances and durability.
|
||||
usage, and is a very good candidate for being Garage's default metadata engine in
|
||||
future releases. In the future, we will need to define a data policy for Garage to help us
|
||||
arbitrate between performance and durability.
|
||||
|
||||
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
|
||||
need to better assess the impact of validating a write and then possibly losing it.
|
||||
need to better assess the impact of possibly losing a write after it has been validated.
|
||||
Because Garage is a distributed system, even if a node loses its write due to a
|
||||
power loss, it will fetch it back from the 2 other nodes storing it. But rare
|
||||
situations can occur, where 1 node is down and the 2 others validated the write and then
|
||||
lost power. What is our policy in this case? For storage durability,
|
||||
power loss, it will fetch it back from the 2 other nodes that store it. But rare
|
||||
situations can occur where 1 node is down and the 2 others validate the write and then
|
||||
lose power before having time to flush to disk. What is our policy in this case? For storage durability,
|
||||
we are already supposing that we never lose the storage of more than 2 nodes,
|
||||
so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
|
||||
time? What should we think about people hosting all their nodes at the same
|
||||
time? What should we do about people hosting all of their nodes at the same
|
||||
place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
|
||||
some compromises on this side
|
||||
([#3536](https://github.com/minio/minio/issues/3536),
|
||||
|
@ -286,7 +287,7 @@ only data and not metadata is persisted on disk - in combination with
|
|||
|
||||
**Storing a million objects** - Object storage systems are designed not only
|
||||
for data durability and availability but also for scalability, so naturally,
|
||||
some people asked us how scalable Garage is. If answering this
|
||||
some people asked us how scalable Garage is. If giving a definitive answer to this
|
||||
question is out of the scope of this study, we wanted to be sure that our
|
||||
metadata engine would be able to scale to a million objects. To put this
|
||||
target in context, it remains small compared to other industrial solutions:
|
||||
|
@ -300,10 +301,10 @@ We wrote our own benchmarking tool for this test,
|
|||
The benchmark procedure consists in
|
||||
concurrently sending a defined number of tiny objects (8192 objects of 16
|
||||
bytes by default) and measuring the time it takes. This step is then repeated a given
|
||||
number of times (128 by default) to effectively create a certain target number of
|
||||
number of times (128 by default) to effectively create a target number of
|
||||
objects on the cluster (1M by default). On our local setup with 3
|
||||
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
|
||||
following plot, we show how much time it took to Garage and Minio to handle
|
||||
following plot, we show how much time it took Garage and Minio to handle
|
||||
each batch.
|
||||
|
||||
Before looking at the plot, **you must keep in mind some important points about
|
||||
|
@ -312,14 +313,14 @@ the internals of both Minio and Garage**.
|
|||
Minio has no metadata engine, it stores its objects directly on the filesystem.
|
||||
Sending 1 million objects on Minio results in creating one million inodes on
|
||||
the storage server in our current setup. So the performances of the filesystem
|
||||
will probably substantially impact the results we will observe.
|
||||
probably have substantial impact on the results we observe.
|
||||
In our precise setup, we know that the
|
||||
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
||||
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
||||
`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
|
||||
creation of objects. Finally, object storage is designed for big objects: the
|
||||
costs measured here are negligible for bigger objects. In the end, again, we use Minio as a
|
||||
reference to understand what is our performance budget for each part of our
|
||||
creation of objects. Finally, object storage is designed for big objects, for which the
|
||||
costs measured here are negligible. In the end, again, we use Minio as a
|
||||
reference point to understand what performance budget we have for each part of our
|
||||
software.
|
||||
|
||||
Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
|
||||
|
@ -334,8 +335,8 @@ metadata engine and thus focus only on 16-byte objects.
|
|||
It appears that the performances of our metadata engine are acceptable, as we
|
||||
have a comfortable margin compared to Minio (Minio is between 3x and 4x times
|
||||
slower per batch). We also note that, past the 200k objects mark, Minio's
|
||||
time to complete a batch of inserts is constant, while on Garage it is still increasing on the observed range.
|
||||
It could be interesting to know if Garage batch's completion time would cross Minio's one
|
||||
time to complete a batch of inserts is constant, while on Garage it still increases on the observed range.
|
||||
It could be interesting to know if Garage's batch completion time would cross Minio's one
|
||||
for a very large number of objects. If we reason per object, both Minio's and
|
||||
Garage's performances remain very good: it takes respectively around 20ms and
|
||||
5ms to create an object. At 100 Mbps, the upload of a 10MB file takes
|
||||
|
@ -348,17 +349,17 @@ Let us now focus on Garage's metrics only to better see its specific behavior:
|
|||
|
||||
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
|
||||
|
||||
Two effects are now more visible: 1., increasing batch completion time increases with the
|
||||
Two effects are now more visible: 1., batch completion time increases with the
|
||||
number of objects in the bucket and 2., measurements are dispersed, at least
|
||||
more than for Minio. We expect this batch completion time increase to be logarithmic,
|
||||
but we don't have enough data points to conclude safety: additional
|
||||
measurements are needed. Concerning the observed instability, it could
|
||||
be a symptom of what we saw with some other experiments in this machine,
|
||||
which sometimes freezes under heavy I/O load. Such freezes could lead to
|
||||
request timeouts and failures. If this occurs on our testing computer, it will
|
||||
occur on other servers too: it would be interesting to better understand this
|
||||
issue, document how to avoid it, and potentially change how we handle our I/O. At the same
|
||||
time, this was a very stressful test that will probably not be encountered in
|
||||
request timeouts and failures. If this occurs on our testing computer, it might
|
||||
occur on other servers as well: it would be interesting to better understand this
|
||||
issue, document how to avoid it, and potentially change how we handle our I/O
|
||||
internally in Garage. But still, this was a very stressful test that will probably not be encountered in
|
||||
many setups: we were adding 273 objects per second for 30 minutes straight!
|
||||
|
||||
To conclude this part, Garage can ingest 1 million tiny objects while remaining
|
||||
|
@ -366,16 +367,16 @@ usable on our local setup. To put this result in perspective, our production
|
|||
cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with
|
||||
116k objects. This bucket contains real-world production data: it is used by our Matrix instance
|
||||
to store people's media files (profile pictures, shared pictures, videos,
|
||||
audios, documents...). Thanks to this benchmark, we have identified two points
|
||||
audio files, documents...). Thanks to this benchmark, we have identified two points
|
||||
of vigilance: the increase of batch insert time with the number of existing
|
||||
objects in the cluster in the observed range, and the volatility in our measured data that
|
||||
could be a symptom of our system freezing under the load. Despite these two
|
||||
points, we are confident that Garage could scale way above 1M+ objects, although
|
||||
points, we are confident that Garage could scale way above 1M objects, although
|
||||
that remains to be proven.
|
||||
|
||||
## In an unpredictable world, stay resilient
|
||||
|
||||
Supporting a variety of network properties and computers, especially ones that
|
||||
Supporting a variety of real-world networks and computers, especially ones that
|
||||
were not designed for software-defined storage or even for server purposes, is the
|
||||
core value proposition of Garage. For example, our production cluster is
|
||||
hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
|
||||
|
@ -387,8 +388,8 @@ latency and the number of nodes in the cluster impact the duration of the most
|
|||
important kinds of S3 requests.
|
||||
|
||||
**Latency amplification** - With the kind of networks we use (consumer-grade
|
||||
fiber links across the EU), the observed latency is in the 50ms range between
|
||||
nodes. When latency is not negligible, you will observe that request completion
|
||||
fiber links across the EU), the observed latency between nodes is in the 50ms range.
|
||||
When latency is not negligible, you will observe that request completion
|
||||
time is a factor of the observed latency. That's to be expected: in many cases, the
|
||||
node of the cluster you are contacting can not directly answer your request, and
|
||||
has to reach other nodes of the cluster to get the requested information. Each
|
||||
|
@ -403,7 +404,7 @@ from storage nodes. We can therefore expect that the
|
|||
request duration of a small GetObject request will be close to twice the
|
||||
network latency.
|
||||
|
||||
We tested this theory with another benchmark of our own named
|
||||
We tested the latency amplification theory with another benchmark of our own named
|
||||
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
|
||||
which does a single request at a time on an endpoint and measures its response
|
||||
time. As we are not interested in bandwidth but latency, all our requests
|
||||
|
@ -420,7 +421,7 @@ v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
|
|||
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
|
||||
RemoveObject). This can be easily understood by the fact that Minio has not been designed for
|
||||
environments with high latencies. Instead, it expects to run on clusters that are built
|
||||
in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous
|
||||
in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected with their asynchronous
|
||||
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
||||
feature.
|
||||
|
||||
|
@ -443,7 +444,7 @@ This test was ran directly on Grid5000 with 6 physical servers spread
|
|||
in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up
|
||||
to 65 instances of Garage simultaneously, for a total of 390 nodes. The
|
||||
network between physical servers is the dedicated network provided by
|
||||
the Grid5000 operators. Nodes on the same physical machine communicate directly
|
||||
the Grid5000 community. Nodes on the same physical machine communicate directly
|
||||
through the Linux network stack without any limitation. We are aware that this is a
|
||||
weakness of this test, but we still think that this test can be relevant as, at
|
||||
each step in the test, each instance of Garage has 83% (5/6) of its connections
|
||||
|
@ -487,8 +488,10 @@ terabytes of data and billions of objects on long-lasting experiments.
|
|||
|
||||
In the meantime, stay tuned: we have released
|
||||
[a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1),
|
||||
and we are working on proving and explaining our layout algorithm
|
||||
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also
|
||||
and are already working on a number of features for the next version.
|
||||
For instance, we are working on a new layout that will have enhanced optimality properties,
|
||||
as well as a theoretical proof of correctness
|
||||
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)). We are also
|
||||
working on a Python SDK for Garage's administration API
|
||||
([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will
|
||||
soon officially introduce a new API (as a technical preview) named K2V
|
||||
|
|
Loading…
Reference in a new issue