garagehq.deuxfleurs.fr/content/blog/2022-perf/index.md

492 lines
29 KiB
Markdown

+++
title="Bringing theoretical design and observed performances face to face"
date=2022-09-26
+++
*For the past years, we have extensively analyzed possible design decisions and
their theoretical tradeoffs on Garage, especially on the network, data
structure, or scheduling side. And it worked well enough for our production
cluster at Deuxfleurs, but we also knew that people started discovering some
unexpected behaviors. We thus started a round of benchmark and performance
measurements to see how Garage behaves compared to our expectations. We split
them into 3 categories: "efficient I/O", "myriads of objects" and "resiliency"
to reflect the high-level properties we are seeking.*
<!-- more -->
---
## ⚠️ Disclaimer
The following results must be taken with a critical grain of salt due to some
limitations that are inherent to any benchmark. We try to reference them as
exhaustively as possible in this section, but other limitations might exist.
Most of our tests are done on simulated networks that can not represent all the
diversity of real networks (dynamic drop, jitter, latency, all of them could be
correlated with throughput or any other external event). We also limited
ourselves to very small workloads that are not representative of a production
cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
our results are thus not an overview of the whole software performance.
For some benchmarks, we used Minio as a reference. It must be noted that we did
not try to optimize its configuration as we have done on Garage, and more
generally, we have way less knowledge on Minio than on Garage, which can lead
to underrated performance measurements for Minio. It must also be noted that
Garage and Minio are systems with different feature sets, *eg.* Minio supports
erasure coding for better data density while Garage doesn't, Minio implements
way more S3 endpoints than Garage, etc. Such features have necessarily a cost
that you must keep in mind when reading plots. You should consider Minio
results as a way to contextualize our results, to check that our improvements
are not artificials compared to existing object storage implementations.
The impact of the testing environment is also not evaluated (kernel patches,
configuration, parameters, filesystem, hardware configuration, etc.), some of
these configurations could favor one configuration/software over another.
Especially, it must be noted that most of the tests were done on a
consumer-grade computer and SSD only, which will be different from most
production setups. Finally, our results are also provided without statistical
tests to check their significance, and thus might be statistically not
significant.
When reading this post, please keep in mind that **we are not making any
business or technical recommendations here, this is not a scientific paper
either**; we only share bits of our development process as honestly as
possible. Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html),
make your own
tests if you need to take a decision, and remain supportive and caring with
your peers...
## About our testing environment
We started a batch of tests on
[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible
testbed for experiment-driven research in all areas of computer science, under
the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program.
During our tests, we used part of the following clusters:
[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova),
[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and
[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a
geo-distributed topology. We used the Grid5000 testbed only during our
preliminary tests to identify issues when running Garage on many powerful
servers, issues that we then reproduced in a controlled environment; don't be
surprised then if Grid5000 is not mentioned often on our plots.
To reproduce some environments locally, we have a small set of Python scripts
named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
needs[^ref1]. Most of the following tests were thus run locally with mknet on a
single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of
RAM, a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and
`vm.dirty_ratio` have been reduced to `2` and `1` respectively as, with default
values, the system tends to freeze when it is under heavy I/O load.
## Efficient I/O
The main goal of an object storage system is to store or retrieve an object
across the network, and the faster, the better. For this analysis, we focus on
2 aspects: time to first byte, as many applications can start processing a file
before receiving it completely, and generic throughput, to understand how well
Garage can leverage the underlying machine performances.
**Time To First Byte** - One specificity of Garage is that we implemented S3
web endpoints, with the idea to make it the platform of choice to publish your
static website. When publishing a website, one metric you observe is Time To
First Byte (TTFB), as it will impact the perceived reactivity of your website.
On Garage, time to first byte was a bit high.
This is not surprising as, until now, the smallest level of granularity
internally was handling full blocks. Blocks are 1MB chunks (this is
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size))
of a given object. For example, a 4.5MB object will be split into 4 blocks of
1MB and 1 block of 0.5MB. With this design, when you were sending a GET
request, the first block had to be fully retrieved by the gateway node from the
storage node before starting to send any data to the client.
With Garage v0.8, we integrated a block streaming logic that allows the gateway
to send the beginning of a block without having to wait for the full block from
the storage node. We can visually represent the difference as follow:
![A schema depicting how streaming improves the delivery of a block](schema-streaming.png)
As our default block size is only 1MB, the difference will be very small on
fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However,
on a very slow network (or a very congested link with many parallel requests
handled), the impact can be much more important: at 5Mbps, it takes 1.6 seconds
to transfer our 1MB block, and streaming could heavily improve user experience.
We wanted to see if this theory holds in practice: we simulated a low latency
but slow network on mknet and did some requests with (garage v0.8 beta) and
without (garage v0.7) block streaming. We also added Minio as a reference. To
benchmark this behavior, we wrote a small test named
[s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb),
its results are depicted in the following figure.
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
Garage v0.7, which does not support block streaming, features TTFB between 1.6s
and 2s, which corresponds to the theoretical time to transfer the full block.
On the other side of the plot, Garage v0.8 has a very low TTFB thanks to the
streaming feature (the lowest value is 43 ms). Minio sits between the two
Garage versions: we suppose that it does some form of batching, but smaller
than 1MB.
**Throughput** - As soon as we publicly released Garage, people started
benchmarking it, comparing its performances to writing directly on the
filesystem, and observed that Garage was slower (eg.
[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the
situation, we put costly processing like hashing on a dedicated thread and did
many compute optimization
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to
`v0.8 beta 1`. We also noted logic we wrote (to better control resource usage
and detect errors, like semaphores or timeouts) was artificially limiting
performances. In another iteration, we made this logic less restrictive at the
cost of higher resource consumption under load
([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in
`v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we
write a block. We know that this is expensive and did a test build without any
`fsync` call ([see the
commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7))
that will not be merged, just to assess the impact of `fsync`. We refer to it
as `no-fsync` in the following plot.
*A note about fsync: for performance reasons, operating systems often do not
write directly to the disk when a process creates or updates a file in your
filesystem, instead, the write is kept in memory, and flushed later in a batch
with other writes. If a power loss occurs before the OS has time to flush the
writes on the disk, data will be lost. To ensure that a write is effectively
written on disk, you must use the
[fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it
will block until your file or directory has been written from your volatile
memory to your persisting storage device. Additionally, the exact semantic of
fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
and, even on battle-tested software like Postgres,
[they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
Note that on Garage, we are currently working on our "fsync" policy and thus, for
now, you should expect limited data durability in case of power loss, as we are
aware of some inconsistency on this point (which we describe in the following
and plan to solve).*
To assess performance improvements, we used the benchmark tool
[minio/warp](https://github.com/minio/warp) in a non-standard configuration,
adapted for small-scale tests, and we kept only the aggregated result named
"cluster total". The goal of this experiment is to get an idea of the cluster
performance with a standardized and mixed workload.
![Plot showing IO perf of Garage configs and Minio](io.png)
Minio, our ground truth, features the best performances in this test.
Considering Garage, we observe that each improvement we made has a visible
impact on its performances. We also note that we have a progress margin in
terms of performances compared to Minio: additional benchmarks, tests, and
monitoring could help better understand the remaining difference.
## A myriad of objects
Object storage systems do not handle a single object but a myriad of them:
Amazon claims to handle trillions of objects on their platform, and Red Hat
communicates about Ceph being able to handle 10 billion objects. All these
objects must be tracked efficiently in the system to be fetched, listed,
removed, etc. In Garage, we use a "metadata engine" component to track them.
For this analysis, we compare different metadata engines in Garage and see how
well the best one scale to a million objects.
**Testing metadata engines** - With Garage, we chose to not store metadata
directly on the filesystem, like Minio for example, but in an on-disk fancy
B-Tree structure, in other words, in an embedded database engine. Until now,
the only available option was [sled](https://sled.rs/), but we started having
serious issues with it, and we were not alone
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
v0.8, we introduce an abstraction semantic over the features we expect from our
database, allowing us to switch from one backend to another without touching
the rest of our codebase. We added two additional backends: lmdb
([heed](https://github.com/meilisearch/heed)) and sqlite
([rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
are both experimental: contrarily to sled, we have never run them in production
for a long time.**
Similarly to the impact of fsync on block writing, each database engine we use
has its own policy with fsync. Sled flushes its write every 2 seconds by
default, this is
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
lmdb by default does an `fsync` on each write, on early tests it led to very
slow resynchronizations between nodes. We added 2 flags:
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
and
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16)
which deactivate fsync. On sqlite, it is also possible to deactivate fsync with
`pragma synchronous = off;`, but we did not start any optimization work on it:
our sqlite implementation fsync all the data on the disk. Additionally, we are
using these engines through a Rust binding that had to do some tradeoff on the
concurrency part. **Our comparison will not reflect the raw performances of
these database engines, but instead, our integration choices.**
Still, we think it makes sense to evaluate our implementations in their current
state in Garage. We designed a benchmark that is intensive on the metadata part
of the software, ie. handling tiny files. We chose again minio/warp but we
configure it with the smallest possible object size supported by warp, 256
bytes, to put some pressure on the metadata engine. We evaluate sled twice:
with its default configuration, and with a configuration where we set a flush
interval of 10 minutes to disable fsync.
*Note that S3 has not been designed for such small objects; a regular database,
like Cassandra, would be more appropriate for such workloads. This test has
only been designed to stress our metadata engine, it is not indicative of
real-world performances.*
![Plot of our metadata engines comparison with Warp](db_engine.png)
Unsurprisingly, we observe abysmal performances for sqlite, the engine we have
the less tested and kept fsync for each write. lmdb performs twice better than
sled in its default version and 60% better than the "no fsync" version in our
benchmark. Furthermore, and not depicted on these plots, LMDB uses way less
disk storage and RAM; we would like to quantify that in the future. As we are
only at the very beginning of our work on metadata engines, it is hard to draw
strong conclusions. Still, we can say that sqlite is not ready for production
workloads, LMDB looks very promising both in terms of performances and resource
usage, it is a very good candidate for Garage's default metadata engine in the
future, and we need to define a data policy for Garage that would help us
arbitrate between performances and durability.
*To fsync or not to fsync? Performance is nothing without reliability, so we
need to better assess the impact of validating a write and then losing it.
Because Garage is a distributed system, even if a node loses its write due to a
power loss, it will fetch it back from the 2 other nodes storing it. But rare
situations where 1 node is down and the 2 others validated the write and then
lost power can occur, what is our policy in this case? For storage durability,
we are already supposing that we never lose the storage of more than 2 nodes,
should we also expect that we don't lose power on more than 2 nodes at the same
time? What should we think about people hosting all their nodes at the same
place without a UPS? Historically, it seems that Minio developers also accepted
some compromises on this side
([#3536](https://github.com/minio/minio/issues/3536),
[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
only data and not metadata are persisted on disk - in combination with
`O_DIRECT` for direct I/O
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
**Storing a million objects** - Object storage systems are designed not only
for data durability and availability but also for scalability. Following this
observation, some people asked us how scalable Garage is. If answering this
question is out of the scope of this study, we wanted to be sure that our
metadata engine would be able to scale to a million objects. To put this
target in context, it remains small compared to other industrial solutions:
Ceph claims to scale up to [10 billion objects](https://www.redhat.com/en/resources/data-solutions-overview),
which is 4 orders of magnitude more than our current target. Of course, their
benchmarking setup has nothing in common with ours, and their tests are way
more exhaustive.
We wrote our own benchmarking tool for this test,
[s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2].
It concurrently sends a defined number of very tiny objects (8192 objects of 16
bytes by default) and measures the time it took. It repeats this step a given
number of times (128 by default) to effectively create a certain number of
objects on the target cluster (1M by default). On our local setup with 3
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
following plot, we show how many times it took to Garage and Minio to handle
each batch.
Before looking at the plot, **you must keep in mind some important points about
Minio and Garage internals**.
Minio has no metadata engine, it stores its objects directly on the filesystem.
Sending 1 million objects on Minio results in creating one million inodes on
the storage node in our current setup. So the performance of your filesystem
will probably substantially impact the results you will observe; we know the
filesystem we used is not adapted at all for Minio (encryption layer, fixed
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
fsync for our metadata engine, minio has some fsync logic here slowing down the
creation of objects. Finally, object storage is designed for big objects: this
cost is negligible with bigger objects. In the end, again, we use Minio as a
reference to understand what is our performance budget for each part of our
software.
Conversely, Garage has an optimization for small objects. Below 3KB, a block is
not created on the filesystem but the object is directly stored inline in the
metadata engine. In the future, we plan to evaluate how Garage behaves with
3KB+ objects at scale, probably way closer to Minio, as it will have to create
an inode for each object. For now, we limit ourselves to evaluating our
metadata engine and thus focus only on 16-byte objects.
![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png)
It appears that the performances of our metadata engine are acceptable, as we
have a comfortable margin compared to Minio (Minio is between 3x and 4x times
slower per batch). We also note that, past 200k objects, Minio batch
completion time is constant as Garage's one remains linear: it could be
interesting to know if Garage batch's completion time would cross Minio's one
for a very large number of objects. If we reason per object, both Minio and
Garage performances remain very good: it takes respectively around 20ms and
5ms to create an object. At 100 Mbps, if you upload a 10MB file, the
upload will take 800ms, for a 100MB file, it goes up to 8sec; in both cases
handling the object metadata is only a fraction of the upload time. The
only cases where you could notice it would be if you upload a lot of very
small files at once, which again, is an unusual usage of the S3 API.
Next, we focus on Garage's data only to better see its specific behavior:
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
Two effects are now more visible: 1. batch completion time is linear with the
number of objects in the bucket and 2. measurements are dispersed, at least
more than Minio. We discussed the first point previously but not the second
one on measurement dispersion. This instability could be an issue as it could
be a symptom of what we saw with some other experiments in this machine:
sometimes it freezes under heavy I/O operations. Such freezes could lead to
request timeouts and failures. If it occurs on our testing computer, it will
occur on other servers too: it could be interesting to better understand this
issue, document how to avoid it, or change how we handle our I/O. At the same
time, this was a very stressful test that will probably not be encountered in
many setups: we were adding 273 objects per second for 30 minutes!
To conclude this part, Garage can ingest 1 million tiny objects while remaining
usable on our local setup. To put this result in perspective, our production
cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with
116k objects. This bucket contains real data: it is used by our Matrix instance
to store people's media files (profile pictures, shared pictures, videos,
audios, documents...). Thanks to this benchmark, we have identified two points
of vigilance: putting object duration seems linear with the number of existing
objects in the cluster, and we have some volatility in our measured data that
could be a symptom of our system freezing under the load. Despite these two
points, we are confident that Garage could scale way above 1M+ objects, but it
remains to be proved!
## In an unpredictable world, stay resilient
Supporting a variety of network properties and computers, especially ones that
were not designed for software-defined storage or even server purposes, is the
core value proposition of Garage. For example, our production cluster is
hosted [on refurbished Lenovo Thinkcentre Tiny Desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
behind consumer-grade fiber links across France and Belgium - if you are reading this,
congratulation, you fetched this webpage from it! That's why we are very
careful that our internal protocol (named RPC protocol in our documentation)
remains as lightweight as possible. For this analysis, we quantify how network
latency and the number of nodes in the cluster impact S3 main requests
duration.
**Latency amplification** - With the kind of networks we use (consumer-grade
fiber links across the EU), the observed latency is in the 50ms range between
nodes. When latency is not negligible, you will observe that request completion
time is a factor of the observed latency. That's expected: in many cases, the
node of the cluster you are contacting can not directly answer your request, it
needs to reach other nodes of the cluster to get your information. Each
sequential RPC adds to the final S3 request duration, which can quickly become
expensive. This ratio between request duration and network latency is what we
refer to as *latency amplification*.
For example, on Garage, a GetObject request does two sequential calls: first,
it asks for the descriptor of the requested object containing the block list of
the requested object, then it retrieves its blocks. We can expect that the
request duration of a small GetObject request will be close to twice the
network latency.
We tested this theory with another benchmark of our own named
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
which does a single request at a time on an endpoint and measures its response
time. As we are not interested in bandwidth but latency, all our requests
involving an object are made on a tiny file of around 16 bytes. Our benchmark
tests 5 standard endpoints: ListBuckets, ListObjects, PutObject, GetObject and
RemoveObject. Its results are plotted here:
![Latency amplification](amplification.png)
As Garage has been optimized for this use case from the beginning, we don't see
any significant evolution from one version to another (garage v0.7.3 and garage
v0.8.0 beta here). Compared to Minio, these values are either similar (for
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
RemoveObject). It is understandable: Minio has not been designed for
environments with high latencies, you are expected to build your clusters in
the same datacenter, and then possibly connect them with their asynchronous
[Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
feature.
*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/)
but it is even more sensitive to latency: "Multi-site replication has increased
latency sensitivity, as MinIO does not consider an object as replicated until
it has synchronized to all configured remote targets. Replication latency is
therefore dictated by the slowest link in the replication mesh."*
**A cluster with many nodes** - Whether you already have many compute nodes
with unused storage, need to store a lot of data, or experiment with unusual
system architecture, you might want to deploy a hundredth of Garage nodes.
However, in some distributed systems, the number of nodes in the cluster will
impact performance. Theoretically, our protocol inspired by distributed
hashtables (DHT) should scale fairly well but we never took the time to test it
with a hundredth of nodes before.
This time, we did our test directly on Grid5000 with 6 physical servers spread
in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up
to 65 instances of Garage simultaneously (for a total of 390 nodes). The
network between the physical server is the dedicated network provided by
Grid5000 operators. Nodes on the same physical machine communicate directly
through the Linux network stack without any limitation: we are aware this is a
weakness of this test. We still think that this test can be relevant as, at
each step in the test, each instance of Garage has 83% (5/6) of its connections
that are made over a real network. To benchmark each cluster size, we used
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
again:
![Impact of response time with bigger clusters](complexity.png)
Up to 250 nodes observed response times remain constant. After this threshold,
results become very noisy. By looking at the server resource usage, we saw
that their load started to become non-negligible: it seems that we are not
hitting a limit on the protocol side but we have simply exhausted the resource
of our testing nodes. In the future, we would like to run this experiment
again, but on way more physical nodes, to confirm our hypothesis. For now, we
are confident that a Garage cluster with 100+ nodes should work.
## Conclusion and Future work
During this work, we identified some sensitive points on Garage we will
continue working on: our data durability target and interaction with the
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
our components, our new metadata engines (lmdb, sqlite) still need some testing
and tuning, and we know that raw I/O (GetObject, PutObject) have a small
improvement margin.
At the same time, Garage has never been better: its next version (v0.8) will
see drastic improvements in terms of performance and reliability. We are
confident that it is already able to cover a wide range of deployment needs, up
to a hundredth of nodes and millions of objects.
In the future, on the performance aspect, we would like to evaluate the impact
of introducing an SRPT scheduler
([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), define a data
durability policy and implement it, and make a deeper and larger review of the
state of the art (minio, ceph, swift, openio, riak cs, seaweedfs, etc.) to
learn from them, and finally, benchmark Garage at scale with possibly multiple
terabytes of data and billions of objects on long-lasting experiments.
In the meantime, stay tuned: we have released
[a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1),
and we are working on proving and explaining our layout algorithm
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also
working on a Python SDK for Garage's administration API
([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will
soon introduce officially a new API (as a technical preview) named K2V
([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
## Notes
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)
existence. This tool is far more complex than our set of scripts, but we know
that it is also way more versatile.
[^ref2]: The program name contains the word "billion" and we only tested Garage
up to 1 "million" object, this is not a typo, we were just a little bit too
enthusiast when we wrote it.
<style>
.footnote-definition p { display: inline; }
</style>