New article: Bringing theoretical design and observed performances face to face #12

Merged
quentin merged 24 commits from perf into master 2022-09-29 11:16:04 +00:00
Showing only changes of commit 2ccde811e9 - Show all commits

View file

@ -375,29 +375,30 @@ that remains to be proven.
## In an unpredictable world, stay resilient
Supporting a variety of network properties and computers, especially ones that
were not designed for software-defined storage or even server purposes, is the
were not designed for software-defined storage or even for server purposes, is the
core value proposition of Garage. For example, our production cluster is
hosted [on refurbished Lenovo Thinkcentre Tiny Desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
behind consumer-grade fiber links across France and Belgium - if you are reading this,
congratulation, you fetched this webpage from it! That's why we are very
hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
behind consumer-grade fiber links across France and Belgium (if you are reading this,
congratulation, you fetched this webpage from it!). That's why we are very
careful that our internal protocol (named RPC protocol in our documentation)
remains as lightweight as possible. For this analysis, we quantify how network
latency and the number of nodes in the cluster impact S3 main requests
duration.
latency and the number of nodes in the cluster impact the duration of the most
important kinds of S3 requests.
**Latency amplification** - With the kind of networks we use (consumer-grade
fiber links across the EU), the observed latency is in the 50ms range between
nodes. When latency is not negligible, you will observe that request completion
time is a factor of the observed latency. That's expected: in many cases, the
node of the cluster you are contacting can not directly answer your request, it
needs to reach other nodes of the cluster to get your information. Each
sequential RPC adds to the final S3 request duration, which can quickly become
time is a factor of the observed latency. That's to be expected: in many cases, the
node of the cluster you are contacting can not directly answer your request, and
has to reach other nodes of the cluster to get the requested information. Each
of these sequential remote procedure calls - or RPCs - adds to the final S3 request duration, which can quickly become
expensive. This ratio between request duration and network latency is what we
refer to as *latency amplification*.
For example, on Garage, a GetObject request does two sequential calls: first,
it asks for the descriptor of the requested object containing the block list of
the requested object, then it retrieves its blocks. We can expect that the
it fetches the descriptor of the requested object, which contains a reference
to the first block of data, and then only in a second step it can start retrieving data blocks
from storage nodes. We can therefore expect that the
request duration of a small GetObject request will be close to twice the
network latency.
@ -406,80 +407,81 @@ We tested this theory with another benchmark of our own named
which does a single request at a time on an endpoint and measures its response
time. As we are not interested in bandwidth but latency, all our requests
involving an object are made on a tiny file of around 16 bytes. Our benchmark
tests 5 standard endpoints: ListBuckets, ListObjects, PutObject, GetObject and
RemoveObject. Its results are plotted here:
tests 5 standard endpoints of the S3 API: ListBuckets, ListObjects, PutObject, GetObject and
RemoveObject. Here are the results:
![Latency amplification](amplification.png)
As Garage has been optimized for this use case from the beginning, we don't see
any significant evolution from one version to another (garage v0.7.3 and garage
v0.8.0 beta here). Compared to Minio, these values are either similar (for
any significant evolution from one version to another (Garage v0.7.3 and Garage
v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
RemoveObject). It is understandable: Minio has not been designed for
environments with high latencies, you are expected to build your clusters in
the same datacenter, and then possibly connect them with their asynchronous
[Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
RemoveObject). This can be easily understood by the fact that Minio has not been designed for
environments with high latencies. Instead, it expects to run on clusters that are buil allt
the same datacenter. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
feature.
*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/)
*Minio also has a [multi-site active-active replication system](https://blog.min.io/minio-multi-site-active-active-replication/)
but it is even more sensitive to latency: "Multi-site replication has increased
latency sensitivity, as MinIO does not consider an object as replicated until
it has synchronized to all configured remote targets. Replication latency is
therefore dictated by the slowest link in the replication mesh."*
**A cluster with many nodes** - Whether you already have many compute nodes
with unused storage, need to store a lot of data, or experiment with unusual
system architecture, you might want to deploy a hundredth of Garage nodes.
However, in some distributed systems, the number of nodes in the cluster will
impact performance. Theoretically, our protocol inspired by distributed
hashtables (DHT) should scale fairly well but we never took the time to test it
with a hundredth of nodes before.
This time, we did our test directly on Grid5000 with 6 physical servers spread
**A cluster with many nodes** - Whether you already have many compute nodes
with unused storage, need to store a lot of data, or are experimenting with unusual
system architectures, you might be interested in deploying over a hundred Garage nodes.
However, in some distributed systems, the number of nodes in the cluster will
have an impact on performance. Theoretically, our protocol, which is inspired by distributed
hash tables (DHT), should scale fairly well, but until now, we never took the time to test it
with a hundred nodes or more.
This test was ran directly on Grid5000 with 6 physical servers spread
in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up
to 65 instances of Garage simultaneously (for a total of 390 nodes). The
network between the physical server is the dedicated network provided by
Grid5000 operators. Nodes on the same physical machine communicate directly
through the Linux network stack without any limitation: we are aware this is a
weakness of this test. We still think that this test can be relevant as, at
to 65 instances of Garage simultaneously, for a total of 390 nodes. The
network between physical servers is the dedicated network provided by
the Grid5000 operators. Nodes on the same physical machine communicate directly
through the Linux network stack without any limitation. We are aware that this is a
weakness of this test, but we still think that this test can be relevant as, at
each step in the test, each instance of Garage has 83% (5/6) of its connections
that are made over a real network. To benchmark each cluster size, we used
that are made over a real network. To measure performances for each cluster size, we used
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
again:
![Impact of response time with bigger clusters](complexity.png)
Up to 250 nodes observed response times remain constant. After this threshold,
Up to 250 nodes, we observed response times that remain constant. After this threshold,
results become very noisy. By looking at the server resource usage, we saw
that their load started to become non-negligible: it seems that we are not
hitting a limit on the protocol side but we have simply exhausted the resource
hitting a limit on the protocol side, but have simply exhausted the resource
of our testing nodes. In the future, we would like to run this experiment
again, but on way more physical nodes, to confirm our hypothesis. For now, we
again, but on many more physical nodes, to confirm our hypothesis. For now, we
are confident that a Garage cluster with 100+ nodes should work.
## Conclusion and Future work
During this work, we identified some sensitive points on Garage we will
continue working on: our data durability target and interaction with the
During this work, we identified some sensitive points on Garage,
on which we will have to continue working: our data durability target and interaction with the
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
our components, our new metadata engines (LMDB, SQLite) still need some testing
and tuning, and we know that raw I/O (GetObject, PutObject) have a small
our components; our new metadata engines (LMDB, SQLite) still need some testing
and tuning; and we know that raw I/O performances (GetObject and PutObject for large objects) have a small
improvement margin.
At the same time, Garage has never been better: its next version (v0.8) will
At the same time, Garage has never been in better shape: its next version (version 0.8) will
see drastic improvements in terms of performance and reliability. We are
confident that it is already able to cover a wide range of deployment needs, up
to a hundredth of nodes and millions of objects.
confident that Garage is already able to cover a wide range of deployment needs, up
to a over a hundred nodes and millions of objects.
In the future, on the performance aspect, we would like to evaluate the impact
of introducing an SRPT scheduler
([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), define a data
durability policy and implement it, and make a deeper and larger review of the
state of the art (minio, ceph, swift, openio, riak cs, seaweedfs, etc.) to
learn from them, and finally, benchmark Garage at scale with possibly multiple
durability policy and implement it, make a deeper and larger review of the
state of the art (Minio, Ceph, Swift, OpenIO, Riak CS, SeaweedFS, etc.) to
learn from them and, lastly, benchmark Garage at scale with possibly multiple
terabytes of data and billions of objects on long-lasting experiments.
In the meantime, stay tuned: we have released
@ -488,19 +490,19 @@ and we are working on proving and explaining our layout algorithm
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also
working on a Python SDK for Garage's administration API
([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will
soon introduce officially a new API (as a technical preview) named K2V
soon officially introduce a new API (as a technical preview) named K2V
([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
## Notes
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)
existence. This tool is far more complex than our set of scripts, but we know
that it is also way more versatile.
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)'s
existence. Jepsen is far more complex than our set of scripts, but
it is also way more versatile.
[^ref2]: The program name contains the word "billion" and we only tested Garage
up to 1 "million" object, this is not a typo, we were just a little bit too
enthusiast when we wrote it.
[^ref2]: The program name contains the word "billion", although we only tested Garage
up to 1 million objects: this is not a typo, we were just a little bit too
enthusiastic when we wrote it ;)
<style>
.footnote-definition p { display: inline; }