New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 57 additions and 55 deletions
|
@ -375,29 +375,30 @@ that remains to be proven.
|
|||
## In an unpredictable world, stay resilient
|
||||
|
||||
Supporting a variety of network properties and computers, especially ones that
|
||||
were not designed for software-defined storage or even server purposes, is the
|
||||
were not designed for software-defined storage or even for server purposes, is the
|
||||
core value proposition of Garage. For example, our production cluster is
|
||||
hosted [on refurbished Lenovo Thinkcentre Tiny Desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
|
||||
behind consumer-grade fiber links across France and Belgium - if you are reading this,
|
||||
congratulation, you fetched this webpage from it! That's why we are very
|
||||
hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
|
||||
behind consumer-grade fiber links across France and Belgium (if you are reading this,
|
||||
congratulation, you fetched this webpage from it!). That's why we are very
|
||||
careful that our internal protocol (named RPC protocol in our documentation)
|
||||
remains as lightweight as possible. For this analysis, we quantify how network
|
||||
latency and the number of nodes in the cluster impact S3 main requests
|
||||
duration.
|
||||
latency and the number of nodes in the cluster impact the duration of the most
|
||||
important kinds of S3 requests.
|
||||
|
||||
**Latency amplification** - With the kind of networks we use (consumer-grade
|
||||
fiber links across the EU), the observed latency is in the 50ms range between
|
||||
nodes. When latency is not negligible, you will observe that request completion
|
||||
time is a factor of the observed latency. That's expected: in many cases, the
|
||||
node of the cluster you are contacting can not directly answer your request, it
|
||||
needs to reach other nodes of the cluster to get your information. Each
|
||||
sequential RPC adds to the final S3 request duration, which can quickly become
|
||||
time is a factor of the observed latency. That's to be expected: in many cases, the
|
||||
node of the cluster you are contacting can not directly answer your request, and
|
||||
has to reach other nodes of the cluster to get the requested information. Each
|
||||
of these sequential remote procedure calls - or RPCs - adds to the final S3 request duration, which can quickly become
|
||||
expensive. This ratio between request duration and network latency is what we
|
||||
refer to as *latency amplification*.
|
||||
|
||||
For example, on Garage, a GetObject request does two sequential calls: first,
|
||||
it asks for the descriptor of the requested object containing the block list of
|
||||
the requested object, then it retrieves its blocks. We can expect that the
|
||||
it fetches the descriptor of the requested object, which contains a reference
|
||||
to the first block of data, and then only in a second step it can start retrieving data blocks
|
||||
from storage nodes. We can therefore expect that the
|
||||
request duration of a small GetObject request will be close to twice the
|
||||
network latency.
|
||||
|
||||
|
@ -406,80 +407,81 @@ We tested this theory with another benchmark of our own named
|
|||
which does a single request at a time on an endpoint and measures its response
|
||||
time. As we are not interested in bandwidth but latency, all our requests
|
||||
involving an object are made on a tiny file of around 16 bytes. Our benchmark
|
||||
tests 5 standard endpoints: ListBuckets, ListObjects, PutObject, GetObject and
|
||||
RemoveObject. Its results are plotted here:
|
||||
tests 5 standard endpoints of the S3 API: ListBuckets, ListObjects, PutObject, GetObject and
|
||||
RemoveObject. Here are the results:
|
||||
|
||||
|
||||
![Latency amplification](amplification.png)
|
||||
|
||||
As Garage has been optimized for this use case from the beginning, we don't see
|
||||
any significant evolution from one version to another (garage v0.7.3 and garage
|
||||
v0.8.0 beta here). Compared to Minio, these values are either similar (for
|
||||
any significant evolution from one version to another (Garage v0.7.3 and Garage
|
||||
v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
|
||||
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
|
||||
RemoveObject). It is understandable: Minio has not been designed for
|
||||
environments with high latencies, you are expected to build your clusters in
|
||||
the same datacenter, and then possibly connect them with their asynchronous
|
||||
[Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
||||
RemoveObject). This can be easily understood by the fact that Minio has not been designed for
|
||||
environments with high latencies. Instead, it expects to run on clusters that are buil allt
|
||||
the same datacenter. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous
|
||||
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
||||
feature.
|
||||
|
||||
*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/)
|
||||
*Minio also has a [multi-site active-active replication system](https://blog.min.io/minio-multi-site-active-active-replication/)
|
||||
but it is even more sensitive to latency: "Multi-site replication has increased
|
||||
latency sensitivity, as MinIO does not consider an object as replicated until
|
||||
it has synchronized to all configured remote targets. Replication latency is
|
||||
therefore dictated by the slowest link in the replication mesh."*
|
||||
|
||||
**A cluster with many nodes** - Whether you already have many compute nodes
|
||||
with unused storage, need to store a lot of data, or experiment with unusual
|
||||
system architecture, you might want to deploy a hundredth of Garage nodes.
|
||||
However, in some distributed systems, the number of nodes in the cluster will
|
||||
impact performance. Theoretically, our protocol inspired by distributed
|
||||
hashtables (DHT) should scale fairly well but we never took the time to test it
|
||||
with a hundredth of nodes before.
|
||||
|
||||
This time, we did our test directly on Grid5000 with 6 physical servers spread
|
||||
**A cluster with many nodes** - Whether you already have many compute nodes
|
||||
with unused storage, need to store a lot of data, or are experimenting with unusual
|
||||
system architectures, you might be interested in deploying over a hundred Garage nodes.
|
||||
However, in some distributed systems, the number of nodes in the cluster will
|
||||
have an impact on performance. Theoretically, our protocol, which is inspired by distributed
|
||||
hash tables (DHT), should scale fairly well, but until now, we never took the time to test it
|
||||
with a hundred nodes or more.
|
||||
|
||||
This test was ran directly on Grid5000 with 6 physical servers spread
|
||||
in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up
|
||||
to 65 instances of Garage simultaneously (for a total of 390 nodes). The
|
||||
network between the physical server is the dedicated network provided by
|
||||
Grid5000 operators. Nodes on the same physical machine communicate directly
|
||||
through the Linux network stack without any limitation: we are aware this is a
|
||||
weakness of this test. We still think that this test can be relevant as, at
|
||||
to 65 instances of Garage simultaneously, for a total of 390 nodes. The
|
||||
network between physical servers is the dedicated network provided by
|
||||
the Grid5000 operators. Nodes on the same physical machine communicate directly
|
||||
through the Linux network stack without any limitation. We are aware that this is a
|
||||
weakness of this test, but we still think that this test can be relevant as, at
|
||||
each step in the test, each instance of Garage has 83% (5/6) of its connections
|
||||
that are made over a real network. To benchmark each cluster size, we used
|
||||
that are made over a real network. To measure performances for each cluster size, we used
|
||||
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
|
||||
again:
|
||||
|
||||
|
||||
![Impact of response time with bigger clusters](complexity.png)
|
||||
|
||||
Up to 250 nodes observed response times remain constant. After this threshold,
|
||||
Up to 250 nodes, we observed response times that remain constant. After this threshold,
|
||||
results become very noisy. By looking at the server resource usage, we saw
|
||||
that their load started to become non-negligible: it seems that we are not
|
||||
hitting a limit on the protocol side but we have simply exhausted the resource
|
||||
hitting a limit on the protocol side, but have simply exhausted the resource
|
||||
of our testing nodes. In the future, we would like to run this experiment
|
||||
again, but on way more physical nodes, to confirm our hypothesis. For now, we
|
||||
again, but on many more physical nodes, to confirm our hypothesis. For now, we
|
||||
are confident that a Garage cluster with 100+ nodes should work.
|
||||
|
||||
|
||||
## Conclusion and Future work
|
||||
|
||||
During this work, we identified some sensitive points on Garage we will
|
||||
continue working on: our data durability target and interaction with the
|
||||
During this work, we identified some sensitive points on Garage,
|
||||
on which we will have to continue working: our data durability target and interaction with the
|
||||
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
|
||||
our components, our new metadata engines (LMDB, SQLite) still need some testing
|
||||
and tuning, and we know that raw I/O (GetObject, PutObject) have a small
|
||||
our components; our new metadata engines (LMDB, SQLite) still need some testing
|
||||
and tuning; and we know that raw I/O performances (GetObject and PutObject for large objects) have a small
|
||||
improvement margin.
|
||||
|
||||
At the same time, Garage has never been better: its next version (v0.8) will
|
||||
At the same time, Garage has never been in better shape: its next version (version 0.8) will
|
||||
see drastic improvements in terms of performance and reliability. We are
|
||||
confident that it is already able to cover a wide range of deployment needs, up
|
||||
to a hundredth of nodes and millions of objects.
|
||||
confident that Garage is already able to cover a wide range of deployment needs, up
|
||||
to a over a hundred nodes and millions of objects.
|
||||
|
||||
In the future, on the performance aspect, we would like to evaluate the impact
|
||||
of introducing an SRPT scheduler
|
||||
([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), define a data
|
||||
durability policy and implement it, and make a deeper and larger review of the
|
||||
state of the art (minio, ceph, swift, openio, riak cs, seaweedfs, etc.) to
|
||||
learn from them, and finally, benchmark Garage at scale with possibly multiple
|
||||
durability policy and implement it, make a deeper and larger review of the
|
||||
state of the art (Minio, Ceph, Swift, OpenIO, Riak CS, SeaweedFS, etc.) to
|
||||
learn from them and, lastly, benchmark Garage at scale with possibly multiple
|
||||
terabytes of data and billions of objects on long-lasting experiments.
|
||||
|
||||
In the meantime, stay tuned: we have released
|
||||
|
@ -488,19 +490,19 @@ and we are working on proving and explaining our layout algorithm
|
|||
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also
|
||||
working on a Python SDK for Garage's administration API
|
||||
([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will
|
||||
soon introduce officially a new API (as a technical preview) named K2V
|
||||
soon officially introduce a new API (as a technical preview) named K2V
|
||||
([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
|
||||
|
||||
|
||||
## Notes
|
||||
|
||||
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)
|
||||
existence. This tool is far more complex than our set of scripts, but we know
|
||||
that it is also way more versatile.
|
||||
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)'s
|
||||
existence. Jepsen is far more complex than our set of scripts, but
|
||||
it is also way more versatile.
|
||||
|
||||
[^ref2]: The program name contains the word "billion" and we only tested Garage
|
||||
up to 1 "million" object, this is not a typo, we were just a little bit too
|
||||
enthusiast when we wrote it.
|
||||
[^ref2]: The program name contains the word "billion", although we only tested Garage
|
||||
up to 1 million objects: this is not a typo, we were just a little bit too
|
||||
enthusiastic when we wrote it ;)
|
||||
|
||||
<style>
|
||||
.footnote-definition p { display: inline; }
|
||||
|
|
Loading…
Reference in a new issue