New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 57 additions and 55 deletions
|
@ -375,29 +375,30 @@ that remains to be proven.
|
||||||
## In an unpredictable world, stay resilient
|
## In an unpredictable world, stay resilient
|
||||||
|
|
||||||
Supporting a variety of network properties and computers, especially ones that
|
Supporting a variety of network properties and computers, especially ones that
|
||||||
were not designed for software-defined storage or even server purposes, is the
|
were not designed for software-defined storage or even for server purposes, is the
|
||||||
core value proposition of Garage. For example, our production cluster is
|
core value proposition of Garage. For example, our production cluster is
|
||||||
hosted [on refurbished Lenovo Thinkcentre Tiny Desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
|
hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
|
||||||
behind consumer-grade fiber links across France and Belgium - if you are reading this,
|
behind consumer-grade fiber links across France and Belgium (if you are reading this,
|
||||||
congratulation, you fetched this webpage from it! That's why we are very
|
congratulation, you fetched this webpage from it!). That's why we are very
|
||||||
careful that our internal protocol (named RPC protocol in our documentation)
|
careful that our internal protocol (named RPC protocol in our documentation)
|
||||||
remains as lightweight as possible. For this analysis, we quantify how network
|
remains as lightweight as possible. For this analysis, we quantify how network
|
||||||
latency and the number of nodes in the cluster impact S3 main requests
|
latency and the number of nodes in the cluster impact the duration of the most
|
||||||
duration.
|
important kinds of S3 requests.
|
||||||
|
|
||||||
**Latency amplification** - With the kind of networks we use (consumer-grade
|
**Latency amplification** - With the kind of networks we use (consumer-grade
|
||||||
fiber links across the EU), the observed latency is in the 50ms range between
|
fiber links across the EU), the observed latency is in the 50ms range between
|
||||||
nodes. When latency is not negligible, you will observe that request completion
|
nodes. When latency is not negligible, you will observe that request completion
|
||||||
time is a factor of the observed latency. That's expected: in many cases, the
|
time is a factor of the observed latency. That's to be expected: in many cases, the
|
||||||
node of the cluster you are contacting can not directly answer your request, it
|
node of the cluster you are contacting can not directly answer your request, and
|
||||||
needs to reach other nodes of the cluster to get your information. Each
|
has to reach other nodes of the cluster to get the requested information. Each
|
||||||
sequential RPC adds to the final S3 request duration, which can quickly become
|
of these sequential remote procedure calls - or RPCs - adds to the final S3 request duration, which can quickly become
|
||||||
expensive. This ratio between request duration and network latency is what we
|
expensive. This ratio between request duration and network latency is what we
|
||||||
refer to as *latency amplification*.
|
refer to as *latency amplification*.
|
||||||
|
|
||||||
For example, on Garage, a GetObject request does two sequential calls: first,
|
For example, on Garage, a GetObject request does two sequential calls: first,
|
||||||
it asks for the descriptor of the requested object containing the block list of
|
it fetches the descriptor of the requested object, which contains a reference
|
||||||
the requested object, then it retrieves its blocks. We can expect that the
|
to the first block of data, and then only in a second step it can start retrieving data blocks
|
||||||
|
from storage nodes. We can therefore expect that the
|
||||||
request duration of a small GetObject request will be close to twice the
|
request duration of a small GetObject request will be close to twice the
|
||||||
network latency.
|
network latency.
|
||||||
|
|
||||||
|
@ -406,80 +407,81 @@ We tested this theory with another benchmark of our own named
|
||||||
which does a single request at a time on an endpoint and measures its response
|
which does a single request at a time on an endpoint and measures its response
|
||||||
time. As we are not interested in bandwidth but latency, all our requests
|
time. As we are not interested in bandwidth but latency, all our requests
|
||||||
involving an object are made on a tiny file of around 16 bytes. Our benchmark
|
involving an object are made on a tiny file of around 16 bytes. Our benchmark
|
||||||
tests 5 standard endpoints: ListBuckets, ListObjects, PutObject, GetObject and
|
tests 5 standard endpoints of the S3 API: ListBuckets, ListObjects, PutObject, GetObject and
|
||||||
RemoveObject. Its results are plotted here:
|
RemoveObject. Here are the results:
|
||||||
|
|
||||||
|
|
||||||
![Latency amplification](amplification.png)
|
![Latency amplification](amplification.png)
|
||||||
|
|
||||||
As Garage has been optimized for this use case from the beginning, we don't see
|
As Garage has been optimized for this use case from the beginning, we don't see
|
||||||
any significant evolution from one version to another (garage v0.7.3 and garage
|
any significant evolution from one version to another (Garage v0.7.3 and Garage
|
||||||
v0.8.0 beta here). Compared to Minio, these values are either similar (for
|
v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
|
||||||
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
|
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
|
||||||
RemoveObject). It is understandable: Minio has not been designed for
|
RemoveObject). This can be easily understood by the fact that Minio has not been designed for
|
||||||
environments with high latencies, you are expected to build your clusters in
|
environments with high latencies. Instead, it expects to run on clusters that are buil allt
|
||||||
the same datacenter, and then possibly connect them with their asynchronous
|
the same datacenter. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous
|
||||||
[Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
||||||
feature.
|
feature.
|
||||||
|
|
||||||
*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/)
|
*Minio also has a [multi-site active-active replication system](https://blog.min.io/minio-multi-site-active-active-replication/)
|
||||||
but it is even more sensitive to latency: "Multi-site replication has increased
|
but it is even more sensitive to latency: "Multi-site replication has increased
|
||||||
latency sensitivity, as MinIO does not consider an object as replicated until
|
latency sensitivity, as MinIO does not consider an object as replicated until
|
||||||
it has synchronized to all configured remote targets. Replication latency is
|
it has synchronized to all configured remote targets. Replication latency is
|
||||||
therefore dictated by the slowest link in the replication mesh."*
|
therefore dictated by the slowest link in the replication mesh."*
|
||||||
|
|
||||||
**A cluster with many nodes** - Whether you already have many compute nodes
|
|
||||||
with unused storage, need to store a lot of data, or experiment with unusual
|
|
||||||
system architecture, you might want to deploy a hundredth of Garage nodes.
|
|
||||||
However, in some distributed systems, the number of nodes in the cluster will
|
|
||||||
impact performance. Theoretically, our protocol inspired by distributed
|
|
||||||
hashtables (DHT) should scale fairly well but we never took the time to test it
|
|
||||||
with a hundredth of nodes before.
|
|
||||||
|
|
||||||
This time, we did our test directly on Grid5000 with 6 physical servers spread
|
**A cluster with many nodes** - Whether you already have many compute nodes
|
||||||
|
with unused storage, need to store a lot of data, or are experimenting with unusual
|
||||||
|
system architectures, you might be interested in deploying over a hundred Garage nodes.
|
||||||
|
However, in some distributed systems, the number of nodes in the cluster will
|
||||||
|
have an impact on performance. Theoretically, our protocol, which is inspired by distributed
|
||||||
|
hash tables (DHT), should scale fairly well, but until now, we never took the time to test it
|
||||||
|
with a hundred nodes or more.
|
||||||
|
|
||||||
|
This test was ran directly on Grid5000 with 6 physical servers spread
|
||||||
in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up
|
in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up
|
||||||
to 65 instances of Garage simultaneously (for a total of 390 nodes). The
|
to 65 instances of Garage simultaneously, for a total of 390 nodes. The
|
||||||
network between the physical server is the dedicated network provided by
|
network between physical servers is the dedicated network provided by
|
||||||
Grid5000 operators. Nodes on the same physical machine communicate directly
|
the Grid5000 operators. Nodes on the same physical machine communicate directly
|
||||||
through the Linux network stack without any limitation: we are aware this is a
|
through the Linux network stack without any limitation. We are aware that this is a
|
||||||
weakness of this test. We still think that this test can be relevant as, at
|
weakness of this test, but we still think that this test can be relevant as, at
|
||||||
each step in the test, each instance of Garage has 83% (5/6) of its connections
|
each step in the test, each instance of Garage has 83% (5/6) of its connections
|
||||||
that are made over a real network. To benchmark each cluster size, we used
|
that are made over a real network. To measure performances for each cluster size, we used
|
||||||
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
|
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
|
||||||
again:
|
again:
|
||||||
|
|
||||||
|
|
||||||
![Impact of response time with bigger clusters](complexity.png)
|
![Impact of response time with bigger clusters](complexity.png)
|
||||||
|
|
||||||
Up to 250 nodes observed response times remain constant. After this threshold,
|
Up to 250 nodes, we observed response times that remain constant. After this threshold,
|
||||||
results become very noisy. By looking at the server resource usage, we saw
|
results become very noisy. By looking at the server resource usage, we saw
|
||||||
that their load started to become non-negligible: it seems that we are not
|
that their load started to become non-negligible: it seems that we are not
|
||||||
hitting a limit on the protocol side but we have simply exhausted the resource
|
hitting a limit on the protocol side, but have simply exhausted the resource
|
||||||
of our testing nodes. In the future, we would like to run this experiment
|
of our testing nodes. In the future, we would like to run this experiment
|
||||||
again, but on way more physical nodes, to confirm our hypothesis. For now, we
|
again, but on many more physical nodes, to confirm our hypothesis. For now, we
|
||||||
are confident that a Garage cluster with 100+ nodes should work.
|
are confident that a Garage cluster with 100+ nodes should work.
|
||||||
|
|
||||||
|
|
||||||
## Conclusion and Future work
|
## Conclusion and Future work
|
||||||
|
|
||||||
During this work, we identified some sensitive points on Garage we will
|
During this work, we identified some sensitive points on Garage,
|
||||||
continue working on: our data durability target and interaction with the
|
on which we will have to continue working: our data durability target and interaction with the
|
||||||
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
|
filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
|
||||||
our components, our new metadata engines (LMDB, SQLite) still need some testing
|
our components; our new metadata engines (LMDB, SQLite) still need some testing
|
||||||
and tuning, and we know that raw I/O (GetObject, PutObject) have a small
|
and tuning; and we know that raw I/O performances (GetObject and PutObject for large objects) have a small
|
||||||
improvement margin.
|
improvement margin.
|
||||||
|
|
||||||
At the same time, Garage has never been better: its next version (v0.8) will
|
At the same time, Garage has never been in better shape: its next version (version 0.8) will
|
||||||
see drastic improvements in terms of performance and reliability. We are
|
see drastic improvements in terms of performance and reliability. We are
|
||||||
confident that it is already able to cover a wide range of deployment needs, up
|
confident that Garage is already able to cover a wide range of deployment needs, up
|
||||||
to a hundredth of nodes and millions of objects.
|
to a over a hundred nodes and millions of objects.
|
||||||
|
|
||||||
In the future, on the performance aspect, we would like to evaluate the impact
|
In the future, on the performance aspect, we would like to evaluate the impact
|
||||||
of introducing an SRPT scheduler
|
of introducing an SRPT scheduler
|
||||||
([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), define a data
|
([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), define a data
|
||||||
durability policy and implement it, and make a deeper and larger review of the
|
durability policy and implement it, make a deeper and larger review of the
|
||||||
state of the art (minio, ceph, swift, openio, riak cs, seaweedfs, etc.) to
|
state of the art (Minio, Ceph, Swift, OpenIO, Riak CS, SeaweedFS, etc.) to
|
||||||
learn from them, and finally, benchmark Garage at scale with possibly multiple
|
learn from them and, lastly, benchmark Garage at scale with possibly multiple
|
||||||
terabytes of data and billions of objects on long-lasting experiments.
|
terabytes of data and billions of objects on long-lasting experiments.
|
||||||
|
|
||||||
In the meantime, stay tuned: we have released
|
In the meantime, stay tuned: we have released
|
||||||
|
@ -488,19 +490,19 @@ and we are working on proving and explaining our layout algorithm
|
||||||
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also
|
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also
|
||||||
working on a Python SDK for Garage's administration API
|
working on a Python SDK for Garage's administration API
|
||||||
([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will
|
([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will
|
||||||
soon introduce officially a new API (as a technical preview) named K2V
|
soon officially introduce a new API (as a technical preview) named K2V
|
||||||
([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
|
([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
|
||||||
|
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)
|
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)'s
|
||||||
existence. This tool is far more complex than our set of scripts, but we know
|
existence. Jepsen is far more complex than our set of scripts, but
|
||||||
that it is also way more versatile.
|
it is also way more versatile.
|
||||||
|
|
||||||
[^ref2]: The program name contains the word "billion" and we only tested Garage
|
[^ref2]: The program name contains the word "billion", although we only tested Garage
|
||||||
up to 1 "million" object, this is not a typo, we were just a little bit too
|
up to 1 million objects: this is not a typo, we were just a little bit too
|
||||||
enthusiast when we wrote it.
|
enthusiastic when we wrote it ;)
|
||||||
|
|
||||||
<style>
|
<style>
|
||||||
.footnote-definition p { display: inline; }
|
.footnote-definition p { display: inline; }
|
||||||
|
|
Loading…
Reference in a new issue