2022-09-29 11:16:04 +00:00
1 changed files with 57 additions and 55 deletions
--- a/content/blog/2022-perf/index.md
+++ b/content/blog/2022-perf/index.md
@ -375,29 +375,30 @@ that remains to be proven.
 ## In an unpredictable world, stay resilient
 Supporting a variety of network properties and computers, especially ones that
-were not designed for software-defined storage or even server purposes, is the
+were not designed for software-defined storage or even for server purposes, is the
 core value proposition of Garage.  For example, our production cluster is
-hosted [on refurbished Lenovo Thinkcentre Tiny Desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
+hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
-behind consumer-grade fiber links across France and Belgium - if you are reading this,
+behind consumer-grade fiber links across France and Belgium (if you are reading this,
-congratulation, you fetched this webpage from it! That's why we are very
+congratulation, you fetched this webpage from it!). That's why we are very
 careful that our internal protocol (named RPC protocol in our documentation)
 remains as lightweight as possible. For this analysis, we quantify how network
-latency and the number of nodes in the cluster impact S3 main requests
+latency and the number of nodes in the cluster impact the duration of the most
-duration.
+important kinds of S3 requests.
 **Latency amplification** - With the kind of networks we use (consumer-grade
 fiber links across the EU), the observed latency is in the 50ms range between
 nodes. When latency is not negligible, you will observe that request completion
-time is a factor of the observed latency. That's expected: in many cases, the
+time is a factor of the observed latency. That's to be expected: in many cases, the
-node of the cluster you are contacting can not directly answer your request, it
+node of the cluster you are contacting can not directly answer your request, and
-needs to reach other nodes of the cluster to get your information. Each
+has to reach other nodes of the cluster to get the requested information. Each
-sequential RPC adds to the final S3 request duration, which can quickly become
+of these sequential remote procedure calls - or RPCs - adds to the final S3 request duration, which can quickly become
 expensive.  This ratio between request duration and network latency is what we
 refer to as *latency amplification*.
 For example, on Garage, a GetObject request does two sequential calls: first,
-it asks for the descriptor of the requested object containing the block list of
+it fetches the descriptor of the requested object, which contains a reference
-the requested object, then it retrieves its blocks.  We can expect that the
+to the first block of data, and then only in a second step it can start retrieving data blocks
 from storage nodes.  We can therefore expect that the
 request duration of a small GetObject request will be close to twice the
 network latency.
@ -406,80 +407,81 @@ We tested this theory with another benchmark of our own named
 which does a single request at a time on an endpoint and measures its response
 time. As we are not interested in bandwidth but latency, all our requests
 involving an object are made on a tiny file of around 16 bytes. Our benchmark
-tests 5 standard endpoints: ListBuckets, ListObjects, PutObject, GetObject and
+tests 5 standard endpoints of the S3 API: ListBuckets, ListObjects, PutObject, GetObject and
-RemoveObject. Its results are plotted here:
+RemoveObject. Here are the results:
 ![Latency amplification](amplification.png)
 As Garage has been optimized for this use case from the beginning, we don't see
-any significant evolution from one version to another (garage v0.7.3 and garage
+any significant evolution from one version to another (Garage v0.7.3 and Garage
-v0.8.0 beta here).  Compared to Minio, these values are either similar (for
+v0.8.0 Beta 1 here).  Compared to Minio, these values are either similar (for
 ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
-RemoveObject).  It is understandable: Minio has not been designed for
+RemoveObject).  This can be easily understood by the fact that Minio has not been designed for
-environments with high latencies, you are expected to build your clusters in
+environments with high latencies. Instead, it expects to run on clusters that are buil allt
-the same datacenter, and then possibly connect them with their asynchronous
+the same datacenter. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous
-[Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
+[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
 feature.
-*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/)
+*Minio also has a [multi-site active-active replication system](https://blog.min.io/minio-multi-site-active-active-replication/)
 but it is even more sensitive to latency: "Multi-site replication has increased
 latency sensitivity, as MinIO does not consider an object as replicated until
 it has synchronized to all configured remote targets. Replication latency is
 therefore dictated by the slowest link in the replication mesh."*
 **A cluster with many nodes** - Whether you already have many compute nodes
 with unused storage, need to store a lot of data, or experiment with unusual
 system architecture, you might want to deploy a hundredth of Garage nodes.
 However, in some distributed systems, the number of nodes in the cluster will
 impact performance. Theoretically, our protocol inspired by distributed
 hashtables (DHT) should scale fairly well but we never took the time to test it
 with a hundredth of nodes before.
-This time, we did our test directly on Grid5000 with 6 physical servers spread
+**A cluster with many nodes** - Whether you already have many compute nodes
 with unused storage, need to store a lot of data, or are experimenting with unusual
 system architectures, you might be interested in deploying over a hundred Garage nodes.
 However, in some distributed systems, the number of nodes in the cluster will
 have an impact on performance. Theoretically, our protocol, which is inspired by distributed
 hash tables (DHT), should scale fairly well, but until now, we never took the time to test it
 with a hundred nodes or more.
 This test was ran directly on Grid5000 with 6 physical servers spread
 in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up
-to 65 instances of Garage simultaneously (for a total of 390 nodes). The
+to 65 instances of Garage simultaneously, for a total of 390 nodes. The
-network between the physical server is the dedicated network provided by
+network between physical servers is the dedicated network provided by
-Grid5000 operators. Nodes on the same physical machine communicate directly
+the Grid5000 operators. Nodes on the same physical machine communicate directly
-through the Linux network stack without any limitation: we are aware this is a
+through the Linux network stack without any limitation. We are aware that this is a
-weakness of this test. We still think that this test can be relevant as, at
+weakness of this test, but we still think that this test can be relevant as, at
 each step in the test, each instance of Garage has 83% (5/6) of its connections
-that are made over a real network. To benchmark each cluster size, we used
+that are made over a real network. To measure performances for each cluster size, we used
 [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
 again:
 ![Impact of response time with bigger clusters](complexity.png)
-Up to 250 nodes observed response times remain constant. After this threshold,
+Up to 250 nodes, we observed response times that remain constant. After this threshold,
 results become very noisy.  By looking at the server resource usage, we saw
 that their load started to become non-negligible: it seems that we are not
-hitting a limit on the protocol side but we have simply exhausted the resource
+hitting a limit on the protocol side, but have simply exhausted the resource
 of our testing nodes. In the future, we would like to run this experiment
-again, but on way more physical nodes, to confirm our hypothesis.  For now, we
+again, but on many more physical nodes, to confirm our hypothesis.  For now, we
 are confident that a Garage cluster with 100+ nodes should work.
 ## Conclusion and Future work
-During this work, we identified some sensitive points on Garage we will
+During this work, we identified some sensitive points on Garage,
-continue working on: our data durability target and interaction with the
+on which we will have to continue working: our data durability target and interaction with the
 filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
-our components, our new metadata engines (LMDB, SQLite) still need some testing
+our components; our new metadata engines (LMDB, SQLite) still need some testing
-and tuning, and we know that raw I/O (GetObject, PutObject) have a small
+and tuning; and we know that raw I/O performances (GetObject and PutObject for large objects) have a small
 improvement margin.
-At the same time, Garage has never been better: its next version (v0.8) will
+At the same time, Garage has never been in better shape: its next version (version 0.8) will
 see drastic improvements in terms of performance and reliability.  We are
-confident that it is already able to cover a wide range of deployment needs, up
+confident that Garage is already able to cover a wide range of deployment needs, up
-to a hundredth of nodes and millions of objects.
+to a over a hundred nodes and millions of objects.
 In the future, on the performance aspect, we would like to evaluate the impact
 of introducing an SRPT scheduler
 ([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), define a data
-durability policy and implement it, and make a deeper and larger review of the
+durability policy and implement it, make a deeper and larger review of the
-state of the art (minio, ceph, swift, openio, riak cs, seaweedfs, etc.) to
+state of the art (Minio, Ceph, Swift, OpenIO, Riak CS, SeaweedFS, etc.) to
-learn from them, and finally, benchmark Garage at scale with possibly multiple
+learn from them and, lastly, benchmark Garage at scale with possibly multiple
 terabytes of data and billions of objects on long-lasting experiments.
 In the meantime, stay tuned: we have released
@ -488,19 +490,19 @@ and we are working on proving and explaining our layout algorithm
 ([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also
 working on a Python SDK for Garage's administration API
 ([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will
-soon introduce officially a new API (as a technical preview) named K2V
+soon officially introduce a new API (as a technical preview) named K2V
 ([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
 ## Notes
-[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)
+[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)'s
-existence. This tool is far more complex than our set of scripts, but we know
+existence. Jepsen is far more complex than our set of scripts, but
-that it is also way more versatile.
+it is also way more versatile.
-[^ref2]: The program name contains the word "billion" and we only tested Garage
+[^ref2]: The program name contains the word "billion", although we only tested Garage
-up to 1 "million" object, this is not a typo, we were just a little bit too
+up to 1 million objects: this is not a typo, we were just a little bit too
-enthusiast when we wrote it.
+enthusiastic when we wrote it ;)
 <style>
 .footnote-definition p { display: inline; }