diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index e240c18..d020be4 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -375,29 +375,30 @@ that remains to be proven. ## In an unpredictable world, stay resilient Supporting a variety of network properties and computers, especially ones that -were not designed for software-defined storage or even server purposes, is the +were not designed for software-defined storage or even for server purposes, is the core value proposition of Garage. For example, our production cluster is -hosted [on refurbished Lenovo Thinkcentre Tiny Desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg) -behind consumer-grade fiber links across France and Belgium - if you are reading this, -congratulation, you fetched this webpage from it! That's why we are very +hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg) +behind consumer-grade fiber links across France and Belgium (if you are reading this, +congratulation, you fetched this webpage from it!). That's why we are very careful that our internal protocol (named RPC protocol in our documentation) remains as lightweight as possible. For this analysis, we quantify how network -latency and the number of nodes in the cluster impact S3 main requests -duration. +latency and the number of nodes in the cluster impact the duration of the most +important kinds of S3 requests. **Latency amplification** - With the kind of networks we use (consumer-grade fiber links across the EU), the observed latency is in the 50ms range between nodes. When latency is not negligible, you will observe that request completion -time is a factor of the observed latency. That's expected: in many cases, the -node of the cluster you are contacting can not directly answer your request, it -needs to reach other nodes of the cluster to get your information. Each -sequential RPC adds to the final S3 request duration, which can quickly become +time is a factor of the observed latency. That's to be expected: in many cases, the +node of the cluster you are contacting can not directly answer your request, and +has to reach other nodes of the cluster to get the requested information. Each +of these sequential remote procedure calls - or RPCs - adds to the final S3 request duration, which can quickly become expensive. This ratio between request duration and network latency is what we refer to as *latency amplification*. For example, on Garage, a GetObject request does two sequential calls: first, -it asks for the descriptor of the requested object containing the block list of -the requested object, then it retrieves its blocks. We can expect that the +it fetches the descriptor of the requested object, which contains a reference +to the first block of data, and then only in a second step it can start retrieving data blocks +from storage nodes. We can therefore expect that the request duration of a small GetObject request will be close to twice the network latency. @@ -406,80 +407,81 @@ We tested this theory with another benchmark of our own named which does a single request at a time on an endpoint and measures its response time. As we are not interested in bandwidth but latency, all our requests involving an object are made on a tiny file of around 16 bytes. Our benchmark -tests 5 standard endpoints: ListBuckets, ListObjects, PutObject, GetObject and -RemoveObject. Its results are plotted here: +tests 5 standard endpoints of the S3 API: ListBuckets, ListObjects, PutObject, GetObject and +RemoveObject. Here are the results: ![Latency amplification](amplification.png) As Garage has been optimized for this use case from the beginning, we don't see -any significant evolution from one version to another (garage v0.7.3 and garage -v0.8.0 beta here). Compared to Minio, these values are either similar (for +any significant evolution from one version to another (Garage v0.7.3 and Garage +v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for ListObjects and ListBuckets) or way better (for GetObject, PutObject, and -RemoveObject). It is understandable: Minio has not been designed for -environments with high latencies, you are expected to build your clusters in -the same datacenter, and then possibly connect them with their asynchronous -[Bucket Replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect) +RemoveObject). This can be easily understood by the fact that Minio has not been designed for +environments with high latencies. Instead, it expects to run on clusters that are buil allt +the same datacenter. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous +[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect) feature. -*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/) +*Minio also has a [multi-site active-active replication system](https://blog.min.io/minio-multi-site-active-active-replication/) but it is even more sensitive to latency: "Multi-site replication has increased latency sensitivity, as MinIO does not consider an object as replicated until it has synchronized to all configured remote targets. Replication latency is therefore dictated by the slowest link in the replication mesh."* -**A cluster with many nodes** - Whether you already have many compute nodes -with unused storage, need to store a lot of data, or experiment with unusual -system architecture, you might want to deploy a hundredth of Garage nodes. -However, in some distributed systems, the number of nodes in the cluster will -impact performance. Theoretically, our protocol inspired by distributed -hashtables (DHT) should scale fairly well but we never took the time to test it -with a hundredth of nodes before. -This time, we did our test directly on Grid5000 with 6 physical servers spread +**A cluster with many nodes** - Whether you already have many compute nodes +with unused storage, need to store a lot of data, or are experimenting with unusual +system architectures, you might be interested in deploying over a hundred Garage nodes. +However, in some distributed systems, the number of nodes in the cluster will +have an impact on performance. Theoretically, our protocol, which is inspired by distributed +hash tables (DHT), should scale fairly well, but until now, we never took the time to test it +with a hundred nodes or more. + +This test was ran directly on Grid5000 with 6 physical servers spread in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up -to 65 instances of Garage simultaneously (for a total of 390 nodes). The -network between the physical server is the dedicated network provided by -Grid5000 operators. Nodes on the same physical machine communicate directly -through the Linux network stack without any limitation: we are aware this is a -weakness of this test. We still think that this test can be relevant as, at +to 65 instances of Garage simultaneously, for a total of 390 nodes. The +network between physical servers is the dedicated network provided by +the Grid5000 operators. Nodes on the same physical machine communicate directly +through the Linux network stack without any limitation. We are aware that this is a +weakness of this test, but we still think that this test can be relevant as, at each step in the test, each instance of Garage has 83% (5/6) of its connections -that are made over a real network. To benchmark each cluster size, we used +that are made over a real network. To measure performances for each cluster size, we used [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) again: ![Impact of response time with bigger clusters](complexity.png) -Up to 250 nodes observed response times remain constant. After this threshold, +Up to 250 nodes, we observed response times that remain constant. After this threshold, results become very noisy. By looking at the server resource usage, we saw that their load started to become non-negligible: it seems that we are not -hitting a limit on the protocol side but we have simply exhausted the resource +hitting a limit on the protocol side, but have simply exhausted the resource of our testing nodes. In the future, we would like to run this experiment -again, but on way more physical nodes, to confirm our hypothesis. For now, we +again, but on many more physical nodes, to confirm our hypothesis. For now, we are confident that a Garage cluster with 100+ nodes should work. ## Conclusion and Future work -During this work, we identified some sensitive points on Garage we will -continue working on: our data durability target and interaction with the +During this work, we identified some sensitive points on Garage, +on which we will have to continue working: our data durability target and interaction with the filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across -our components, our new metadata engines (LMDB, SQLite) still need some testing -and tuning, and we know that raw I/O (GetObject, PutObject) have a small +our components; our new metadata engines (LMDB, SQLite) still need some testing +and tuning; and we know that raw I/O performances (GetObject and PutObject for large objects) have a small improvement margin. -At the same time, Garage has never been better: its next version (v0.8) will +At the same time, Garage has never been in better shape: its next version (version 0.8) will see drastic improvements in terms of performance and reliability. We are -confident that it is already able to cover a wide range of deployment needs, up -to a hundredth of nodes and millions of objects. +confident that Garage is already able to cover a wide range of deployment needs, up +to a over a hundred nodes and millions of objects. In the future, on the performance aspect, we would like to evaluate the impact of introducing an SRPT scheduler ([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), define a data -durability policy and implement it, and make a deeper and larger review of the -state of the art (minio, ceph, swift, openio, riak cs, seaweedfs, etc.) to -learn from them, and finally, benchmark Garage at scale with possibly multiple +durability policy and implement it, make a deeper and larger review of the +state of the art (Minio, Ceph, Swift, OpenIO, Riak CS, SeaweedFS, etc.) to +learn from them and, lastly, benchmark Garage at scale with possibly multiple terabytes of data and billions of objects on long-lasting experiments. In the meantime, stay tuned: we have released @@ -488,19 +490,19 @@ and we are working on proving and explaining our layout algorithm ([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also working on a Python SDK for Garage's administration API ([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will -soon introduce officially a new API (as a technical preview) named K2V +soon officially introduce a new API (as a technical preview) named K2V ([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)). ## Notes -[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen) -existence. This tool is far more complex than our set of scripts, but we know -that it is also way more versatile. +[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)'s +existence. Jepsen is far more complex than our set of scripts, but +it is also way more versatile. -[^ref2]: The program name contains the word "billion" and we only tested Garage -up to 1 "million" object, this is not a typo, we were just a little bit too -enthusiast when we wrote it. +[^ref2]: The program name contains the word "billion", although we only tested Garage +up to 1 million objects: this is not a typo, we were just a little bit too +enthusiastic when we wrote it ;)