New article: Bringing theoretical design and observed performances face to face #12

Merged
quentin merged 24 commits from perf into master 2022-09-29 11:16:04 +00:00
Showing only changes of commit 26d396ba85 - Show all commits

View file

@ -27,7 +27,7 @@ Read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html),
We started a batch of tests on [Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible testbed for experiment-driven research in all areas of computer science, under the [Open Access](https://www.grid5000.fr/w/Grid5000:Open-Access) program. During our tests, we used part of the following clusters: [nova](https://www.grid5000.fr/w/Lyon:Hardware#nova), [paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and [econome](https://www.grid5000.fr/w/Nantes:Hardware#econome) to make a geo-distributed topology. We used the Grid5000 testbed only during our preliminary tests to identify issues when running Garage on many powerful servers, issues that we then reproduced in a controlled environment; don't be surprised then if Grid5000 is not mentioned often on our plots.
To reproduce some environments locally, we have a small set of Python scripts named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our needs[^1]. Most of the following tests where thus run locally with mknet on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In term of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and `vm.dirty_ratio` have been reduced to `2` and `1` respectively as, with default values, the system tends to freeze when it is under heavy I/O load.
To reproduce some environments locally, we have a small set of Python scripts named [mknet](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our needs[^ref1]. Most of the following tests where thus run locally with mknet on a single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of RAM, a 512GB SSD. In term of software, NixOS 22.05 with the 5.15.50 kernel is used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and `vm.dirty_ratio` have been reduced to `2` and `1` respectively as, with default values, the system tends to freeze when it is under heavy I/O load.
## Efficient I/O
@ -41,11 +41,11 @@ With Garage v0.8, we integrated a block streaming logic which allows the gateway
As our default block size is only 1MB, the difference will be very small on fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network. However, on a very slow network (or a very congested link with many parallel requests handled), the impact can be much more important: at 5Mbps, it takes 1.6 second to transfer our 1MB block, and streaming could heavily improve user experience.
We wanted to see if this theory helds in practise: we simulated a low latency but slow network on mknet and did some request with (garage v0.8 beta) and without (garage v0.7) block streaming. We also added Minio as a reference.
We wanted to see if this theory helds in practise: we simulated a low latency but slow network on mknet and did some request with (garage v0.8 beta) and without (garage v0.7) block streaming. We also added Minio as a reference. To benchmark this behavior, we wrote a small test named [s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb), its results are depicted on the following figure.
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
As planned, Garage v0.7 that does not support block streaming features TTFB between 1.6s and 2s, which correspond to the computed time to transfer the full block. On the other side of the plot, we see Garage v0.8 with very low TTFB thanks to our streaming approach (the lowest value is 43 ms). Minio sits between our 2 implementations: we suppose that it does some form of batching, but on less than 1MB.
Garage v0.7, that does not support block streaming, features TTFB between 1.6s and 2s, which correspond to the theoretical time to transfer the full block. On the other side of the plot, Garage v0.8 has very low TTFB thanks to the streaming feature (the lowest value is 43 ms). Minio sits between the two Garage versions: we suppose that it does some form of batching, but smaller than 1MB.
**Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted logic we wrote (to better control resources usage and detect errors, like semaphore or timeouts) were artificially limiting performances. In another iteration, we made this logic less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot.
@ -86,7 +86,7 @@ Following this observation, some people asked us how scalable Garage is. If answ
be sure that our metadata engine would be able to scale to million of objects. To put this target in context, it remains small compared to other industrial solutions:
Ceph claims to scale up to [10 billion objects](https://www.redhat.com/en/resources/data-solutions-overview), which is 4 order of magnitude more than our current target. Of course, their benchmarking setup has nothing in common with ours, and their tests are way more exhaustive.
We wrote our own [benchmarking tool](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion) for this test. It concurrently sends a defined number of very tiny object (8192 objects of 16 bytes by default) and measures the time it took. It repeats this step a given number of time (128 by default) to effectively create a certain number of objects on the target cluster (1M by default).
We wrote our own benchmarking tool [s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2] for this test. It concurrently sends a defined number of very tiny object (8192 objects of 16 bytes by default) and measures the time it took. It repeats this step a given number of time (128 by default) to effectively create a certain number of objects on the target cluster (1M by default).
On our local setup with 3 nodes, both Minio and Garage with LMDB were able to achieve this target. On the following plot, we show how many times it took to Garage and Minio to handle each batch.
Before looking at the plot, **you must keep in mind some important points about Minio and Garage internals**.
@ -129,7 +129,8 @@ This ratio between request duration and network latency is what we refer as *lat
For example, on Garage, a GetObject request does two sequential calls: first, it asks for the descriptor of the requested object containing the block list of the requested object, then it retrieves its blocks.
We can expect that the request duration of a small GetObject request will be close to twice the network latency.
On the following graph, we test experimentally some standard endpoints, including GetObject:
We tested this theory with another benchmark of our own named [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) that does a single request at a time on an endpoint and measure its response time. As we are not interested in bandwidth but latency, all our requests involving an object are made on a tiny file of around 16 bytes. Our benchmark tests 5 standard endpoints: ListBuckets, ListObjects, PutObject, GetObject and RemoveObject. Its results are plotted here:
![Latency amplification](amplification.png)
@ -139,9 +140,16 @@ It is understandable: Minio has not been designed for environment with high late
*Minio also has a [Multi-Site Active-Active Replication System](https://blog.min.io/minio-multi-site-active-active-replication/) but it is even more sensitive to latency: "Multi-site replication has increased latency sensitivity, as MinIO does not consider an object as replicated until it has synchronized to all configured remote targets. Replication latency is therefore dictated by the slowest link in the replication mesh."*
**Node count amplification** -
**A cluster with many nodes** - Whether you already have many compute nodes with unused storage, need to store lot of data, or experiment with unusual system architecture, you might want to deploy hundredth of Garage nodes. However, on some distributed systems, the number of nodes in the cluster will impact performances. Theoretically, our protocol inspired by distributed hashtables (DHT) should scale fairly well but we never took the time to test it with hundredth of nodes before.
![](complexity.png)
This time, we did our test directly on Grid5000 with 6 physical servers spread in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up to 65 instances of Garage simultaneously (for a total of 390 nodes). The network between the physical server is the dedicated network provided by Grid5000 operators. Nodes on the same physical machine communicate directly through the Linux network stack without any limitation: we are aware this is a weakness of this test. We still think that this test can be relevant as, at each step in the test, each instance of Garage has 83% (5/6) of its connections that are made over a real network. To benchmark each cluster size, we used [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) again:
![Impact of response time with bigger clusters](complexity.png)
Up to 250 nodes observed response times remain constant. After this threshold, results become very noisy.
By looking at the server resource usage, we saw that their load started to become non negligible: it seems that we are not hitting a limit at the protocol side but we have simply exhausted the ressource of our testing nodes. In the future, we would like to run this experiment again, but on way more physical nodes, to confirm our hypothesis.
For now, we are confident that a Garage cluster with 100+ nodes should definitely work.
## Future work
@ -154,4 +162,12 @@ It is understandable: Minio has not been designed for environment with high late
- try to better understand ecosystem (riak cs, minio, ceph, swift) -> some knowledge to get
[^1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen) existence. This tool is far more complex than our set of scripts, but we know that it is also way more versatile.
## Notes
[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen) existence. This tool is far more complex than our set of scripts, but we know that it is also way more versatile.
[^ref2]: The program name contains the word "billion" and we only tested Garage up to 1 "million" object, this is not a typo, we were just a little bit too enthusiast when we wrote it.
<style>
.footnote-definition p { display: inline; }
</style>