New article: Bringing theoretical design and observed performances face to face #12

Merged
quentin merged 24 commits from perf into master 2022-09-29 11:16:04 +00:00
Showing only changes of commit af589aacd6 - Show all commits

View file

@ -11,7 +11,7 @@ date=2022-09-26
--- ---
## ⚠️ Disclaimer ## ⚠️ Disclaimer
The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them as exhaustively as possible in this section, but others limitation might exist. The following results must be taken with a critical grain of salt due to some limitations that are inherent to any benchmark. We try to reference them as exhaustively as possible in this section, but other limitations might exist.
Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could possibly be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. Furthermore, we only benchmarked some very specific aspects of Garage, on which we are currently working on: our results are thus not an overview of the whole software performances. Most of our tests are done on simulated networks that can not represent all the diversity of real networks (dynamic drop, jitter, latency, all of them could possibly be correlated with throughput or any other external event). We also limited ourselves to very small workloads that are not representative of a production cluster. Furthermore, we only benchmarked some very specific aspects of Garage, on which we are currently working on: our results are thus not an overview of the whole software performances.
@ -31,7 +31,11 @@ To reproduce some environments locally, we have a small set of Python scripts na
## Efficient I/O ## Efficient I/O
**Time To First Byte** - One specificity of Garage is that we implemented S3 web endpoints, with the idea to make it the platform of choice to publish your static website. When publishing a website, one metric you observe is Time To First Byte (TTFB), as it will impact the perceived reactivity of your wbesite. On Garage, time to first byte was a bit high, especially for objects of 1MB and more. This is not surprising as, until now, the smallest level of granularity was the block level, which are set to at most 1MB by default. Hence, when you were sending a GET request, the block had to be fully retrieved by the gateway node from the storage node before starting sending any data to the client. With Garage v0.8, we integrated a block streaming logic which allows the gateway to send the beginning of a block without having to wait for the full block from the storage node. We can visually represent the difference as follow: **Time To First Byte** - One specificity of Garage is that we implemented S3 web endpoints, with the idea to make it the platform of choice to publish your static website. When publishing a website, one metric you observe is Time To First Byte (TTFB), as it will impact the perceived reactivity of your website. On Garage, time to first byte was a bit high.
This is not surprising as, until now, the smallest level of granularity was blocks. Blocks are 1MB chunks (this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)) of a given object. For example, a 4.5MB object will be split in 4 blocks of 1MB and 1 block of 0.5MB. With this design, when you were sending a GET request, the first block had to be fully retrieved by the gateway node from the storage node before starting sending any data to the client.
With Garage v0.8, we integrated a block streaming logic which allows the gateway to send the beginning of a block without having to wait for the full block from the storage node. We can visually represent the difference as follow:
![A schema depicting how streaming improves the delivery of a block](schema-streaming.png) ![A schema depicting how streaming improves the delivery of a block](schema-streaming.png)
@ -43,9 +47,9 @@ We wanted to see if this theory helds in practise: we simulated a low latency bu
As planned, Garage v0.7 that does not support block streaming features TTFB between 1.6s and 2s, which correspond to the computed time to transfer the full block. On the other side of the plot, we see Garage v0.8 with very low TTFB thanks to our streaming approach (the lowest value is 43 ms). Minio sits between our 2 implementations: we suppose that it does some form of batching, but on less than 1MB. As planned, Garage v0.7 that does not support block streaming features TTFB between 1.6s and 2s, which correspond to the computed time to transfer the full block. On the other side of the plot, we see Garage v0.8 with very low TTFB thanks to our streaming approach (the lowest value is 43 ms). Minio sits between our 2 implementations: we suppose that it does some form of batching, but on less than 1MB.
**Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted that logic we had to better control the resource usage and detect errors (semaphore, timeouts) were artificially limiting performances: we made them less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot. **Read/write throughput** - As soon as we publicly released Garage, people started benchmarking it, comparing its performances to writing directly on the filesystem, and observed that Garage was slower (eg. [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the situation, we put costly processing like hashing on a dedicated thread and did many compute optimization ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to `v0.8 beta 1`. We also noted logic we wrote (to better control resources usage and detect errors, like semaphore or timeouts) were artificially limiting performances. In another iteration, we made this logic less restrictive at the cost of higher resource consumption under load ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in `v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we write a block. We know that this is expensive and did a test build without any `fsync` call ([see the commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) that will not be merged, just to assess the impact of `fsync`. We refer to it as `no-fsync` in the following plot.
*A note about fsync: for performance reasons, OS often do not write directly to the disk when you create or update a file in your filesystem: your write will be kept in memory, and flush later in a batch with other writes. If a power loss occures before the OS has time to flush the writes on the disk, they will be lost. To ensure that a write is effectively written on disk, you must use the [fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it will block until your file or directory has been written from your volatile RAM memory to your persisting storage device. Additionaly, the exact semantic of fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) and, even on battle-tested software like Postgres, [they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). Note that on Garage, we are currently working on our "fsync" policy and thus, you should expect limited data durability as we are aware of some inconsistency on this point (which we describe in the following).* *A note about fsync: for performance reasons, operating systems often do not write directly to the disk when a process creates or updates a file in your filesystem, instead, the write is kept in memory, and flushed later in a batch with other writes. If a power loss occures before the OS has time to flush the writes on the disk, data will be lost. To ensure that a write is effectively written on disk, you must use the [fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it will block until your file or directory has been written from your volatile RAM memory to your persisting storage device. Additionaly, the exact semantic of fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) and, even on battle-tested software like Postgres, [they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). Note that on Garage, we are currently working on our "fsync" policy and thus, for now, you should expect limited data durability in case of power loss, as we are aware of some inconsistency on this point (which we describe in the following and plan to solve).*
To assess performance improvements, we used the benchmark tool [minio/warp](https://github.com/minio/warp) in a non-standard configuration, adapted for small scale tests, and we kept only the aggregated result named "cluster total". The goal of this experiment is to get an idea of the cluster performance with a standardized and mixed workload. To assess performance improvements, we used the benchmark tool [minio/warp](https://github.com/minio/warp) in a non-standard configuration, adapted for small scale tests, and we kept only the aggregated result named "cluster total". The goal of this experiment is to get an idea of the cluster performance with a standardized and mixed workload.
@ -77,13 +81,45 @@ and we need to define a data policy for Garage that would help us arbitrate betw
*To fsync or not to fsync? Performance is nothing without reliability, so we need to better assess the impact of validating a write and then losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare situations where 1 node is down and the 2 others validated the write and then lost power can occure, what is our policy in this case? For storage durability, we are already supposing that we never loose the storage of more than 2 nodes, should we also expect that we don't loose more than 2 nodes at the same time? What to think about people hosting all their nodes at the same place without an UPS?* *To fsync or not to fsync? Performance is nothing without reliability, so we need to better assess the impact of validating a write and then losing it. Because Garage is a distributed system, even if a node loses its write due to a power loss, it will fetch it back from the 2 other nodes storing it. But rare situations where 1 node is down and the 2 others validated the write and then lost power can occure, what is our policy in this case? For storage durability, we are already supposing that we never loose the storage of more than 2 nodes, should we also expect that we don't loose more than 2 nodes at the same time? What to think about people hosting all their nodes at the same place without an UPS?*
**Storing million of objects** - **Storing million of objects** - Object storage systems are designed not only for data durability and availability, but also for scalability.
Following this observation, some people asked us how scalable Garage is. If answering this question is out of scope of this study, we wanted to
be sure that our metadata engine would be able to scale to million of objects. To put this target in context, it remains small compared to other industrial solutions:
Ceph claims to scale up to [10 billion objects](https://www.redhat.com/en/resources/data-solutions-overview), which is 4 order of magnitude more than our current target. Of course, their benchmarking setup has nothing in common with ours, and their tests are way more exhaustive.
![](1million-both.png) We wrote our own [benchmarking tool](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion) for this test. It concurrently sends a defined number of very tiny object (8192 objects of 16 bytes by default) and measures the time it took. It repeats this step a given number of time (128 by default) to effectively create a certain number of objects on the target cluster (1M by default).
On our local setup with 3 nodes, both Minio and Garage with LMDB were able to achieve this target. On the following plot, we show how many times it took to Garage and Minio to handle each batch.
![](1million.png) Before looking at the plot, **you must keep in mind some important points about Minio and Garage internals**.
## An unpredictable world Minio has no metadata engine, it stores its objects directly on the filesystem. Sending 1 million objects on Minio results in creating one million inodes on the storage node in our current setup. So the performance of your filesystem will probably impact a lot the results you will observe; we know the filesystem we used is not adapted at all for Minio (encryption layer, fixed number of inodes, etc.). Additionaly, we mentioned earlier that we deactivated fsync for our metadata engine, minio might have some fsync logic here slowing down the creation of objects. Finally, object storage is designed for big objects: this cost is negligible with bigger objects. In the end, again, we use Minio as a reference to understand what are our performance budget for each part of our software.
Conversely, Garage has a metadata engine with a special optimization for small objects. Below 3KB, a block is not created on the filesystem but the object is directly stored inline in the metadata engine.
In the future, we plan to evaluate how Garage behaves with 3KB+ objects at scale, probably way closer to Minio, as it will have to create an inode for each object.
For now, we limit ourselves at evaluating our metadata engine, and thus focus only on 16-byte objects.
![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png)
It appears that performances of our metadata engine are acceptables, as we have a comfortable margin compared to Minio (Minio is between 3x and 4x times slower per batch).
We also note that, past 200k objects, Minio batch completion time is constant as Garage's one remains linear: it could be interesting to know if Garage batch's completion time
would cross Minio's one for a very large number of objects.
If we reason per object, both Minio and Garage performances remains very good: it takes respectively around 20ms and 5ms to create an object.
At 100 Mbps, if you upload a 10MB file, the upload will take 800ms, for a 100MB files, it goes up to 8sec; in both cases handling the object metadata is only a fraction of the upload time.
The only cases where you could notice it would be if you upload lot of very small files at once, which again, is an unsual usage of the S3 API.
Next, we focus on Garage's data only to better see its specific behavior:
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
Two effects are now more visibles: 1. batch completion time is linear with the number of objects in the bucket and 2. measurements are dispersed, at least more than Minio.
We discussed the first point previously but not the second one on measurement dispersion.
This instability could be an issue as it could be a symptom of what we saw with some other experiments in this machine: sometime it freezes under heavy I/O operations.
Such freezes could lead to request timeouts and failures. If it occures on our testing computer, it will occure on other servers too: it could be interesting to better understand this issue,
document how to avoid it or change how we handle our I/O.
At the same time, this was a very stressful test that will probably not be encountered in many setups: we were adding 273 object per seconds for 30 minutes!
As a conclusion, Garage can ingest 1 million tiny objects in 30 minutes in a very restricted environment. As a comparison, our production cluster at [deuxfleurs.fr](https://deuxfleurs) manages a bucket with 116k objects. This bucket contains real data as it is used by our Matrix instance to store people's media files (profile picture, shared pictures, videos, audios, documents...). Thanks to this benchmark, we have identified two points of vigilance: putting object duration seems linear with the number of existing objects in the cluster, and we have some volatility in our measured data that could be a symptom of our system freezing under the load. Despite these 2 points, we are confident that Garage could scale way above 1M+ objects, but it remains to be proved!
## In an unpredictable world, stay resilient
- low bandwidth - low bandwidth