New article: Bringing theoretical design and observed performances face to face #12

Merged
quentin merged 24 commits from perf into master 2022-09-29 11:16:04 +00:00
Showing only changes of commit e57f5c727a - Show all commits

View file

@ -284,8 +284,8 @@ only data and not metadata is persisted on disk - in combination with
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).* [example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
**Storing a million objects** - Object storage systems are designed not only **Storing a million objects** - Object storage systems are designed not only
for data durability and availability but also for scalability. Following this for data durability and availability but also for scalability, so naturally,
observation, some people asked us how scalable Garage is. If answering this some people asked us how scalable Garage is. If answering this
question is out of the scope of this study, we wanted to be sure that our question is out of the scope of this study, we wanted to be sure that our
metadata engine would be able to scale to a million objects. To put this metadata engine would be able to scale to a million objects. To put this
target in context, it remains small compared to other industrial solutions: target in context, it remains small compared to other industrial solutions:
@ -296,79 +296,81 @@ more exhaustive.
We wrote our own benchmarking tool for this test, We wrote our own benchmarking tool for this test,
[s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2]. [s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2].
It concurrently sends a defined number of very tiny objects (8192 objects of 16 The benchmark procedure consists in
bytes by default) and measures the time it took. It repeats this step a given concurrently sending a defined number of tiny objects (8192 objects of 16
number of times (128 by default) to effectively create a certain number of bytes by default) and measuring the time it takes. This step is then repeated a given
objects on the target cluster (1M by default). On our local setup with 3 number of times (128 by default) to effectively create a certain target number of
objects on the cluster (1M by default). On our local setup with 3
nodes, both Minio and Garage with LMDB were able to achieve this target. In the nodes, both Minio and Garage with LMDB were able to achieve this target. In the
following plot, we show how many times it took to Garage and Minio to handle following plot, we show how much time it took to Garage and Minio to handle
each batch. each batch.
Before looking at the plot, **you must keep in mind some important points about Before looking at the plot, **you must keep in mind some important points about
Minio and Garage internals**. the internals of both Minio and Garage**.
Minio has no metadata engine, it stores its objects directly on the filesystem. Minio has no metadata engine, it stores its objects directly on the filesystem.
Sending 1 million objects on Minio results in creating one million inodes on Sending 1 million objects on Minio results in creating one million inodes on
the storage node in our current setup. So the performance of your filesystem the storage server in our current setup. So the performances of the filesystem
will probably substantially impact the results you will observe; we know the will probably substantially impact the results we will observe.
In our precise setup, we know that the
filesystem we used is not adapted at all for Minio (encryption layer, fixed filesystem we used is not adapted at all for Minio (encryption layer, fixed
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the `fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
creation of objects. Finally, object storage is designed for big objects: this creation of objects. Finally, object storage is designed for big objects: the
cost is negligible with bigger objects. In the end, again, we use Minio as a costs measured here are negligible for bigger objects. In the end, again, we use Minio as a
reference to understand what is our performance budget for each part of our reference to understand what is our performance budget for each part of our
software. software.
Conversely, Garage has an optimization for small objects. Below 3KB, a block is Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
not created on the filesystem but the object is directly stored inline in the not created on the filesystem but the object is directly stored inline in the
metadata engine. In the future, we plan to evaluate how Garage behaves with metadata engine. In the future, we plan to evaluate how Garage behaves at scale with
3KB+ objects at scale, probably way closer to Minio, as it will have to create >3KB objects, which we expect to be way closer to Minio, as it will have to create
an inode for each object. For now, we limit ourselves to evaluating our at least one inode per object. For now, we limit ourselves to evaluating our
metadata engine and thus focus only on 16-byte objects. metadata engine and thus focus only on 16-byte objects.
![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png) ![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png)
It appears that the performances of our metadata engine are acceptable, as we It appears that the performances of our metadata engine are acceptable, as we
have a comfortable margin compared to Minio (Minio is between 3x and 4x times have a comfortable margin compared to Minio (Minio is between 3x and 4x times
slower per batch). We also note that, past 200k objects, Minio batch slower per batch). We also note that, past the 200k objects mark, Minio's
completion time is constant as Garage's one is still increasing in the observed range: time to complete a batch of inserts is constant, while on Garage it is still increasing on the observed range.
it could be interesting to know if Garage batch's completion time would cross Minio's one It could be interesting to know if Garage batch's completion time would cross Minio's one
for a very large number of objects. If we reason per object, both Minio and for a very large number of objects. If we reason per object, both Minio's and
Garage performances remain very good: it takes respectively around 20ms and Garage's performances remain very good: it takes respectively around 20ms and
5ms to create an object. At 100 Mbps, if you upload a 10MB file, the 5ms to create an object. At 100 Mbps, the upload of a 10MB file takes
upload will take 800ms, for a 100MB file, it goes up to 8sec; in both cases 800ms, and goes up to 8sec for a 100MB file: in both cases
handling the object metadata is only a fraction of the upload time. The handling the object metadata is only a fraction of the upload time. The
only cases where you could notice it would be if you upload a lot of very only cases where a difference would be noticeable would be when uploading a lot of very
small files at once, which again, is an unusual usage of the S3 API. small files at once, which again is an unusual usage of the S3 API.
Next, we focus on Garage's data only to better see its specific behavior: Let us now focus on Garage's metrics only to better see its specific behavior:
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png) ![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
Two effects are now more visible: 1. increasing batch completion time with the Two effects are now more visible: 1increasing batch completion time increases with the
number of objects in the bucket and 2. measurements are dispersed, at least number of objects in the bucket and 2. measurements are dispersed, at least
more than Minio. We don't know for sure if this increasing batch completion more than for Minio. We expect this batch completion time increase to be logarithmic,
time is linear or logarithmic as we don't have enough datapoint; additinal but we don't have enough datapoint to conclude safety: additional
measurements are needed. Concercning the observed instability, it could measurements are needed. Concercning the observed instability, it could
be a symptom of what we saw with some other experiments in this machine: be a symptom of what we saw with some other experiments in this machine,
sometimes it freezes under heavy I/O operations. Such freezes could lead to which sometimes freezes under heavy I/O load. Such freezes could lead to
request timeouts and failures. If it occurs on our testing computer, it will request timeouts and failures. If this occurs on our testing computer, it will
occur on other servers too: it could be interesting to better understand this occur on other servers too: it would be interesting to better understand this
issue, document how to avoid it, or change how we handle our I/O. At the same issue, document how to avoid it, and potentially change how we handle our I/O. At the same
time, this was a very stressful test that will probably not be encountered in time, this was a very stressful test that will probably not be encountered in
many setups: we were adding 273 objects per second for 30 minutes! many setups: we were adding 273 objects per second for 30 minutes straight!
To conclude this part, Garage can ingest 1 million tiny objects while remaining To conclude this part, Garage can ingest 1 million tiny objects while remaining
usable on our local setup. To put this result in perspective, our production usable on our local setup. To put this result in perspective, our production
cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with
116k objects. This bucket contains real data: it is used by our Matrix instance 116k objects. This bucket contains real-world production data: it is used by our Matrix instance
to store people's media files (profile pictures, shared pictures, videos, to store people's media files (profile pictures, shared pictures, videos,
audios, documents...). Thanks to this benchmark, we have identified two points audios, documents...). Thanks to this benchmark, we have identified two points
of vigilance: batch duration increases with the number of existing of vigilance: the increase of batch insert time with the number of existing
objects in the cluster in the observed range, and we have some volatility in our measured data that objects in the cluster in the observed range, and the volatility in our measured data that
could be a symptom of our system freezing under the load. Despite these two could be a symptom of our system freezing under the load. Despite these two
points, we are confident that Garage could scale way above 1M+ objects, but it points, we are confident that Garage could scale way above 1M+ objects, although
remains to be proved! that remains to be proven.
## In an unpredictable world, stay resilient ## In an unpredictable world, stay resilient