New article: Bringing theoretical design and observed performances face to face #12

Merged
quentin merged 24 commits from perf into master 2022-09-29 11:16:04 +00:00
Showing only changes of commit e57f5c727a - Show all commits

View file

@ -284,8 +284,8 @@ only data and not metadata is persisted on disk - in combination with
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
**Storing a million objects** - Object storage systems are designed not only
for data durability and availability but also for scalability. Following this
observation, some people asked us how scalable Garage is. If answering this
for data durability and availability but also for scalability, so naturally,
some people asked us how scalable Garage is. If answering this
question is out of the scope of this study, we wanted to be sure that our
metadata engine would be able to scale to a million objects. To put this
target in context, it remains small compared to other industrial solutions:
@ -296,79 +296,81 @@ more exhaustive.
We wrote our own benchmarking tool for this test,
[s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2].
It concurrently sends a defined number of very tiny objects (8192 objects of 16
bytes by default) and measures the time it took. It repeats this step a given
number of times (128 by default) to effectively create a certain number of
objects on the target cluster (1M by default). On our local setup with 3
The benchmark procedure consists in
concurrently sending a defined number of tiny objects (8192 objects of 16
bytes by default) and measuring the time it takes. This step is then repeated a given
number of times (128 by default) to effectively create a certain target number of
objects on the cluster (1M by default). On our local setup with 3
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
following plot, we show how many times it took to Garage and Minio to handle
following plot, we show how much time it took to Garage and Minio to handle
each batch.
Before looking at the plot, **you must keep in mind some important points about
Minio and Garage internals**.
the internals of both Minio and Garage**.
Minio has no metadata engine, it stores its objects directly on the filesystem.
Sending 1 million objects on Minio results in creating one million inodes on
the storage node in our current setup. So the performance of your filesystem
will probably substantially impact the results you will observe; we know the
the storage server in our current setup. So the performances of the filesystem
will probably substantially impact the results we will observe.
In our precise setup, we know that the
filesystem we used is not adapted at all for Minio (encryption layer, fixed
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the
creation of objects. Finally, object storage is designed for big objects: this
cost is negligible with bigger objects. In the end, again, we use Minio as a
`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
creation of objects. Finally, object storage is designed for big objects: the
costs measured here are negligible for bigger objects. In the end, again, we use Minio as a
reference to understand what is our performance budget for each part of our
software.
Conversely, Garage has an optimization for small objects. Below 3KB, a block is
Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
not created on the filesystem but the object is directly stored inline in the
metadata engine. In the future, we plan to evaluate how Garage behaves with
3KB+ objects at scale, probably way closer to Minio, as it will have to create
an inode for each object. For now, we limit ourselves to evaluating our
metadata engine. In the future, we plan to evaluate how Garage behaves at scale with
>3KB objects, which we expect to be way closer to Minio, as it will have to create
at least one inode per object. For now, we limit ourselves to evaluating our
metadata engine and thus focus only on 16-byte objects.
![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png)
It appears that the performances of our metadata engine are acceptable, as we
have a comfortable margin compared to Minio (Minio is between 3x and 4x times
slower per batch). We also note that, past 200k objects, Minio batch
completion time is constant as Garage's one is still increasing in the observed range:
it could be interesting to know if Garage batch's completion time would cross Minio's one
for a very large number of objects. If we reason per object, both Minio and
Garage performances remain very good: it takes respectively around 20ms and
5ms to create an object. At 100 Mbps, if you upload a 10MB file, the
upload will take 800ms, for a 100MB file, it goes up to 8sec; in both cases
slower per batch). We also note that, past the 200k objects mark, Minio's
time to complete a batch of inserts is constant, while on Garage it is still increasing on the observed range.
It could be interesting to know if Garage batch's completion time would cross Minio's one
for a very large number of objects. If we reason per object, both Minio's and
Garage's performances remain very good: it takes respectively around 20ms and
5ms to create an object. At 100 Mbps, the upload of a 10MB file takes
800ms, and goes up to 8sec for a 100MB file: in both cases
handling the object metadata is only a fraction of the upload time. The
only cases where you could notice it would be if you upload a lot of very
small files at once, which again, is an unusual usage of the S3 API.
only cases where a difference would be noticeable would be when uploading a lot of very
small files at once, which again is an unusual usage of the S3 API.
Next, we focus on Garage's data only to better see its specific behavior:
Let us now focus on Garage's metrics only to better see its specific behavior:
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
Two effects are now more visible: 1. increasing batch completion time with the
Two effects are now more visible: 1increasing batch completion time increases with the
number of objects in the bucket and 2. measurements are dispersed, at least
more than Minio. We don't know for sure if this increasing batch completion
time is linear or logarithmic as we don't have enough datapoint; additinal
more than for Minio. We expect this batch completion time increase to be logarithmic,
but we don't have enough datapoint to conclude safety: additional
measurements are needed. Concercning the observed instability, it could
be a symptom of what we saw with some other experiments in this machine:
sometimes it freezes under heavy I/O operations. Such freezes could lead to
request timeouts and failures. If it occurs on our testing computer, it will
occur on other servers too: it could be interesting to better understand this
issue, document how to avoid it, or change how we handle our I/O. At the same
be a symptom of what we saw with some other experiments in this machine,
which sometimes freezes under heavy I/O load. Such freezes could lead to
request timeouts and failures. If this occurs on our testing computer, it will
occur on other servers too: it would be interesting to better understand this
issue, document how to avoid it, and potentially change how we handle our I/O. At the same
time, this was a very stressful test that will probably not be encountered in
many setups: we were adding 273 objects per second for 30 minutes!
many setups: we were adding 273 objects per second for 30 minutes straight!
To conclude this part, Garage can ingest 1 million tiny objects while remaining
usable on our local setup. To put this result in perspective, our production
cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with
116k objects. This bucket contains real data: it is used by our Matrix instance
116k objects. This bucket contains real-world production data: it is used by our Matrix instance
to store people's media files (profile pictures, shared pictures, videos,
audios, documents...). Thanks to this benchmark, we have identified two points
of vigilance: batch duration increases with the number of existing
objects in the cluster in the observed range, and we have some volatility in our measured data that
of vigilance: the increase of batch insert time with the number of existing
objects in the cluster in the observed range, and the volatility in our measured data that
could be a symptom of our system freezing under the load. Despite these two
points, we are confident that Garage could scale way above 1M+ objects, but it
remains to be proved!
points, we are confident that Garage could scale way above 1M+ objects, although
that remains to be proven.
## In an unpredictable world, stay resilient