New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 43 additions and 41 deletions
|
@ -284,8 +284,8 @@ only data and not metadata is persisted on disk - in combination with
|
||||||
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
||||||
|
|
||||||
**Storing a million objects** - Object storage systems are designed not only
|
**Storing a million objects** - Object storage systems are designed not only
|
||||||
for data durability and availability but also for scalability. Following this
|
for data durability and availability but also for scalability, so naturally,
|
||||||
observation, some people asked us how scalable Garage is. If answering this
|
some people asked us how scalable Garage is. If answering this
|
||||||
question is out of the scope of this study, we wanted to be sure that our
|
question is out of the scope of this study, we wanted to be sure that our
|
||||||
metadata engine would be able to scale to a million objects. To put this
|
metadata engine would be able to scale to a million objects. To put this
|
||||||
target in context, it remains small compared to other industrial solutions:
|
target in context, it remains small compared to other industrial solutions:
|
||||||
|
@ -296,79 +296,81 @@ more exhaustive.
|
||||||
|
|
||||||
We wrote our own benchmarking tool for this test,
|
We wrote our own benchmarking tool for this test,
|
||||||
[s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2].
|
[s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2].
|
||||||
It concurrently sends a defined number of very tiny objects (8192 objects of 16
|
The benchmark procedure consists in
|
||||||
bytes by default) and measures the time it took. It repeats this step a given
|
concurrently sending a defined number of tiny objects (8192 objects of 16
|
||||||
number of times (128 by default) to effectively create a certain number of
|
bytes by default) and measuring the time it takes. This step is then repeated a given
|
||||||
objects on the target cluster (1M by default). On our local setup with 3
|
number of times (128 by default) to effectively create a certain target number of
|
||||||
|
objects on the cluster (1M by default). On our local setup with 3
|
||||||
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
|
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
|
||||||
following plot, we show how many times it took to Garage and Minio to handle
|
following plot, we show how much time it took to Garage and Minio to handle
|
||||||
each batch.
|
each batch.
|
||||||
|
|
||||||
Before looking at the plot, **you must keep in mind some important points about
|
Before looking at the plot, **you must keep in mind some important points about
|
||||||
Minio and Garage internals**.
|
the internals of both Minio and Garage**.
|
||||||
|
|
||||||
Minio has no metadata engine, it stores its objects directly on the filesystem.
|
Minio has no metadata engine, it stores its objects directly on the filesystem.
|
||||||
Sending 1 million objects on Minio results in creating one million inodes on
|
Sending 1 million objects on Minio results in creating one million inodes on
|
||||||
the storage node in our current setup. So the performance of your filesystem
|
the storage server in our current setup. So the performances of the filesystem
|
||||||
will probably substantially impact the results you will observe; we know the
|
will probably substantially impact the results we will observe.
|
||||||
|
In our precise setup, we know that the
|
||||||
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
||||||
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
||||||
`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the
|
`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
|
||||||
creation of objects. Finally, object storage is designed for big objects: this
|
creation of objects. Finally, object storage is designed for big objects: the
|
||||||
cost is negligible with bigger objects. In the end, again, we use Minio as a
|
costs measured here are negligible for bigger objects. In the end, again, we use Minio as a
|
||||||
reference to understand what is our performance budget for each part of our
|
reference to understand what is our performance budget for each part of our
|
||||||
software.
|
software.
|
||||||
|
|
||||||
Conversely, Garage has an optimization for small objects. Below 3KB, a block is
|
Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
|
||||||
not created on the filesystem but the object is directly stored inline in the
|
not created on the filesystem but the object is directly stored inline in the
|
||||||
metadata engine. In the future, we plan to evaluate how Garage behaves with
|
metadata engine. In the future, we plan to evaluate how Garage behaves at scale with
|
||||||
3KB+ objects at scale, probably way closer to Minio, as it will have to create
|
>3KB objects, which we expect to be way closer to Minio, as it will have to create
|
||||||
an inode for each object. For now, we limit ourselves to evaluating our
|
at least one inode per object. For now, we limit ourselves to evaluating our
|
||||||
metadata engine and thus focus only on 16-byte objects.
|
metadata engine and thus focus only on 16-byte objects.
|
||||||
|
|
||||||
![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png)
|
![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png)
|
||||||
|
|
||||||
It appears that the performances of our metadata engine are acceptable, as we
|
It appears that the performances of our metadata engine are acceptable, as we
|
||||||
have a comfortable margin compared to Minio (Minio is between 3x and 4x times
|
have a comfortable margin compared to Minio (Minio is between 3x and 4x times
|
||||||
slower per batch). We also note that, past 200k objects, Minio batch
|
slower per batch). We also note that, past the 200k objects mark, Minio's
|
||||||
completion time is constant as Garage's one is still increasing in the observed range:
|
time to complete a batch of inserts is constant, while on Garage it is still increasing on the observed range.
|
||||||
it could be interesting to know if Garage batch's completion time would cross Minio's one
|
It could be interesting to know if Garage batch's completion time would cross Minio's one
|
||||||
for a very large number of objects. If we reason per object, both Minio and
|
for a very large number of objects. If we reason per object, both Minio's and
|
||||||
Garage performances remain very good: it takes respectively around 20ms and
|
Garage's performances remain very good: it takes respectively around 20ms and
|
||||||
5ms to create an object. At 100 Mbps, if you upload a 10MB file, the
|
5ms to create an object. At 100 Mbps, the upload of a 10MB file takes
|
||||||
upload will take 800ms, for a 100MB file, it goes up to 8sec; in both cases
|
800ms, and goes up to 8sec for a 100MB file: in both cases
|
||||||
handling the object metadata is only a fraction of the upload time. The
|
handling the object metadata is only a fraction of the upload time. The
|
||||||
only cases where you could notice it would be if you upload a lot of very
|
only cases where a difference would be noticeable would be when uploading a lot of very
|
||||||
small files at once, which again, is an unusual usage of the S3 API.
|
small files at once, which again is an unusual usage of the S3 API.
|
||||||
|
|
||||||
Next, we focus on Garage's data only to better see its specific behavior:
|
Let us now focus on Garage's metrics only to better see its specific behavior:
|
||||||
|
|
||||||
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
|
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
|
||||||
|
|
||||||
Two effects are now more visible: 1. increasing batch completion time with the
|
Two effects are now more visible: 1increasing batch completion time increases with the
|
||||||
number of objects in the bucket and 2. measurements are dispersed, at least
|
number of objects in the bucket and 2. measurements are dispersed, at least
|
||||||
more than Minio. We don't know for sure if this increasing batch completion
|
more than for Minio. We expect this batch completion time increase to be logarithmic,
|
||||||
time is linear or logarithmic as we don't have enough datapoint; additinal
|
but we don't have enough datapoint to conclude safety: additional
|
||||||
measurements are needed. Concercning the observed instability, it could
|
measurements are needed. Concercning the observed instability, it could
|
||||||
be a symptom of what we saw with some other experiments in this machine:
|
be a symptom of what we saw with some other experiments in this machine,
|
||||||
sometimes it freezes under heavy I/O operations. Such freezes could lead to
|
which sometimes freezes under heavy I/O load. Such freezes could lead to
|
||||||
request timeouts and failures. If it occurs on our testing computer, it will
|
request timeouts and failures. If this occurs on our testing computer, it will
|
||||||
occur on other servers too: it could be interesting to better understand this
|
occur on other servers too: it would be interesting to better understand this
|
||||||
issue, document how to avoid it, or change how we handle our I/O. At the same
|
issue, document how to avoid it, and potentially change how we handle our I/O. At the same
|
||||||
time, this was a very stressful test that will probably not be encountered in
|
time, this was a very stressful test that will probably not be encountered in
|
||||||
many setups: we were adding 273 objects per second for 30 minutes!
|
many setups: we were adding 273 objects per second for 30 minutes straight!
|
||||||
|
|
||||||
To conclude this part, Garage can ingest 1 million tiny objects while remaining
|
To conclude this part, Garage can ingest 1 million tiny objects while remaining
|
||||||
usable on our local setup. To put this result in perspective, our production
|
usable on our local setup. To put this result in perspective, our production
|
||||||
cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with
|
cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with
|
||||||
116k objects. This bucket contains real data: it is used by our Matrix instance
|
116k objects. This bucket contains real-world production data: it is used by our Matrix instance
|
||||||
to store people's media files (profile pictures, shared pictures, videos,
|
to store people's media files (profile pictures, shared pictures, videos,
|
||||||
audios, documents...). Thanks to this benchmark, we have identified two points
|
audios, documents...). Thanks to this benchmark, we have identified two points
|
||||||
of vigilance: batch duration increases with the number of existing
|
of vigilance: the increase of batch insert time with the number of existing
|
||||||
objects in the cluster in the observed range, and we have some volatility in our measured data that
|
objects in the cluster in the observed range, and the volatility in our measured data that
|
||||||
could be a symptom of our system freezing under the load. Despite these two
|
could be a symptom of our system freezing under the load. Despite these two
|
||||||
points, we are confident that Garage could scale way above 1M+ objects, but it
|
points, we are confident that Garage could scale way above 1M+ objects, although
|
||||||
remains to be proved!
|
that remains to be proven.
|
||||||
|
|
||||||
## In an unpredictable world, stay resilient
|
## In an unpredictable world, stay resilient
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue