New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 43 additions and 41 deletions
|
@ -284,8 +284,8 @@ only data and not metadata is persisted on disk - in combination with
|
|||
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
||||
|
||||
**Storing a million objects** - Object storage systems are designed not only
|
||||
for data durability and availability but also for scalability. Following this
|
||||
observation, some people asked us how scalable Garage is. If answering this
|
||||
for data durability and availability but also for scalability, so naturally,
|
||||
some people asked us how scalable Garage is. If answering this
|
||||
question is out of the scope of this study, we wanted to be sure that our
|
||||
metadata engine would be able to scale to a million objects. To put this
|
||||
target in context, it remains small compared to other industrial solutions:
|
||||
|
@ -296,79 +296,81 @@ more exhaustive.
|
|||
|
||||
We wrote our own benchmarking tool for this test,
|
||||
[s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2].
|
||||
It concurrently sends a defined number of very tiny objects (8192 objects of 16
|
||||
bytes by default) and measures the time it took. It repeats this step a given
|
||||
number of times (128 by default) to effectively create a certain number of
|
||||
objects on the target cluster (1M by default). On our local setup with 3
|
||||
The benchmark procedure consists in
|
||||
concurrently sending a defined number of tiny objects (8192 objects of 16
|
||||
bytes by default) and measuring the time it takes. This step is then repeated a given
|
||||
number of times (128 by default) to effectively create a certain target number of
|
||||
objects on the cluster (1M by default). On our local setup with 3
|
||||
nodes, both Minio and Garage with LMDB were able to achieve this target. In the
|
||||
following plot, we show how many times it took to Garage and Minio to handle
|
||||
following plot, we show how much time it took to Garage and Minio to handle
|
||||
each batch.
|
||||
|
||||
Before looking at the plot, **you must keep in mind some important points about
|
||||
Minio and Garage internals**.
|
||||
the internals of both Minio and Garage**.
|
||||
|
||||
Minio has no metadata engine, it stores its objects directly on the filesystem.
|
||||
Sending 1 million objects on Minio results in creating one million inodes on
|
||||
the storage node in our current setup. So the performance of your filesystem
|
||||
will probably substantially impact the results you will observe; we know the
|
||||
the storage server in our current setup. So the performances of the filesystem
|
||||
will probably substantially impact the results we will observe.
|
||||
In our precise setup, we know that the
|
||||
filesystem we used is not adapted at all for Minio (encryption layer, fixed
|
||||
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
|
||||
`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the
|
||||
creation of objects. Finally, object storage is designed for big objects: this
|
||||
cost is negligible with bigger objects. In the end, again, we use Minio as a
|
||||
`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
|
||||
creation of objects. Finally, object storage is designed for big objects: the
|
||||
costs measured here are negligible for bigger objects. In the end, again, we use Minio as a
|
||||
reference to understand what is our performance budget for each part of our
|
||||
software.
|
||||
|
||||
Conversely, Garage has an optimization for small objects. Below 3KB, a block is
|
||||
Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
|
||||
not created on the filesystem but the object is directly stored inline in the
|
||||
metadata engine. In the future, we plan to evaluate how Garage behaves with
|
||||
3KB+ objects at scale, probably way closer to Minio, as it will have to create
|
||||
an inode for each object. For now, we limit ourselves to evaluating our
|
||||
metadata engine. In the future, we plan to evaluate how Garage behaves at scale with
|
||||
>3KB objects, which we expect to be way closer to Minio, as it will have to create
|
||||
at least one inode per object. For now, we limit ourselves to evaluating our
|
||||
metadata engine and thus focus only on 16-byte objects.
|
||||
|
||||
![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png)
|
||||
|
||||
It appears that the performances of our metadata engine are acceptable, as we
|
||||
have a comfortable margin compared to Minio (Minio is between 3x and 4x times
|
||||
slower per batch). We also note that, past 200k objects, Minio batch
|
||||
completion time is constant as Garage's one is still increasing in the observed range:
|
||||
it could be interesting to know if Garage batch's completion time would cross Minio's one
|
||||
for a very large number of objects. If we reason per object, both Minio and
|
||||
Garage performances remain very good: it takes respectively around 20ms and
|
||||
5ms to create an object. At 100 Mbps, if you upload a 10MB file, the
|
||||
upload will take 800ms, for a 100MB file, it goes up to 8sec; in both cases
|
||||
slower per batch). We also note that, past the 200k objects mark, Minio's
|
||||
time to complete a batch of inserts is constant, while on Garage it is still increasing on the observed range.
|
||||
It could be interesting to know if Garage batch's completion time would cross Minio's one
|
||||
for a very large number of objects. If we reason per object, both Minio's and
|
||||
Garage's performances remain very good: it takes respectively around 20ms and
|
||||
5ms to create an object. At 100 Mbps, the upload of a 10MB file takes
|
||||
800ms, and goes up to 8sec for a 100MB file: in both cases
|
||||
handling the object metadata is only a fraction of the upload time. The
|
||||
only cases where you could notice it would be if you upload a lot of very
|
||||
small files at once, which again, is an unusual usage of the S3 API.
|
||||
only cases where a difference would be noticeable would be when uploading a lot of very
|
||||
small files at once, which again is an unusual usage of the S3 API.
|
||||
|
||||
Next, we focus on Garage's data only to better see its specific behavior:
|
||||
Let us now focus on Garage's metrics only to better see its specific behavior:
|
||||
|
||||
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
|
||||
|
||||
Two effects are now more visible: 1. increasing batch completion time with the
|
||||
Two effects are now more visible: 1increasing batch completion time increases with the
|
||||
number of objects in the bucket and 2. measurements are dispersed, at least
|
||||
more than Minio. We don't know for sure if this increasing batch completion
|
||||
time is linear or logarithmic as we don't have enough datapoint; additinal
|
||||
more than for Minio. We expect this batch completion time increase to be logarithmic,
|
||||
but we don't have enough datapoint to conclude safety: additional
|
||||
measurements are needed. Concercning the observed instability, it could
|
||||
be a symptom of what we saw with some other experiments in this machine:
|
||||
sometimes it freezes under heavy I/O operations. Such freezes could lead to
|
||||
request timeouts and failures. If it occurs on our testing computer, it will
|
||||
occur on other servers too: it could be interesting to better understand this
|
||||
issue, document how to avoid it, or change how we handle our I/O. At the same
|
||||
be a symptom of what we saw with some other experiments in this machine,
|
||||
which sometimes freezes under heavy I/O load. Such freezes could lead to
|
||||
request timeouts and failures. If this occurs on our testing computer, it will
|
||||
occur on other servers too: it would be interesting to better understand this
|
||||
issue, document how to avoid it, and potentially change how we handle our I/O. At the same
|
||||
time, this was a very stressful test that will probably not be encountered in
|
||||
many setups: we were adding 273 objects per second for 30 minutes!
|
||||
many setups: we were adding 273 objects per second for 30 minutes straight!
|
||||
|
||||
To conclude this part, Garage can ingest 1 million tiny objects while remaining
|
||||
usable on our local setup. To put this result in perspective, our production
|
||||
cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with
|
||||
116k objects. This bucket contains real data: it is used by our Matrix instance
|
||||
116k objects. This bucket contains real-world production data: it is used by our Matrix instance
|
||||
to store people's media files (profile pictures, shared pictures, videos,
|
||||
audios, documents...). Thanks to this benchmark, we have identified two points
|
||||
of vigilance: batch duration increases with the number of existing
|
||||
objects in the cluster in the observed range, and we have some volatility in our measured data that
|
||||
of vigilance: the increase of batch insert time with the number of existing
|
||||
objects in the cluster in the observed range, and the volatility in our measured data that
|
||||
could be a symptom of our system freezing under the load. Despite these two
|
||||
points, we are confident that Garage could scale way above 1M+ objects, but it
|
||||
remains to be proved!
|
||||
points, we are confident that Garage could scale way above 1M+ objects, although
|
||||
that remains to be proven.
|
||||
|
||||
## In an unpredictable world, stay resilient
|
||||
|
||||
|
|
Loading…
Reference in a new issue