diff --git a/content/blog/2022-perf/index.md b/content/blog/2022-perf/index.md index 06bcd31..e240c18 100644 --- a/content/blog/2022-perf/index.md +++ b/content/blog/2022-perf/index.md @@ -284,8 +284,8 @@ only data and not metadata is persisted on disk - in combination with [example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).* **Storing a million objects** - Object storage systems are designed not only -for data durability and availability but also for scalability. Following this -observation, some people asked us how scalable Garage is. If answering this +for data durability and availability but also for scalability, so naturally, +some people asked us how scalable Garage is. If answering this question is out of the scope of this study, we wanted to be sure that our metadata engine would be able to scale to a million objects. To put this target in context, it remains small compared to other industrial solutions: @@ -296,79 +296,81 @@ more exhaustive. We wrote our own benchmarking tool for this test, [s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2]. -It concurrently sends a defined number of very tiny objects (8192 objects of 16 -bytes by default) and measures the time it took. It repeats this step a given -number of times (128 by default) to effectively create a certain number of -objects on the target cluster (1M by default). On our local setup with 3 +The benchmark procedure consists in +concurrently sending a defined number of tiny objects (8192 objects of 16 +bytes by default) and measuring the time it takes. This step is then repeated a given +number of times (128 by default) to effectively create a certain target number of +objects on the cluster (1M by default). On our local setup with 3 nodes, both Minio and Garage with LMDB were able to achieve this target. In the -following plot, we show how many times it took to Garage and Minio to handle +following plot, we show how much time it took to Garage and Minio to handle each batch. Before looking at the plot, **you must keep in mind some important points about -Minio and Garage internals**. +the internals of both Minio and Garage**. Minio has no metadata engine, it stores its objects directly on the filesystem. Sending 1 million objects on Minio results in creating one million inodes on -the storage node in our current setup. So the performance of your filesystem -will probably substantially impact the results you will observe; we know the +the storage server in our current setup. So the performances of the filesystem +will probably substantially impact the results we will observe. +In our precise setup, we know that the filesystem we used is not adapted at all for Minio (encryption layer, fixed number of inodes, etc.). Additionally, we mentioned earlier that we deactivated -`fsync` for our metadata engine, Minio has some `fsync` logic here slowing down the -creation of objects. Finally, object storage is designed for big objects: this -cost is negligible with bigger objects. In the end, again, we use Minio as a +`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the +creation of objects. Finally, object storage is designed for big objects: the +costs measured here are negligible for bigger objects. In the end, again, we use Minio as a reference to understand what is our performance budget for each part of our software. -Conversely, Garage has an optimization for small objects. Below 3KB, a block is +Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is not created on the filesystem but the object is directly stored inline in the -metadata engine. In the future, we plan to evaluate how Garage behaves with -3KB+ objects at scale, probably way closer to Minio, as it will have to create -an inode for each object. For now, we limit ourselves to evaluating our +metadata engine. In the future, we plan to evaluate how Garage behaves at scale with +>3KB objects, which we expect to be way closer to Minio, as it will have to create +at least one inode per object. For now, we limit ourselves to evaluating our metadata engine and thus focus only on 16-byte objects. ![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png) It appears that the performances of our metadata engine are acceptable, as we have a comfortable margin compared to Minio (Minio is between 3x and 4x times -slower per batch). We also note that, past 200k objects, Minio batch -completion time is constant as Garage's one is still increasing in the observed range: -it could be interesting to know if Garage batch's completion time would cross Minio's one -for a very large number of objects. If we reason per object, both Minio and -Garage performances remain very good: it takes respectively around 20ms and -5ms to create an object. At 100 Mbps, if you upload a 10MB file, the -upload will take 800ms, for a 100MB file, it goes up to 8sec; in both cases +slower per batch). We also note that, past the 200k objects mark, Minio's +time to complete a batch of inserts is constant, while on Garage it is still increasing on the observed range. +It could be interesting to know if Garage batch's completion time would cross Minio's one +for a very large number of objects. If we reason per object, both Minio's and +Garage's performances remain very good: it takes respectively around 20ms and +5ms to create an object. At 100 Mbps, the upload of a 10MB file takes +800ms, and goes up to 8sec for a 100MB file: in both cases handling the object metadata is only a fraction of the upload time. The -only cases where you could notice it would be if you upload a lot of very -small files at once, which again, is an unusual usage of the S3 API. +only cases where a difference would be noticeable would be when uploading a lot of very +small files at once, which again is an unusual usage of the S3 API. -Next, we focus on Garage's data only to better see its specific behavior: +Let us now focus on Garage's metrics only to better see its specific behavior: ![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png) -Two effects are now more visible: 1. increasing batch completion time with the +Two effects are now more visible: 1increasing batch completion time increases with the number of objects in the bucket and 2. measurements are dispersed, at least -more than Minio. We don't know for sure if this increasing batch completion -time is linear or logarithmic as we don't have enough datapoint; additinal +more than for Minio. We expect this batch completion time increase to be logarithmic, +but we don't have enough datapoint to conclude safety: additional measurements are needed. Concercning the observed instability, it could -be a symptom of what we saw with some other experiments in this machine: -sometimes it freezes under heavy I/O operations. Such freezes could lead to -request timeouts and failures. If it occurs on our testing computer, it will -occur on other servers too: it could be interesting to better understand this -issue, document how to avoid it, or change how we handle our I/O. At the same +be a symptom of what we saw with some other experiments in this machine, +which sometimes freezes under heavy I/O load. Such freezes could lead to +request timeouts and failures. If this occurs on our testing computer, it will +occur on other servers too: it would be interesting to better understand this +issue, document how to avoid it, and potentially change how we handle our I/O. At the same time, this was a very stressful test that will probably not be encountered in -many setups: we were adding 273 objects per second for 30 minutes! +many setups: we were adding 273 objects per second for 30 minutes straight! To conclude this part, Garage can ingest 1 million tiny objects while remaining usable on our local setup. To put this result in perspective, our production cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with -116k objects. This bucket contains real data: it is used by our Matrix instance +116k objects. This bucket contains real-world production data: it is used by our Matrix instance to store people's media files (profile pictures, shared pictures, videos, audios, documents...). Thanks to this benchmark, we have identified two points -of vigilance: batch duration increases with the number of existing -objects in the cluster in the observed range, and we have some volatility in our measured data that +of vigilance: the increase of batch insert time with the number of existing +objects in the cluster in the observed range, and the volatility in our measured data that could be a symptom of our system freezing under the load. Despite these two -points, we are confident that Garage could scale way above 1M+ objects, but it -remains to be proved! +points, we are confident that Garage could scale way above 1M+ objects, although +that remains to be proven. ## In an unpredictable world, stay resilient