Question: best approach for low latency #893

New issue

Closed

opened 2024-10-25 10:04:07 +00:00 by sebadob · 2 comments

sebadob commented

2024-10-25 10:04:07 +00:00

Hey,

I migrated from Minio over to Garage some time ago and I am very happy with it, you did awesome work on this project!

This is only a question, not an issue.

I am designing a new application that will use Garage for file storage backend for in the end millions of objects.
The main goal is to have lowest possible latency for reads. I can split the data logically in a way that I could use probably 3 or 4 different buckets and it would still make sense, but I need to change object names from time to time and would therefore need to use copy internal to do this.

The question is, what would be the best approach from the start performance and latency wise?

split data into multiple buckets and probably stream an object over the requesting host for copying objects between buckets (copy internal works inside the same bucket only, right?)
keep everything inside the same bucket (object names will never conflict) to be able to use copy internal everywhere
does it make a difference performance wise if I use prefixes for the files instead of using a flat structure and have all objects with a unique ID in the bucket? The question here is if Garage can internally do a more efficient and faster lookup if I split the unique object ID to a prefix based approach.

I am asking this upfront because a migration to another approach later on after benchmarking when I finally have all these files on the storage could be a lot of work.

Thanks!

Edit::

The copy internal is not an issue anymore, it's working fine between different buckets. So there is no reason to not partition the data and separate it using different buckets.

The only question left is if I would get a latency / speed advantage from path prefix vs no prefix.

Hey, I migrated from Minio over to Garage some time ago and I am very happy with it, you did awesome work on this project! This is only a question, not an issue. I am designing a new application that will use Garage for file storage backend for in the end millions of objects. The main goal is to have lowest possible latency for reads. I can split the data logically in a way that I could use probably 3 or 4 different buckets and it would still make sense, but I need to change object names from time to time and would therefore need to use `copy internal` to do this. The question is, what would be the best approach from the start performance and latency wise? - split data into multiple buckets and probably stream an object over the requesting host for copying objects between buckets (copy internal works inside the same bucket only, right?) - keep everything inside the same bucket (object names will never conflict) to be able to use `copy internal` everywhere - does it make a difference performance wise if I use `prefix`es for the files instead of using a flat structure and have all objects with a unique ID in the bucket? The question here is if Garage can internally do a more efficient and faster lookup if I split the unique object ID to a prefix based approach. I am asking this upfront because a migration to another approach later on after benchmarking when I finally have all these files on the storage could be a lot of work. Thanks! Edit:: The `copy internal` is not an issue anymore, it's working fine between different buckets. So there is no reason to not partition the data and separate it using different buckets. The only question left is if I would get a latency / speed advantage from path prefix vs no prefix.

maximilien commented

2024-10-31 12:09:38 +00:00

Owner

I don't think there will, there is no specific prefix sharing depending on object path, it depends on the hash of the actual data. The only thing that might make a difference is the block size you configure on the cluster, as well as the size of objects.

sebadob commented

2024-10-31 13:10:24 +00:00

Author

I don't think there will, there is no specific prefix sharing depending on object path, it depends on the hash of the actual data. The only thing that might make a difference is the block size you configure on the cluster, as well as the size of objects.

Thank you!

I was able to do some very first, simple benchmarks and I did not notice any real difference between the two approaches so far.
A big vote for using path prefixes though is the ability to do easier debugging or resuming longer running jobs, if the main application restarts in between for instance, because I can limit ListObjects for a prefix, which is very helpful in these situations.

So what I am doing now is sharding the data between different buckets where it is no issue (logically, all data is "the same", but I can do some rough grouping) and using path prefixes for easier maintenance. The performance is great in any case and I can super easily serve small files from S3 directly, which is awesome.

So I guess this is answered.

> I don't think there will, there is no specific prefix sharing depending on object path, it depends on the hash of the actual data. The only thing that might make a difference is the block size you configure on the cluster, as well as the size of objects. Thank you! I was able to do some very first, simple benchmarks and I did not notice any real difference between the two approaches so far. A big vote for using path prefixes though is the ability to do easier debugging or resuming longer running jobs, if the main application restarts in between for instance, because I can limit `ListObjects` for a prefix, which is very helpful in these situations. So what I am doing now is sharding the data between different buckets where it is no issue (logically, all data is "the same", but I can do some rough grouping) and using path prefixes for easier maintenance. The performance is great in any case and I can super easily serve small files from S3 directly, which is awesome. So I guess this is answered.