Partition the key list of heavy buckets by prefix #200

New issue

Open

opened 2022-01-24 17:35:33 +00:00 by quentin · 0 comments

quentin commented

2022-01-24 17:35:33 +00:00

Owner

Reference

Old AWS S3

In the past, Amazon published documentation on how to name your keys to benefit from efficient partitioning.

https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/

Amazon S3 maintains an index of object key names in each AWS region. Object keys are stored in UTF-8 binary ordering across multiple partitions in the index. The key name dictates which partition the key is stored in. Using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood that Amazon S3 will target a specific partition for a large number of your keys, overwhelming the I/O capacity of the partition. If you introduce some randomness in your key name prefixes, the key names, and therefore the I/O load, will be distributed across more than one partition.

https://stackoverflow.com/questions/38930846/performance-of-listing-s3-bucket-with-prefix-and-delimiter

New AWS S3

Apparently, they are now able to make it invisible to their end users:

For example, previously Amazon S3 performance guidelines recommended randomizing prefix naming with hashed characters to optimize performance for frequent data retrievals. You no longer have to randomize prefix naming for performance, and can use sequential date-based naming for your prefixes.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

Proposed solution

Table

Currently we have this logic:

objects: (partition key: bucket, sorting key: object_key)

But Amazon S3, to support a very large number of keys, switch (automagically?) to an unknown number of partitions for keys, which is probably something like:

shards: (partition key: bucket, sorting key: shard)
objects: (partition key: shard, sorting key: object_key)

Choosing partition number + range

This is probably a change to this algorithm that enables Amazon to efficiently partition data even if you used sequential keys.

A first iteration for Garage could propose manually sharded buckets. Something like garage bucket shard my-bucket --shards=a,b,c,d,e,f to create 7 shards:

Prefixes from '\x00' to 'a'
Prefixes from 'a' to 'b'
Prefixes from 'b' to 'c'
Prefixes from 'c' to 'd'
Prefixes from 'd' to 'e'
Prefixes from 'e' to 'f'
Prefixes from 'f' to '\xff'

Note on "low priority"

We are not yet aware of any case where such a scheme would be required for Garage. In other words, we have never seen any deployment of Garage that would require this optimization.

Please let us know if you encountered (or think that your will encounter) such bottleneck on Garage.

## Reference ### Old AWS S3 In the past, Amazon published documentation on how to name your keys to benefit from efficient partitioning. - https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/ > Amazon S3 maintains an index of object key names in each AWS region. Object keys are stored in UTF-8 binary ordering across multiple partitions in the index. The key name dictates which partition the key is stored in. Using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood that Amazon S3 will target a specific partition for a large number of your keys, overwhelming the I/O capacity of the partition. If you introduce some randomness in your key name prefixes, the key names, and therefore the I/O load, will be distributed across more than one partition. https://stackoverflow.com/questions/38930846/performance-of-listing-s3-bucket-with-prefix-and-delimiter ### New AWS S3 Apparently, they are now able to make it invisible to their end users: > For example, previously Amazon S3 performance guidelines recommended randomizing prefix naming with hashed characters to optimize performance for frequent data retrievals. You no longer have to randomize prefix naming for performance, and can use sequential date-based naming for your prefixes. https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html ## Proposed solution ### Table Currently we have this logic: - objects: (partition key: bucket, sorting key: object_key) But Amazon S3, to support a very large number of keys, switch (automagically?) to an unknown number of partitions for keys, which is probably something like: - shards: (partition key: bucket, sorting key: shard) - objects: (partition key: shard, sorting key: object_key) ### Choosing partition number + range This is probably a change to this algorithm that enables Amazon to efficiently partition data even if you used sequential keys. A first iteration for Garage could propose manually sharded buckets. Something like `garage bucket shard my-bucket --shards=a,b,c,d,e,f` to create 7 shards: 1. Prefixes from '\x00' to 'a' 2. Prefixes from 'a' to 'b' 3. Prefixes from 'b' to 'c' 4. Prefixes from 'c' to 'd' 5. Prefixes from 'd' to 'e' 6. Prefixes from 'e' to 'f' 7. Prefixes from 'f' to '\xff' ## Note on "low priority" We are not yet aware of any case where such a scheme would be required for Garage. In other words, we have never seen any deployment of Garage that would require this optimization. Please let us know if you encountered (or think that your will encounter) such bottleneck on Garage.