Deuxfleurs/garage

Fork 80

better consistent hashing #31

New issue

Closed

opened 2021-02-21 12:14:34 +00:00 by lx · 3 comments

lx commented

2021-02-21 12:14:34 +00:00

Owner

Current technique:

Hash(node id ; 1, 2, ..., n_tokens)
Sort those hashes (gives a ring)
To find where an item goes, look up position in the ring and walk from there

New technique:

Partitions = [0, ..., 2^partition_bits-1]
For i = 0 to 2^partition_bits-1:
- partition_peer[i] = argmin_peer(min_j=1 to peer.n_tokens (hash(peer ; i ; j)))
To find where an item goes, look up partition (hash & (2^partition_bits-1)) and walk from there

partition_bits is a constant of the system that defines how many partitions of the data exist, typically partition_bits = 10 and we have 1024 partitions.

Two advantages to the new technique:

better load distribution
partitions keep the same boundaries when peers are added/removed
much faster lookup on the ring (replace binary search by a known index)

One inconvenient:

partition_bits has to be selected big enough initially to accomodate future system growth (we can change it to make it bigger but it means rehashing all the data set)

EDIT: Maglev is even better, the load distribution is perfect and it has the same other properties as the method described above

Current technique: 1. Hash(node id ; 1, 2, ..., n_tokens) 2. Sort those hashes (gives a ring) 3. To find where an item goes, look up position in the ring and walk from there New technique: 1. Partitions = [0, ..., 2^partition_bits-1] 2. For i = 0 to 2^partition_bits-1: - partition_peer[i] = argmin_peer(min_j=1 to peer.n_tokens (hash(peer ; i ; j))) 3. To find where an item goes, look up partition (hash & (2^partition_bits-1)) and walk from there partition_bits is a constant of the system that defines how many partitions of the data exist, typically partition_bits = 10 and we have 1024 partitions. Two advantages to the new technique: - better load distribution - partitions keep the same boundaries when peers are added/removed - much faster lookup on the ring (replace binary search by a known index) One inconvenient: - partition_bits has to be selected big enough initially to accomodate future system growth (we can change it to make it bigger but it means rehashing all the data set) EDIT: [Maglev](https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/) is even better, the load distribution is perfect and it has the same other properties as the method described above

lx changed title from ~~better consistent~~ to better consistent hashing

2021-02-21 12:17:22 +00:00

lx commented

2021-02-21 18:22:57 +00:00

Author

Owner

Script simulate_ring.py shows a simulation of four hashing algorihtms:

Consistent hashing as currently implemented in Garage
The adaptation from previous post
Maglev, with our current ring-walking strategy for multi-dc
Maglev customized to integrate multi-dc in its base principle

The last one is the best, although it uses 3x more RAM since the list of replicas for a partition is stored in the partition and not obtained by walking the ring of partitions. (i.e. a partition stores 3 node IDs instead of just one, but it is totally worth it because the walking strategy was a major source of imbalance). Also note that the last method provides very good balancing for 256 partitions only, even in a test with 10 nodes, whereas methods 2 and 3 need much more partitions (typically 4096) to counter the imbalance due to the ring walking.

Script `simulate_ring.py` shows a simulation of four hashing algorihtms: 1. Consistent hashing as currently implemented in Garage 2. The adaptation from previous post 3. Maglev, with our current ring-walking strategy for multi-dc 4. Maglev customized to integrate multi-dc in its base principle The last one is the best, although it uses 3x more RAM since the list of replicas for a partition is stored in the partition and not obtained by walking the ring of partitions. (i.e. a partition stores 3 node IDs instead of just one, but it is totally worth it because the walking strategy was a major source of imbalance). Also note that the last method provides very good balancing for 256 partitions only, even in a test with 10 nodes, whereas methods 2 and 3 need much more partitions (typically 4096) to counter the imbalance due to the ring walking.

lx added the

kind

improvement

label 2021-02-22 09:16:45 +00:00

lx commented

2021-03-05 16:08:36 +00:00

Author

Owner

Mostly done. Todo: rename n_tokens into capacity

Mostly done. Todo: rename `n_tokens` into `capacity`

lx commented

2021-03-10 13:52:24 +00:00

Author

Owner

Done

lx closed this issue

2021-03-10 13:52:25 +00:00

lx added this to the 0.2 milestone 2021-03-10 16:48:19 +00:00

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#31

No description provided.

Rows
Columns