better consistent hashing #31

Closed
opened 2021-02-21 12:14:34 +00:00 by lx · 3 comments
Owner

Current technique:

  1. Hash(node id ; 1, 2, ..., n_tokens)
  2. Sort those hashes (gives a ring)
  3. To find where an item goes, look up position in the ring and walk from there

New technique:

  1. Partitions = [0, ..., 2^partition_bits-1]
  2. For i = 0 to 2^partition_bits-1:
    • partition_peer[i] = argmin_peer(min_j=1 to peer.n_tokens (hash(peer ; i ; j)))
  3. To find where an item goes, look up partition (hash & (2^partition_bits-1)) and walk from there

partition_bits is a constant of the system that defines how many partitions of the data exist, typically partition_bits = 10 and we have 1024 partitions.

Two advantages to the new technique:

  • better load distribution
  • partitions keep the same boundaries when peers are added/removed
  • much faster lookup on the ring (replace binary search by a known index)

One inconvenient:

  • partition_bits has to be selected big enough initially to accomodate future system growth (we can change it to make it bigger but it means rehashing all the data set)

EDIT: Maglev is even better, the load distribution is perfect and it has the same other properties as the method described above

Current technique: 1. Hash(node id ; 1, 2, ..., n_tokens) 2. Sort those hashes (gives a ring) 3. To find where an item goes, look up position in the ring and walk from there New technique: 1. Partitions = [0, ..., 2^partition_bits-1] 2. For i = 0 to 2^partition_bits-1: - partition_peer[i] = argmin_peer(min_j=1 to peer.n_tokens (hash(peer ; i ; j))) 3. To find where an item goes, look up partition (hash & (2^partition_bits-1)) and walk from there partition_bits is a constant of the system that defines how many partitions of the data exist, typically partition_bits = 10 and we have 1024 partitions. Two advantages to the new technique: - better load distribution - partitions keep the same boundaries when peers are added/removed - much faster lookup on the ring (replace binary search by a known index) One inconvenient: - partition_bits has to be selected big enough initially to accomodate future system growth (we can change it to make it bigger but it means rehashing all the data set) EDIT: [Maglev](https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/) is even better, the load distribution is perfect and it has the same other properties as the method described above
lx changed title from better consistent to better consistent hashing 2021-02-21 12:17:22 +00:00
Author
Owner

Script simulate_ring.py shows a simulation of four hashing algorihtms:

  1. Consistent hashing as currently implemented in Garage
  2. The adaptation from previous post
  3. Maglev, with our current ring-walking strategy for multi-dc
  4. Maglev customized to integrate multi-dc in its base principle

The last one is the best, although it uses 3x more RAM since the list of replicas for a partition is stored in the partition and not obtained by walking the ring of partitions. (i.e. a partition stores 3 node IDs instead of just one, but it is totally worth it because the walking strategy was a major source of imbalance). Also note that the last method provides very good balancing for 256 partitions only, even in a test with 10 nodes, whereas methods 2 and 3 need much more partitions (typically 4096) to counter the imbalance due to the ring walking.

Script `simulate_ring.py` shows a simulation of four hashing algorihtms: 1. Consistent hashing as currently implemented in Garage 2. The adaptation from previous post 3. Maglev, with our current ring-walking strategy for multi-dc 4. Maglev customized to integrate multi-dc in its base principle The last one is the best, although it uses 3x more RAM since the list of replicas for a partition is stored in the partition and not obtained by walking the ring of partitions. (i.e. a partition stores 3 node IDs instead of just one, but it is totally worth it because the walking strategy was a major source of imbalance). Also note that the last method provides very good balancing for 256 partitions only, even in a test with 10 nodes, whereas methods 2 and 3 need much more partitions (typically 4096) to counter the imbalance due to the ring walking.
lx added the
kind
improvement
label 2021-02-22 09:16:45 +00:00
Author
Owner

Mostly done. Todo: rename n_tokens into capacity

Mostly done. Todo: rename `n_tokens` into `capacity`
Author
Owner

Done

Done
lx closed this issue 2021-03-10 13:52:25 +00:00
lx added this to the 0.2 milestone 2021-03-10 16:48:19 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#31
No description provided.