Add write-up about load-balancing
This commit is contained in:
parent
49c25a1509
commit
2b4b69938f
1 changed files with 175 additions and 0 deletions
175
doc/Load_Balancing.md
Normal file
175
doc/Load_Balancing.md
Normal file
|
@ -0,0 +1,175 @@
|
|||
I have conducted a quick study of different methods to load-balance data over different Garage nodes using consistent hashing.
|
||||
|
||||
### Requirements
|
||||
|
||||
- good balancing: two nodes that have the same announced capacity should receive close to the same number of items
|
||||
- multi-datacenter: the replicas of a partition should be distributed over as many datacenters as possible
|
||||
- minimal disruption: when adding or removing a node, as few partitions as possible should have to move around
|
||||
|
||||
### Methods
|
||||
|
||||
#### Naive multi-DC ring walking strategy
|
||||
|
||||
This strategy can be used with any ring-linke algorithm to make it aware of the *multi-datacenter* requirement:
|
||||
|
||||
- the ring is a list of positions, each associated with a single node in the cluster
|
||||
- look up position of item on ring
|
||||
- select the node for that position
|
||||
- go clockwise, skipping nodes that:
|
||||
- we halve already selected
|
||||
- are in a datacenter of a node we have selected, except if we already have nodes from all available datacenters
|
||||
|
||||
In this way the selected nodes will always be distributed over
|
||||
`min(n_datacenters, n_replicas)` different datacenters, which is the best we
|
||||
can do.
|
||||
|
||||
This method was implemented in the first iteration of Garage, with the basic
|
||||
ring construction that consists in associating `n_token` random positions to
|
||||
each node.
|
||||
|
||||
#### Better rings
|
||||
|
||||
The ring construction that selects `n_token` random positions for each nodes gives a ring of positions that
|
||||
is not well-balanced: the space between the tokens varies a lot, and some partitions are thus bigger than others.
|
||||
This problem was demonstrated in the original Dynamo DB paper.
|
||||
|
||||
To solve this, we want to apply a second method for partitionning our dataset:
|
||||
|
||||
1. fix an initially large number of partitions (say 1024) with evenly-spaced delimiters,
|
||||
|
||||
2. attribute each partition randomly to a node, with a probability
|
||||
proportionnal to its capacity (which `n_tokens` represented in the first
|
||||
method)
|
||||
|
||||
I have studied two ways to do the attribution, in a way that is deterministic:
|
||||
|
||||
- Custom: take `argmin_node(hash(node, partition_number))`
|
||||
- MagLev: see [here](https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/)
|
||||
|
||||
MagLev provided significantly better balancing, as it guarantees that the exact
|
||||
same number of partitions is attributed to all nodes that have the same
|
||||
capacity (and that this number is proportionnal to the node's capacity, except
|
||||
for large values), however in both cases:
|
||||
|
||||
- the distribution is still bad, because we use the naive multi-DC ring walking
|
||||
that behaves strangely due to interactions between consecutive positions on
|
||||
the ring
|
||||
|
||||
- the disruption in case of adding/removing a node is not as low as it can be,
|
||||
as we show with the following method.
|
||||
|
||||
A quick description of MagLev:
|
||||
|
||||
> The basic idea of Maglev hashing is to assign a preference list of all the
|
||||
> lookup table positions to each backend. Then all the backends take turns
|
||||
> filling their most-preferred table positions that are still empty, until the
|
||||
> lookup table is completely filled in. Hence, Maglev hashing gives an almost
|
||||
> equal share of the lookup table to each of the backends. Heterogeneous
|
||||
> backend weights can be achieved by altering the relative frequency of the
|
||||
> backends’ turns…
|
||||
|
||||
Here are some stats (run `scripts/simulate_ring.py` to reproduce):
|
||||
|
||||
```
|
||||
##### Custom-ring (min-hash) #####
|
||||
|
||||
#partitions per node (capacity in parenthesis):
|
||||
- datura (8) : 227
|
||||
- digitale (8) : 351
|
||||
- drosera (8) : 259
|
||||
- geant (16) : 476
|
||||
- gipsie (16) : 410
|
||||
- io (16) : 495
|
||||
- isou (8) : 231
|
||||
- mini (4) : 149
|
||||
- mixi (4) : 188
|
||||
- modi (4) : 127
|
||||
- moxi (4) : 159
|
||||
|
||||
Variance of load distribution for load normalized to intra-class mean
|
||||
(a class being the set of nodes with the same announced capacity): 2.18% <-- REALLY BAD
|
||||
|
||||
Disruption when removing nodes (partitions moved on 0/1/2/3 nodes):
|
||||
removing atuin digitale : 63.09% 30.18% 6.64% 0.10%
|
||||
removing atuin drosera : 72.36% 23.44% 4.10% 0.10%
|
||||
removing atuin datura : 73.24% 21.48% 5.18% 0.10%
|
||||
removing jupiter io : 48.34% 38.48% 12.30% 0.88%
|
||||
removing jupiter isou : 74.12% 19.73% 6.05% 0.10%
|
||||
removing grog mini : 84.47% 12.40% 2.93% 0.20%
|
||||
removing grog mixi : 80.76% 16.60% 2.64% 0.00%
|
||||
removing grog moxi : 83.59% 14.06% 2.34% 0.00%
|
||||
removing grog modi : 87.01% 11.43% 1.46% 0.10%
|
||||
removing grisou geant : 48.24% 37.40% 13.67% 0.68%
|
||||
removing grisou gipsie : 53.03% 33.59% 13.09% 0.29%
|
||||
on average: 69.84% 23.53% 6.40% 0.23% <-- COULD BE BETTER
|
||||
|
||||
--------
|
||||
|
||||
##### MagLev #####
|
||||
|
||||
#partitions per node:
|
||||
- datura (8) : 273
|
||||
- digitale (8) : 256
|
||||
- drosera (8) : 267
|
||||
- geant (16) : 452
|
||||
- gipsie (16) : 427
|
||||
- io (16) : 483
|
||||
- isou (8) : 272
|
||||
- mini (4) : 184
|
||||
- mixi (4) : 160
|
||||
- modi (4) : 144
|
||||
- moxi (4) : 154
|
||||
|
||||
Variance of load distribution: 0.37% <-- Already much better, but not optimal
|
||||
|
||||
Disruption when removing nodes (partitions moved on 0/1/2/3 nodes):
|
||||
removing atuin digitale : 62.60% 29.20% 7.91% 0.29%
|
||||
removing atuin drosera : 65.92% 26.56% 7.23% 0.29%
|
||||
removing atuin datura : 63.96% 27.83% 7.71% 0.49%
|
||||
removing jupiter io : 44.63% 40.33% 14.06% 0.98%
|
||||
removing jupiter isou : 63.38% 27.25% 8.98% 0.39%
|
||||
removing grog mini : 72.46% 21.00% 6.35% 0.20%
|
||||
removing grog mixi : 72.95% 22.46% 4.39% 0.20%
|
||||
removing grog moxi : 74.22% 20.61% 4.98% 0.20%
|
||||
removing grog modi : 75.98% 18.36% 5.27% 0.39%
|
||||
removing grisou geant : 46.97% 36.62% 15.04% 1.37%
|
||||
removing grisou gipsie : 49.22% 36.52% 12.79% 1.46%
|
||||
on average: 62.94% 27.89% 8.61% 0.57% <-- Worse than custom method
|
||||
```
|
||||
|
||||
#### The magical solution: multi-DC aware MagLev
|
||||
|
||||
(insert algorithm description here, in the meantime refer to `method4` in the simulation script)
|
||||
|
||||
```
|
||||
##### Multi-DC aware MagLev #####
|
||||
|
||||
#partitions per node:
|
||||
- datura (8) : 268 <-- NODES WITH THE SAME CAPACITY
|
||||
- digitale (8) : 267 HAVE THE SAME NUM OF PARTITIONS
|
||||
- drosera (8) : 267 (+- 1)
|
||||
- geant (16) : 470
|
||||
- gipsie (16) : 472
|
||||
- io (16) : 516
|
||||
- isou (8) : 268
|
||||
- mini (4) : 136
|
||||
- mixi (4) : 136
|
||||
- modi (4) : 136
|
||||
- moxi (4) : 136
|
||||
|
||||
Variance of load distribution: 0.06% <-- CAN'T DO BETTER THAN THIS
|
||||
|
||||
Disruption when removing nodes (partitions moved on 0/1/2/3 nodes):
|
||||
removing atuin digitale : 65.72% 33.01% 1.27% 0.00%
|
||||
removing atuin drosera : 64.65% 33.89% 1.37% 0.10%
|
||||
removing atuin datura : 66.11% 32.62% 1.27% 0.00%
|
||||
removing jupiter io : 42.97% 53.42% 3.61% 0.00%
|
||||
removing jupiter isou : 66.11% 32.32% 1.56% 0.00%
|
||||
removing grog mini : 80.47% 18.85% 0.68% 0.00%
|
||||
removing grog mixi : 80.27% 18.85% 0.88% 0.00%
|
||||
removing grog moxi : 80.18% 19.04% 0.78% 0.00%
|
||||
removing grog modi : 79.69% 19.92% 0.39% 0.00%
|
||||
removing grisou geant : 44.63% 52.15% 3.22% 0.00%
|
||||
removing grisou gipsie : 43.55% 52.54% 3.91% 0.00%
|
||||
on average: 64.94% 33.33% 1.72% 0.01% <-- VERY GOOD
|
||||
```
|
Loading…
Reference in a new issue