Documentation on topology change #217

Open
opened 2022-02-04 07:38:31 +00:00 by quentin · 3 comments
Owner

We do not document topology change on Garage, only disaster recovery.
It is particularly interesting to document it as topology flexibility is one of the strength of Garage.
We should explain how someone can add multiple nodes or remove one or more nodes from the cluster.
Especially, we have some interesting stuff to say about removing a whole region at once.

An example of someone wishing storage system were more flexible: https://twitter.com/dave_universetf/status/1489483755630645248

We do not document topology change on Garage, only disaster recovery. It is particularly interesting to document it as topology flexibility is one of the strength of Garage. We should explain how someone can add multiple nodes or remove one or more nodes from the cluster. Especially, we have some interesting stuff to say about removing a whole region at once. An example of someone wishing storage system were more flexible: https://twitter.com/dave_universetf/status/1489483755630645248
quentin added the
Documentation
label 2022-02-04 07:38:31 +00:00

Failure of a disk (or hot removal and replace with a blank disk) also sholld be covered (related #218)

Failure of a disk (or hot removal and replace with a blank disk) also sholld be covered (related https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/218)

Hi, I'm very interested to learn about the flexible topology features ; meanwhile, may you point me to the relevant code so I could understand how it works in Garage ?

Hi, I'm very interested to learn about the flexible topology features ; meanwhile, may you point me to the relevant code so I could understand how it works in Garage ?
Owner

Hi @Chosto , the core of this feature is implemented in src/rpc/ring.rs and src/rpc/layout.rs. The two are separated mostly for historical reasons; nowadays the ring is just a copy of the layout data in a slightly different format that makes it easier to use.

Data in Garage is divided in 256 partitions (like slices of a cake, the entirety of the data in your cluster is the whole cake but it's also called the "ring" in technical terms). Each partition has three copies: Garage builds, for each partition, a list of three nodes that store one copy, we call that the assignation of the partition to nodes. The assignation of all partitions to three nodes for each of them is what we call the layout but in the code there is also a copy of that in the ring datastructure.

Garage decides of the assignations for each partition trying to solve the following constraints:

  • The three nodes assigned to each partition MUST be in different zones if there are enough zones (at least three)
  • Nodes must have an assigned number of partitions that is proportionnal to their announced capacity; this is sometimes in contradiction with the previous constraint, for instance in cases where the total capacity at different zones is not the same. When it is not possible, Garage does something (I'm working on PR #266 that tries to make it so that at least data is well balanced between nodes within a same zone).

The code in src/rpc/layout.rs is able to compute such an assignation. It is also able to compute a new assignation from an old one in the case where nodes are added or removed (this is where flexible topologies come in). When updating an old assignation, the code tries to minimize the number of partition copies that are moved between nodes, so as to minimize the amount of data to be transferred between nodes in order to rebalance the dataset.

Hi @Chosto , the core of this feature is implemented in `src/rpc/ring.rs` and `src/rpc/layout.rs`. The two are separated mostly for historical reasons; nowadays the ring is just a copy of the layout data in a slightly different format that makes it easier to use. Data in Garage is divided in 256 partitions (like slices of a cake, the entirety of the data in your cluster is the whole cake but it's also called the "ring" in technical terms). Each partition has three copies: Garage builds, for each partition, a list of three nodes that store one copy, we call that the assignation of the partition to nodes. The assignation of all partitions to three nodes for each of them is what we call the layout but in the code there is also a copy of that in the ring datastructure. Garage decides of the assignations for each partition trying to solve the following constraints: - The three nodes assigned to each partition MUST be in different zones if there are enough zones (at least three) - Nodes must have an assigned number of partitions that is proportionnal to their announced capacity; this is sometimes in contradiction with the previous constraint, for instance in cases where the total capacity at different zones is not the same. When it is not possible, Garage does *something* (I'm working on PR #266 that tries to make it so that at least data is well balanced between nodes within a same zone). The code in `src/rpc/layout.rs` is able to compute such an assignation. It is also able to compute a new assignation from an old one in the case where nodes are added or removed (this is where flexible topologies come in). When updating an old assignation, the code tries to minimize the number of partition copies that are moved between nodes, so as to minimize the amount of data to be transferred between nodes in order to rebalance the dataset.
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#217
No description provided.