improve how nodes roles are assigned in garage #152

Merged
lx merged 1 commit from node-configure into main 2021-11-16 15:33:23 +00:00
Owner
  • change the terminology: the network configuration becomes the role table, the configuration of a nodes becomes a node's role
  • the modification of the role table takes place in two steps: first, changes are staged in a CRDT data structure. Then, once the user is happy with the changes, they can commit them all at once (or revert them).
  • update documentation
  • fix tests
  • implement smarter partition assignation algorithm
  • improve re-assignation algorithm when using --replace
  • improve the thing that displays how many partitions are displaced

Currently, this PR breaks the format of the network configuration: when migrating, the cluster will be in a state where no roles are assigned. All roles must be re-assigned and commited at once. This migration should not pose an issue.

- [x] change the terminology: the network configuration becomes the role table, the configuration of a nodes becomes a node's role - [x] the modification of the role table takes place in two steps: first, changes are staged in a CRDT data structure. Then, once the user is happy with the changes, they can commit them all at once (or revert them). - [x] update documentation - [x] fix tests - [x] implement smarter partition assignation algorithm - [ ] improve re-assignation algorithm when using `--replace` - [x] improve the thing that displays how many partitions are displaced Currently, this PR breaks the format of the network configuration: when migrating, the cluster will be in a state where no roles are assigned. All roles must be re-assigned and commited at once. This migration should not pose an issue.
lx changed title from WIP: improve how nodes roles are assigned in garage to improve how nodes roles are assigned in garage 2021-11-15 13:56:34 +00:00
quentin approved these changes 2021-11-16 10:25:06 +00:00
quentin left a comment
Owner

I reviewed your code and tested it locally on my computer.
I did not review your partition algorithm in depth neither tried a bigger deployment.

I have some minor remarks/question, the main one being the Number of partitions that move entry being not crystal clear to me, especially on the initialization phase.

Thanks for this PR :)

I reviewed your code and tested it locally on my computer. I did not review your partition algorithm in depth neither tried a bigger deployment. I have some minor remarks/question, the main one being the `Number of partitions that move` entry being not crystal clear to me, especially on the initialization phase. Thanks for this PR :)
@ -5,12 +5,12 @@
- [Quick start](./quick_start/index.md)
- [Cookbook](./cookbook/index.md)
- [Production Deployment](./cookbook/real_world.md)
Owner

Interesting that you changed the order here.
I put this guide way below because I thought that it was important to master the different concepts of Garage and its deployment before considering a production deployment ;)

Interesting that you changed the order here. I put this guide way below because I thought that it was important to master the different concepts of Garage and its deployment before considering a production deployment ;)
Author
Owner

The reasoning to me is that this is one of the most important pages in the documentation because it's not just about "production deployments" but more generally how to run Garage in a multi-node setup, a core feature of Garage. So it makes sense that it's on top just to be more visible. Maybe we should rename to "multi-node deployment" if it makes more sense.

The reasoning to me is that this is one of the most important pages in the documentation because it's not just about "production deployments" but more generally how to run Garage in a multi-node setup, a core feature of Garage. So it makes sense that it's on top just to be more visible. Maybe we should rename to "multi-node deployment" if it makes more sense.
@ -30,2 +32,4 @@
aws --endpoint-url http://127.0.0.1:3900 s3 ls
```
If a newly added gateway node seems to not be working, do a full table resync to ensure that bucket and key list are correctly propagated:
Owner

What are the conditions that require to manually trigger a resync?
Is it only in case of a bug?

What are the conditions that require to manually trigger a resync? Is it only in case of a bug?
Author
Owner

The tables should be resynced regularly so if you just let nodes do their thing, it will eventually work. What can happen is that a node that is added does not receive the content of these tables before the first resync, so for some time the gateway node might be inoperable. (this is probably worthy of opening an issue)

The tables should be resynced regularly so if you just let nodes do their thing, it will eventually work. What can happen is that a node that is added does not receive the content of these tables before the first resync, so for some time the gateway node might be inoperable. (this is probably worthy of opening an issue)
@ -0,0 +38,4 @@
that either takes into account the proposed changes or cancels them:
```bash
garage layout apply --version <new_version_number>
Owner

For future release, could we consider a system similar to Nomad's one?
We could compute an ID for each layout, either random or a hash of the datastructure.
Then, when you run apply, you pass the ID of the layout configuration you want to apply.
This ID could be obtained from garage layout show.
It would prevent the following bug: a version number would be necessarily bound to a specific version?

For future release, could we consider a system similar to Nomad's one? We could compute an ID for each layout, either random or a hash of the datastructure. Then, when you run apply, you pass the ID of the layout configuration you want to apply. This ID could be obtained from `garage layout show`. It would prevent the following bug: a version number would be necessarily bound to a specific version?
Author
Owner

True, but we also need a way to make sure that version numbers are an increasing sequence, otherwise nodes don't have a way to know what is the last version (the one they should use)

True, but we also need a way to make sure that version numbers are an increasing sequence, otherwise nodes don't have a way to know what is the last version (the one they should use)
@ -0,0 +164,4 @@
true
}
/// Calculate an assignation of partitions to nodes
Owner

So this is our new stateful partition algorithm, this one being when a modification is operated in the cluster?

So this is our new stateful partition algorithm, this one being when a modification is operated in the cluster?
Author
Owner

Yes :)

Yes :)
@ -0,0 +224,4 @@
// Shuffle partitions between nodes so that nodes will reach (or better approach)
// their target number of stored partitions
loop {
let mut usefull = false;
Owner

is there any reason you wrote useful with 2 l (it seems to be an old way to write it)?

is there any reason you wrote useful with 2 `l` (it seems to be an old way to write it)?
Author
Owner

I don't speak English very well, sorry.

I don't speak English very well, sorry.
@ -0,0 +303,4 @@
}
println!("Number of partitions that move:");
for ((nminus, nplus), npart) in diffcount {
println!("\t-{}\t+{}\t{}", nminus, nplus, npart);
Owner

When I initialize a cluster, I have:

Number of partitions that move:
	-0	+1	256

I do not understant what these values mean :s

I think this line misses at least the id of the affected node as the first value, something like:

Number of partitions that move:
65edae1c553f7935…  -0	+1	256

And I think that the +1 is a strange edge case of the initial algorithm, I think it should be either:

Number of partitions that move:
65edae1c553f7935…  -0	+256	256

or:

Number of partitions that move:
65edae1c553f7935…  -0	+0	256
When I initialize a cluster, I have: ``` Number of partitions that move: -0 +1 256 ``` I do not understant what these values mean :s I think this line misses at least the id of the affected node as the first value, something like: ``` Number of partitions that move: 65edae1c553f7935… -0 +1 256 ``` And I think that the `+1` is a strange edge case of the initial algorithm, I think it should be either: ``` Number of partitions that move: 65edae1c553f7935… -0 +256 256 ``` or: ``` Number of partitions that move: 65edae1c553f7935… -0 +0 256 ```
Author
Owner

Ok I will make this part more readable and contain more information.

Ok I will make this part more readable and contain more information.
@ -0,0 +317,4 @@
true
}
fn initial_partition_assignation(&self) -> Option<Vec<PartitionAss<'_>>> {
Owner

And this is one is our initial partition algorithm, so when the cluster is initialized for the first time?

And this is one is our initial partition algorithm, so when the cluster is initialized for the first time?
Author
Owner

Yes, it is used:

  • when the cluster is initialized for the first time
  • when a node is removed, to re-assign new nodes to the partitions that the old node stored, even if this assignation is not well balanced

Basically, we use this function to ensure that we have an initial assignation of three (or n) nodes to each partition. Once we have it, we just do an iterative optimization algorihtm (the other function, above), that tries to balance better the number of partitions between nodes by doing elementary operations that consist only in replacing one node by another somewhere in the assignation.

Yes, it is used: - when the cluster is initialized for the first time - when a node is removed, to re-assign new nodes to the partitions that the old node stored, even if this assignation is not well balanced Basically, we use this function to ensure that we have an initial assignation of three (or n) nodes to each partition. Once we have it, we just do an iterative optimization algorihtm (the other function, above), that tries to balance better the number of partitions between nodes by doing elementary operations that consist only in replacing one node by another somewhere in the assignation.
@ -0,0 +391,4 @@
Some(partitions)
}
Owner

We might want to put the pseudo code of these 2 partition computation algorithms in the design page.
Ideally it would be someone else than you, LX, that write it: it would allow to proof-read this part more in-depth and spread the knowledge a bit more :)

We might want to put the pseudo code of these 2 partition computation algorithms in the design page. Ideally it would be someone else than you, LX, that write it: it would allow to proof-read this part more in-depth and spread the knowledge a bit more :)
Author
Owner

True

True
@ -6,3 +6,4 @@
license = "AGPL-3.0"
description = "Utility crate for the Garage object store"
repository = "https://git.deuxfleurs.fr/Deuxfleurs/garage"
readme = "../../README.md"
Owner

:P

:P
lx force-pushed node-configure from 971c5ca66b to 4752046990 2021-11-16 14:39:46 +00:00 Compare
lx force-pushed node-configure from 4752046990 to cd378622b4 2021-11-16 14:42:21 +00:00 Compare
lx force-pushed node-configure from cd378622b4 to 3685bd91e9 2021-11-16 14:45:11 +00:00 Compare
lx force-pushed node-configure from 3685bd91e9 to a3871f2251 2021-11-16 14:45:51 +00:00 Compare
lx force-pushed node-configure from a3871f2251 to c94406f428 2021-11-16 15:06:00 +00:00 Compare
lx merged commit c94406f428 into main 2021-11-16 15:33:23 +00:00
Sign in to join this conversation.
No description provided.