improve how nodes roles are assigned in garage #152

lx · 2021-11-09T18:48:42Z

lx commented

2021-11-09 18:48:42 +00:00

change the terminology: the network configuration becomes the role table, the configuration of a nodes becomes a node's role
the modification of the role table takes place in two steps: first, changes are staged in a CRDT data structure. Then, once the user is happy with the changes, they can commit them all at once (or revert them).
update documentation
fix tests
implement smarter partition assignation algorithm
improve re-assignation algorithm when using --replace
improve the thing that displays how many partitions are displaced

Currently, this PR breaks the format of the network configuration: when migrating, the cluster will be in a state where no roles are assigned. All roles must be re-assigned and commited at once. This migration should not pose an issue.

- [x] change the terminology: the network configuration becomes the role table, the configuration of a nodes becomes a node's role - [x] the modification of the role table takes place in two steps: first, changes are staged in a CRDT data structure. Then, once the user is happy with the changes, they can commit them all at once (or revert them). - [x] update documentation - [x] fix tests - [x] implement smarter partition assignation algorithm - [ ] improve re-assignation algorithm when using `--replace` - [x] improve the thing that displays how many partitions are displaced Currently, this PR breaks the format of the network configuration: when migrating, the cluster will be in a state where no roles are assigned. All roles must be re-assigned and commited at once. This migration should not pose an issue.

lx changed title from ~~WIP: improve how nodes roles are assigned in garage~~ to improve how nodes roles are assigned in garage

2021-11-15 13:56:34 +00:00

quentin approved these changes 2021-11-16 10:25:06 +00:00

quentin left a comment

I reviewed your code and tested it locally on my computer.
I did not review your partition algorithm in depth neither tried a bigger deployment.

I have some minor remarks/question, the main one being the Number of partitions that move entry being not crystal clear to me, especially on the initialization phase.

Thanks for this PR :)

I reviewed your code and tested it locally on my computer. I did not review your partition algorithm in depth neither tried a bigger deployment. I have some minor remarks/question, the main one being the `Number of partitions that move` entry being not crystal clear to me, especially on the initialization phase. Thanks for this PR :)

doc/book/src/SUMMARY.md Outdated

					
				@ -5,12 +5,12 @@

				- [Quick start](./quick_start/index.md)

				- [Cookbook](./cookbook/index.md)

				  - [Production Deployment](./cookbook/real_world.md)

quentin commented

2021-11-16 09:28:59 +00:00

Interesting that you changed the order here.
I put this guide way below because I thought that it was important to master the different concepts of Garage and its deployment before considering a production deployment ;)

Interesting that you changed the order here. I put this guide way below because I thought that it was important to master the different concepts of Garage and its deployment before considering a production deployment ;)

lx commented

2021-11-16 11:43:39 +00:00

The reasoning to me is that this is one of the most important pages in the documentation because it's not just about "production deployments" but more generally how to run Garage in a multi-node setup, a core feature of Garage. So it makes sense that it's on top just to be more visible. Maybe we should rename to "multi-node deployment" if it makes more sense.

doc/book/src/cookbook/gateways.md Outdated

					
				@ -30,2 +32,4 @@

				aws --endpoint-url http://127.0.0.1:3900 s3 ls

				```

				If a newly added gateway node seems to not be working, do a full table resync to ensure that bucket and key list are correctly propagated:

quentin commented

2021-11-16 08:34:42 +00:00

What are the conditions that require to manually trigger a resync?
Is it only in case of a bug?

What are the conditions that require to manually trigger a resync? Is it only in case of a bug?

lx commented

2021-11-16 12:05:04 +00:00

The tables should be resynced regularly so if you just let nodes do their thing, it will eventually work. What can happen is that a node that is added does not receive the content of these tables before the first resync, so for some time the gateway node might be inoperable. (this is probably worthy of opening an issue)

doc/book/src/reference_manual/layout.md Outdated

					
				@ -0,0 +38,4 @@

				that either takes into account the proposed changes or cancels them:

				```bash

				garage layout apply --version <new_version_number>

quentin commented

2021-11-16 08:39:39 +00:00

For future release, could we consider a system similar to Nomad's one?
We could compute an ID for each layout, either random or a hash of the datastructure.
Then, when you run apply, you pass the ID of the layout configuration you want to apply.
This ID could be obtained from garage layout show.
It would prevent the following bug: a version number would be necessarily bound to a specific version?

For future release, could we consider a system similar to Nomad's one? We could compute an ID for each layout, either random or a hash of the datastructure. Then, when you run apply, you pass the ID of the layout configuration you want to apply. This ID could be obtained from `garage layout show`. It would prevent the following bug: a version number would be necessarily bound to a specific version?

lx commented

2021-11-16 12:06:57 +00:00

True, but we also need a way to make sure that version numbers are an increasing sequence, otherwise nodes don't have a way to know what is the last version (the one they should use)

src/rpc/layout.rs Outdated

					
				@ -0,0 +164,4 @@

						true

					}

					/// Calculate an assignation of partitions to nodes

quentin commented

2021-11-16 09:00:10 +00:00

So this is our new stateful partition algorithm, this one being when a modification is operated in the cluster?

lx commented

2021-11-16 12:07:34 +00:00

Yes :)

src/rpc/layout.rs Outdated

					
				@ -0,0 +224,4 @@

						// Shuffle partitions between nodes so that nodes will reach (or better approach)

						// their target number of stored partitions

						loop {

							let mut usefull = false;

quentin commented

2021-11-16 09:04:12 +00:00

is there any reason you wrote useful with 2 l (it seems to be an old way to write it)?

is there any reason you wrote useful with 2 `l` (it seems to be an old way to write it)?

lx commented

2021-11-16 12:07:47 +00:00

I don't speak English very well, sorry.

😆 1

src/rpc/layout.rs Outdated

					
				@ -0,0 +303,4 @@

						}

						println!("Number of partitions that move:");

						for ((nminus, nplus), npart) in diffcount {

							println!("\t-{}\t+{}\t{}", nminus, nplus, npart);

quentin commented

2021-11-16 10:01:44 +00:00

When I initialize a cluster, I have:

Number of partitions that move:
	-0	+1	256

I do not understant what these values mean :s

I think this line misses at least the id of the affected node as the first value, something like:

Number of partitions that move:
65edae1c553f7935…  -0	+1	256

And I think that the +1 is a strange edge case of the initial algorithm, I think it should be either:

Number of partitions that move:
65edae1c553f7935…  -0	+256	256

or:

Number of partitions that move:
65edae1c553f7935…  -0	+0	256

When I initialize a cluster, I have: ``` Number of partitions that move: -0 +1 256 ``` I do not understant what these values mean :s I think this line misses at least the id of the affected node as the first value, something like: ``` Number of partitions that move: 65edae1c553f7935… -0 +1 256 ``` And I think that the `+1` is a strange edge case of the initial algorithm, I think it should be either: ``` Number of partitions that move: 65edae1c553f7935… -0 +256 256 ``` or: ``` Number of partitions that move: 65edae1c553f7935… -0 +0 256 ```

lx commented

2021-11-16 12:09:32 +00:00

Ok I will make this part more readable and contain more information.

src/rpc/layout.rs Outdated

					
				@ -0,0 +317,4 @@

						true

					}

					fn initial_partition_assignation(&self) -> Option<Vec<PartitionAss<'_>>> {

quentin commented

2021-11-16 09:00:47 +00:00

And this is one is our initial partition algorithm, so when the cluster is initialized for the first time?

lx commented

2021-11-16 12:12:27 +00:00

Yes, it is used:

when the cluster is initialized for the first time
when a node is removed, to re-assign new nodes to the partitions that the old node stored, even if this assignation is not well balanced

Basically, we use this function to ensure that we have an initial assignation of three (or n) nodes to each partition. Once we have it, we just do an iterative optimization algorihtm (the other function, above), that tries to balance better the number of partitions between nodes by doing elementary operations that consist only in replacing one node by another somewhere in the assignation.

Yes, it is used: - when the cluster is initialized for the first time - when a node is removed, to re-assign new nodes to the partitions that the old node stored, even if this assignation is not well balanced Basically, we use this function to ensure that we have an initial assignation of three (or n) nodes to each partition. Once we have it, we just do an iterative optimization algorihtm (the other function, above), that tries to balance better the number of partitions between nodes by doing elementary operations that consist only in replacing one node by another somewhere in the assignation.

src/rpc/layout.rs Outdated

					
				@ -0,0 +391,4 @@

						Some(partitions)

					}

quentin commented

2021-11-16 09:08:43 +00:00

We might want to put the pseudo code of these 2 partition computation algorithms in the design page.
Ideally it would be someone else than you, LX, that write it: it would allow to proof-read this part more in-depth and spread the knowledge a bit more :)

We might want to put the pseudo code of these 2 partition computation algorithms in the design page. Ideally it would be someone else than you, LX, that write it: it would allow to proof-read this part more in-depth and spread the knowledge a bit more :)

lx commented

2021-11-16 12:14:05 +00:00

True

src/util/Cargo.toml Outdated

					
				@ -6,3 +6,4 @@

				license = "AGPL-3.0"

				description = "Utility crate for the Garage object store"

				repository = "https://git.deuxfleurs.fr/Deuxfleurs/garage"

				readme = "../../README.md"