# WIP: Optimal layout assignation algorithm #296

Open

lx
wants to merge 31 commits from `optimal-layout `

into `main`

Loading…

Reference in new issue

There is no content yet.

Delete Branch 'optimal-layout'

Deleting a branch is permanent. It **CANNOT** be undone. Continue?

Change the way new layout assignations are computed.

The function now computes an optimal assignation (with respect to partition size) that minimizes the distance to the former assignation, using flow algorithms.

This commit was written by Mendes Oulamara mendes.oulamara@pm.me

Hi @Mendes, thanks a lot for your amazing contribution!

In order to make your PR more approachable, would you agree to share some high-level information about your algorithm in a comment of this PR?

For example, what are the state of the art algoithms/data structures/results/mathematical concepts that you used. I have seen a bipartite graph and "Dinic's max flow algorithm" in the comments for example, do you use other ones? how do they interact together?

Based on this state of the art, could you also shortly say how you leveraged them to build the final algorithm, how did you split the original layout assignation problem in smaller problems? For example, LX identified 3 major components: 1. assigning partition to zones, 2. assigning zone partitions to nodes, and 3. minimizing the amount of rebalance between the old and new layout by detecting cycles in this bipartite graph. Could you confirm that this is how your algorithm work? Can you detail what each component does?

From an engineering perspective, being able to reflect and isolate these different concerns in the code would help us understand the code, test it, and feel confident enough to maintain it!

I only quickly reviewed your file

`src/util/bipartite.rs`

and have some suggestions to illustrate this proposition. First, it seems this file contains logic on 2 different concerns, if I understand correctly. The first is for a bipartite graph built with the`EdgeFlow`

structure that serves for the zone assignation. The 2nd is for a bipartite graph built with the`WeightedEdge`

structure. I would say that things that go together must be grouped together, so each structure should be next to the functions that use them. I would even say they should be declared in different files.Next, focusing on one of your function, like

`optimize_matching`

, as an engineer my first thought is that your function are long! This is not a problemper se, but it often good to question why, see for example Rule of 30 and Applications of Cyclomatic Complexity - do not apply them strictly to Rust, I think it starts being a problem after 60/80 lines, and there is always exceptions. If we take the ~30 first lines, they are here to build your bi-partite graph. Instead of putting this logic here, you could 1) define a structure for your bipartite graph outside of this function that contains the vector,`nb_left`

and`nb_right`

and 2) define a`new`

or`build`

function that will encapsulate these 30 initialization lines. Then you could write a function for the Bellman Ford algorithm, and maybe a function that output the optimal matching from the graph. Your`positive_cycle`

function can be attached to your graph structure as a method too.Your root function

`optimize_matching`

would be as simple as:What I have described here is an OOP approach to solve the problem of modularization and encapsulation. In my opinion, such properties help to make the code more readable by forcing the author to

name and split the things, of course if names are meaningless and code is abitrarily split, this is useless.This last part is just an example and probably not the best way to do it, especially because I am not familiar with the concepts. My example is just here to convey an intent, other programming paradigms are possibles and great, and like all approaches it should not be abused.

This looks extremely promising and I'm hyped to have this in Garage soon :)

I think we identified three high-level concerns, in addition to the comments I made in the code below:

error reporting (and general reporting of what is happening)

testing on small toy examples

clarity of the code

Error reporting.For now, this code uses asserts, unwrap and panics at many places. The consistency of the results are checked at several steps using assertions, which is good. However, if we want to report to the user what is going wrong, it would be better if all of these functions could return Result types, maybe with a custom Error enum for all of the different kinds of errors that can happen at any step, or just using String as an error type. Globally, this means the following transformations:replacing

`panic()`

by:`return Err(format!("my error message"))`

replacing

`assert!(x)`

by`if !x { return Err(format!("my error message"))`

replacing

`some_result.unwrap()`

by`some_result?`

and`some_option.unwrap()`

by`some_option.ok_or_else(|| format!("my error message"))?`

We should take this opportunity to write explicit error messages that try to tell precisely what is going wrong. If it's a situation that isn't supposed to happen (a failed assert for instance), the error message should clearly say so.

General reporting of what is happening:In the previous version, when a layout change was calculated, a list of how many partitions were moved between nodes was printed on screen by the algorithm that does the calculation, so that the sysadmin could predict how much data will be moved on the network. Having this information is important, and is missing in the new version. Moreover, instead of printing this information on the console, we would ultimately need to have it returned in a struct of some kind, so that it can also be returned to users that are calling this algorithm over the network.Testing on small toy examples.We need a high level of assurance that this algorithm actually does what it's expected to do. To do so, we need to run it on many real-world cluster configurations, in the various settings that may appear in real-world deployments:small clusters in only 1 zone, or bigger ones

small clusters in only 2 zones, or bigger ones

clusters in 3 zones

clusters in 4 or more zones

We need a way to run the algorihtm on such small toy examples and print the results somewhere so that we can check that the way capacities are assigned to different nodes corresponds to what we would intuitively expect (i.e. verify that the equations implemented here actually correspond to what we want). I think we won't be convinced to integrate this work until we have such toy visualizations to convince ourselves that everything is working :)

Clarity of the code.What the code is doing is not always very clear. Comments are present but a high-level view is sometimes missing. I like to have, at the beginning of a function, a high-level comment (that can be quite long) that describes what the function does, and the steps it takes to do so. It would also be nice to have comments for data structures and intermediate variables, that explain the meaning of the different possible values that are stored in the variables that are defined. (e.g. if we have a vector whose values are positions in another vector, it would be nice to have a comment saying so).`//We compute the optimal number of partition to assign to`

`//every node and zone.`

`if let Some((part_per_nod, part_per_zone)) = self.optimal_proportions() {`

A whole indentation level for everything that follows can be removed by doing:

`self.ring_assignation_data.push(id as CompactNodeType);`

`} else {`

`panic!()`

`}`

The if-else block can be written more simply as:

You can also use

`.expect("message")`

to add a custom error message when`nod`

is`None`

`///This function compute the number of partition to assign to`

`///every node and zone, so that every partition is replicated`

`///self.replication_factor times and the capacity of a partition`

`///is maximized.`

If we have nodes in only 2 zones but replicaton factor is 3, does this do the trick of adding a "ghost zone" that takes some of the capacity?

And what if we have nodes in only 1 zone?

`///every node and zone, so that every partition is replicated`

`///self.replication_factor times and the capacity of a partition`

`///is maximized.`

`fn optimal_proportions(&mut self) -> Option<(Vec<usize>, HashMap<String, usize>)> {`

This whole function is very obscure to me :(

I tried to put some comments but I'm not sure I understood everything.

Probably a long comment at the beginning with the high-level math would help.

`);`

`} else {`

`zone_capacity.insert(node_zone[i].clone(), node_capacity[i]);`

`}`

More concise and idiomatic version:

`),`

`)`

`})`

`.collect();`

Not convinced this works: if I have one very big zone and two smaller zones, the largest one will be capped to exactly nb_partitions, but the smaller ones will have less than nb_partitions. But the only correct solution would have been to have exactly nb_partitions partitions in each zone. Am I correct?

`.keys()`

`.filter(|k| part_per_zone[*k] < nb_partitions)`

`.map(|k| zone_capacity[k])`

`.sum();`

Rewrite as:

`rf == cl.ring_assignation_data[rf * i..rf * (i + 1)]`

`.iter()`

`.map(|nod| node_zone[*nod as usize].clone())`

`.unique()`

ALERT ALERT ALERTI think this is wrong:

`.unique()`

only works on a sorted iterator, because it only de-duplicates consecutive items! If I have an iterator that gives me A, B, A, then`.unique().count())`

will be 3 but this is clearly not how we want to count here.`.filter(|x| **x == i as u8)`

`.count()`

`})`

`.collect::<Vec<_>>();`

I think this can be made linear complexity instead of n * m by doing:

I didn't put comments everywhere, but I think I saw other places with double nested iterators (n * m complexity), are there any others that can be rewritten to be linear complexity?

Note that

`a.iter().filter(|x| b.contains(x))`

is such a double nested iterator with n * m complexity, depending on the context it might be possible to rewrite them as above.`let mut part = PartitionAss::new();`

`for node_i in self.ring_assignation_data`

`[i * self.replication_factor..(i + 1) * self.replication_factor]`

`let zone_vec = node_zone.iter().unique().collect::<Vec<_>>();`

ALERT ALERT ALERTSame remark here on

`.unique()`

(see above). What has to be done instead:`.filter(|x| node_zone[**x as usize] == **z)`

`.count()`

`})`

`.collect::<Vec<_>>();`

This is a nested loop with n * m complexity. Fortunately, zone_vec is very small. However for cache locality it would have been better to rewrite it the other way:

This is probably a very tiny optimization and not important at all. In fact the other version has other advantages such as not necessitating to use

`.unwrap()`

, and, depending on taste, being more readable (I don't necessarily agree)`assert!(`

`node_capacity[idmin] * (node_nb_part[idnew] as u32 + 1)`

`>= node_capacity[idnew] * node_nb_part[idmin] as u32`

`);`

I think we should avoid writing

`if /* a very long multiline condition */ { /* sth */ }`

if we can.Here I would give a temporary name to the value

`(0..nb_nodes).filter(...).max_by(...)`

and then just do`if let Some(idnew) = tempvalue { ... }`

(don't just call it`tempvalue`

though, that's an example)`if let Some(idnew) = node_of_z_iter.min_by(|i, j| {`

`(node_capacity[*i] * (node_nb_part[*j] as u32 + 1))`

`.cmp(&(node_capacity[*j] * (node_nb_part[*i] as u32 + 1)))`

`}) {`

Same here, the value matched in the

`if`

condition should probably have its own`let`

binding above`* This module deals with graph algorithm in complete bipartite`

`* graphs. It is used in layout.rs to build the partition to node`

`* assignation.`

`* */`

For a documentation comment that concerns the whole module, use the following syntax:

`c: i32,`

`flow: i32,`

`v: usize,`

`rev: usize,`

What are the meanings of the differnt fields ? Add this as comments

`struct WeightedEdge {`

`w: i32,`

`u: usize,`

`v: usize,`

Same

`* complete bipartite graph. It returns a matching that has the`

`* same degree as new_match at every vertex, and that is as close`

`* as possible to old_match.`

`* */`

For a doc comment for the items that follows the comment, use the following syntax:

When using

`cargo doc`

to generate crate documentation, these comments will then appear as documentation text.(other comments might also be concerned by this remark)

`for j in 0..nb_right {`

`edge_vec[i * nb_right + j].u = i;`

`edge_vec[i * nb_right + j].v = nb_left + j;`

`}`

I think this whole loop would be clearer as:

(each time we can avoid calculating array indices as

`a * x + b`

is a good thing)`graph[1][i].c = *c as i32;`

`graph[1][i].v = i + 2 + left_cap_vec.len();`

`graph[1][i].rev = 0;`

`}`

This might be a stretch, but instead of using integers that have special meanings (0 = source, 1 = sink, etc) as vertex identifiers, why not use an enum as the following?

Values of this kind behave exactly the same as integers thanks to the derive statement, so the flow algorithm's body doesn't need to be changed. It's just the end of the algorithm that needs to change, when we reconstruct the association table from the flow graph, where we need to match on the two sides of each of the edge, but that's probably quite simple to do.

Using such a strategy would reduce by A LOT the risk of doing something wrong with vertex identifiers, and would also make the code a lot more elegant.

This could also be done for the bipartite cycle optimization algorithm, with a vertex type as follows:

The edge types could be parametric on the vertex type:

`}`

`//otherwise, we send flow to nbd.`

`lifo.push_back((graph[id][nbd].v, new_flow));`

`}`

I'll trust you that Dinic's flow algorithm is correct here ;)

Proposal: make capacity easier to approach#357`ee08c3c850`

to`3039bb5d43`

2 months agov0.9milestone 2 months ago`161a6ce463`

to`74f6056a9b`

4 weeks ago`74f6056a9b`

to`ea5afc2511`

4 weeks ago`8cc2f4065e`

to`d75b37b018`

4 weeks ago`6c4dae9564`

to`0fba0317ab`

4 weeks ago`0fba0317ab`

to`fc2729cd81`

4 weeks agoPanic during application of new cluster layout in 0.8.0-rc1#414Panic during application of new cluster layout in 0.8.0-rc1#414Panic during application of new cluster layout in 0.8.0-rc1#414