Optimal layout assignation algorithm #296
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#296
Loading…
Reference in a new issue
No description provided.
Delete branch "optimal-layout"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Change the way new layout assignations are computed.
The function now computes an optimal assignation (with respect to partition size) that minimizes the distance to the former assignation, using flow algorithms.
This commit was written by Mendes Oulamara mendes.oulamara@pm.me
Hi @Mendes, thanks a lot for your amazing contribution!
In order to make your PR more approachable, would you agree to share some high-level information about your algorithm in a comment of this PR?
For example, what are the state of the art algoithms/data structures/results/mathematical concepts that you used. I have seen a bipartite graph and "Dinic's max flow algorithm" in the comments for example, do you use other ones? how do they interact together?
Based on this state of the art, could you also shortly say how you leveraged them to build the final algorithm, how did you split the original layout assignation problem in smaller problems? For example, LX identified 3 major components: 1. assigning partition to zones, 2. assigning zone partitions to nodes, and 3. minimizing the amount of rebalance between the old and new layout by detecting cycles in this bipartite graph. Could you confirm that this is how your algorithm work? Can you detail what each component does?
From an engineering perspective, being able to reflect and isolate these different concerns in the code would help us understand the code, test it, and feel confident enough to maintain it!
I only quickly reviewed your file
src/util/bipartite.rs
and have some suggestions to illustrate this proposition. First, it seems this file contains logic on 2 different concerns, if I understand correctly. The first is for a bipartite graph built with theEdgeFlow
structure that serves for the zone assignation. The 2nd is for a bipartite graph built with theWeightedEdge
structure. I would say that things that go together must be grouped together, so each structure should be next to the functions that use them. I would even say they should be declared in different files.Next, focusing on one of your function, like
optimize_matching
, as an engineer my first thought is that your function are long! This is not a problem per se, but it often good to question why, see for example Rule of 30 and Applications of Cyclomatic Complexity - do not apply them strictly to Rust, I think it starts being a problem after 60/80 lines, and there is always exceptions. If we take the ~30 first lines, they are here to build your bi-partite graph. Instead of putting this logic here, you could 1) define a structure for your bipartite graph outside of this function that contains the vector,nb_left
andnb_right
and 2) define anew
orbuild
function that will encapsulate these 30 initialization lines. Then you could write a function for the Bellman Ford algorithm, and maybe a function that output the optimal matching from the graph. Yourpositive_cycle
function can be attached to your graph structure as a method too.Your root function
optimize_matching
would be as simple as:What I have described here is an OOP approach to solve the problem of modularization and encapsulation. In my opinion, such properties help to make the code more readable by forcing the author to name and split the things, of course if names are meaningless and code is abitrarily split, this is useless.
This last part is just an example and probably not the best way to do it, especially because I am not familiar with the concepts. My example is just here to convey an intent, other programming paradigms are possibles and great, and like all approaches it should not be abused.
This looks extremely promising and I'm hyped to have this in Garage soon :)
I think we identified three high-level concerns, in addition to the comments I made in the code below:
error reporting (and general reporting of what is happening)
testing on small toy examples
clarity of the code
Error reporting. For now, this code uses asserts, unwrap and panics at many places. The consistency of the results are checked at several steps using assertions, which is good. However, if we want to report to the user what is going wrong, it would be better if all of these functions could return Result types, maybe with a custom Error enum for all of the different kinds of errors that can happen at any step, or just using String as an error type. Globally, this means the following transformations:
replacing
panic()
by:return Err(format!("my error message"))
replacing
assert!(x)
byif !x { return Err(format!("my error message"))
replacing
some_result.unwrap()
bysome_result?
andsome_option.unwrap()
bysome_option.ok_or_else(|| format!("my error message"))?
We should take this opportunity to write explicit error messages that try to tell precisely what is going wrong. If it's a situation that isn't supposed to happen (a failed assert for instance), the error message should clearly say so.
General reporting of what is happening: In the previous version, when a layout change was calculated, a list of how many partitions were moved between nodes was printed on screen by the algorithm that does the calculation, so that the sysadmin could predict how much data will be moved on the network. Having this information is important, and is missing in the new version. Moreover, instead of printing this information on the console, we would ultimately need to have it returned in a struct of some kind, so that it can also be returned to users that are calling this algorithm over the network.
Testing on small toy examples. We need a high level of assurance that this algorithm actually does what it's expected to do. To do so, we need to run it on many real-world cluster configurations, in the various settings that may appear in real-world deployments:
small clusters in only 1 zone, or bigger ones
small clusters in only 2 zones, or bigger ones
clusters in 3 zones
clusters in 4 or more zones
We need a way to run the algorihtm on such small toy examples and print the results somewhere so that we can check that the way capacities are assigned to different nodes corresponds to what we would intuitively expect (i.e. verify that the equations implemented here actually correspond to what we want). I think we won't be convinced to integrate this work until we have such toy visualizations to convince ourselves that everything is working :)
Clarity of the code. What the code is doing is not always very clear. Comments are present but a high-level view is sometimes missing. I like to have, at the beginning of a function, a high-level comment (that can be quite long) that describes what the function does, and the steps it takes to do so. It would also be nice to have comments for data structures and intermediate variables, that explain the meaning of the different possible values that are stored in the variables that are defined. (e.g. if we have a vector whose values are positions in another vector, it would be nice to have a comment saying so).
@ -177,0 +185,4 @@
//We compute the optimal number of partition to assign to
//every node and zone.
if let Some((part_per_nod, part_per_zone)) = self.optimal_proportions() {
A whole indentation level for everything that follows can be removed by doing:
@ -213,0 +283,4 @@
self.ring_assignation_data.push(id as CompactNodeType);
} else {
panic!()
}
The if-else block can be written more simply as:
You can also use
.expect("message")
to add a custom error message whennod
isNone
@ -239,0 +324,4 @@
///This function compute the number of partition to assign to
///every node and zone, so that every partition is replicated
///self.replication_factor times and the capacity of a partition
///is maximized.
If we have nodes in only 2 zones but replicaton factor is 3, does this do the trick of adding a "ghost zone" that takes some of the capacity?
And what if we have nodes in only 1 zone?
@ -239,0 +325,4 @@
///every node and zone, so that every partition is replicated
///self.replication_factor times and the capacity of a partition
///is maximized.
fn optimal_proportions(&mut self) -> Option<(Vec<usize>, HashMap<String, usize>)> {
This whole function is very obscure to me :(
I tried to put some comments but I'm not sure I understood everything.
Probably a long comment at the beginning with the high-level math would help.
@ -239,0 +339,4 @@
);
} else {
zone_capacity.insert(node_zone[i].clone(), node_capacity[i]);
}
More concise and idiomatic version:
@ -288,0 +369,4 @@
),
)
})
.collect();
Not convinced this works: if I have one very big zone and two smaller zones, the largest one will be capped to exactly nb_partitions, but the smaller ones will have less than nb_partitions. But the only correct solution would have been to have exactly nb_partitions partitions in each zone. Am I correct?
@ -288,0 +380,4 @@
.keys()
.filter(|k| part_per_zone[*k] < nb_partitions)
.map(|k| zone_capacity[k])
.sum();
Rewrite as:
@ -476,0 +524,4 @@
rf == cl.ring_assignation_data[rf * i..rf * (i + 1)]
.iter()
.map(|nod| node_zone[*nod as usize].clone())
.unique()
ALERT ALERT ALERT
I think this is wrong:
.unique()
only works on a sorted iterator, because it only de-duplicates consecutive items! If I have an iterator that gives me A, B, A, then.unique().count())
will be 3 but this is clearly not how we want to count here.@ -487,0 +538,4 @@
.filter(|x| **x == i as u8)
.count()
})
.collect::<Vec<_>>();
I think this can be made linear complexity instead of n * m by doing:
I didn't put comments everywhere, but I think I saw other places with double nested iterators (n * m complexity), are there any others that can be rewritten to be linear complexity?
Note that
a.iter().filter(|x| b.contains(x))
is such a double nested iterator with n * m complexity, depending on the context it might be possible to rewrite them as above.@ -493,3 +543,1 @@
let mut part = PartitionAss::new();
for node_i in self.ring_assignation_data
[i * self.replication_factor..(i + 1) * self.replication_factor]
let zone_vec = node_zone.iter().unique().collect::<Vec<_>>();
ALERT ALERT ALERT
Same remark here on
.unique()
(see above). What has to be done instead:@ -499,0 +549,4 @@
.filter(|x| node_zone[**x as usize] == **z)
.count()
})
.collect::<Vec<_>>();
This is a nested loop with n * m complexity. Fortunately, zone_vec is very small. However for cache locality it would have been better to rewrite it the other way:
This is probably a very tiny optimization and not important at all. In fact the other version has other advantages such as not necessitating to use
.unwrap()
, and, depending on taste, being more readable (I don't necessarily agree)@ -507,0 +573,4 @@
assert!(
node_capacity[idmin] * (node_nb_part[idnew] as u32 + 1)
>= node_capacity[idnew] * node_nb_part[idmin] as u32
);
I think we should avoid writing
if /* a very long multiline condition */ { /* sth */ }
if we can.Here I would give a temporary name to the value
(0..nb_nodes).filter(...).max_by(...)
and then just doif let Some(idnew) = tempvalue { ... }
(don't just call ittempvalue
though, that's an example)@ -522,0 +587,4 @@
if let Some(idnew) = node_of_z_iter.min_by(|i, j| {
(node_capacity[*i] * (node_nb_part[*j] as u32 + 1))
.cmp(&(node_capacity[*j] * (node_nb_part[*i] as u32 + 1)))
}) {
Same here, the value matched in the
if
condition should probably have its ownlet
binding above@ -0,0 +2,4 @@
* This module deals with graph algorithm in complete bipartite
* graphs. It is used in layout.rs to build the partition to node
* assignation.
* */
For a documentation comment that concerns the whole module, use the following syntax:
@ -0,0 +14,4 @@
c: i32,
flow: i32,
v: usize,
rev: usize,
What are the meanings of the differnt fields ? Add this as comments
@ -0,0 +22,4 @@
struct WeightedEdge {
w: i32,
u: usize,
v: usize,
Same
@ -0,0 +29,4 @@
* complete bipartite graph. It returns a matching that has the
* same degree as new_match at every vertex, and that is as close
* as possible to old_match.
* */
For a doc comment for the items that follows the comment, use the following syntax:
When using
cargo doc
to generate crate documentation, these comments will then appear as documentation text.(other comments might also be concerned by this remark)
@ -0,0 +45,4 @@
for j in 0..nb_right {
edge_vec[i * nb_right + j].u = i;
edge_vec[i * nb_right + j].v = nb_left + j;
}
I think this whole loop would be clearer as:
(each time we can avoid calculating array indices as
a * x + b
is a good thing)@ -0,0 +185,4 @@
graph[1][i].c = *c as i32;
graph[1][i].v = i + 2 + left_cap_vec.len();
graph[1][i].rev = 0;
}
This might be a stretch, but instead of using integers that have special meanings (0 = source, 1 = sink, etc) as vertex identifiers, why not use an enum as the following?
Values of this kind behave exactly the same as integers thanks to the derive statement, so the flow algorithm's body doesn't need to be changed. It's just the end of the algorithm that needs to change, when we reconstruct the association table from the flow graph, where we need to match on the two sides of each of the edge, but that's probably quite simple to do.
Using such a strategy would reduce by A LOT the risk of doing something wrong with vertex identifiers, and would also make the code a lot more elegant.
This could also be done for the bipartite cycle optimization algorithm, with a vertex type as follows:
The edge types could be parametric on the vertex type:
@ -0,0 +306,4 @@
}
//otherwise, we send flow to nbd.
lifo.push_back((graph[id][nbd].v, new_flow));
}
I'll trust you that Dinic's flow algorithm is correct here ;)
ee08c3c850
to3039bb5d43
161a6ce463
to74f6056a9b
74f6056a9b
toea5afc2511
8cc2f4065e
tod75b37b018
6c4dae9564
to0fba0317ab
0fba0317ab
tofc2729cd81
WIP: Optimal layout assignation algorithmto Optimal layout assignation algorithmAnything that needs to be finalized on this feature before release of v0.9 will now be done on branch
next
.