2022-03-28 15:36:16 +00:00
2 changed files with 81 additions and 19 deletions
--- a/doc/book/reference-manual/configuration.md
+++ b/doc/book/reference-manual/configuration.md
@ -48,7 +48,6 @@ root_domain = ".web.garage"
 [admin]
 api_bind_addr = "0.0.0.0:3903"
 trace_sink = "http://localhost:4317"
-
 ```

 The following gives details about each available configuration option.
@ -89,20 +88,47 @@ might use more storage space that is optimally possible.

 Garage supports the following replication modes:

- `none` or `1`: data stored on Garage is stored on a single node. There is no redundancy,
-  and data will be unavailable as soon as one node fails or its network is disconnected.
-  Do not use this for anything else than test deployments.
+- `none` or `1`: data stored on Garage is stored on a single node. There is no
+  redundancy, and data will be unavailable as soon as one node fails or its
+  network is disconnected.  Do not use this for anything else than test
+  deployments.

- `2`: data stored on Garage will be stored on two different nodes, if possible in different
-  zones. Garage tolerates one node failure before losing data. Data should be available
-  read-only when one node is down, but write operations will fail.
-  Use this only if you really have to.
+- `2`: data stored on Garage will be stored on two different nodes, if possible
+  in different zones. Garage tolerates one node failure, or several nodes
+  failing but all in a single zone (in a deployment with at least two zones),
+  before losing data. Data remains available in read-only mode when one node is
+  down, but write operations will fail.

- `3`: data stored on Garage will be stored on three different nodes, if possible each in
-  a different zones.
-  Garage tolerates two node failure before losing data. Data should be available
-  read-only when two nodes are down, and writes should be possible if only a single node
-  is down.
+  - `2-dangerous`: a variant of mode `2`, where written objects are written to
+    the second replica asynchronously. This means that Garage will return `200
+    OK` to a PutObject request before the second copy is fully written (or even
+    before it even starts being written).  This means that data can more easily
+    be lost if the node crashes before a second copy can be completed.  This
+    also means that written objects might not be visible immediately in read
+    operations.  In other words, this mode severely breaks the consistency and
+    durability guarantees of standard Garage cluster operation.  Benefits of
+    this mode: you can still write to your cluster when one node is
+    unavailable.
+
+- `3`: data stored on Garage will be stored on three different nodes, if
+  possible each in a different zones.  Garage tolerates two node failure, or
+  several node failures but in no more than two zones (in a deployment with at
+  least three zones), before losing data. As long as only a single node fails,
+  or node failures are only in a single zone, reading and writing data to
+  Garage can continue normally.
+
+  - `3-degraded`: a variant of replication mode `3`, that lowers the read
+    quorum to `1`, to allow you to read data from your cluster when several
+    nodes (or nodes in several zones) are unavailable.  In this mode, Garage
+    does not provide read-after-write consistency anymore.  The write quorum is
+    still 2, ensuring that data successfully written to Garage is stored on at
+    least two nodes.
+
+  - `3-dangerous`: a variant of replication mode `3` that lowers both the read
+    and write quorums to `1`, to allow you to both read and write to your
+    cluster when several nodes (or nodes in several zones) are unavailable.  It
+    is the least consistent mode of operation proposed by Garage, and also one
+    that should probably never be used.

 Note that in modes `2` and `3`,
 if at least the same number of zones are available, an arbitrary number of failures in 
@ -111,8 +137,35 @@ any given zone is tolerated as copies of data will be spread over several zones.
 **Make sure `replication_mode` is the same in the configuration files of all nodes.
 Never run a Garage cluster where that is not the case.**

-Changing the `replication_mode` of a cluster might work (make sure to shut down all nodes
-and changing it everywhere at the time), but is not officially supported.
+The quorums associated with each replication mode are described below:
+
+| `replication_mode` | Number of replicas | Write quorum | Read quorum | Read-after-write consistency? |
+| ------------------ | ------------------ | ------------ | ----------- | ----------------------------- |
+| `none` or `1`      | 1                  | 1            | 1           | yes                           |
+| `2`                | 2                  | 2            | 1           | yes                           |
+| `2-dangerous`      | 2                  | 1            | 1           | NO                            |
+| `3`                | 3                  | 2            | 2           | yes                           |
+| `3-degraded`       | 3                  | 2            | 1           | NO                            |
+| `3-dangerous`      | 3                  | 1            | 1           | NO                            |
+
+Changing the `replication_mode` between modes with the same number of replicas
+(e.g. from `3` to `3-degraded`, or from `2-dangerous` to `2`), can be done easily by
+just changing the `replication_mode` parameter in your config files and restarting all your
+Garage nodes.
+
+It is also technically possible to change the replication mode to a mode with a
+different numbers of replicas, although it's a dangerous operation that is not
+officially supported.  This requires you to delete the existing cluster layout
+and create a new layout from scratch, meaning that a full rebalancing of your
+cluster's data will be needed.  To do it, shut down your cluster entirely,
+delete the `custer_layout` files in the meta directories of all your nodes,
+update all your configuration files with the new `replication_mode` parameter,
+restart your cluster, and then create a new layout with all the nodes you want
+to keep.  Rebalancing data will take some time, and data might temporarily
+appear unavailable to your users.  It is recommended to shut down public access
+to the cluster while rebalancing is in progress.  In theory, no data should be
+lost as rebalancing is a routine operation for Garage, although we cannot
+guarantee you that everything will go right in such an extreme scenario.

 ### `compression_level`

--- a/src/table/replication/mode.rs
+++ b/src/table/replication/mode.rs
@ -1,7 +1,10 @@
 pub enum ReplicationMode {
 	None,
 	TwoWay,
+	TwoWayDangerous,
 	ThreeWay,
+	ThreeWayDegraded,
+	ThreeWayDangerous,
 }

 impl ReplicationMode {
@ -9,7 +12,10 @@ impl ReplicationMode {
 		match v {
 			"none" | "1" => Some(Self::None),
 			"2" => Some(Self::TwoWay),
+			"2-dangerous" => Some(Self::TwoWayDangerous),
 			"3" => Some(Self::ThreeWay),
+			"3-degraded" => Some(Self::ThreeWayDegraded),
+			"3-dangerous" => Some(Self::ThreeWayDangerous),
 			_ => None,
 		}
 	}
@ -24,16 +30,17 @@ impl ReplicationMode {
 	pub fn replication_factor(&self) -> usize {
 		match self {
 			Self::None => 1,
-			Self::TwoWay => 2,
-			Self::ThreeWay => 3,
+			Self::TwoWay | Self::TwoWayDangerous => 2,
+			Self::ThreeWay | Self::ThreeWayDegraded | Self::ThreeWayDangerous => 3,
 		}
 	}

 	pub fn read_quorum(&self) -> usize {
 		match self {
 			Self::None => 1,
-			Self::TwoWay => 1,
+			Self::TwoWay | Self::TwoWayDangerous => 1,
 			Self::ThreeWay => 2,
+			Self::ThreeWayDegraded | Self::ThreeWayDangerous => 1,
 		}
 	}

@ -41,7 +48,9 @@ impl ReplicationMode {
 		match self {
 			Self::None => 1,
 			Self::TwoWay => 2,
-			Self::ThreeWay => 2,
+			Self::TwoWayDangerous => 1,
+			Self::ThreeWay | Self::ThreeWayDegraded => 2,
+			Self::ThreeWayDangerous => 1,
 		}
 	}
 }