diff --git a/doc/book/reference-manual/configuration.md b/doc/book/reference-manual/configuration.md index b8881795..bb04650c 100644 --- a/doc/book/reference-manual/configuration.md +++ b/doc/book/reference-manual/configuration.md @@ -44,6 +44,10 @@ root_domain = ".s3.garage" [s3_web] bind_addr = "[::]:3902" root_domain = ".web.garage" + +[admin] +api_bind_addr = "0.0.0.0:3903" +trace_sink = "http://localhost:4317" ``` The following gives details about each available configuration option. @@ -84,20 +88,47 @@ might use more storage space that is optimally possible. Garage supports the following replication modes: -- `none` or `1`: data stored on Garage is stored on a single node. There is no redundancy, - and data will be unavailable as soon as one node fails or its network is disconnected. - Do not use this for anything else than test deployments. +- `none` or `1`: data stored on Garage is stored on a single node. There is no + redundancy, and data will be unavailable as soon as one node fails or its + network is disconnected. Do not use this for anything else than test + deployments. -- `2`: data stored on Garage will be stored on two different nodes, if possible in different - zones. Garage tolerates one node failure before losing data. Data should be available - read-only when one node is down, but write operations will fail. - Use this only if you really have to. +- `2`: data stored on Garage will be stored on two different nodes, if possible + in different zones. Garage tolerates one node failure, or several nodes + failing but all in a single zone (in a deployment with at least two zones), + before losing data. Data remains available in read-only mode when one node is + down, but write operations will fail. -- `3`: data stored on Garage will be stored on three different nodes, if possible each in - a different zones. - Garage tolerates two node failure before losing data. Data should be available - read-only when two nodes are down, and writes should be possible if only a single node - is down. + - `2-dangerous`: a variant of mode `2`, where written objects are written to + the second replica asynchronously. This means that Garage will return `200 + OK` to a PutObject request before the second copy is fully written (or even + before it even starts being written). This means that data can more easily + be lost if the node crashes before a second copy can be completed. This + also means that written objects might not be visible immediately in read + operations. In other words, this mode severely breaks the consistency and + durability guarantees of standard Garage cluster operation. Benefits of + this mode: you can still write to your cluster when one node is + unavailable. + +- `3`: data stored on Garage will be stored on three different nodes, if + possible each in a different zones. Garage tolerates two node failure, or + several node failures but in no more than two zones (in a deployment with at + least three zones), before losing data. As long as only a single node fails, + or node failures are only in a single zone, reading and writing data to + Garage can continue normally. + + - `3-degraded`: a variant of replication mode `3`, that lowers the read + quorum to `1`, to allow you to read data from your cluster when several + nodes (or nodes in several zones) are unavailable. In this mode, Garage + does not provide read-after-write consistency anymore. The write quorum is + still 2, ensuring that data successfully written to Garage is stored on at + least two nodes. + + - `3-dangerous`: a variant of replication mode `3` that lowers both the read + and write quorums to `1`, to allow you to both read and write to your + cluster when several nodes (or nodes in several zones) are unavailable. It + is the least consistent mode of operation proposed by Garage, and also one + that should probably never be used. Note that in modes `2` and `3`, if at least the same number of zones are available, an arbitrary number of failures in @@ -106,8 +137,35 @@ any given zone is tolerated as copies of data will be spread over several zones. **Make sure `replication_mode` is the same in the configuration files of all nodes. Never run a Garage cluster where that is not the case.** -Changing the `replication_mode` of a cluster might work (make sure to shut down all nodes -and changing it everywhere at the time), but is not officially supported. +The quorums associated with each replication mode are described below: + +| `replication_mode` | Number of replicas | Write quorum | Read quorum | Read-after-write consistency? | +| ------------------ | ------------------ | ------------ | ----------- | ----------------------------- | +| `none` or `1` | 1 | 1 | 1 | yes | +| `2` | 2 | 2 | 1 | yes | +| `2-dangerous` | 2 | 1 | 1 | NO | +| `3` | 3 | 2 | 2 | yes | +| `3-degraded` | 3 | 2 | 1 | NO | +| `3-dangerous` | 3 | 1 | 1 | NO | + +Changing the `replication_mode` between modes with the same number of replicas +(e.g. from `3` to `3-degraded`, or from `2-dangerous` to `2`), can be done easily by +just changing the `replication_mode` parameter in your config files and restarting all your +Garage nodes. + +It is also technically possible to change the replication mode to a mode with a +different numbers of replicas, although it's a dangerous operation that is not +officially supported. This requires you to delete the existing cluster layout +and create a new layout from scratch, meaning that a full rebalancing of your +cluster's data will be needed. To do it, shut down your cluster entirely, +delete the `custer_layout` files in the meta directories of all your nodes, +update all your configuration files with the new `replication_mode` parameter, +restart your cluster, and then create a new layout with all the nodes you want +to keep. Rebalancing data will take some time, and data might temporarily +appear unavailable to your users. It is recommended to shut down public access +to the cluster while rebalancing is in progress. In theory, no data should be +lost as rebalancing is a routine operation for Garage, although we cannot +guarantee you that everything will go right in such an extreme scenario. ### `compression_level` @@ -260,3 +318,21 @@ For instance, if `root_domain` is `web.garage.eu`, a bucket called `deuxfleurs.f will be accessible either with hostname `deuxfleurs.fr.web.garage.eu` or with hostname `deuxfleurs.fr`. + +## The `[admin]` section + +Garage has a few administration capabilities, in particular to allow remote monitoring. These features are detailed below. + +### `api_bind_addr` + +If specified, Garage will bind an HTTP server to this port and address, on +which it will listen to requests for administration features. Currently, +this endpoint only exposes Garage metrics in the Prometheus format at +`/metrics`. This endpoint is not authenticated. In the future, bucket and +access key management might be possible by REST calls to this endpoint. + +### `trace_sink` + +Optionnally, the address of an Opentelemetry collector. If specified, +Garage will send traces in the Opentelemetry format to this endpoint. These +trace allow to inspect Garage's operation when it handles S3 API requests. diff --git a/src/admin/metrics.rs b/src/admin/metrics.rs index cbc737d3..7edc36c6 100644 --- a/src/admin/metrics.rs +++ b/src/admin/metrics.rs @@ -28,7 +28,7 @@ async fn serve_req( req: Request, admin_server: Arc, ) -> Result, hyper::Error> { - info!("Receiving request at path {}", req.uri()); + debug!("Receiving request at path {}", req.uri()); let request_start = SystemTime::now(); admin_server.metrics.http_counter.add(1); diff --git a/src/api/signature/payload.rs b/src/api/signature/payload.rs index 88ec1f00..2a41b307 100644 --- a/src/api/signature/payload.rs +++ b/src/api/signature/payload.rs @@ -235,7 +235,7 @@ pub fn canonical_request( ) -> String { [ method.as_str(), - &uri.path().to_string(), + uri.path(), &canonical_query_string(uri), &canonical_header_string(headers, signed_headers), "", diff --git a/src/block/manager.rs b/src/block/manager.rs index 9cf72019..a3c4ec0d 100644 --- a/src/block/manager.rs +++ b/src/block/manager.rs @@ -265,6 +265,11 @@ impl BlockManager { self.resync_queue.len() } + /// Get number of blocks that have an error + pub fn resync_errors_len(&self) -> usize { + self.resync_errors.len() + } + /// Get number of items in the refcount table pub fn rc_len(&self) -> usize { self.rc.rc.len() diff --git a/src/garage/admin.rs b/src/garage/admin.rs index f9ec1593..0b20bb20 100644 --- a/src/garage/admin.rs +++ b/src/garage/admin.rs @@ -728,6 +728,12 @@ impl AdminRpcHandler { self.garage.block_manager.resync_queue_len() ) .unwrap(); + writeln!( + &mut ret, + " blocks with resync errors: {}", + self.garage.block_manager.resync_errors_len() + ) + .unwrap(); ret } diff --git a/src/table/replication/mode.rs b/src/table/replication/mode.rs index 32687288..c6f84c45 100644 --- a/src/table/replication/mode.rs +++ b/src/table/replication/mode.rs @@ -1,7 +1,10 @@ pub enum ReplicationMode { None, TwoWay, + TwoWayDangerous, ThreeWay, + ThreeWayDegraded, + ThreeWayDangerous, } impl ReplicationMode { @@ -9,7 +12,10 @@ impl ReplicationMode { match v { "none" | "1" => Some(Self::None), "2" => Some(Self::TwoWay), + "2-dangerous" => Some(Self::TwoWayDangerous), "3" => Some(Self::ThreeWay), + "3-degraded" => Some(Self::ThreeWayDegraded), + "3-dangerous" => Some(Self::ThreeWayDangerous), _ => None, } } @@ -24,16 +30,17 @@ impl ReplicationMode { pub fn replication_factor(&self) -> usize { match self { Self::None => 1, - Self::TwoWay => 2, - Self::ThreeWay => 3, + Self::TwoWay | Self::TwoWayDangerous => 2, + Self::ThreeWay | Self::ThreeWayDegraded | Self::ThreeWayDangerous => 3, } } pub fn read_quorum(&self) -> usize { match self { Self::None => 1, - Self::TwoWay => 1, + Self::TwoWay | Self::TwoWayDangerous => 1, Self::ThreeWay => 2, + Self::ThreeWayDegraded | Self::ThreeWayDangerous => 1, } } @@ -41,7 +48,9 @@ impl ReplicationMode { match self { Self::None => 1, Self::TwoWay => 2, - Self::ThreeWay => 2, + Self::TwoWayDangerous => 1, + Self::ThreeWay | Self::ThreeWayDegraded => 2, + Self::ThreeWayDangerous => 1, } } }