Add documentation on durability and repair procedures (fix #219)

2023-06-14 11:54:21 +02:00 · 2023-06-14 11:54:21 +02:00 · 249a7a66df
commit 249a7a66df
parent 733a7c314e
3 changed files with 116 additions and 2 deletions
--- a/doc/book/cookbook/durability-repairs.md
+++ b/doc/book/cookbook/durability-repairs.md
@ -0,0 +1,114 @@
+++
+title = "Durability & Repairs"
+weight = 50
+++
+
+To ensure the best durability of your data and to fix any inconsistencies that may
+pop up in a distributed system, Garage provides a serires of repair operations.
+This guide will explain the meaning of each of them and when they should be applied.
+
+
+# General syntax of repair operations
+
+Repair operations described below are of the form `garage repair <repair_name>`.
+These repairs will not launch without the `--yes` flag, which should
+be added as follows: `garage repair --yes <repair_name>`.
+By default these repair procedures will only run on the Garage node your CLI is
+connecting to. To run on all nodes, add the `-a` flag as follows:
+`garage repair -a --yes <repair_name>`.
+
+# Data block operations
+
+## Data store scrub
+
+Scrubbing the data store means examining each individual data block to check that
+their content is correct, by verifying their hash. Any block found to be corrupted
+(e.g. by bitrot or by an accidental manipulation of the datastore) will be
+restored from another node that holds a valid copy.
+
+A scrub is run automatically by Garage every 30 days. It can also be launched
+manually using `garage repair scrub start`.
+
+To view the status of an ongoing scrub, first find the task ID of the scrub worker
+using `garage worker list`. Then, run `garage worker info <scrub_task_id>` to
+view detailed runtime statistics of the scrub. To gather cluster-wide information,
+this command has to be run on each individual node.
+
+A scrub is a very disk-intensive operation that might slow down your cluster.
+You may pause an ongoing scrub using `garage repair scrub pause`, but note that
+the scrub will resume automatically 24 hours later as Garage will not let your
+cluster run without a regular scrub. If the scrub procedure is too intensive
+for your servers and is slowing down your workload, the recommended solution
+is to increase the "scrub tranquility" using `garage repair scrub set-tranquility`.
+A higher tranquility value will make Garage take longer pauses between two block
+verifications. Of course, scrubbing the entire data store will also take longer.
+
+## Block check and resync
+
+In some cases, nodes hold a reference to a block but do not actually have the block
+stored on disk. Conversely, they may also have on disk blocks that are not referenced
+any more. To fix both cases, a block repair may be run with `garage repair blocks`.
+This will scan the entire block reference counter table to check that the blocks
+exist on disk, and will scan the entire disk store to check that stored blocks
+are referenced.
+
+It is recommended to run this procedure when changing your cluster layout,
+after the metadata tables have finished synchronizing between nodes
+(usually a few hours after `garage layout apply`).
+
+## Inspecting lost blocks
+
+In extremely rare situations, data blocks may be unavailable from the entire cluster.
+This means that even using `garage repair blocks`, some nodes may be unable
+to fetch data blocks for which they hold a reference.
+
+These errors are stored on each node in a list of "block resync errors", i.e.
+blocks for which the last resync operation failed.
+This list can be inspected using `garage block list-errors`.
+These errors usually fall into one of the following categories:
+
+1. a block is still referenced but the object was deleted, this is a case
+   of metadata reference inconsistency (see below for the fix)
+2. a block is referenced by a non-deleted object, but could not be fetched due
+   to a transient error such as a network failure
+3. a block is referenced by a non-deleted object, but could not be fetched due
+   to a permanent error such as there not being any valid copy of the block on the
+   entire cluster
+
+To help make the difference between cases 1 and cases 2 and 3, you may use the
+`garage block info` command to see which objects hold a reference to each block.
+
+In the second case (transient errors), Garage will try to fetch the block again
+after a certain time, so the error should disappear natuarlly. You can also
+request Garage to try to fetch the block immediately using `garage block retry-now`
+if you have fixed the transient issue.
+
+If you are confident that you are in the third scenario and that your data block
+is definitely lost, then there is no other choice than to declare your S3 objects
+as unrecoverable, and to delete them properly from the data store. This can be done
+using the `garage block purge` command.
+
+
+# Metadata operations
+
+## Metadata table resync
+
+Garage automatically resyncs all entries stored in the metadata tables every hour,
+to ensure that all nodes have the most up-to-date version of all the information
+they should be holding.
+The resync procedure is based on a Merkle tree that allows to efficiently find
+differences between nodes.
+
+In some special cases, e.g. before an upgrade, you might want to run a table
+resync manually. This can be done using `garage repair tables`.
+
+## Metadata table reference fixes
+
+In some very rare cases where nodes are unavailable, some references between objects
+are broken. For instance, if an object is deleted, the underlying versions or data
+blocks may still be held by Garage. If you suspect that such corruption has occurred
+in your cluster, you can run one of the following repair procedures:
+
+- `garage repair versions`: checks that all versions belong to a non-deleted object, and purges any orphan version
+- `garage repair block_refs`: checks that all block references belong to a non-deleted object version, and purges any orphan block reference (this will then allow the blocks to be garbage-collected)
+
--- a/doc/book/cookbook/recovering.md
+++ b/doc/book/cookbook/recovering.md
@ -1,6 +1,6 @@
 +++
 title = "Recovering from failures"
-weight = 50
+weight = 60
 +++

 Garage is meant to work on old, second-hand hardware.
--- a/doc/book/cookbook/upgrading.md
+++ b/doc/book/cookbook/upgrading.md
@ -1,6 +1,6 @@
 +++
 title = "Upgrading Garage"
-weight = 60
+weight = 70
 +++

 Garage is a stateful clustered application, where all nodes are communicating together and share data structures.