Add an "upgrading" section, add a guide for 0.7

2022-04-05 10:05:44 +02:00 · 2022-04-05 10:05:44 +02:00 · a122a8cb46
commit a122a8cb46
parent 9fd8ec1dee
3 changed files with 89 additions and 2 deletions
--- a/doc/book/cookbook/upgrading.md
+++ b/doc/book/cookbook/upgrading.md
@ -0,0 +1,56 @@
+++
+title = "Upgrading Garage"
+weight = 40
+++
+
+Garage is a stateful clustered application, where all nodes are communicating together and share data structures.
+It makes upgrade more difficult than stateless applications so you must be more careful when upgrading.
+On a new version release, there is 2 possibilities:
+  - protocols and data structures remained the same ➡️ this is a **straightforward upgrade**
+  - protocols or data structures changed  ➡️  this is an **advanced upgrade**
+
+You can quickly now what type of update you will have to operate by looking at the version identifier.
+Following the [SemVer ](https://semver.org/) terminology, if only the *patch* number changed, it will only need a straightforward upgrade.
+Example: an upgrade from v0.6.0 from v0.6.1 is a straightforward upgrade.
+If the *minor* or *major* number changed however, you will have to do an advanced upgrade. Example: from v0.6.1 to v0.7.0.
+
+Migrations are tested only from one stable version to another one, if you want to minimize your risks, you must ideally upgrade only to the direct next stable version, including patch ones.
+Example: to go from v0.6.0 to v0.7.0, upgrade from v0.6.0 to v0.6.1 and then from v0.6.1 to v0.7.0.
+
+Migrating from a minor version to another one without installing patch ones could work but are not tested, you must do your own tests in advance.
+Example: going from v0.6.0 directly to v0.7.0 should work but is untested. Never, never, skip a minor or a major version.
+Example: going from v0.6.0 directly to v0.8.0 is forbidden.
+If you are very late in your upgrades, you should consider spawning a new cluster with the latest version and operate application level migrations
+from the old cluster to the new one.
+
+## Straightforward upgrades
+
+Straightforward upgrades do not imply cluster downtime.
+Before upgrading, you should still read [the changelog](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases) and ideally test your deployment on a staging cluster before.
+
+When you are ready, start by checking the health of your cluster.
+You can force some checks with `garage repair`, we recommend at least running `garage repair --all-nodes --yes` that is very quick to run (less than a minute).
+You will see that the command correctly terminated in the logs of your daemon.
+
+Finally, you can simply upgrades nodes one by one. 
+For each node: stop it, install the new binary, edit the configuration if needed, restart it. 
+
+## Advanced upgrades
+
+Advanced upgrades will imply cluster downtime.
+Before upgrading, you must read [the changelog](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases) and you must test your deployment on a staging cluster before.
+
+From a high level perspective, an advanced upgrade looks like this:
+  1. Make sure the health of your cluster is good (see `garage repair`)
+  2. Disable API access (comment the configuration in your reverse proxy)
+  3. Check that your cluster is idle
+  4. Stop the whole cluster
+  5. Backup the metadata folder of all your nodes, so that you will be able to restore it quickly if the upgrade fails (blocks being immutable, they should not be impacted)
+  6. Install the new binary, update the configuration
+  7. Start the whole cluster
+  8. If needed, run the corresponding migration from `garage migrate`
+  9. Make sure the health of your cluster is good
+  10. Enable API access (uncomment the configuration in your reverse proxy)
+  11. Monitor your cluster while load comes back, check that all your applications are happy with this new version
+
+We write guides for each advanced upgrade, they are stored under the "Working Documents" section of this documentation.
--- a/doc/book/working-documents/migration-06.md
+++ b/doc/book/working-documents/migration-06.md
@ -4,12 +4,12 @@ weight = 15
 +++

 **This guide explains how to migrate to 0.6 if you have an existing 0.5 cluster.
-We don't recommend trying to migrate directly from 0.4 or older to 0.6.**
+We don't recommend trying to migrate directly from older to 0.5.**

 **We make no guarantee that this migration will work perfectly:
 back up all your data before attempting it!**

-Garage v0.6 (not yet released) introduces a new data model for buckets,
+Garage v0.6 introduces a new data model for buckets,
 that allows buckets to have many names (aliases).
 Buckets can also have "private" aliases (called local aliases),
 which are only visible when using a certain access key.
--- a/doc/book/working-documents/migration-07.md
+++ b/doc/book/working-documents/migration-07.md
@ -0,0 +1,31 @@
+++
+title = "Migrating from 0.6 to 0.7"
+weight = 14
+++
+**This guide explains how to migrate to 0.7 if you have an existing 0.6 cluster.
+We don't recommend trying to migrate directly from older to 0.6.**
+
+**We make no guarantee that this migration will work perfectly:
+back up all your data before attempting it!**
+
+Garage v0.7 introduces a cluster protocol change to support request tracing through OpenTelemetry.
+No data structure is changed, so no data migration is required.
+
+The migration steps are as follows:
+
+1. Do `garage repair --all-nodes --yes tables` and `garage repair --all-nodes --yes blocks`,
+   check the logs and check that all data seems to be synced correctly between
+   nodes. If you have time, do additional checks (`scrub`, `block_refs`, etc.)
+2. Disable api and web access. Garage does not support disabling
+   these endpoints but you can change the port number or stop your reverse
+   proxy for instance.
+3. Check once again that your cluster is healty. Run again `garage repair --all-nodes --yes tables` which is quick.
+   Also check your queues are empty, run `garage stats` to query them.
+4. Turn off Garage v0.6
+5. Backup the metadata folder of all your nodes: `cd /var/lib/garage ; tar -acf meta-v0.6.tar.zst meta/`
+6. Install Garage v0.7, edit the configuration if you plan to use OpenTelemetry or the Kubernetes integration
+7. Turn on Garage v0.7
+8. Do `garage repair --all-nodes --yes tables` and `garage repair --all-nodes --yes blocks`
+9. Your upgraded cluster should be in a working state. Re-enable API and Web
+    access and check that everything went well.
+10. Monitor your cluster in the next hours to see if it works well under your production load, report any issue.