Upgrade without downtime #436

Closed
opened 2022-12-02 13:55:12 +00:00 by lx · 2 comments
Owner

Currently, in the documentation, our suggested upgrade procedure is to take all nodes offline for the duration of the upgrade. This is not a viable solution for production clusters that need to be continuously serving traffic. This issue covers the two aspects of solving this problem:

  1. Making sure that we provide the correct tools to allow upgrades without downtime
  2. Document the correct procedure for upgrades without downtime, and make sure it is reliable by testing

Concerning point 1, here are the current constraints:

  • Nodes running incompatible versions cannot communicate (the RPC protocols are made to be incompatible on purpose to avoid any undefined behavior)
  • Once a metadata table entry has been migrated to a new schema, it can no longer be expressed in the old schema, and cannot be transmitted over the old RPC protocol

Note that there is technically no need for the cluster to be taken at once for the upgrade. Here is what we could easily relax in the current migration procedure:

  • for the backup procedure, it can be done node-by-node or zone-by-zone, although there is a risk of ending up with inconsistent snapshots at the different nodes
  • for the offline repair procedures, there is no need to do all nodes at once: nodes can be taken offline one after the other to run the procedure without impacting the rest of the cluster

For the upgrade itself, and without further tools, i.e. with Garage in its current state (v0.8.0), the following two upgrade procedures are the best we can achieve:

  • by using dangerous mode (this reduces the guarantees on writes during the update, as they need to be validated by only one node, but should provide zero downtime if switchover is done properly): switch the cluster to replication mode 3-dangerous, reboot nodes into new version one by one or by small groups, ensuring that there is always at least one available node for each partition. For instance, in a 3-zone layout, this means upgrading all nodes in each zone, one zone after the other. In a 4-zone layout, this can be achieved by upgrading the nodes in two selected zones together, and then doing the nodes in the two remaining zones together. Once all nodes are done, put the cluster back in standard 3 replication mode. Note that changing the replication mode of the cluster itself already requires rebooting all nodes one after the other. (this cannot be done in a layout with 5 or more zones)

  • without dangerous mode: just reboot all nodes into the new version at the same time. This is not zero-downtime, but can probably be done with downtime < 1 minute.

The following features could be provided by Garage to ease the process:

  • (easy) make the replication mode configurable in the layout, so that the cluster can switch very rapidly to 3-dangerous and then back to 3, without having to reboot all nodes.
  • (hard) embed both version n-1 and version n of the data model in each Garage binary, so that the "switching all nodes to the new version at the same time" can be achieved without rebooting the garage process, but simply by flipping a cluster-wide switch (the value could also be propagated as part of the layout), thus reducing the disruption to at most a few seconds
  • (medium/hard) have the possibility of backuping node states by making a snapshot while the node is running
  • (probably not possible in current state) replace offline repair procedures by background tasks that run automatically during normal operation

All upgrade procedures should be tested thorougly and documented as completely as possible.

Currently, in the documentation, our suggested upgrade procedure is to take all nodes offline for the duration of the upgrade. This is not a viable solution for production clusters that need to be continuously serving traffic. This issue covers the two aspects of solving this problem: 1. Making sure that we provide the correct tools to allow upgrades without downtime 2. Document the correct procedure for upgrades without downtime, and make sure it is reliable by testing Concerning point 1, here are the current constraints: - Nodes running incompatible versions cannot communicate (the RPC protocols are made to be incompatible on purpose to avoid any undefined behavior) - Once a metadata table entry has been migrated to a new schema, it can no longer be expressed in the old schema, and cannot be transmitted over the old RPC protocol Note that there is technically no need for the cluster to be taken at once for the upgrade. Here is what we could easily relax in the current migration procedure: - for the backup procedure, it can be done node-by-node or zone-by-zone, although there is a risk of ending up with inconsistent snapshots at the different nodes - for the offline repair procedures, there is no need to do all nodes at once: nodes can be taken offline one after the other to run the procedure without impacting the rest of the cluster For the upgrade itself, and without further tools, i.e. with Garage in its current state (v0.8.0), the following two upgrade procedures are the best we can achieve: - *by using dangerous mode* (this reduces the guarantees on writes during the update, as they need to be validated by only one node, but should provide zero downtime if switchover is done properly): switch the cluster to replication mode 3-dangerous, reboot nodes into new version one by one or by small groups, ensuring that there is always at least one available node for each partition. For instance, in a 3-zone layout, this means upgrading all nodes in each zone, one zone after the other. In a 4-zone layout, this can be achieved by upgrading the nodes in two selected zones together, and then doing the nodes in the two remaining zones together. Once all nodes are done, put the cluster back in standard 3 replication mode. Note that changing the replication mode of the cluster itself already requires rebooting all nodes one after the other. (this cannot be done in a layout with 5 or more zones) - *without dangerous mode*: just reboot all nodes into the new version at the same time. This is not zero-downtime, but can probably be done with downtime < 1 minute. The following features could be provided by Garage to ease the process: - (easy) make the replication mode configurable in the layout, so that the cluster can switch very rapidly to 3-dangerous and then back to 3, without having to reboot all nodes. - (hard) embed both version n-1 and version n of the data model in each Garage binary, so that the "switching all nodes to the new version at the same time" can be achieved without rebooting the garage process, but simply by flipping a cluster-wide switch (the value could also be propagated as part of the layout), thus reducing the disruption to at most a few seconds - (medium/hard) have the possibility of backuping node states by making a snapshot while the node is running - (probably not possible in current state) replace offline repair procedures by background tasks that run automatically during normal operation All upgrade procedures should be tested thorougly and documented as completely as possible.
lx added the
Improvement
Documentation
labels 2022-12-02 13:55:12 +00:00

Here is another possibility: wait until all nodes are upgraded to switch to the new version.

  1. Start the new server locally, it will tell the old instance that it's waiting for a handover.
  2. The old instance continues accepting requests, but it is now aware that an upgrade is pending. It will periodically check other instances to see if they're ready.
  3. Once the upgrade is pending on all instances, they collectively decide to migrate.

This is similar to your (hard) mode, but leveraging both the old and new server binaries.

The specific handover mechanism can be done in multiple ways. I suggest investigating handing over the socket file descriptor over a unix domain socket. Alternatively, with pidfd_getfd. Or just stop listening on the old server, and start listening on the new instance, though that would incurr more downtime.

The old instances could also "reverse-proxy" traffic for the new instances while waiting for the transition to be ready, so that the handover happens faster.

New connections would be performed with the new server instance, while the old instance would do its best to serve the existing ones.
I don't know much about the S3 protocol, so I am not sure if connections are long-running, can be interupted, etc (what to do if a big file is being transferred? can it be interrupted? Marked as "in use" in the new instance, etc).

While writing the above, I came across a few interesting posts:

Here is another possibility: wait until all nodes are upgraded to switch to the new version. 1. Start the new server locally, it will tell the old instance that it's waiting for a handover. 2. The old instance continues accepting requests, but it is now aware that an upgrade is pending. It will periodically check other instances to see if they're ready. 3. Once the upgrade is pending on all instances, they collectively decide to migrate. This is similar to your (hard) mode, but leveraging both the old and new server binaries. The specific handover mechanism can be done in multiple ways. I suggest investigating handing over the socket file descriptor over a unix domain socket. Alternatively, with `pidfd_getfd`. Or just stop listening on the old server, and start listening on the new instance, though that would incurr more downtime. The old instances could also "reverse-proxy" traffic for the new instances while waiting for the transition to be ready, so that the handover happens faster. New connections would be performed with the new server instance, while the old instance would do its best to serve the existing ones. I don't know much about the S3 protocol, so I am not sure if connections are long-running, can be interupted, etc (what to do if a big file is being transferred? can it be interrupted? Marked as "in use" in the new instance, etc). While writing the above, I came across a few interesting posts: * https://copyconstruct.medium.com/file-descriptor-transfer-over-unix-domain-sockets-dcbbf5b3b6ec also mentions the name "Socket Takeover", apparently in use at Facebook * https://copyconstruct.medium.com/seamless-file-descriptor-transfer-between-processes-with-pidfd-and-pidfd-getfd-816afcd19ed4 The new version
Author
Owner

We have tested and validated an upgrade procedure with minimal downtime, where nodes are restarted simultaneously. This can be very fast if well coordinated. There is no plan to improve further by adding complex logic into Garage itself.

We have tested and validated an upgrade procedure with *minimal downtime*, where nodes are restarted simultaneously. This can be very fast if well coordinated. There is no plan to improve further by adding complex logic into Garage itself.
lx closed this issue 2023-10-16 10:00:18 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#436
No description provided.