Upgrade without downtime #436
Labels
No Label
AdminAPI
Bug
Check AWS
CI
Correctness
Critical
Documentation
Ideas
Improvement
Low priority
Newcomer
Performance
S3 Compatibility
Testing
Usability
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#436
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Currently, in the documentation, our suggested upgrade procedure is to take all nodes offline for the duration of the upgrade. This is not a viable solution for production clusters that need to be continuously serving traffic. This issue covers the two aspects of solving this problem:
Concerning point 1, here are the current constraints:
Note that there is technically no need for the cluster to be taken at once for the upgrade. Here is what we could easily relax in the current migration procedure:
For the upgrade itself, and without further tools, i.e. with Garage in its current state (v0.8.0), the following two upgrade procedures are the best we can achieve:
by using dangerous mode (this reduces the guarantees on writes during the update, as they need to be validated by only one node, but should provide zero downtime if switchover is done properly): switch the cluster to replication mode 3-dangerous, reboot nodes into new version one by one or by small groups, ensuring that there is always at least one available node for each partition. For instance, in a 3-zone layout, this means upgrading all nodes in each zone, one zone after the other. In a 4-zone layout, this can be achieved by upgrading the nodes in two selected zones together, and then doing the nodes in the two remaining zones together. Once all nodes are done, put the cluster back in standard 3 replication mode. Note that changing the replication mode of the cluster itself already requires rebooting all nodes one after the other. (this cannot be done in a layout with 5 or more zones)
without dangerous mode: just reboot all nodes into the new version at the same time. This is not zero-downtime, but can probably be done with downtime < 1 minute.
The following features could be provided by Garage to ease the process:
All upgrade procedures should be tested thorougly and documented as completely as possible.
Here is another possibility: wait until all nodes are upgraded to switch to the new version.
This is similar to your (hard) mode, but leveraging both the old and new server binaries.
The specific handover mechanism can be done in multiple ways. I suggest investigating handing over the socket file descriptor over a unix domain socket. Alternatively, with
pidfd_getfd
. Or just stop listening on the old server, and start listening on the new instance, though that would incurr more downtime.The old instances could also "reverse-proxy" traffic for the new instances while waiting for the transition to be ready, so that the handover happens faster.
New connections would be performed with the new server instance, while the old instance would do its best to serve the existing ones.
I don't know much about the S3 protocol, so I am not sure if connections are long-running, can be interupted, etc (what to do if a big file is being transferred? can it be interrupted? Marked as "in use" in the new instance, etc).
While writing the above, I came across a few interesting posts:
We have tested and validated an upgrade procedure with minimal downtime, where nodes are restarted simultaneously. This can be very fast if well coordinated. There is no plan to improve further by adding complex logic into Garage itself.