CLI: default rpc_host
continuous-integration/drone/pr Build is passing Details
continuous-integration/drone/push Build is passing Details
continuous-integration/drone/tag Build is passing Details
continuous-integration/drone Build is passing Details

This commit is contained in:
Alex 2021-10-26 11:22:28 +02:00
parent 43e13a501d
commit 3e7f766d95
No known key found for this signature in database
GPG Key ID: EDABF9711E244EB1
3 changed files with 90 additions and 29 deletions

View File

@ -1,6 +1,8 @@
# Migrating from 0.3 to 0.4
**Migrating from 0.3 to 0.4 is unsupported. This document is only intended to document the process internally for the Deuxfleurs cluster where we have to do it. Do not try it yourself, you will lose your data and we will not help you.**
**Migrating from 0.3 to 0.4 is unsupported. This document is only intended to
document the process internally for the Deuxfleurs cluster where we have to do
it. Do not try it yourself, you will lose your data and we will not help you.**
**Migrating from 0.2 to 0.4 will break everything for sure. Never try it.**
@ -9,53 +11,95 @@ The Sled database is still the same, and the data directory as well.
The following has changed, all in the meta directory:
- `node_id` in 0.3 contains the identifier of the current node. In 0.4, this file does nothing and should be deleted. It is replaced by `node_key` (the secret key) and `node_key.pub` (the associated public key). A node's identifier on the ring is its public key.
- `node_id` in 0.3 contains the identifier of the current node. In 0.4, this
file does nothing and should be deleted. It is replaced by `node_key` (the
secret key) and `node_key.pub` (the associated public key). A node's
identifier on the ring is its public key.
- `peer_info` in 0.3 contains the list of peers saved automatically by Garage. The format has changed and it is now stored in `peer_list` (`peer_info` should be deleted).
- `peer_info` in 0.3 contains the list of peers saved automatically by Garage.
The format has changed and it is now stored in `peer_list` (`peer_info`
should be deleted).
When migrating, all node identifiers will change. This also means that the affectation of data partitions on the ring will change, and lots of data will have to be rebalanced.
When migrating, all node identifiers will change. This also means that the
affectation of data partitions on the ring will change, and lots of data will
have to be rebalanced.
- If your cluster has only 3 nodes, all nodes store everything, therefore nothing has to be rebalanced.
- If your cluster has only 4 nodes, for any partition there will always be at least 2 nodes that stored data before that still store it after. Therefore the migration should in theory be transparent and Garage should continue to work during the rebalance.
- If your cluster has only 4 nodes, for any partition there will always be at
least 2 nodes that stored data before that still store it after. Therefore
the migration should in theory be transparent and Garage should continue to
work during the rebalance.
- If your cluster has 5 or more nodes, data will disappear during the migration. Do not migrate (fortunately we don't have this scenario at Deuxfleurs), or if you do, make Garage unavailable until things stabilize (disable web and api access).
- If your cluster has 5 or more nodes, data will disappear during the
migration. Do not migrate (fortunately we don't have this scenario at
Deuxfleurs), or if you do, make Garage unavailable until things stabilize
(disable web and api access).
The migration steps are as follows:
1. Prepare a new configuration file for 0.4. For each node, point to the same meta and data directories as Garage 0.3. Basically, the things that change are the following:
1. Prepare a new configuration file for 0.4. For each node, point to the same
meta and data directories as Garage 0.3. Basically, the things that change
are the following:
- No more `rpc_tls` section
- You have to generate a shared `rpc_secret` and put it in all config files
- `bootstrap_nodes` has a different syntax as it has to contain node keys. Leave it empty and use `garage node-id` and `garage node connect` instead (new features of 0.4)
- `bootstrap_peers` has a different syntax as it has to contain node keys.
Leave it empty and use `garage node-id` and `garage node connect` instead (new features of 0.4)
- put the publicly accessible RPC address of your node in `rpc_public_addr` if possible (its optional but recommended)
- If you are using Consul, change the `consul_service_name` to NOT be the name advertised by Nomad. Now Garage is responsible for advertising its own service itself.
- If you are using Consul, change the `consul_service_name` to NOT be the name advertised by Nomad.
Now Garage is responsible for advertising its own service itself.
2. Disable api and web access for some time, do `garage repair --all --yes tables` and `garage repair --all --yes blocks`, check the logs and check that all data seems to be synced correctly between nodes.
2. Disable api and web access for some time (Garage does not support disabling
these endpoints but you can change the port number or stop your reverse
proxy for instance).
3. Save somewhere the output of `garage status`. We will need this to remember how to reconfigure nodes in 0.4.
3. Do `garage repair -a --yes tables` and `garage repair -a --yes blocks`,
check the logs and check that all data seems to be synced correctly between
nodes.
4. Turn off Garage 0.3
4. Save somewhere the output of `garage status`. We will need this to remember
how to reconfigure nodes in 0.4.
5. Backup metadata folders if you can (i.e. if you have space to do it somewhere). Backuping data folders could also be usefull but that's much harder to do. If your filesystem supports snapshots, this could be a good time to use them.
5. Turn off Garage 0.3
6. Turn on Garage 0.4
6. Backup metadata folders if you can (i.e. if you have space to do it
somewhere). Backuping data folders could also be usefull but that's much
harder to do. If your filesystem supports snapshots, this could be a good
time to use them.
7. At this point, running `garage status` should indicate that all nodes of the previous cluster are "unavailable". The nodes have new identifiers that should appear in healthy nodes once they can talk to one another (use `garage node connect` if necessary`). They should have NO ROLE ASSIGNED at the moment.
7. Turn on Garage 0.4
8. Prepare a script with several `garage node configure` commands that replace each of the v0.3 node ID with the corresponding v0.4 node ID, with the same zone/tag/capacity. For example if your node `drosera` had identifier `c24e` before and now has identifier `789a`, and it was configured with capacity `2` in zone `dc1`, put the following command in your script:
8. At this point, running `garage status` should indicate that all nodes of the
previous cluster are "unavailable". The nodes have new identifiers that
should appear in healthy nodes once they can talk to one another (use
`garage node connect` if necessary`). They should have NO ROLE ASSIGNED at
the moment.
9. Prepare a script with several `garage node configure` commands that replace
each of the v0.3 node ID with the corresponding v0.4 node ID, with the same
zone/tag/capacity. For example if your node `drosera` had identifier `c24e`
before and now has identifier `789a`, and it was configured with capacity
`2` in zone `dc1`, put the following command in your script:
```bash
garage node configure 789a -z dc1 -c 2 -t drosera --replace c24e
```
9. Run your reconfiguration script. Check that the new output of `garage status` contains the correct node IDs with the correct values for capacity and zone. Old nodes should no longer be mentioned.
10. Run your reconfiguration script. Check that the new output of `garage
status` contains the correct node IDs with the correct values for capacity
and zone. Old nodes should no longer be mentioned.
10. If your cluster has 4 nodes or less, and you are feeling adventurous, you can reenable Web and API access now. Things will probably work.
11. If your cluster has 4 nodes or less, and you are feeling adventurous, you
can reenable Web and API access now. Things will probably work.
11. Garage might already be resyncing stuff. Issue a `garage repair --all --yes tables` and `garage repair --all --yes blocks` to force it to do so.
12. Garage might already be resyncing stuff. Issue a `garage repair -a --yes
tables` and `garage repair -a --yes blocks` to force it to do so.
12. Wait for resyncing activity to stop in the logs. Do steps 11 and 12 two or three times, until you see that when you issue the repair commands, nothing gets resynced any longer.
13. Wait for resyncing activity to stop in the logs. Do steps 12 and 13 two or
three times, until you see that when you issue the repair commands, nothing
gets resynced any longer.
13. Your upgraded cluster should be in a working state. Re-enable API and Web access and check that everything went well.
14. Your upgraded cluster should be in a working state. Re-enable API and Web
access and check that everything went well.

View File

@ -1,5 +1,7 @@
use std::path::PathBuf;
use log::warn;
use garage_util::error::*;
pub const READ_KEY_ERROR: &str = "Unable to read node key. It will be generated by your garage node the first time is it launched. Ensure that your garage node is currently running. (The node key is supposed to be stored in your metadata directory.)";
@ -22,11 +24,11 @@ pub fn node_id_command(config_file: PathBuf, quiet: bool) -> Result<(), Error> {
println!("{}", idstr);
if !quiet {
eprintln!("WARNING: I don't know the public address to reach this node.");
eprintln!("In all of the instructions below, replace 127.0.0.1:3901 by the appropriate address and port.");
warn!("WARNING: I don't know the public address to reach this node.");
warn!("In all of the instructions below, replace 127.0.0.1:{} by the appropriate address and port.", config.rpc_bind_addr.port());
}
format!("{}@127.0.0.1:3901", idstr)
format!("{}@127.0.0.1:{}", idstr, config.rpc_bind_addr.port())
};
if !quiet {

View File

@ -9,6 +9,7 @@ mod cli;
mod repair;
mod server;
use std::net::SocketAddr;
use std::path::PathBuf;
use structopt::StructOpt;
@ -46,6 +47,9 @@ struct Opt {
#[tokio::main]
async fn main() {
if std::env::var("RUST_LOG").is_err() {
std::env::set_var("RUST_LOG", "garage=info")
}
pretty_env_logger::init();
sodiumoxide::init().expect("Unable to init sodiumoxide");
@ -99,12 +103,23 @@ async fn cli_command(opt: Opt) -> Result<(), Error> {
let (id, addr) = if let Some(h) = opt.rpc_host {
let (id, addrs) = parse_and_resolve_peer_addr(&h).ok_or_else(|| format!("Invalid RPC remote node identifier: {}. Expected format is <pubkey>@<IP or hostname>:<port>.", h))?;
(id, addrs[0])
} else if let Some(a) = config.as_ref().map(|c| c.rpc_public_addr).flatten() {
let node_id = garage_rpc::system::read_node_id(&config.unwrap().metadata_dir)
.err_context(READ_KEY_ERROR)?;
(node_id, a)
} else {
return Err(Error::Message("No RPC host provided".into()));
let node_id = garage_rpc::system::read_node_id(&config.as_ref().unwrap().metadata_dir)
.err_context(READ_KEY_ERROR)?;
if let Some(a) = config.as_ref().map(|c| c.rpc_public_addr).flatten() {
(node_id, a)
} else {
let default_addr = SocketAddr::new(
"127.0.0.1".parse().unwrap(),
config.as_ref().unwrap().rpc_bind_addr.port(),
);
warn!(
"Trying to contact Garage node at default address {}",
default_addr
);
warn!("If this doesn't work, consider adding rpc_public_addr in your config file or specifying the -h command line parameter.");
(node_id, default_addr)
}
};
// Connect to target host