replace RPC stack with netapp #123

Merged
lx merged 5 commits from netapp into main 2021-10-25 13:49:35 +00:00
Owner

NOTICE

  • this change removes the migration path from the v0.2 series
  • configuration/setup is vastly changed, please refer to new documentation

TODO

  • Update documentation section on config file reference
  • Persist peer list in local file
  • Netapp: test priority queue (it's probably wrong)
  • Consul advertisement that includes node's public key in NodeMeta
  • Check if we can NAT with a different port thanks to rpc_public_addr
  • Try a migration from 0.3 to 0.4
  • Document (at least roughly) migration path from 0.3 to 0.4
NOTICE - this change removes the migration path from the `v0.2` series - configuration/setup is vastly changed, please refer to new documentation TODO - [x] Update documentation section on config file reference - [x] Persist peer list in local file - [x] Netapp: test priority queue (it's probably wrong) - [x] Consul advertisement that includes node's public key in `NodeMeta` - [ ] Check if we can NAT with a different port thanks to `rpc_public_addr` - [x] Try a migration from 0.3 to 0.4 - [x] Document (at least roughly) migration path from 0.3 to 0.4
lx changed title from WIP replace RPC stack with netapp to WIP: replace RPC stack with netapp 2021-10-18 13:28:01 +00:00
lx force-pushed netapp from 8ae40088d9 to 97c5d5c77f 2021-10-19 12:35:54 +00:00 Compare
lx force-pushed netapp from 4a555c0c7d to d9c52e9a9c 2021-10-19 21:40:42 +00:00 Compare
lx added 1 commit 2021-10-19 21:53:58 +00:00
Fix clippy
Some checks failed
continuous-integration/drone/push Build is failing
continuous-integration/drone/pr Build is failing
f1a68f6b57
lx added 2 commits 2021-10-19 22:00:42 +00:00
run cargo2nix
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone/pr Build is passing
c18920f8d1
quentin reviewed 2021-10-20 07:31:44 +00:00
@ -127,0 +110,4 @@
563e1ac825ee3323aa441e72c26d1030d6d4414aeb3dd25287c531e7fc2bc95d@[fc00:1::1]:3901
Venus$ garage node-id
86f0f26ae4afbd59aaf9cfb059eefac844951efd5b8caeec0d53f4ed6c85f332[fc00:1::2]:3901
Owner

Maybe a missing @ here?

Maybe a missing @ here?
quentin reviewed 2021-10-20 07:31:54 +00:00
@ -127,0 +120,4 @@
```toml
bootstrap_peers = [
"563e1ac825ee3323aa441e72c26d1030d6d4414aeb3dd25287c531e7fc2bc95d@[fc00:1::1]:3901",
"86f0f26ae4afbd59aaf9cfb059eefac844951efd5b8caeec0d53f4ed6c85f332[fc00:1::2]:3901",
Owner

idem here?

idem here?
quentin reviewed 2021-10-20 07:37:58 +00:00
@ -64,1 +65,3 @@
(garage server -c /tmp/config.$count.toml 2>&1|while read r; do echo -en "$LABEL $r\n"; done) &
(garage -c /tmp/config.$count.toml server 2>&1|while read r; do echo -en "$LABEL $r\n"; done) &
done
# >>>>>>>>>>>>>>>> END FOR LOOP ON NODES
Owner

check socat multiple launch. Tracked in #124

check socat multiple launch. Tracked in #124
lx added 1 commit 2021-10-20 14:10:29 +00:00
Update documentation
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone/pr Build is passing
6455347955
lx added 1 commit 2021-10-20 22:30:31 +00:00
Discovery via consul
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone/pr Build is passing
5e986c443f
lx force-pushed netapp from 832602d7f3 to 1dc7cc7936 2021-10-21 10:34:10 +00:00 Compare
lx added 1 commit 2021-10-21 11:46:30 +00:00
Persist peer list to file
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone/pr Build is passing
2fa65554f1
Owner

you need to run cargo2nix -f again after updating netapp :)

you need to run `cargo2nix -f` again after updating netapp :)
lx added 1 commit 2021-10-21 12:10:20 +00:00
update cargo.nix
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone/pr Build is passing
b7eccf5264
quentin approved these changes 2021-10-21 13:21:48 +00:00
quentin left a comment
Owner

I have reviewed your merge request and agree to merge it.

Before merging it, I think our main requirement is to make garage node-id works when the meta directory has not been created yet (please refer to my comment).

It seems to be an already existing bug but we may display an empty failed node list in some cases that is very useless when debugging (please refer to my comment). We must at least track this.

Might be for a future PR but we should improve error messages, especially when:

  • a file or folder does not exist
  • when the rpc_secret does not match
  • when we failed to meet the quorum
  • when the user put a node url with the wrong format

Might also be for a future PR but the secret handshake protocol will be a bit new for our users, we should write some documentation about securing a cluster. We must mention:

  • rpc_secret
  • node_id
  • firewalls
  • (hardened systemd service)

I have not:

  • tested Consul
  • done performance tests

If you are not confident on one of these two points, I can test them before you merge.

I have reviewed your merge request and agree to merge it. Before merging it, I think our main requirement is to make `garage node-id` works when the `meta` directory has not been created yet (please refer to my comment). It seems to be an already existing bug but we may display an empty failed node list in some cases that is very useless when debugging (please refer to my comment). We must at least track this. Might be for a future PR but we should improve error messages, especially when: - a file or folder does not exist - when the `rpc_secret` does not match - when we failed to meet the quorum - when the user put a node url with the wrong format Might also be for a future PR but the secret handshake protocol will be a bit new for our users, we should write some documentation about securing a cluster. We must mention: - rpc_secret - node_id - firewalls - (hardened systemd service) I have not: - tested Consul - done performance tests If you are not confident on one of these two points, I can test them before you merge.
@ -391,3 +388,2 @@
if failure_case_1 || failure_case_2 {
println!("\nFailed nodes:");
for adv in status.iter().filter(|x| !x.is_up) {
println!("\n==== FAILED NODES ====");
Owner

If we are on failure_case_2 we display the failed nodes section but it is empty.

It can be reproduced by spanning 2 garage instances then connecting the first instance to the second one.

eg:

$ ./result/bin/garage -c /tmp/garage.toml status
==== HEALTHY NODES ====
ID                 Hostname     Address                  Tag               Zone  Capacity
2f75a8a915f9cceb…  lheureduthe  127.0.0.1:3911           NO ROLE ASSIGNED
c393f0ab78d8ddcd…  lheureduthe  [::ffff:127.0.0.1]:3901  NO ROLE ASSIGNED

$ ./result/bin/garage status
==== HEALTHY NODES ====
ID                 Hostname     Address         Tag               Zone  Capacity
c393f0ab78d8ddcd…  lheureduthe  127.0.0.1:3901  NO ROLE ASSIGNED

==== FAILED NODES ====
ID  Hostname  Address  Tag  Zone  Capacity  Last seen
If we are on `failure_case_2` we display the failed nodes section but it is empty. It can be reproduced by spanning 2 garage instances then connecting the first instance to the second one. eg: ``` $ ./result/bin/garage -c /tmp/garage.toml status ==== HEALTHY NODES ==== ID Hostname Address Tag Zone Capacity 2f75a8a915f9cceb… lheureduthe 127.0.0.1:3911 NO ROLE ASSIGNED c393f0ab78d8ddcd… lheureduthe [::ffff:127.0.0.1]:3901 NO ROLE ASSIGNED $ ./result/bin/garage status ==== HEALTHY NODES ==== ID Hostname Address Tag Zone Capacity c393f0ab78d8ddcd… lheureduthe 127.0.0.1:3901 NO ROLE ASSIGNED ==== FAILED NODES ==== ID Hostname Address Tag Zone Capacity Last seen ```
@ -80,0 +97,4 @@
// Find and parse the address of the target host
let (id, addr) = if let Some(h) = opt.rpc_host {
let (id, addrs) = parse_and_resolve_peer_addr(&h).expect("Invalid RPC host");
Owner

We should replace this error message by something like that:

Invalid remote peer identifier "127.0.0.1:3911".
A remote peer identifier has the following format: <pubkey>@<address>:<port>.
Example: 9286ef3c3127b199d4a28928a261584f772dd674fc0aab9860d83a1849d3fb45@127.0.0.1:3901

The most important in my opinion is to drop the "rpc host" and "host" terminology in favor of a more abstract terminology such as "remote peer identifier" that can not be confused with the layer 3 or layer 4 terminology/patterns.

We should replace this error message by something like that: ``` Invalid remote peer identifier "127.0.0.1:3911". A remote peer identifier has the following format: <pubkey>@<address>:<port>. Example: 9286ef3c3127b199d4a28928a261584f772dd674fc0aab9860d83a1849d3fb45@127.0.0.1:3901 ``` The most important in my opinion is to drop the "rpc host" and "host" terminology in favor of a more abstract terminology such as "remote peer identifier" that can not be confused with the layer 3 or layer 4 terminology/patterns.
@ -98,0 +144,4 @@
format!("{}@127.0.0.1:3901", idstr)
};
if !quiet {
Owner

I think we can improve the wording of this very helpful and critical message :)

  1. We speak about "nodes" indiferrently, we can qualify them as "remote" or "local", "existing" or "new", etc. to help the user identifiy the new/local node he/she is configuring and the remote/existing nodes he/she has already configured.

  2. We could better separate concerns (on this node // on a cluster node, your cluster is live // you are configuring it) and prioritize our examples.

  3. Does it mean that we can join any cluster? If yes, this is a secrity issue.

I will propose an alternative text after gaining some more knowledge on this process.

I think we can improve the wording of this *very helpful and critical message* :) 1. We speak about "nodes" indiferrently, we can qualify them as "remote" or "local", "existing" or "new", etc. to help the user identifiy the new/local node he/she is configuring and the remote/existing nodes he/she has already configured. 2. We could better separate concerns (on this node // on a cluster node, your cluster is live // you are configuring it) and prioritize our examples. 3. Does it mean that we can join any cluster? If yes, this is a secrity issue. I will propose an alternative text after gaining some more knowledge on this process.
@ -98,0 +158,4 @@
idstr
);
eprintln!(
"where <remote_node> is their own node identifier in the format: <pubkey>@<ip>:<port>"
Owner

Ok, I missed this point the first time I tested, I tried with 127.0.0.1:3911 and got the following error:

thread 'main' panicked at 'Invalid RPC host', main.rs:100:59

I think here we should directly put <pubkey>@<address>:<port> instead of <remote_node>.

Ok, I missed this point the first time I tested, I tried with `127.0.0.1:3911` and got the following error: ``` thread 'main' panicked at 'Invalid RPC host', main.rs:100:59 ``` I think here we should directly put `<pubkey>@<address>:<port>` instead of `<remote_node>`.
@ -0,0 +128,4 @@
} else {
let (_, key) = ed25519::gen_keypair();
let mut f = std::fs::File::create(key_file.as_path())?;
Owner
  1. Create a file /etc/garage.toml with the content given in the Quickstart
  2. Never start the daemon (check that no meta or data folder have been created)
  3. Run garage node-id (this is similar to the steps advertised in "Cookbook > Deploying Garage"
  4. Get an error:
strace -e open ./result/bin/garage node-id
open("/proc/self/cgroup", O_RDONLY|O_CLOEXEC) = 3
open("/proc/self/mountinfo", O_RDONLY|O_CLOEXEC) = 3
open("/sys/fs/cgroup/cpu,cpuacct/user.slice/user-1000.slice/user@1000.service/cpu.cfs_quota_us", O_RDONLY|O_CLOEXEC) = 3
open("/etc/garage.toml", O_RDONLY|O_CLOEXEC) = 9
open("/tmp/meta/node_key", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = -1 ENOENT (Aucun fichier ou dossier de ce type)
 ERROR garage > Unable to read or generate node key: IO error: No such file or directory (os error 2)
+++ exited with 1 +++

2 points:

  1. We need to recursively create folders before trying to write the key
  2. We might want to improve our error reporting by specifying the failed path.
1. Create a file `/etc/garage.toml` with the content given in the Quickstart 2. Never start the daemon (check that no meta or data folder have been created) 3. Run `garage node-id` (this is similar to the steps advertised in "Cookbook > Deploying Garage" 4. Get an error: ``` strace -e open ./result/bin/garage node-id open("/proc/self/cgroup", O_RDONLY|O_CLOEXEC) = 3 open("/proc/self/mountinfo", O_RDONLY|O_CLOEXEC) = 3 open("/sys/fs/cgroup/cpu,cpuacct/user.slice/user-1000.slice/user@1000.service/cpu.cfs_quota_us", O_RDONLY|O_CLOEXEC) = 3 open("/etc/garage.toml", O_RDONLY|O_CLOEXEC) = 9 open("/tmp/meta/node_key", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = -1 ENOENT (Aucun fichier ou dossier de ce type) ERROR garage > Unable to read or generate node key: IO error: No such file or directory (os error 2) +++ exited with 1 +++ ``` 2 points: 1. We need to recursively create folders before trying to write the key 2. We might want to improve our error reporting by specifying the failed path.
@ -0,0 +130,4 @@
} else {
let (_, key) = ed25519::gen_keypair();
let mut f = std::fs::File::create(key_file.as_path())?;
Owner
$ RUST_LOG=garage=debug,netapp=debug strace -e open ./result/bin/garage node-id
open("/proc/self/cgroup", O_RDONLY|O_CLOEXEC) = 3
open("/proc/self/mountinfo", O_RDONLY|O_CLOEXEC) = 3
open("/sys/fs/cgroup/cpu,cpuacct/user.slice/user-1000.slice/user@1000.service/cpu.cfs_quota_us", O_RDONLY|O_CLOEXEC) = 3
open("/etc/garage.toml", O_RDONLY|O_CLOEXEC) = 9
open("/tmp/meta/node_key", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = -1 ENOENT (Aucun fichier ou dossier de ce type)
 ERROR garage > Unable to read or generate node key: IO error: No such file or directory (os error 2)
+++ exited with 1 +++

There are many chance that the garage node-id command will fail as the key will be stored in the meta folder that will very likely not be created yet and it will throw the previous cryptic error I a diagnosed through strace.

``` $ RUST_LOG=garage=debug,netapp=debug strace -e open ./result/bin/garage node-id open("/proc/self/cgroup", O_RDONLY|O_CLOEXEC) = 3 open("/proc/self/mountinfo", O_RDONLY|O_CLOEXEC) = 3 open("/sys/fs/cgroup/cpu,cpuacct/user.slice/user-1000.slice/user@1000.service/cpu.cfs_quota_us", O_RDONLY|O_CLOEXEC) = 3 open("/etc/garage.toml", O_RDONLY|O_CLOEXEC) = 9 open("/tmp/meta/node_key", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = -1 ENOENT (Aucun fichier ou dossier de ce type) ERROR garage > Unable to read or generate node key: IO error: No such file or directory (os error 2) +++ exited with 1 +++ ``` There are many chance that the `garage node-id` command will fail as the key will be stored in the `meta` folder that will very likely not be created yet and it will throw the previous cryptic error I a diagnosed through strace.
@ -50,3 +26,1 @@
#[error(display = "PKI error: {}", _0)]
Pki(#[error(source)] webpki::Error),
#[error(display = "Netapp error: {}", _0)]
Owner

When the wrong rpc_secret is used and that it leads to a failed handshake, we should add a hint to the user that it may check his/her secret key.

When the wrong `rpc_secret` is used and that it leads to a failed handshake, we should add a hint to the user that it may check his/her secret key.
@ -71,3 +49,2 @@
#[error(display = "Remote error: {} (status code {})", _0, _1)]
RemoteError(String, StatusCode),
#[error(display = "Too many errors: {:?}", _0)]
Owner

We discussed renaming this error "FailedQuorumError" or something similar as this is the only case it is fired.

We discussed renaming this error "FailedQuorumError" or something similar as this is the only case it is fired.
Owner

Fixes #101
Fixes #36

Fixes #101 Fixes #36
lx added 1 commit 2021-10-21 15:29:45 +00:00
Move things around, improvements to CLI
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone/pr Build is passing
bff5333c62
lx requested review from quentin 2021-10-21 15:42:20 +00:00
lx force-pushed netapp from 0a45a2b8ee to dbe457d3fa 2021-10-21 15:43:48 +00:00 Compare
lx changed title from WIP: replace RPC stack with netapp to replace RPC stack with netapp 2021-10-22 13:35:43 +00:00
lx force-pushed netapp from d838d604eb to f4d246ee24 2021-10-22 13:48:07 +00:00 Compare
lx force-pushed netapp from f4d246ee24 to 991fe2032f 2021-10-22 13:50:59 +00:00 Compare
lx force-pushed netapp from 991fe2032f to 2a1dc24710 2021-10-22 13:53:51 +00:00 Compare
lx force-pushed netapp from 2a1dc24710 to db2924ad80 2021-10-22 13:55:51 +00:00 Compare
lx force-pushed netapp from db2924ad80 to 28c3c27c26 2021-10-22 13:59:07 +00:00 Compare
lx force-pushed netapp from 28c3c27c26 to 2eba3d6d62 2021-10-22 14:52:34 +00:00 Compare
lx force-pushed netapp from 2eba3d6d62 to 6d8b74cf8d 2021-10-22 14:55:07 +00:00 Compare
lx force-pushed netapp from 6d8b74cf8d to 88a91fe648 2021-10-22 14:56:05 +00:00 Compare
lx force-pushed netapp from e84432181f to 060ed9dcd0 2021-10-22 19:56:37 +00:00 Compare
lx force-pushed netapp from 060ed9dcd0 to cda4523872 2021-10-22 20:04:19 +00:00 Compare
lx force-pushed netapp from eb8c93d50d to 6ab65b6ef4 2021-10-25 07:03:11 +00:00 Compare
lx force-pushed netapp from 6ab65b6ef4 to d5a8ce7fc7 2021-10-25 12:00:05 +00:00 Compare
lx force-pushed netapp from d5a8ce7fc7 to df8a4068d9 2021-10-25 12:21:56 +00:00 Compare
lx added 1 commit 2021-10-25 13:25:25 +00:00
Disable i686 and armv6l pipelines for now
All checks were successful
continuous-integration/drone/pr Build is passing
continuous-integration/drone/tag Build is passing
continuous-integration/drone Build is passing
continuous-integration/drone/push Build is passing
f6ebcbc7a7
lx merged commit f6ebcbc7a7 into main 2021-10-25 13:49:35 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#123
No description provided.