diff --git a/README.md b/README.md index 9514084..c86a067 100644 --- a/README.md +++ b/README.md @@ -12,54 +12,15 @@ It sets up the following: See the following documentation topics: -- [Quick start for adding new nodes after NixOS install](doc/quick-start.md) +- [Quick start and onboarding for new administrators](doc/onboarding.md) +- [How to add new nodes to a cluster (rapid overview)](doc/adding-nodes.md) - [Architecture of this repo, how the scripts work](doc/architecture.md) - [List of TCP and UDP ports used by services](doc/ports) Additionnal documentation topics: -- [Succint guide for NixOS installation with LUKX full disk encryption](doc/nixos-install.md) (we don't do that in practice on our servers) +- [Succint guide for NixOS installation with LUKX full disk encryption](doc/nixos-install-luks.md) (we don't do that in practice on our servers) - [Example `hardware-config.nix` for a full disk encryption scenario](doc/example-hardware-configuration.nix) +- [Why not Ansible?](doc/why-not-ansible.md) -## Why not Ansible? - -I often get asked why not use Ansible to deploy to remote machines, as this -would look like a typical use case. There are many reasons, which basically -boil down to "I really don't like Ansible": - -- Ansible tries to do declarative system configuration, but doesn't do it - correctly at all, like Nix does. Example: in NixOS, to undo something you've - done, just comment the corresponding lines and redeploy. - -- Ansible is massive overkill for what we're trying to do here, we're just - copying a few small files and running some basic commands, leaving the rest - to NixOS. - -- YAML is a pain to manipulate as soon as you have more than two or three - indentation levels. Also, why in hell would you want to write loops and - conditions in YAML when you could use a proper expression language? - -- Ansible's vocabulary is not ours, and it imposes a rigid hierarchy of - directories and files which I don't want. - -- Ansible is probably not flexible enough to do what we want, at least not - without getting a migraine when trying. For example, it's inventory - management is too simple to account for the heterogeneity of our cluster - nodes while still retaining a level of organization (some configuration - options are defined cluster-wide, some are defined for each site - physical - location - we deploy on, and some are specific to each node). - -- I never remember Ansible's command line flags. - -- My distribution's package for Ansible takes almost 400MB once installed, - WTF??? By not depending on it, we're reducing the set of tools we need to - deploy to a bare minimum: Git, OpenSSH, OpenSSL, socat, - [pass](https://www.passwordstore.org/) (and the Consul and Nomad binaries - which are, I'll admit, not small). - - -## More - -Please read README.more.md for more detailed information - diff --git a/doc/quick-start.md b/doc/adding-nodes.md similarity index 57% rename from doc/quick-start.md rename to doc/adding-nodes.md index 1307fde..24b409c 100644 --- a/doc/quick-start.md +++ b/doc/adding-nodes.md @@ -1,17 +1,5 @@ # Quick start -## How to welcome a new administrator - -See: https://guide.deuxfleurs.fr/operations/acces/pass/ - -Basically: - - The new administrator generates a GPG key and publishes it on Gitea - - All existing administrators pull their key and sign it - - An existing administrator reencrypt the keystore with this new key and push it - - The new administrator clone the repo and check that they can decrypt the secrets - - Finally, the new administrator must choose a password to operate over SSH with `./passwd prod rick` where `rick` is the target username - - ## How to create files for a new zone *The documentation is written for the production cluster, the same apply for other clusters.* @@ -40,34 +28,3 @@ Run: - if a user changes their password (using `./passwd`), needs to be redeployed on all nodes to setup the password on all nodes - `./deploy_pki prod datura` - to deploy Nomad's and Consul's PKI -## How to operate a node - -Edit your `~/.ssh/config` file: - -``` -Host dahlia - HostName dahlia.machine.deuxfleurs.fr - LocalForward 14646 127.0.0.1:4646 - LocalForward 8501 127.0.0.1:8501 - LocalForward 1389 bottin.service.prod.consul:389 - LocalForward 5432 psql-proxy.service.prod.consul:5432 -``` - -Then run the TLS proxy and leave it running: - -``` -./tlsproxy prod -``` - -SSH to a production machine (e.g. dahlia) and leave it running: - -``` -ssh dahlia -``` - - -Finally you should see be able to access the production Nomad and Consul by browsing: - - - Consul: http://localhost:8500 - - Nomad: http://localhost:4646 - diff --git a/doc/architecture.md b/doc/architecture.md index 8a9579f..ee83dca 100644 --- a/doc/architecture.md +++ b/doc/architecture.md @@ -1,4 +1,4 @@ -# Additional README +# Overall architecture ## Configuring the OS @@ -15,6 +15,7 @@ All deployment scripts can use the following parameters passed as environment va - `SUDO_PASS`: optionnally, the password for `sudo` on cluster nodes. If not set, it will be asked at the begninning. - `SSH_USER`: optionnally, the user to try to login using SSH. If not set, the username from your local machine will be used. + ### Assumptions (how to setup your environment) - you have an SSH access to all of your cluster nodes (listed in `cluster//ssh_config`) @@ -25,6 +26,7 @@ All deployment scripts can use the following parameters passed as environment va - you have a clone of the secrets repository in your `pass` password store, for instance at `~/.password-store/deuxfleurs` (scripts in this repo will read and write all secrets in `pass` under `deuxfleurs/cluster//`) + ### Deploying the NixOS configuration The NixOS configuration makes use of a certain number of files: @@ -48,12 +50,9 @@ or to deploy only on a single node: To upgrade NixOS, use the `./upgrade_nixos` script instead (it has the same syntax). -**When adding a node to the cluster:** just do `./deploy_nixos ` ### Generating and deploying a PKI for Consul and Nomad -This is very similar to how we do for Wesher. - First, if the PKI has not yet been created, create it with: ``` @@ -66,7 +65,8 @@ Then, deploy the PKI on all nodes with: ./deploy_pki ``` -**When adding a node to the cluster:** just do `./deploy_pki ` +Note that certificates are valid for not much more than one year: every year in January, `gen_pki` and `deploy_pki` have to be re-run to generate certificates for the new year. + ### Adding administrators and password management @@ -89,6 +89,7 @@ Then, an administrator that already has root access must run the following (afte ./deploy_passwords ``` + ## Deploying stuff on Nomad ### Connecting to Nomad @@ -118,12 +119,12 @@ Stuff should be started in this order: 1. `app/core` 2. `app/frontend` 3. `app/telemetry` -4. `app/garage-staging` +4. `app/garage` 5. `app/directory` -Then, other stuff can be started in any order: +Then, other stuff can be started in any order, e.g.: -- `app/im` (cluster `staging` only) -- `app/cryptpad` (cluster `prod` only) +- `app/im` +- `app/cryptpad` - `app/drone-ci` diff --git a/doc/onboarding.md b/doc/onboarding.md new file mode 100644 index 0000000..b3bd264 --- /dev/null +++ b/doc/onboarding.md @@ -0,0 +1,45 @@ +# Onboarding / quick start for new administrators + +## How to welcome a new administrator + +See: https://guide.deuxfleurs.fr/operations/acces/pass/ + +Basically: + - The new administrator generates a GPG key and publishes it on Gitea + - All existing administrators pull their key and sign it + - An existing administrator reencrypt the keystore with this new key and push it + - The new administrator clone the repo and check that they can decrypt the secrets + - Finally, the new administrator must choose a password to operate over SSH with `./passwd prod rick` where `rick` is the target username + + +## How to operate a node (conncet to Nomad and Consul) + +Edit your `~/.ssh/config` file with content such as the following: + +``` +Host dahlia + HostName dahlia.machine.deuxfleurs.fr + LocalForward 14646 127.0.0.1:4646 + LocalForward 8501 127.0.0.1:8501 + LocalForward 1389 bottin.service.prod.consul:389 + LocalForward 5432 psql-proxy.service.prod.consul:5432 +``` + +Then run the TLS proxy and leave it running: + +``` +./tlsproxy prod +``` + +SSH to a production machine (e.g. dahlia) and leave it running: + +``` +ssh dahlia +``` + + +Finally you should see be able to access the production Nomad and Consul by browsing: + + - Consul: http://localhost:8500 + - Nomad: http://localhost:4646 + diff --git a/doc/why-not-ansible.md b/doc/why-not-ansible.md new file mode 100644 index 0000000..6c8be55 --- /dev/null +++ b/doc/why-not-ansible.md @@ -0,0 +1,37 @@ +# Why not Ansible? + +I often get asked why not use Ansible to deploy to remote machines, as this +would look like a typical use case. There are many reasons, which basically +boil down to "I really don't like Ansible": + +- Ansible tries to do declarative system configuration, but doesn't do it + correctly at all, like Nix does. Example: in NixOS, to undo something you've + done, just comment the corresponding lines and redeploy. + +- Ansible is massive overkill for what we're trying to do here, we're just + copying a few small files and running some basic commands, leaving the rest + to NixOS. + +- YAML is a pain to manipulate as soon as you have more than two or three + indentation levels. Also, why in hell would you want to write loops and + conditions in YAML when you could use a proper expression language? + +- Ansible's vocabulary is not ours, and it imposes a rigid hierarchy of + directories and files which I don't want. + +- Ansible is probably not flexible enough to do what we want, at least not + without getting a migraine when trying. For example, it's inventory + management is too simple to account for the heterogeneity of our cluster + nodes while still retaining a level of organization (some configuration + options are defined cluster-wide, some are defined for each site - physical + location - we deploy on, and some are specific to each node). + +- I never remember Ansible's command line flags. + +- My distribution's package for Ansible takes almost 400MB once installed, + WTF??? By not depending on it, we're reducing the set of tools we need to + deploy to a bare minimum: Git, OpenSSH, OpenSSL, socat, + [pass](https://www.passwordstore.org/) (and the Consul and Nomad binaries + which are, I'll admit, not small). + +