More doc reorganization

2022-12-22 23:44:00 +01:00 · 2022-12-22 23:44:00 +01:00 · 0e1574a82b
commit 0e1574a82b
parent 3e5e2d60cd
5 changed files with 96 additions and 95 deletions
--- a/README.md
+++ b/README.md
@ -12,54 +12,15 @@ It sets up the following:
 See the following documentation topics:
- [Quick start for adding new nodes after NixOS install](doc/quick-start.md)
+- [Quick start and onboarding for new administrators](doc/onboarding.md)
 - [How to add new nodes to a cluster (rapid overview)](doc/adding-nodes.md)
 - [Architecture of this repo, how the scripts work](doc/architecture.md)
 - [List of TCP and UDP ports used by services](doc/ports)
 Additionnal documentation topics:
- [Succint guide for NixOS installation with LUKX full disk encryption](doc/nixos-install.md) (we don't do that in practice on our servers)
+- [Succint guide for NixOS installation with LUKX full disk encryption](doc/nixos-install-luks.md) (we don't do that in practice on our servers)
 - [Example `hardware-config.nix` for a full disk encryption scenario](doc/example-hardware-configuration.nix)
 - [Why not Ansible?](doc/why-not-ansible.md)
 ## Why not Ansible?
 I often get asked why not use Ansible to deploy to remote machines, as this
 would look like a typical use case.  There are many reasons, which basically
 boil down to "I really don't like Ansible":
 - Ansible tries to do declarative system configuration, but doesn't do it
  correctly at all, like Nix does.  Example: in NixOS, to undo something you've
  done, just comment the corresponding lines and redeploy.
 - Ansible is massive overkill for what we're trying to do here, we're just
  copying a few small files and running some basic commands, leaving the rest
  to NixOS.
 - YAML is a pain to manipulate as soon as you have more than two or three
  indentation levels.  Also, why in hell would you want to write loops and
  conditions in YAML when you could use a proper expression language?
 - Ansible's vocabulary is not ours, and it imposes a rigid hierarchy of
  directories and files which I don't want.
 - Ansible is probably not flexible enough to do what we want, at least not
  without getting a migraine when trying. For example, it's inventory
  management is too simple to account for the heterogeneity of our cluster
  nodes while still retaining a level of organization (some configuration
  options are defined cluster-wide, some are defined for each site - physical
  location - we deploy on, and some are specific to each node).
 - I never remember Ansible's command line flags.
 - My distribution's package for Ansible takes almost 400MB once installed,
  WTF???  By not depending on it, we're reducing the set of tools we need to
  deploy to a bare minimum: Git, OpenSSH, OpenSSL, socat,
  [pass](https://www.passwordstore.org/) (and the Consul and Nomad binaries
  which are, I'll admit, not small).
 ## More
 Please read README.more.md for more detailed information
--- a/doc/adding-nodes.md
+++ b/doc/adding-nodes.md
@ -1,17 +1,5 @@
 # Quick start
 ## How to welcome a new administrator
 See: https://guide.deuxfleurs.fr/operations/acces/pass/
 Basically:
  - The new administrator generates a GPG key and publishes it on Gitea
  - All existing administrators pull their key and sign it
  - An existing administrator reencrypt the keystore with this new key and push it
  - The new administrator clone the repo and check that they can decrypt the secrets
  - Finally, the new administrator must choose a password to operate over SSH with `./passwd prod rick` where `rick` is the target username
 ## How to create files for a new zone
 *The documentation is written for the production cluster, the same apply for other clusters.*
@ -40,34 +28,3 @@ Run:
   - if a user changes their password (using `./passwd`), needs to be redeployed on all nodes to setup the password on all nodes
  - `./deploy_pki prod datura` - to deploy Nomad's and Consul's PKI
 ## How to operate a node
 Edit your `~/.ssh/config` file:
 ```
 Host dahlia
  HostName dahlia.machine.deuxfleurs.fr
  LocalForward 14646 127.0.0.1:4646
  LocalForward 8501 127.0.0.1:8501
  LocalForward 1389 bottin.service.prod.consul:389
  LocalForward 5432 psql-proxy.service.prod.consul:5432
 ```
 Then run the TLS proxy and leave it running:
 ```
 ./tlsproxy prod
 ```
 SSH to a production machine (e.g. dahlia) and leave it running:
 ```
 ssh dahlia
 ```
 Finally you should see be able to access the production Nomad and Consul by browsing: 
 - Consul: http://localhost:8500
 - Nomad: http://localhost:4646
--- a/doc/architecture.md
+++ b/doc/architecture.md
@ -1,4 +1,4 @@
-# Additional README
+# Overall architecture
 ## Configuring the OS
@ -15,6 +15,7 @@ All deployment scripts can use the following parameters passed as environment va
 - `SUDO_PASS`: optionnally, the password for `sudo` on cluster nodes. If not set, it will be asked at the begninning.
 - `SSH_USER`: optionnally, the user to try to login using SSH. If not set, the username from your local machine will be used.
 ### Assumptions (how to setup your environment)
 - you have an SSH access to all of your cluster nodes (listed in `cluster/<cluster_name>/ssh_config`)
@ -25,6 +26,7 @@ All deployment scripts can use the following parameters passed as environment va
 - you have a clone of the secrets repository in your `pass` password store, for instance at `~/.password-store/deuxfleurs`
  (scripts in this repo will read and write all secrets in `pass` under `deuxfleurs/cluster/<cluster_name>/`)
 ### Deploying the NixOS configuration
 The NixOS configuration makes use of a certain number of files:
@ -48,12 +50,9 @@ or to deploy only on a single node:
 To upgrade NixOS, use the `./upgrade_nixos` script instead (it has the same syntax).
 **When adding a node to the cluster:** just do `./deploy_nixos <cluster_name> <name_of_new_node>`
 ### Generating and deploying a PKI for Consul and Nomad
 This is very similar to how we do for Wesher.
 First, if the PKI has not yet been created, create it with:
 ```
@ -66,7 +65,8 @@ Then, deploy the PKI on all nodes with:
 ./deploy_pki <cluster_name>
 ```
-**When adding a node to the cluster:** just do `./deploy_pki <cluster_name> <name_of_new_node>`
+Note that certificates are valid for not much more than one year: every year in January, `gen_pki` and `deploy_pki` have to be re-run to generate certificates for the new year.
 ### Adding administrators and password management
@ -89,6 +89,7 @@ Then, an administrator that already has root access must run the following (afte
 ./deploy_passwords <cluster_name>
 ```
 ## Deploying stuff on Nomad
 ### Connecting to Nomad
@ -118,12 +119,12 @@ Stuff should be started in this order:
 1. `app/core`
 2. `app/frontend`
 3. `app/telemetry`
-4. `app/garage-staging`
+4. `app/garage`
 5. `app/directory`
-Then, other stuff can be started in any order:
+Then, other stuff can be started in any order, e.g.:
- `app/im` (cluster `staging` only)
+- `app/im`
- `app/cryptpad` (cluster `prod` only)
+- `app/cryptpad`
 - `app/drone-ci`
--- a/doc/onboarding.md
+++ b/doc/onboarding.md
@ -0,0 +1,45 @@
 # Onboarding / quick start for new administrators
 ## How to welcome a new administrator
 See: https://guide.deuxfleurs.fr/operations/acces/pass/
 Basically:
  - The new administrator generates a GPG key and publishes it on Gitea
  - All existing administrators pull their key and sign it
  - An existing administrator reencrypt the keystore with this new key and push it
  - The new administrator clone the repo and check that they can decrypt the secrets
  - Finally, the new administrator must choose a password to operate over SSH with `./passwd prod rick` where `rick` is the target username
 ## How to operate a node (conncet to Nomad and Consul)
 Edit your `~/.ssh/config` file with content such as the following:
 ```
 Host dahlia
  HostName dahlia.machine.deuxfleurs.fr
  LocalForward 14646 127.0.0.1:4646
  LocalForward 8501 127.0.0.1:8501
  LocalForward 1389 bottin.service.prod.consul:389
  LocalForward 5432 psql-proxy.service.prod.consul:5432
 ```
 Then run the TLS proxy and leave it running:
 ```
 ./tlsproxy prod
 ```
 SSH to a production machine (e.g. dahlia) and leave it running:
 ```
 ssh dahlia
 ```
 Finally you should see be able to access the production Nomad and Consul by browsing: 
 - Consul: http://localhost:8500
 - Nomad: http://localhost:4646
--- a/doc/why-not-ansible.md
+++ b/doc/why-not-ansible.md
@ -0,0 +1,37 @@
 # Why not Ansible?
 I often get asked why not use Ansible to deploy to remote machines, as this
 would look like a typical use case.  There are many reasons, which basically
 boil down to "I really don't like Ansible":
 - Ansible tries to do declarative system configuration, but doesn't do it
  correctly at all, like Nix does.  Example: in NixOS, to undo something you've
  done, just comment the corresponding lines and redeploy.
 - Ansible is massive overkill for what we're trying to do here, we're just
  copying a few small files and running some basic commands, leaving the rest
  to NixOS.
 - YAML is a pain to manipulate as soon as you have more than two or three
  indentation levels.  Also, why in hell would you want to write loops and
  conditions in YAML when you could use a proper expression language?
 - Ansible's vocabulary is not ours, and it imposes a rigid hierarchy of
  directories and files which I don't want.
 - Ansible is probably not flexible enough to do what we want, at least not
  without getting a migraine when trying. For example, it's inventory
  management is too simple to account for the heterogeneity of our cluster
  nodes while still retaining a level of organization (some configuration
  options are defined cluster-wide, some are defined for each site - physical
  location - we deploy on, and some are specific to each node).
 - I never remember Ansible's command line flags.
 - My distribution's package for Ansible takes almost 400MB once installed,
  WTF???  By not depending on it, we're reducing the set of tools we need to
  deploy to a bare minimum: Git, OpenSSH, OpenSSL, socat,
  [pass](https://www.passwordstore.org/) (and the Consul and Nomad binaries
  which are, I'll admit, not small).