Nix system configuration for Deuxfleurs clusters

Find a file

Alex Auvolat 912753c7ad remove useless lines in caribou,origan.nix		2022-12-22 23:16:15 +01:00
cluster	remove useless lines in caribou,origan.nix	2022-12-22 23:16:15 +01:00
doc	remove outdated telemetry doc	2022-12-22 18:01:46 +01:00
experimental	SSB experiment	2022-09-21 19:29:08 +02:00
nix	Replace deploy_wg by a NixOS activation script	2022-12-14 18:02:30 +01:00
secretmgr	Clone core module in staging and prod, move bad stuff to experimental	2022-08-24 15:48:18 +02:00
.gitignore	Modularize and prepare to support multiple clusters	2022-02-09 12:09:49 +01:00
deploy_nixos	Staging: Add CNAME target meta parameter, will be used for diplonat auto dns update	2022-12-07 12:32:21 +01:00
deploy_passwords	Add scripts to manage passwords	2022-04-20 15:41:54 +02:00
deploy_pki	Add origan node in staging cluster (+ refactor system.stateVersion)	2022-12-11 22:37:28 +01:00
gen_pki	Fix access to consul for non-server nodes	2022-08-24 16:58:50 +02:00
passwd	edited passwd command to set bash as interpreter	2022-11-09 19:02:02 +01:00
README.md	write about why not ansible	2022-12-14 17:52:36 +01:00
README.more.md	WIP doc	2022-10-16 11:14:50 +02:00
restic-summary	Move cryptpad backup job to backup-daily.hcl	2022-09-26 13:02:38 +02:00
ssh_known_hosts	Add origan node in staging cluster (+ refactor system.stateVersion)	2022-12-11 22:37:28 +01:00
sshtool	sshtool: quote password to fix shell interpretation	2022-12-06 23:13:32 +01:00
tlsproxy	changed shebang of tlsproxy file to bash, because trap failed with sh (trap is a builtin of bash)	2022-11-09 18:53:21 +01:00
upgrade_nixos	Staging: ability to run Nix jobs using exec2 driver	2022-11-28 22:58:39 +01:00

README.md

Deuxfleurs on NixOS!

This repository contains code to run Deuxfleur's infrastructure on NixOS.

It sets up the following:

A Wireguard mesh between all nodes
Consul, with TLS
Nomad, with TLS

How to welcome a new administrator

See: https://guide.deuxfleurs.fr/operations/acces/pass/

Basically:

The new administrator generates a GPG key and publishes it on Gitea
All existing administrators pull their key and sign it
An existing administrator reencrypt the keystore with this new key and push it
The new administrator clone the repo and check that they can decrypt the secrets
Finally, the new administrator must choose a password to operate over SSH with ./passwd prod rick where rick is the target username

How to create files for a new zone

The documentation is written for the production cluster, the same apply for other clusters.

Basically:

Create your site file in cluster/prod/site/ folder
Create your node files in cluster/prod/node/ folder
Add your wireguard configuration to cluster/prod/cluster.nix
- You will have to edit your NAT config manually
- To get your node's wg public key, you must run ./deploy_prod prod <node>, see the next section for more information
Add your nodes to cluster/prod/ssh_config, it will be used by the various SSH scripts.
- If you use ssh directly, use ssh -F ./cluster/prod/ssh_config
- Add User root for the first time as your user will not be declared yet on the system

How to deploy a Nix configuration on a fresh node

We suppose that the node name is datura. Start by doing the deployment one node at a time, you will have plenty of time in your operator's life to break everything through automation.

Run:

./deploy_wg prod datura - to generate wireguard's keys
./deploy_nixos prod datura - to deploy the nix configuration files
need to be redeployed on all nodes as the new wireguard conf is needed everywhere
./deploy_password prod datura - to deploy user's passwords
need to be redeployed on all nodes to setup the password on all nodes
./deploy_pki prod datura - to deploy Nomad's and Consul's PKI

How to operate a node

Edit your ~/.ssh/config file:

Host dahlia
  HostName dahlia.machine.deuxfleurs.fr
  LocalForward 14646 127.0.0.1:4646
  LocalForward 8501 127.0.0.1:8501
  LocalForward 1389 bottin.service.prod.consul:389
  LocalForward 5432 psql-proxy.service.prod.consul:5432

Then run the TLS proxy and leave it running:

./tlsproxy prod

SSH to a production machine (e.g. dahlia) and leave it running:

ssh dahlia

Finally you should see be able to access the production Nomad and Consul by browsing:

Consul: http://localhost:8500
Nomad: http://localhost:4646

Why not Ansible?

I often get asked why not use Ansible to deploy to remote machines, as this would look like a typical use case. There are many reasons, which basically boil down to "I really don't like Ansible":

Ansible tries to do declarative system configuration, but doesn't do it correctly at all, like Nix does. Example: in NixOS, to undo something you've done, just comment the corresponding lines and redeploy.
Ansible is massive overkill for what we're trying to do here, we're just copying a few small files and running some basic commands, leaving the rest to NixOS.
YAML is a pain to manipulate as soon as you have more than two or three indentation levels. Also, why in hell would you want to write loops and conditions in YAML when you could use a proper expression language?
Ansible's vocabulary is not ours, and it imposes a rigid hierarchy of directories and files which I don't want.
Ansible is probably not flexible enough to do what we want, at least not without getting a migraine when trying. For example, it's inventory management is too simple to account for the heterogeneity of our cluster nodes while still retaining a level of organization (some configuration options are defined cluster-wide, some are defined for each site - physical location - we deploy on, and some are specific to each node).
I never remember Ansible's command line flags.
My distribution's package for Ansible takes almost 400MB once installed, WTF??? By not depending on it, we're reducing the set of tools we need to deploy to a bare minimum: Git, OpenSSH, OpenSSL, socat, pass (and the Consul and Nomad binaries which are, I'll admit, not small).

Please read README.more.md for more detailed information