infrastructure/ansible/README.md

# ANSIBLE

## How to proceed

For each machine, **one by one** do:
  - Check that cluster is healthy
    - `sudo gluster peer status`
    - `sudo gluster volume status all` (check Online Col, only `Y` must appear)
    - Check that Nomad is healthy
    - Check that Consul is healthy
    - Check that Postgres is healthy
  - Run `ansible-playbook -i production --limit <machine> site.yml`
  - Reboot
  - Check that cluster is healthy

## New configuration with Wireguard

This configuration is used to make all of the cluster nodes appear in a single
virtual private network, enable them to communicate on all ports even if they
are behind NATs at different locations. The VPN also provides a layer of
security, encrypting all comunications that occur over the internet.

### Prerequisites

Nodes must all have two publicly accessible ports (potentially routed through a NAT):

- A port that maps to the SSH port (port 22) of the machine, allowing TCP connections
- A port that maps to the Wireguard port (port 51820) of the machine, allowing UDP connections


### Configuration

The network role sets up a Wireguard interface, called `wgdeuxfleurs`, and
establishes a full mesh between all cluster machines. The following
configuration variables are necessary in the node list:

- `ansible_host`: hostname to which Ansible connects to, usually the same as `public_ip`
- `ansible_user`: username to connect as for Ansible to run commands through SSH
- `ansible_port`: if SSH is not bound publicly on port 22, set the port here
- `public_ip`: the public IP for the machine or the NATting router behind which the machine is
- `public_vpn_port`: the public port number on `public_ip` that maps to port 51820 of the machine
- `vpn_ip`: the IP address to affect to the node on the VPN (each node must have a different one)
- `dns_server`: any DNS resolver, typically your ISP's DNS or a public one such as OpenDNS

The new iptables configuration now prevents direct communication between
cluster machines, except on port 51820 which is used to transmit VPN packets.
All intra-cluster communications must now go through the VPN interface (thus
machines refer to one another using their VPN IP addresses and never their
public or LAN addresses).

### Restarting Nomad

When switching to the Wireguard configuration, machines will stop using their
LAN addresses and switch to using their VPN addresses. Consul seems to handle
this correctly, however Nomad does not. To make Nomad able to restart
correctly, its Raft protocol module must be informed of the new IP addresses of
the cluster members. This is done by creating on all nodes the file
`/var/lib/nomad/server/raft/peers.json` that contains the list of IP addresses
of the cluster. Here is an example for such a file:

```
["10.68.70.11:4647","10.68.70.12:4647","10.68.70.13:4647"]
```

Once this file is created and is the same on all nodes, restart Nomad on all
nodes. The cluster should resume operation normally.

The same procedure can also be applied to fix Consul, however my tests showed
that it didn't break when IP addresses changed (it just took a bit long to come
back up).