History

Alex Auvolat 24118ab426 Make things work on cluster devx.adnab.me		2020-07-15 16:06:08 +02:00
..
group_vars/all	Update local scripts	2020-06-16 16:28:24 +02:00
roles	Make things work on cluster devx.adnab.me	2020-07-15 16:06:08 +02:00
cluster_nodes.yml	Clean nomad+consul deploy tasks as we do not deploy anymore on ARM so it is untested for real	2020-07-05 20:12:51 +02:00
lxvm	Make things work on cluster devx.adnab.me	2020-07-15 16:06:08 +02:00
production	Add my own modifications	2020-07-05 19:49:32 +02:00
README.md	Document Wireguard config	2020-07-15 16:03:42 +02:00
README.more.md	Add a readme	2020-07-05 19:52:31 +02:00
site.yml	Initial commit	2019-07-11 09:33:07 +02:00

README.md

ANSIBLE

How to proceed

For each machine, one by one do:

Check that cluster is healthy
- sudo gluster peer status
- sudo gluster volume status all (check Online Col, only Y must appear)
- Check that Nomad is healthy
- Check that Consul is healthy
- Check that Postgres is healthy
Run ansible-playbook -i production --limit <machine> site.yml
Reboot
Check that cluster is healthy

New configuration with Wireguard

This configuration is used to make all of the cluster nodes appear in a single virtual private network, enable them to communicate on all ports even if they are behind NATs at different locations. The VPN also provides a layer of security, encrypting all comunications that occur over the internet.

Prerequisites

Nodes must all have two publicly accessible ports (potentially routed through a NAT):

A port that maps to the SSH port (port 22) of the machine, allowing TCP connections
A port that maps to the Wireguard port (port 51820) of the machine, allowing UDP connections

Configuration

The network role sets up a Wireguard interface, called wgdeuxfleurs, and establishes a full mesh between all cluster machines. The following configuration variables are necessary in the node list:

ansible_host: hostname to which Ansible connects to, usually the same as public_ip
ansible_user: username to connect as for Ansible to run commands through SSH
ansible_port: if SSH is not bound publicly on port 22, set the port here
public_ip: the public IP for the machine or the NATting router behind which the machine is
public_vpn_port: the public port number on public_ip that maps to port 51820 of the machine
vpn_ip: the IP address to affect to the node on the VPN (each node must have a different one)
dns_server: any DNS resolver, typically your ISP's DNS or a public one such as OpenDNS

The new iptables configuration now prevents direct communication between cluster machines, except on port 51820 which is used to transmit VPN packets. All intra-cluster communications must now go through the VPN interface (thus machines refer to one another using their VPN IP addresses and never their public or LAN addresses).

Restarting Nomad

When switching to the Wireguard configuration, machines will stop using their LAN addresses and switch to using their VPN addresses. Consul seems to handle this correctly, however Nomad does not. To make Nomad able to restart correctly, its Raft protocol module must be informed of the new IP addresses of the cluster members. This is done by creating on all nodes the file /var/lib/nomad/server/raft/peers.json that contains the list of IP addresses of the cluster. Here is an example for such a file:

["10.68.70.11:4647","10.68.70.12:4647","10.68.70.13:4647"]

Once this file is created and is the same on all nodes, restart Nomad on all nodes. The cluster should resume operation normally.

The same procedure can also be applied to fix Consul, however my tests showed that it didn't break when IP addresses changed (it just took a bit long to come back up).