infrastructure/ansible
Alex Auvolat d3ada90d83 Fix nomad ip address
Remove the networ_interface parameter in nomad config
This means that nomad will now autodetect its own ip address
by looking at the default route.
Thus nodes in a LAN behind a NAT will get their LAN address,
and internet nodes will get their public address.
They won't get their VPN addresses.
This seems not to break Consul's use of VPN addresses to address
services, and fixes attr.unique.network.ip-address for DiploNAT.
2020-07-15 16:03:51 +02:00
..
group_vars/all Update local scripts 2020-06-16 16:28:24 +02:00
roles Fix nomad ip address 2020-07-15 16:03:51 +02:00
cluster_nodes.yml Clean nomad+consul deploy tasks as we do not deploy anymore on ARM so it is untested for real 2020-07-05 20:12:51 +02:00
lxvm Allow external VPN nodes, make multi-DC deployment work 2020-07-15 16:03:42 +02:00
production Add my own modifications 2020-07-05 19:49:32 +02:00
README.md Document Wireguard config 2020-07-15 16:03:42 +02:00
README.more.md Add a readme 2020-07-05 19:52:31 +02:00
site.yml Initial commit 2019-07-11 09:33:07 +02:00

ANSIBLE

How to proceed

For each machine, one by one do:

  • Check that cluster is healthy
    • sudo gluster peer status
    • sudo gluster volume status all (check Online Col, only Y must appear)
    • Check that Nomad is healthy
    • Check that Consul is healthy
    • Check that Postgres is healthy
  • Run ansible-playbook -i production --limit <machine> site.yml
  • Reboot
  • Check that cluster is healthy

New configuration with Wireguard

This configuration is used to make all of the cluster nodes appear in a single virtual private network, enable them to communicate on all ports even if they are behind NATs at different locations. The VPN also provides a layer of security, encrypting all comunications that occur over the internet.

Prerequisites

Nodes must all have two publicly accessible ports (potentially routed through a NAT):

  • A port that maps to the SSH port (port 22) of the machine, allowing TCP connections
  • A port that maps to the Wireguard port (port 51820) of the machine, allowing UDP connections

Configuration

The network role sets up a Wireguard interface, called wgdeuxfleurs, and establishes a full mesh between all cluster machines. The following configuration variables are necessary in the node list:

  • ansible_host: hostname to which Ansible connects to, usually the same as public_ip
  • ansible_user: username to connect as for Ansible to run commands through SSH
  • ansible_port: if SSH is not bound publicly on port 22, set the port here
  • public_ip: the public IP for the machine or the NATting router behind which the machine is
  • public_vpn_port: the public port number on public_ip that maps to port 51820 of the machine
  • vpn_ip: the IP address to affect to the node on the VPN (each node must have a different one)
  • dns_server: any DNS resolver, typically your ISP's DNS or a public one such as OpenDNS

The new iptables configuration now prevents direct communication between cluster machines, except on port 51820 which is used to transmit VPN packets. All intra-cluster communications must now go through the VPN interface (thus machines refer to one another using their VPN IP addresses and never their public or LAN addresses).

Restarting Nomad

When switching to the Wireguard configuration, machines will stop using their LAN addresses and switch to using their VPN addresses. Consul seems to handle this correctly, however Nomad does not. To make Nomad able to restart correctly, its Raft protocol module must be informed of the new IP addresses of the cluster members. This is done by creating on all nodes the file /var/lib/nomad/server/raft/peers.json that contains the list of IP addresses of the cluster. Here is an example for such a file:

["10.68.70.11:4647","10.68.70.12:4647","10.68.70.13:4647"]

Once this file is created and is the same on all nodes, restart Nomad on all nodes. The cluster should resume operation normally.

The same procedure can also be applied to fix Consul, however my tests showed that it didn't break when IP addresses changed (it just took a bit long to come back up).