Remove all files from op_guide, now migrated to guide.deuxfleurs.fr

This commit is contained in:
Alex 2022-12-22 17:46:19 +01:00
parent 015c372532
commit b575b2b486
No known key found for this signature in database
GPG key ID: 09EC5284AA804D3C
18 changed files with 3 additions and 1179 deletions

3
op_guide/README.md Normal file
View file

@ -0,0 +1,3 @@
All documents from our operations guide have been moved to <https://guide.deuxfleurs.fr/operations/> (or in Git repository `guide.deuxfleurs.fr`).
Tous les documents de notre guide des opérations ont été déplacés sur <https://guide.deuxfleurs.fr/operations/> (ou dans le dépôt Git `guide.deuxfleurs.fr`).

View file

@ -1 +0,0 @@
Not very generic currently, check the backup.sh script

View file

@ -1,65 +0,0 @@
#!/bin/bash
cd $(dirname $0)
if [ "$(hostname)" != "io" ]; then
echo "Please run this script on io"
exit 1
fi
if [ ! -d "buckets" ]; then
btrfs subvolume create $(pwd)/buckets
fi
AK=$1
SK=$2
function gctl {
docker exec garage /garage $@
}
gctl status
BUCKETS=$(gctl bucket list | tail -n +2 | cut -d " " -f 3 | cut -d "," -f 1)
for BUCKET in $BUCKETS; do
case $BUCKET in
*backup*)
echo "Skipping $BUCKET (not doing backup of backup)"
;;
*cache*)
echo "Skipping $BUCKET (not doing backup of cache)"
;;
*)
echo "Backing up $BUCKET"
if [ ! -d $(pwd)/buckets/$BUCKET ]; then
mkdir $(pwd)/buckets/$BUCKET
fi
gctl bucket allow --key $AK --read $BUCKET
rclone sync --s3-endpoint http://localhost:3900 \
--s3-access-key-id $AK \
--s3-secret-access-key $SK \
--s3-region garage \
--s3-force-path-style \
--transfers 32 \
--fast-list \
--stats-one-line \
--stats 10s \
--stats-log-level NOTICE \
:s3:$BUCKET $(pwd)/buckets/$BUCKET
;;
esac
done
# Remove duplicates
#duperemove -dAr $(pwd)/buckets
if [ ! -d "$(pwd)/snapshots" ]; then
mkdir snapshots
fi
SNAPSHOT=$(pwd)/snapshots/buckets-$(date +%F)
echo "Making snapshot: $SNAPSHOT"
btrfs subvolume snapshot $(pwd)/buckets $SNAPSHOT

View file

@ -1,60 +0,0 @@
# How to setup NextCloud
## First setup
It's complicated.
First, create a service user `nextcloud` and a database `nextcloud` it owns. Also create a Garage access key and bucket `nextcloud` it is allowed to use.
Fill in the following Consul keys with actual values:
```
secrets/nextcloud/db_user
secrets/nextcloud/db_pass
secrets/nextcloud/garage_access_key
secrets/nextcloud/garage_secret_key
```
Create the following Consul keys with empty values:
```
secrets/nextcloud/instance_id
secrets/nextcloud/password_salt
secrets/nextcloud/secret
```
Start the nextcloud.hcl nomad service. Enter the container and call `occ maintenance:install` with the correct database parameters as user `www-data`.
A possibility: call the admin user `nextcloud` and give it the same password as the `nextcloud` service user.
Cat the newly generated `config.php` file and copy the instance id, password salt, and secret from there to Consul
(they were generated by the install script and we want to keep them).
Restart the Nextcloud Nomad server.
You should now be able to log in to Nextcloud using the admin user (`nextcloud` if you called it that).
Go to the apps settings and enable desired apps.
## Configure LDAP login
LDAP login has to be configured from the admin interface. First, enable the LDAP authentification application.
Go to settings > LDAP/AD integration. Enter the following parameters:
- ldap server: `bottin2.service.2.cluster.deuxfleurs.fr`
- bind user: `cn=nextcloud,ou=services,ou=users,dc=deuxfleurs,dc=fr`
- bind password: password of the nextcloud service user
- base DN for users: `ou=users,dc=deuxfleurs,dc=fr`
- check "manually enter LDAP filters"
- in the users tab, edit LDAP query and set it to `(&(|(objectclass=inetOrgPerson))(|(memberof=cn=nextcloud,ou=groups,dc=deuxfleurs,dc=fr)))`
- in the login attributes tab, edit LDAP query and set it to `(&(&(|(objectclass=inetOrgPerson))(|(memberof=cn=nextcloud,ou=groups,dc=deuxfleurs,dc=fr)))(|(|(mailPrimaryAddress=%uid)(mail=%uid))(|(cn=%uid))))`
- in the groups tab, edit the LDAP query and set it to `(|(objectclass=groupOfNames))`
- in the advanced tab, enter the "directory setting" section and check/modify the following:
- user display name field: `displayname`
- base user tree: `ou=users,dc=deuxfleurs,dc=fr`
- user search attribute: `cn`
- groupe display name field: `displayname`
- **base group tree**: `ou=groups,dc=deuxfleurs,dc=fr`
- group search attribute: `cn`
That should be it. Go to the login attributes tab and enter a username (which should have been added to the nextcloud group) to check that nextcloud is able to find it and allows it for login.

View file

@ -1,44 +0,0 @@
## Creating a new Plume user
1. Bind nomad on your machine with SSH (check the README file at the root of this repo)
2. Go to http://127.0.0.1:4646
3. Select `plume` -> click `exec` button (top right)
4. Select `plume` on the left panel
5. Press `enter` to get a bash shell
6. Run:
```bash
plm users new \
--username alice \
--display-name Alice \
--bio Just an internet user \
--email alice@example.com \
--password s3cr3t
```
That's all folks, now you can use your new account at https://plume.deuxfleurs.fr
## Bug and debug
If you can't follow a new user and have this error:
```
2022-04-23T19:26:12.639285Z WARN plume::routes::errors: Db(DatabaseError(UniqueViolation, "duplicate key value violates unique constraint \"follows_unique_ap_url\""))
```
You might have an empty field in your database:
```
plume=> select * from follows where ap_url='';
id | follower_id | following_id | ap_url
------+-------------+--------------+--------
2118 | 20 | 238 |
(1 row)
```
Simply set the `ap_url` as follows:
```
plume=> update follows set ap_url='https://plume.deuxfleurs.fr/follows/2118' where id=2118;
UPDATE 1
```

View file

@ -1,45 +0,0 @@
Le 20 janvier free a changé mon IP, un peu comme partout en France.
Ça concerne l'IPv4 et le préfixe IPv6.
Ici le bon vieux Bortzmoinsbien qui tweet : https://twitter.com/bortzmeyer/status/1351434290916155394
Max a update tout de suite l'IPv4 mais avec un TTL de 4h le temps de propagation est grand.
J'ai réduit les entrées sur les IP à 300 secondes, soit 5 minutes, le minimum chez Gandi, à voir si c'est une bonne idée.
Reste à update les IPv6, moins critiques pour le front facing mais utilisées pour le signaling en interne...
## Le fameux signaling
Ça pose un gros problème avec Nomad (et en moindre mesure avec Consul).
En effet, Nomad utilise l'IPv6 pour communiquer, il faut donc changer les IPs de tous les noeuds.
Problème ! On peut pas faire la migration au fur et à mesure car, changeant d'IP, les noeuds ne seront plus en mesure de communiquer.
On n'a pas envie de supprimer le cluster et d'en créer un nouveau car ça voudrait dire tout redéployer ce qui est long également (tous les fichiers HCL pour Nomad, tout le KV pour consul).
On ne peut pas non plus la faire à la bourrin en stoppant tous les cluster, changer son IP, puis redémarrer.
Enfin si, Consul accepte mais pas Nomad, qui lui va chercher à communiquer avec les anciennes IP et n'arrivera jamais à un consensus.
Au passage j'en ai profité pour changer le nom des noeuds car la dernière fois, Nomad n'avait PAS DU TOUT apprécié qu'un noeud ayant le même nom change d'IP. Ceci dit, si on utilise de facto le `peers.json` c'est peut être pas problématique. À tester.
Du coup, après moult réflexions, la silver bullet c'est la fonction outage recovery de nomad (consul l'a aussi au besoin).
Elle est ici : https://learn.hashicorp.com/tutorials/consul/recovery-outage
En gros, il faut arrêter tous les nodes.
Ensuite créer un fichier à ce path : `/var/lib/nomad/server/raft/peers.json`
Ne vous laissez pas perturber par le fichier `peers.info` à côté, il ne faut pas le toucher.
Après la grande question c'est de savoir si le cluster est en Raft v2 ou Raft v3.
Bon ben nous on était en Raft v2. Si vous vous trompez, au redémarrage Nomad va crasher avec une sale erreur :
```
nomad: failed to start Raft: error="recovery failed to parse peers.json: json: cannot unmarshal string into Go value of type raft.configEntry"
```
(je me suis trompé bien sûr).
Voilà, après il ne vous reste plus qu'à redémarrer et suivre les logs, cherchez bien la ligne où il dit qu'il a trouvé le peers.json.
## Les trucs à pas oublier
- Reconfigurer le backend KV de traefik (à voir à utiliser des DNS plutôt du coup)
- Reconfigurer l'IPv4 publique annoncée à Jitsi
## Ce qui reste à faire
- Mettre à jour les entrées DNS IPv6, ce qui devrait créer :
- digitale.machine.deuxfleurs.fr
- datura.machine.deuxfleurs.fr
- drosera.machine.deuxfleurs.fr
- Mettre à jour l'instance garage sur io

View file

@ -1,14 +0,0 @@
# La BDD synapse rempli nos disques
Todo: finir ce blog post et le dupliquer ici https://quentin.dufour.io/blog/2021-07-12/chroniques-administration-synapse/
Le WAL qui grossissait à l'infini était également du à un SSD défaillant dont les écritures était abyssalement lentes.
Actions mises en place :
- Documentation de comment ajouter de l'espace sur un disque différent avec les tablespaces
- Interdiction de rejoindre les rooms avec une trop grande complexité
- nettoyage de la BDD à la main (rooms vides, comptes non utilisés, etc.)
- Remplacement du SSD défaillant
Actions à mettre en place :
- Utiliser les outils de maintenance de base de données distribuées par le projet matrix

View file

@ -1,28 +0,0 @@
# Corruption GlusterFS
Suite au redémarrage d'un serveur, les emails ne sont plus disponibles.
Il apparait que GlusterFS ne répliquait plus correctement les données depuis un certain temps.
Suite à ce problème, il a renvoyé des dossiers Dovecot corrompu.
Dovecot a reconstruit un index sans les emails, ce qui a désynchronisé les bàl des gens.
À la fin, certaines boites mails ont perdu tous leurs emails.
Aucune sauvegarde des emails n'était réalisée.
Le problème a été créé cet été quand j'ai réinstallé un serveur.
J'ai installé sur une version de Debian différente.
La version de GlusterFS était pinnée dans un sources.list, en pointant vers le repo du projet gluster
Mais le pinning était pour la version de debian précédente.
Le sources.list a été ignoré, et c'est le gluster du projet debian plus récent qui a été installé.
Ces versions étaient incompatibles mais silencieusement.
GlusterFS n'informe pas proactivement non plus que les volumes sont désynchronisées.
Il n'y a aucune commande pour connaitre l'état du cluster.
Après plusieurs jours de travail, il m'a été impossible de remonter les emails.
Action mise en place :
- Suppression de GlusterFS
- Sauvegardes journalière des emails
- Les emails sont maintenant directement sur le disque (pas de haute dispo)
Action en cours de mise en place :
- Développement d'un serveur IMAP sur Garage

View file

@ -1,15 +0,0 @@
- **2020** Publii efface le disque dur d'un de nos membres. Il a changé le dossier de sortie vers /home qui a été effacé
- **2021-07-27** Panne de courant à Rennes - 40 000 personnes sans électricité pendant une journée - nos serveurs de prod étant dans la zone coupée, deuxfleurs.fr est dans le noir - https://www.francebleu.fr/infos/faits-divers-justice/rennes-plusieurs-quartiers-prives-d-electricite-1627354121
- **2021-12:** Tentative de migration un peu trop hâtive vers Tricot pour remplacer Traefik qui pose des soucis. Downtime et manque de communication sur les causes, confusion généralisée.
*Actions à envisager:* prévoir à l'avance toute intervention de nature à impacter la qualité de service sur l'infra Deuxfleurs. Tester en amont un maximum pour éviter de devoir tester en prod. Lorsque le test en prod est inévitable, s'organiser pour impacter le moins de monde possible.
- **2022-03-28:** Coupure d'électricité au site Jupiter, `io` ne redémarre pas toute seule. T est obligée de la rallumer manuellement. `io` n'est pas disponible durant quelques heures.
*Actions à envisager:* reconfigurer `io` pour s'allumer toute seule quand le courant démarre.
- **2022-03-28:** Grafana (hébergé par M) n'est pas disponible. M est le seul à pouvoir intervenir.
*Actions à envisager:* cartographier l'infra de monitoring et s'assurer que plusieurs personnes ont les accès.

View file

@ -1,186 +0,0 @@
Add the admin account as `deuxfleurs` to your `~/.mc/config` file
You need to choose some names/identifiers:
```bash
export ENDPOINT="https://s3.garage.tld"
export SERVICE_NAME="example"
export BUCKET_NAME="backups-${SERVICE_NAME}"
export NEW_ACCESS_KEY_ID="key-${SERVICE_NAME}"
export NEW_SECRET_ACCESS_KEY=$(openssl rand -base64 32)
export POLICY_NAME="policy-$BUCKET_NAME"
```
Create a new bucket:
```bash
mc mb deuxfleurs/$BUCKET_NAME
```
Create a new user:
```bash
mc admin user add deuxfleurs $NEW_ACCESS_KEY_ID $NEW_SECRET_ACCESS_KEY
```
Add this new user to your `~/.mc/config.json`, run this command before to generate the snippet to copy/paste:
```
cat > /dev/stdout <<EOF
"$NEW_ACCESS_KEY_ID": {
"url": "https://$ENDPOINT",
"accessKey": "$NEW_ACCESS_KEY_ID",
"secretKey": "$NEW_SECRET_ACCESS_KEY",
"api": "S3v4",
"path": "auto"
},
EOF
```
---
Create a policy for this bucket and save it as json:
```bash
cat > /tmp/policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::${BUCKET_NAME}"
]
},
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::${BUCKET_NAME}/*"
]
}
]
}
EOF
```
Register it:
```bash
mc admin policy add deuxfleurs $POLICY_NAME /tmp/policy.json
```
Set it to your user:
```bash
mc admin policy set deuxfleurs $POLICY_NAME user=${NEW_ACCESS_KEY_ID}
```
Now it should display *only* your new bucket when running:
```bash
mc ls $NEW_ACCESS_KEY_ID
```
---
Now we need to initialize the repository with restic.
```bash
export AWS_ACCESS_KEY_ID=$NEW_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY=$NEW_SECRET_ACCESS_KEY
export RESTIC_REPOSITORY="s3:$ENDPOINT/$BUCKET_NAME"
export RESTIC_PASSWORD=$(openssl rand -base64 32)
```
Save the password:
```bash
echo $RESTIC_PASSWORD
pass deuxfleurs/backups/$SERVICE_NAME/restic_password
# ctrl + c -> ctrl + v
cd ~/.password-store/deuxfleurs/
git pull ; git push
cd -
```
Then init the repo for restic from your machine:
```
restic init
```
*I am using restic version `restic 0.12.1 compiled with go1.16.9 on linux/amd64`*
See your snapshots with:
```
restic snapshots
```
Check also these useful commands:
```
restic ls
restic diff
restic help
```
---
Add the secrets to Consul, near your service secrets.
The idea is that the backuping service is a component of the global running service.
You must run in `app/<name>/secrets/<subpath>`:
```bash
echo "USER Backup AWS access key ID" > backup_aws_access_key_id
echo "USER Backup AWS secret access key" > backup_aws_secret_access_key
echo "USER Restic repository, eg. s3:https://s3.garage.tld" > backup_restic_repository
echo "USER Restic password to encrypt backups" > backup_restic_password
```
Then run secretmgr:
```bash
# Spawning a nix shell is an easy way to get all the dependencies you need
nix-shell
# Check that secretmgr works for you
python3 secretmgr.py check <name>
# Now interactively feed the secrets
python3 secretmgr.py gen <name>
```
---
Now we need a service that runs:
```
restic backup .
```
Find an existing .hcl declaration that uses restic in this repository or in the Deuxfleurs/nixcfg repository
to use it as an example.
And also that garbage collect snapshots.
I propose:
```
restic forget --prune --keep-within 1m1d --keep-within-weekly 3m --keep-within-monthly 1y
```
Also try to restore a snapshot:
```
restic restore <snapshot id> --target /tmp/$SERVICE_NAME
```

View file

@ -1,166 +0,0 @@
## you are new and want to access the secret repository
You need a GPG key to start with.
You can generate one with:
```bash
gpg2 --expert --full-gen-key
# Personnaly I use `9) ECC and ECC`, `1) Curve 25519`, and `5y`
```
Now export your public key:
```bash
gpg2 --export --armor <your email address>
```
You can upload it to Gitea, it will then be available publicly easily.
For example, you can access my key at this URL:
```
https://git.deuxfleurs.fr/quentin.gpg
```
You can import it to your keychain as follow:
```bash
gpg2 --import <(curl https://git.deuxfleurs.fr/quentin.gpg)
gpg2 --list-keys
# pub ed25519/0xE9602264D639FF68 2022-04-19 [SC] [expire : 2027-04-18]
# Empreinte de la clef = 8023 E27D F1BB D52C 559B 054C E960 2264 D639 FF68
# uid [ ultime ] Quentin Dufour <quentin@deuxfleurs.fr>
# sub cv25519/0xA40574404FF72851 2022-04-19 [E] [expire : 2027-04-18]
```
How to read this snippet:
- the key id: `E9602264D639FF68`
- the key fingerprint: `8023 E27D F1BB D52C 559B 054C E960 2264 D639 FF68`
Now, you need to:
1. Inform all other sysadmins that you have published your key
2. Check that the key of other sysadmins is the correct one.
To perform the check, you need another communication channel (ideally physically, otherwise through the phone, Matrix if you already trusted the other person, etc.)
Once you trust someone, sign its key:
```bash
gpg --edit-key quentin@deuxfleurs.fr
# or
gpg --edit-key E9602264D639FF68
# gpg> lsign
# (say yes)
# gpg> save
```
Once you signed everybody, ask to a sysadmin to add your key to `<secrets>/.gpg-id` and then run:
```
pass init -p deuxfleurs $(cat ~/.password-store/deuxfleurs/.gpg-id)
cd ~/.password-store
git commit
git push
```
Now you are ready to install `pass`:
```bash
sudo apt-get install pass # Debian + Ubuntu
sudo yum install pass # Fedora + RHEL
sudo zypper in password-store # OpenSUSE
sudo emerge -av pass # Gentoo
sudo pacman -S pass # Arch Linux
brew install pass # macOS
pkg install password-store # FreeBSD
```
*Go to [passwordstore.org](https://www.passwordstore.org/) for more information about pass*.
Download the repository:
```
mkdir -p ~/.password-store
cd ~/.password-store
git clone git@git.deuxfleurs.fr:Deuxfleurs/secrets.git deuxfleurs
```
And then check that everything work:
```bash
pass show deuxfleurs
```
---
---
## init
generate a new password store named deuxfleurs for you:
```
pass init -p deuxfleurs you@example.com
```
add a password in this store, it will be encrypted with your gpg key:
```bash
pass generate deuxfleurs/backup_nextcloud 20
# or
pass insert deuxfleurs/backup_nextcloud
```
## add a teammate
edit `~/.password-store/acme/.gpg-id` and add the id of your friends:
```
alice@example.com
jane@example.com
bob@example.com
```
make sure that you trust the keys of your teammates:
```
$ gpg --edit-key jane@example.com
gpg> lsign
gpg> y
gpg> save
```
Now re-encrypt the secrets:
```
pass init -p deuxfleurs $(cat ~/.password-store/deuxfleurs/.gpg-id)
```
They will now be able to decrypt the password:
```
pass deuxfleurs/backup_nextcloud
```
## sharing with git
To create the repo:
```bash
cd ~/.password-store/deuxfleurs
git init
git add .
git commit -m "Initial commit"
# Set up remote
git push
```
To setup the repo:
```bash
cd ~/.password-store
git clone https://git.example.com/org/repo.git deuxfleurs
```
## Ref
https://medium.com/@davidpiegza/using-pass-in-a-team-1aa7adf36592

View file

@ -1,3 +0,0 @@
- [Initialize the cluster](install.md)
- [Create a database](create_database.md)
- [Manually backup all the databases](manual_backup.md)

View file

@ -1,26 +0,0 @@
## 1. Create a LDAP user and assign a password for your service
Go to guichet.deuxfleurs.fr
1. Everything takes place in `ou=services,ou=users,dc=deuxfleurs,dc=fr`
2. Create a new user, like `johny`
3. Generate a random password with `openssl rand -base64 32`
4. Hash it with `slappasswd`
5. Add a `userpassword` entry with the hash
This step can also be done using the automated tool `secretmgr.py` in the app folder.
## 2. Connect to postgres with the admin users
```bash
# 1. Launch ssh tunnel given in the README
# 2. Make sure you have postregsql client installed locally
psql -h localhost -U postgres -W postgres
```
## 3. Create the binded users with LDAP in postgres + the database
```sql
CREATE USER sogo;
Create database sogodb with owner sogo encoding 'utf8' LC_COLLATE = 'C' LC_CTYPE = 'C' TEMPLATE template0;
```

View file

@ -1,87 +0,0 @@
Spawn container:
```bash
docker run \
-ti --rm \
--name stolon-config \
--user root \
-v /var/lib/consul/pki/:/certs \
superboum/amd64_postgres:v11
```
Init with:
```
stolonctl \
--cluster-name chelidoine \
--store-backend=consul \
--store-endpoints https://consul.service.prod.consul:8501 \
--store-ca-file /certs/consul-ca.crt \
--store-cert-file /certs/consul2022-client.crt \
--store-key /certs/consul2022-client.key \
init \
'{ "initMode": "new",
"usePgrewind" : true,
"proxyTimeout" : "120s",
"pgHBA": [
"host all postgres all md5",
"host replication replicator all md5",
"host all all all ldap ldapserver=bottin.service.prod.consul ldapbasedn=\"ou=users,dc=deuxfleurs, dc=fr\" ldapbinddn=\"<bind_dn>\" ldapbindpasswd=\"<bind_pwd>\" ldapsearchattribute=\"cn\""
]
}'
```
Then set appropriate permission on host:
```
mkdir -p /mnt/{ssd,storage}/postgres/
chown -R 999:999 /mnt/{ssd,storage}/postgres/
```
(102 is the id of the postgres user used in Docker)
It might be improved by staying with root, then chmoding in an entrypoint and finally switching to user 102 before executing user's command.
Moreover it would enable the usage of the user namespace that shift the UIDs.
## Upgrading the cluster
To retrieve the current stolon config:
```
stolonctl spec --cluster-name chelidoine --store-backend consul --store-ca-file ... --store-cert-file ... --store-endpoints https://consul.service.prod.consul:8501
```
The important part for the LDAP:
```
{
"pgHBA": [
"host all postgres all md5",
"host replication replicator all md5",
"host all all all ldap ldapserver=bottin.service.2.cluster.deuxfleurs.fr ldapbasedn=\"ou=users,dc=deuxfleurs,dc=fr\" ldapbinddn=\"cn=admin,dc=deuxfleurs,dc=fr\" ldapbindpasswd=\"<REDACTED>\" ldapsearchattribute=\"cn\""
]
}
```
Once a patch is writen:
```
stolonctl --cluster-name pissenlit --store-backend consul --store-endpoints http://consul.service.2.cluster.deuxfleurs.fr:8500 update --patch -f /tmp/patch.json
```
## Log
- 2020-12-18 Activate pg\_rewind in stolon
```
stolonctl --cluster-name pissenlit --store-backend consul --store-endpoints http://consul.service.2.cluster.deuxfleurs.fr:8500 update --patch '{ "usePgrewind" : true }'
```
- 2021-03-14 Increase proxy timeout to cope with consul latency spikes
```
stolonctl --cluster-name pissenlit --store-backend consul --store-endpoints http://consul.service.2.cluster.deuxfleurs.fr:8500 update --patch '{ "proxyTimeout" : "120s" }'
```

View file

@ -1,305 +0,0 @@
## Disclaimer
Do **NOT** use the following backup methods on the Stolon Cluster:
1. copying the data directory
2. `pg_dump`
3. `pg_dumpall`
The first one will lead to corrupted/inconsistent files.
The second and third ones put too much pressure on the cluster.
Basically, you will destroy it, in the following ways:
- Load will increase, requests will timeout
- RAM will increase, the daemon will be OOM (Out Of Memory) killed by Linux
- Potentially, the WAL log will grow a lot
## A binary backup with `pg_basebackup`
The only acceptable solution is `pg_basebackup` with **some throttling configured**.
Later, if you want a SQL dump, you can inject this binary backup on an ephemeral database you spawned solely for this purpose on a non-production machine.
First, start by fetching from Consul the identifiers of the replication account.
Do not use the root account setup in Stolon, it will not work.
First setup a SSH tunnel on your machine that bind postgresql, eg:
```bash
ssh -L 5432:psql-proxy.service.2.cluster.deuxfleurs.fr:5432 ...
```
*Later, we will use `/tmp/sql` as our working directory. Depending on your distribution, this
folder may be a `tmpfs` and thus mounted on RAM. If it is the case, choose another folder, that is not a `tmpfs`, otherwise you will fill your RAM
and fail your backup. I am using NixOS and the `/tmp` folder is a regular folder, persisted on disk, which explain why I am using it.*
Then export your password in `PGPASSWORD` and launch the backup:
```bash
export PGPASSWORD=xxx
mkdir -p /tmp/sql
cd /tmp/sql
pg_basebackup \
--host=127.0.0.1 \
--username=replicator \
--pgdata=/tmp/sql \
--format=tar \
--wal-method=stream \
--gzip \
--compress=6 \
--progress \
--max-rate=5M
```
*Something you should now: while it seems optional, fetching the WAL is mandatory. At first, I thought it was a way to have a "more recent backup".
But after some reading, it appears that the base backup is corrupted because it is not a snapshot at all, but a copy of the postgres folder with no specific state.
The whole point of the WAL is, in fact, to fix this corrupted archive...*
*Take a cup of coffe, it will take some times...*
The result I get (the important file is `base.tar.gz`, `41921.tar.gz` will probably be missing as it is a secondary tablespace I will deactivate soon):
```
[nix-shell:/tmp/sql]$ ls
41921.tar.gz backup_manifest base.tar.gz pg_wal.tar.gz
```
From now, disconnect from the production to continue your work.
You don't need it anymore and it will prevent some disaster if you fail a command.
## Importing the backup
> The backup taken with `pg_basebckup` is an exact copy of your data directory so, all you need to do to restore from that backup is to point postgres at that directory and start it up.
```bash
mkdir -p /tmp/sql/pg_data && cd /tmp/sql/pg_data
tar xzfv ../base.tar.gz
```
Now you should have something like that:
```
[nix-shell:/tmp/sql/pg_data]$ ls
backup_label base pg_commit_ts pg_hba.conf pg_logical pg_notify pg_serial pg_stat pg_subtrans pg_twophase pg_wal postgresql.conf tablespace_map
backup_label.old global pg_dynshmem pg_ident.conf pg_multixact pg_replslot pg_snapshots pg_stat_tmp pg_tblspc PG_VERSION pg_xact stolon-temp-postgresql.conf
```
Now we will extract the WAL:
```bash
mkdir -p /tmp/sql/wal && cd /tmp/sql/wal
tar xzfv ../pg_wal.tar.gz
```
You should have something like that:
```
[nix-shell:/tmp/sql/wal]$ ls
00000003000014AF000000C9 00000003000014AF000000CA 00000003.history archive_status
```
Before restoring our backup, we want to check it:
```bash
cd /tmp/sql/pg_data
cp ../backup_manifest .
# On ne vérifie pas le WAL car il semblerait que ça marche pas trop
# Cf ma référence en bas capdata.fr
# pg_verifybackup -w ../wal .
pg_verifybackup -n .
```
Now, We must edit/read some files before launching our ephemeral server:
- Set `listen_addresses = '0.0.0.0'` in `postgresql.conf`
- Add `restore_command = 'cp /mnt/wal/%f %p' ` in `postgresql.conf`
- Check `port` in `postgresql.conf`, in our case it is `5433`.
- Create an empty file named `recovery.signal`
*Do not create a `recovery.conf` file, it might be written on the internet but this is a deprecated method and your postgres daemon will refuse to boot if it finds one.*
*Currently, we use port 5433 in oour postgresql configuration despite 5432 being the default port. Indeed, in production, clients access the cluster transparently through the Stolon Proxy that listens on port 5432 and redirect the requests to the correct PostgreSQL instance, listening secretly on port 5433! To export our binary backup in text, we will directly query our postgres instance without passing through the proxy, which is why you must note this port.*
Now we will start our postgres container on our machine.
At the time of writing the live version is `superboum/amd64_postgres:v9`.
We must start by getting `postgres` user id. Our container are run by default with this user, so you only need to run:
```bash
docker run --rm -it superboum/amd64_postgres:v9 id
```
And we get:
```
uid=999(postgres) gid=999(postgres) groups=999(postgres),101(ssl-cert)
```
Now `chown` your `pg_data`:
```bash
chown 999:999 -R /tmp/sql/{pg_data,wal}
chmod 700 -R /tmp/sql/{pg_data,wal}
```
And finally:
```
docker run \
--rm \
-it \
-p 5433:5433 \
-v /tmp/sql/:/mnt/ \
superboum/amd64_postgres:v9 \
postgres -D /mnt/pg_data
```
I have the following output:
```
2022-01-28 14:46:39.750 GMT [1] LOG: skipping missing configuration file "/mnt/pg_data/postgresql.auto.conf"
2022-01-28 14:46:39.763 UTC [1] LOG: starting PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2022-01-28 14:46:39.764 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5433
2022-01-28 14:46:39.767 UTC [1] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"
2022-01-28 14:46:39.773 UTC [7] LOG: database system was interrupted; last known up at 2022-01-28 14:33:13 UTC
cp: cannot stat '/mnt/wal/00000004.history': No such file or directory
2022-01-28 14:46:40.318 UTC [7] LOG: starting archive recovery
2022-01-28 14:46:40.321 UTC [7] LOG: restored log file "00000003.history" from archive
2022-01-28 14:46:40.336 UTC [7] LOG: restored log file "00000003000014AF000000C9" from archive
2022-01-28 14:46:41.426 UTC [7] LOG: could not open directory "pg_tblspc/41921/PG_13_202007201": No such file or directory
2022-01-28 14:46:41.445 UTC [7] LOG: could not open directory "pg_tblspc/41921/PG_13_202007201": No such file or directory
2022-01-28 14:46:41.457 UTC [7] LOG: redo starts at 14AF/C9000028
2022-01-28 14:46:41.500 UTC [7] LOG: restored log file "00000003000014AF000000CA" from archive
2022-01-28 14:46:42.461 UTC [7] LOG: consistent recovery state reached at 14AF/CA369AB0
2022-01-28 14:46:42.461 UTC [1] LOG: database system is ready to accept read only connections
cp: cannot stat '/mnt/wal/00000003000014AF000000CB': No such file or directory
2022-01-28 14:46:42.463 UTC [7] LOG: redo done at 14AF/CA369AB0
2022-01-28 14:46:42.463 UTC [7] LOG: last completed transaction was at log time 2022-01-28 14:35:04.698438+00
2022-01-28 14:46:42.480 UTC [7] LOG: could not open directory "pg_tblspc/41921/PG_13_202007201": No such file or directory
2022-01-28 14:46:42.493 UTC [7] LOG: restored log file "00000003000014AF000000CA" from archive
cp: cannot stat '/mnt/wal/00000004.history': No such file or directory
2022-01-28 14:46:43.462 UTC [7] LOG: selected new timeline ID: 4
2022-01-28 14:46:44.441 UTC [7] LOG: archive recovery complete
2022-01-28 14:46:44.444 UTC [7] LOG: restored log file "00000003.history" from archive
2022-01-28 14:46:45.614 UTC [1] LOG: database system is ready to accept connections
```
*Notes: the missing tablespace is a legacy tablesplace used in the past to debug Matrix. It will be removed soon, we can safely ignore it. Other errors on cp seems to be intended as postgres might want to know how far it can rewind with the WAL but I a not 100% sure.*
Your ephemeral instance should work:
```bash
export PGPASSWORD=xxx # your postgres (admin) account password
psql -h 127.0.0.1 -p 5433 -U postgres postgres
```
And your databases should appear:
```
[nix-shell:~/Documents/dev/infrastructure]$ psql -h 127.0.0.1 -p 5433 -U postgres postgres
psql (13.5, server 13.3 (Debian 13.3-1.pgdg100+1))
Type "help" for help.
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+------------+------------+-----------------------
xxxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
xxxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
xxxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 |
xxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
xxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
template0 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
(8 rows)
```
## Dump your ephemeral database as SQL
Now we can do a SQL export of our ephemeral database.
We use zstd to automatically compress the outputed file.
We use multiple parameters:
- `-vv` gives use some idea on the progress
- `-9` is a quite high value and should compress efficiently. Decrease it if your machine is low powered
- `-T0` asks zstd to use all your cores. By default, zstd uses only one core.
```bash
pg_dumpall -h 127.0.0.1 -p 5433 -U postgres \
| zstd -vv -9 -T0 --format=zstd > dump-`date --rfc-3339=seconds | sed 's/ /T/'`.sql.zstd
```
I get the following result:
```
[nix-shell:/tmp/sql]$ ls -lah dump*
-rw-r--r-- 1 quentin users 749M janv. 28 16:07 dump-2022-01-28T16:06:29+01:00.sql.zstd
```
Now you can stop your ephemeral server.
## Restore your SQL file
First, start a blank server:
```bash
docker run \
--rm -it \
--name postgres \
-p 5433:5432 \
superboum/amd64_postgres:v9 \
bash -c '
set -ex
mkdir /tmp/psql
initdb -D /tmp/psql --no-locale --encoding=UTF8
echo "host all postgres 0.0.0.0/0 md5" >> /tmp/psql/pg_hba.conf
postgres -D /tmp/psql
'
```
Then set the same password as your prod for the `posgtgres` user (it will be required as part of the restore):
```bash
docker exec -ti postgres bash -c "echo \"ALTER USER postgres WITH PASSWORD '$PGPASSWORD';\" | psql"
echo '\l' | psql -h 127.0.0.1 -p 5433 -U postgres postgres
# the database should have no entry (except `posgtres`, `template0` and `template1`) otherwise ABORT EVERYTHING, YOU ARE ON THE WRONG DB
```
And finally, restore your SQL backup:
```bash
zstdcat -vv dump-* | \
grep -P -v '^(CREATE|DROP) ROLE postgres;' | \
psql -h 127.0.0.1 -p 5433 -U postgres --set ON_ERROR_STOP=on postgres
```
*Note: we must skip CREATE/DROP ROLE postgres during the restore as it aready exists and would generate an error.
Because we want to be extra careful, we specifically asked to crash on every error and do not want to change this behavior.
So, instead, we simply remove any entry that contains the specific regex stated in the previous command.*
Check that the backup has been correctly restored.
For example:
```bash
docker exec -ti postgres psql
#then type "\l", "\c db-name", "select ..."
```
## Finally, store it safely
```bash
rsync --progress -av /tmp/sql/{*.tar.gz,backup_manifest,dump-*} backup/target
```
## Ref
- https://philipmcclarence.com/backing-up-and-restoring-postgres-using-pg_basebackup/
- https://www.cybertec-postgresql.com/en/pg_basebackup-creating-self-sufficient-backups/
- https://www.postgresql.org/docs/14/continuous-archiving.html
- https://www.postgresql.org/docs/14/backup-dump.html#BACKUP-DUMP-RESTORE
- https://dba.stackexchange.com/questions/75033/how-to-restore-everything-including-postgres-role-from-pg-dumpall-backup
- https://blog.capdata.fr/index.php/postgresql-13-les-nouveautes-interessantes/

View file

@ -1,26 +0,0 @@
Start by following ../restic
## Garbage collect old backups
```
mc ilm import deuxfleurs/${BUCKET_NAME} <<EOF
{
"Rules": [
{
"Expiration": {
"Days": 62
},
"ID": "PurgeOldBackups",
"Status": "Enabled"
}
]
}
EOF
```
Check that it has been activated:
```
mc ilm ls deuxfleurs/${BUCKET_NAME}
```

View file

@ -1,15 +0,0 @@
```
curl http://127.0.0.1:8500/v1/kv/traefik/acme/account/object?raw > traefik.gzip
gunzip -c traefik.gzip > traefik.json
cat traefik.json | jq '.DomainsCertificate.Certs[] | .Certificate.Domain, .Domains.Main'
# "alps.deuxfleurs.fr"
# "alps.deuxfleurs.fr"
# "cloud.deuxfleurs.fr"
# "cloud.deuxfleurs.fr"
# chaque NDD doit apparaitre 2x à la suite sinon fix comme suit
cat traefik.json | jq > traefik-new.json
vim traefik-new.json
# enlever les certifs corrompus, traefik les renouvellera automatiquement au démarrage
gzip -c traefik-new.json > traefik-new.gzip
curl --request PUT --data-binary @traefik-new.gzip http://127.0.0.1:8500/v1/kv/traefik/acme/account/object
```

View file

@ -1,93 +0,0 @@
How to update Matrix?
=====================
## 1. Build the new containers
Often, I update Riot Web and Synapse at the same time.
* Open `app/docker-compose.yml` and locate `riot` (the Element Web service) and `synapse` (the Matrix Synapse server). There are two things you need to do for each service:
* Set the `VERSION` argument to the target service version (e.g. `1.26.0` for Synapse). This argument is then used to template the Dockerfile.
The `VERSION` value should match a github release, the link to the corresponding release page is put as a comment next to the variable in the compose file;
* Tag the image with a new incremented version tag. For example: `superboum/amd64_riotweb:v17` will become `superboum/amd64_riotweb:v18`.
We use the docker hub to store our images. So, if you are not `superboum` you must change the name with your own handle, eg. `john/amd64_riotweb:v18`. This requires that you registered an account (named `john`) on https://hub.docker.com.
So, from now we expect you have:
* changed the `VERSION` value and `image` name/tag of `riot`
* changed the `VERSION` value and `image` name/tag of `synapse`
From the `/app` folder, you can now simply build and push the new images:
```bash
docker-compose build riot synapse
```
And then send them to the docker hub:
```
docker-compose push riot synapse
```
Don't forget to commit and push your changes before doing anything else!
## 2. Deploy the new containers
Now, we will edit the deployment file `app/im/deploy/im.hcl`.
Find where the image is defined in the file, for example Element-web will look like that:
```hcl
group "riotweb" {
count = 1
task "server" {
driver = "docker"
config {
image = "superboum/amd64_riotweb:v17"
port_map {
web_port = 8043
}
```
And replace the `image =` entry with its new version created above.
Do the same thing for the `synapse` service.
Now, you need a way to access the cluster to deploy this file.
To do this, you must bind nomad on your machine through a SSH tunnel.
Check the end of [the parent `README.md`](../README.md) to do it.
If you have access to the Nomad web UI when entering http://127.0.0.1:4646
you are ready to go.
You must have installed the Nomad command line tool on your machine (also explained in [the parent `README.md`](../README.md)).
Now, on your machine and from the `app/im/deploy` folder, you must be able to run:
```
nomad plan im.hcl
```
Check that the proposed diff corresponds to what you have in mind.
If it seems OK, just copy paste the `nomad job run ... im.hcl` command proposed as part of the output of the `nomad plan` command.
From now, it will take around ~2 minutes to deploy the new images.
You can follow the deployment from the Nomad UI.
Bear in mind that, once the deployment is done on Nomad, you may still need to wait some minutes that Traefik refreshes its configuration.
If everythings worked as intended, you can commit and push your deployment file.
If something went wrong, you must rollback your deployment.
1. First, find a working deployment with [nomad job history](https://www.nomadproject.io/docs/commands/job/history)
2. Revert to this deployment with [nomad job revert](https://www.nomadproject.io/docs/commands/job/revert)
Now, if the deployment failed, you should probably investigate what went wrong offline.
I built a test stack with docker-compose in `app/<service>/integration` that should help you out (for now, test suites are only written for plume and jitsi).