diff --git a/op_guide/README.md b/op_guide/README.md new file mode 100644 index 0000000..75b033d --- /dev/null +++ b/op_guide/README.md @@ -0,0 +1,3 @@ +All documents from our operations guide have been moved to (or in Git repository `guide.deuxfleurs.fr`). + +Tous les documents de notre guide des opérations ont été déplacés sur (ou dans le dépôt Git `guide.deuxfleurs.fr`). diff --git a/op_guide/garage/README.md b/op_guide/garage/README.md deleted file mode 100644 index 44fda62..0000000 --- a/op_guide/garage/README.md +++ /dev/null @@ -1 +0,0 @@ -Not very generic currently, check the backup.sh script diff --git a/op_guide/garage/backup.sh b/op_guide/garage/backup.sh deleted file mode 100644 index 2ff18cd..0000000 --- a/op_guide/garage/backup.sh +++ /dev/null @@ -1,65 +0,0 @@ -#!/bin/bash - -cd $(dirname $0) - -if [ "$(hostname)" != "io" ]; then - echo "Please run this script on io" - exit 1 -fi - -if [ ! -d "buckets" ]; then - btrfs subvolume create $(pwd)/buckets -fi - - -AK=$1 -SK=$2 - -function gctl { - docker exec garage /garage $@ -} - -gctl status -BUCKETS=$(gctl bucket list | tail -n +2 | cut -d " " -f 3 | cut -d "," -f 1) - -for BUCKET in $BUCKETS; do - case $BUCKET in - *backup*) - echo "Skipping $BUCKET (not doing backup of backup)" - ;; - *cache*) - echo "Skipping $BUCKET (not doing backup of cache)" - ;; - *) - echo "Backing up $BUCKET" - - if [ ! -d $(pwd)/buckets/$BUCKET ]; then - mkdir $(pwd)/buckets/$BUCKET - fi - - gctl bucket allow --key $AK --read $BUCKET - rclone sync --s3-endpoint http://localhost:3900 \ - --s3-access-key-id $AK \ - --s3-secret-access-key $SK \ - --s3-region garage \ - --s3-force-path-style \ - --transfers 32 \ - --fast-list \ - --stats-one-line \ - --stats 10s \ - --stats-log-level NOTICE \ - :s3:$BUCKET $(pwd)/buckets/$BUCKET - ;; - esac -done - -# Remove duplicates -#duperemove -dAr $(pwd)/buckets - -if [ ! -d "$(pwd)/snapshots" ]; then - mkdir snapshots -fi - -SNAPSHOT=$(pwd)/snapshots/buckets-$(date +%F) -echo "Making snapshot: $SNAPSHOT" -btrfs subvolume snapshot $(pwd)/buckets $SNAPSHOT diff --git a/op_guide/nextcloud/README.md b/op_guide/nextcloud/README.md deleted file mode 100644 index f68520b..0000000 --- a/op_guide/nextcloud/README.md +++ /dev/null @@ -1,60 +0,0 @@ -# How to setup NextCloud - -## First setup - -It's complicated. - -First, create a service user `nextcloud` and a database `nextcloud` it owns. Also create a Garage access key and bucket `nextcloud` it is allowed to use. - -Fill in the following Consul keys with actual values: - -``` -secrets/nextcloud/db_user -secrets/nextcloud/db_pass -secrets/nextcloud/garage_access_key -secrets/nextcloud/garage_secret_key -``` - -Create the following Consul keys with empty values: - -``` -secrets/nextcloud/instance_id -secrets/nextcloud/password_salt -secrets/nextcloud/secret -``` - -Start the nextcloud.hcl nomad service. Enter the container and call `occ maintenance:install` with the correct database parameters as user `www-data`. -A possibility: call the admin user `nextcloud` and give it the same password as the `nextcloud` service user. - -Cat the newly generated `config.php` file and copy the instance id, password salt, and secret from there to Consul -(they were generated by the install script and we want to keep them). - -Restart the Nextcloud Nomad server. - -You should now be able to log in to Nextcloud using the admin user (`nextcloud` if you called it that). - -Go to the apps settings and enable desired apps. - -## Configure LDAP login - -LDAP login has to be configured from the admin interface. First, enable the LDAP authentification application. - -Go to settings > LDAP/AD integration. Enter the following parameters: - -- ldap server: `bottin2.service.2.cluster.deuxfleurs.fr` -- bind user: `cn=nextcloud,ou=services,ou=users,dc=deuxfleurs,dc=fr` -- bind password: password of the nextcloud service user -- base DN for users: `ou=users,dc=deuxfleurs,dc=fr` -- check "manually enter LDAP filters" -- in the users tab, edit LDAP query and set it to `(&(|(objectclass=inetOrgPerson))(|(memberof=cn=nextcloud,ou=groups,dc=deuxfleurs,dc=fr)))` -- in the login attributes tab, edit LDAP query and set it to `(&(&(|(objectclass=inetOrgPerson))(|(memberof=cn=nextcloud,ou=groups,dc=deuxfleurs,dc=fr)))(|(|(mailPrimaryAddress=%uid)(mail=%uid))(|(cn=%uid))))` -- in the groups tab, edit the LDAP query and set it to `(|(objectclass=groupOfNames))` -- in the advanced tab, enter the "directory setting" section and check/modify the following: - - user display name field: `displayname` - - base user tree: `ou=users,dc=deuxfleurs,dc=fr` - - user search attribute: `cn` - - groupe display name field: `displayname` - - **base group tree**: `ou=groups,dc=deuxfleurs,dc=fr` - - group search attribute: `cn` - -That should be it. Go to the login attributes tab and enter a username (which should have been added to the nextcloud group) to check that nextcloud is able to find it and allows it for login. diff --git a/op_guide/plume/README.md b/op_guide/plume/README.md deleted file mode 100644 index 16f7af9..0000000 --- a/op_guide/plume/README.md +++ /dev/null @@ -1,44 +0,0 @@ -## Creating a new Plume user - - 1. Bind nomad on your machine with SSH (check the README file at the root of this repo) - 2. Go to http://127.0.0.1:4646 - 3. Select `plume` -> click `exec` button (top right) - 4. Select `plume` on the left panel - 5. Press `enter` to get a bash shell - 6. Run: - -```bash -plm users new \ - --username alice \ - --display-name Alice \ - --bio Just an internet user \ - --email alice@example.com \ - --password s3cr3t -``` - -That's all folks, now you can use your new account at https://plume.deuxfleurs.fr - -## Bug and debug - -If you can't follow a new user and have this error: - -``` -2022-04-23T19:26:12.639285Z WARN plume::routes::errors: Db(DatabaseError(UniqueViolation, "duplicate key value violates unique constraint \"follows_unique_ap_url\"")) -``` - -You might have an empty field in your database: - -``` -plume=> select * from follows where ap_url=''; - id | follower_id | following_id | ap_url -------+-------------+--------------+-------- - 2118 | 20 | 238 | -(1 row) -``` - -Simply set the `ap_url` as follows: - -``` -plume=> update follows set ap_url='https://plume.deuxfleurs.fr/follows/2118' where id=2118; -UPDATE 1 -``` diff --git a/op_guide/postmortem/2020-01-20-changement-ip.md b/op_guide/postmortem/2020-01-20-changement-ip.md deleted file mode 100644 index 21856a9..0000000 --- a/op_guide/postmortem/2020-01-20-changement-ip.md +++ /dev/null @@ -1,45 +0,0 @@ -Le 20 janvier free a changé mon IP, un peu comme partout en France. -Ça concerne l'IPv4 et le préfixe IPv6. -Ici le bon vieux Bortzmoinsbien qui tweet : https://twitter.com/bortzmeyer/status/1351434290916155394 - -Max a update tout de suite l'IPv4 mais avec un TTL de 4h le temps de propagation est grand. -J'ai réduit les entrées sur les IP à 300 secondes, soit 5 minutes, le minimum chez Gandi, à voir si c'est une bonne idée. -Reste à update les IPv6, moins critiques pour le front facing mais utilisées pour le signaling en interne... - -## Le fameux signaling -Ça pose un gros problème avec Nomad (et en moindre mesure avec Consul). -En effet, Nomad utilise l'IPv6 pour communiquer, il faut donc changer les IPs de tous les noeuds. -Problème ! On peut pas faire la migration au fur et à mesure car, changeant d'IP, les noeuds ne seront plus en mesure de communiquer. -On n'a pas envie de supprimer le cluster et d'en créer un nouveau car ça voudrait dire tout redéployer ce qui est long également (tous les fichiers HCL pour Nomad, tout le KV pour consul). -On ne peut pas non plus la faire à la bourrin en stoppant tous les cluster, changer son IP, puis redémarrer. -Enfin si, Consul accepte mais pas Nomad, qui lui va chercher à communiquer avec les anciennes IP et n'arrivera jamais à un consensus. - -Au passage j'en ai profité pour changer le nom des noeuds car la dernière fois, Nomad n'avait PAS DU TOUT apprécié qu'un noeud ayant le même nom change d'IP. Ceci dit, si on utilise de facto le `peers.json` c'est peut être pas problématique. À tester. - -Du coup, après moult réflexions, la silver bullet c'est la fonction outage recovery de nomad (consul l'a aussi au besoin). -Elle est ici : https://learn.hashicorp.com/tutorials/consul/recovery-outage -En gros, il faut arrêter tous les nodes. -Ensuite créer un fichier à ce path : `/var/lib/nomad/server/raft/peers.json` -Ne vous laissez pas perturber par le fichier `peers.info` à côté, il ne faut pas le toucher. -Après la grande question c'est de savoir si le cluster est en Raft v2 ou Raft v3. -Bon ben nous on était en Raft v2. Si vous vous trompez, au redémarrage Nomad va crasher avec une sale erreur : - -``` -nomad: failed to start Raft: error="recovery failed to parse peers.json: json: cannot unmarshal string into Go value of type raft.configEntry" -``` - -(je me suis trompé bien sûr). -Voilà, après il ne vous reste plus qu'à redémarrer et suivre les logs, cherchez bien la ligne où il dit qu'il a trouvé le peers.json. - -## Les trucs à pas oublier - - - Reconfigurer le backend KV de traefik (à voir à utiliser des DNS plutôt du coup) - - Reconfigurer l'IPv4 publique annoncée à Jitsi - -## Ce qui reste à faire - - - Mettre à jour les entrées DNS IPv6, ce qui devrait créer : - - digitale.machine.deuxfleurs.fr - - datura.machine.deuxfleurs.fr - - drosera.machine.deuxfleurs.fr - - Mettre à jour l'instance garage sur io diff --git a/op_guide/postmortem/2021-07-12-synapse-bdd-rempli-disque.md b/op_guide/postmortem/2021-07-12-synapse-bdd-rempli-disque.md deleted file mode 100644 index 8514016..0000000 --- a/op_guide/postmortem/2021-07-12-synapse-bdd-rempli-disque.md +++ /dev/null @@ -1,14 +0,0 @@ -# La BDD synapse rempli nos disques - -Todo: finir ce blog post et le dupliquer ici https://quentin.dufour.io/blog/2021-07-12/chroniques-administration-synapse/ - -Le WAL qui grossissait à l'infini était également du à un SSD défaillant dont les écritures était abyssalement lentes. - -Actions mises en place : - - Documentation de comment ajouter de l'espace sur un disque différent avec les tablespaces - - Interdiction de rejoindre les rooms avec une trop grande complexité - - nettoyage de la BDD à la main (rooms vides, comptes non utilisés, etc.) - - Remplacement du SSD défaillant - -Actions à mettre en place : - - Utiliser les outils de maintenance de base de données distribuées par le projet matrix diff --git a/op_guide/postmortem/2022-01-xx-glusterfs-corruption.md b/op_guide/postmortem/2022-01-xx-glusterfs-corruption.md deleted file mode 100644 index 62694e6..0000000 --- a/op_guide/postmortem/2022-01-xx-glusterfs-corruption.md +++ /dev/null @@ -1,28 +0,0 @@ -# Corruption GlusterFS - -Suite au redémarrage d'un serveur, les emails ne sont plus disponibles. -Il apparait que GlusterFS ne répliquait plus correctement les données depuis un certain temps. -Suite à ce problème, il a renvoyé des dossiers Dovecot corrompu. -Dovecot a reconstruit un index sans les emails, ce qui a désynchronisé les bàl des gens. -À la fin, certaines boites mails ont perdu tous leurs emails. -Aucune sauvegarde des emails n'était réalisée. -Le problème a été créé cet été quand j'ai réinstallé un serveur. -J'ai installé sur une version de Debian différente. -La version de GlusterFS était pinnée dans un sources.list, en pointant vers le repo du projet gluster -Mais le pinning était pour la version de debian précédente. -Le sources.list a été ignoré, et c'est le gluster du projet debian plus récent qui a été installé. -Ces versions étaient incompatibles mais silencieusement. -GlusterFS n'informe pas proactivement non plus que les volumes sont désynchronisées. -Il n'y a aucune commande pour connaitre l'état du cluster. -Après plusieurs jours de travail, il m'a été impossible de remonter les emails. - -Action mise en place : - - Suppression de GlusterFS - - Sauvegardes journalière des emails - - Les emails sont maintenant directement sur le disque (pas de haute dispo) - -Action en cours de mise en place : - - Développement d'un serveur IMAP sur Garage - - - diff --git a/op_guide/postmortem/petits-incidents.md b/op_guide/postmortem/petits-incidents.md deleted file mode 100644 index fec5367..0000000 --- a/op_guide/postmortem/petits-incidents.md +++ /dev/null @@ -1,15 +0,0 @@ -- **2020** Publii efface le disque dur d'un de nos membres. Il a changé le dossier de sortie vers /home qui a été effacé - -- **2021-07-27** Panne de courant à Rennes - 40 000 personnes sans électricité pendant une journée - nos serveurs de prod étant dans la zone coupée, deuxfleurs.fr est dans le noir - https://www.francebleu.fr/infos/faits-divers-justice/rennes-plusieurs-quartiers-prives-d-electricite-1627354121 - -- **2021-12:** Tentative de migration un peu trop hâtive vers Tricot pour remplacer Traefik qui pose des soucis. Downtime et manque de communication sur les causes, confusion généralisée. - - *Actions à envisager:* prévoir à l'avance toute intervention de nature à impacter la qualité de service sur l'infra Deuxfleurs. Tester en amont un maximum pour éviter de devoir tester en prod. Lorsque le test en prod est inévitable, s'organiser pour impacter le moins de monde possible. - -- **2022-03-28:** Coupure d'électricité au site Jupiter, `io` ne redémarre pas toute seule. T est obligée de la rallumer manuellement. `io` n'est pas disponible durant quelques heures. - - *Actions à envisager:* reconfigurer `io` pour s'allumer toute seule quand le courant démarre. - -- **2022-03-28:** Grafana (hébergé par M) n'est pas disponible. M est le seul à pouvoir intervenir. - - *Actions à envisager:* cartographier l'infra de monitoring et s'assurer que plusieurs personnes ont les accès. diff --git a/op_guide/restic/README.md b/op_guide/restic/README.md deleted file mode 100644 index f8fb658..0000000 --- a/op_guide/restic/README.md +++ /dev/null @@ -1,186 +0,0 @@ -Add the admin account as `deuxfleurs` to your `~/.mc/config` file - -You need to choose some names/identifiers: - -```bash -export ENDPOINT="https://s3.garage.tld" -export SERVICE_NAME="example" - - -export BUCKET_NAME="backups-${SERVICE_NAME}" -export NEW_ACCESS_KEY_ID="key-${SERVICE_NAME}" -export NEW_SECRET_ACCESS_KEY=$(openssl rand -base64 32) -export POLICY_NAME="policy-$BUCKET_NAME" -``` - -Create a new bucket: - -```bash -mc mb deuxfleurs/$BUCKET_NAME -``` - -Create a new user: - -```bash -mc admin user add deuxfleurs $NEW_ACCESS_KEY_ID $NEW_SECRET_ACCESS_KEY -``` - -Add this new user to your `~/.mc/config.json`, run this command before to generate the snippet to copy/paste: - -``` -cat > /dev/stdout < /tmp/policy.json < ctrl + v -cd ~/.password-store/deuxfleurs/ -git pull ; git push -cd - -``` - -Then init the repo for restic from your machine: - -``` -restic init -``` - -*I am using restic version `restic 0.12.1 compiled with go1.16.9 on linux/amd64`* - -See your snapshots with: - -``` -restic snapshots -``` - -Check also these useful commands: - -``` -restic ls -restic diff -restic help -``` - ---- - -Add the secrets to Consul, near your service secrets. -The idea is that the backuping service is a component of the global running service. -You must run in `app//secrets/`: - -```bash -echo "USER Backup AWS access key ID" > backup_aws_access_key_id -echo "USER Backup AWS secret access key" > backup_aws_secret_access_key -echo "USER Restic repository, eg. s3:https://s3.garage.tld" > backup_restic_repository -echo "USER Restic password to encrypt backups" > backup_restic_password -``` - -Then run secretmgr: - -```bash -# Spawning a nix shell is an easy way to get all the dependencies you need -nix-shell - -# Check that secretmgr works for you -python3 secretmgr.py check - -# Now interactively feed the secrets -python3 secretmgr.py gen -``` - ---- - -Now we need a service that runs: - -``` -restic backup . -``` - - -Find an existing .hcl declaration that uses restic in this repository or in the Deuxfleurs/nixcfg repository -to use it as an example. - - -And also that garbage collect snapshots. -I propose: - -``` -restic forget --prune --keep-within 1m1d --keep-within-weekly 3m --keep-within-monthly 1y -``` - -Also try to restore a snapshot: - -``` -restic restore --target /tmp/$SERVICE_NAME -``` diff --git a/op_guide/secrets/README.md b/op_guide/secrets/README.md deleted file mode 100644 index e3687d1..0000000 --- a/op_guide/secrets/README.md +++ /dev/null @@ -1,166 +0,0 @@ -## you are new and want to access the secret repository - -You need a GPG key to start with. -You can generate one with: - -```bash -gpg2 --expert --full-gen-key -# Personnaly I use `9) ECC and ECC`, `1) Curve 25519`, and `5y` -``` - -Now export your public key: - -```bash -gpg2 --export --armor -``` - -You can upload it to Gitea, it will then be available publicly easily. -For example, you can access my key at this URL: - -``` -https://git.deuxfleurs.fr/quentin.gpg -``` - -You can import it to your keychain as follow: - -```bash -gpg2 --import <(curl https://git.deuxfleurs.fr/quentin.gpg) -gpg2 --list-keys -# pub ed25519/0xE9602264D639FF68 2022-04-19 [SC] [expire : 2027-04-18] -# Empreinte de la clef = 8023 E27D F1BB D52C 559B 054C E960 2264 D639 FF68 -# uid [ ultime ] Quentin Dufour -# sub cv25519/0xA40574404FF72851 2022-04-19 [E] [expire : 2027-04-18] -``` - -How to read this snippet: - - the key id: `E9602264D639FF68` - - the key fingerprint: `8023 E27D F1BB D52C 559B 054C E960 2264 D639 FF68` - -Now, you need to: - 1. Inform all other sysadmins that you have published your key - 2. Check that the key of other sysadmins is the correct one. - -To perform the check, you need another communication channel (ideally physically, otherwise through the phone, Matrix if you already trusted the other person, etc.) - -Once you trust someone, sign its key: - -```bash -gpg --edit-key quentin@deuxfleurs.fr -# or -gpg --edit-key E9602264D639FF68 -# gpg> lsign -# (say yes) -# gpg> save -``` - -Once you signed everybody, ask to a sysadmin to add your key to `/.gpg-id` and then run: - -``` -pass init -p deuxfleurs $(cat ~/.password-store/deuxfleurs/.gpg-id) -cd ~/.password-store -git commit -git push -``` - -Now you are ready to install `pass`: - -```bash -sudo apt-get install pass # Debian + Ubuntu -sudo yum install pass # Fedora + RHEL -sudo zypper in password-store # OpenSUSE -sudo emerge -av pass # Gentoo -sudo pacman -S pass # Arch Linux -brew install pass # macOS -pkg install password-store # FreeBSD -``` - -*Go to [passwordstore.org](https://www.passwordstore.org/) for more information about pass*. - -Download the repository: - -``` -mkdir -p ~/.password-store -cd ~/.password-store -git clone git@git.deuxfleurs.fr:Deuxfleurs/secrets.git deuxfleurs -``` - -And then check that everything work: - -```bash -pass show deuxfleurs -``` - ---- - ---- - -## init - -generate a new password store named deuxfleurs for you: - -``` -pass init -p deuxfleurs you@example.com -``` - -add a password in this store, it will be encrypted with your gpg key: - -```bash -pass generate deuxfleurs/backup_nextcloud 20 -# or -pass insert deuxfleurs/backup_nextcloud -``` - -## add a teammate - -edit `~/.password-store/acme/.gpg-id` and add the id of your friends: - -``` -alice@example.com -jane@example.com -bob@example.com -``` - -make sure that you trust the keys of your teammates: - -``` -$ gpg --edit-key jane@example.com -gpg> lsign -gpg> y -gpg> save -``` - -Now re-encrypt the secrets: - -``` -pass init -p deuxfleurs $(cat ~/.password-store/deuxfleurs/.gpg-id) -``` - -They will now be able to decrypt the password: - -``` -pass deuxfleurs/backup_nextcloud -``` - -## sharing with git - -To create the repo: - -```bash -cd ~/.password-store/deuxfleurs -git init -git add . -git commit -m "Initial commit" -# Set up remote -git push -``` - -To setup the repo: - -```bash -cd ~/.password-store -git clone https://git.example.com/org/repo.git deuxfleurs -``` - -## Ref - -https://medium.com/@davidpiegza/using-pass-in-a-team-1aa7adf36592 diff --git a/op_guide/stolon/README.md b/op_guide/stolon/README.md deleted file mode 100644 index 9e76b0e..0000000 --- a/op_guide/stolon/README.md +++ /dev/null @@ -1,3 +0,0 @@ - - [Initialize the cluster](install.md) - - [Create a database](create_database.md) - - [Manually backup all the databases](manual_backup.md) diff --git a/op_guide/stolon/create_database.md b/op_guide/stolon/create_database.md deleted file mode 100644 index 96999ef..0000000 --- a/op_guide/stolon/create_database.md +++ /dev/null @@ -1,26 +0,0 @@ -## 1. Create a LDAP user and assign a password for your service - -Go to guichet.deuxfleurs.fr - - 1. Everything takes place in `ou=services,ou=users,dc=deuxfleurs,dc=fr` - 2. Create a new user, like `johny` - 3. Generate a random password with `openssl rand -base64 32` - 4. Hash it with `slappasswd` - 5. Add a `userpassword` entry with the hash - -This step can also be done using the automated tool `secretmgr.py` in the app folder. - -## 2. Connect to postgres with the admin users - -```bash -# 1. Launch ssh tunnel given in the README -# 2. Make sure you have postregsql client installed locally -psql -h localhost -U postgres -W postgres -``` - -## 3. Create the binded users with LDAP in postgres + the database - -```sql -CREATE USER sogo; -Create database sogodb with owner sogo encoding 'utf8' LC_COLLATE = 'C' LC_CTYPE = 'C' TEMPLATE template0; -``` diff --git a/op_guide/stolon/install.md b/op_guide/stolon/install.md deleted file mode 100644 index e4791ed..0000000 --- a/op_guide/stolon/install.md +++ /dev/null @@ -1,87 +0,0 @@ -Spawn container: - -```bash -docker run \ - -ti --rm \ - --name stolon-config \ - --user root \ - -v /var/lib/consul/pki/:/certs \ - superboum/amd64_postgres:v11 -``` - - -Init with: - -``` -stolonctl \ - --cluster-name chelidoine \ - --store-backend=consul \ - --store-endpoints https://consul.service.prod.consul:8501 \ - --store-ca-file /certs/consul-ca.crt \ - --store-cert-file /certs/consul2022-client.crt \ - --store-key /certs/consul2022-client.key \ - init \ - '{ "initMode": "new", - "usePgrewind" : true, - "proxyTimeout" : "120s", - "pgHBA": [ - "host all postgres all md5", - "host replication replicator all md5", - "host all all all ldap ldapserver=bottin.service.prod.consul ldapbasedn=\"ou=users,dc=deuxfleurs, dc=fr\" ldapbinddn=\"\" ldapbindpasswd=\"\" ldapsearchattribute=\"cn\"" - ] - }' - -``` - -Then set appropriate permission on host: - -``` -mkdir -p /mnt/{ssd,storage}/postgres/ -chown -R 999:999 /mnt/{ssd,storage}/postgres/ -``` - -(102 is the id of the postgres user used in Docker) -It might be improved by staying with root, then chmoding in an entrypoint and finally switching to user 102 before executing user's command. -Moreover it would enable the usage of the user namespace that shift the UIDs. - - - -## Upgrading the cluster - -To retrieve the current stolon config: - -``` -stolonctl spec --cluster-name chelidoine --store-backend consul --store-ca-file ... --store-cert-file ... --store-endpoints https://consul.service.prod.consul:8501 -``` - -The important part for the LDAP: - -``` -{ - "pgHBA": [ - "host all postgres all md5", - "host replication replicator all md5", - "host all all all ldap ldapserver=bottin.service.2.cluster.deuxfleurs.fr ldapbasedn=\"ou=users,dc=deuxfleurs,dc=fr\" ldapbinddn=\"cn=admin,dc=deuxfleurs,dc=fr\" ldapbindpasswd=\"\" ldapsearchattribute=\"cn\"" - ] -} -``` - -Once a patch is writen: - -``` -stolonctl --cluster-name pissenlit --store-backend consul --store-endpoints http://consul.service.2.cluster.deuxfleurs.fr:8500 update --patch -f /tmp/patch.json -``` - -## Log - -- 2020-12-18 Activate pg\_rewind in stolon - -``` -stolonctl --cluster-name pissenlit --store-backend consul --store-endpoints http://consul.service.2.cluster.deuxfleurs.fr:8500 update --patch '{ "usePgrewind" : true }' -``` - -- 2021-03-14 Increase proxy timeout to cope with consul latency spikes - -``` -stolonctl --cluster-name pissenlit --store-backend consul --store-endpoints http://consul.service.2.cluster.deuxfleurs.fr:8500 update --patch '{ "proxyTimeout" : "120s" }' -``` diff --git a/op_guide/stolon/manual_backup.md b/op_guide/stolon/manual_backup.md deleted file mode 100644 index 654d789..0000000 --- a/op_guide/stolon/manual_backup.md +++ /dev/null @@ -1,305 +0,0 @@ -## Disclaimer - -Do **NOT** use the following backup methods on the Stolon Cluster: - 1. copying the data directory - 2. `pg_dump` - 3. `pg_dumpall` - -The first one will lead to corrupted/inconsistent files. -The second and third ones put too much pressure on the cluster. -Basically, you will destroy it, in the following ways: - - Load will increase, requests will timeout - - RAM will increase, the daemon will be OOM (Out Of Memory) killed by Linux - - Potentially, the WAL log will grow a lot - - -## A binary backup with `pg_basebackup` - -The only acceptable solution is `pg_basebackup` with **some throttling configured**. -Later, if you want a SQL dump, you can inject this binary backup on an ephemeral database you spawned solely for this purpose on a non-production machine. - -First, start by fetching from Consul the identifiers of the replication account. -Do not use the root account setup in Stolon, it will not work. - -First setup a SSH tunnel on your machine that bind postgresql, eg: - -```bash -ssh -L 5432:psql-proxy.service.2.cluster.deuxfleurs.fr:5432 ... -``` - -*Later, we will use `/tmp/sql` as our working directory. Depending on your distribution, this -folder may be a `tmpfs` and thus mounted on RAM. If it is the case, choose another folder, that is not a `tmpfs`, otherwise you will fill your RAM -and fail your backup. I am using NixOS and the `/tmp` folder is a regular folder, persisted on disk, which explain why I am using it.* - -Then export your password in `PGPASSWORD` and launch the backup: - -```bash -export PGPASSWORD=xxx - -mkdir -p /tmp/sql -cd /tmp/sql - -pg_basebackup \ - --host=127.0.0.1 \ - --username=replicator \ - --pgdata=/tmp/sql \ - --format=tar \ - --wal-method=stream \ - --gzip \ - --compress=6 \ - --progress \ - --max-rate=5M -``` - -*Something you should now: while it seems optional, fetching the WAL is mandatory. At first, I thought it was a way to have a "more recent backup". -But after some reading, it appears that the base backup is corrupted because it is not a snapshot at all, but a copy of the postgres folder with no specific state. -The whole point of the WAL is, in fact, to fix this corrupted archive...* - -*Take a cup of coffe, it will take some times...* - -The result I get (the important file is `base.tar.gz`, `41921.tar.gz` will probably be missing as it is a secondary tablespace I will deactivate soon): - -``` -[nix-shell:/tmp/sql]$ ls -41921.tar.gz backup_manifest base.tar.gz pg_wal.tar.gz -``` - -From now, disconnect from the production to continue your work. -You don't need it anymore and it will prevent some disaster if you fail a command. - - -## Importing the backup - -> The backup taken with `pg_basebckup` is an exact copy of your data directory so, all you need to do to restore from that backup is to point postgres at that directory and start it up. - -```bash -mkdir -p /tmp/sql/pg_data && cd /tmp/sql/pg_data -tar xzfv ../base.tar.gz -``` - -Now you should have something like that: - -``` -[nix-shell:/tmp/sql/pg_data]$ ls -backup_label base pg_commit_ts pg_hba.conf pg_logical pg_notify pg_serial pg_stat pg_subtrans pg_twophase pg_wal postgresql.conf tablespace_map -backup_label.old global pg_dynshmem pg_ident.conf pg_multixact pg_replslot pg_snapshots pg_stat_tmp pg_tblspc PG_VERSION pg_xact stolon-temp-postgresql.conf -``` - -Now we will extract the WAL: - -```bash -mkdir -p /tmp/sql/wal && cd /tmp/sql/wal -tar xzfv ../pg_wal.tar.gz -``` - -You should have something like that: - -``` -[nix-shell:/tmp/sql/wal]$ ls -00000003000014AF000000C9 00000003000014AF000000CA 00000003.history archive_status -``` - -Before restoring our backup, we want to check it: - -```bash -cd /tmp/sql/pg_data -cp ../backup_manifest . -# On ne vérifie pas le WAL car il semblerait que ça marche pas trop -# Cf ma référence en bas capdata.fr -# pg_verifybackup -w ../wal . -pg_verifybackup -n . - -``` - -Now, We must edit/read some files before launching our ephemeral server: - - Set `listen_addresses = '0.0.0.0'` in `postgresql.conf` - - Add `restore_command = 'cp /mnt/wal/%f %p' ` in `postgresql.conf` - - Check `port` in `postgresql.conf`, in our case it is `5433`. - - Create an empty file named `recovery.signal` - -*Do not create a `recovery.conf` file, it might be written on the internet but this is a deprecated method and your postgres daemon will refuse to boot if it finds one.* - -*Currently, we use port 5433 in oour postgresql configuration despite 5432 being the default port. Indeed, in production, clients access the cluster transparently through the Stolon Proxy that listens on port 5432 and redirect the requests to the correct PostgreSQL instance, listening secretly on port 5433! To export our binary backup in text, we will directly query our postgres instance without passing through the proxy, which is why you must note this port.* - -Now we will start our postgres container on our machine. - -At the time of writing the live version is `superboum/amd64_postgres:v9`. -We must start by getting `postgres` user id. Our container are run by default with this user, so you only need to run: - -```bash -docker run --rm -it superboum/amd64_postgres:v9 id -``` - -And we get: - -``` -uid=999(postgres) gid=999(postgres) groups=999(postgres),101(ssl-cert) -``` - -Now `chown` your `pg_data`: - -```bash -chown 999:999 -R /tmp/sql/{pg_data,wal} -chmod 700 -R /tmp/sql/{pg_data,wal} -``` - -And finally: - -``` -docker run \ - --rm \ - -it \ - -p 5433:5433 \ - -v /tmp/sql/:/mnt/ \ - superboum/amd64_postgres:v9 \ - postgres -D /mnt/pg_data -``` - -I have the following output: - -``` -2022-01-28 14:46:39.750 GMT [1] LOG: skipping missing configuration file "/mnt/pg_data/postgresql.auto.conf" -2022-01-28 14:46:39.763 UTC [1] LOG: starting PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit -2022-01-28 14:46:39.764 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5433 -2022-01-28 14:46:39.767 UTC [1] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433" -2022-01-28 14:46:39.773 UTC [7] LOG: database system was interrupted; last known up at 2022-01-28 14:33:13 UTC -cp: cannot stat '/mnt/wal/00000004.history': No such file or directory -2022-01-28 14:46:40.318 UTC [7] LOG: starting archive recovery -2022-01-28 14:46:40.321 UTC [7] LOG: restored log file "00000003.history" from archive -2022-01-28 14:46:40.336 UTC [7] LOG: restored log file "00000003000014AF000000C9" from archive -2022-01-28 14:46:41.426 UTC [7] LOG: could not open directory "pg_tblspc/41921/PG_13_202007201": No such file or directory -2022-01-28 14:46:41.445 UTC [7] LOG: could not open directory "pg_tblspc/41921/PG_13_202007201": No such file or directory -2022-01-28 14:46:41.457 UTC [7] LOG: redo starts at 14AF/C9000028 -2022-01-28 14:46:41.500 UTC [7] LOG: restored log file "00000003000014AF000000CA" from archive -2022-01-28 14:46:42.461 UTC [7] LOG: consistent recovery state reached at 14AF/CA369AB0 -2022-01-28 14:46:42.461 UTC [1] LOG: database system is ready to accept read only connections -cp: cannot stat '/mnt/wal/00000003000014AF000000CB': No such file or directory -2022-01-28 14:46:42.463 UTC [7] LOG: redo done at 14AF/CA369AB0 -2022-01-28 14:46:42.463 UTC [7] LOG: last completed transaction was at log time 2022-01-28 14:35:04.698438+00 -2022-01-28 14:46:42.480 UTC [7] LOG: could not open directory "pg_tblspc/41921/PG_13_202007201": No such file or directory -2022-01-28 14:46:42.493 UTC [7] LOG: restored log file "00000003000014AF000000CA" from archive -cp: cannot stat '/mnt/wal/00000004.history': No such file or directory -2022-01-28 14:46:43.462 UTC [7] LOG: selected new timeline ID: 4 -2022-01-28 14:46:44.441 UTC [7] LOG: archive recovery complete -2022-01-28 14:46:44.444 UTC [7] LOG: restored log file "00000003.history" from archive -2022-01-28 14:46:45.614 UTC [1] LOG: database system is ready to accept connections -``` - -*Notes: the missing tablespace is a legacy tablesplace used in the past to debug Matrix. It will be removed soon, we can safely ignore it. Other errors on cp seems to be intended as postgres might want to know how far it can rewind with the WAL but I a not 100% sure.* - -Your ephemeral instance should work: - -```bash -export PGPASSWORD=xxx # your postgres (admin) account password - -psql -h 127.0.0.1 -p 5433 -U postgres postgres -``` - -And your databases should appear: - -``` -[nix-shell:~/Documents/dev/infrastructure]$ psql -h 127.0.0.1 -p 5433 -U postgres postgres -psql (13.5, server 13.3 (Debian 13.3-1.pgdg100+1)) -Type "help" for help. - -postgres=# \l - List of databases - Name | Owner | Encoding | Collate | Ctype | Access privileges ------------+----------+----------+------------+------------+----------------------- - xxxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 | - xxxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 | - xxxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 | - postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 | - xxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 | - xxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 | - template0 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres + - | | | | | postgres=CTc/postgres - template1 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres + - | | | | | postgres=CTc/postgres -(8 rows) -``` - -## Dump your ephemeral database as SQL - -Now we can do a SQL export of our ephemeral database. -We use zstd to automatically compress the outputed file. -We use multiple parameters: - - `-vv` gives use some idea on the progress - - `-9` is a quite high value and should compress efficiently. Decrease it if your machine is low powered - - `-T0` asks zstd to use all your cores. By default, zstd uses only one core. - -```bash -pg_dumpall -h 127.0.0.1 -p 5433 -U postgres \ - | zstd -vv -9 -T0 --format=zstd > dump-`date --rfc-3339=seconds | sed 's/ /T/'`.sql.zstd -``` - -I get the following result: - -``` -[nix-shell:/tmp/sql]$ ls -lah dump* --rw-r--r-- 1 quentin users 749M janv. 28 16:07 dump-2022-01-28T16:06:29+01:00.sql.zstd -``` - -Now you can stop your ephemeral server. - -## Restore your SQL file - -First, start a blank server: - -```bash -docker run \ - --rm -it \ - --name postgres \ - -p 5433:5432 \ - superboum/amd64_postgres:v9 \ - bash -c ' - set -ex - mkdir /tmp/psql - initdb -D /tmp/psql --no-locale --encoding=UTF8 - echo "host all postgres 0.0.0.0/0 md5" >> /tmp/psql/pg_hba.conf - postgres -D /tmp/psql - ' -``` - -Then set the same password as your prod for the `posgtgres` user (it will be required as part of the restore): - -```bash -docker exec -ti postgres bash -c "echo \"ALTER USER postgres WITH PASSWORD '$PGPASSWORD';\" | psql" -echo '\l' | psql -h 127.0.0.1 -p 5433 -U postgres postgres -# the database should have no entry (except `posgtres`, `template0` and `template1`) otherwise ABORT EVERYTHING, YOU ARE ON THE WRONG DB -``` - -And finally, restore your SQL backup: - -```bash -zstdcat -vv dump-* | \ - grep -P -v '^(CREATE|DROP) ROLE postgres;' | \ - psql -h 127.0.0.1 -p 5433 -U postgres --set ON_ERROR_STOP=on postgres -``` - -*Note: we must skip CREATE/DROP ROLE postgres during the restore as it aready exists and would generate an error. -Because we want to be extra careful, we specifically asked to crash on every error and do not want to change this behavior. -So, instead, we simply remove any entry that contains the specific regex stated in the previous command.* - -Check that the backup has been correctly restored. -For example: - -```bash -docker exec -ti postgres psql -#then type "\l", "\c db-name", "select ..." -``` - -## Finally, store it safely - -```bash -rsync --progress -av /tmp/sql/{*.tar.gz,backup_manifest,dump-*} backup/target -``` - -## Ref - - - https://philipmcclarence.com/backing-up-and-restoring-postgres-using-pg_basebackup/ - - https://www.cybertec-postgresql.com/en/pg_basebackup-creating-self-sufficient-backups/ - - https://www.postgresql.org/docs/14/continuous-archiving.html - - https://www.postgresql.org/docs/14/backup-dump.html#BACKUP-DUMP-RESTORE - - https://dba.stackexchange.com/questions/75033/how-to-restore-everything-including-postgres-role-from-pg-dumpall-backup - - https://blog.capdata.fr/index.php/postgresql-13-les-nouveautes-interessantes/ diff --git a/op_guide/stolon/nomad_full_backup.md b/op_guide/stolon/nomad_full_backup.md deleted file mode 100644 index 2fb5822..0000000 --- a/op_guide/stolon/nomad_full_backup.md +++ /dev/null @@ -1,26 +0,0 @@ -Start by following ../restic - -## Garbage collect old backups - -``` -mc ilm import deuxfleurs/${BUCKET_NAME} < traefik.gzip -gunzip -c traefik.gzip > traefik.json -cat traefik.json | jq '.DomainsCertificate.Certs[] | .Certificate.Domain, .Domains.Main' -# "alps.deuxfleurs.fr" -# "alps.deuxfleurs.fr" -# "cloud.deuxfleurs.fr" -# "cloud.deuxfleurs.fr" -# chaque NDD doit apparaitre 2x à la suite sinon fix comme suit -cat traefik.json | jq > traefik-new.json -vim traefik-new.json -# enlever les certifs corrompus, traefik les renouvellera automatiquement au démarrage -gzip -c traefik-new.json > traefik-new.gzip -curl --request PUT --data-binary @traefik-new.gzip http://127.0.0.1:8500/v1/kv/traefik/acme/account/object -``` diff --git a/op_guide/update_matrix/README.md b/op_guide/update_matrix/README.md deleted file mode 100644 index e330277..0000000 --- a/op_guide/update_matrix/README.md +++ /dev/null @@ -1,93 +0,0 @@ -How to update Matrix? -===================== - -## 1. Build the new containers - -Often, I update Riot Web and Synapse at the same time. - - -* Open `app/docker-compose.yml` and locate `riot` (the Element Web service) and `synapse` (the Matrix Synapse server). There are two things you need to do for each service: - - * Set the `VERSION` argument to the target service version (e.g. `1.26.0` for Synapse). This argument is then used to template the Dockerfile. - - The `VERSION` value should match a github release, the link to the corresponding release page is put as a comment next to the variable in the compose file; - - * Tag the image with a new incremented version tag. For example: `superboum/amd64_riotweb:v17` will become `superboum/amd64_riotweb:v18`. - - We use the docker hub to store our images. So, if you are not `superboum` you must change the name with your own handle, eg. `john/amd64_riotweb:v18`. This requires that you registered an account (named `john`) on https://hub.docker.com. - - -So, from now we expect you have: - -* changed the `VERSION` value and `image` name/tag of `riot` -* changed the `VERSION` value and `image` name/tag of `synapse` - -From the `/app` folder, you can now simply build and push the new images: - -```bash -docker-compose build riot synapse -``` - -And then send them to the docker hub: - -``` -docker-compose push riot synapse -``` - -Don't forget to commit and push your changes before doing anything else! - -## 2. Deploy the new containers - -Now, we will edit the deployment file `app/im/deploy/im.hcl`. - -Find where the image is defined in the file, for example Element-web will look like that: - - -```hcl - group "riotweb" { - count = 1 - - task "server" { - driver = "docker" - config { - image = "superboum/amd64_riotweb:v17" - port_map { - web_port = 8043 - } -``` - -And replace the `image =` entry with its new version created above. -Do the same thing for the `synapse` service. - -Now, you need a way to access the cluster to deploy this file. -To do this, you must bind nomad on your machine through a SSH tunnel. -Check the end of [the parent `README.md`](../README.md) to do it. -If you have access to the Nomad web UI when entering http://127.0.0.1:4646 -you are ready to go. - -You must have installed the Nomad command line tool on your machine (also explained in [the parent `README.md`](../README.md)). - -Now, on your machine and from the `app/im/deploy` folder, you must be able to run: - -``` -nomad plan im.hcl -``` - -Check that the proposed diff corresponds to what you have in mind. -If it seems OK, just copy paste the `nomad job run ... im.hcl` command proposed as part of the output of the `nomad plan` command. - -From now, it will take around ~2 minutes to deploy the new images. -You can follow the deployment from the Nomad UI. -Bear in mind that, once the deployment is done on Nomad, you may still need to wait some minutes that Traefik refreshes its configuration. - -If everythings worked as intended, you can commit and push your deployment file. - -If something went wrong, you must rollback your deployment. - - 1. First, find a working deployment with [nomad job history](https://www.nomadproject.io/docs/commands/job/history) - 2. Revert to this deployment with [nomad job revert](https://www.nomadproject.io/docs/commands/job/revert) - -Now, if the deployment failed, you should probably investigate what went wrong offline. -I built a test stack with docker-compose in `app//integration` that should help you out (for now, test suites are only written for plume and jitsi). - -