2022-12-25 10:39:18 +00:00
16 changed files with 457 additions and 7 deletions
--- a/content/operations/courantes/_index.md
+++ b/content/operations/courantes/_index.md
@ -0,0 +1,6 @@
 +++
 title = "Opérations courantes"
 description = "Opérations courantes"
 weight = 15
 sort_by = "weight"
 +++
--- a/content/operations/courantes/email.md
+++ b/content/operations/courantes/email.md
--- a/content/operations/courantes/plume.md
+++ b/content/operations/courantes/plume.md
@ -0,0 +1,52 @@
 +++
 title = "Plume"
 description = "Plume"
 date = 2022-12-22
 dateCreated = 2022-12-22
 weight = 11
 +++
 ## Creating a new Plume user
  1. Bind nomad on your machine with SSH (check the README file at the root of this repo)
  2. Go to http://127.0.0.1:4646
  3. Select `plume` -> click `exec` button (top right)
  4. Select `plume` on the left panel
  5. Press `enter` to get a bash shell
  6. Run:
 ```bash
 plm users new \
  --username alice \
  --display-name Alice \
  --bio Just an internet user \
  --email alice@example.com \
  --password s3cr3t
 ```
 That's all folks, now you can use your new account at https://plume.deuxfleurs.fr
 ## Bug and debug
 If you can't follow a new user and have this error:
 ```
 2022-04-23T19:26:12.639285Z  WARN plume::routes::errors: Db(DatabaseError(UniqueViolation, "duplicate key value violates unique constraint \"follows_unique_ap_url\""))
 ```
 You might have an empty field in your database:
 ```
 plume=> select * from follows where ap_url='';
  id  | follower_id | following_id | ap_url
 ------+-------------+--------------+--------
 2118 |          20 |          238 |
 (1 row)
 ```
 Simply set the `ap_url` as follows:
 ```
 plume=> update follows set ap_url='https://plume.deuxfleurs.fr/follows/2118' where id=2118;
 UPDATE 1
 ```
--- a/content/operations/deploiement/app/_index.md
+++ b/content/operations/deploiement/app/_index.md
@ -2,6 +2,7 @@
 title = "Applications"
 description = "Déploiement d'une application"
 sort_by = "weight"
 date = 2022-12-22
 weight = 30
 +++
--- a/content/operations/deploiement/app/create_database.md
+++ b/content/operations/deploiement/app/create_database.md
@ -0,0 +1,34 @@
 +++
 title = "Créer une BDD"
 description = "Création d'une base de données pour une nouvelle application"
 date = 2022-12-22
 dateCreated = 2022-12-22
 weight = 11
 +++
 ## 1. Create a LDAP user and assign a password for your service
 Go to guichet.deuxfleurs.fr
  1. Everything takes place in `ou=services,ou=users,dc=deuxfleurs,dc=fr`
  2. Create a new user, like `johny`
  3. Generate a random password with `openssl rand -base64 32`
  4. Hash it with `slappasswd`
  5. Add a `userpassword` entry with the hash
 This step can also be done using the automated tool `secretmgr.py` in the app folder.
 ## 2. Connect to postgres with the admin users
 ```bash
 # 1. Launch ssh tunnel given in the README 
 # 2. Make sure you have postregsql client installed locally
 psql -h localhost -U postgres -W postgres
 ```
 ## 3. Create the binded users with LDAP in postgres + the database
 ```sql
 CREATE USER sogo;
 Create database sogodb with owner sogo encoding 'utf8' LC_COLLATE = 'C' LC_CTYPE = 'C' TEMPLATE template0;
 ```
--- a/content/operations/deploiement/grappe/_index.md
+++ b/content/operations/deploiement/grappe/_index.md
@ -2,6 +2,8 @@
 title = "Grappe"
 description = "Grappe"
 weight = 20
 date = 2022-12-22
 sort_by = "weight"
 +++
 # Installation
--- a/content/operations/deploiement/grappe/stolon.md
+++ b/content/operations/deploiement/grappe/stolon.md
@ -0,0 +1,95 @@
 +++
 title = "Stolon"
 description = "Comment déployer Stolon"
 date = 2022-12-22
 dateCreated = 2022-12-22
 weight = 11
 +++
 Spawn container:
 ```bash
 docker run \
  -ti --rm \
  --name stolon-config \
  --user root \
  -v /var/lib/consul/pki/:/certs \
  superboum/amd64_postgres:v11
 ```
 Init with:
 ```
 stolonctl \
  --cluster-name chelidoine \
  --store-backend=consul \
  --store-endpoints https://consul.service.prod.consul:8501 \
  --store-ca-file /certs/consul-ca.crt \
  --store-cert-file /certs/consul2022-client.crt \
  --store-key /certs/consul2022-client.key \
  init \
  '{ "initMode": "new",
     "usePgrewind" : true,
     "proxyTimeout" : "120s",
     "pgHBA": [ 
       "host all postgres all md5", 
       "host replication replicator all md5", 
       "host all all all ldap ldapserver=bottin.service.prod.consul ldapbasedn=\"ou=users,dc=deuxfleurs, dc=fr\" ldapbinddn=\"<bind_dn>\" ldapbindpasswd=\"<bind_pwd>\" ldapsearchattribute=\"cn\"" 
      ] 
   }'
 ```
 Then set appropriate permission on host:
 ```
 mkdir -p /mnt/{ssd,storage}/postgres/
 chown -R 999:999 /mnt/{ssd,storage}/postgres/
 ```
 (102 is the id of the postgres user used in Docker)
 It might be improved by staying with root, then chmoding in an entrypoint and finally switching to user 102 before executing user's command.
 Moreover it would enable the usage of the user namespace that shift the UIDs.
 ## Upgrading the cluster
 To retrieve the current stolon config:
 ```
 stolonctl spec --cluster-name chelidoine --store-backend consul --store-ca-file ... --store-cert-file ... --store-endpoints https://consul.service.prod.consul:8501
 ```
 The important part for the LDAP:
 ```
 {
 	"pgHBA": [
 		"host all postgres all md5",
 		"host replication replicator all md5",
 		"host all all all ldap ldapserver=bottin.service.2.cluster.deuxfleurs.fr ldapbasedn=\"ou=users,dc=deuxfleurs,dc=fr\" ldapbinddn=\"cn=admin,dc=deuxfleurs,dc=fr\" ldapbindpasswd=\"<REDACTED>\" ldapsearchattribute=\"cn\""
 	]
 }
 ```
 Once a patch is writen:
 ```
 stolonctl --cluster-name pissenlit --store-backend consul --store-endpoints http://consul.service.2.cluster.deuxfleurs.fr:8500 update --patch -f /tmp/patch.json
 ```
 ## Log
 - 2020-12-18 Activate pg\_rewind in stolon
 ```
 stolonctl --cluster-name pissenlit --store-backend consul --store-endpoints http://consul.service.2.cluster.deuxfleurs.fr:8500 update --patch '{ "usePgrewind" : true }'
 ```
 - 2021-03-14 Increase proxy timeout to cope with consul latency spikes
 ```
 stolonctl --cluster-name pissenlit --store-backend consul --store-endpoints http://consul.service.2.cluster.deuxfleurs.fr:8500 update --patch '{ "proxyTimeout" : "120s" }'
 ```
--- a/content/operations/maintien_en_condition/matrix.md
+++ b/content/operations/maintien_en_condition/matrix.md
@ -0,0 +1,101 @@
 +++
 title = "MàJ Matrix"
 description = "Mise à jour de Matrix (Synapse/Element)"
 date = 2022-12-22
 dateCreated = 2022-12-22
 weight = 11
 +++
 How to update Matrix?
 =====================
 ## 1. Build the new containers
 Often, I update Riot Web and Synapse at the same time.
 * Open `app/docker-compose.yml` and locate `riot` (the Element Web service) and `synapse` (the Matrix Synapse server). There are two things you need to do for each service:
    * Set the `VERSION` argument to the target service version (e.g. `1.26.0` for Synapse). This argument is then used to template the Dockerfile. 
    The `VERSION` value should match a github release, the link to the corresponding release page is put as a comment next to the variable in the compose file;
    * Tag the image with a new incremented version tag. For example: `superboum/amd64_riotweb:v17` will become `superboum/amd64_riotweb:v18`.
    We use the docker hub to store our images. So, if you are not `superboum` you must change the name with your own handle, eg. `john/amd64_riotweb:v18`. This requires that you registered an account (named `john`) on https://hub.docker.com.
 So, from now we expect you have:
 * changed the `VERSION` value and `image` name/tag of `riot`
 * changed the `VERSION` value and `image` name/tag of `synapse`
 From the `/app` folder, you can now simply build and push the new images:
 ```bash
 docker-compose build riot synapse
 ```
 And then send them to the docker hub:
 ```
 docker-compose push riot synapse
 ```
 Don't forget to commit and push your changes before doing anything else!
 ## 2. Deploy the new containers
 Now, we will edit the deployment file `app/im/deploy/im.hcl`.
 Find where the image is defined in the file, for example Element-web will look like that:
 ```hcl
  group "riotweb" {
    count = 1
    task "server" {
      driver = "docker"
      config {
        image = "superboum/amd64_riotweb:v17"
        port_map {
          web_port = 8043
        }
 ```
 And replace the `image =` entry with its new version created above.
 Do the same thing for the `synapse` service.
 Now, you need a way to access the cluster to deploy this file.
 To do this, you must bind nomad on your machine through a SSH tunnel.
 Check the end of [the parent `README.md`](../README.md) to do it.
 If you have access to the Nomad web UI when entering http://127.0.0.1:4646
 you are ready to go.
 You must have installed the Nomad command line tool on your machine (also explained in [the parent `README.md`](../README.md)).
 Now, on your machine and from the `app/im/deploy` folder, you must be able to run:
 ```
 nomad plan im.hcl
 ```
 Check that the proposed diff corresponds to what you have in mind.
 If it seems OK, just copy paste the `nomad job run ... im.hcl` command proposed as part of the output of the `nomad plan` command.
 From now, it will take around ~2 minutes to deploy the new images.
 You can follow the deployment from the Nomad UI.
 Bear in mind that, once the deployment is done on Nomad, you may still need to wait some minutes that Traefik refreshes its configuration.
 If everythings worked as intended, you can commit and push your deployment file.
 If something went wrong, you must rollback your deployment.
  1. First, find a working deployment with [nomad job history](https://www.nomadproject.io/docs/commands/job/history)
  2. Revert to this deployment with [nomad job revert](https://www.nomadproject.io/docs/commands/job/revert)
 Now, if the deployment failed, you should probably investigate what went wrong offline.
 I built a test stack with docker-compose in `app/<service>/integration` that should help you out (for now, test suites are only written for plume and jitsi).
--- a/content/operations/pannes/2020-01-20-changement-ip.md
+++ b/content/operations/pannes/2020-01-20-changement-ip.md
@ -0,0 +1,53 @@
 +++
 title = "Janvier 2020"
 description = "Janvier 2020: changement imprévu d'adresses IP"
 date = 2022-12-22
 dateCreated = 2022-12-22
 weight = 10
 +++
 Le 20 janvier free a changé mon IP, un peu comme partout en France.
 Ça concerne l'IPv4 et le préfixe IPv6.
 Ici le bon vieux Bortzmoinsbien qui tweet : https://twitter.com/bortzmeyer/status/1351434290916155394
 Max a update tout de suite l'IPv4 mais avec un TTL de 4h le temps de propagation est grand.
 J'ai réduit les entrées sur les IP à 300 secondes, soit 5 minutes, le minimum chez Gandi, à voir si c'est une bonne idée.
 Reste à update les IPv6, moins critiques pour le front facing mais utilisées pour le signaling en interne...
 ## Le fameux signaling
 Ça pose un gros problème avec Nomad (et en moindre mesure avec Consul).
 En effet, Nomad utilise l'IPv6 pour communiquer, il faut donc changer les IPs de tous les noeuds.
 Problème ! On peut pas faire la migration au fur et à mesure car, changeant d'IP, les noeuds ne seront plus en mesure de communiquer.
 On n'a pas envie de supprimer le cluster et d'en créer un nouveau car ça voudrait dire tout redéployer ce qui est long également (tous les fichiers HCL pour Nomad, tout le KV pour consul).
 On ne peut pas non plus la faire à la bourrin en stoppant tous les cluster, changer son IP, puis redémarrer.
 Enfin si, Consul accepte mais pas Nomad, qui lui va chercher à communiquer avec les anciennes IP et n'arrivera jamais à un consensus.
 Au passage j'en ai profité pour changer le nom des noeuds car la dernière fois, Nomad n'avait PAS DU TOUT apprécié qu'un noeud ayant le même nom change d'IP. Ceci dit, si on utilise de facto le `peers.json` c'est peut être pas problématique. À tester.
 Du coup, après moult réflexions, la silver bullet c'est la fonction outage recovery de nomad (consul l'a aussi au besoin).
 Elle est ici : https://learn.hashicorp.com/tutorials/consul/recovery-outage
 En gros, il faut arrêter tous les nodes.
 Ensuite créer un fichier à ce path : `/var/lib/nomad/server/raft/peers.json`
 Ne vous laissez pas perturber par le fichier `peers.info` à côté, il ne faut pas le toucher.
 Après la grande question c'est de savoir si le cluster est en Raft v2 ou Raft v3.
 Bon ben nous on était en Raft v2. Si vous vous trompez, au redémarrage Nomad va crasher avec une sale erreur :
 ```
 nomad: failed to start Raft: error="recovery failed to parse peers.json: json: cannot unmarshal string into Go value of type raft.configEntry"
 ```
 (je me suis trompé bien sûr).
 Voilà, après il ne vous reste plus qu'à redémarrer et suivre les logs, cherchez bien la ligne où il dit qu'il a trouvé le peers.json.
 ## Les trucs à pas oublier
  - Reconfigurer le backend KV de traefik (à voir à utiliser des DNS plutôt du coup)
  - Reconfigurer l'IPv4 publique annoncée à Jitsi
 ## Ce qui reste à faire
 - Mettre à jour les entrées DNS IPv6, ce qui devrait créer :
   - digitale.machine.deuxfleurs.fr
   - datura.machine.deuxfleurs.fr
   - drosera.machine.deuxfleurs.fr
 - Mettre à jour l'instance garage sur io
--- a/content/operations/pannes/2021-07-12-synapse-bdd-rempli-disque.md
+++ b/content/operations/pannes/2021-07-12-synapse-bdd-rempli-disque.md
@ -0,0 +1,22 @@
 +++
 title = "Juillet 2021"
 description = "Juillet 2021: la BDD Synapse remplit nos disques"
 date = 2022-12-22
 dateCreated = 2022-12-22
 weight = 20
 +++
 # La BDD synapse rempli nos disques
 Todo: finir ce blog post et le dupliquer ici https://quentin.dufour.io/blog/2021-07-12/chroniques-administration-synapse/
 Le WAL qui grossissait à l'infini était également du à un SSD défaillant dont les écritures était abyssalement lentes.
 Actions mises en place :
  - Documentation de comment ajouter de l'espace sur un disque différent avec les tablespaces
  - Interdiction de rejoindre les rooms avec une trop grande complexité
  - nettoyage de la BDD à la main (rooms vides, comptes non utilisés, etc.)
  - Remplacement du SSD défaillant
 Actions à mettre en place :
  - Utiliser les outils de maintenance de base de données distribuées par le projet matrix
--- a/content/operations/pannes/2022-01-xx-glusterfs-corruption.md
+++ b/content/operations/pannes/2022-01-xx-glusterfs-corruption.md
@ -0,0 +1,36 @@
 +++
 title = "Janvier 2022"
 description = "Janvier 2022: Corruptions GlusterFS"
 date = 2022-12-22
 dateCreated = 2022-12-22
 weight = 30
 +++
 # Corruption GlusterFS
 Suite au redémarrage d'un serveur, les emails ne sont plus disponibles.
 Il apparait que GlusterFS ne répliquait plus correctement les données depuis un certain temps.
 Suite à ce problème, il a renvoyé des dossiers Dovecot corrompu.
 Dovecot a reconstruit un index sans les emails, ce qui a désynchronisé les bàl des gens.
 À la fin, certaines boites mails ont perdu tous leurs emails.
 Aucune sauvegarde des emails n'était réalisée.
 Le problème a été créé cet été quand j'ai réinstallé un serveur.
 J'ai installé sur une version de Debian différente.
 La version de GlusterFS était pinnée dans un sources.list, en pointant vers le repo du projet gluster
 Mais le pinning était pour la version de debian précédente.
 Le sources.list a été ignoré, et c'est le gluster du projet debian plus récent qui a été installé.
 Ces versions étaient incompatibles mais silencieusement.
 GlusterFS n'informe pas proactivement non plus que les volumes sont désynchronisées.
 Il n'y a aucune commande pour connaitre l'état du cluster.
 Après plusieurs jours de travail, il m'a été impossible de remonter les emails.
 Action mise en place :
  - Suppression de GlusterFS
  - Sauvegardes journalière des emails
  - Les emails sont maintenant directement sur le disque (pas de haute dispo)
 Action en cours de mise en place :
  - Développement d'un serveur IMAP sur Garage
--- a/content/operations/pannes/petits-incidents.md
+++ b/content/operations/pannes/petits-incidents.md
@ -0,0 +1,23 @@
 +++
 title = "Petits incidents"
 description = "Petits incidents"
 date = 2022-12-22
 dateCreated = 2022-12-22
 weight = 1000
 +++
 - **2020** Publii efface le disque dur d'un de nos membres. Il a changé le dossier de sortie vers /home qui a été effacé
 - **2021-07-27** Panne de courant à Rennes - 40 000 personnes sans électricité pendant une journée - nos serveurs de prod étant dans la zone coupée, deuxfleurs.fr est dans le noir - https://www.francebleu.fr/infos/faits-divers-justice/rennes-plusieurs-quartiers-prives-d-electricite-1627354121
 - **2021-12:** Tentative de migration un peu trop hâtive vers Tricot pour remplacer Traefik qui pose des soucis. Downtime et manque de communication sur les causes, confusion généralisée.
  *Actions à envisager:* prévoir à l'avance toute intervention de nature à impacter la qualité de service sur l'infra Deuxfleurs. Tester en amont un maximum pour éviter de devoir tester en prod. Lorsque le test en prod est inévitable, s'organiser pour impacter le moins de monde possible.
 - **2022-03-28:** Coupure d'électricité au site Jupiter, `io` ne redémarre pas toute seule. T est obligée de la rallumer manuellement. `io` n'est pas disponible durant quelques heures.
  *Actions à envisager:* reconfigurer `io` pour s'allumer toute seule quand le courant démarre.
 - **2022-03-28:** Grafana (hébergé par M) n'est pas disponible. M est le seul à pouvoir intervenir.
  *Actions à envisager:* cartographier l'infra de monitoring et s'assurer que plusieurs personnes ont les accès.
--- a/content/operations/prestataires/_index.md
+++ b/content/operations/prestataires/_index.md
@ -10,7 +10,7 @@ Gandi
 # Pont IPv6
-Route 48
+FDN
 # Paquets
--- a/content/operations/sauvegardes/pg_basebackup.md
+++ b/content/operations/sauvegardes/pg_basebackup.md
@ -301,6 +301,31 @@ docker exec -ti postgres psql
 rsync --progress -av /tmp/sql/{*.tar.gz,backup_manifest,dump-*} backup/target
 ```
 ## Garbage collect old backups
 ```
 mc ilm import deuxfleurs/${BUCKET_NAME} <<EOF
 {
    "Rules": [
        {
            "Expiration": {
                "Days": 62
            },
            "ID": "PurgeOldBackups",
            "Status": "Enabled"
        }
    ]
 }
 EOF
 ```
 Check that it has been activated:
 ```
 mc ilm ls deuxfleurs/${BUCKET_NAME}
 ```
 ## Ref
 - https://philipmcclarence.com/backing-up-and-restoring-postgres-using-pg_basebackup/
--- a/content/operations/sauvegardes/restic.md
+++ b/content/operations/sauvegardes/restic.md
@ -169,3 +169,9 @@ I propose:
 ```
 restic forget --prune --keep-within 1m1d --keep-within-weekly 3m --keep-within-monthly 1y
 ```
 Also try to restore a snapshot:
 ```
 restic restore <snapshot id> --target /tmp/$SERVICE_NAME
 ```
--- a/content/operations/support/_index.md
+++ b/content/operations/support/_index.md
@ -1,6 +0,0 @@
 +++
 title = "Support"
 description = "Support"
 weight = 50
 sort_by = "weight"
 +++