Ajout doc sauvegarde

This commit is contained in:
Quentin 2022-05-16 18:37:52 +02:00
parent 478bbc9dad
commit 1a5b62d254
Signed by: quentin
GPG key ID: E9602264D639FF68
9 changed files with 607 additions and 6 deletions

View file

@ -5,3 +5,5 @@ weight = 30
sort_by = "weight"
+++
Ce manuel vous propose de vous former sur les questions portées par l'association, que ce soit sur l'impact social du numérique ou l'administration d'une machine Linux, avec dans l'idée que vous pourrez vous impliquer d'avantange dans nos activités après, en faisant des ateliers ou en participant à opérer les machines et les logicies.

View file

@ -4,3 +4,6 @@ description = "Infrastructures"
weight = 90
+++
Ce manuel documente la dimension matérielle du numérique chez Deuxfleurs. On y recense les ordinateurs, le lieu où ils sont, les connexions réseaux nécessaires, l'énergie consommée, l'impact de fabrication, de fin de vie, etc.

View file

@ -19,7 +19,7 @@ De par leur rôle, ils participent au bon fonctionnement de la production.
Ils n'ont pas de données personnelles brutes mais les métriques collectées peuvent refléter certains comportement des usager·es
et les sauvegardes, bien qu'elles soient chiffrées, contiennent tout de même des données personnelles.
[Développement](./developpement) - Les serveurs de développement hébergent les outils qui nous permettent de travailler sur le logiciel,
[Développement](./developpement/) - Les serveurs de développement hébergent les outils qui nous permettent de travailler sur le logiciel,
les configurations, les tickets, ou la compilation. Ils ne contiennent pas de données personnelles mais peuvent être utilisés pour
des attaques de chaine d'approvisionnement (*supply chain attack*). À terme, ce rôle pourrait être fusionné avec la production.

View file

@ -5,9 +5,11 @@ weight = 100
sort_by = "weight"
+++
# Définitions
Ce manuel recense notre savoir-faire technique, il a pour but d'accompagner nos opérateur·ices dans la réalisation de leurs tâches.
* _Une grappe_, c'est **un ensemble d'ordinateurs, au moins trois** pour obtenir une grappe bien solide.
# Notre jargon
* _Une grappe_, c'est **un ensemble d'ordinateurs** qui **coopèrent** pour fournir un **service**.
Une grappe est **gérée de façon cohérente** (avec le même système logiciel), **plus ou moins autonome** (elle ne dépend pas du reste du monde pour fournir des services web), **par une entité définie** (une personne physique ou un groupe de personnes).

View file

@ -4,3 +4,41 @@ description = "Sauvegardes"
weight = 30
sort_by = "weight"
+++
# Données sauvegardées
[restic](./restic/) - Nous utilisons restic pour sauvegarder les logiciels
qui utilisent le système de fichier (Cryptpad, Dovecot, et Plume) ainsi que Consul.
À terme, nous aimerions être en mesure de tout pouvoir stocker directement sur Garage
et rendre obsolète ce mode de sauvegarde.
[pg\_basebackup](./pg_basebackup/) - Nous utilisons cet outils pour sauvegarder l'ensemble
des tables gérées par notre base de données SQL sans impacter trop les performances.
Le tout est réalisé par un script python qui chiffre avec `age` et envoie le backup via S3.
À terme, nous aimerions utiliser [wal-g](https://github.com/wal-g/wal-g) à la place.
[rclone](./rclone/) - Combiné avec btrfs, nous copions sur un système de fichier à plat
le contenu de notre cluster afin de faire face en cas de corruption.
# Localisation des sauvegardes
[Suresnes](/infrastructures/machines/support/#suresnes-mercure) - À Suresnes, nous avons une instance Minio
dédiée aux sauvegardes de données. Elle reçoit les sauvegardes du système de fichier, de consul et de Postgres.
[Rennes 2](/infrastructures/machines/support/#rennes-2-jupiter) - À Rennes, nous avons un simple serveur Debian
avec une partition en BTRFS. Il se charge de sauvegarder toutes les nuits le contenu de notre instance de production de Garage.
À terme il est possible qu'on décide de rationaliser nos sauvegardes et de choisir
de sauvegarder S3.
# Durée de rétention et fréquence
Les sauvegardes doivent être configurées avec les options suivantes :
**Fréquence** - 1 fois par jour (toutes les nuits)
**Durée de rétention** - 1 an
**Politique de conservation des instantanés** - 1 instantané par jour pendant 1 mois, 1 instantané par semaine pendant 3 mois, 1 instantané par mois pendant 1 an
**Exceptions**
Les sauvegardes de Postgres sont faites une fois par semaine seulement pour le moment
Le nombre d'instantané est de 1 par jour pendant 1 an pour Garage

View file

@ -0,0 +1,311 @@
+++
title = "pg_basebackup"
description = "pg_basebackup"
weight=15
+++
## Disclaimer
Do **NOT** use the following backup methods on the Stolon Cluster:
1. copying the data directory
2. `pg_dump`
3. `pg_dumpall`
The first one will lead to corrupted/inconsistent files.
The second and third ones put too much pressure on the cluster.
Basically, you will destroy it, in the following ways:
- Load will increase, requests will timeout
- RAM will increase, the daemon will be OOM (Out Of Memory) killed by Linux
- Potentially, the WAL log will grow a lot
## A binary backup with `pg_basebackup`
The only acceptable solution is `pg_basebackup` with **some throttling configured**.
Later, if you want a SQL dump, you can inject this binary backup on an ephemeral database you spawned solely for this purpose on a non-production machine.
First, start by fetching from Consul the identifiers of the replication account.
Do not use the root account setup in Stolon, it will not work.
First setup a SSH tunnel on your machine that bind postgresql, eg:
```bash
ssh -L 5432:psql-proxy.service.2.cluster.deuxfleurs.fr:5432 ...
```
*Later, we will use `/tmp/sql` as our working directory. Depending on your distribution, this
folder may be a `tmpfs` and thus mounted on RAM. If it is the case, choose another folder, that is not a `tmpfs`, otherwise you will fill your RAM
and fail your backup. I am using NixOS and the `/tmp` folder is a regular folder, persisted on disk, which explain why I am using it.*
Then export your password in `PGPASSWORD` and launch the backup:
```bash
export PGPASSWORD=xxx
mkdir -p /tmp/sql
cd /tmp/sql
pg_basebackup \
--host=127.0.0.1 \
--username=replicator \
--pgdata=/tmp/sql \
--format=tar \
--wal-method=stream \
--gzip \
--compress=6 \
--progress \
--max-rate=5M
```
*Something you should now: while it seems optional, fetching the WAL is mandatory. At first, I thought it was a way to have a "more recent backup".
But after some reading, it appears that the base backup is corrupted because it is not a snapshot at all, but a copy of the postgres folder with no specific state.
The whole point of the WAL is, in fact, to fix this corrupted archive...*
*Take a cup of coffe, it will take some times...*
The result I get (the important file is `base.tar.gz`, `41921.tar.gz` will probably be missing as it is a secondary tablespace I will deactivate soon):
```
[nix-shell:/tmp/sql]$ ls
41921.tar.gz backup_manifest base.tar.gz pg_wal.tar.gz
```
From now, disconnect from the production to continue your work.
You don't need it anymore and it will prevent some disaster if you fail a command.
## Importing the backup
> The backup taken with `pg_basebckup` is an exact copy of your data directory so, all you need to do to restore from that backup is to point postgres at that directory and start it up.
```bash
mkdir -p /tmp/sql/pg_data && cd /tmp/sql/pg_data
tar xzfv ../base.tar.gz
```
Now you should have something like that:
```
[nix-shell:/tmp/sql/pg_data]$ ls
backup_label base pg_commit_ts pg_hba.conf pg_logical pg_notify pg_serial pg_stat pg_subtrans pg_twophase pg_wal postgresql.conf tablespace_map
backup_label.old global pg_dynshmem pg_ident.conf pg_multixact pg_replslot pg_snapshots pg_stat_tmp pg_tblspc PG_VERSION pg_xact stolon-temp-postgresql.conf
```
Now we will extract the WAL:
```bash
mkdir -p /tmp/sql/wal && cd /tmp/sql/wal
tar xzfv ../pg_wal.tar.gz
```
You should have something like that:
```
[nix-shell:/tmp/sql/wal]$ ls
00000003000014AF000000C9 00000003000014AF000000CA 00000003.history archive_status
```
Before restoring our backup, we want to check it:
```bash
cd /tmp/sql/pg_data
cp ../backup_manifest .
# On ne vérifie pas le WAL car il semblerait que ça marche pas trop
# Cf ma référence en bas capdata.fr
# pg_verifybackup -w ../wal .
pg_verifybackup -n .
```
Now, We must edit/read some files before launching our ephemeral server:
- Set `listen_addresses = '0.0.0.0'` in `postgresql.conf`
- Add `restore_command = 'cp /mnt/wal/%f %p' ` in `postgresql.conf`
- Check `port` in `postgresql.conf`, in our case it is `5433`.
- Create an empty file named `recovery.signal`
*Do not create a `recovery.conf` file, it might be written on the internet but this is a deprecated method and your postgres daemon will refuse to boot if it finds one.*
*Currently, we use port 5433 in oour postgresql configuration despite 5432 being the default port. Indeed, in production, clients access the cluster transparently through the Stolon Proxy that listens on port 5432 and redirect the requests to the correct PostgreSQL instance, listening secretly on port 5433! To export our binary backup in text, we will directly query our postgres instance without passing through the proxy, which is why you must note this port.*
Now we will start our postgres container on our machine.
At the time of writing the live version is `superboum/amd64_postgres:v9`.
We must start by getting `postgres` user id. Our container are run by default with this user, so you only need to run:
```bash
docker run --rm -it superboum/amd64_postgres:v9 id
```
And we get:
```
uid=999(postgres) gid=999(postgres) groups=999(postgres),101(ssl-cert)
```
Now `chown` your `pg_data`:
```bash
chown 999:999 -R /tmp/sql/{pg_data,wal}
chmod 700 -R /tmp/sql/{pg_data,wal}
```
And finally:
```
docker run \
--rm \
-it \
-p 5433:5433 \
-v /tmp/sql/:/mnt/ \
superboum/amd64_postgres:v9 \
postgres -D /mnt/pg_data
```
I have the following output:
```
2022-01-28 14:46:39.750 GMT [1] LOG: skipping missing configuration file "/mnt/pg_data/postgresql.auto.conf"
2022-01-28 14:46:39.763 UTC [1] LOG: starting PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2022-01-28 14:46:39.764 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5433
2022-01-28 14:46:39.767 UTC [1] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"
2022-01-28 14:46:39.773 UTC [7] LOG: database system was interrupted; last known up at 2022-01-28 14:33:13 UTC
cp: cannot stat '/mnt/wal/00000004.history': No such file or directory
2022-01-28 14:46:40.318 UTC [7] LOG: starting archive recovery
2022-01-28 14:46:40.321 UTC [7] LOG: restored log file "00000003.history" from archive
2022-01-28 14:46:40.336 UTC [7] LOG: restored log file "00000003000014AF000000C9" from archive
2022-01-28 14:46:41.426 UTC [7] LOG: could not open directory "pg_tblspc/41921/PG_13_202007201": No such file or directory
2022-01-28 14:46:41.445 UTC [7] LOG: could not open directory "pg_tblspc/41921/PG_13_202007201": No such file or directory
2022-01-28 14:46:41.457 UTC [7] LOG: redo starts at 14AF/C9000028
2022-01-28 14:46:41.500 UTC [7] LOG: restored log file "00000003000014AF000000CA" from archive
2022-01-28 14:46:42.461 UTC [7] LOG: consistent recovery state reached at 14AF/CA369AB0
2022-01-28 14:46:42.461 UTC [1] LOG: database system is ready to accept read only connections
cp: cannot stat '/mnt/wal/00000003000014AF000000CB': No such file or directory
2022-01-28 14:46:42.463 UTC [7] LOG: redo done at 14AF/CA369AB0
2022-01-28 14:46:42.463 UTC [7] LOG: last completed transaction was at log time 2022-01-28 14:35:04.698438+00
2022-01-28 14:46:42.480 UTC [7] LOG: could not open directory "pg_tblspc/41921/PG_13_202007201": No such file or directory
2022-01-28 14:46:42.493 UTC [7] LOG: restored log file "00000003000014AF000000CA" from archive
cp: cannot stat '/mnt/wal/00000004.history': No such file or directory
2022-01-28 14:46:43.462 UTC [7] LOG: selected new timeline ID: 4
2022-01-28 14:46:44.441 UTC [7] LOG: archive recovery complete
2022-01-28 14:46:44.444 UTC [7] LOG: restored log file "00000003.history" from archive
2022-01-28 14:46:45.614 UTC [1] LOG: database system is ready to accept connections
```
*Notes: the missing tablespace is a legacy tablesplace used in the past to debug Matrix. It will be removed soon, we can safely ignore it. Other errors on cp seems to be intended as postgres might want to know how far it can rewind with the WAL but I a not 100% sure.*
Your ephemeral instance should work:
```bash
export PGPASSWORD=xxx # your postgres (admin) account password
psql -h 127.0.0.1 -p 5433 -U postgres postgres
```
And your databases should appear:
```
[nix-shell:~/Documents/dev/infrastructure]$ psql -h 127.0.0.1 -p 5433 -U postgres postgres
psql (13.5, server 13.3 (Debian 13.3-1.pgdg100+1))
Type "help" for help.
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+------------+------------+-----------------------
xxxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
xxxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
xxxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 |
xxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
xxxx | xxxxx | UTF8 | en_US.utf8 | en_US.utf8 |
template0 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
(8 rows)
```
## Dump your ephemeral database as SQL
Now we can do a SQL export of our ephemeral database.
We use zstd to automatically compress the outputed file.
We use multiple parameters:
- `-vv` gives use some idea on the progress
- `-9` is a quite high value and should compress efficiently. Decrease it if your machine is low powered
- `-T0` asks zstd to use all your cores. By default, zstd uses only one core.
```bash
pg_dumpall -h 127.0.0.1 -p 5433 -U postgres \
| zstd -vv -9 -T0 --format=zstd > dump-`date --rfc-3339=seconds | sed 's/ /T/'`.sql.zstd
```
I get the following result:
```
[nix-shell:/tmp/sql]$ ls -lah dump*
-rw-r--r-- 1 quentin users 749M janv. 28 16:07 dump-2022-01-28T16:06:29+01:00.sql.zstd
```
Now you can stop your ephemeral server.
## Restore your SQL file
First, start a blank server:
```bash
docker run \
--rm -it \
--name postgres \
-p 5433:5432 \
superboum/amd64_postgres:v9 \
bash -c '
set -ex
mkdir /tmp/psql
initdb -D /tmp/psql --no-locale --encoding=UTF8
echo "host all postgres 0.0.0.0/0 md5" >> /tmp/psql/pg_hba.conf
postgres -D /tmp/psql
'
```
Then set the same password as your prod for the `posgtgres` user (it will be required as part of the restore):
```bash
docker exec -ti postgres bash -c "echo \"ALTER USER postgres WITH PASSWORD '$PGPASSWORD';\" | psql"
echo '\l' | psql -h 127.0.0.1 -p 5433 -U postgres postgres
# the database should have no entry (except `posgtres`, `template0` and `template1`) otherwise ABORT EVERYTHING, YOU ARE ON THE WRONG DB
```
And finally, restore your SQL backup:
```bash
zstdcat -vv dump-* | \
grep -P -v '^(CREATE|DROP) ROLE postgres;' | \
psql -h 127.0.0.1 -p 5433 -U postgres --set ON_ERROR_STOP=on postgres
```
*Note: we must skip CREATE/DROP ROLE postgres during the restore as it aready exists and would generate an error.
Because we want to be extra careful, we specifically asked to crash on every error and do not want to change this behavior.
So, instead, we simply remove any entry that contains the specific regex stated in the previous command.*
Check that the backup has been correctly restored.
For example:
```bash
docker exec -ti postgres psql
#then type "\l", "\c db-name", "select ..."
```
## Finally, store it safely
```bash
rsync --progress -av /tmp/sql/{*.tar.gz,backup_manifest,dump-*} backup/target
```
## Ref
- https://philipmcclarence.com/backing-up-and-restoring-postgres-using-pg_basebackup/
- https://www.cybertec-postgresql.com/en/pg_basebackup-creating-self-sufficient-backups/
- https://www.postgresql.org/docs/14/continuous-archiving.html
- https://www.postgresql.org/docs/14/backup-dump.html#BACKUP-DUMP-RESTORE
- https://dba.stackexchange.com/questions/75033/how-to-restore-everything-including-postgres-role-from-pg-dumpall-backup
- https://blog.capdata.fr/index.php/postgresql-13-les-nouveautes-interessantes/

View file

@ -0,0 +1,76 @@
+++
title = "rclone"
description = "rclone"
weight = 20
sort_by = "weight"
+++
Script de backup brut, on planifie une approche plus élégante à l'avenir :
```
#!/bin/bash
cd $(dirname $0)
if [ "$(hostname)" != "io" ]; then
echo "Please run this script on io"
exit 1
fi
if [ ! -d "buckets" ]; then
btrfs subvolume create $(pwd)/buckets
fi
AK=$1
SK=$2
function gctl {
docker exec garage /garage $@
}
gctl status
BUCKETS=$(gctl bucket list | tail -n +2 | cut -d " " -f 3 | cut -d "," -f 1)
for BUCKET in $BUCKETS; do
case $BUCKET in
*backup*)
echo "Skipping $BUCKET (not doing backup of backup)"
;;
*cache*)
echo "Skipping $BUCKET (not doing backup of cache)"
;;
*)
echo "Backing up $BUCKET"
if [ ! -d $(pwd)/buckets/$BUCKET ]; then
mkdir $(pwd)/buckets/$BUCKET
fi
gctl bucket allow --key $AK --read $BUCKET
rclone sync --s3-endpoint http://localhost:3900 \
--s3-access-key-id $AK \
--s3-secret-access-key $SK \
--s3-region garage \
--s3-force-path-style \
--transfers 32 \
--fast-list \
--stats-one-line \
--stats 10s \
--stats-log-level NOTICE \
:s3:$BUCKET $(pwd)/buckets/$BUCKET
;;
esac
done
# Remove duplicates
#duperemove -dAr $(pwd)/buckets
if [ ! -d "$(pwd)/snapshots" ]; then
mkdir snapshots
fi
SNAPSHOT=$(pwd)/snapshots/buckets-$(date +%F)
echo "Making snapshot: $SNAPSHOT"
btrfs subvolume snapshot $(pwd)/buckets $SNAPSHOT
```

View file

@ -0,0 +1,171 @@
+++
title = "restic"
description = "restic"
weight = 10
+++
Add the admin account as `deuxfleurs` to your `~/.mc/config` file
You need to choose some names/identifiers:
```bash
export ENDPOINT="https://s3.garage.tld"
export SERVICE_NAME="example"
export BUCKET_NAME="backups-${SERVICE_NAME}"
export NEW_ACCESS_KEY_ID="key-${SERVICE_NAME}"
export NEW_SECRET_ACCESS_KEY=$(openssl rand -base64 32)
export POLICY_NAME="policy-$BUCKET_NAME"
```
Create a new bucket:
```bash
mc mb deuxfleurs/$BUCKET_NAME
```
Create a new user:
```bash
mc admin user add deuxfleurs $NEW_ACCESS_KEY_ID $NEW_SECRET_ACCESS_KEY
```
Add this new user to your `~/.mc/config.json`, run this command before to generate the snippet to copy/paste:
```
cat > /dev/stdout <<EOF
"$NEW_ACCESS_KEY_ID": {
"url": "$ENDPOINT",
"accessKey": "$NEW_ACCESS_KEY_ID",
"secretKey": "$NEW_SECRET_ACCESS_KEY",
"api": "S3v4",
"path": "auto"
},
EOF
```
---
Create a policy for this bucket and save it as json:
```bash
cat > /tmp/policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::${BUCKET_NAME}"
]
},
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::${BUCKET_NAME}/*"
]
}
]
}
EOF
```
Register it:
```bash
mc admin policy add deuxfleurs $POLICY_NAME /tmp/policy.json
```
Set it to your user:
```bash
mc admin policy set deuxfleurs $POLICY_NAME user=${NEW_ACCESS_KEY_ID}
```
Now it should display *only* your new bucket when running:
```bash
mc ls $NEW_ACCESS_KEY_ID
```
---
Now we need to initialize the repository with restic.
```bash
export AWS_ACCESS_KEY_ID=$NEW_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY=$NEW_SECRET_ACCESS_KEY
export RESTIC_REPOSITORY="s3:$ENDPOINT/$BUCKET_NAME"
export RESTIC_PASSWORD=$(openssl rand -base64 32)
```
Then init the repo for restic from your machine:
```
restic init
```
*I am using restic version `restic 0.12.1 compiled with go1.16.9 on linux/amd64`*
See your snapshots with:
```
restic snapshots
```
Check also these useful commands:
```
restic ls
restic diff
restic help
```
---
Add the secrets to Consul, near your service secrets.
The idea is that the backuping service is a component of the global running service.
You must run in `app/<name>/secrets/<subpath>`:
```bash
echo "USER Backup AWS access key ID" > backup_aws_access_key_id
echo "USER Backup AWS secret access key" > backup_aws_secret_access_key
echo "USER Restic repository, eg. s3:https://s3.garage.tld" > backup_restic_repository
echo "USER Restic password to encrypt backups" > backup_restic_password
```
Then run secretmgr:
```bash
# Spawning a nix shell is an easy way to get all the dependencies you need
nix-shell
# Check that secretmgr works for you
python3 secretmgr.py check <name>
# Now interactively feed the secrets
python3 secretmgr.py gen <name>
```
---
Now we need a service that runs:
```
restic backup .
```
And also that garbage collect snapshots.
I propose:
```
restic forget --prune --keep-within 1m1d --keep-within-weekly 3m --keep-within-monthly 1y
```

View file

@ -5,6 +5,4 @@ weight = 50
sort_by = "weight"
+++
Ceci est un manuel
Ce manuel traite de tout ce qui concerne l'association, comme ses aspects légaux, les délibérations, ou l'organisation des personnes.