PostgreSQL ne remonte plus tout seul #49

Open
opened 2021-09-23 04:54:51 +00:00 by quentin · 0 comments
Owner

Depuis mes opérations de cet été, il semble que postgresql ne remonte plus tout seul car il échoue en boucle :

pg_basebackup: error: directory "/mnt/slow" exists but is not empty
Voir le log dans son contexte
WARNING:  skipping special file "./postgresql.auto.conf"
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 145E/8000028 on timeline 2
pg_basebackup: error: directory "/mnt/slow" exists but is not empty
pg_basebackup: removing data directory "/mnt/persist/postgres"
2021-09-23T04:49:45.724Z	ERROR	cmd/keeper.go:1365	failed to resync from followed instance	{"error": "sync error: exit status 1"}
2021-09-23T04:49:50.732Z	ERROR	cmd/keeper.go:1110	db failed to initialize or resync
2021-09-23T04:49:50.741Z	INFO	cmd/keeper.go:1141	current db UID different than cluster data db UID	{"db": "", "cdDB": "b06f3d8b"}
2021-09-23T04:49:50.741Z	INFO	cmd/keeper.go:1296	resyncing the database cluster
2021-09-23T04:49:50.768Z	INFO	cmd/keeper.go:1321	database cluster not initialized
2021-09-23T04:49:50.776Z	INFO	cmd/keeper.go:925	syncing from followed db	{"followedDB": "4d972314", "keeper": "2f396e97"}
2021-09-23T04:49:50.777Z	INFO	postgresql/postgresql.go:964	running pg_basebackup
pg_basebackup: initiating base backup, waiting for checkpoint to complete
WARNING:  skipping special file "./postgresql.auto.conf"
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 145E/90000D8 on timeline 2
pg_basebackup: error: directory "/mnt/slow" exists but is not empty
pg_basebackup: removing data directory "/mnt/persist/postgres"
2021-09-23T04:49:52.426Z	ERROR	cmd/keeper.go:1365	failed to resync from followed instance	{"error": "sync error: exit status 1"}
2021-09-23T04:49:57.432Z	ERROR	cmd/keeper.go:1110	db failed to initialize or resync
2021-09-23T04:49:57.459Z	INFO	cmd/keeper.go:1141	current db UID different than cluster data db UID	{"db": "", "cdDB": "b06f3d8b"}
2021-09-23T04:49:57.459Z	INFO	cmd/keeper.go:1296	resyncing the database cluster
2021-09-23T04:49:57.480Z	INFO	cmd/keeper.go:1321	database cluster not initialized
2021-09-23T04:49:57.488Z	INFO	cmd/keeper.go:925	syncing from followed db	{"followedDB": "4d972314", "keeper": "2f396e97"}
2021-09-23T04:49:57.488Z	INFO	postgresql/postgresql.go:964	running pg_basebackup
pg_basebackup: initiating base backup, waiting for checkpoint to complete
WARNING:  skipping special file "./postgresql.auto.conf"

Vider le dossier /mnt/slow suffit à le faire redémarrer.

Je ne connais pas encore la source du bug mais mon intuition c'est que stolon ne sait clean que le storage principal de postgresql et pas ses storages secondaires quand il fait un reset. Du coup ce storage secondaire n'est pas supprimé et ça bloque.

Deux options que je vois :

  • Arrêter le stockage secondaire bien que ça nous soit bien utile pour réparer du postgresql
  • Lire le code de Stolon, trouver le bug, ouvrir une issue et la patcher
Depuis mes opérations de cet été, il semble que postgresql ne remonte plus tout seul car il échoue en boucle : ``` pg_basebackup: error: directory "/mnt/slow" exists but is not empty ``` <details> <summary> Voir le log dans son contexte </summary> ``` WARNING: skipping special file "./postgresql.auto.conf" pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 145E/8000028 on timeline 2 pg_basebackup: error: directory "/mnt/slow" exists but is not empty pg_basebackup: removing data directory "/mnt/persist/postgres" 2021-09-23T04:49:45.724Z ERROR cmd/keeper.go:1365 failed to resync from followed instance {"error": "sync error: exit status 1"} 2021-09-23T04:49:50.732Z ERROR cmd/keeper.go:1110 db failed to initialize or resync 2021-09-23T04:49:50.741Z INFO cmd/keeper.go:1141 current db UID different than cluster data db UID {"db": "", "cdDB": "b06f3d8b"} 2021-09-23T04:49:50.741Z INFO cmd/keeper.go:1296 resyncing the database cluster 2021-09-23T04:49:50.768Z INFO cmd/keeper.go:1321 database cluster not initialized 2021-09-23T04:49:50.776Z INFO cmd/keeper.go:925 syncing from followed db {"followedDB": "4d972314", "keeper": "2f396e97"} 2021-09-23T04:49:50.777Z INFO postgresql/postgresql.go:964 running pg_basebackup pg_basebackup: initiating base backup, waiting for checkpoint to complete WARNING: skipping special file "./postgresql.auto.conf" pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 145E/90000D8 on timeline 2 pg_basebackup: error: directory "/mnt/slow" exists but is not empty pg_basebackup: removing data directory "/mnt/persist/postgres" 2021-09-23T04:49:52.426Z ERROR cmd/keeper.go:1365 failed to resync from followed instance {"error": "sync error: exit status 1"} 2021-09-23T04:49:57.432Z ERROR cmd/keeper.go:1110 db failed to initialize or resync 2021-09-23T04:49:57.459Z INFO cmd/keeper.go:1141 current db UID different than cluster data db UID {"db": "", "cdDB": "b06f3d8b"} 2021-09-23T04:49:57.459Z INFO cmd/keeper.go:1296 resyncing the database cluster 2021-09-23T04:49:57.480Z INFO cmd/keeper.go:1321 database cluster not initialized 2021-09-23T04:49:57.488Z INFO cmd/keeper.go:925 syncing from followed db {"followedDB": "4d972314", "keeper": "2f396e97"} 2021-09-23T04:49:57.488Z INFO postgresql/postgresql.go:964 running pg_basebackup pg_basebackup: initiating base backup, waiting for checkpoint to complete WARNING: skipping special file "./postgresql.auto.conf" ``` </details> <br/> Vider le dossier /mnt/slow suffit à le faire redémarrer. Je ne connais pas encore la source du bug mais mon intuition c'est que stolon ne sait clean que le storage principal de postgresql et pas ses storages secondaires quand il fait un reset. Du coup ce storage secondaire n'est pas supprimé et ça bloque. Deux options que je vois : - Arrêter le stockage secondaire bien que ça nous soit bien utile pour réparer du postgresql - Lire le code de Stolon, trouver le bug, ouvrir une issue et la patcher
quentin added the
bug
label 2021-09-23 04:54:51 +00:00
quentin self-assigned this 2021-09-23 04:54:52 +00:00
This repo is archived. You cannot comment on issues.
No Milestone
No project
No Assignees
1 Participants
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/infrastructure#49
No description provided.