Cluster/layout migration status is wrong after removing nodes #916

Open
opened 2024-12-29 11:54:05 +00:00 by morsik · 1 comment

Hi! I migrated my Garage cluster from one nodes, to another using layout assign, layout remove and then layout apply (both add and remove operations at the same time).

Then I watched status until it went into this:

$  k -n garage exec -it ds/garage -- ./garage status
Defaulted container "garage" out of: garage, garage-init (init)
2024-12-29T11:52:15.428848Z  INFO garage_net::netapp: Connected to 127.0.0.1:3901, negotiating handshake...
2024-12-29T11:52:15.471564Z  INFO garage_net::netapp: Connection established to 8be181efd16f1894
==== HEALTHY NODES ====
ID                Hostname      Address            Tags  Zone    Capacity          DataAvail
8be181efd16f1894  garage-mttpb  10.244.1.40:3901   []    krk1-a  400.0 GB          208.9 GB (95.3%)
5cdd28d293df8664  garage-bsjd9  10.244.11.42:3901  []    krk1-c  400.0 GB          190.7 GB (94.8%)
356a3f2c6ed7b688  garage-5wtkf  10.244.2.2:3901                  NO ROLE ASSIGNED
8296a8b70019368d  garage-q6nq2  10.244.5.34:3901                 NO ROLE ASSIGNED
5d8ff6ec03ea8c7c  garage-k2kzc  10.244.0.112:3901  []    krk1-b  400.0 GB          195.9 GB (95.1%)
09e39ab7fc3ff62f  garage-zxv2l  10.244.4.110:3901                NO ROLE ASSIGNED
a0a47345a521ef61  garage-fx78j  10.244.3.2:3901                  NO ROLE ASSIGNED
899c05d989e7c5fd  garage-62zmv  10.244.6.27:3901                 NO ROLE ASSIGNED
ac4601eb24d93f59  garage-2sqwx  10.244.7.3:3901                  NO ROLE ASSIGNED
$  k -n garage exec -it ds/garage -- ./garage layout show
Defaulted container "garage" out of: garage, garage-init (init)
2024-12-29T11:52:34.064123Z  INFO garage_net::netapp: Connected to 127.0.0.1:3901, negotiating handshake...
2024-12-29T11:52:34.107576Z  INFO garage_net::netapp: Connection established to 8be181efd16f1894
==== CURRENT CLUSTER LAYOUT ====
ID                Tags  Zone    Capacity  Usable capacity
5cdd28d293df8664        krk1-c  400.0 GB  400.0 GB (100.0%)
5d8ff6ec03ea8c7c        krk1-b  400.0 GB  400.0 GB (100.0%)
8be181efd16f1894        krk1-a  400.0 GB  400.0 GB (100.0%)

Zone redundancy: maximum

Current cluster layout version: 2

This seems to be correct, so I started shutting down old nodes - and what a surprise - data was still not migrated from old nodes and new nodes were referencing blocks from old ones - thus making data essentially unavailable.

I booted up old nodes and I can see in logs that data is still transferred from old to new nodes, and there's no information anywhere about that. So I'm waiting now until... I don't know... logs will stop showing that blocks are synced?

EDIT: it looks like after 2-3 hours all data was finally migrated and I was able to shutdown old nodes.

Hi! I migrated my Garage cluster from one nodes, to another using `layout assign`, `layout remove` and then `layout apply` (both add and remove operations at the same time). Then I watched status until it went into this: ``` $ k -n garage exec -it ds/garage -- ./garage status Defaulted container "garage" out of: garage, garage-init (init) 2024-12-29T11:52:15.428848Z INFO garage_net::netapp: Connected to 127.0.0.1:3901, negotiating handshake... 2024-12-29T11:52:15.471564Z INFO garage_net::netapp: Connection established to 8be181efd16f1894 ==== HEALTHY NODES ==== ID Hostname Address Tags Zone Capacity DataAvail 8be181efd16f1894 garage-mttpb 10.244.1.40:3901 [] krk1-a 400.0 GB 208.9 GB (95.3%) 5cdd28d293df8664 garage-bsjd9 10.244.11.42:3901 [] krk1-c 400.0 GB 190.7 GB (94.8%) 356a3f2c6ed7b688 garage-5wtkf 10.244.2.2:3901 NO ROLE ASSIGNED 8296a8b70019368d garage-q6nq2 10.244.5.34:3901 NO ROLE ASSIGNED 5d8ff6ec03ea8c7c garage-k2kzc 10.244.0.112:3901 [] krk1-b 400.0 GB 195.9 GB (95.1%) 09e39ab7fc3ff62f garage-zxv2l 10.244.4.110:3901 NO ROLE ASSIGNED a0a47345a521ef61 garage-fx78j 10.244.3.2:3901 NO ROLE ASSIGNED 899c05d989e7c5fd garage-62zmv 10.244.6.27:3901 NO ROLE ASSIGNED ac4601eb24d93f59 garage-2sqwx 10.244.7.3:3901 NO ROLE ASSIGNED ``` ``` $ k -n garage exec -it ds/garage -- ./garage layout show Defaulted container "garage" out of: garage, garage-init (init) 2024-12-29T11:52:34.064123Z INFO garage_net::netapp: Connected to 127.0.0.1:3901, negotiating handshake... 2024-12-29T11:52:34.107576Z INFO garage_net::netapp: Connection established to 8be181efd16f1894 ==== CURRENT CLUSTER LAYOUT ==== ID Tags Zone Capacity Usable capacity 5cdd28d293df8664 krk1-c 400.0 GB 400.0 GB (100.0%) 5d8ff6ec03ea8c7c krk1-b 400.0 GB 400.0 GB (100.0%) 8be181efd16f1894 krk1-a 400.0 GB 400.0 GB (100.0%) Zone redundancy: maximum Current cluster layout version: 2 ``` This seems to be correct, so I started shutting down old nodes - and what a surprise - data was still not migrated from old nodes and new nodes were referencing blocks from old ones - thus making data essentially unavailable. I booted up old nodes and I can see in logs that data is still transferred from old to new nodes, and there's **no information** anywhere about that. So I'm waiting now until... I don't know... logs will stop showing that blocks are synced? EDIT: it looks like after 2-3 hours all data was finally migrated and I was able to shutdown old nodes.
morsik changed title from Cluster/layout status in wrong after removing nodes to Cluster/layout status is wrong after removing nodes 2024-12-29 11:54:14 +00:00
morsik changed title from Cluster/layout status is wrong after removing nodes to Cluster/layout migration status is wrong after removing nodes 2024-12-29 11:54:21 +00:00
Owner

Garage does not have a central coordinator, and hence doesn't have a sense of "progress" at a cluster level for things such as layout migrations. The best way to monitor it is to setup monitoring, and to watch the number of block in the resync queue drop on each of the nodes as they migrate to the new nodes.

Garage does not have a central coordinator, and hence doesn't have a sense of "progress" at a cluster level for things such as layout migrations. The best way to monitor it is to setup [monitoring](https://garagehq.deuxfleurs.fr/documentation/cookbook/monitoring/), and to watch the number of block in the resync queue drop on each of the nodes as they migrate to the new nodes.
maximilien added the
scope
documentation
kind
usability
labels 2025-01-06 21:44:17 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#916
No description provided.