How to remove old dead nodes #879

Closed
opened 2024-09-20 00:02:25 +00:00 by fartsy · 3 comments

Lost a data drive on one of the active nodes, replaced with new one on the same ip/port with

garage layout assign d05dd11f95558ee5 --replace e03bd85fdf89cf70 -c 3.5TB -z home

but the node still appears in FAILED NODES

# garage status
...
==== FAILED NODES ====
ID                Hostname  Tags  Zone  Capacity              Last seen
d05dd11f95558ee5  ??        []    home  draining metadata...  never seen
e03bd85fdf89cf70  ??        []    home  draining metadata...  never seen

Is there a way to remove them so that garage layout catches up to the latest version? Right now it reports it as

# garage layout history
==== LAYOUT HISTORY ====
Version  Status      Storage nodes  Gateway nodes
#6       current     3              0
#5       draining    2              0
#4       draining    3              0
#3       draining    3              0
#2       historical  2              0
#1       historical  3              0

==== UPDATE TRACKERS ====
Several layout versions are currently live in the version, and data is being migrated.
This is the internal data that Garage stores to know which nodes have what data.

Node              Ack  Sync  Sync_ack
00531d207b91afc8  #6   #3    #3
717b01667affa37d  #6   #5    #3
c1ac67a150d3c825  #6   #3    #3
d05dd11f95558ee5  #6   #6    #3
e03bd85fdf89cf70  #6   #6    #3
Lost a data drive on one of the active nodes, replaced with new one on the same ip/port with ``` garage layout assign d05dd11f95558ee5 --replace e03bd85fdf89cf70 -c 3.5TB -z home ``` but the node still appears in FAILED NODES ``` # garage status ... ==== FAILED NODES ==== ID Hostname Tags Zone Capacity Last seen d05dd11f95558ee5 ?? [] home draining metadata... never seen e03bd85fdf89cf70 ?? [] home draining metadata... never seen ``` Is there a way to remove them so that `garage layout` catches up to the latest version? Right now it reports it as ``` # garage layout history ==== LAYOUT HISTORY ==== Version Status Storage nodes Gateway nodes #6 current 3 0 #5 draining 2 0 #4 draining 3 0 #3 draining 3 0 #2 historical 2 0 #1 historical 3 0 ==== UPDATE TRACKERS ==== Several layout versions are currently live in the version, and data is being migrated. This is the internal data that Garage stores to know which nodes have what data. Node Ack Sync Sync_ack 00531d207b91afc8 #6 #3 #3 717b01667affa37d #6 #5 #3 c1ac67a150d3c825 #6 #3 #3 d05dd11f95558ee5 #6 #6 #3 e03bd85fdf89cf70 #6 #6 #3 ```
Owner

Could you paste the full output of garage status and garage layout show ?

Could you paste the full output of garage status and garage layout show ?
Author

status

# garage status
==== HEALTHY NODES ====
ID                Hostname          Address              Tags      Zone      Capacity   DataAvail
00531d207b91afc8  host1     192.168.222.45:3901  [canada]  americas  2.0 TB     976.6 GB (46.3%)
717b01667affa37d  host2     192.168.222.20:3901  []        home      4.0 TB     7.6 TB (86.5%)
c1ac67a150d3c825  host3     192.168.222.10:3901  [usa]     americas  1000.0 GB  50.0 GB (4.6%)

==== FAILED NODES ====
ID                Hostname  Tags  Zone  Capacity              Last seen
d05dd11f95558ee5  ??        []    home  draining metadata...  never seen
e03bd85fdf89cf70  ??        []    home  draining metadata...  never seen

Your cluster is expecting to drain data from nodes that are currently unavailable.
If these nodes are definitely dead, please review the layout history with
`garage layout history` and use `garage layout skip-dead-nodes` to force progress.

layout show

# garage layout show
==== CURRENT CLUSTER LAYOUT ====
ID                Tags    Zone      Capacity   Usable capacity
00531d207b91afc8  canada  americas  2.0 TB     2.0 TB (100.0%)
717b01667affa37d          home      4.0 TB     3.0 TB (74.9%)
c1ac67a150d3c825  usa     americas  1000.0 GB  994.2 GB (99.4%)

Zone redundancy: maximum

Current cluster layout version: 6

layout history

# garage layout history
==== LAYOUT HISTORY ====
Version  Status      Storage nodes  Gateway nodes
#6       current     3              0
#5       draining    2              0
#4       draining    3              0
#3       draining    3              0
#2       historical  2              0
#1       historical  3              0

==== UPDATE TRACKERS ====
Several layout versions are currently live in the version, and data is being migrated.
This is the internal data that Garage stores to know which nodes have what data.

Node              Ack  Sync  Sync_ack
00531d207b91afc8  #6   #3    #3
717b01667affa37d  #6   #5    #3
c1ac67a150d3c825  #6   #3    #3
d05dd11f95558ee5  #6   #6    #3
e03bd85fdf89cf70  #6   #6    #3

If some nodes are not catching up to the latest layout version in the update trackers,
it might be because they are offline or unable to complete a sync successfully.
You may force progress using `garage layout skip-dead-nodes --version 6`
status ``` # garage status ==== HEALTHY NODES ==== ID Hostname Address Tags Zone Capacity DataAvail 00531d207b91afc8 host1 192.168.222.45:3901 [canada] americas 2.0 TB 976.6 GB (46.3%) 717b01667affa37d host2 192.168.222.20:3901 [] home 4.0 TB 7.6 TB (86.5%) c1ac67a150d3c825 host3 192.168.222.10:3901 [usa] americas 1000.0 GB 50.0 GB (4.6%) ==== FAILED NODES ==== ID Hostname Tags Zone Capacity Last seen d05dd11f95558ee5 ?? [] home draining metadata... never seen e03bd85fdf89cf70 ?? [] home draining metadata... never seen Your cluster is expecting to drain data from nodes that are currently unavailable. If these nodes are definitely dead, please review the layout history with `garage layout history` and use `garage layout skip-dead-nodes` to force progress. ``` layout show ``` # garage layout show ==== CURRENT CLUSTER LAYOUT ==== ID Tags Zone Capacity Usable capacity 00531d207b91afc8 canada americas 2.0 TB 2.0 TB (100.0%) 717b01667affa37d home 4.0 TB 3.0 TB (74.9%) c1ac67a150d3c825 usa americas 1000.0 GB 994.2 GB (99.4%) Zone redundancy: maximum Current cluster layout version: 6 ``` layout history ``` # garage layout history ==== LAYOUT HISTORY ==== Version Status Storage nodes Gateway nodes #6 current 3 0 #5 draining 2 0 #4 draining 3 0 #3 draining 3 0 #2 historical 2 0 #1 historical 3 0 ==== UPDATE TRACKERS ==== Several layout versions are currently live in the version, and data is being migrated. This is the internal data that Garage stores to know which nodes have what data. Node Ack Sync Sync_ack 00531d207b91afc8 #6 #3 #3 717b01667affa37d #6 #5 #3 c1ac67a150d3c825 #6 #3 #3 d05dd11f95558ee5 #6 #6 #3 e03bd85fdf89cf70 #6 #6 #3 If some nodes are not catching up to the latest layout version in the update trackers, it might be because they are offline or unable to complete a sync successfully. You may force progress using `garage layout skip-dead-nodes --version 6` ```
lx added the
kind
wrong-behavior
label 2024-09-22 11:30:30 +00:00
Owner

In theory you should be able to fix your issue using garage layout skip-dead-nodes --version 6 --allow-missing-data, but there is a bug in the current logic so that won't work. I am working on a patch.

In theory you should be able to fix your issue using `garage layout skip-dead-nodes --version 6 --allow-missing-data`, but there is a bug in the current logic so that won't work. I am working on a patch.
lx added this to the v1.1 milestone 2024-09-22 11:49:14 +00:00
lx closed this issue 2024-09-22 12:01:50 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#879
No description provided.