Crash client Garage quand on passe le dernier noeud d'un site en gateway #43

Open
opened 2024-11-30 13:16:58 +00:00 by baptiste · 1 comment
Owner

Problème rencontré en passant les noeuds de Neptune en gateway avant leur extinction.

Lors de la définition du layout, le client garage crashe en manipulant le nouveau layout (avant de l'avoir appliqué). Il est possible de revert le layout en cours de définition.

A regarder et faire un ticket Garage.

Problème rencontré en passant les noeuds de Neptune en gateway avant leur extinction. Lors de la définition du layout, le client garage crashe en manipulant le nouveau layout (avant de l'avoir appliqué). Il est possible de revert le layout en cours de définition. A regarder et faire un ticket Garage.
Owner

Layout avant mise à jour :

[root@df-ykl:~]# docker exec -it 0bc65f9b208e /garage layout show
2024-11-28T21:05:45.674060Z  INFO garage_net::netapp: Connected to [2a02:a03f:6510:5102:6e4b:90ff:fe3b:e86c]:3901, negotiating handshake...
2024-11-28T21:05:45.715926Z  INFO garage_net::netapp: Connection established to 17ee03c6b81d9235
==== CURRENT CLUSTER LAYOUT ====
ID                Tags                              Zone     Capacity  Usable capacity
0a03ab7c082ad929  ananas,scorpio,france,adrien      scorpio  2.0 TB    744.2 GB (37.2%)
17ee03c6b81d9235  df-ykl,bespin,belgium,max         bespin   500.0 GB  494.2 GB (98.8%)
2032d0a37f249c4a  abricot,scopio,france,adrien      scorpio  2.0 TB    744.2 GB (37.2%)
5fcb3b6e39db3dcb  concombre,neptune,france,alex     neptune  500.0 GB  244.2 GB (48.8%)
68e74be3672f9cc0  pamplemousse,corrin,france,zorun  corrin   gateway
8cf284e7df17d0fd  celeri,neptune,france,alex        neptune  2.0 TB    1000.0 GB (50.0%)
942dd71ea95f4904  df-ymf,bespin,belgium,max         bespin   500.0 GB  500.0 GB (100.0%)
a717e5b618267806  courgette,neptune,france,alex     neptune  500.0 GB  244.2 GB (48.8%)
fdfaf7832d8359e0  df-ymk,bespin,belgium,max         bespin   500.0 GB  494.2 GB (98.8%)

Zone redundancy: maximum

Current cluster layout version: 25

et status avant mise à jour

[root@df-ykl:~]# docker exec -it 0bc65f9b208e /garage status
2024-11-28T21:07:48.715286Z  INFO garage_net::netapp: Connected to [2a02:a03f:6510:5102:6e4b:90ff:fe3b:e86c]:3901, negotiating handshake...
2024-11-28T21:07:48.757847Z  INFO garage_net::netapp: Connection established to 17ee03c6b81d9235
==== HEALTHY NODES ====
ID                Hostname      Address                                         Tags                                Zone     Capacity          DataAvail
e5b9a31be37fa9b0  pasteque      [2001:912:1ac0:2200::202]:3901                                                               NO ROLE ASSIGNED
a717e5b618267806  courgette     [2a01:e0a:2c:540::32]:3901                      [courgette,neptune,france,alex]     neptune  500.0 GB          397.8 GB (79.6%)
942dd71ea95f4904  df-ymf        [2a02:a03f:6510:5102:6e4b:90ff:fe3a:6174]:3901  [df-ymf,bespin,belgium,max]         bespin   500.0 GB          251.4 GB (50.3%)
17ee03c6b81d9235  df-ykl        [2a02:a03f:6510:5102:6e4b:90ff:fe3b:e86c]:3901  [df-ykl,bespin,belgium,max]         bespin   500.0 GB          274.9 GB (55.0%)
8cf284e7df17d0fd  celeri        [2a01:e0a:2c:540::33]:3901                      [celeri,neptune,france,alex]        neptune  2.0 TB            1.6 TB (79.2%)
0a03ab7c082ad929  ananas        [2a01:e0a:e4:2dd0::42]:3901                     [ananas,scorpio,france,adrien]      scorpio  2.0 TB            1.7 TB (84.3%)
2032d0a37f249c4a  abricot       [2a01:e0a:e4:2dd0::41]:3901                     [abricot,scopio,france,adrien]      scorpio  2.0 TB            1.7 TB (84.3%)
5fcb3b6e39db3dcb  concombre     [2a01:e0a:2c:540::31]:3901                      [concombre,neptune,france,alex]     neptune  500.0 GB          398.0 GB (79.6%)
fdfaf7832d8359e0  df-ymk        [2a02:a03f:6510:5102:6e4b:90ff:fe3b:e939]:3901  [df-ymk,bespin,belgium,max]         bespin   500.0 GB          266.0 GB (53.2%)
68e74be3672f9cc0  pamplemousse  [2001:912:1ac0:2200::201]:3901                  [pamplemousse,corrin,france,zorun]  corrin   gateway           N/A

On va donc refaire un layout pour

  • passer les nodes de nepture en gateway
  • passer les nodes de corrin en stockage

Après avoir ajouté les deux nœuds corrin, et passé 2 des 3 noeuds neptune en gateway :

==== NEW CLUSTER LAYOUT AFTER APPLYING CHANGES ====
ID                Tags                              Zone     Capacity   Usable capacity
0a03ab7c082ad929  ananas,scorpio,france,adrien      scorpio  2.0 TB     1000.0 GB (50.0%)
17ee03c6b81d9235  df-ykl,bespin,belgium,max         bespin   500.0 GB   500.0 GB (100.0%)
2032d0a37f249c4a  abricot,scopio,france,adrien      scorpio  2.0 TB     1000.0 GB (50.0%)
5fcb3b6e39db3dcb  concombre,neptune,france,alex     neptune  500.0 GB   500.0 GB (100.0%)
68e74be3672f9cc0  pamplemousse,corrin,france,zorun  corrin   1000.0 GB  1000.0 GB (100.0%)
8cf284e7df17d0fd  celeri,neptune,france,alex        neptune  gateway
942dd71ea95f4904  df-ymf,bespin,belgium,max         bespin   500.0 GB   500.0 GB (100.0%)
a717e5b618267806  courgette,neptune,france,alex     neptune  gateway
e5b9a31be37fa9b0  pasteque,corrin,france,zorun      corrin   1000.0 GB  1000.0 GB (100.0%)
fdfaf7832d8359e0  df-ymk,bespin,belgium,max         bespin   500.0 GB   500.0 GB (100.0%)

Zone redundancy: maximum

==== COMPUTATION OF A NEW PARTITION ASSIGNATION ====

Partitions are replicated 3 times on at least 3 distinct zones.

Optimal partition size:                     7.8 GB (5.8 GB in previous layout)
Usable capacity / total cluster capacity:   6.0 TB / 8.0 TB (75.0 %)
Effective capacity (replication factor 3):  2.0 TB

If the percentage is too low, it might be that the cluster topology and redundancy constraints are forcing the use of nodes/zones with small storage capacities.
You might want to move storage capacity between zones or relax the redundancy constraint.
See the detailed statistics below and look for saturated nodes/zones.

A total of 278 new copies of partitions need to be transferred.

scorpio             Tags                              Partitions        Capacity   Usable capacity
  0a03ab7c082ad929  ananas,scorpio,france,adrien      128 (0 new)       2.0 TB     1000.0 GB (50.0%)
  2032d0a37f249c4a  abricot,scopio,france,adrien      128 (0 new)       2.0 TB     1000.0 GB (50.0%)
  TOTAL                                               256 (256 unique)  4.0 TB     2.0 TB (50.0%)

bespin              Tags                              Partitions        Capacity   Usable capacity
  17ee03c6b81d9235  df-ykl,bespin,belgium,max         64 (0 new)        500.0 GB   500.0 GB (100.0%)
  942dd71ea95f4904  df-ymf,bespin,belgium,max         64 (0 new)        500.0 GB   500.0 GB (100.0%)
  fdfaf7832d8359e0  df-ymk,bespin,belgium,max         64 (0 new)        500.0 GB   500.0 GB (100.0%)
  TOTAL                                               192 (192 unique)  1.5 TB     1.5 TB (100.0%)

neptune             Tags                              Partitions        Capacity   Usable capacity
  5fcb3b6e39db3dcb  concombre,neptune,france,alex     64 (22 new)       500.0 GB   500.0 GB (100.0%)
  TOTAL                                               64 (64 unique)    500.0 GB   500.0 GB (100.0%)

corrin              Tags                              Partitions        Capacity   Usable capacity
  68e74be3672f9cc0  pamplemousse,corrin,france,zorun  128 (128 new)     1000.0 GB  1000.0 GB (100.0%)
  e5b9a31be37fa9b0  pasteque,corrin,france,zorun      128 (128 new)     1000.0 GB  1000.0 GB (100.0%)
  TOTAL                                               256 (256 unique)  2.0 TB     2.0 TB (100.0%)

Crash client Garage

Commandes

# ajoute de pasteque
garage layout assign e5b9a31be37fa9b0 -z corrin -c 1T -t pasteque,corrin,france,zorun
# passage de pamplemousse de gateway à node
garage layout assign 68e74be3672f9cc0 -c 1T
# passage des nodes de neptune en gateway
garage layout assign 8cf284e7df17d0fd -g
garage layout assign a717e5b618267806 -g
garage layout assign 5fcb3b6e39db3dcb -g

Et ça panic côté client !

[root@df-ykl:~]# docker exec -it 0bc65f9b208e /garage layout show
2024-11-28T21:18:34.226090Z  INFO garage_net::netapp: Connected to [2a02:a03f:6510:5102:6e4b:90ff:fe3b:e86c]:3901, negotiating handshake...
2024-11-28T21:18:34.268175Z  INFO garage_net::netapp: Connection established to 17ee03c6b81d9235
==== CURRENT CLUSTER LAYOUT ====
ID                Tags                              Zone     Capacity  Usable capacity
0a03ab7c082ad929  ananas,scorpio,france,adrien      scorpio  2.0 TB    744.2 GB (37.2%)
17ee03c6b81d9235  df-ykl,bespin,belgium,max         bespin   500.0 GB  494.2 GB (98.8%)
2032d0a37f249c4a  abricot,scopio,france,adrien      scorpio  2.0 TB    744.2 GB (37.2%)
5fcb3b6e39db3dcb  concombre,neptune,france,alex     neptune  500.0 GB  244.2 GB (48.8%)
68e74be3672f9cc0  pamplemousse,corrin,france,zorun  corrin   gateway
8cf284e7df17d0fd  celeri,neptune,france,alex        neptune  2.0 TB    1000.0 GB (50.0%)
942dd71ea95f4904  df-ymf,bespin,belgium,max         bespin   500.0 GB  500.0 GB (100.0%)
a717e5b618267806  courgette,neptune,france,alex     neptune  500.0 GB  244.2 GB (48.8%)
fdfaf7832d8359e0  df-ymk,bespin,belgium,max         bespin   500.0 GB  494.2 GB (98.8%)

Zone redundancy: maximum

Current cluster layout version: 25

==== STAGED ROLE CHANGES ====
ID                Tags                              Zone     Capacity
5fcb3b6e39db3dcb  concombre,neptune,france,alex     neptune  gateway
68e74be3672f9cc0  pamplemousse,corrin,france,zorun  corrin   1000.0 GB
8cf284e7df17d0fd  celeri,neptune,france,alex        neptune  gateway
a717e5b618267806  courgette,neptune,france,alex     neptune  gateway
e5b9a31be37fa9b0  pasteque,corrin,france,zorun      corrin   1000.0 GB

======== PANIC (internal Garage error) ========
panicked at layout/version.rs:653:43:
no entry found for key

Panics are internal errors that Garage is unable to handle on its own.
They can be caused by bugs in Garage's code, or by corrupted data in
the node's storage. If you feel that this error is likely to be a bug
in Garage, please report it on our issue tracker a the following address:

        https://git.deuxfleurs.fr/Deuxfleurs/garage/issues

Please include the last log messages and the the full backtrace below in
your bug report, as well as any relevant information on the context in
which Garage was running when this error occurred.

GARAGE VERSION: v1.0.0-rc1-hotfix-red-ftr-wquorum [features: k2v, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs]

BACKTRACE:
   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>

Meme souci si on commit le layout après avoir rajouté pasteque.

Layout avant mise à jour : ``` [root@df-ykl:~]# docker exec -it 0bc65f9b208e /garage layout show 2024-11-28T21:05:45.674060Z INFO garage_net::netapp: Connected to [2a02:a03f:6510:5102:6e4b:90ff:fe3b:e86c]:3901, negotiating handshake... 2024-11-28T21:05:45.715926Z INFO garage_net::netapp: Connection established to 17ee03c6b81d9235 ==== CURRENT CLUSTER LAYOUT ==== ID Tags Zone Capacity Usable capacity 0a03ab7c082ad929 ananas,scorpio,france,adrien scorpio 2.0 TB 744.2 GB (37.2%) 17ee03c6b81d9235 df-ykl,bespin,belgium,max bespin 500.0 GB 494.2 GB (98.8%) 2032d0a37f249c4a abricot,scopio,france,adrien scorpio 2.0 TB 744.2 GB (37.2%) 5fcb3b6e39db3dcb concombre,neptune,france,alex neptune 500.0 GB 244.2 GB (48.8%) 68e74be3672f9cc0 pamplemousse,corrin,france,zorun corrin gateway 8cf284e7df17d0fd celeri,neptune,france,alex neptune 2.0 TB 1000.0 GB (50.0%) 942dd71ea95f4904 df-ymf,bespin,belgium,max bespin 500.0 GB 500.0 GB (100.0%) a717e5b618267806 courgette,neptune,france,alex neptune 500.0 GB 244.2 GB (48.8%) fdfaf7832d8359e0 df-ymk,bespin,belgium,max bespin 500.0 GB 494.2 GB (98.8%) Zone redundancy: maximum Current cluster layout version: 25 ``` et status avant mise à jour ``` [root@df-ykl:~]# docker exec -it 0bc65f9b208e /garage status 2024-11-28T21:07:48.715286Z INFO garage_net::netapp: Connected to [2a02:a03f:6510:5102:6e4b:90ff:fe3b:e86c]:3901, negotiating handshake... 2024-11-28T21:07:48.757847Z INFO garage_net::netapp: Connection established to 17ee03c6b81d9235 ==== HEALTHY NODES ==== ID Hostname Address Tags Zone Capacity DataAvail e5b9a31be37fa9b0 pasteque [2001:912:1ac0:2200::202]:3901 NO ROLE ASSIGNED a717e5b618267806 courgette [2a01:e0a:2c:540::32]:3901 [courgette,neptune,france,alex] neptune 500.0 GB 397.8 GB (79.6%) 942dd71ea95f4904 df-ymf [2a02:a03f:6510:5102:6e4b:90ff:fe3a:6174]:3901 [df-ymf,bespin,belgium,max] bespin 500.0 GB 251.4 GB (50.3%) 17ee03c6b81d9235 df-ykl [2a02:a03f:6510:5102:6e4b:90ff:fe3b:e86c]:3901 [df-ykl,bespin,belgium,max] bespin 500.0 GB 274.9 GB (55.0%) 8cf284e7df17d0fd celeri [2a01:e0a:2c:540::33]:3901 [celeri,neptune,france,alex] neptune 2.0 TB 1.6 TB (79.2%) 0a03ab7c082ad929 ananas [2a01:e0a:e4:2dd0::42]:3901 [ananas,scorpio,france,adrien] scorpio 2.0 TB 1.7 TB (84.3%) 2032d0a37f249c4a abricot [2a01:e0a:e4:2dd0::41]:3901 [abricot,scopio,france,adrien] scorpio 2.0 TB 1.7 TB (84.3%) 5fcb3b6e39db3dcb concombre [2a01:e0a:2c:540::31]:3901 [concombre,neptune,france,alex] neptune 500.0 GB 398.0 GB (79.6%) fdfaf7832d8359e0 df-ymk [2a02:a03f:6510:5102:6e4b:90ff:fe3b:e939]:3901 [df-ymk,bespin,belgium,max] bespin 500.0 GB 266.0 GB (53.2%) 68e74be3672f9cc0 pamplemousse [2001:912:1ac0:2200::201]:3901 [pamplemousse,corrin,france,zorun] corrin gateway N/A ``` On va donc refaire un layout pour - passer les nodes de nepture en gateway - passer les nodes de corrin en stockage Après avoir ajouté les deux nœuds corrin, et passé 2 des 3 noeuds neptune en gateway : ``` ==== NEW CLUSTER LAYOUT AFTER APPLYING CHANGES ==== ID Tags Zone Capacity Usable capacity 0a03ab7c082ad929 ananas,scorpio,france,adrien scorpio 2.0 TB 1000.0 GB (50.0%) 17ee03c6b81d9235 df-ykl,bespin,belgium,max bespin 500.0 GB 500.0 GB (100.0%) 2032d0a37f249c4a abricot,scopio,france,adrien scorpio 2.0 TB 1000.0 GB (50.0%) 5fcb3b6e39db3dcb concombre,neptune,france,alex neptune 500.0 GB 500.0 GB (100.0%) 68e74be3672f9cc0 pamplemousse,corrin,france,zorun corrin 1000.0 GB 1000.0 GB (100.0%) 8cf284e7df17d0fd celeri,neptune,france,alex neptune gateway 942dd71ea95f4904 df-ymf,bespin,belgium,max bespin 500.0 GB 500.0 GB (100.0%) a717e5b618267806 courgette,neptune,france,alex neptune gateway e5b9a31be37fa9b0 pasteque,corrin,france,zorun corrin 1000.0 GB 1000.0 GB (100.0%) fdfaf7832d8359e0 df-ymk,bespin,belgium,max bespin 500.0 GB 500.0 GB (100.0%) Zone redundancy: maximum ==== COMPUTATION OF A NEW PARTITION ASSIGNATION ==== Partitions are replicated 3 times on at least 3 distinct zones. Optimal partition size: 7.8 GB (5.8 GB in previous layout) Usable capacity / total cluster capacity: 6.0 TB / 8.0 TB (75.0 %) Effective capacity (replication factor 3): 2.0 TB If the percentage is too low, it might be that the cluster topology and redundancy constraints are forcing the use of nodes/zones with small storage capacities. You might want to move storage capacity between zones or relax the redundancy constraint. See the detailed statistics below and look for saturated nodes/zones. A total of 278 new copies of partitions need to be transferred. scorpio Tags Partitions Capacity Usable capacity 0a03ab7c082ad929 ananas,scorpio,france,adrien 128 (0 new) 2.0 TB 1000.0 GB (50.0%) 2032d0a37f249c4a abricot,scopio,france,adrien 128 (0 new) 2.0 TB 1000.0 GB (50.0%) TOTAL 256 (256 unique) 4.0 TB 2.0 TB (50.0%) bespin Tags Partitions Capacity Usable capacity 17ee03c6b81d9235 df-ykl,bespin,belgium,max 64 (0 new) 500.0 GB 500.0 GB (100.0%) 942dd71ea95f4904 df-ymf,bespin,belgium,max 64 (0 new) 500.0 GB 500.0 GB (100.0%) fdfaf7832d8359e0 df-ymk,bespin,belgium,max 64 (0 new) 500.0 GB 500.0 GB (100.0%) TOTAL 192 (192 unique) 1.5 TB 1.5 TB (100.0%) neptune Tags Partitions Capacity Usable capacity 5fcb3b6e39db3dcb concombre,neptune,france,alex 64 (22 new) 500.0 GB 500.0 GB (100.0%) TOTAL 64 (64 unique) 500.0 GB 500.0 GB (100.0%) corrin Tags Partitions Capacity Usable capacity 68e74be3672f9cc0 pamplemousse,corrin,france,zorun 128 (128 new) 1000.0 GB 1000.0 GB (100.0%) e5b9a31be37fa9b0 pasteque,corrin,france,zorun 128 (128 new) 1000.0 GB 1000.0 GB (100.0%) TOTAL 256 (256 unique) 2.0 TB 2.0 TB (100.0%) ``` ## Crash client Garage ### Commandes ``` # ajoute de pasteque garage layout assign e5b9a31be37fa9b0 -z corrin -c 1T -t pasteque,corrin,france,zorun # passage de pamplemousse de gateway à node garage layout assign 68e74be3672f9cc0 -c 1T # passage des nodes de neptune en gateway garage layout assign 8cf284e7df17d0fd -g garage layout assign a717e5b618267806 -g garage layout assign 5fcb3b6e39db3dcb -g ``` Et ça panic côté client ! ``` [root@df-ykl:~]# docker exec -it 0bc65f9b208e /garage layout show 2024-11-28T21:18:34.226090Z INFO garage_net::netapp: Connected to [2a02:a03f:6510:5102:6e4b:90ff:fe3b:e86c]:3901, negotiating handshake... 2024-11-28T21:18:34.268175Z INFO garage_net::netapp: Connection established to 17ee03c6b81d9235 ==== CURRENT CLUSTER LAYOUT ==== ID Tags Zone Capacity Usable capacity 0a03ab7c082ad929 ananas,scorpio,france,adrien scorpio 2.0 TB 744.2 GB (37.2%) 17ee03c6b81d9235 df-ykl,bespin,belgium,max bespin 500.0 GB 494.2 GB (98.8%) 2032d0a37f249c4a abricot,scopio,france,adrien scorpio 2.0 TB 744.2 GB (37.2%) 5fcb3b6e39db3dcb concombre,neptune,france,alex neptune 500.0 GB 244.2 GB (48.8%) 68e74be3672f9cc0 pamplemousse,corrin,france,zorun corrin gateway 8cf284e7df17d0fd celeri,neptune,france,alex neptune 2.0 TB 1000.0 GB (50.0%) 942dd71ea95f4904 df-ymf,bespin,belgium,max bespin 500.0 GB 500.0 GB (100.0%) a717e5b618267806 courgette,neptune,france,alex neptune 500.0 GB 244.2 GB (48.8%) fdfaf7832d8359e0 df-ymk,bespin,belgium,max bespin 500.0 GB 494.2 GB (98.8%) Zone redundancy: maximum Current cluster layout version: 25 ==== STAGED ROLE CHANGES ==== ID Tags Zone Capacity 5fcb3b6e39db3dcb concombre,neptune,france,alex neptune gateway 68e74be3672f9cc0 pamplemousse,corrin,france,zorun corrin 1000.0 GB 8cf284e7df17d0fd celeri,neptune,france,alex neptune gateway a717e5b618267806 courgette,neptune,france,alex neptune gateway e5b9a31be37fa9b0 pasteque,corrin,france,zorun corrin 1000.0 GB ======== PANIC (internal Garage error) ======== panicked at layout/version.rs:653:43: no entry found for key Panics are internal errors that Garage is unable to handle on its own. They can be caused by bugs in Garage's code, or by corrupted data in the node's storage. If you feel that this error is likely to be a bug in Garage, please report it on our issue tracker a the following address: https://git.deuxfleurs.fr/Deuxfleurs/garage/issues Please include the last log messages and the the full backtrace below in your bug report, as well as any relevant information on the context in which Garage was running when this error occurred. GARAGE VERSION: v1.0.0-rc1-hotfix-red-ftr-wquorum [features: k2v, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs] BACKTRACE: 0: <unknown> 1: <unknown> 2: <unknown> 3: <unknown> 4: <unknown> 5: <unknown> 6: <unknown> 7: <unknown> 8: <unknown> 9: <unknown> 10: <unknown> 11: <unknown> 12: <unknown> 13: <unknown> 14: <unknown> ``` Meme souci si on commit le layout après avoir rajouté pasteque.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/nixcfg#43
No description provided.