Garage cluster encounters resync problems and does not seem to recover #911

Open
opened 2024-12-18 21:09:18 +00:00 by Szetty · 2 comments

We have a Garage cluster deployed in Kubernetes, consisting of 8 nodes with replication factor of 2. We had 6 nodes before, and as we added two more, we needed to do some shifting too so one node is in a different zone now.

Since we switched to 8 nodes there are lot of items in the resync queue and they do not seem to self heal. There are also many resync errors. Not sure how to proceed or what other information might be necessary.

Anyone has any idea what could be the problem ?

We have a Garage cluster deployed in Kubernetes, consisting of 8 nodes with replication factor of 2. We had 6 nodes before, and as we added two more, we needed to do some shifting too so one node is in a different zone now. Since we switched to 8 nodes there are lot of items in the resync queue and they do not seem to self heal. There are also many resync errors. Not sure how to proceed or what other information might be necessary. Anyone has any idea what could be the problem ?
Owner

What is the version of garage you are using? Can you share what your garage status and layout look like? Can you also share the state of the works on a couple of nodes? Did you update the tranquility setting?

Items can be multiple time in the resync queue, so the absolute number of item in the resync queue is not necessarily indicative that nothing is happening.

What is the version of garage you are using? Can you share what your garage status and layout look like? Can you also share the state of the works on a couple of nodes? Did you update the tranquility setting? Items can be multiple time in the resync queue, so the absolute number of item in the resync queue is not necessarily indicative that nothing is happening.
Author

garage stats:

Garage version: v1.0.0 [features: k2v, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs]
Rust compiler version: 1.73.0

Database engine: LMDB (using Heed crate)

Table stats:
  Table      Items     MklItems  MklTodo  GcTodo
  bucket_v2  4         5         0        0
  key        4         5         0        0
  object     19710544  25328775  0        0
  version    6235495   7979732   0        0
  block_ref  11549864  15041405  0        0

Block manager stats:
  number of RC entries (~= number of blocks): 11486142
  resync queue length: 2197
  blocks with resync errors: 1321

Storage nodes:
  ID                Hostname  Zone  Capacity  Part.  DataAvail               MetaAvail
  6491f54b516fe740  garage-3  hel1  6.0 TB    64     168.4 GB/7.6 TB (2.2%)  168.4 GB/7.6 TB (2.2%)
  b38df7f1cd6e64cc  garage-5  fsn1  6.0 TB    64     177.5 GB/7.6 TB (2.3%)  177.5 GB/7.6 TB (2.3%)
  c5a65adbdbfcabd8  garage-0  fsn1  6.0 TB    64     108.4 GB/7.6 TB (1.4%)  108.4 GB/7.6 TB (1.4%)
  da04daee442cc468  garage-1  hel1  6.0 TB    64     304.9 GB/7.6 TB (4.0%)  304.9 GB/7.6 TB (4.0%)
  fecb40744431de50  garage-2  fsn1  6.0 TB    64     172.3 GB/7.6 TB (2.3%)  172.3 GB/7.6 TB (2.3%)
  dd93d29d89331ec7  garage-7  hel1  6.0 TB    64     7.2 TB/7.6 TB (93.9%)   7.2 TB/7.6 TB (93.9%)
  a657965c7aaf3281  garage-4  fsn1  6.0 TB    64     119.6 GB/7.6 TB (1.6%)  119.6 GB/7.6 TB (1.6%)
  8b4e87778d470a0b  garage-6  hel1  6.0 TB    64     7.2 TB/7.6 TB (94.6%)   7.2 TB/7.6 TB (94.6%)

Estimated available storage space cluster-wide (might be lower in practice):
  data: 433.5 GB
  metadata: 433.5 GB

garage status:

==== HEALTHY NODES ====
ID                Hostname  Address              Tags        Zone  Capacity  DataAvail
fecb40744431de50  garage-2  10.233.122.237:3901  [garage-2]  fsn1  6.0 TB    172.6 GB (2.3%)
dd93d29d89331ec7  garage-7  10.233.66.60:3901    [garage-7]  hel1  6.0 TB    7.2 TB (93.9%)
c5a65adbdbfcabd8  garage-0  10.233.67.134:3901   [garage-0]  fsn1  6.0 TB    108.4 GB (1.4%)
6491f54b516fe740  garage-3  10.233.100.211:3901  [garage-3]  hel1  6.0 TB    168.6 GB (2.2%)
b38df7f1cd6e64cc  garage-5  10.233.64.58:3901    [garage-5]  fsn1  6.0 TB    177.8 GB (2.3%)
da04daee442cc468  garage-1  10.233.103.22:3901   [garage-1]  hel1  6.0 TB    303.9 GB (4.0%)
8b4e87778d470a0b  garage-6  10.233.72.241:3901   [garage-6]  hel1  6.0 TB    7.2 TB (94.6%)
a657965c7aaf3281  garage-4  10.233.115.160:3901  [garage-4]  fsn1  6.0 TB    119.6 GB (1.6%)

garage layout show:

==== CURRENT CLUSTER LAYOUT ====
ID                Tags      Zone  Capacity  Usable capacity
6491f54b516fe740  garage-3  hel1  6.0 TB    6.0 TB (100.0%)
8b4e87778d470a0b  garage-6  hel1  6.0 TB    6.0 TB (100.0%)
a657965c7aaf3281  garage-4  fsn1  6.0 TB    6.0 TB (100.0%)
b38df7f1cd6e64cc  garage-5  fsn1  6.0 TB    6.0 TB (100.0%)
c5a65adbdbfcabd8  garage-0  fsn1  6.0 TB    6.0 TB (100.0%)
da04daee442cc468  garage-1  hel1  6.0 TB    6.0 TB (100.0%)
dd93d29d89331ec7  garage-7  hel1  6.0 TB    6.0 TB (100.0%)
fecb40744431de50  garage-2  fsn1  6.0 TB    6.0 TB (100.0%)

Zone redundancy: maximum

Current cluster layout version: 4

Can you also share the state of the works on a couple of nodes?

Not sure what to share exactly, we have logs and metrics in Grafana if that can help

Did you update the tranquility setting?

As I see this would be a type of repair (related to scrubs), the only repair I tried was garage repair --all-nodes --yes tables, run multiple times.

`garage stats`: ``` Garage version: v1.0.0 [features: k2v, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs] Rust compiler version: 1.73.0 Database engine: LMDB (using Heed crate) Table stats: Table Items MklItems MklTodo GcTodo bucket_v2 4 5 0 0 key 4 5 0 0 object 19710544 25328775 0 0 version 6235495 7979732 0 0 block_ref 11549864 15041405 0 0 Block manager stats: number of RC entries (~= number of blocks): 11486142 resync queue length: 2197 blocks with resync errors: 1321 Storage nodes: ID Hostname Zone Capacity Part. DataAvail MetaAvail 6491f54b516fe740 garage-3 hel1 6.0 TB 64 168.4 GB/7.6 TB (2.2%) 168.4 GB/7.6 TB (2.2%) b38df7f1cd6e64cc garage-5 fsn1 6.0 TB 64 177.5 GB/7.6 TB (2.3%) 177.5 GB/7.6 TB (2.3%) c5a65adbdbfcabd8 garage-0 fsn1 6.0 TB 64 108.4 GB/7.6 TB (1.4%) 108.4 GB/7.6 TB (1.4%) da04daee442cc468 garage-1 hel1 6.0 TB 64 304.9 GB/7.6 TB (4.0%) 304.9 GB/7.6 TB (4.0%) fecb40744431de50 garage-2 fsn1 6.0 TB 64 172.3 GB/7.6 TB (2.3%) 172.3 GB/7.6 TB (2.3%) dd93d29d89331ec7 garage-7 hel1 6.0 TB 64 7.2 TB/7.6 TB (93.9%) 7.2 TB/7.6 TB (93.9%) a657965c7aaf3281 garage-4 fsn1 6.0 TB 64 119.6 GB/7.6 TB (1.6%) 119.6 GB/7.6 TB (1.6%) 8b4e87778d470a0b garage-6 hel1 6.0 TB 64 7.2 TB/7.6 TB (94.6%) 7.2 TB/7.6 TB (94.6%) Estimated available storage space cluster-wide (might be lower in practice): data: 433.5 GB metadata: 433.5 GB ``` `garage status`: ``` ==== HEALTHY NODES ==== ID Hostname Address Tags Zone Capacity DataAvail fecb40744431de50 garage-2 10.233.122.237:3901 [garage-2] fsn1 6.0 TB 172.6 GB (2.3%) dd93d29d89331ec7 garage-7 10.233.66.60:3901 [garage-7] hel1 6.0 TB 7.2 TB (93.9%) c5a65adbdbfcabd8 garage-0 10.233.67.134:3901 [garage-0] fsn1 6.0 TB 108.4 GB (1.4%) 6491f54b516fe740 garage-3 10.233.100.211:3901 [garage-3] hel1 6.0 TB 168.6 GB (2.2%) b38df7f1cd6e64cc garage-5 10.233.64.58:3901 [garage-5] fsn1 6.0 TB 177.8 GB (2.3%) da04daee442cc468 garage-1 10.233.103.22:3901 [garage-1] hel1 6.0 TB 303.9 GB (4.0%) 8b4e87778d470a0b garage-6 10.233.72.241:3901 [garage-6] hel1 6.0 TB 7.2 TB (94.6%) a657965c7aaf3281 garage-4 10.233.115.160:3901 [garage-4] fsn1 6.0 TB 119.6 GB (1.6%) ``` `garage layout show`: ``` ==== CURRENT CLUSTER LAYOUT ==== ID Tags Zone Capacity Usable capacity 6491f54b516fe740 garage-3 hel1 6.0 TB 6.0 TB (100.0%) 8b4e87778d470a0b garage-6 hel1 6.0 TB 6.0 TB (100.0%) a657965c7aaf3281 garage-4 fsn1 6.0 TB 6.0 TB (100.0%) b38df7f1cd6e64cc garage-5 fsn1 6.0 TB 6.0 TB (100.0%) c5a65adbdbfcabd8 garage-0 fsn1 6.0 TB 6.0 TB (100.0%) da04daee442cc468 garage-1 hel1 6.0 TB 6.0 TB (100.0%) dd93d29d89331ec7 garage-7 hel1 6.0 TB 6.0 TB (100.0%) fecb40744431de50 garage-2 fsn1 6.0 TB 6.0 TB (100.0%) Zone redundancy: maximum Current cluster layout version: 4 ``` > Can you also share the state of the works on a couple of nodes? Not sure what to share exactly, we have logs and metrics in Grafana if that can help > Did you update the tranquility setting? As I see this would be a type of repair (related to scrubs), the only repair I tried was `garage repair --all-nodes --yes tables`, run multiple times.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#911
No description provided.