Garage resync queue length just grows #934

Open
opened 2025-01-26 19:44:57 +00:00 by Szetty · 1 comment

Hello again,

I am back with a similar problem as before, but a slightly different behaviour. While the last problem was mostly related to new nodes, and resync not being fast enough, now these are probably not the problem; we did not change our Garage cluster at all, and I left the resync settings that were used to fix the previous problem (will show the settings below).

First of all an overview of all metrics: Screenshot 2025-01-26 at 21.32.43.png

We can observe that our previous problem mostly finished on 14th of January and it seems that data was synchronized correctly because all nodes now have similar free disk space (while the new ones started at ~7 TB and the old ones were reaching exhaustion, barely surviving at 100 GB and often being taken out of commission by k8s due to low disk size).

Not long after a new problem started, not sure why though: Screenshot 2025-01-26 at 21.40.46.png

Worker settings on the nodes:

kubectl exec --stdin --tty garage-5 -- /garage worker get
lifecycle-last-completed    2025-01-26
resync-tranquility          1
resync-worker-count         2
scrub-corruptions_detected  0
scrub-last-completed        2025-01-15T23:43:42.623Z
scrub-next-run              2025-02-17T12:13:42.623Z
scrub-tranquility           4
kubectl exec --stdin --tty garage-7 -- /garage worker get
lifecycle-last-completed    2025-01-26
resync-tranquility          1
resync-worker-count         2
scrub-corruptions_detected  0
scrub-last-completed        2025-01-13T07:46:31.272Z
scrub-next-run              2025-02-11T21:32:53.272Z
scrub-tranquility           4

Stats as seen from garage-5:

Garage version: v1.0.1 [features: k2v, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs]
Rust compiler version: 1.77.0

Database engine: LMDB (using Heed crate)

Table stats:
  Table      Items    MklItems  MklTodo  GcTodo
  bucket_v2  4        5         0        0
  key        4        5         0        0
  object     0        0         0        0
  version    4685153  6000678   0        0
  block_ref  8679784  11300284  0        1

Block manager stats:
  number of RC entries (~= number of blocks): 9397559
  resync queue length: 22932860
  blocks with resync errors: 132067

Storage nodes:
  ID                Hostname  Zone  Capacity  Part.  DataAvail              MetaAvail
  dd93d29d89331ec7  garage-7  hel1  6.0 TB    64     2.0 TB/7.6 TB (25.7%)  2.0 TB/7.6 TB (25.7%)
  a657965c7aaf3281  garage-4  fsn1  6.0 TB    64     1.9 TB/7.6 TB (25.1%)  1.9 TB/7.6 TB (25.1%)
  da04daee442cc468  garage-1  hel1  6.0 TB    64     2.0 TB/7.6 TB (26.4%)  2.0 TB/7.6 TB (26.4%)
  8b4e87778d470a0b  garage-6  hel1  6.0 TB    64     1.9 TB/7.6 TB (25.3%)  1.9 TB/7.6 TB (25.3%)
  6491f54b516fe740  garage-3  hel1  6.0 TB    64     2.0 TB/7.6 TB (25.7%)  2.0 TB/7.6 TB (25.7%)
  fecb40744431de50  garage-2  fsn1  6.0 TB    64     1.9 TB/7.6 TB (24.6%)  1.9 TB/7.6 TB (24.6%)
  b38df7f1cd6e64cc  garage-5  fsn1  6.0 TB    64     1.9 TB/7.6 TB (25.6%)  1.9 TB/7.6 TB (25.6%)
  c5a65adbdbfcabd8  garage-0  fsn1  6.0 TB    64     1.9 TB/7.6 TB (24.9%)  1.9 TB/7.6 TB (24.9%)

Estimated available storage space cluster-wide (might be lower in practice):
  data: 7.5 TB
  metadata: 7.5 TB

Any ideas what could be going on ? Do you need any more information ?

Thank you.

With respect,
Arnold

Hello again, I am back with a similar problem as before, but a slightly different behaviour. While the last problem was mostly related to new nodes, and resync not being fast enough, now these are probably not the problem; we did not change our Garage cluster at all, and I left the resync settings that were used to fix the previous problem (will show the settings below). First of all an overview of all metrics: ![Screenshot 2025-01-26 at 21.32.43.png](/attachments/9a9e102e-19ad-44df-b54c-8afa378afa14) We can observe that our previous problem mostly finished on 14th of January and it seems that data was synchronized correctly because all nodes now have similar free disk space (while the new ones started at ~7 TB and the old ones were reaching exhaustion, barely surviving at 100 GB and often being taken out of commission by k8s due to low disk size). Not long after a new problem started, not sure why though: ![Screenshot 2025-01-26 at 21.40.46.png](/attachments/a5256963-586d-4f37-aed4-dade1b1c18b8) Worker settings on the nodes: ``` kubectl exec --stdin --tty garage-5 -- /garage worker get lifecycle-last-completed 2025-01-26 resync-tranquility 1 resync-worker-count 2 scrub-corruptions_detected 0 scrub-last-completed 2025-01-15T23:43:42.623Z scrub-next-run 2025-02-17T12:13:42.623Z scrub-tranquility 4 ``` ``` kubectl exec --stdin --tty garage-7 -- /garage worker get lifecycle-last-completed 2025-01-26 resync-tranquility 1 resync-worker-count 2 scrub-corruptions_detected 0 scrub-last-completed 2025-01-13T07:46:31.272Z scrub-next-run 2025-02-11T21:32:53.272Z scrub-tranquility 4 ``` Stats as seen from garage-5: ``` Garage version: v1.0.1 [features: k2v, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs] Rust compiler version: 1.77.0 Database engine: LMDB (using Heed crate) Table stats: Table Items MklItems MklTodo GcTodo bucket_v2 4 5 0 0 key 4 5 0 0 object 0 0 0 0 version 4685153 6000678 0 0 block_ref 8679784 11300284 0 1 Block manager stats: number of RC entries (~= number of blocks): 9397559 resync queue length: 22932860 blocks with resync errors: 132067 Storage nodes: ID Hostname Zone Capacity Part. DataAvail MetaAvail dd93d29d89331ec7 garage-7 hel1 6.0 TB 64 2.0 TB/7.6 TB (25.7%) 2.0 TB/7.6 TB (25.7%) a657965c7aaf3281 garage-4 fsn1 6.0 TB 64 1.9 TB/7.6 TB (25.1%) 1.9 TB/7.6 TB (25.1%) da04daee442cc468 garage-1 hel1 6.0 TB 64 2.0 TB/7.6 TB (26.4%) 2.0 TB/7.6 TB (26.4%) 8b4e87778d470a0b garage-6 hel1 6.0 TB 64 1.9 TB/7.6 TB (25.3%) 1.9 TB/7.6 TB (25.3%) 6491f54b516fe740 garage-3 hel1 6.0 TB 64 2.0 TB/7.6 TB (25.7%) 2.0 TB/7.6 TB (25.7%) fecb40744431de50 garage-2 fsn1 6.0 TB 64 1.9 TB/7.6 TB (24.6%) 1.9 TB/7.6 TB (24.6%) b38df7f1cd6e64cc garage-5 fsn1 6.0 TB 64 1.9 TB/7.6 TB (25.6%) 1.9 TB/7.6 TB (25.6%) c5a65adbdbfcabd8 garage-0 fsn1 6.0 TB 64 1.9 TB/7.6 TB (24.9%) 1.9 TB/7.6 TB (24.9%) Estimated available storage space cluster-wide (might be lower in practice): data: 7.5 TB metadata: 7.5 TB ``` Any ideas what could be going on ? Do you need any more information ? Thank you. With respect, Arnold
Owner

Maybe taking a look at the logs on those nodes would be interesting. I suspect they maybe have some kind of connectivity issue that prevent garage from syncing blocks from other nodes and causing the resync queue to blow up.

Maybe taking a look at the logs on those nodes would be interesting. I suspect they maybe have some kind of connectivity issue that prevent garage from syncing blocks from other nodes and causing the resync queue to blow up.
maximilien added the
action
more-info-needed
label 2025-01-27 10:36:06 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#934
No description provided.