Garage cluster encounters resync problems and does not seem to recover #911

Closed
opened 2024-12-18 21:09:18 +00:00 by Szetty · 11 comments

We have a Garage cluster deployed in Kubernetes, consisting of 8 nodes with replication factor of 2. We had 6 nodes before, and as we added two more, we needed to do some shifting too so one node is in a different zone now.

Since we switched to 8 nodes there are lot of items in the resync queue and they do not seem to self heal. There are also many resync errors. Not sure how to proceed or what other information might be necessary.

Anyone has any idea what could be the problem ?

We have a Garage cluster deployed in Kubernetes, consisting of 8 nodes with replication factor of 2. We had 6 nodes before, and as we added two more, we needed to do some shifting too so one node is in a different zone now. Since we switched to 8 nodes there are lot of items in the resync queue and they do not seem to self heal. There are also many resync errors. Not sure how to proceed or what other information might be necessary. Anyone has any idea what could be the problem ?
Owner

What is the version of garage you are using? Can you share what your garage status and layout look like? Can you also share the state of the works on a couple of nodes? Did you update the tranquility setting?

Items can be multiple time in the resync queue, so the absolute number of item in the resync queue is not necessarily indicative that nothing is happening.

What is the version of garage you are using? Can you share what your garage status and layout look like? Can you also share the state of the works on a couple of nodes? Did you update the tranquility setting? Items can be multiple time in the resync queue, so the absolute number of item in the resync queue is not necessarily indicative that nothing is happening.
Author

garage stats:

Garage version: v1.0.0 [features: k2v, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs]
Rust compiler version: 1.73.0

Database engine: LMDB (using Heed crate)

Table stats:
  Table      Items     MklItems  MklTodo  GcTodo
  bucket_v2  4         5         0        0
  key        4         5         0        0
  object     19710544  25328775  0        0
  version    6235495   7979732   0        0
  block_ref  11549864  15041405  0        0

Block manager stats:
  number of RC entries (~= number of blocks): 11486142
  resync queue length: 2197
  blocks with resync errors: 1321

Storage nodes:
  ID                Hostname  Zone  Capacity  Part.  DataAvail               MetaAvail
  6491f54b516fe740  garage-3  hel1  6.0 TB    64     168.4 GB/7.6 TB (2.2%)  168.4 GB/7.6 TB (2.2%)
  b38df7f1cd6e64cc  garage-5  fsn1  6.0 TB    64     177.5 GB/7.6 TB (2.3%)  177.5 GB/7.6 TB (2.3%)
  c5a65adbdbfcabd8  garage-0  fsn1  6.0 TB    64     108.4 GB/7.6 TB (1.4%)  108.4 GB/7.6 TB (1.4%)
  da04daee442cc468  garage-1  hel1  6.0 TB    64     304.9 GB/7.6 TB (4.0%)  304.9 GB/7.6 TB (4.0%)
  fecb40744431de50  garage-2  fsn1  6.0 TB    64     172.3 GB/7.6 TB (2.3%)  172.3 GB/7.6 TB (2.3%)
  dd93d29d89331ec7  garage-7  hel1  6.0 TB    64     7.2 TB/7.6 TB (93.9%)   7.2 TB/7.6 TB (93.9%)
  a657965c7aaf3281  garage-4  fsn1  6.0 TB    64     119.6 GB/7.6 TB (1.6%)  119.6 GB/7.6 TB (1.6%)
  8b4e87778d470a0b  garage-6  hel1  6.0 TB    64     7.2 TB/7.6 TB (94.6%)   7.2 TB/7.6 TB (94.6%)

Estimated available storage space cluster-wide (might be lower in practice):
  data: 433.5 GB
  metadata: 433.5 GB

garage status:

==== HEALTHY NODES ====
ID                Hostname  Address              Tags        Zone  Capacity  DataAvail
fecb40744431de50  garage-2  10.233.122.237:3901  [garage-2]  fsn1  6.0 TB    172.6 GB (2.3%)
dd93d29d89331ec7  garage-7  10.233.66.60:3901    [garage-7]  hel1  6.0 TB    7.2 TB (93.9%)
c5a65adbdbfcabd8  garage-0  10.233.67.134:3901   [garage-0]  fsn1  6.0 TB    108.4 GB (1.4%)
6491f54b516fe740  garage-3  10.233.100.211:3901  [garage-3]  hel1  6.0 TB    168.6 GB (2.2%)
b38df7f1cd6e64cc  garage-5  10.233.64.58:3901    [garage-5]  fsn1  6.0 TB    177.8 GB (2.3%)
da04daee442cc468  garage-1  10.233.103.22:3901   [garage-1]  hel1  6.0 TB    303.9 GB (4.0%)
8b4e87778d470a0b  garage-6  10.233.72.241:3901   [garage-6]  hel1  6.0 TB    7.2 TB (94.6%)
a657965c7aaf3281  garage-4  10.233.115.160:3901  [garage-4]  fsn1  6.0 TB    119.6 GB (1.6%)

garage layout show:

==== CURRENT CLUSTER LAYOUT ====
ID                Tags      Zone  Capacity  Usable capacity
6491f54b516fe740  garage-3  hel1  6.0 TB    6.0 TB (100.0%)
8b4e87778d470a0b  garage-6  hel1  6.0 TB    6.0 TB (100.0%)
a657965c7aaf3281  garage-4  fsn1  6.0 TB    6.0 TB (100.0%)
b38df7f1cd6e64cc  garage-5  fsn1  6.0 TB    6.0 TB (100.0%)
c5a65adbdbfcabd8  garage-0  fsn1  6.0 TB    6.0 TB (100.0%)
da04daee442cc468  garage-1  hel1  6.0 TB    6.0 TB (100.0%)
dd93d29d89331ec7  garage-7  hel1  6.0 TB    6.0 TB (100.0%)
fecb40744431de50  garage-2  fsn1  6.0 TB    6.0 TB (100.0%)

Zone redundancy: maximum

Current cluster layout version: 4

Can you also share the state of the works on a couple of nodes?

Not sure what to share exactly, we have logs and metrics in Grafana if that can help

Did you update the tranquility setting?

As I see this would be a type of repair (related to scrubs), the only repair I tried was garage repair --all-nodes --yes tables, run multiple times.

`garage stats`: ``` Garage version: v1.0.0 [features: k2v, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs] Rust compiler version: 1.73.0 Database engine: LMDB (using Heed crate) Table stats: Table Items MklItems MklTodo GcTodo bucket_v2 4 5 0 0 key 4 5 0 0 object 19710544 25328775 0 0 version 6235495 7979732 0 0 block_ref 11549864 15041405 0 0 Block manager stats: number of RC entries (~= number of blocks): 11486142 resync queue length: 2197 blocks with resync errors: 1321 Storage nodes: ID Hostname Zone Capacity Part. DataAvail MetaAvail 6491f54b516fe740 garage-3 hel1 6.0 TB 64 168.4 GB/7.6 TB (2.2%) 168.4 GB/7.6 TB (2.2%) b38df7f1cd6e64cc garage-5 fsn1 6.0 TB 64 177.5 GB/7.6 TB (2.3%) 177.5 GB/7.6 TB (2.3%) c5a65adbdbfcabd8 garage-0 fsn1 6.0 TB 64 108.4 GB/7.6 TB (1.4%) 108.4 GB/7.6 TB (1.4%) da04daee442cc468 garage-1 hel1 6.0 TB 64 304.9 GB/7.6 TB (4.0%) 304.9 GB/7.6 TB (4.0%) fecb40744431de50 garage-2 fsn1 6.0 TB 64 172.3 GB/7.6 TB (2.3%) 172.3 GB/7.6 TB (2.3%) dd93d29d89331ec7 garage-7 hel1 6.0 TB 64 7.2 TB/7.6 TB (93.9%) 7.2 TB/7.6 TB (93.9%) a657965c7aaf3281 garage-4 fsn1 6.0 TB 64 119.6 GB/7.6 TB (1.6%) 119.6 GB/7.6 TB (1.6%) 8b4e87778d470a0b garage-6 hel1 6.0 TB 64 7.2 TB/7.6 TB (94.6%) 7.2 TB/7.6 TB (94.6%) Estimated available storage space cluster-wide (might be lower in practice): data: 433.5 GB metadata: 433.5 GB ``` `garage status`: ``` ==== HEALTHY NODES ==== ID Hostname Address Tags Zone Capacity DataAvail fecb40744431de50 garage-2 10.233.122.237:3901 [garage-2] fsn1 6.0 TB 172.6 GB (2.3%) dd93d29d89331ec7 garage-7 10.233.66.60:3901 [garage-7] hel1 6.0 TB 7.2 TB (93.9%) c5a65adbdbfcabd8 garage-0 10.233.67.134:3901 [garage-0] fsn1 6.0 TB 108.4 GB (1.4%) 6491f54b516fe740 garage-3 10.233.100.211:3901 [garage-3] hel1 6.0 TB 168.6 GB (2.2%) b38df7f1cd6e64cc garage-5 10.233.64.58:3901 [garage-5] fsn1 6.0 TB 177.8 GB (2.3%) da04daee442cc468 garage-1 10.233.103.22:3901 [garage-1] hel1 6.0 TB 303.9 GB (4.0%) 8b4e87778d470a0b garage-6 10.233.72.241:3901 [garage-6] hel1 6.0 TB 7.2 TB (94.6%) a657965c7aaf3281 garage-4 10.233.115.160:3901 [garage-4] fsn1 6.0 TB 119.6 GB (1.6%) ``` `garage layout show`: ``` ==== CURRENT CLUSTER LAYOUT ==== ID Tags Zone Capacity Usable capacity 6491f54b516fe740 garage-3 hel1 6.0 TB 6.0 TB (100.0%) 8b4e87778d470a0b garage-6 hel1 6.0 TB 6.0 TB (100.0%) a657965c7aaf3281 garage-4 fsn1 6.0 TB 6.0 TB (100.0%) b38df7f1cd6e64cc garage-5 fsn1 6.0 TB 6.0 TB (100.0%) c5a65adbdbfcabd8 garage-0 fsn1 6.0 TB 6.0 TB (100.0%) da04daee442cc468 garage-1 hel1 6.0 TB 6.0 TB (100.0%) dd93d29d89331ec7 garage-7 hel1 6.0 TB 6.0 TB (100.0%) fecb40744431de50 garage-2 fsn1 6.0 TB 6.0 TB (100.0%) Zone redundancy: maximum Current cluster layout version: 4 ``` > Can you also share the state of the works on a couple of nodes? Not sure what to share exactly, we have logs and metrics in Grafana if that can help > Did you update the tranquility setting? As I see this would be a type of repair (related to scrubs), the only repair I tried was `garage repair --all-nodes --yes tables`, run multiple times.
Author

@maximilien or anyone else: this is still an issue for us, the new nodes are still not serving traffic, any ideas what to do ?

@maximilien or anyone else: this is still an issue for us, the new nodes are still not serving traffic, any ideas what to do ?
Owner

Sorry I somehow missed your last message, can you share a screenshot of the garage grafana dashboard if you have it setup? I'm especially interested on the evolution of the "Resync queue length" and "Table GC queue length" graph, as well as the "rsync errored blocks".

How much bandwidth do you have between the sites?

Sorry I somehow missed your last message, can you share a screenshot of the garage grafana dashboard if you have it setup? I'm especially interested on the evolution of the "Resync queue length" and "Table GC queue length" graph, as well as the "rsync errored blocks". How much bandwidth do you have between the sites?
maximilien self-assigned this 2025-01-08 18:39:08 +00:00
maximilien added the
kind
performance
label 2025-01-08 18:39:24 +00:00
Author

Sorry I somehow missed your last message, can you share a screenshot of the garage grafana dashboard if you have it setup? I'm especially interested on the evolution of the "Resync queue length" and "Table GC queue length" graph, as well as the "rsync errored blocks".

So I have prepared the evolution of the 3 metrics asked (the period is from 2024-12-12 23:33:57 UTC +2 to 2025-01-08 23:05:08 UTC +2) :

  1. Resync queue length (named block_resync_queue_length) - we have multiple samples here to avoid the outliers
  2. Table GC queue length (named table_gc_todo_queue_length) - as I see there are multiple tables here, I have made a sum between them for now (we can be more granular if needed)
  3. Resync errored block (named block_resync_errored_blocks) - we have multiple samples here to avoid the outliers

How much bandwidth do you have between the sites?

I have measured some samples and found the following:

garage-1 -> garage-0 = 19.3 MB/s
garage-4 -> garage-0 = 18.1 MB/s
garage-3 -> garage-0 = 16.2 MB/s
garage-6 -> garage-0 = 17.5 MB/s
garage-7 -> garage-0 = 18.0 MB/s
garage-2 -> garage-1 = 21.2 MB/s
garage-5 -> garage-1 = 16.8 MB/s
> Sorry I somehow missed your last message, can you share a screenshot of the garage grafana dashboard if you have it setup? I'm especially interested on the evolution of the "Resync queue length" and "Table GC queue length" graph, as well as the "rsync errored blocks". So I have prepared the evolution of the 3 metrics asked (the period is from `2024-12-12 23:33:57 UTC +2` to `2025-01-08 23:05:08 UTC +2`) : 1. Resync queue length (named `block_resync_queue_length`) - we have multiple samples here to avoid the outliers 2. Table GC queue length (named `table_gc_todo_queue_length`) - as I see there are multiple tables here, I have made a sum between them for now (we can be more granular if needed) 3. Resync errored block (named `block_resync_errored_blocks`) - we have multiple samples here to avoid the outliers > How much bandwidth do you have between the sites? I have measured some samples and found the following: ``` garage-1 -> garage-0 = 19.3 MB/s garage-4 -> garage-0 = 18.1 MB/s garage-3 -> garage-0 = 16.2 MB/s garage-6 -> garage-0 = 17.5 MB/s garage-7 -> garage-0 = 18.0 MB/s garage-2 -> garage-1 = 21.2 MB/s garage-5 -> garage-1 = 16.8 MB/s ```
Author

I have made another screenshot for the metric block_resync_queue_length where I change the period a bit to avoid the big outliers

I have made another screenshot for the metric `block_resync_queue_length` where I change the period a bit to avoid the big outliers
Owner

@Szetty can you give me the resync queue on garage node 6 and 7 between dec 18th and today? With the Y axis scaled to the graph (not set at zero, I want to see the variations). But my initial estimate is simply that the cluster is syncing, very slowly, due to the amount of data you have, which (assuming you have only garage data on the disks) seems to be in the order of 10TB or more per zone, and relatively slow networking, not even GB.

So the answer is to wait. Given the amount of blocks that need to be moved and the bandwidth you got, it could take several weeks for garage to sync. I also see that you are running Garage 1.0.0, I suggest you consider upgrading to 1.0.1 later.

What we can do to increase the resync speed (at the cost of performance for serving actual requests) is to tweak the tranquility. Can you execute into one of node 6 or 7 and do a /garage worker get? (there is no back so you have to exec the command right away).

You should get an output like this one

17ee03c6b81d9235  lifecycle-last-completed    2025-01-09
17ee03c6b81d9235  resync-tranquility          1
17ee03c6b81d9235  resync-worker-count         2
17ee03c6b81d9235  scrub-corruptions_detected  0
17ee03c6b81d9235  scrub-last-completed        2025-01-02T19:41:56.038Z
17ee03c6b81d9235  scrub-next-run              2025-01-28T01:12:19.038Z
17ee03c6b81d9235  scrub-tranquility           4

Depending on your values, you might then want to lower the resync-tranquility setting, 1 would be a good value, otherwise you can try to go to zero, while watching your application and system metrics to see what the impact is. Depending on your CPU, memory and system resources you can also adjust resync-worker-count, although usually we advise to set it to the number of CPU available.

To set the value you can use /garage worker set resync-tranquility <0-n> and /garage worker set resync-worker-count <0-n>. Those settings are local to the node so you'll need to run those commands one node 6 and 7 separately.

@Szetty can you give me the resync queue on garage node 6 and 7 between dec 18th and today? With the Y axis scaled to the graph (not set at zero, I want to see the variations). But my initial estimate is simply that the cluster is syncing, very slowly, due to the amount of data you have, which (assuming you have only garage data on the disks) seems to be in the order of 10TB or more per zone, and relatively slow networking, not even GB. So the answer is to wait. Given the amount of blocks that need to be moved and the bandwidth you got, it could take several weeks for garage to sync. I also see that you are running Garage 1.0.0, I suggest you consider [upgrading to 1.0.1](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v1.0.1) later. What we can do to increase the resync speed (at the cost of performance for serving actual requests) is to tweak the tranquility. Can you execute into one of node 6 or 7 and do a `/garage worker get`? (there is no back so you have to exec the command right away). You should get an output like this one ``` 17ee03c6b81d9235 lifecycle-last-completed 2025-01-09 17ee03c6b81d9235 resync-tranquility 1 17ee03c6b81d9235 resync-worker-count 2 17ee03c6b81d9235 scrub-corruptions_detected 0 17ee03c6b81d9235 scrub-last-completed 2025-01-02T19:41:56.038Z 17ee03c6b81d9235 scrub-next-run 2025-01-28T01:12:19.038Z 17ee03c6b81d9235 scrub-tranquility 4 ``` Depending on your values, you might then want to lower the `resync-tranquility` setting, 1 would be a good value, otherwise you can try to go to zero, while watching your application and system metrics to see what the impact is. Depending on your CPU, memory and system resources you can also adjust `resync-worker-count`, although usually we advise to set it to the number of CPU available. To set the value you can use `/garage worker set resync-tranquility <0-n>` and `/garage worker set resync-worker-count <0-n>`. Those settings are local to the node so you'll need to run those commands one node 6 and 7 separately.
maximilien added the
scope
documentation
label 2025-01-09 23:09:00 +00:00
Author

I have attached the resync queue for garage-6 and garage-7 between the requested time: Resync queue (garage-6 and garage-7).png

I have run /garage worker get on garage-6:

lifecycle-last-completed    2025-01-10
resync-tranquility          2
resync-worker-count         1
scrub-corruptions_detected  0
scrub-last-completed        1970-01-01T00:00:00.000Z
scrub-next-run              2025-01-18T13:10:08.208Z
scrub-tranquility           4

and garage-7:

lifecycle-last-completed    2025-01-10
resync-tranquility          2
resync-worker-count         1
scrub-corruptions_detected  0
scrub-last-completed        1970-01-01T00:00:00.000Z
scrub-next-run              2025-01-12T00:53:54.339Z
scrub-tranquility           4

Couple of questions about the improvements:

  1. would upgrading to 1.0.1 help in this sense ? if yes, how ?
  2. what does the resync-tranquility do ?
  3. we have only one CPU set per garage instance (it seemed there is no need for more), should we set more and adjust resync-worker-count too ?
  4. we have also the possiblity to start from scratch, do you think it would be faster and rewrite all data ?

I have been looking through the logs a bit, and found these when starting garage-6: logs when starting  garage-6.png

the warnings about the ring are still present on both garage-6 and garage-7, anything to worry about ?

I have attached the resync queue for garage-6 and garage-7 between the requested time: ![Resync queue (garage-6 and garage-7).png](/attachments/5ea5b977-dfe7-41df-9b12-c2ced434ccdd) I have run `/garage worker get` on garage-6: ``` lifecycle-last-completed 2025-01-10 resync-tranquility 2 resync-worker-count 1 scrub-corruptions_detected 0 scrub-last-completed 1970-01-01T00:00:00.000Z scrub-next-run 2025-01-18T13:10:08.208Z scrub-tranquility 4 ``` and garage-7: ``` lifecycle-last-completed 2025-01-10 resync-tranquility 2 resync-worker-count 1 scrub-corruptions_detected 0 scrub-last-completed 1970-01-01T00:00:00.000Z scrub-next-run 2025-01-12T00:53:54.339Z scrub-tranquility 4 ``` Couple of questions about the improvements: 1. would upgrading to `1.0.1` help in this sense ? if yes, how ? 2. what does the `resync-tranquility` do ? 3. we have only one CPU set per garage instance (it seemed there is no need for more), should we set more and adjust `resync-worker-count` too ? 4. we have also the possiblity to start from scratch, do you think it would be faster and rewrite all data ? I have been looking through the logs a bit, and found these when starting garage-6: ![logs when starting garage-6.png](/attachments/dfb935c3-1a1a-4d9a-b731-401e9c8881b1) the warnings about the ring are still present on both garage-6 and garage-7, anything to worry about ?
Owner

Garage 1.0.1 will definitely help with the layout issue you quoted above, an maybe as well with metadata performance depending on how much memory you have and how big your LMDB database is. So please do upgrade first, especially if you have a backup of the data in this cluster already.

The tranquility factor cause garage to "sleep" for a certain duration between tasks. From my understanding (and @trinity-1686a can maybe correct me here), if resyncing a block takes 1s in average and you have a tranquility factor of 2, then garage will sleep 2s between each sync.

I suggest you start by lowering the tranquility to 1 and set the worker count to 2 on both nodes, and tale a look at how the resync queue evolves. You can likely keep your current CPU allocation, just take a look at how the CPU usage evolves.

I don't believe re-ingesting everything will be necessarily faster, as you'll also have to write the blocks to the other nodes, which seem to be fine for now.

Garage 1.0.1 will definitely help with the layout issue you quoted above, an maybe as well with metadata performance depending on how much memory you have and how big your LMDB database is. So please do upgrade first, especially if you have a backup of the data in this cluster already. The tranquility factor cause garage to "sleep" for a certain duration between tasks. From my understanding (and @trinity-1686a can maybe correct me here), if resyncing a block takes 1s in average and you have a tranquility factor of 2, then garage will sleep 2s between each sync. I suggest you start by lowering the tranquility to 1 and set the worker count to 2 on both nodes, and tale a look at how the resync queue evolves. You can likely keep your current CPU allocation, just take a look at how the CPU usage evolves. I don't believe re-ingesting everything will be necessarily faster, as you'll also have to write the blocks to the other nodes, which seem to be fine for now.
Author

okay, seems like it is coming around now; upgrading the version and making the changes to tranquility and worker count seem helpful. Thank you for your patience and for the help :)

okay, seems like it is coming around now; upgrading the version and making the changes to tranquility and worker count seem helpful. Thank you for your patience and for the help :)
Owner

You're welcome, glad we got yourself out of this!

You're welcome, glad we got yourself out of this!
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#911
No description provided.