garage block list-errors shows errors after cluster layout change and node reboot #810

Open
opened 2024-04-18 06:58:20 +00:00 by flokli · 10 comments
Contributor

I had a 3 node garage cluster, using the following settings:

compression_level = 3
data_dir = "/tank/garage/data"
db_engine = "lmdb"
metadata_dir = "/tank/garage/meta"
replication_mode = "3"

garage-0.9.4, on aarch64-linux (NixOS). Underlying filesystem is btrfs.


I added a fourth node to the cluster:

garage layout assign -z kadriorg -c 1.9TB bd94fdc709ea3f87

They're all in the same zone.

I applied the layout change, things started to rebalance. 1

While the cluster was rebalancing (took a few hours), I rebooted the three old nodes (cf8ae7c345e3efbb, afd7e76fa5bb4e20, bd94fdc709ea3f87) sequentially (to apply a kernel change). I assumed this shouldn't be a risky process, due to multiple replicas still being around and the three other nodes being up.

During the rebalancing, I saw a bunch of connection errors w.r.t quorum though:

logs_n4.zstd:Apr 17 10:19:58 n4-rk1 garage[1149]: 2024-04-17T10:19:58.736000Z ERROR garage_util::background::worker: Error in worker object sync (TID 23): Could not reach quorum of 3. 2 of 3 request succeeded, others returned errors: ["Network error: Not connected: cf8ae7c345e3efbb"]
logs_n4.zstd:Apr 17 10:20:09 n4-rk1 garage[1149]: 2024-04-17T10:20:09.741707Z ERROR garage_block::resync: Error when resyncing ec19d5bf93416547: NeedBlockQuery RPC

After all rebalancing operations finished (judging from the network interface traffic), I saw 4 blocks with resync errors:

[root@n1-rk1:~]# garage status
==== HEALTHY NODES ====
ID                Hostname  Address                                     Tags  Zone      Capacity  DataAvail
bd94fdc709ea3f87  n3-rk1    [fd42:172:2042:0:e89a:61ff:fe07:8440]:3901  []    kadriorg  3.5 TB    1.8 TB (44.3%)
7febdd18a6e81380  n1-rk1    [fd42:172:2042:0:e0fb:93ff:feb8:8d74]:3901  []    kadriorg  3.5 TB    2.4 TB (59.9%)
cf8ae7c345e3efbb  n2-rk1    [fd42:172:2042:0:dcc6:b6ff:fe53:502b]:3901  []    kadriorg  3.5 TB    1.8 TB (44.2%)
afd7e76fa5bb4e20  n4-rk1    [fd42:172:2042:0:438:80ff:fef9:6720]:3901   []    kadriorg  3.5 TB    1096.9 GB (27.4%)

[root@n1-rk1:~]# garage stats

Garage version: cargo:0.9.4 [features: k2v, sled, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs]
Rust compiler version: 1.76.0

Database engine: LMDB (using Heed crate)

Table stats:
  Table      Items    MklItems  MklTodo  GcTodo
  bucket_v2  4        5         0        0
  key        1        1         0        0
  object     173025   220355    0        0
  version    107582   139924    0        1967
  block_ref  2234373  2460476   0        11872

Block manager stats:
  number of RC entries (~= number of blocks): 2178118
  resync queue length: 2128862
  blocks with resync errors: 4

If values are missing above (marked as NC), consider adding the --detailed flag (this will be slow).

Storage nodes:
  ID                Hostname  Zone      Capacity  Part.  DataAvail                 MetaAvail
  afd7e76fa5bb4e20  n4-rk1    kadriorg  3.5 TB    192    1096.9 GB/4.0 TB (27.4%)  1096.9 GB/4.0 TB (27.4%)
  bd94fdc709ea3f87  n3-rk1    kadriorg  3.5 TB    192    1.8 TB/4.0 TB (44.3%)     1.8 TB/4.0 TB (44.3%)
  cf8ae7c345e3efbb  n2-rk1    kadriorg  3.5 TB    192    1.8 TB/4.0 TB (44.2%)     1.8 TB/4.0 TB (44.2%)
  7febdd18a6e81380  n1-rk1    kadriorg  3.5 TB    192    2.4 TB/4.0 TB (59.9%)     2.4 TB/4.0 TB (59.9%)

Estimated available storage space cluster-wide (might be lower in practice):
  data: 1.5 TB
  metadata: 1.5 TB

[root@n1-rk1:~]# garage block list-errors
Hash                                                              RC  Errors  Last error   Next try
ba917c82b0d6b2758ef415ee4294030836ee4b0ad80eabb079e8db92870c5afd  1   1       9 hours ago  in asap
bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c  1   1       9 hours ago  in asap
baf0193a401c0756cee135e25a7df34d1dcc521e57d802a6ada5917a8bea5809  1   1       9 hours ago  in asap
bafc0e8c8d9d1272cada292b4064e679fa0afca485b379773af1b5aec161bb44  1   1       9 hours ago  in asap

[root@n1-rk1:~]# garage block info ba917c82b0d6b2758ef415ee4294030836ee4b0ad80eabb079e8db92870c5afd
Block hash: ba917c82b0d6b2758ef415ee4294030836ee4b0ad80eabb079e8db92870c5afd
Refcount: 1

Version           Bucket            Key                                                                              MPU  Deleted
25017aad3217158a  f3083a7badc99932  restic/data/11/11e3ca1fed9e691c8cadbc8b0312aa143e663bcab708f01e396193258e9b78a8       false

[root@n1-rk1:~]# garage block info bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c
Block hash: bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c
Refcount: 1

Version           Bucket            Key                                                                              MPU  Deleted
05c7045a3957d1a6  f3083a7badc99932  restic/data/d6/d61be1eb551056054ed50469314acfe5e3de4e30b41d5c996f7c5ab0b2643063       false

[root@n1-rk1:~]# garage block info baf0193a401c0756cee135e25a7df34d1dcc521e57d802a6ada5917a8bea5809
Block hash: baf0193a401c0756cee135e25a7df34d1dcc521e57d802a6ada5917a8bea5809
Refcount: 1

Version           Bucket            Key                MPU  Deleted
2e9450a950372b15  a5f5dbe525c11841  [snip]       false

[root@n1-rk1:~]# garage block info bafc0e8c8d9d1272cada292b4064e679fa0afca485b379773af1b5aec161bb44
Block hash: bafc0e8c8d9d1272cada292b4064e679fa0afca485b379773af1b5aec161bb44
Refcount: 1

Version           Bucket            Key                                                                                      MPU  Deleted
6c3805bb8641c79a  a5f5dbe525c11841  [snip]       false

[root@n1-rk1:~]# garage block list-errors
Hash                                                              RC  Errors  Last error    Next try
ba917c82b0d6b2758ef415ee4294030836ee4b0ad80eabb079e8db92870c5afd  1   1       1 minute ago  in asap
bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c  1   1       9 hours ago   in asap
baf0193a401c0756cee135e25a7df34d1dcc521e57d802a6ada5917a8bea5809  1   1       9 hours ago   in asap
bafc0e8c8d9d1272cada292b4064e679fa0afca485b379773af1b5aec161bb44  1   1       9 hours ago   in asap

[root@n1-rk1:~]# garage block list-errors
Hash                                                              RC  Errors  Last error      Next try
ba917c82b0d6b2758ef415ee4294030836ee4b0ad80eabb079e8db92870c5afd  1   1       11 minutes ago  in asap
bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c  1   1       10 hours ago    in asap
baf0193a401c0756cee135e25a7df34d1dcc521e57d802a6ada5917a8bea5809  1   1       10 hours ago    in asap
bafc0e8c8d9d1272cada292b4064e679fa0afca485b379773af1b5aec161bb44  1   1       10 hours ago    in asap

[root@n1-rk1:~]# garage block retry-now --all
4 blocks returned in queue for a retry now (check logs to see results)

[root@n1-rk1:~]# garage block list-errors
Hash                                                              RC  Errors  Last error    Next try
ba917c82b0d6b2758ef415ee4294030836ee4b0ad80eabb079e8db92870c5afd  1   1       1 minute ago  in asap
bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c  1   1       1 minute ago  in asap
baf0193a401c0756cee135e25a7df34d1dcc521e57d802a6ada5917a8bea5809  1   1       1 minute ago  in asap
bafc0e8c8d9d1272cada292b4064e679fa0afca485b379773af1b5aec161bb44  1   1       1 minute ago  in asap

As can be seen in the logs, they're not part of an ongoing multipart upload, or deleted, a garage block retry-now --all did not get it unstuck.

I manually looked in the filesystem on all 4 nodes and also couldn't find these blocks in /tank/data, so they seemed to be indeed gone.

What's interesting is they all have the same block hash prefix.


I only have logs for these machines for their current boot, so only after each node got rebooted, and I'm not 100% sure certain about the timestamps on these logs. Grepping for the block hashes did yield this though:

logs_n3.zstd:Apr 17 16:15:29 n3-rk1 garage[1141]: 2024-04-17T16:13:15.757413Z  INFO garage_block::resync: Resync block ba917c82b0d6b275: offloading and deleting
logs_n3.zstd:Apr 17 16:15:29 n3-rk1 garage[1141]: 2024-04-17T16:13:15.892371Z  INFO garage_block::resync: Deleting unneeded block ba917c82b0d6b275, offload finished (1 / 3)

… suggesting garage was confident it did actually persist the block elsewhere, and afterwards deleted it from node3?


I kicked off a restic check --read-data overnight (it's still ongoing). Curiously, this morning the list of errors is empty:

[root@n3-rk1:~]# garage block list-errors
Hash  RC  Errors  Last error  Next try

And a garage block-info on one of them showed the block as being present:

[root@n1-rk1:~]# garage block info bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c
Block hash: bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c
Refcount: 1

Version           Bucket            Key                                                                              MPU  Deleted
05c7045a3957d1a6  f3083a7badc99932  restic/data/d6/d61be1eb551056054ed50469314acfe5e3de4e30b41d5c996f7c5ab0b2643063       false

I did not peek into the filesystem.

Now, an hour later, all info about this block is gone:

garage block info bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c
Error: No matching block found

This is quite confusing, I'm not entirely sure what's going on. Maybe I'm misinterpreting some of the output?


  1. In fact, I first applied a layout change giving all nodes less capacity than before, as I misremembered the per-node capacity to be ~2TB, while it was 4. Then I applied a second layout change, assigning them all ~3.5TB. ↩︎

I had a 3 node garage cluster, using the following settings: ``` compression_level = 3 data_dir = "/tank/garage/data" db_engine = "lmdb" metadata_dir = "/tank/garage/meta" replication_mode = "3" ``` `garage-0.9.4`, on `aarch64-linux` (NixOS). Underlying filesystem is btrfs. --- I added a fourth node to the cluster: ``` garage layout assign -z kadriorg -c 1.9TB bd94fdc709ea3f87 ``` They're all in the same zone. I applied the layout change, things started to rebalance. [^1] While the cluster was rebalancing (took a few hours), I rebooted the three old nodes (`cf8ae7c345e3efbb`, `afd7e76fa5bb4e20`, `bd94fdc709ea3f87`) sequentially (to apply a kernel change). I assumed this shouldn't be a risky process, due to multiple replicas still being around and the three other nodes being up. During the rebalancing, I saw a bunch of connection errors w.r.t quorum though: ``` logs_n4.zstd:Apr 17 10:19:58 n4-rk1 garage[1149]: 2024-04-17T10:19:58.736000Z ERROR garage_util::background::worker: Error in worker object sync (TID 23): Could not reach quorum of 3. 2 of 3 request succeeded, others returned errors: ["Network error: Not connected: cf8ae7c345e3efbb"] logs_n4.zstd:Apr 17 10:20:09 n4-rk1 garage[1149]: 2024-04-17T10:20:09.741707Z ERROR garage_block::resync: Error when resyncing ec19d5bf93416547: NeedBlockQuery RPC ``` After all rebalancing operations finished (judging from the network interface traffic), I saw 4 blocks with resync errors: ``` [root@n1-rk1:~]# garage status ==== HEALTHY NODES ==== ID Hostname Address Tags Zone Capacity DataAvail bd94fdc709ea3f87 n3-rk1 [fd42:172:2042:0:e89a:61ff:fe07:8440]:3901 [] kadriorg 3.5 TB 1.8 TB (44.3%) 7febdd18a6e81380 n1-rk1 [fd42:172:2042:0:e0fb:93ff:feb8:8d74]:3901 [] kadriorg 3.5 TB 2.4 TB (59.9%) cf8ae7c345e3efbb n2-rk1 [fd42:172:2042:0:dcc6:b6ff:fe53:502b]:3901 [] kadriorg 3.5 TB 1.8 TB (44.2%) afd7e76fa5bb4e20 n4-rk1 [fd42:172:2042:0:438:80ff:fef9:6720]:3901 [] kadriorg 3.5 TB 1096.9 GB (27.4%) [root@n1-rk1:~]# garage stats Garage version: cargo:0.9.4 [features: k2v, sled, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs] Rust compiler version: 1.76.0 Database engine: LMDB (using Heed crate) Table stats: Table Items MklItems MklTodo GcTodo bucket_v2 4 5 0 0 key 1 1 0 0 object 173025 220355 0 0 version 107582 139924 0 1967 block_ref 2234373 2460476 0 11872 Block manager stats: number of RC entries (~= number of blocks): 2178118 resync queue length: 2128862 blocks with resync errors: 4 If values are missing above (marked as NC), consider adding the --detailed flag (this will be slow). Storage nodes: ID Hostname Zone Capacity Part. DataAvail MetaAvail afd7e76fa5bb4e20 n4-rk1 kadriorg 3.5 TB 192 1096.9 GB/4.0 TB (27.4%) 1096.9 GB/4.0 TB (27.4%) bd94fdc709ea3f87 n3-rk1 kadriorg 3.5 TB 192 1.8 TB/4.0 TB (44.3%) 1.8 TB/4.0 TB (44.3%) cf8ae7c345e3efbb n2-rk1 kadriorg 3.5 TB 192 1.8 TB/4.0 TB (44.2%) 1.8 TB/4.0 TB (44.2%) 7febdd18a6e81380 n1-rk1 kadriorg 3.5 TB 192 2.4 TB/4.0 TB (59.9%) 2.4 TB/4.0 TB (59.9%) Estimated available storage space cluster-wide (might be lower in practice): data: 1.5 TB metadata: 1.5 TB [root@n1-rk1:~]# garage block list-errors Hash RC Errors Last error Next try ba917c82b0d6b2758ef415ee4294030836ee4b0ad80eabb079e8db92870c5afd 1 1 9 hours ago in asap bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c 1 1 9 hours ago in asap baf0193a401c0756cee135e25a7df34d1dcc521e57d802a6ada5917a8bea5809 1 1 9 hours ago in asap bafc0e8c8d9d1272cada292b4064e679fa0afca485b379773af1b5aec161bb44 1 1 9 hours ago in asap [root@n1-rk1:~]# garage block info ba917c82b0d6b2758ef415ee4294030836ee4b0ad80eabb079e8db92870c5afd Block hash: ba917c82b0d6b2758ef415ee4294030836ee4b0ad80eabb079e8db92870c5afd Refcount: 1 Version Bucket Key MPU Deleted 25017aad3217158a f3083a7badc99932 restic/data/11/11e3ca1fed9e691c8cadbc8b0312aa143e663bcab708f01e396193258e9b78a8 false [root@n1-rk1:~]# garage block info bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c Block hash: bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c Refcount: 1 Version Bucket Key MPU Deleted 05c7045a3957d1a6 f3083a7badc99932 restic/data/d6/d61be1eb551056054ed50469314acfe5e3de4e30b41d5c996f7c5ab0b2643063 false [root@n1-rk1:~]# garage block info baf0193a401c0756cee135e25a7df34d1dcc521e57d802a6ada5917a8bea5809 Block hash: baf0193a401c0756cee135e25a7df34d1dcc521e57d802a6ada5917a8bea5809 Refcount: 1 Version Bucket Key MPU Deleted 2e9450a950372b15 a5f5dbe525c11841 [snip] false [root@n1-rk1:~]# garage block info bafc0e8c8d9d1272cada292b4064e679fa0afca485b379773af1b5aec161bb44 Block hash: bafc0e8c8d9d1272cada292b4064e679fa0afca485b379773af1b5aec161bb44 Refcount: 1 Version Bucket Key MPU Deleted 6c3805bb8641c79a a5f5dbe525c11841 [snip] false [root@n1-rk1:~]# garage block list-errors Hash RC Errors Last error Next try ba917c82b0d6b2758ef415ee4294030836ee4b0ad80eabb079e8db92870c5afd 1 1 1 minute ago in asap bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c 1 1 9 hours ago in asap baf0193a401c0756cee135e25a7df34d1dcc521e57d802a6ada5917a8bea5809 1 1 9 hours ago in asap bafc0e8c8d9d1272cada292b4064e679fa0afca485b379773af1b5aec161bb44 1 1 9 hours ago in asap [root@n1-rk1:~]# garage block list-errors Hash RC Errors Last error Next try ba917c82b0d6b2758ef415ee4294030836ee4b0ad80eabb079e8db92870c5afd 1 1 11 minutes ago in asap bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c 1 1 10 hours ago in asap baf0193a401c0756cee135e25a7df34d1dcc521e57d802a6ada5917a8bea5809 1 1 10 hours ago in asap bafc0e8c8d9d1272cada292b4064e679fa0afca485b379773af1b5aec161bb44 1 1 10 hours ago in asap [root@n1-rk1:~]# garage block retry-now --all 4 blocks returned in queue for a retry now (check logs to see results) [root@n1-rk1:~]# garage block list-errors Hash RC Errors Last error Next try ba917c82b0d6b2758ef415ee4294030836ee4b0ad80eabb079e8db92870c5afd 1 1 1 minute ago in asap bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c 1 1 1 minute ago in asap baf0193a401c0756cee135e25a7df34d1dcc521e57d802a6ada5917a8bea5809 1 1 1 minute ago in asap bafc0e8c8d9d1272cada292b4064e679fa0afca485b379773af1b5aec161bb44 1 1 1 minute ago in asap ``` As can be seen in the logs, they're not part of an ongoing multipart upload, or deleted, a `garage block retry-now --all` did not get it unstuck. I manually looked in the filesystem on all 4 nodes and also couldn't find these blocks in `/tank/data`, so they seemed to be indeed gone. What's interesting is they all have the same block hash prefix. --- I only have logs for these machines for their current boot, so only after each node got rebooted, and I'm not 100% sure certain about the timestamps on these logs. Grepping for the block hashes did yield this though: ``` logs_n3.zstd:Apr 17 16:15:29 n3-rk1 garage[1141]: 2024-04-17T16:13:15.757413Z INFO garage_block::resync: Resync block ba917c82b0d6b275: offloading and deleting logs_n3.zstd:Apr 17 16:15:29 n3-rk1 garage[1141]: 2024-04-17T16:13:15.892371Z INFO garage_block::resync: Deleting unneeded block ba917c82b0d6b275, offload finished (1 / 3) ``` … suggesting garage was confident it did actually persist the block elsewhere, and afterwards deleted it from node3? --- I kicked off a `restic check --read-data` overnight (it's still ongoing). Curiously, this morning the list of errors is empty: ``` [root@n3-rk1:~]# garage block list-errors Hash RC Errors Last error Next try ``` And a `garage block-info` on one of them showed the block as being present: ``` [root@n1-rk1:~]# garage block info bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c Block hash: bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c Refcount: 1 Version Bucket Key MPU Deleted 05c7045a3957d1a6 f3083a7badc99932 restic/data/d6/d61be1eb551056054ed50469314acfe5e3de4e30b41d5c996f7c5ab0b2643063 false ``` I did not peek into the filesystem. Now, an hour later, all info about this block is gone: ``` garage block info bab15d11b576aa60077d64f2317e966ce352b44b679962f6ce8be88aa2551f5c Error: No matching block found ``` --- This is quite confusing, I'm not entirely sure what's going on. Maybe I'm misinterpreting some of the output? [^1]: In fact, I first applied a layout change giving all nodes less capacity than before, as I misremembered the per-node capacity to be ~2TB, while it was 4. Then I applied a second layout change, assigning them all ~3.5TB.
Owner

Did you apply the second layout change while the first change was still in progress?

Did you apply the second layout change while the first change was still in progress?
Author
Contributor

Yes, the second layout change was minutes after the first, the rebalancing took some hours to complete.

Yes, the second layout change was minutes after the first, the rebalancing took some hours to complete.
Owner

It is very possible that your issue was cause by the two subsequent layout changes, as I'm not sure if garage keeps more than two layout in the running set. This could have caused garage to consider the first rebalancing "done", although I'm still puzzled why it would not keep a copy of the block on at least 2 of the old nodes...

It is very possible that your issue was cause by the two subsequent layout changes, as I'm not sure if garage keeps more than two layout in the running set. This could have caused garage to consider the first rebalancing "done", although I'm still puzzled why it would not keep a copy of the block on at least 2 of the old nodes...
Author
Contributor

The restic check --read-data run finished and it didn't find any discrepancies. So it seems no data was lost?

I noticed for all four blocks, garage block info $block_id still reports "Error: No matching block found" only if executed on node3. Is this expected to be a node-local command?

Any operations you'd advise to run to further inspect this?

The `restic check --read-data` run finished and it didn't find any discrepancies. So it seems no data was lost? I noticed for all four blocks, `garage block info $block_id` still reports "Error: No matching block found" only if executed on node3. Is this expected to be a node-local command? Any operations you'd advise to run to further inspect this?
Owner

I think this is mostly normal behavior of Garage, especially given that:

  • no data that you stored in the S3 API was lost
  • the situation normalized itself after some time

The block errors you saw were probably on a node where the metadata was not fully up-to-date yet, so it thought it still needed the block (the refcount was 1), while in fact the object had been deleted on the cluster already (and other nodes were already aware of it and had already deleted the block). This situation is generally fixed automatically pretty fast, but it might have taken a bit longer given that you were changing the cluster layout around that time. To fix the situation faster, you could have done a garage repair -a tables before the garage block retry-now, and if that was not enough you could also have tried garage repair -a versions and garage repair -a block-refs.

Doing two subsequent layout changes in a small interval is supported by Garage. It could accentuate small internal discrepancies such as this one but the global consistency from the point of view of the S3 API should be preserved.

I think this is mostly normal behavior of Garage, especially given that: - no data that you stored in the S3 API was lost - the situation normalized itself after some time The block errors you saw were probably on a node where the metadata was not fully up-to-date yet, so it thought it still needed the block (the refcount was 1), while in fact the object had been deleted on the cluster already (and other nodes were already aware of it and had already deleted the block). This situation is generally fixed automatically pretty fast, but it might have taken a bit longer given that you were changing the cluster layout around that time. To fix the situation faster, you could have done a `garage repair -a tables` before the `garage block retry-now`, and if that was not enough you could also have tried `garage repair -a versions` and `garage repair -a block-refs`. Doing two subsequent layout changes in a small interval is supported by Garage. It could accentuate small internal discrepancies such as this one but the global consistency from the point of view of the S3 API should be preserved.
lx closed this issue 2024-04-21 09:03:25 +00:00
Owner

I noticed for all four blocks, garage block info $block_id still reports "Error: No matching block found" only if executed on node3. Is this expected to be a node-local command?

Yes, it's a node-local command. I think in your case the block_ref entries or block reference counter on node3 are not garbage collected because they have been marked deleted more recently, whereas on the other nodes they were marked deleted long ago and were garbage collected already.

> I noticed for all four blocks, garage block info $block_id still reports "Error: No matching block found" only if executed on node3. Is this expected to be a node-local command? Yes, it's a node-local command. I think in your case the block_ref entries or block reference counter on node3 are not garbage collected because they have been marked deleted more recently, whereas on the other nodes they were marked deleted long ago and were garbage collected already.
flokli reopened this issue 2024-04-22 07:08:27 +00:00
Author
Contributor

The scrubs finished, and one node (node4) recorded "5798 persistent errors".

[root@n4-rk1:~]# garage worker get
afd7e76fa5bb4e20  lifecycle-last-completed    2024-04-21
afd7e76fa5bb4e20  resync-tranquility          2
afd7e76fa5bb4e20  resync-worker-count         1
afd7e76fa5bb4e20  scrub-corruptions_detected  5798
afd7e76fa5bb4e20  scrub-last-completed        2024-04-21T02:18:48.374Z
afd7e76fa5bb4e20  scrub-next-run              2024-05-23T00:02:41.374Z
afd7e76fa5bb4e20  scrub-tranquility           0

There's a bunch of .corrupted files in the data directory:

./37/05/370514b13d74ce07f243ce027934208aa69cf6c3fc4d0ed9470d194e8ec08714.zst.corrupted
./37/69/37697247aa4caf22ca928a9c3e1e4f303ba86f25bef1e46b993a60e2275b194b.zst.corrupted
./37/0c/370cb220c21973488db0954200be24c730e07efd2b0694d786fa8e711f75f702.zst.corrupted
./37/f2/37f25d1863ec23ca04f7d3553b48181376fecc56c6a1e8f5b32057b76a647ffa.zst.corrupted
./37/f2/37f210c76e907541402354d330fc819c7f072f4fabefda9a902ec33976336efa.zst.corrupted
./37/fb/37fbb8380c250103428ad3010266ee77d9009d6c4ce4eb2d4621788cdfd2959c.zst.corrupted
./2a/a6/2aa6e1b0a656c710d873fd691905d7274dcce8a568dc519b97f7af7a90654c92.zst.corrupted
./2a/d4/2ad4571d6262bcf0b058812a72b553eb9d09e5882959bd416fee2b173623e996.zst.corrupted
./2a/3a/2a3a2e8b8532465e1b9e272dffeda6bec69c8a8a4b7454d827388a05edd2b73b.zst.corrupted
./2a/8e/2a8e0a56adacd5f9b62f7e5a46f8f32a593d99d7f7c46e3a8f304c863d8ff7d5.zst.corrupted
./2a/45/2a451f64604efea46ff5809f319c02b3bb41aabf46aa899eae8cf83a8d040063.zst.corrupted
./2a/2f/2a2fc8ebfe26e345455ed978a4714f1b49e3583fd5b260291ef2ee0fd471348e.zst.corrupted
./2a/a3/2aa3ee73eff3bb5b770fad87e81562d5a792343e095db0d98b112b61dc09377a.zst.corrupted
[…]

From a quick glance, all seem to be 0 byte files, and there seems to be a non-corrupted file alongside it, so I think I could just delete the .corrupted files and move on?

The scrubs finished, and one node (node4) recorded "5798 persistent errors". ``` [root@n4-rk1:~]# garage worker get afd7e76fa5bb4e20 lifecycle-last-completed 2024-04-21 afd7e76fa5bb4e20 resync-tranquility 2 afd7e76fa5bb4e20 resync-worker-count 1 afd7e76fa5bb4e20 scrub-corruptions_detected 5798 afd7e76fa5bb4e20 scrub-last-completed 2024-04-21T02:18:48.374Z afd7e76fa5bb4e20 scrub-next-run 2024-05-23T00:02:41.374Z afd7e76fa5bb4e20 scrub-tranquility 0 ``` There's a bunch of `.corrupted` files in the data directory: ``` ./37/05/370514b13d74ce07f243ce027934208aa69cf6c3fc4d0ed9470d194e8ec08714.zst.corrupted ./37/69/37697247aa4caf22ca928a9c3e1e4f303ba86f25bef1e46b993a60e2275b194b.zst.corrupted ./37/0c/370cb220c21973488db0954200be24c730e07efd2b0694d786fa8e711f75f702.zst.corrupted ./37/f2/37f25d1863ec23ca04f7d3553b48181376fecc56c6a1e8f5b32057b76a647ffa.zst.corrupted ./37/f2/37f210c76e907541402354d330fc819c7f072f4fabefda9a902ec33976336efa.zst.corrupted ./37/fb/37fbb8380c250103428ad3010266ee77d9009d6c4ce4eb2d4621788cdfd2959c.zst.corrupted ./2a/a6/2aa6e1b0a656c710d873fd691905d7274dcce8a568dc519b97f7af7a90654c92.zst.corrupted ./2a/d4/2ad4571d6262bcf0b058812a72b553eb9d09e5882959bd416fee2b173623e996.zst.corrupted ./2a/3a/2a3a2e8b8532465e1b9e272dffeda6bec69c8a8a4b7454d827388a05edd2b73b.zst.corrupted ./2a/8e/2a8e0a56adacd5f9b62f7e5a46f8f32a593d99d7f7c46e3a8f304c863d8ff7d5.zst.corrupted ./2a/45/2a451f64604efea46ff5809f319c02b3bb41aabf46aa899eae8cf83a8d040063.zst.corrupted ./2a/2f/2a2fc8ebfe26e345455ed978a4714f1b49e3583fd5b260291ef2ee0fd471348e.zst.corrupted ./2a/a3/2aa3ee73eff3bb5b770fad87e81562d5a792343e095db0d98b112b61dc09377a.zst.corrupted […] ``` From a quick glance, all seem to be 0 byte files, and there seems to be a non-corrupted file alongside it, so I think I could just delete the `.corrupted` files and move on?
Author
Contributor

As suggested in the Matrix channel, I checked the filesystem.

There's nothing in dmesg suggesting any problems with the filesystem, a btrfs scrub on the filesystem hosting this data also came up negative.

However, btrfs check indeed found something:

Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on /dev/dm-0
UUID: f88b711d-62ea-424b-bb0e-e7537bd7c7fb
[1/7] checking root items
[2/7] checking extents
parent transid verify failed on 484587257856 wanted 63559 found 63579
parent transid verify failed on 484587257856 wanted 63559 found 63579
parent transid verify failed on 484587257856 wanted 63559 found 63579
Ignoring transid failure
ERROR: invalid generation for extent 484587257856, have 63579 expect (0, 63560]
metadata level mismatch on [484587257856, 16384]
owner ref check failed [484587257856 16384]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
parent transid verify failed on 484587257856 wanted 63559 found 63579
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=484586962944 item=131 parent level=1 child bytenr=484587257856 child level=1
could not load free space tree: Input/output error
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
ERROR: transid errors in file system
found 2212787752960 bytes used, error(s) found
total csum bytes: 2154973844
total tree bytes: 6094520320
total fs tree bytes: 2291744768
total extent tree bytes: 421806080
btree space waste bytes: 2104804141
file data blocks allocated: 2244605816832
 referenced 2204922077184

I think best steps would now be to remove the node from the cluster, wait for everything to be drained, then recreate the filesystem and add it back?

As suggested in the Matrix channel, I checked the filesystem. There's nothing in dmesg suggesting any problems with the filesystem, a `btrfs scrub` on the filesystem hosting this data also came up negative. However, `btrfs check` indeed found something: ``` Opening filesystem to check... WARNING: filesystem mounted, continuing because of --force Checking filesystem on /dev/dm-0 UUID: f88b711d-62ea-424b-bb0e-e7537bd7c7fb [1/7] checking root items [2/7] checking extents parent transid verify failed on 484587257856 wanted 63559 found 63579 parent transid verify failed on 484587257856 wanted 63559 found 63579 parent transid verify failed on 484587257856 wanted 63559 found 63579 Ignoring transid failure ERROR: invalid generation for extent 484587257856, have 63579 expect (0, 63560] metadata level mismatch on [484587257856, 16384] owner ref check failed [484587257856 16384] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree parent transid verify failed on 484587257856 wanted 63559 found 63579 Ignoring transid failure ERROR: child eb corrupted: parent bytenr=484586962944 item=131 parent level=1 child bytenr=484587257856 child level=1 could not load free space tree: Input/output error [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) ERROR: transid errors in file system found 2212787752960 bytes used, error(s) found total csum bytes: 2154973844 total tree bytes: 6094520320 total fs tree bytes: 2291744768 total extent tree bytes: 421806080 btree space waste bytes: 2104804141 file data blocks allocated: 2244605816832 referenced 2204922077184 ``` I think best steps would now be to remove the node from the cluster, wait for everything to be drained, then recreate the filesystem and add it back?
Author
Contributor

I removed the node from the cluster will wait for the resync to finish.

I removed the node from the cluster will wait for the resync to finish.
Author
Contributor

Done, adding the node back in (after recreating the filesystem)

Done, adding the node back in (after recreating the filesystem)
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#810
No description provided.