Occasional reference counter mismatches #644

New issue

Closed

opened 2023-10-06 13:56:47 +00:00 by raucao · 3 comments

raucao commented

2023-10-06 13:56:47 +00:00

I followed the instructions in the docs and tried to purge the blocks in question. Unfortunately, even though garage responds with "1 blocks were purged: 0 object deletion markers added, 0 versions marked deleted", the blocks still appear in list-errors and garage is still retrying the sync every hour.

The output from block info looks like this for all of our erroring ones:

~ garage block info a9eda7daecd811c104a5f388c1b70be7866aa5046cb71f206a0572035404c879
Block hash: a9eda7daecd811c104a5f388c1b70be7866aa5046cb71f206a0572035404c879
Refcount: 1

Version           Bucket  Key  Deleted
b22bb8f6b8deb0a6               yes

Warning: refcount does not match number of non-deleted versions

We had some issues with a slow node, which resulted in a lot of unfinished uploads before. Now that we finally removed the node from the cluster and shifted data to a new machine, I was going to check and solve the remaining block errors ("error when resyning, no node returned a valid block"). I followed the [instructions in the docs](https://garagehq.deuxfleurs.fr/documentation/operations/durability-repairs/) and tried to `purge` the blocks in question. Unfortunately, even though garage responds with "1 blocks were purged: 0 object deletion markers added, 0 versions marked deleted", the blocks still appear in `list-errors` and garage is still retrying the sync every hour. The output from `block info` looks like this for all of our erroring ones: ``` ~ garage block info a9eda7daecd811c104a5f388c1b70be7866aa5046cb71f206a0572035404c879 Block hash: a9eda7daecd811c104a5f388c1b70be7866aa5046cb71f206a0572035404c879 Refcount: 1 Version Bucket Key Deleted b22bb8f6b8deb0a6 yes Warning: refcount does not match number of non-deleted versions ```

👍 1

lx commented

2023-10-10 13:18:42 +00:00

Owner

Warning: refcount does not match number of non-deleted versions : this indicates you've hit a bug in garage, which I've been having for some time and I haven't yet figured where it is. Basically garage keeps a reference counter which is different from the actual list of objects referencing a given block, and in some rare cases the two become inconsistent. In your case there is no issue with your cluster, since the errored block is not referenced by an object, so not having it in the cluster is not an issue. However currently there is no way to make the error go away.

`Warning: refcount does not match number of non-deleted versions` : this indicates you've hit a bug in garage, which I've been having for some time and I haven't yet figured where it is. Basically garage keeps a reference counter which is different from the actual list of objects referencing a given block, and in some rare cases the two become inconsistent. In your case there is no issue with your cluster, since the errored block is not referenced by an object, so not having it in the cluster is not an issue. However currently there is no way to make the error go away.

👍 1

lx added the

kind

wrong-behavior

label 2023-10-10 13:18:53 +00:00

lx changed title from ~~block purge doesn't purge block~~ to Occasional reference counter mismatches

2023-10-10 13:19:10 +00:00

lx referenced this issue

2023-10-16 09:45:36 +00:00

Handle FD starvation correctly #595

zdenek.crha commented

2024-01-05 13:15:01 +00:00

Contributor

We have probably run into this issue today with some ~30 blocks listed with error. When check for info, all of them looked like this:

$ garage block info  ff9151bf46d864a37bf83981aecfc31dc1ff2dbf0d763d03085d92552e2bb8c3
Block hash: ff9151bf46d864a37bf83981aecfc31dc1ff2dbf0d763d03085d92552e2bb8c3
Refcount: 1

Version           Bucket  Key  Deleted
bbba42eb9a6b74fd               yes

Warning: refcount does not match number of non-deleted versions

Running garage block purge did not seem to do anything. But we have also started scrub on one of our nodes and observed that number of resync errors show in prometheus metrics and garage block list-errors went down.

@lx You mentioned there is no way to make these errors go away and I don't know anything about garage internals. But is it plausible that running scrub could clear those errors?

We have also run repair versions, repair tables on all nodes during the scrub. They did not affect anything immediately, but maybe contributed to clearing these errors?

We have probably run into this issue today with some ~30 blocks listed with error. When check for info, all of them looked like this: ``` $ garage block info ff9151bf46d864a37bf83981aecfc31dc1ff2dbf0d763d03085d92552e2bb8c3 Block hash: ff9151bf46d864a37bf83981aecfc31dc1ff2dbf0d763d03085d92552e2bb8c3 Refcount: 1 Version Bucket Key Deleted bbba42eb9a6b74fd yes Warning: refcount does not match number of non-deleted versions ``` Running `garage block purge` did not seem to do anything. But we have also started `scrub` on one of our nodes and observed that number of resync errors show in prometheus metrics and `garage block list-errors` went down. @lx You mentioned there is no way to make these errors go away and I don't know anything about garage internals. But is it plausible that running `scrub` could clear those errors? We have also run `repair versions`, `repair tables` on all nodes during the scrub. They did not affect anything immediately, but maybe contributed to clearing these errors?

lx commented

2024-01-22 11:03:15 +00:00

Owner

is it plausible that running scrub could clear those errors?

In theory, scrub is completely unrelated. If the number of errors went down, I think it's more because Garage finally realized that those blocks were not needed and it was ok to not retry to fetch them, so the errors were removed from the backlog.

We have also run repair versions, repair tables on all nodes during the scrub. They did not affect anything immediately, but maybe contributed to clearing these errors?

Yes, this is more likely.

Hopefully with Garage 1.0 when we remove Sled we will have the opportunity of making a repair procedure to remove these errors. Limits in the Sled API were the main reason we couldn't do this earlier.

> is it plausible that running scrub could clear those errors? In theory, scrub is completely unrelated. If the number of errors went down, I think it's more because Garage finally realized that those blocks were not needed and it was ok to not retry to fetch them, so the errors were removed from the backlog. > We have also run repair versions, repair tables on all nodes during the scrub. They did not affect anything immediately, but maybe contributed to clearing these errors? Yes, this is more likely. Hopefully with Garage 1.0 when we remove Sled we will have the opportunity of making a repair procedure to remove these errors. Limits in the Sled API were the main reason we couldn't do this earlier.

lx added this to the v1.0 milestone 2024-03-12 10:33:30 +00:00

lx referenced this issue from a pull request that will close it,

2024-03-19 15:26:05 +00:00

block refcount repair #782

lx closed this issue

2024-03-19 15:59:20 +00:00