Occasional reference counter mismatches #644
Labels
No Label
AdminAPI
Bug
Check AWS
CI
Correctness
Critical
Documentation
Ideas
Improvement
Low priority
Newcomer
Performance
S3 Compatibility
Testing
Usability
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#644
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We had some issues with a slow node, which resulted in a lot of unfinished uploads before. Now that we finally removed the node from the cluster and shifted data to a new machine, I was going to check and solve the remaining block errors ("error when resyning, no node returned a valid block").
I followed the instructions in the docs and tried to
purge
the blocks in question. Unfortunately, even though garage responds with "1 blocks were purged: 0 object deletion markers added, 0 versions marked deleted", the blocks still appear inlist-errors
and garage is still retrying the sync every hour.The output from
block info
looks like this for all of our erroring ones:Warning: refcount does not match number of non-deleted versions
: this indicates you've hit a bug in garage, which I've been having for some time and I haven't yet figured where it is. Basically garage keeps a reference counter which is different from the actual list of objects referencing a given block, and in some rare cases the two become inconsistent. In your case there is no issue with your cluster, since the errored block is not referenced by an object, so not having it in the cluster is not an issue. However currently there is no way to make the error go away.`block purge` doesn't purge blockto Occasional reference counter mismatchesWe have probably run into this issue today with some ~30 blocks listed with error. When check for info, all of them looked like this:
Running
garage block purge
did not seem to do anything. But we have also startedscrub
on one of our nodes and observed that number of resync errors show in prometheus metrics andgarage block list-errors
went down.@lx You mentioned there is no way to make these errors go away and I don't know anything about garage internals. But is it plausible that running
scrub
could clear those errors?We have also run
repair versions
,repair tables
on all nodes during the scrub. They did not affect anything immediately, but maybe contributed to clearing these errors?In theory, scrub is completely unrelated. If the number of errors went down, I think it's more because Garage finally realized that those blocks were not needed and it was ok to not retry to fetch them, so the errors were removed from the backlog.
Yes, this is more likely.
Hopefully with Garage 1.0 when we remove Sled we will have the opportunity of making a repair procedure to remove these errors. Limits in the Sled API were the main reason we couldn't do this earlier.