Occasional reference counter mismatches #644
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#644
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We had some issues with a slow node, which resulted in a lot of unfinished uploads before. Now that we finally removed the node from the cluster and shifted data to a new machine, I was going to check and solve the remaining block errors ("error when resyning, no node returned a valid block").
I followed the instructions in the docs and tried to
purge
the blocks in question. Unfortunately, even though garage responds with "1 blocks were purged: 0 object deletion markers added, 0 versions marked deleted", the blocks still appear inlist-errors
and garage is still retrying the sync every hour.The output from
block info
looks like this for all of our erroring ones:Warning: refcount does not match number of non-deleted versions
: this indicates you've hit a bug in garage, which I've been having for some time and I haven't yet figured where it is. Basically garage keeps a reference counter which is different from the actual list of objects referencing a given block, and in some rare cases the two become inconsistent. In your case there is no issue with your cluster, since the errored block is not referenced by an object, so not having it in the cluster is not an issue. However currently there is no way to make the error go away.to Occasional reference counter mismatchesblock purge
doesn't purge blockWe have probably run into this issue today with some ~30 blocks listed with error. When check for info, all of them looked like this:
Running
garage block purge
did not seem to do anything. But we have also startedscrub
on one of our nodes and observed that number of resync errors show in prometheus metrics andgarage block list-errors
went down.@lx You mentioned there is no way to make these errors go away and I don't know anything about garage internals. But is it plausible that running
scrub
could clear those errors?We have also run
repair versions
,repair tables
on all nodes during the scrub. They did not affect anything immediately, but maybe contributed to clearing these errors?In theory, scrub is completely unrelated. If the number of errors went down, I think it's more because Garage finally realized that those blocks were not needed and it was ok to not retry to fetch them, so the errors were removed from the backlog.
Yes, this is more likely.
Hopefully with Garage 1.0 when we remove Sled we will have the opportunity of making a repair procedure to remove these errors. Limits in the Sled API were the main reason we couldn't do this earlier.