garage block list-errors
shows errors after cluster layout change and node reboot #810
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#810
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I had a 3 node garage cluster, using the following settings:
garage-0.9.4
, onaarch64-linux
(NixOS). Underlying filesystem is btrfs.I added a fourth node to the cluster:
They're all in the same zone.
I applied the layout change, things started to rebalance. 1
While the cluster was rebalancing (took a few hours), I rebooted the three old nodes (
cf8ae7c345e3efbb
,afd7e76fa5bb4e20
,bd94fdc709ea3f87
) sequentially (to apply a kernel change). I assumed this shouldn't be a risky process, due to multiple replicas still being around and the three other nodes being up.During the rebalancing, I saw a bunch of connection errors w.r.t quorum though:
After all rebalancing operations finished (judging from the network interface traffic), I saw 4 blocks with resync errors:
As can be seen in the logs, they're not part of an ongoing multipart upload, or deleted, a
garage block retry-now --all
did not get it unstuck.I manually looked in the filesystem on all 4 nodes and also couldn't find these blocks in
/tank/data
, so they seemed to be indeed gone.What's interesting is they all have the same block hash prefix.
I only have logs for these machines for their current boot, so only after each node got rebooted, and I'm not 100% sure certain about the timestamps on these logs. Grepping for the block hashes did yield this though:
… suggesting garage was confident it did actually persist the block elsewhere, and afterwards deleted it from node3?
I kicked off a
restic check --read-data
overnight (it's still ongoing). Curiously, this morning the list of errors is empty:And a
garage block-info
on one of them showed the block as being present:I did not peek into the filesystem.
Now, an hour later, all info about this block is gone:
This is quite confusing, I'm not entirely sure what's going on. Maybe I'm misinterpreting some of the output?
In fact, I first applied a layout change giving all nodes less capacity than before, as I misremembered the per-node capacity to be ~2TB, while it was 4. Then I applied a second layout change, assigning them all ~3.5TB. ↩︎
Did you apply the second layout change while the first change was still in progress?
Yes, the second layout change was minutes after the first, the rebalancing took some hours to complete.
It is very possible that your issue was cause by the two subsequent layout changes, as I'm not sure if garage keeps more than two layout in the running set. This could have caused garage to consider the first rebalancing "done", although I'm still puzzled why it would not keep a copy of the block on at least 2 of the old nodes...
The
restic check --read-data
run finished and it didn't find any discrepancies. So it seems no data was lost?I noticed for all four blocks,
garage block info $block_id
still reports "Error: No matching block found" only if executed on node3. Is this expected to be a node-local command?Any operations you'd advise to run to further inspect this?
I think this is mostly normal behavior of Garage, especially given that:
The block errors you saw were probably on a node where the metadata was not fully up-to-date yet, so it thought it still needed the block (the refcount was 1), while in fact the object had been deleted on the cluster already (and other nodes were already aware of it and had already deleted the block). This situation is generally fixed automatically pretty fast, but it might have taken a bit longer given that you were changing the cluster layout around that time. To fix the situation faster, you could have done a
garage repair -a tables
before thegarage block retry-now
, and if that was not enough you could also have triedgarage repair -a versions
andgarage repair -a block-refs
.Doing two subsequent layout changes in a small interval is supported by Garage. It could accentuate small internal discrepancies such as this one but the global consistency from the point of view of the S3 API should be preserved.
Yes, it's a node-local command. I think in your case the block_ref entries or block reference counter on node3 are not garbage collected because they have been marked deleted more recently, whereas on the other nodes they were marked deleted long ago and were garbage collected already.
garage block
is node-local #813The scrubs finished, and one node (node4) recorded "5798 persistent errors".
There's a bunch of
.corrupted
files in the data directory:From a quick glance, all seem to be 0 byte files, and there seems to be a non-corrupted file alongside it, so I think I could just delete the
.corrupted
files and move on?As suggested in the Matrix channel, I checked the filesystem.
There's nothing in dmesg suggesting any problems with the filesystem, a
btrfs scrub
on the filesystem hosting this data also came up negative.However,
btrfs check
indeed found something:I think best steps would now be to remove the node from the cluster, wait for everything to be drained, then recreate the filesystem and add it back?
I removed the node from the cluster will wait for the resync to finish.
Done, adding the node back in (after recreating the filesystem)