garage repair rebalance stops when encountering corrupted block #845
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#845
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hi everyone!
I noticed that I had created my Garage data partition wrong, so I needed to move the data to another disk. At first I tried doing this offline, but I noticed that Garage has an online feature for this, so I stopped that process halfway, then set the old disk as readonly, and continued the transfer using
garage repair rebalance
.After a while, it was stopping, and not restarting when I used that command again. Using
strace
on the server process, I figured out that it was encountering a chunk file that I had already moved to another disk, but it was truncated to 0 bytes on the original. That makes it a corrupted block.Even after I had deleted the chunk file (as it was already present on the other drive), when I ran
garage repair rebalance
again, it wouldn't start moving files. I had to close the maingarage server
process and launch it again for it to work.What could be the issue here?
Hi @danya02. Theoretically, Garage is able to fix corrupted blocks, so no idea of what is going on.
If we have step-by-step instructions to reproduce the issue, it might help us investigate.
Hi! Thanks for replying, sorry for not getting back to you sooner.
I've written a small Python script that should set up the error condition. I haven't had time to tidy it up, so if there's anything unclear about what it's doing, please let me know and I'll try explaining it.
The script is uploaded to this comment as a
.txt
file, rename it to.py
to run it.The script sets up an ephemeral Garage server, uploads a file to it, then reconfigures the layout in the way I did with my live installation, and tries to perform a rebalance task. The final output on my machine is:
The script requires that a Garage executable is available at
~/.cargo/bin/garage
(which is where it would be installed by Cargo), and also thatrclone
is available in PATH.The script stores all its ephemeral files in the
garage-rebalance-toast
directory, and the first thing it does is delete that directory to clean up after any previous runs, so make sure to run it in a place where there is nothing important in that directory.Hi! I just wanted to report that running the script on
garage cargo:1.0.1 [features: k2v, lmdb, sqlite, metrics, bundled-libs]
causes the same issue as above, using my script.Note that I was using Garage in a single-node configuration (as I don't have many other computers to use for this), so the standard repair processes you mention might not all be working. In particular, I expect that this problem wouldn't happen if there were other nodes, as Garage would just fetch the (what it thinks is) corrupted block from a different node.