garage repair rebalance stops when encountering corrupted block #845

Open
opened 2024-08-01 06:56:08 +00:00 by danya02 · 3 comments

Hi everyone!

# garage -V
garage v1.0.0 [features: k2v, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs]
# uname -a
Linux garage 6.1.0-22-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21) x86_64 GNU/Linux

I noticed that I had created my Garage data partition wrong, so I needed to move the data to another disk. At first I tried doing this offline, but I noticed that Garage has an online feature for this, so I stopped that process halfway, then set the old disk as readonly, and continued the transfer using garage repair rebalance.

After a while, it was stopping, and not restarting when I used that command again. Using strace on the server process, I figured out that it was encountering a chunk file that I had already moved to another disk, but it was truncated to 0 bytes on the original. That makes it a corrupted block.

Even after I had deleted the chunk file (as it was already present on the other drive), when I ran garage repair rebalance again, it wouldn't start moving files. I had to close the main garage server process and launch it again for it to work.

What could be the issue here?

Hi everyone! ``` # garage -V garage v1.0.0 [features: k2v, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs] # uname -a Linux garage 6.1.0-22-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21) x86_64 GNU/Linux ``` I noticed that I had created my Garage data partition wrong, so I needed to move the data to another disk. At first I tried doing this offline, but I noticed that Garage has an online feature for this, so I stopped that process halfway, then set the old disk as readonly, and continued the transfer using `garage repair rebalance`. After a while, it was stopping, and not restarting when I used that command again. Using `strace` on the server process, I figured out that it was encountering a chunk file that I had already moved to another disk, but it was truncated to 0 bytes on the original. That makes it a corrupted block. Even after I had deleted the chunk file (as it was already present on the other drive), when I ran `garage repair rebalance` again, it wouldn't start moving files. I had to close the main `garage server` process and launch it again for it to work. What could be the issue here?
Owner

Hi @danya02. Theoretically, Garage is able to fix corrupted blocks, so no idea of what is going on.
If we have step-by-step instructions to reproduce the issue, it might help us investigate.

Hi @danya02. Theoretically, Garage is able to fix corrupted blocks, so no idea of what is going on. If we have step-by-step instructions to reproduce the issue, it might help us investigate.
quentin added the
action
more-info-needed
label 2024-08-07 09:21:00 +00:00
quentin added the
kind
wrong-behavior
label 2024-08-07 09:33:09 +00:00
quentin added the
scope
background-healing
label 2024-08-07 09:35:52 +00:00
Author

Hi! Thanks for replying, sorry for not getting back to you sooner.

I've written a small Python script that should set up the error condition. I haven't had time to tidy it up, so if there's anything unclear about what it's doing, please let me know and I'll try explaining it.

The script is uploaded to this comment as a .txt file, rename it to .py to run it.

The script sets up an ephemeral Garage server, uploads a file to it, then reconfigures the layout in the way I did with my live installation, and tries to perform a rebalance task. The final output on my machine is:

Old files: 51
New1 files: 2
New2 files: 1
We expected that old contains zero files and the new directories contain all of them.

The script requires that a Garage executable is available at ~/.cargo/bin/garage (which is where it would be installed by Cargo), and also that rclone is available in PATH.

The script stores all its ephemeral files in the garage-rebalance-toast directory, and the first thing it does is delete that directory to clean up after any previous runs, so make sure to run it in a place where there is nothing important in that directory.

Hi! Thanks for replying, sorry for not getting back to you sooner. I've written a small Python script that should set up the error condition. I haven't had time to tidy it up, so if there's anything unclear about what it's doing, please let me know and I'll try explaining it. The script is uploaded to this comment as a `.txt` file, rename it to `.py` to run it. The script sets up an ephemeral Garage server, uploads a file to it, then reconfigures the layout in the way I did with my live installation, and tries to perform a rebalance task. The final output on my machine is: ``` Old files: 51 New1 files: 2 New2 files: 1 We expected that old contains zero files and the new directories contain all of them. ``` The script requires that a Garage executable is available at `~/.cargo/bin/garage` (which is where it would be installed by Cargo), and also that `rclone` is available in PATH. The script stores all its ephemeral files in the `garage-rebalance-toast` directory, and the first thing it does is delete that directory to clean up after any previous runs, so make sure to run it in a place where there is nothing important in that directory.
Author

Hi! I just wanted to report that running the script on garage cargo:1.0.1 [features: k2v, lmdb, sqlite, metrics, bundled-libs] causes the same issue as above, using my script.

Note that I was using Garage in a single-node configuration (as I don't have many other computers to use for this), so the standard repair processes you mention might not all be working. In particular, I expect that this problem wouldn't happen if there were other nodes, as Garage would just fetch the (what it thinks is) corrupted block from a different node.

Hi! I just wanted to report that running the script on `garage cargo:1.0.1 [features: k2v, lmdb, sqlite, metrics, bundled-libs]` causes the same issue as above, using my script. Note that I was using Garage in a single-node configuration (as I don't have many other computers to use for this), so the standard repair processes you mention might not all be working. In particular, I expect that this problem wouldn't happen if there were other nodes, as Garage would just fetch the (what it thinks is) corrupted block from a different node.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#845
No description provided.