panicked at 'Could not count block_local_resync_queue' #541
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#541
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I'm running garage 0.8.1 in cluster of 3 physical servers. Each physical server runs 2 garage processes (as HashiCorp Nomad jobs):
I have recently done update and reboot of all servers. The jobs have been stopped by Nomad drain node and automatically restarted after reboot.
After that, one server started to log panic on process startup for both gw and data job (see log below).
Unfortunately, I do not have logs from the first fail as I discovered the failure too late and logs got garbage collected by nomad.
rw
modeRestarting
Hi @zdenek.crha
Thanks for reporting this bug.
About your error
According to your stack trace, it seems to be located in this function: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/block/resync.rs#L89-L108
And according to your title, it seems to be due to this line: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/block/resync.rs#L93
Unfortunately Garage does not log the underlying error, so can't know what is wrong in CountedTree :-(
Digging in the code
CountedTree::new is defined here:
https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/counted_tree_hack.rs#L25
And it calls your underlying metadata engine driver (either LMDB, sled or sqlite).
For sled, it runs this logic: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/sled_adapter.rs#L99-L102
For lmdb: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/lmdb_adapter.rs#L118-L122
For sqlite: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/sqlite_adapter.rs#L133-L145
Based on your logs, it seems you use
lmdb
.So basically I think we need to find where the error is located in the following logic:
Debugging your issue
It seems Garage does not like your
/storage/meta/db.lmdb
metadata file but we can't know why. Improving Garage error reporting could help debug your issue: if you want, you can contribute a patch to improve it, and we will be able to ship you a dedicated build. Otherwise, hopefully, someone else will find the time to improve this part of the code soon :-)If you are 100% sure that your cluster contains no personal/confidential/sensible data, you can email your
/storage/meta/db.lmdb
file to garagehq {at} deuxfleurs.fr, it could help us reproduce your issue.Finally, if you need to debug your cluster, please follow the Recovering from failures recipe from Garage's documentation, especially the "Replacement scenario 2: metadata (and possibly data) is lost" section.
Thanks for detailed reply. Unfortunately I won't be able to dig into the issue as I'm swamped with other work at $job :-(. Also, I can't reproduce the issue (see below) anymore.
But I do have few more pieces of information that may help whoever picks this up:
First, the affected node runs two
garage
processesgw
process using partition onssd
data
process using partition onhdd
Both processes started crashing at roughly same time. I don't have logs anymore and therefore I'm not 100% sure, but I think it was the same panic message in both cases.
Second, the panic did go away after I rebooted the node itself. New
gw
,data
processes started with original data/meta directories and started successfully.The two points above make me think it is not an issue with
db.lmdb
file. Or at least not just an issue with that file. Maybe some lower-level io error when accessing the file? Something that goes away with reboot?Chance that two metadata files corrupt at same time in a same way is slim. Especially with two different physical drives.
Stuff that is common for both processes is:
I went through the node os logs again, but I can't find anything out of ordinary. Unlike logs from Nomad allocation (ie garage process itself), I still have os logs and can dig through them. If you need me check something, let me know.
I feel that we can safely say that this is an issue caused by a transient condition in your system which made LMDB behave strangely, and not necessarily an issue in Garage itself. I will close this issue for now but feel free to re-open if you'd like to add something.