panicked at 'Could not count block_local_resync_queue' #541

Closed
opened 2023-04-18 09:02:13 +00:00 by zdenek.crha · 3 comments
Contributor

I'm running garage 0.8.1 in cluster of 3 physical servers. Each physical server runs 2 garage processes (as HashiCorp Nomad jobs):

  • gw (just gateway)
  • data (data)

I have recently done update and reboot of all servers. The jobs have been stopped by Nomad drain node and automatically restarted after reboot.

After that, one server started to log panic on process startup for both gw and data job (see log below).

Unfortunately, I do not have logs from the first fail as I discovered the failure too late and logs got garbage collected by nomad.

  • kernel, dmesg logs do not show any errors at the time of failure
  • filesystem(s) holding garage data
    • are mounted in rw mode
    • have enough space and inodes

Restarting

  • process did not not help, the same error appears.
  • server did help, panic disappeared
2023-04-18T06:44:57.548842Z  INFO garage::server: Loading configuration...
2023-04-18T06:44:57.549253Z  INFO garage::server: Initializing background runner...
2023-04-18T06:44:57.549307Z  INFO garage::server: Initializing Garage main data store...
2023-04-18T06:44:57.549362Z  INFO garage_model::garage: Opening database...
2023-04-18T06:44:57.549377Z  INFO garage_model::garage: Opening LMDB database at: /storage/meta/db.lmdb
2023-04-18T06:44:57.549597Z  INFO garage_model::garage: Initialize membership management system...
2023-04-18T06:44:57.549653Z  INFO garage_rpc::system: Node ID of this node: 73c69f2fd97641de
2023-04-18T06:44:57.549839Z  INFO garage_model::garage: Initialize block manager...
======== PANIC (internal Garage error) ========

GARAGE VERSION: v0.8.1 [features: k2v, sled, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs]

BACKTRACE:
   0: garage::main::{{closure}}::{{closure}}
   1: std::panicking::rust_panic_with_hook
   2: std::panicking::begin_panic_handler::{{closure}}
   3: std::sys_common::backtrace::__rust_end_short_backtrace
   4: rust_begin_unwind
   5: core::panicking::panic_fmt
   6: core::result::unwrap_failed
   7: garage_block::resync::BlockResyncManager::new
   8: garage_block::manager::BlockManager::new
   9: garage_model::garage::Garage::new
  10: garage::server::run_server::{{closure}}
  11: garage::main::{{closure}}
  12: std::thread::local::LocalKey<T>::with
  13: tokio::park::thread::CachedParkThread::block_on
  14: tokio::runtime::thread_pool::ThreadPool::block_on
  15: garage::main
  16: std::sys_common::backtrace::__rust_begin_short_backtrace
  17: std::rt::lang_start::{{closure}}
  18: std::rt::lang_start_internal
  19: main

2023-04-18T06:45:13.194945Z  INFO garage::server: Loading configuration...
2023-04-18T06:45:13.195238Z  INFO garage::server: Initializing background runner...
2023-04-18T06:45:13.195288Z  INFO garage::server: Initializing Garage main data store...
2023-04-18T06:45:13.195341Z  INFO garage_model::garage: Opening database...
2023-04-18T06:45:13.195357Z  INFO garage_model::garage: Opening LMDB database at: /storage/meta/db.lmdb
2023-04-18T06:45:13.195594Z  INFO garage_model::garage: Initialize membership management system...
2023-04-18T06:45:13.195645Z  INFO garage_rpc::system: Node ID of this node: 73c69f2fd97641de
2023-04-18T06:45:13.195836Z  INFO garage_model::garage: Initialize block manager...
I'm running garage 0.8.1 in cluster of 3 physical servers. Each physical server runs 2 garage processes (as HashiCorp Nomad jobs): - gw (just gateway) - data (data) I have recently done update and reboot of all servers. The jobs have been stopped by Nomad drain node and automatically restarted after reboot. After that, one server started to log panic on process startup for **both** gw and data job (see log below). Unfortunately, I do not have logs from the first fail as I discovered the failure too late and logs got garbage collected by nomad. - kernel, dmesg logs do not show any errors at the time of failure - filesystem(s) holding garage data - are mounted in `rw` mode - have enough space and inodes Restarting - process did not not help, the same error appears. - server did help, panic disappeared ``` 2023-04-18T06:44:57.548842Z INFO garage::server: Loading configuration... 2023-04-18T06:44:57.549253Z INFO garage::server: Initializing background runner... 2023-04-18T06:44:57.549307Z INFO garage::server: Initializing Garage main data store... 2023-04-18T06:44:57.549362Z INFO garage_model::garage: Opening database... 2023-04-18T06:44:57.549377Z INFO garage_model::garage: Opening LMDB database at: /storage/meta/db.lmdb 2023-04-18T06:44:57.549597Z INFO garage_model::garage: Initialize membership management system... 2023-04-18T06:44:57.549653Z INFO garage_rpc::system: Node ID of this node: 73c69f2fd97641de 2023-04-18T06:44:57.549839Z INFO garage_model::garage: Initialize block manager... ======== PANIC (internal Garage error) ======== GARAGE VERSION: v0.8.1 [features: k2v, sled, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs] BACKTRACE: 0: garage::main::{{closure}}::{{closure}} 1: std::panicking::rust_panic_with_hook 2: std::panicking::begin_panic_handler::{{closure}} 3: std::sys_common::backtrace::__rust_end_short_backtrace 4: rust_begin_unwind 5: core::panicking::panic_fmt 6: core::result::unwrap_failed 7: garage_block::resync::BlockResyncManager::new 8: garage_block::manager::BlockManager::new 9: garage_model::garage::Garage::new 10: garage::server::run_server::{{closure}} 11: garage::main::{{closure}} 12: std::thread::local::LocalKey<T>::with 13: tokio::park::thread::CachedParkThread::block_on 14: tokio::runtime::thread_pool::ThreadPool::block_on 15: garage::main 16: std::sys_common::backtrace::__rust_begin_short_backtrace 17: std::rt::lang_start::{{closure}} 18: std::rt::lang_start_internal 19: main 2023-04-18T06:45:13.194945Z INFO garage::server: Loading configuration... 2023-04-18T06:45:13.195238Z INFO garage::server: Initializing background runner... 2023-04-18T06:45:13.195288Z INFO garage::server: Initializing Garage main data store... 2023-04-18T06:45:13.195341Z INFO garage_model::garage: Opening database... 2023-04-18T06:45:13.195357Z INFO garage_model::garage: Opening LMDB database at: /storage/meta/db.lmdb 2023-04-18T06:45:13.195594Z INFO garage_model::garage: Initialize membership management system... 2023-04-18T06:45:13.195645Z INFO garage_rpc::system: Node ID of this node: 73c69f2fd97641de 2023-04-18T06:45:13.195836Z INFO garage_model::garage: Initialize block manager... ```
Owner

Hi @zdenek.crha
Thanks for reporting this bug.

About your error

According to your stack trace, it seems to be located in this function: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/block/resync.rs#L89-L108

And according to your title, it seems to be due to this line: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/block/resync.rs#L93

let queue = CountedTree::new(queue).expect("Could not count block_local_resync_queue");

Unfortunately Garage does not log the underlying error, so can't know what is wrong in CountedTree :-(

Digging in the code

CountedTree::new is defined here:
https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/counted_tree_hack.rs#L25

And it calls your underlying metadata engine driver (either LMDB, sled or sqlite).

For sled, it runs this logic: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/sled_adapter.rs#L99-L102

For lmdb: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/lmdb_adapter.rs#L118-L122

For sqlite: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/sqlite_adapter.rs#L133-L145

Based on your logs, it seems you use lmdb.
So basically I think we need to find where the error is located in the following logic:

fn len(&self, tree: usize) -> Result<usize> {
	let tree = self.get_tree(tree)?;
	let tx = self.db.read_txn()?;
	Ok(tree.len(&tx)?.try_into().unwrap())
}

Debugging your issue

It seems Garage does not like your /storage/meta/db.lmdb metadata file but we can't know why. Improving Garage error reporting could help debug your issue: if you want, you can contribute a patch to improve it, and we will be able to ship you a dedicated build. Otherwise, hopefully, someone else will find the time to improve this part of the code soon :-)

If you are 100% sure that your cluster contains no personal/confidential/sensible data, you can email your /storage/meta/db.lmdb file to garagehq {at} deuxfleurs.fr, it could help us reproduce your issue.

Finally, if you need to debug your cluster, please follow the Recovering from failures recipe from Garage's documentation, especially the "Replacement scenario 2: metadata (and possibly data) is lost" section.

Hi @zdenek.crha Thanks for reporting this bug. #### About your error According to your stack trace, it seems to be located in this function: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/block/resync.rs#L89-L108 And according to your title, it seems to be due to this line: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/block/resync.rs#L93 ```rust let queue = CountedTree::new(queue).expect("Could not count block_local_resync_queue"); ``` Unfortunately Garage does not log the underlying error, so can't know what is wrong in CountedTree :-( #### Digging in the code CountedTree::new is defined here: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/counted_tree_hack.rs#L25 And it calls your underlying metadata engine driver (either LMDB, sled or sqlite). For sled, it runs this logic: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/sled_adapter.rs#L99-L102 For lmdb: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/lmdb_adapter.rs#L118-L122 For sqlite: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/sqlite_adapter.rs#L133-L145 Based on your logs, it seems you use `lmdb`. So basically I think we need to find where the error is located in the following logic: ```rust fn len(&self, tree: usize) -> Result<usize> { let tree = self.get_tree(tree)?; let tx = self.db.read_txn()?; Ok(tree.len(&tx)?.try_into().unwrap()) } ``` #### Debugging your issue It seems Garage does not like your `/storage/meta/db.lmdb` metadata file but we can't know why. Improving Garage error reporting could help debug your issue: if you want, you can contribute a patch to improve it, and we will be able to ship you a dedicated build. Otherwise, hopefully, someone else will find the time to improve this part of the code soon :-) If you are 100% sure that your cluster contains no personal/confidential/sensible data, you can email your `/storage/meta/db.lmdb` file to garagehq {at} deuxfleurs.fr, it could help us reproduce your issue. Finally, if you need to debug your cluster, please follow the [Recovering from failures](https://garagehq.deuxfleurs.fr/documentation/cookbook/recovering/) recipe from Garage's documentation, especially the "Replacement scenario 2: metadata (and possibly data) is lost" section.
quentin added the
Bug
label 2023-04-21 10:20:58 +00:00
Author
Contributor

Thanks for detailed reply. Unfortunately I won't be able to dig into the issue as I'm swamped with other work at $job :-(. Also, I can't reproduce the issue (see below) anymore.

But I do have few more pieces of information that may help whoever picks this up:

First, the affected node runs two garage processes

  • gw process using partition on ssd
  • data process using partition on hdd

Both processes started crashing at roughly same time. I don't have logs anymore and therefore I'm not 100% sure, but I think it was the same panic message in both cases.

Second, the panic did go away after I rebooted the node itself. New gw, data processes started with original data/meta directories and started successfully.

The two points above make me think it is not an issue with db.lmdb file. Or at least not just an issue with that file. Maybe some lower-level io error when accessing the file? Something that goes away with reboot?

Chance that two metadata files corrupt at same time in a same way is slim. Especially with two different physical drives.

Stuff that is common for both processes is:

  • LVM managing data partitions
  • Nomad running both processes (as two groups in single job)
  • rest of the OS (of course :-))

I went through the node os logs again, but I can't find anything out of ordinary. Unlike logs from Nomad allocation (ie garage process itself), I still have os logs and can dig through them. If you need me check something, let me know.

Thanks for detailed reply. Unfortunately I won't be able to dig into the issue as I'm swamped with other work at $job :-(. Also, I can't reproduce the issue (see below) anymore. But I do have few more pieces of information that may help whoever picks this up: First, the affected node runs two `garage` processes - `gw` process using partition on `ssd` - `data` process using partition on `hdd` Both processes started crashing at roughly same time. I don't have logs anymore and therefore I'm not 100% sure, but I think it was the same panic message in both cases. Second, the panic *did* go away after I rebooted the node itself. New `gw`, `data` processes started with original data/meta directories and started *successfully*. The two points above make me think it is not an issue with `db.lmdb` file. Or at least not *just* an issue with that file. Maybe some lower-level io error when accessing the file? Something that goes away with reboot? Chance that two metadata files corrupt at same time in a same way is slim. Especially with two different physical drives. Stuff that is common for both processes is: - LVM managing data partitions - Nomad running both processes (as two groups in single job) - rest of the OS (of course :-)) I went through the node os logs again, but I can't find anything out of ordinary. Unlike logs from Nomad allocation (ie garage process itself), I still have os logs and can dig through them. If you need me check something, let me know.
Owner

I feel that we can safely say that this is an issue caused by a transient condition in your system which made LMDB behave strangely, and not necessarily an issue in Garage itself. I will close this issue for now but feel free to re-open if you'd like to add something.

I feel that we can safely say that this is an issue caused by a transient condition in your system which made LMDB behave strangely, and not necessarily an issue in Garage itself. I will close this issue for now but feel free to re-open if you'd like to add something.
lx closed this issue 2023-10-16 09:53:05 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#541
No description provided.