merkle tree panic #797

Closed
opened 2024-04-08 07:18:06 +00:00 by tezlm · 11 comments

Garage started giving this error with seemingly no warning or reason, and there's no way to start the server/access any data.

2024-04-08T07:10:58.228275Z  INFO garage_block::resync: Resync block 9e96992fb1d03465: fetching absent but needed block (refcount > 0)
2024-04-08T07:10:58.239011Z  INFO garage_table::gc: (bucket_object_counter) GC: 6 items successfully pushed, will try to delete.
======== PANIC (internal Garage error) ========
panicked at /build/source/src/table/merkle.rs:207:25:
assertion `left == right` failed
  left: [185]
 right: [102]

Panics are internal errors that Garage is unable to handle on its own.
They can be caused by bugs in Garage's code, or by corrupted data in
the node's storage. If you feel that this error is likely to be a bug
in Garage, please report it on our issue tracker a the following address:

        https://git.deuxfleurs.fr/Deuxfleurs/garage/issues

Please include the last log messages and the the full backtrace below in
your bug report, as well as any relevant information on the context in
which Garage was running when this error occurred.

GARAGE VERSION: cargo:0.9.3 [features: k2v, sled, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs]

BACKTRACE:
   0: garage::main::{{closure}}::{{closure}}
   1: std::panicking::rust_panic_with_hook
   2: std::panicking::begin_panic_handler::{{closure}}
   3: std::sys_common::backtrace::__rust_end_short_backtrace
   4: rust_begin_unwind
   5: core::panicking::panic_fmt
   6: core::panicking::assert_failed_inner
   7: core::panicking::assert_failed
   8: garage_table::merkle::MerkleUpdater<F,R>::update_item_rec
   9: garage_table::merkle::MerkleUpdater<F,R>::update_item_rec
  10: <garage_db::TxFn<F,R,E> as garage_db::ITxFn>::try_on
  11: <garage_db::sqlite_adapter::SqliteDb as garage_db::IDb>::transaction
  12: tokio::runtime::task::raw::poll
  13: tokio::runtime::task::UnownedTask<S>::run
  14: std::sys_common::backtrace::__rust_begin_short_backtrace
  15: core::ops::function::FnOnce::call_once{{vtable.shim}}
  16: std::sys::unix::thread::Thread::new::thread_start
  17: start_thread
  18: __clone3
Garage started giving this error with seemingly no warning or reason, and there's no way to start the server/access any data. ``` 2024-04-08T07:10:58.228275Z INFO garage_block::resync: Resync block 9e96992fb1d03465: fetching absent but needed block (refcount > 0) 2024-04-08T07:10:58.239011Z INFO garage_table::gc: (bucket_object_counter) GC: 6 items successfully pushed, will try to delete. ======== PANIC (internal Garage error) ======== panicked at /build/source/src/table/merkle.rs:207:25: assertion `left == right` failed left: [185] right: [102] Panics are internal errors that Garage is unable to handle on its own. They can be caused by bugs in Garage's code, or by corrupted data in the node's storage. If you feel that this error is likely to be a bug in Garage, please report it on our issue tracker a the following address: https://git.deuxfleurs.fr/Deuxfleurs/garage/issues Please include the last log messages and the the full backtrace below in your bug report, as well as any relevant information on the context in which Garage was running when this error occurred. GARAGE VERSION: cargo:0.9.3 [features: k2v, sled, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs] BACKTRACE: 0: garage::main::{{closure}}::{{closure}} 1: std::panicking::rust_panic_with_hook 2: std::panicking::begin_panic_handler::{{closure}} 3: std::sys_common::backtrace::__rust_end_short_backtrace 4: rust_begin_unwind 5: core::panicking::panic_fmt 6: core::panicking::assert_failed_inner 7: core::panicking::assert_failed 8: garage_table::merkle::MerkleUpdater<F,R>::update_item_rec 9: garage_table::merkle::MerkleUpdater<F,R>::update_item_rec 10: <garage_db::TxFn<F,R,E> as garage_db::ITxFn>::try_on 11: <garage_db::sqlite_adapter::SqliteDb as garage_db::IDb>::transaction 12: tokio::runtime::task::raw::poll 13: tokio::runtime::task::UnownedTask<S>::run 14: std::sys_common::backtrace::__rust_begin_short_backtrace 15: core::ops::function::FnOnce::call_once{{vtable.shim}} 16: std::sys::unix::thread::Thread::new::thread_start 17: start_thread 18: __clone3 ```
Owner

@tezlm can you specify the hardware you use? I see in the stack trace that you are using the SQLite backend for metadata. Is the storage an SD card? Was there any interruption in power? What's the underlying filesystem for metadata?

@tezlm can you specify the hardware you use? I see in the stack trace that you are using the SQLite backend for metadata. Is the storage an SD card? Was there any interruption in power? What's the underlying filesystem for metadata?
Author

I'm using an old laptop, the data (both metadata/sqlite and blobs) is on a nvme ssd with ext4. As far as I know, the ssd is healthy. I do remember rebooting before the database failure (it failed on startup), but that seems like it would've been a clean shutdown.

I'm using an old laptop, the data (both metadata/sqlite and blobs) is on a nvme ssd with ext4. As far as I know, the ssd is healthy. I do remember rebooting before the database failure (it failed on startup), but that seems like it would've been a clean shutdown.
Owner

Is there any other nodes in that cluster or are you running a single node?

Is there any other nodes in that cluster or are you running a single node?
Author

This is with a single node

This is with a single node
Owner

Unfortunately there is not much to be done at this point, this does look like some corruption in your database file that Garage is not able to recover from. If it's only in the Merkle tree, we could probably just rebuild that tree from scratch as the tree is basically just an index structure, but there is currently no code in Garage to do this.

Unfortunately there is not much to be done at this point, this does look like some corruption in your database file that Garage is not able to recover from. If it's only in the Merkle tree, we could probably just rebuild that tree from scratch as the tree is basically just an index structure, but there is currently no code in Garage to do this.
Author

Is there any way to manually recover the data?

Is there any way to manually recover the data?
Owner

@tezlm can you be more specific on the kind of data that was stored on this node? Do you have any backups of the metadata? @lx coud the sled migration code be used as a base to rebuild the index tree?

@tezlm can you be more specific on the kind of data that was stored on this node? Do you have any backups of the metadata? @lx coud the sled migration code be used as a base to rebuild the index tree?
Owner

coud the sled migration code be used as a base to rebuild the index tree?

No, unfortunately, it does not rebuild the Merkle tree, it copies it as-is.

Is there any way to manually recover the data?

You could try clearing the Merkle tree tables in the sqlite db file. This will not put your garage instance in a proper state for continued use, but it should allow you at least to start Garage to copy your data out of it.

First, make a backup of your db.sqlite file.

Then, launch the sqlite command line with sqlite3 db.sqlite.

Then, execute the following commands in your sqlite session:

DELETE FROM tree_bucket_v2_COLON_merkle_tree ;
DELETE FROM tree_bucket_alias_COLON_merkle_tree ;
DELETE FROM tree_key_COLON_merkle_tree ;
DELETE FROM tree_block_ref_COLON_merkle_tree ;
DELETE FROM tree_version_COLON_merkle_tree ;
DELETE FROM tree_bucket_mpu_counter_COLON_merkle_tree ;
DELETE FROM tree_multipart_upload_COLON_merkle_tree ;
DELETE FROM tree_bucket_object_counter_COLON_merkle_tree ;
DELETE FROM tree_object_COLON_merkle_tree ;
DELETE FROM tree_k2v_index_counter_v2_COLON_merkle_tree ;
DELETE FROM tree_k2v_item_COLON_merkle_tree ;

Then, try starting Garage.

> coud the sled migration code be used as a base to rebuild the index tree? No, unfortunately, it does not rebuild the Merkle tree, it copies it as-is. > Is there any way to manually recover the data? You could try clearing the Merkle tree tables in the sqlite db file. This will **not** put your garage instance in a proper state for continued use, but it should allow you at least to start Garage to copy your data out of it. First, make a backup of your db.sqlite file. Then, launch the sqlite command line with `sqlite3 db.sqlite`. Then, execute the following commands in your sqlite session: ``` DELETE FROM tree_bucket_v2_COLON_merkle_tree ; DELETE FROM tree_bucket_alias_COLON_merkle_tree ; DELETE FROM tree_key_COLON_merkle_tree ; DELETE FROM tree_block_ref_COLON_merkle_tree ; DELETE FROM tree_version_COLON_merkle_tree ; DELETE FROM tree_bucket_mpu_counter_COLON_merkle_tree ; DELETE FROM tree_multipart_upload_COLON_merkle_tree ; DELETE FROM tree_bucket_object_counter_COLON_merkle_tree ; DELETE FROM tree_object_COLON_merkle_tree ; DELETE FROM tree_k2v_index_counter_v2_COLON_merkle_tree ; DELETE FROM tree_k2v_item_COLON_merkle_tree ; ``` Then, try starting Garage.
Author

can you be more specific on the kind of data that was stored on this node? Do you have any backups of the metadata?

@maximilien Media and tarballs, most of which are 1-100 MiB. I don't have any backup of the metadata, but I definitely will set one up now!

@lx That works and gave these logs, not sure if it's useful or not:

2024-04-09T16:51:36.234120Z  INFO garage_block::resync: Resync block b1cd5105419791c0: fetching absent but needed block (refcount > 0)
2024-04-09T16:51:36.234413Z ERROR garage_block::resync: Error when resyncing b1cd5105419791c0: Get block b1cd5105419791c0: no node returned a valid block
2024-04-09T16:51:36.683276Z  INFO garage_block::resync: Resync block 5674d240da40b609: fetching absent but needed block (refcount > 0)
2024-04-09T16:51:36.684451Z ERROR garage_block::resync: Error when resyncing 5674d240da40b609: Get block 5674d240da40b609: no node returned a valid block
2024-04-09T16:51:37.005386Z  INFO garage_block::resync: Resync block ae41f2595e3d6d05: fetching absent but needed block (refcount > 0)
2024-04-09T16:51:37.005824Z ERROR garage_block::resync: Error when resyncing ae41f2595e3d6d05: Get block ae41f2595e3d6d05: no node returned a valid block
2024-04-09T16:51:37.269270Z  INFO garage_block::resync: Resync block 19f92f50226fec93: fetching absent but needed block (refcount > 0)
2024-04-09T16:51:37.269715Z ERROR garage_block::resync: Error when resyncing 19f92f50226fec93: Get block 19f92f50226fec93: no node returned a valid block
2024-04-09T16:51:37.494254Z  INFO garage_block::resync: Resync block 9afa2eeddead71eb: fetching absent but needed block (refcount > 0)
2024-04-09T16:51:37.494673Z ERROR garage_block::resync: Error when resyncing 9afa2eeddead71eb: Get block 9afa2eeddead71eb: no node returned a valid block
2024-04-09T16:51:37.694563Z  INFO garage_block::resync: Resync block c6ab59de694949fb: fetching absent but needed block (refcount > 0)
(repeated ~20-30 times)
> can you be more specific on the kind of data that was stored on this node? Do you have any backups of the metadata? @maximilien Media and tarballs, most of which are 1-100 MiB. I don't have any backup of the metadata, but I definitely will set one up now! @lx That works and gave these logs, not sure if it's useful or not: ``` 2024-04-09T16:51:36.234120Z INFO garage_block::resync: Resync block b1cd5105419791c0: fetching absent but needed block (refcount > 0) 2024-04-09T16:51:36.234413Z ERROR garage_block::resync: Error when resyncing b1cd5105419791c0: Get block b1cd5105419791c0: no node returned a valid block 2024-04-09T16:51:36.683276Z INFO garage_block::resync: Resync block 5674d240da40b609: fetching absent but needed block (refcount > 0) 2024-04-09T16:51:36.684451Z ERROR garage_block::resync: Error when resyncing 5674d240da40b609: Get block 5674d240da40b609: no node returned a valid block 2024-04-09T16:51:37.005386Z INFO garage_block::resync: Resync block ae41f2595e3d6d05: fetching absent but needed block (refcount > 0) 2024-04-09T16:51:37.005824Z ERROR garage_block::resync: Error when resyncing ae41f2595e3d6d05: Get block ae41f2595e3d6d05: no node returned a valid block 2024-04-09T16:51:37.269270Z INFO garage_block::resync: Resync block 19f92f50226fec93: fetching absent but needed block (refcount > 0) 2024-04-09T16:51:37.269715Z ERROR garage_block::resync: Error when resyncing 19f92f50226fec93: Get block 19f92f50226fec93: no node returned a valid block 2024-04-09T16:51:37.494254Z INFO garage_block::resync: Resync block 9afa2eeddead71eb: fetching absent but needed block (refcount > 0) 2024-04-09T16:51:37.494673Z ERROR garage_block::resync: Error when resyncing 9afa2eeddead71eb: Get block 9afa2eeddead71eb: no node returned a valid block 2024-04-09T16:51:37.694563Z INFO garage_block::resync: Resync block c6ab59de694949fb: fetching absent but needed block (refcount > 0) (repeated ~20-30 times) ```
Owner

@tezlm These errors might indicate more corruption in your metadata file, but the most important is that you're able to retrieve all your files from Garage. At this point, you should copy everything out of garage on your local disk and rebuild your garage cluster from scratch.

@tezlm These errors might indicate more corruption in your metadata file, but the most important is that you're able to retrieve all your files from Garage. At this point, you should copy everything out of garage on your local disk and rebuild your garage cluster from scratch.
Author

I'm closing this for now, as I think it's an error on my end rather than with garage.

I'm closing this for now, as I think it's an error on my end rather than with garage.
tezlm closed this issue 2024-04-09 17:32:41 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#797
No description provided.