WIP: Fjall DB engine #906

Draft
withings wants to merge 5 commits from withings/garage:feat/fjall-db-engine into main
Contributor

This is a draft implementation for a new meta backend based on LSM trees using fjall. A couple of things to note so far:

  • ITx::clear could not be implemented (without iterating/deleting all keys) but I believe this method is never used. Since it's not the first time I encounter that issue, perhaps we could just remove that method?
  • New configuration option fjall_block_cache_size to set the block cache size.
  • I could not for the life of me find a way around the 'r lifetimes in ITx::range and ITx::range_rev, ended up cloning the bounds to avoid conflicts with '_.

Performance so far has been pretty low on writes, there's definitely some room for improvement in this PR. Using the dashboard from #851:

image

image

image

In my test setup there were also some nasty crashes, which have yet to be explained... Somehow the backend messes with the integrity of the Merkle trees...

======== PANIC (internal Garage error) ========
panicked at /home/build/garage/src/table/merkle.rs:219:10:
called `Option::unwrap()` on a `None` value
Panics are internal errors that Garage is unable to handle on its own.
They can be caused by bugs in Garage's code, or by corrupted data in
the node's storage. If you feel that this error is likely to be a bug
in Garage, please report it on our issue tracker a the following address:
     https://git.deuxfleurs.fr/Deuxfleurs/garage/issues
Please include the last log messages and the the full backtrace below in
your bug report, as well as any relevant information on the context in
which Garage was running when this error occurred.
GARAGE VERSION: git:v1.0.1-13-gc50cab80-modified [features: k2v, lmdb, sqlite, fjall, metrics, bundled-libs]
BACKTRACE:
    0: garage::main::{{closure}}::{{closure}}
              at home/build/garage/src/garage/main.rs:135:21
    1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/alloc/src/boxed.rs:2245:9
       std::panicking::rust_panic_with_hook
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:805:13
    2: std::panicking::begin_panic_handler::{{closure}}
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:664:13
    3: std::sys::backtrace::__rust_end_short_backtrace
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/sys/backtrace.rs:170:18
    4: rust_begin_unwind
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:662:5
    5: core::panicking::panic_fmt
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/panicking.rs:74:14
    6: core::panicking::panic
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/panicking.rs:148:5
    7: core::option::unwrap_failed
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/option.rs:2015:5
    8: core::option::Option<T>::unwrap
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/option.rs:965:21
       garage_table::merkle::MerkleUpdater<F,R>::update_item_rec
              at home/build/garage/src/table/merkle.rs:211:28
    9: garage_table::merkle::MerkleUpdater<F,R>::update_item_rec
              at home/build/garage/src/table/merkle.rs:149:28
   10: garage_table::merkle::MerkleUpdater<F,R>::update_item::{{closure}}
              at home/build/garage/src/table/merkle.rs:111:22
       <garage_db::TxFn<F,R,E> as garage_db::ITxFn>::try_on
              at home/build/garage/src/db/lib.rs:417:13
   11: <garage_db::metric_proxy::MetricITxFnProxy as garage_db::ITxFn>::try_on
              at home/build/garage/src/db/metric_proxy.rs:149:3
   12: <garage_db::fjall_adapter::FjallDb as garage_db::IDb>::transaction
              at home/build/garage/src/db/fjall_adapter.rs:228:13
   13: <garage_db::metric_proxy::MetricDbProxy as garage_db::IDb>::transaction::{{closure}}
              at home/build/garage/src/db/metric_proxy.rs:135:7
       garage_db::metric_proxy::MetricDbProxy::instrument
              at home/build/garage/src/db/metric_proxy.rs:50:13
       <garage_db::metric_proxy::MetricDbProxy as garage_db::IDb>::transaction
              at home/build/garage/src/db/metric_proxy.rs:134:3
   14: garage_db::Db::transaction
              at home/build/garage/src/db/lib.rs:110:16
       garage_table::merkle::MerkleUpdater<F,R>::update_item
              at home/build/garage/src/table/merkle.rs:108:3
       garage_table::merkle::MerkleUpdater<F,R>::updater_loop_iter
              at home/build/garage/src/table/merkle.rs:85:4
       <garage_table::merkle::MerkleWorker<F,R> as garage_util::background::worker::Worker>::work::{{closure}}::{{closure}}
              at home/build/garage/src/table/merkle.rs:318:13
       <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/blocking/task.rs:42:21
       tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:328:17
       tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/loom/std/unsafe_cell.rs:16:9
       tokio::runtime::task::core::Core<T,S>::poll
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:317:13
       tokio::runtime::task::harness::poll_future::{{closure}}
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:485:19
       <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/panic/unwind_safe.rs:272:9
       std::panicking::try::do_call
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:554:40
       std::panicking::try
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:518:19
       std::panic::catch_unwind
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panic.rs:345:14
       tokio::runtime::task::harness::poll_future
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:473:18
       tokio::runtime::task::harness::Harness<T,S>::poll_inner
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:208:27
       tokio::runtime::task::harness::Harness<T,S>::poll
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:153:15
       tokio::runtime::task::raw::poll
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/raw.rs:271:5
   15: tokio::runtime::task::raw::RawTask::poll
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/raw.rs:201:18
       tokio::runtime::task::UnownedTask<S>::run
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/mod.rs:453:9
   16: tokio::runtime::blocking::pool::Task::run
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/blocking/pool.rs:159:9
       tokio::runtime::blocking::pool::Inner::run
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/blocking/pool.rs:513:17
       tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
              at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/blocking/pool.rs:471:13
       std::sys::backtrace::__rust_begin_short_backtrace
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/sys/backtrace.rs:154:18
   17: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/thread/mod.rs:522:17
       <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/panic/unwind_safe.rs:272:9
       std::panicking::try::do_call
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:554:40
       std::panicking::try
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:518:19
       std::panic::catch_unwind
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panic.rs:345:14
       std::thread::Builder::spawn_unchecked_::{{closure}}
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/thread/mod.rs:521:30
       core::ops::function::FnOnce::call_once{{vtable.shim}}
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/ops/function.rs:250:5
   18: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/alloc/src/boxed.rs:2231:9
       <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/alloc/src/boxed.rs:2231:9
       std::sys::pal::unix::thread::Thread::new::thread_start
              at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/sys/pal/unix/thread.rs:105:17
This is a draft implementation for a new meta backend based on LSM trees using [fjall](https://docs.rs/fjall). A couple of things to note so far: - `ITx::clear` could not be implemented (without iterating/deleting all keys) but I believe this method is never used. Since it's not the first time I encounter that issue, perhaps we could just remove that method? - New configuration option `fjall_block_cache_size` to set the block cache size. - I could not for the life of me find a way around the `'r` lifetimes in `ITx::range` and `ITx::range_rev`, ended up cloning the bounds to avoid conflicts with `'_`. Performance so far has been pretty low on writes, there's definitely some room for improvement in this PR. Using the dashboard from #851: ![image](/attachments/b192d0a3-e9f6-419b-b550-c254de51ea7c) ![image](/attachments/a53d07af-c5c3-492d-b98b-b85b9a3b6cf1) ![image](/attachments/ad5dc456-3eac-47ce-a6ec-bca042343751) In my test setup there were also some nasty crashes, which have yet to be explained... Somehow the backend messes with the integrity of the Merkle trees... ``` ======== PANIC (internal Garage error) ======== panicked at /home/build/garage/src/table/merkle.rs:219:10: called `Option::unwrap()` on a `None` value Panics are internal errors that Garage is unable to handle on its own. They can be caused by bugs in Garage's code, or by corrupted data in the node's storage. If you feel that this error is likely to be a bug in Garage, please report it on our issue tracker a the following address: https://git.deuxfleurs.fr/Deuxfleurs/garage/issues Please include the last log messages and the the full backtrace below in your bug report, as well as any relevant information on the context in which Garage was running when this error occurred. GARAGE VERSION: git:v1.0.1-13-gc50cab80-modified [features: k2v, lmdb, sqlite, fjall, metrics, bundled-libs] BACKTRACE: 0: garage::main::{{closure}}::{{closure}} at home/build/garage/src/garage/main.rs:135:21 1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/alloc/src/boxed.rs:2245:9 std::panicking::rust_panic_with_hook at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:805:13 2: std::panicking::begin_panic_handler::{{closure}} at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:664:13 3: std::sys::backtrace::__rust_end_short_backtrace at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/sys/backtrace.rs:170:18 4: rust_begin_unwind at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:662:5 5: core::panicking::panic_fmt at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/panicking.rs:74:14 6: core::panicking::panic at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/panicking.rs:148:5 7: core::option::unwrap_failed at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/option.rs:2015:5 8: core::option::Option<T>::unwrap at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/option.rs:965:21 garage_table::merkle::MerkleUpdater<F,R>::update_item_rec at home/build/garage/src/table/merkle.rs:211:28 9: garage_table::merkle::MerkleUpdater<F,R>::update_item_rec at home/build/garage/src/table/merkle.rs:149:28 10: garage_table::merkle::MerkleUpdater<F,R>::update_item::{{closure}} at home/build/garage/src/table/merkle.rs:111:22 <garage_db::TxFn<F,R,E> as garage_db::ITxFn>::try_on at home/build/garage/src/db/lib.rs:417:13 11: <garage_db::metric_proxy::MetricITxFnProxy as garage_db::ITxFn>::try_on at home/build/garage/src/db/metric_proxy.rs:149:3 12: <garage_db::fjall_adapter::FjallDb as garage_db::IDb>::transaction at home/build/garage/src/db/fjall_adapter.rs:228:13 13: <garage_db::metric_proxy::MetricDbProxy as garage_db::IDb>::transaction::{{closure}} at home/build/garage/src/db/metric_proxy.rs:135:7 garage_db::metric_proxy::MetricDbProxy::instrument at home/build/garage/src/db/metric_proxy.rs:50:13 <garage_db::metric_proxy::MetricDbProxy as garage_db::IDb>::transaction at home/build/garage/src/db/metric_proxy.rs:134:3 14: garage_db::Db::transaction at home/build/garage/src/db/lib.rs:110:16 garage_table::merkle::MerkleUpdater<F,R>::update_item at home/build/garage/src/table/merkle.rs:108:3 garage_table::merkle::MerkleUpdater<F,R>::updater_loop_iter at home/build/garage/src/table/merkle.rs:85:4 <garage_table::merkle::MerkleWorker<F,R> as garage_util::background::worker::Worker>::work::{{closure}}::{{closure}} at home/build/garage/src/table/merkle.rs:318:13 <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/blocking/task.rs:42:21 tokio::runtime::task::core::Core<T,S>::poll::{{closure}} at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:328:17 tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/loom/std/unsafe_cell.rs:16:9 tokio::runtime::task::core::Core<T,S>::poll at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:317:13 tokio::runtime::task::harness::poll_future::{{closure}} at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:485:19 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/panic/unwind_safe.rs:272:9 std::panicking::try::do_call at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:554:40 std::panicking::try at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:518:19 std::panic::catch_unwind at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panic.rs:345:14 tokio::runtime::task::harness::poll_future at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:473:18 tokio::runtime::task::harness::Harness<T,S>::poll_inner at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:208:27 tokio::runtime::task::harness::Harness<T,S>::poll at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:153:15 tokio::runtime::task::raw::poll at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/raw.rs:271:5 15: tokio::runtime::task::raw::RawTask::poll at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/raw.rs:201:18 tokio::runtime::task::UnownedTask<S>::run at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/mod.rs:453:9 16: tokio::runtime::blocking::pool::Task::run at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/blocking/pool.rs:159:9 tokio::runtime::blocking::pool::Inner::run at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/blocking/pool.rs:513:17 tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}} at home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/blocking/pool.rs:471:13 std::sys::backtrace::__rust_begin_short_backtrace at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/sys/backtrace.rs:154:18 17: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}} at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/thread/mod.rs:522:17 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/panic/unwind_safe.rs:272:9 std::panicking::try::do_call at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:554:40 std::panicking::try at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panicking.rs:518:19 std::panic::catch_unwind at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/panic.rs:345:14 std::thread::Builder::spawn_unchecked_::{{closure}} at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/thread/mod.rs:521:30 core::ops::function::FnOnce::call_once{{vtable.shim}} at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/ops/function.rs:250:5 18: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/alloc/src/boxed.rs:2231:9 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/alloc/src/boxed.rs:2231:9 std::sys::pal::unix::thread::Thread::new::thread_start at rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/std/src/sys/pal/unix/thread.rs:105:17 ```
withings added 2 commits 2024-11-21 17:37:42 +00:00
Contributor

I think it would be interesting to see metrics for individual KV operations. I don't see how raw write speed should be slow or get slower, so I'm wondering if the PutObject call is (heavily) read I/O limited, and if so, why.

Also, how much data was written roughly with how much cache and memory, and how large were the resulting trees?


One obvious thing I'm noticing across the code base is that IDs are fully random; something like UUIDv7's, ULID etc. is much better for every KV store, as this will implicitly order new keys closer together, which helps with locality. If it is possible for Garage to use a time ordered ID format, I would strictly recommend it.

I think it would be interesting to see metrics for individual KV operations. I don't see how raw write speed should be slow or get slower, so I'm wondering if the PutObject call is (heavily) read I/O limited, and if so, why. Also, how much data was written roughly with how much cache and memory, and how large were the resulting trees? --- One obvious thing I'm noticing across the code base is that IDs are fully random; something like UUIDv7's, ULID etc. is much better for every KV store, as this will implicitly order new keys closer together, which helps with locality. If it is possible for Garage to use a time ordered ID format, I would strictly recommend it.
Author
Contributor

First, to sort of eliminate the idea that this is caused by a combination of "this backend" on "this hardware," I re-ran your write-bench tool with --fjall and graphed the insert times again. I used filled curves because the behaviour is much more erratic this time, and this allows us to see some sort of moving average line. At the end the worst-case writes are about 1/3 faster than LMDB for this workload.

Average insert time (ms) up to 100M entries

image

Now about the individual KV operations, you make a very good point, here goes:

image

We can see that insert is being very closely tailed by get, suggesting that the writes may indeed be hindered by the reads. In Garage the S3 PutObject operation does require a couple of reads.

One other thing I notice is that the monitoring appears to be causing a bunch of slow len operations. This isn't a problem with heed/LMDB since the length comes from a stats aggregator but, as I've just noticed, in the case of fjall, this is a O(n) operation. I'll rerun the test without it to remove that noise.

image

Regarding size: we were running on 8GB RAM with a 4GB block cache. We went up to 50k S3 objects, with about 3 trees at ~90k entries as seen below (the others were almost empty, not drawn) :

image

First, to sort of eliminate the idea that this is caused by a combination of "this backend" on "this hardware," I re-ran your write-bench tool with `--fjall` and graphed the insert times again. I used filled curves because the behaviour is much more erratic this time, and this allows us to see some sort of moving average line. At the end the worst-case writes are about 1/3 faster than LMDB for this workload. **Average insert time (ms) up to 100M entries** ![image](/attachments/e8c4dd76-d2c4-428f-8629-5490db9e555b) Now about the individual KV operations, you make a very good point, here goes: ![image](/attachments/ae650a48-5d21-4514-a938-04b314986909) We can see that insert is being very closely tailed by get, suggesting that the writes may indeed be hindered by the reads. In Garage the S3 PutObject operation does require a couple of reads. One other thing I notice is that the monitoring appears to be causing a bunch of slow `len` operations. This isn't a problem with heed/LMDB since the length comes from a stats aggregator but, as I've just noticed, in the case of fjall, this is a *O(n)* operation. I'll rerun the test without it to remove that noise. ![image](/attachments/7e3bec3d-5352-43cd-879b-c61699282037) Regarding size: we were running on 8GB RAM with a 4GB block cache. We went up to 50k S3 objects, with about 3 trees at ~90k entries as seen below (the others were almost empty, not drawn) : ![image](/attachments/07a095f9-a132-4351-bb2c-2f814fc7cd09)
Contributor

I'm guessing the len() is only really used for metrics, in that case it's better to use approximate_len() which is O(1), but may be a bit off if there are updates or deletes (it tends to converge to the real value over time). For metrics that is generally acceptable because it shouldn't matter if you have 928525 block refs or 928825 (example).

50k-100k is really not a lot of items. Even for large values, it's not unfeasible to be able to write that amount in ~1sec or so. Do you know how raw data that actually entails and how much ended up on disk? That would give the average value size, which may be important to know.

I don't fully understand the KV graph, I wouldn't expect the inserts to degrade or even be similar to get(), are you measuring those at the KV layer? Because the raw insert does not have a get() in it (https://git.deuxfleurs.fr/withings/garage/src/branch/feat/fjall-db-engine/src/db/fjall_adapter.rs#L155-L160). At this point, I think it makes sense to see how many KV operations actually happen as more S3 Puts happen, because if both backends degrade by more than 50%, it feels to me like as the database gets larger, increasingly more effort needs to be spent to put another object. What I am saying is that the KV store may not get much slower but instead it is called more often, which might give the impression of it slowing down more than expected. But my idea is partially motivated by me not fully knowing how the Garage tables are structured and what PutObject is actually doing.

Another thing I'm realizing is that Garage is not using an alternative memory allocator. I would strongly recommend using something like jemalloc; that tends to give you more performance, and less memory usage for anything you do pretty much for free. LMDB doesn't really do too many heap allocations because it runs through the kernel page cache, but for anything else (also SQLite etc.), a better memory allocator should definitely be considered.

I'm guessing the len() is only really used for metrics, in that case it's better to use approximate_len() which is O(1), but may be a bit off if there are updates or deletes (it tends to converge to the real value over time). For metrics that is generally acceptable because it shouldn't matter if you have 928525 block refs or 928825 (example). 50k-100k is really not a lot of items. Even for large values, it's not unfeasible to be able to write that amount in ~1sec or so. Do you know how raw data that actually entails and how much ended up on disk? That would give the average value size, which may be important to know. I don't fully understand the KV graph, I wouldn't expect the inserts to degrade or even be similar to get(), are you measuring those at the KV layer? Because the raw insert does not have a get() in it (https://git.deuxfleurs.fr/withings/garage/src/branch/feat/fjall-db-engine/src/db/fjall_adapter.rs#L155-L160). At this point, I think it makes sense to see how many KV operations actually happen as more S3 Puts happen, because if both backends degrade by more than 50%, it feels to me like as the database gets larger, increasingly more effort needs to be spent to put another object. What I am saying is that the KV store may not get much slower but instead it is called more often, which might give the impression of it slowing down more than expected. But my idea is partially motivated by me not fully knowing how the Garage tables are structured and what PutObject is *actually* doing. Another thing I'm realizing is that Garage is not using an alternative memory allocator. I would strongly recommend using something like jemalloc; that tends to give you more performance, and less memory usage for anything you do pretty much for free. LMDB doesn't really do too many heap allocations because it runs through the kernel page cache, but for anything else (also SQLite etc.), a better memory allocator should definitely be considered.
lx reviewed 2024-12-05 18:34:53 +00:00
lx left a comment
Owner

Concerning performance, my suspicion is that fjall does an fsync to the journal after each transcation. With other db engines we never do fsync if metadata_fsync = false in the config. To obtain the corresponding level of performance with fjall, it would need to be configured accordingly.

I did not see an error in the fjall adapter that could explain the inconsistencies of the merkle tree reported in your first post.

Concerning performance, my suspicion is that fjall does an fsync to the journal after each transcation. With other db engines we never do fsync if `metadata_fsync = false` in the config. To obtain the corresponding level of performance with fjall, it would need to be configured accordingly. I did not see an error in the fjall adapter that could explain the inconsistencies of the merkle tree reported in your first post.
@ -0,0 +114,4 @@
let mut path = to.clone();
path.push("data.fjall");
let source_keyspace = fjall::Config::new(&self.path).open()?;
Owner

Why create a new source_keyspace and not use self.keyspace ? Does this not cause consistency issues ?

Why create a new source_keyspace and not use `self.keyspace` ? Does this not cause consistency issues ?
@ -0,0 +120,4 @@
for partition_name in source_keyspace.list_partitions() {
let source_partition = source_keyspace
.open_partition(&partition_name, PartitionCreateOptions::default())?;
let snapshot = source_partition.snapshot();
Owner

We need the snapshots we take to be consistent across partitions, i.e. we need to snapshot all partitions at once and then copy them to the target keyspace. Here we are snapshotting each partition one after the other, so there will be changes in between that will cause inconsistencies when trying to restore from the snapshot.

Suggestion : use a read transaction that runs during the entire duration of FjallDb::snapshot, and iterate on data of all partitions using this transaction to have a consistent read view of the db. Unless there are particluar restrictions on having relatively long-lived read-only transactions, this should be fine. I would also suggest writing to the target db within a single big write transaction.

We need the snapshots we take to be consistent across partitions, i.e. we need to snapshot all partitions at once and then copy them to the target keyspace. Here we are snapshotting each partition one after the other, so there will be changes in between that will cause inconsistencies when trying to restore from the snapshot. Suggestion : use a read transaction that runs during the entire duration of `FjallDb::snapshot`, and iterate on data of all partitions using this transaction to have a consistent read view of the db. Unless there are particluar restrictions on having relatively long-lived read-only transactions, this should be fine. I would also suggest writing to the target db within a single big write transaction.
src/db/open.rs Outdated
@ -117,0 +125,4 @@
#[cfg(feature = "fjall")]
Engine::Fjall => {
info!("Opening Fjall database at: {}", path.display());
let fsync_ms = opt.fsync.then(|| 1000 as u16);
Owner

I think the correct implementation of opt.fsync == false would be to disable all fsync operations in fjall, in particular setting manual_journal_persist to true so that transactions would not do an fsync call. This is the meaning of that option for other db engines. Even with opt.fsync == false we can set fsync_ms to something reasonable like 1000, because if i understand correctly, the fsyncs will now be done by background threads at a regular interval and will not interfere with interactive operations. @marvinj97 please correct me if I'm wrong.

I think the correct implementation of `opt.fsync == false` would be to disable all fsync operations in fjall, in particular setting `manual_journal_persist` to `true` so that transactions would not do an fsync call. This is the meaning of that option for other db engines. Even with `opt.fsync == false` we can set fsync_ms to something reasonable like 1000, because if i understand correctly, the fsyncs will now be done by background threads at a regular interval and will not interfere with interactive operations. @marvinj97 please correct me if I'm wrong.
Contributor

By default, every write operation (such as a WriteTx.commit) flushes to OS buffers, but not to disk. This is the same behaviour as RocksDB, and gives you crash safety, but not power loss/kernel panic safety.
manual_journal_persist skips flushing to OS buffers, so all the data is kept in the user-space BufWriter (unless it is full or you call Keyspace::persist or set a WriteTransaction::durability level), so you can lose data if the application cannot unwind properly (e.g. it is killed).

By default, every write operation (such as a WriteTx.commit) flushes to _OS buffers_, but **not** to disk. This is the same behaviour as RocksDB, and gives you crash safety, but not power loss/kernel panic safety. `manual_journal_persist` skips flushing to OS buffers, so all the data is kept in the user-space BufWriter (unless it is full or you call `Keyspace::persist` or set a `WriteTransaction::durability` level), so you can lose data if the application cannot unwind properly (e.g. it is killed).
Owner

I'm not sure exactly how the following interact together:

  • manual_journal_persist at the level of the TxKeyspace, which is set in the Config
  • manual_journal_persist at the level of the TxPartitionHandle, which is set in the PartitionCreateOptions passed as an argument to TxKeyspace::open_partition
  • the durability of a WriteTransaction which can be set using WriteTransaction::durability

I think for metadata_fsync=false we want the Buffer durability level for all write transactions, and we don't necessarily need manual_journal_persist if just setting WriteTransaction::durability(Buffer) is enough to avoid all calls to fsync when we commit the transaction.

Conversely, for metadata_fsync=true we want the SyncAll durability level for all write transactions. SyncData is probably ok as well, at least on linux.

I'm not sure exactly how the following interact together: - `manual_journal_persist` at the level of the TxKeyspace, which is set in the Config - `manual_journal_persist` at the level of the TxPartitionHandle, which is set in the PartitionCreateOptions passed as an argument to `TxKeyspace::open_partition` - the durability of a `WriteTransaction` which can be set using `WriteTransaction::durability` I think for `metadata_fsync=false` we want the `Buffer` durability level for all write transactions, and we don't necessarily need `manual_journal_persist` if just setting `WriteTransaction::durability(Buffer)` is enough to avoid all calls to fsync when we commit the transaction. Conversely, for `metadata_fsync=true` we want the `SyncAll` durability level for all write transactions. `SyncData` is probably ok as well, at least on linux.
Owner

One obvious thing I'm noticing across the code base is that IDs are fully random; something like UUIDv7's, ULID etc. is much better for every KV store, as this will implicitly order new keys closer together, which helps with locality. If it is possible for Garage to use a time ordered ID format, I would strictly recommend it.

In many cases in Garage, random UUIDs or hashes are necessary to ensure the good balancing of data between cluster nodes. This is the case in particular for UUIDs of object versions and hashes of data blocks.

> One obvious thing I'm noticing across the code base is that IDs are fully random; something like UUIDv7's, ULID etc. is much better for every KV store, as this will implicitly order new keys closer together, which helps with locality. If it is possible for Garage to use a time ordered ID format, I would strictly recommend it. In many cases in Garage, random UUIDs or hashes are necessary to ensure the good balancing of data between cluster nodes. This is the case in particular for UUIDs of object versions and hashes of data blocks.
Owner

@withings Could you give me write access to your fork of the Garage repo ? This would allow me to upload an updated version of Cargo.nix so that we can run the CI. Thanks :)

@withings Could you give me write access to your fork of the Garage repo ? This would allow me to upload an updated version of Cargo.nix so that we can run the CI. Thanks :)
lx added 1 commit 2025-01-04 16:09:27 +00:00
update cargo.nix
Some checks failed
ci/woodpecker/pr/debug Pipeline failed
5acdcd9595
lx added 1 commit 2025-01-04 16:28:32 +00:00
cargo fmt
Some checks failed
ci/woodpecker/pr/debug Pipeline failed
0f722f0117
lx force-pushed feat/fjall-db-engine from 0f722f0117 to 04d3847200 2025-01-04 16:32:58 +00:00 Compare
lx added 1 commit 2025-01-04 16:56:19 +00:00
nix: build with fjall feature
All checks were successful
ci/woodpecker/pr/debug Pipeline was successful
c6b07d6fe8
Owner

Ok I've fixed the CI, basic tests seem to work correctly using the fjall db engine. We are not running the integration tests using fjall, maybe we could add that as well at some point.

I think it would be nice to have this as an experimental option in Garage releases so that with more users we could further debug it and potentially obtain improved performances.

Ok I've fixed the CI, basic tests seem to work correctly using the fjall db engine. We are not running the integration tests using fjall, maybe we could add that as well at some point. I think it would be nice to have this as an experimental option in Garage releases so that with more users we could further debug it and potentially obtain improved performances.
All checks were successful
ci/woodpecker/pr/debug Pipeline was successful
Required
Details
This pull request is marked as a work in progress.
This branch is out-of-date with the base branch
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u feat/fjall-db-engine:withings-feat/fjall-db-engine
git checkout withings-feat/fjall-db-engine
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#906
No description provided.