garage/src/block/metrics.rs
Alex b44d3fc796
All checks were successful
continuous-integration/drone/push Build is passing
Abstract database behind generic interface and implement alternative drivers (#322)
- [x] Design interface
- [x] Implement Sled backend
  - [x] Re-implement the SledCountedTree hack ~~on Sled backend~~ on all backends (i.e. over the abstraction)
- [x] Convert Garage code to use generic interface
- [x] Proof-read converted Garage code
- [ ] Test everything well
- [x] Implement sqlite backend
- [x] Implement LMDB backend
- [ ] (Implement Persy backend?)
- [ ] (Implement other backends? (like RocksDB, ...))
- [x] Implement backend choice in config file and garage server module
- [x] Add CLI for converting between DB formats
- Exploit the new interface to put more things in transactions
  - [x] `.updated()` trigger on Garage tables

Fix #284

**Bugs**

- [x] When exporting sqlite, trees iterate empty??
- [x] LMDB doesn't work

**Known issues for various back-ends**

- Sled:
  - Eats all my RAM and also all my disk space
  - `.len()` has to traverse the whole table
  - Is actually quite slow on some operations
  - And is actually pretty bad code...
- Sqlite:
  - Requires a lock to be taken on all operations. The lock is also taken when iterating on a table with `.iter()`, and the lock isn't released until the iterator is dropped. This means that we must be VERY carefull to not do anything else inside a `.iter()` loop or else we will have a deadlock! Most such cases have been eliminated from the Garage codebase, but there might still be some that remain. If your Garage-over-Sqlite seems to hang/freeze, this is the reason.
  - (adapter uses a bunch of unsafe code)
- Heed (LMDB):
  - Not suited for 32-bit machines as it has to map the whole DB in memory.
  - (adpater uses a tiny bit of unsafe code)

**My recommendation:** avoid 32-bit machines and use LMDB as much as possible.

**Converting databases** is actually quite easy. For example from Sled to LMDB:

```bash
cd src/db
cargo run --features cli --bin convert -- -i path/to/garage/meta/db -a sled -o path/to/garage/meta/db.lmdb -b lmdb
```

Then, just add this to your `config.toml`:

```toml
db_engine = "lmdb"
```

Co-authored-by: Alex Auvolat <alex@adnab.me>
Reviewed-on: #322
Co-authored-by: Alex <alex@adnab.me>
Co-committed-by: Alex <alex@adnab.me>
2022-06-08 10:01:44 +02:00

103 lines
3.3 KiB
Rust

use opentelemetry::{global, metrics::*};
use garage_db::counted_tree_hack::CountedTree;
/// TableMetrics reference all counter used for metrics
pub struct BlockManagerMetrics {
pub(crate) _resync_queue_len: ValueObserver<u64>,
pub(crate) _resync_errored_blocks: ValueObserver<u64>,
pub(crate) resync_counter: BoundCounter<u64>,
pub(crate) resync_error_counter: BoundCounter<u64>,
pub(crate) resync_duration: BoundValueRecorder<f64>,
pub(crate) resync_send_counter: Counter<u64>,
pub(crate) resync_recv_counter: BoundCounter<u64>,
pub(crate) bytes_read: BoundCounter<u64>,
pub(crate) block_read_duration: BoundValueRecorder<f64>,
pub(crate) bytes_written: BoundCounter<u64>,
pub(crate) block_write_duration: BoundValueRecorder<f64>,
pub(crate) delete_counter: BoundCounter<u64>,
pub(crate) corruption_counter: BoundCounter<u64>,
}
impl BlockManagerMetrics {
pub fn new(resync_queue: CountedTree, resync_errors: CountedTree) -> Self {
let meter = global::meter("garage_model/block");
Self {
_resync_queue_len: meter
.u64_value_observer("block.resync_queue_length", move |observer| {
observer.observe(resync_queue.len() as u64, &[])
})
.with_description(
"Number of block hashes queued for local check and possible resync",
)
.init(),
_resync_errored_blocks: meter
.u64_value_observer("block.resync_errored_blocks", move |observer| {
observer.observe(resync_errors.len() as u64, &[])
})
.with_description("Number of block hashes whose last resync resulted in an error")
.init(),
resync_counter: meter
.u64_counter("block.resync_counter")
.with_description("Number of calls to resync_block")
.init()
.bind(&[]),
resync_error_counter: meter
.u64_counter("block.resync_error_counter")
.with_description("Number of calls to resync_block that returned an error")
.init()
.bind(&[]),
resync_duration: meter
.f64_value_recorder("block.resync_duration")
.with_description("Duration of resync_block operations")
.init()
.bind(&[]),
resync_send_counter: meter
.u64_counter("block.resync_send_counter")
.with_description("Number of blocks sent to another node in resync operations")
.init(),
resync_recv_counter: meter
.u64_counter("block.resync_recv_counter")
.with_description("Number of blocks received from other nodes in resync operations")
.init()
.bind(&[]),
bytes_read: meter
.u64_counter("block.bytes_read")
.with_description("Number of bytes read from disk")
.init()
.bind(&[]),
block_read_duration: meter
.f64_value_recorder("block.read_duration")
.with_description("Duration of block read operations")
.init()
.bind(&[]),
bytes_written: meter
.u64_counter("block.bytes_written")
.with_description("Number of bytes written to disk")
.init()
.bind(&[]),
block_write_duration: meter
.f64_value_recorder("block.write_duration")
.with_description("Duration of block write operations")
.init()
.bind(&[]),
delete_counter: meter
.u64_counter("block.delete_counter")
.with_description("Number of blocks deleted")
.init()
.bind(&[]),
corruption_counter: meter
.u64_counter("block.corruption_counter")
.with_description("Data corruptions detected on block reads")
.init()
.bind(&[]),
}
}
}