WIP: Fjall DB engine #906
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#906
Loading…
Reference in a new issue
No description provided.
Delete branch "withings/garage:feat/fjall-db-engine"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This is a draft implementation for a new meta backend based on LSM trees using fjall. A couple of things to note so far:
ITx::clear
could not be implemented (without iterating/deleting all keys) but I believe this method is never used. Since it's not the first time I encounter that issue, perhaps we could just remove that method?fjall_block_cache_size
to set the block cache size.'r
lifetimes inITx::range
andITx::range_rev
, ended up cloning the bounds to avoid conflicts with'_
.Performance so far has been pretty low on writes, there's definitely some room for improvement in this PR. Using the dashboard from #851:
In my test setup there were also some nasty crashes, which have yet to be explained... Somehow the backend messes with the integrity of the Merkle trees...
I think it would be interesting to see metrics for individual KV operations. I don't see how raw write speed should be slow or get slower, so I'm wondering if the PutObject call is (heavily) read I/O limited, and if so, why.
Also, how much data was written roughly with how much cache and memory, and how large were the resulting trees?
One obvious thing I'm noticing across the code base is that IDs are fully random; something like UUIDv7's, ULID etc. is much better for every KV store, as this will implicitly order new keys closer together, which helps with locality. If it is possible for Garage to use a time ordered ID format, I would strictly recommend it.
First, to sort of eliminate the idea that this is caused by a combination of "this backend" on "this hardware," I re-ran your write-bench tool with
--fjall
and graphed the insert times again. I used filled curves because the behaviour is much more erratic this time, and this allows us to see some sort of moving average line. At the end the worst-case writes are about 1/3 faster than LMDB for this workload.Average insert time (ms) up to 100M entries
Now about the individual KV operations, you make a very good point, here goes:
We can see that insert is being very closely tailed by get, suggesting that the writes may indeed be hindered by the reads. In Garage the S3 PutObject operation does require a couple of reads.
One other thing I notice is that the monitoring appears to be causing a bunch of slow
len
operations. This isn't a problem with heed/LMDB since the length comes from a stats aggregator but, as I've just noticed, in the case of fjall, this is a O(n) operation. I'll rerun the test without it to remove that noise.Regarding size: we were running on 8GB RAM with a 4GB block cache. We went up to 50k S3 objects, with about 3 trees at ~90k entries as seen below (the others were almost empty, not drawn) :
I'm guessing the len() is only really used for metrics, in that case it's better to use approximate_len() which is O(1), but may be a bit off if there are updates or deletes (it tends to converge to the real value over time). For metrics that is generally acceptable because it shouldn't matter if you have 928525 block refs or 928825 (example).
50k-100k is really not a lot of items. Even for large values, it's not unfeasible to be able to write that amount in ~1sec or so. Do you know how raw data that actually entails and how much ended up on disk? That would give the average value size, which may be important to know.
I don't fully understand the KV graph, I wouldn't expect the inserts to degrade or even be similar to get(), are you measuring those at the KV layer? Because the raw insert does not have a get() in it (https://git.deuxfleurs.fr/withings/garage/src/branch/feat/fjall-db-engine/src/db/fjall_adapter.rs#L155-L160). At this point, I think it makes sense to see how many KV operations actually happen as more S3 Puts happen, because if both backends degrade by more than 50%, it feels to me like as the database gets larger, increasingly more effort needs to be spent to put another object. What I am saying is that the KV store may not get much slower but instead it is called more often, which might give the impression of it slowing down more than expected. But my idea is partially motivated by me not fully knowing how the Garage tables are structured and what PutObject is actually doing.
Another thing I'm realizing is that Garage is not using an alternative memory allocator. I would strongly recommend using something like jemalloc; that tends to give you more performance, and less memory usage for anything you do pretty much for free. LMDB doesn't really do too many heap allocations because it runs through the kernel page cache, but for anything else (also SQLite etc.), a better memory allocator should definitely be considered.
Concerning performance, my suspicion is that fjall does an fsync to the journal after each transcation. With other db engines we never do fsync if
metadata_fsync = false
in the config. To obtain the corresponding level of performance with fjall, it would need to be configured accordingly.I did not see an error in the fjall adapter that could explain the inconsistencies of the merkle tree reported in your first post.
@ -0,0 +114,4 @@
let mut path = to.clone();
path.push("data.fjall");
let source_keyspace = fjall::Config::new(&self.path).open()?;
Why create a new source_keyspace and not use
self.keyspace
? Does this not cause consistency issues ?@ -0,0 +120,4 @@
for partition_name in source_keyspace.list_partitions() {
let source_partition = source_keyspace
.open_partition(&partition_name, PartitionCreateOptions::default())?;
let snapshot = source_partition.snapshot();
We need the snapshots we take to be consistent across partitions, i.e. we need to snapshot all partitions at once and then copy them to the target keyspace. Here we are snapshotting each partition one after the other, so there will be changes in between that will cause inconsistencies when trying to restore from the snapshot.
Suggestion : use a read transaction that runs during the entire duration of
FjallDb::snapshot
, and iterate on data of all partitions using this transaction to have a consistent read view of the db. Unless there are particluar restrictions on having relatively long-lived read-only transactions, this should be fine. I would also suggest writing to the target db within a single big write transaction.@ -117,0 +125,4 @@
#[cfg(feature = "fjall")]
Engine::Fjall => {
info!("Opening Fjall database at: {}", path.display());
let fsync_ms = opt.fsync.then(|| 1000 as u16);
I think the correct implementation of
opt.fsync == false
would be to disable all fsync operations in fjall, in particular settingmanual_journal_persist
totrue
so that transactions would not do an fsync call. This is the meaning of that option for other db engines. Even withopt.fsync == false
we can set fsync_ms to something reasonable like 1000, because if i understand correctly, the fsyncs will now be done by background threads at a regular interval and will not interfere with interactive operations. @marvinj97 please correct me if I'm wrong.By default, every write operation (such as a WriteTx.commit) flushes to OS buffers, but not to disk. This is the same behaviour as RocksDB, and gives you crash safety, but not power loss/kernel panic safety.
manual_journal_persist
skips flushing to OS buffers, so all the data is kept in the user-space BufWriter (unless it is full or you callKeyspace::persist
or set aWriteTransaction::durability
level), so you can lose data if the application cannot unwind properly (e.g. it is killed).I'm not sure exactly how the following interact together:
manual_journal_persist
at the level of the TxKeyspace, which is set in the Configmanual_journal_persist
at the level of the TxPartitionHandle, which is set in the PartitionCreateOptions passed as an argument toTxKeyspace::open_partition
WriteTransaction
which can be set usingWriteTransaction::durability
I think for
metadata_fsync=false
we want theBuffer
durability level for all write transactions, and we don't necessarily needmanual_journal_persist
if just settingWriteTransaction::durability(Buffer)
is enough to avoid all calls to fsync when we commit the transaction.Conversely, for
metadata_fsync=true
we want theSyncAll
durability level for all write transactions.SyncData
is probably ok as well, at least on linux.In many cases in Garage, random UUIDs or hashes are necessary to ensure the good balancing of data between cluster nodes. This is the case in particular for UUIDs of object versions and hashes of data blocks.
@withings Could you give me write access to your fork of the Garage repo ? This would allow me to upload an updated version of Cargo.nix so that we can run the CI. Thanks :)
0f722f0117
to04d3847200
Ok I've fixed the CI, basic tests seem to work correctly using the fjall db engine. We are not running the integration tests using fjall, maybe we could add that as well at some point.
I think it would be nice to have this as an experimental option in Garage releases so that with more users we could further debug it and potentially obtain improved performances.
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.