WIP: replace some todo trees with disk-backed queues #903
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#903
Loading…
Reference in a new issue
No description provided.
Delete branch "withings/garage:feat/todo-queues"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
During investigations for #851, it was suggested that using LMDB as a storage backend for the merkle_todo work might not be the best approach (10387, 10770). This could be because the read/write pattern for this workload disagrees with LMDB's strengths, possibly impacting other B+-tree use cases within Garage (causing inconvenient page swaps?).
As a workaround, it was suggested to use a disk-backed queue instead, which fits the merkle_todo (and gc_todo) use cases a lot better. yaque was suggested as an "adapter" for this (borrowing from the terminology used in
garage_db
).This PR is a first attempt at implementing this. It brings in a new
garage_todo
crate, similar togarage_db
, charged with managing todo queues. Themerkle_todo
andgc_todo
trees have also been removed and replaced with queues under that new crate. Tree iterators have been rewritten as FIFO pops, and inserts as FIFO pushes.Although the initial results were promising, I remain somewhat unsure about this, in particular the changes brought to the GC. There is also a small admin drawback: it is no longer possible to probe queue lengths (not available in yaque).
Whenever you have time I would love to get your feedback on this. Even if it doesn't get merged it might be a good opportunity to discuss the general idea of moving those trees to more queue-like structures.
Thanks!
ab3a409d04
to3b3086d513
This is interesting work for sure, but I think there are many things to be carefull of:
First, I think we would want to have an adapter that keeps the current behavior of using the DB engine for queues.
We need to make sure that we are not breaking any behaviour that depends on queues being updated atomically with other informations in the metadata DB (this is done in a transaction). For example for the Merkle todo queue, it is updated atomically in
TableData::update_entry_with
andTableData::delete_if_equal
/TableData::delete_if_equal_hash
, such that the following invariant holds: the entry in the Merkle todo tree, if it exists, always has a value that is the hash of the current version in the table data. I think most inserts and removals from todo queues have some atomic behaviour with other things going on in the transaction that must be respected.Most queues are not just FIFO queues, they are using the ordered key/value structure of DB trees for some purpose. The Merkle todo queue uses the key to make sure there is only one queue entry per table entry, which reduces the load on the Merkle updater worker when the same table entry is updated many times in a short time span (this can be the case of the counter tables for instance). Arguably this is a small optimization. More importantly, the table GC and resync queues make use of the key to store the timestamp at which the entry should be processed, and to keep all queues entry sorted by timestamp.
To overcome the last two points, I think we need to consider this work not just as "replacing the db engine by something specialized for todo queues", but "rethinking the whole logic behind the queuing mechanisms in Garage". I'm personnally not sure it's worth it, although it could arguably lead to a simplification of the mechanisms in used and make the code base more understandable.
Before we do this, I think we want to get one of the DB engines based on LSM-trees (fjall, rocksdb, or something else) to work correctly to see if it improves performance on high write loads. I saw your comment on #851 but I was not able to view your implementation of fjall and rocksdb because your repository is private. Would you be willing to make PRs to include fjall and rocksdb support in Garage? I think even if the implementation is not tuned and performs very poorly, we would be interested in merging those in Garage for future work.
Another point I forgot:
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.