WIP: replace some todo trees with disk-backed queues #903

Draft

withings wants to merge 1 commit from withings/garage:feat/todo-queues into main

withings commented

2024-11-18 15:22:52 +00:00

Contributor

During investigations for #851, it was suggested that using LMDB as a storage backend for the merkle_todo work might not be the best approach (10387, 10770). This could be because the read/write pattern for this workload disagrees with LMDB's strengths, possibly impacting other B+-tree use cases within Garage (causing inconvenient page swaps?).

As a workaround, it was suggested to use a disk-backed queue instead, which fits the merkle_todo (and gc_todo) use cases a lot better. yaque was suggested as an "adapter" for this (borrowing from the terminology used in garage_db).

This PR is a first attempt at implementing this. It brings in a new garage_todo crate, similar to garage_db, charged with managing todo queues. The merkle_todo and gc_todo trees have also been removed and replaced with queues under that new crate. Tree iterators have been rewritten as FIFO pops, and inserts as FIFO pushes.

Although the initial results were promising, I remain somewhat unsure about this, in particular the changes brought to the GC. There is also a small admin drawback: it is no longer possible to probe queue lengths (not available in yaque).

Whenever you have time I would love to get your feedback on this. Even if it doesn't get merged it might be a good opportunity to discuss the general idea of moving those trees to more queue-like structures.

Thanks!

During investigations for #851, it was suggested that using LMDB as a storage backend for the merkle_todo work might not be the best approach ([10387](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/851#issuecomment-10387), [10770](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/851#issuecomment-10770)). This could be because the read/write pattern for this workload disagrees with LMDB's strengths, possibly impacting other B+-tree use cases within Garage (causing inconvenient page swaps?). As a workaround, it was suggested to use a disk-backed queue instead, which fits the merkle_todo (and gc_todo) use cases a lot better. [yaque](https://docs.rs/yaque) was suggested as an "adapter" for this (borrowing from the terminology used in `garage_db`). This PR is a first attempt at implementing this. It brings in a new `garage_todo` crate, similar to `garage_db`, charged with managing todo queues. The `merkle_todo` and `gc_todo` trees have also been removed and replaced with queues under that new crate. Tree iterators have been rewritten as FIFO pops, and inserts as FIFO pushes. Although the initial results were promising, I remain somewhat unsure about this, in particular the changes brought to the GC. There is also a small admin drawback: it is no longer possible to probe queue lengths (not available in yaque). Whenever you have time I would love to get your feedback on this. Even if it doesn't get merged it might be a good opportunity to discuss the general idea of moving those trees to more queue-like structures. Thanks!

withings force-pushed feat/todo-queues from ab3a409d04 to 3b3086d513

2024-11-18 15:36:42 +00:00

Compare

lx commented

2024-11-19 09:46:14 +00:00

Owner

This is interesting work for sure, but I think there are many things to be carefull of:

First, I think we would want to have an adapter that keeps the current behavior of using the DB engine for queues.
We need to make sure that we are not breaking any behaviour that depends on queues being updated atomically with other informations in the metadata DB (this is done in a transaction). For example for the Merkle todo queue, it is updated atomically in TableData::update_entry_with and TableData::delete_if_equal/TableData::delete_if_equal_hash, such that the following invariant holds: the entry in the Merkle todo tree, if it exists, always has a value that is the hash of the current version in the table data. I think most inserts and removals from todo queues have some atomic behaviour with other things going on in the transaction that must be respected.
Most queues are not just FIFO queues, they are using the ordered key/value structure of DB trees for some purpose. The Merkle todo queue uses the key to make sure there is only one queue entry per table entry, which reduces the load on the Merkle updater worker when the same table entry is updated many times in a short time span (this can be the case of the counter tables for instance). Arguably this is a small optimization. More importantly, the table GC and resync queues make use of the key to store the timestamp at which the entry should be processed, and to keep all queues entry sorted by timestamp.

To overcome the last two points, I think we need to consider this work not just as "replacing the db engine by something specialized for todo queues", but "rethinking the whole logic behind the queuing mechanisms in Garage". I'm personnally not sure it's worth it, although it could arguably lead to a simplification of the mechanisms in used and make the code base more understandable.

Before we do this, I think we want to get one of the DB engines based on LSM-trees (fjall, rocksdb, or something else) to work correctly to see if it improves performance on high write loads. I saw your comment on #851 but I was not able to view your implementation of fjall and rocksdb because your repository is private. Would you be willing to make PRs to include fjall and rocksdb support in Garage? I think even if the implementation is not tuned and performs very poorly, we would be interested in merging those in Garage for future work.

This is interesting work for sure, but I think there are many things to be carefull of: - First, I think we would want to have an adapter that keeps the current behavior of using the DB engine for queues. - We need to make sure that we are not breaking any behaviour that depends on queues being updated atomically with other informations in the metadata DB (this is done in a transaction). For example for the Merkle todo queue, it is updated atomically in `TableData::update_entry_with` and `TableData::delete_if_equal`/`TableData::delete_if_equal_hash`, such that the following invariant holds: the entry in the Merkle todo tree, if it exists, always has a value that is the hash of the current version in the table data. I think most inserts and removals from todo queues have some atomic behaviour with other things going on in the transaction that must be respected. - Most queues are not just FIFO queues, they are using the ordered key/value structure of DB trees for some purpose. The Merkle todo queue uses the key to make sure there is only one queue entry per table entry, which reduces the load on the Merkle updater worker when the same table entry is updated many times in a short time span (this can be the case of the counter tables for instance). Arguably this is a small optimization. More importantly, the table GC and resync queues make use of the key to store the timestamp at which the entry should be processed, and to keep all queues entry sorted by timestamp. To overcome the last two points, I think we need to consider this work not just as "replacing the db engine by something specialized for todo queues", but "rethinking the whole logic behind the queuing mechanisms in Garage". I'm personnally not sure it's worth it, although it could arguably lead to a simplification of the mechanisms in used and make the code base more understandable. Before we do this, I think we want to get one of the DB engines based on LSM-trees (fjall, rocksdb, or something else) to work correctly to see if it improves performance on high write loads. I saw your comment on #851 but I was not able to view your implementation of fjall and rocksdb because your repository is private. Would you be willing to make PRs to include fjall and rocksdb support in Garage? I think even if the implementation is not tuned and performs very poorly, we would be interested in merging those in Garage for future work.

lx commented

2024-11-19 09:49:54 +00:00

Owner

Another point I forgot:

We need to think of the behaviour if Garage crashes after popping an element from a queue but before completing the corresponding action. With the current implementation, we take care to only peek from the queue to take jobs in the workers, and we delete from the queue only after the job has been processed successfully. When we have several parallel workers (such as resync workers), they will peek several items from the front of the queue, and they will maybe not finish in order, so we need to be able to remove the second/3rd/... item from the queue. With the key/value-backed queue it's easy to find which item to remove since the key does not change, but with yaque which is more like a VecDequeue this becomes more tricky as the index may change.

Another point I forgot: - We need to think of the behaviour if Garage crashes after popping an element from a queue but before completing the corresponding action. With the current implementation, we take care to only peek from the queue to take jobs in the workers, and we delete from the queue only after the job has been processed successfully. When we have several parallel workers (such as resync workers), they will peek several items from the front of the queue, and they will maybe not finish in order, so we need to be able to remove the second/3rd/... item from the queue. With the key/value-backed queue it's easy to find which item to remove since the key does not change, but with yaque which is more like a VecDequeue this becomes more tricky as the index may change.