Improvements to garbage collector (fix #39) #135

Merged
lx merged 4 commits from cleanups into main 11 months ago
lx commented 11 months ago
Owner

Rationale: we want to ensure Garage's safety by making sure things don't get deleted from disk if they are still needed. Two aspects are involved in this.

1. Garbage collection of table entries (in meta/ directory)

STATUS: MOSTLY FIXED BY THIS PR

The Entry trait used for table entries (defined in tables/schema.rs) defines a function is_tombstone() that returns true if that entry represents an entry that is deleted in the table. CRDT semantics by default keep all tombstones, because they are necessary for reconciliation: if node A has a tombstone that supersedes a value x, and node B has value x, A has to keep the tombstone in memory so that the value x can be properly deleted at node B. Otherwise, due to the CRDT reconciliation rule, the value x from B would flow back to A and a deleted item would reappear in the system.

Here, we have some control on the nodes involved in storing Garage data. Therefore we have a garbage collector that is able to delete tombstones UNDER CERTAIN CONDITIONS. This garbage collector is implemented in table/gc.rs. To delete a tombstone, the following condition has to be met:

  • All nodes responsible for storing this entry are aware of the existence of the tombstone, i.e. they cannot hold another version of the entry that is superseeded by the tombstone. This ensures that deleting the tombstone is safe and that no deleted value will come back in the system.

Garage makes use of Sled's atomic operations (such as compare-and-swap and transactions) to ensure that only tombstones that have been correctly propagated to other nodes are ever deleted from the local entry tree.

This GC is safe in the following sense: no non-tombstone data is ever deleted from Garage tables.

However, there is an issue with the way this interacts with data rebalancing in the case when a partition is moving between nodes. If a node has some data of a partition for which it is not responsible, it has to offload it. However that offload process takes some time. In that interval, the GC does not check with that node if it has the tombstone before deleting the tombstone, so perhaps it doesn't have it and when the offload finally happens, old data comes back in the system.

This PR mostly fixes this by implementing a 24-hour delay before anything is garbage collected in a table. This works under the assumption that rebalances that follow data shuffling terminate in less than 24 hours.

2. Garbage collection of data blocks (in data/ directory)

STATUS: SOLVED BY THIS PR

Blocks in the data directory are reference-counted. In previous Garage iterations, blocks could get deleted from local disk as soon as their reference counter reached zero. We had a mechanism to not trigger this immediately at the rc-reaches-zero event, but the cleanup could be triggered by other means (for example by a block repair operation...). This PR implements a safety measure so that blocks never get deleted in a 10 minute interval following the time when the RC reaches zero. This is a measure to make impossible race conditions such as #39. We would have liked to use a larger delay (e.g. 24 hours), but in the case of a rebalance of data, this would have led to the disk utilization to explode during the rebalancing, only to shrink again after 24 hours. The 10-minute delay is a compromise that gives good security while not having this problem of disk space explosion on rebalance.

Rationale: we want to ensure Garage's safety by making sure things don't get deleted from disk if they are still needed. Two aspects are involved in this. ### 1. Garbage collection of table entries (in meta/ directory) STATUS: MOSTLY FIXED BY THIS PR The `Entry` trait used for table entries (defined in `tables/schema.rs`) defines a function `is_tombstone()` that returns `true` if that entry represents an entry that is deleted in the table. CRDT semantics by default keep all tombstones, because they are necessary for reconciliation: if node A has a tombstone that supersedes a value `x`, and node B has value `x`, A has to keep the tombstone in memory so that the value `x` can be properly deleted at node `B`. Otherwise, due to the CRDT reconciliation rule, the value `x` from B would flow back to A and a deleted item would reappear in the system. Here, we have some control on the nodes involved in storing Garage data. Therefore we have a garbage collector that is able to delete tombstones UNDER CERTAIN CONDITIONS. This garbage collector is implemented in `table/gc.rs`. To delete a tombstone, the following condition has to be met: - All nodes responsible for storing this entry are aware of the existence of the tombstone, i.e. they cannot hold another version of the entry that is superseeded by the tombstone. This ensures that deleting the tombstone is safe and that no deleted value will come back in the system. Garage makes use of Sled's atomic operations (such as compare-and-swap and transactions) to ensure that only tombstones that have been correctly propagated to other nodes are ever deleted from the local entry tree. This GC is safe in the following sense: no non-tombstone data is ever deleted from Garage tables. **However**, there is an issue with the way this interacts with data rebalancing in the case when a partition is moving between nodes. If a node has some data of a partition for which it is not responsible, it has to offload it. However that offload process takes some time. In that interval, the GC does not check with that node if it has the tombstone before deleting the tombstone, so perhaps it doesn't have it and when the offload finally happens, old data comes back in the system. **This PR mostly fixes this** by implementing a 24-hour delay before anything is garbage collected in a table. This works under the assumption that rebalances that follow data shuffling terminate in less than 24 hours. ### 2. Garbage collection of data blocks (in data/ directory) STATUS: SOLVED BY THIS PR Blocks in the data directory are reference-counted. In previous Garage iterations, blocks could get deleted from local disk as soon as their reference counter reached zero. We had a mechanism to not trigger this immediately at the rc-reaches-zero event, but the cleanup could be triggered by other means (for example by a block repair operation...). This PR implements a safety measure so that blocks never get deleted in a 10 minute interval following the time when the RC reaches zero. This is a measure to make impossible race conditions such as #39. We would have liked to use a larger delay (e.g. 24 hours), but in the case of a rebalance of data, this would have led to the disk utilization to explode during the rebalancing, only to shrink again after 24 hours. The 10-minute delay is a compromise that gives good security while not having this problem of disk space explosion on rebalance.
lx force-pushed cleanups from 2eac1d2109 to 8282f71d91 11 months ago
lx force-pushed cleanups from 8282f71d91 to 7fbe59e3a8 11 months ago
lx changed title from WIP: Refactor and comment table GC logic to WIP: Improvements to garbage collector 11 months ago
lx force-pushed cleanups from c3300f4513 to 14fe22fc02 11 months ago
lx force-pushed cleanups from 705632dbec to 4527a5ca4b 11 months ago
Owner

I think we could add your explanations to the design section of Garage, maybe in a GC entry? Or at least in the working document section? My point is that it is interesting as it answers some questions on how data is handled :)

also for #39, and more generally, can we not avoid this situation by incrementing the counter before decrementing it? After all both the create and destroy are handled on the same nodes so the order is deterministic?

I can review your code also if you want :)

I think we could add your explanations to the design section of Garage, maybe in a GC entry? Or at least in the working document section? My point is that it is interesting as it answers some questions on how data is handled :) also for #39, and more generally, can we not avoid this situation by incrementing the counter before decrementing it? After all both the `create` and `destroy` are handled on the same nodes so the order is deterministic? I can review your code also if you want :)
lx force-pushed cleanups from 4527a5ca4b to 6af0fb34f7 11 months ago
lx changed title from WIP: Improvements to garbage collector to WIP: Improvements to garbage collector (fix #39) 11 months ago
lx force-pushed cleanups from 302d0a69bd to 5eda70d2f2 11 months ago
lx force-pushed cleanups from 5eda70d2f2 to 9f45650fda 11 months ago
lx force-pushed cleanups from fe1eb032d0 to bd7a8da792 11 months ago
lx force-pushed cleanups from bd7a8da792 to 6d3c4862dc 11 months ago
lx force-pushed cleanups from 6d3c4862dc to 5b365fe273 11 months ago
lx force-pushed cleanups from 5b365fe273 to 790c349f00 11 months ago
lx force-pushed cleanups from 790c349f00 to 68d007fcab 11 months ago
lx force-pushed cleanups from 68d007fcab to 1f8ddc088c 11 months ago
lx force-pushed cleanups from 1f8ddc088c to ebd67c6ab0 11 months ago
lx force-pushed cleanups from ebd67c6ab0 to 0b339919aa 11 months ago
lx force-pushed cleanups from 0b339919aa to bee91d0a25 11 months ago
lx requested review from quentin 11 months ago
lx changed title from WIP: Improvements to garbage collector (fix #39) to Improvements to garbage collector (fix #39) 11 months ago
Poster
Owner

also for #39, and more generally, can we not avoid this situation by incrementing the counter before decrementing it? After all both the create and destroy are handled on the same nodes so the order is deterministic?

We actually did this to fix #39, but 1/ we don't know for sure if the new code is perfectly safe (I don't have a full theoretical analysis), and 2/ we might introduce this kind of mistake again in the future when coding new API endpoints. Adding a delay before block deletion is an additionnal security to reduce the consequences of this kind of bug.

> also for #39, and more generally, can we not avoid this situation by incrementing the counter before decrementing it? After all both the `create` and `destroy` are handled on the same nodes so the order is deterministic? We actually did this to fix #39, but 1/ we don't know for sure if the new code is perfectly safe (I don't have a full theoretical analysis), and 2/ we might introduce this kind of mistake again in the future when coding new API endpoints. Adding a delay before block deletion is an additionnal security to reduce the consequences of this kind of bug.
lx force-pushed cleanups from bee91d0a25 to ad7ab31411 11 months ago
lx added this to the v0.5.0 milestone 11 months ago
quentin approved these changes 11 months ago
quentin left a comment

The only concern I have is linked with backward compatibility/serialization where we could loose some guarantee provided by Rust. I don't see this issue as blocking, I only want your opinion on it.

I had no idea on how to test your code so I only proof-read the code without executing it.
It might still be possible to write some tests by simply instantiating manually some structures and passing them to your functions, it could be a first step before integrating jepsen.

Thanks for all the comments in the source code, they are very precious to understand your intents.
You changes seem to be rather minimal which is a good thing.

- [Design](./design/index.md)
- [Related Work](./design/related_work.md)
- [Internals](./design/internals.md)
- [Design draft](./design/design_draft.md)
Poster
Owner

It seems you renamed Internals as Design Draft but we still have a file named internal.md. Do you confirm that you want to keep it while hiding it from the documentation?

It seems you renamed Internals as Design Draft but we still have a file named `internal.md`. Do you confirm that you want to keep it while hiding it from the documentation?
lx commented 11 months ago
Poster
Owner

No actually I wanted to keep both in the menu ! Will fix.

No actually I wanted to keep both in the menu ! Will fix.
lx marked this conversation as resolved
// so in all cases we add the block here to the todo list
// to check later that it arrived correctly, and if not
// we will fecth it from someone.
self.put_to_resync(hash, 2 * BLOCK_RW_TIMEOUT)?;
Poster
Owner

why are we using 2 times the timeout?

why are we using 2 times the timeout?
lx commented 11 months ago
Poster
Owner

Because 1/ it's not a big issue if we do it too late, 2/ it makes sure that if there was a block put operation, it had enough time to complete/fail even if it took slightly longer than the timeout delay for some reason (clocks are imprecise)

Because 1/ it's not a big issue if we do it too late, 2/ it makes sure that if there was a block put operation, it had enough time to complete/fail even if it took slightly longer than the timeout delay for some reason (clocks are imprecise)
.update_and_fetch(&hash, |old| RcEntry::parse_opt(old).decrement().serialize())?;
let new_rc = RcEntry::parse_opt(new_rc);
if let RcEntry::Deletable { .. } = new_rc {
self.put_to_resync(hash, BLOCK_GC_DELAY + Duration::from_secs(10))?;
Poster
Owner

Why are we adding an hard coded 10 seconds?

Why are we adding an hard coded 10 seconds?
lx commented 11 months ago
Poster
Owner

Same reason, because clocks are imprecise and it's not a big issue if we do it too late

Same reason, because clocks are imprecise and it's not a big issue if we do it too late
Err(e) => {
// Not found but maybe we should have had it ??
self.put_to_resync(hash, Duration::from_millis(0))?;
self.put_to_resync(hash, 2 * BLOCK_RW_TIMEOUT)?;
Poster
Owner

2 times here also.

2 times here also.
u64::from_be_bytes(x8)
/// Describes the state of the reference counter for a block
#[derive(Clone, Copy, Debug)]
enum RcEntry {
Poster
Owner

Just me being nitpicking but we created this enum because we want a bit more than a Reference Counter, something like a Reference Counter with Delay. We could rename it to RcDelayEntry or RcDelay. But feel free to keep this name too if you are more comfortable with the current name :)

Just me being nitpicking but we created this enum because we want a bit more than a Reference Counter, something like a Reference Counter with Delay. We could rename it to `RcDelayEntry` or `RcDelay`. But feel free to keep this name too if you are more comfortable with the current name :)
/// This is stored as [0u8; 8] followed by u64::to_be_bytes(at_time),
/// (this allows for the data format to be backwards compatible with
/// previous Garage versions that didn't have this intermediate state)
Deletable { at_time: u64 },
Poster
Owner

What is our policy for backward compatibility: is there a risk that we make our code worse over time to ease some backward compatibility?
In this specific case, it does not seem very critical to me but I am curious about our code strategy / technical debt strategy :)

What is our policy for backward compatibility: is there a risk that we make our code worse over time to ease some backward compatibility? In this specific case, it does not seem very critical to me but I am curious about our code strategy / technical debt strategy :)
lx commented 11 months ago
Poster
Owner

I don't see this format as technical debt. It's a good format with a simple custom serialization that happens to be compatible with the old format.

It's nice here that we don't have to break compatibility, but if we had to do it we would have done it. (we did break compatibility for the table GC)

I think until release 1.0 we don't need to think too much about backward compatiblity. v0.4 removes at least one code path for data format migration, and we generally don't want to have those for now.

I don't see this format as technical debt. It's a good format with a simple custom serialization that happens to be compatible with the old format. It's nice here that we don't have to break compatibility, but if we had to do it we would have done it. (we did break compatibility for the table GC) I think until release 1.0 we don't need to think too much about backward compatiblity. v0.4 removes at least one code path for data format migration, and we generally don't want to have those for now.
let gc_todo = db
.open_tree(&format!("{}:gc_todo", name))
.open_tree(&format!("{}:gc_todo_v2", name))
Poster
Owner

Here, do we create a new db/table/key format because the previous one had a different format that is now incompatible? Does it mean that we will have a "dead" gc_todo table on already deployed nodes?
I do not see that as a blocking problem either but it is good to know.

Here, do we create a new db/table/key format because the previous one had a different format that is now incompatible? Does it mean that we will have a "dead" gc_todo table on already deployed nodes? I do not see that as a blocking problem either but it is good to know.
lx commented 11 months ago
Poster
Owner

Yes it's a dead table. If the migration is executed properly, i.e. Garage is read-only for enough time for the GC in the old version to process everything, gc_todo should be empty.

Yes it's a dead table. If the migration is executed properly, i.e. Garage is read-only for enough time for the GC in the old version to process everything, `gc_todo` should be empty.
let vhash = Hash::try_from(&vhash[..]).unwrap();
let v_opt = self
// Check if the tombstone is still the current value of the entry.
Poster
Owner

Can you confirm that one reason to invalid a tombstone is a case where I upload a file, detele it, and then reupload it?

Can you confirm that one reason to invalid a tombstone is a case where I upload a file, detele it, and then reupload it?
lx commented 11 months ago
Poster
Owner

Yes! For the object tables, an entry can take the following sequence of values:

  1. an old version (not a tombstone)
  2. a deletion marker (a tombstone which we want to GC if possible)
  3. a new version (not a tombstone)
Yes! For the object tables, an entry can take the following sequence of values: 1. an old version (not a tombstone) 2. a deletion marker (a tombstone which we want to GC if possible) 3. a new version (not a tombstone)
) -> Result<(), Error> {
let n_items = items.len();
// Strategy: we first send all of the values to the remote nodes,
Poster
Owner

Does it mean that deleting a tombstone is a particularly expensive operation? Could we create a deny of service simply by deleting lot of data in a row? or deleting lot of data in a row while a node is down?

But it might be totally ok as it seems we batch our requests, can you confirm?

Edit: I just understood that nodes does not map to all nodes of the cluster but to all nodes of the same "partition"

Does it mean that deleting a tombstone is a particularly expensive operation? Could we create a deny of service simply by deleting lot of data in a row? or deleting lot of data in a row while a node is down? But it might be totally ok as it seems we batch our requests, can you confirm? Edit: I just understood that `nodes` does not map to all nodes of the cluster but to all nodes of the same "partition"
lx commented 11 months ago
Poster
Owner

It's not that expensive (tombstones are small and can be batched), and it's the cost of a correct algorithm :)

At least we are not doing a full Raft or Paxos

It's not that expensive (tombstones are small and can be batched), and it's the cost of a correct algorithm :) At least we are not doing a full Raft or Paxos
/// such entry in Sled
///
/// Format of an entry:
/// - key = 8 bytes: timestamp of tombstone
Poster
Owner

Here also we have some byte manipulations too, this is even the purpose of this object. Are we doing them because of a Sled limitation (it needs a primitive type as a key) or for backward compatibility reasons? (we were storing a byte array before so we keep it)? Are you not affraid that it could introduce bugs with time as we are loosing some checks done by the compiler?

Here also we have some byte manipulations too, this is even the purpose of this object. Are we doing them because of a Sled limitation (it needs a primitive type as a key) or for backward compatibility reasons? (we were storing a byte array before so we keep it)? Are you not affraid that it could introduce bugs with time as we are loosing some checks done by the compiler?
lx commented 11 months ago
Poster
Owner

We are limited by Sled that basically acts as a BTreeMap<Vec<u8>, Vec<u8>>: we have to manage stuff as byte arrays for both keys and values. It's true we have to be extra carefull with this: that's the reason why I extracted the serialization and parsing logic in a separate struct. This would typically be a good place for unit testing if complexity gets out of hand (for now I don't think it has).

We are limited by Sled that basically acts as a `BTreeMap<Vec<u8>, Vec<u8>>`: we have to manage stuff as byte arrays for both keys and values. It's true we have to be extra carefull with this: that's the reason why I extracted the serialization and parsing logic in a separate struct. This would typically be a good place for unit testing if complexity gets out of hand (for now I don't think it has).
lx force-pushed cleanups from 144829a1c8 to 08b1e8a7ea 11 months ago
lx merged commit 08b1e8a7ea into main 11 months ago

Reviewers

quentin approved these changes 11 months ago
continuous-integration/drone/pr Build is passing
continuous-integration/drone/push Build is passing
continuous-integration/drone/tag Build is passing
continuous-integration/drone Build is passing
The pull request has been merged as 08b1e8a7ea.
Sign in to join this conversation.
Loading…
There is no content yet.