Improvements to garbage collector (fix #39) #135
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#135
Loading…
Reference in a new issue
No description provided.
Delete branch "cleanups"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Rationale: we want to ensure Garage's safety by making sure things don't get deleted from disk if they are still needed. Two aspects are involved in this.
1. Garbage collection of table entries (in meta/ directory)
STATUS: MOSTLY FIXED BY THIS PR
The
Entry
trait used for table entries (defined intables/schema.rs
) defines a functionis_tombstone()
that returnstrue
if that entry represents an entry that is deleted in the table. CRDT semantics by default keep all tombstones, because they are necessary for reconciliation: if node A has a tombstone that supersedes a valuex
, and node B has valuex
, A has to keep the tombstone in memory so that the valuex
can be properly deleted at nodeB
. Otherwise, due to the CRDT reconciliation rule, the valuex
from B would flow back to A and a deleted item would reappear in the system.Here, we have some control on the nodes involved in storing Garage data. Therefore we have a garbage collector that is able to delete tombstones UNDER CERTAIN CONDITIONS. This garbage collector is implemented in
table/gc.rs
. To delete a tombstone, the following condition has to be met:Garage makes use of Sled's atomic operations (such as compare-and-swap and transactions) to ensure that only tombstones that have been correctly propagated to other nodes are ever deleted from the local entry tree.
This GC is safe in the following sense: no non-tombstone data is ever deleted from Garage tables.
However, there is an issue with the way this interacts with data rebalancing in the case when a partition is moving between nodes. If a node has some data of a partition for which it is not responsible, it has to offload it. However that offload process takes some time. In that interval, the GC does not check with that node if it has the tombstone before deleting the tombstone, so perhaps it doesn't have it and when the offload finally happens, old data comes back in the system.
This PR mostly fixes this by implementing a 24-hour delay before anything is garbage collected in a table. This works under the assumption that rebalances that follow data shuffling terminate in less than 24 hours.
2. Garbage collection of data blocks (in data/ directory)
STATUS: SOLVED BY THIS PR
Blocks in the data directory are reference-counted. In previous Garage iterations, blocks could get deleted from local disk as soon as their reference counter reached zero. We had a mechanism to not trigger this immediately at the rc-reaches-zero event, but the cleanup could be triggered by other means (for example by a block repair operation...). This PR implements a safety measure so that blocks never get deleted in a 10 minute interval following the time when the RC reaches zero. This is a measure to make impossible race conditions such as #39. We would have liked to use a larger delay (e.g. 24 hours), but in the case of a rebalance of data, this would have led to the disk utilization to explode during the rebalancing, only to shrink again after 24 hours. The 10-minute delay is a compromise that gives good security while not having this problem of disk space explosion on rebalance.
2eac1d2109
to8282f71d91
8282f71d91
to7fbe59e3a8
WIP: Refactor and comment table GC logicto WIP: Improvements to garbage collectorc3300f4513
to14fe22fc02
705632dbec
to4527a5ca4b
I think we could add your explanations to the design section of Garage, maybe in a GC entry? Or at least in the working document section? My point is that it is interesting as it answers some questions on how data is handled :)
also for #39, and more generally, can we not avoid this situation by incrementing the counter before decrementing it? After all both the
create
anddestroy
are handled on the same nodes so the order is deterministic?I can review your code also if you want :)
4527a5ca4b
to6af0fb34f7
WIP: Improvements to garbage collectorto WIP: Improvements to garbage collector (fix #39)302d0a69bd
to5eda70d2f2
5eda70d2f2
to9f45650fda
fe1eb032d0
tobd7a8da792
bd7a8da792
to6d3c4862dc
6d3c4862dc
to5b365fe273
5b365fe273
to790c349f00
790c349f00
to68d007fcab
68d007fcab
to1f8ddc088c
1f8ddc088c
toebd67c6ab0
ebd67c6ab0
to0b339919aa
0b339919aa
tobee91d0a25
WIP: Improvements to garbage collector (fix #39)to Improvements to garbage collector (fix #39)We actually did this to fix #39, but 1/ we don't know for sure if the new code is perfectly safe (I don't have a full theoretical analysis), and 2/ we might introduce this kind of mistake again in the future when coding new API endpoints. Adding a delay before block deletion is an additionnal security to reduce the consequences of this kind of bug.
bee91d0a25
toad7ab31411
The only concern I have is linked with backward compatibility/serialization where we could loose some guarantee provided by Rust. I don't see this issue as blocking, I only want your opinion on it.
I had no idea on how to test your code so I only proof-read the code without executing it.
It might still be possible to write some tests by simply instantiating manually some structures and passing them to your functions, it could be a first step before integrating jepsen.
Thanks for all the comments in the source code, they are very precious to understand your intents.
You changes seem to be rather minimal which is a good thing.
@ -29,3 +29,3 @@
- [Design](./design/index.md)
- [Related Work](./design/related_work.md)
- [Internals](./design/internals.md)
- [Design draft](./design/design_draft.md)
It seems you renamed Internals as Design Draft but we still have a file named
internal.md
. Do you confirm that you want to keep it while hiding it from the documentation?No actually I wanted to keep both in the menu ! Will fix.
@ -255,0 +267,4 @@
// so in all cases we add the block here to the todo list
// to check later that it arrived correctly, and if not
// we will fecth it from someone.
self.put_to_resync(hash, 2 * BLOCK_RW_TIMEOUT)?;
why are we using 2 times the timeout?
Because 1/ it's not a big issue if we do it too late, 2/ it makes sure that if there was a block put operation, it had enough time to complete/fail even if it took slightly longer than the timeout delay for some reason (clocks are imprecise)
@ -271,0 +279,4 @@
.update_and_fetch(&hash, |old| RcEntry::parse_opt(old).decrement().serialize())?;
let new_rc = RcEntry::parse_opt(new_rc);
if let RcEntry::Deletable { .. } = new_rc {
self.put_to_resync(hash, BLOCK_GC_DELAY + Duration::from_secs(10))?;
Why are we adding an hard coded 10 seconds?
Same reason, because clocks are imprecise and it's not a big issue if we do it too late
@ -301,3 +323,3 @@
Err(e) => {
// Not found but maybe we should have had it ??
self.put_to_resync(hash, Duration::from_millis(0))?;
self.put_to_resync(hash, 2 * BLOCK_RW_TIMEOUT)?;
2 times here also.
@ -681,1 +709,3 @@
u64::from_be_bytes(x8)
/// Describes the state of the reference counter for a block
#[derive(Clone, Copy, Debug)]
enum RcEntry {
Just me being nitpicking but we created this enum because we want a bit more than a Reference Counter, something like a Reference Counter with Delay. We could rename it to
RcDelayEntry
orRcDelay
. But feel free to keep this name too if you are more comfortable with the current name :)@ -682,0 +721,4 @@
/// This is stored as [0u8; 8] followed by u64::to_be_bytes(at_time),
/// (this allows for the data format to be backwards compatible with
/// previous Garage versions that didn't have this intermediate state)
Deletable { at_time: u64 },
What is our policy for backward compatibility: is there a risk that we make our code worse over time to ease some backward compatibility?
In this specific case, it does not seem very critical to me but I am curious about our code strategy / technical debt strategy :)
I don't see this format as technical debt. It's a good format with a simple custom serialization that happens to be compatible with the old format.
It's nice here that we don't have to break compatibility, but if we had to do it we would have done it. (we did break compatibility for the table GC)
I think until release 1.0 we don't need to think too much about backward compatiblity. v0.4 removes at least one code path for data format migration, and we generally don't want to have those for now.
@ -55,3 +56,3 @@
let gc_todo = db
.open_tree(&format!("{}:gc_todo", name))
.open_tree(&format!("{}:gc_todo_v2", name))
Here, do we create a new db/table/key format because the previous one had a different format that is now incompatible? Does it mean that we will have a "dead" gc_todo table on already deployed nodes?
I do not see that as a blocking problem either but it is good to know.
Yes it's a dead table. If the migration is executed properly, i.e. Garage is read-only for enough time for the GC in the old version to process everything,
gc_todo
should be empty.@ -100,3 +125,3 @@
let vhash = Hash::try_from(&vhash[..]).unwrap();
let v_opt = self
// Check if the tombstone is still the current value of the entry.
Can you confirm that one reason to invalid a tombstone is a case where I upload a file, detele it, and then reupload it?
Yes! For the object tables, an entry can take the following sequence of values:
@ -172,3 +222,4 @@
) -> Result<(), Error> {
let n_items = items.len();
// Strategy: we first send all of the values to the remote nodes,
Does it mean that deleting a tombstone is a particularly expensive operation? Could we create a deny of service simply by deleting lot of data in a row? or deleting lot of data in a row while a node is down?
But it might be totally ok as it seems we batch our requests, can you confirm?
Edit: I just understood that
nodes
does not map to all nodes of the cluster but to all nodes of the same "partition"It's not that expensive (tombstones are small and can be batched), and it's the cost of a correct algorithm :)
At least we are not doing a full Raft or Paxos
@ -251,0 +324,4 @@
/// such entry in Sled
///
/// Format of an entry:
/// - key = 8 bytes: timestamp of tombstone
Here also we have some byte manipulations too, this is even the purpose of this object. Are we doing them because of a Sled limitation (it needs a primitive type as a key) or for backward compatibility reasons? (we were storing a byte array before so we keep it)? Are you not affraid that it could introduce bugs with time as we are loosing some checks done by the compiler?
We are limited by Sled that basically acts as a
BTreeMap<Vec<u8>, Vec<u8>>
: we have to manage stuff as byte arrays for both keys and values. It's true we have to be extra carefull with this: that's the reason why I extracted the serialization and parsing logic in a separate struct. This would typically be a good place for unit testing if complexity gets out of hand (for now I don't think it has).144829a1c8
to08b1e8a7ea