garage/doc/book/design/internals.md

145 lines
7.1 KiB
Markdown

+++
title = "Internals"
weight = 20
+++
## Overview
TODO: write this section
- The Dynamo ring (see [this paper](https://dl.acm.org/doi/abs/10.1145/1323293.1294281) and [that paper](https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/eisenbud))
- CRDTs (see [this paper](https://link.springer.com/chapter/10.1007/978-3-642-24550-3_29))
- Consistency model of Garage tables
In the meantime, you can find some information at the following links:
- [this presentation (in French)](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/doc/talks/2020-12-02_wide-team/talk.pdf)
- [an old design draft](@/documentation/working-documents/design-draft.md)
## Request routing logic
Data retrieval requests to Garage endpoints (S3 API and websites) are resolved
to an individual object in a bucket. Since objects are replicated to multiple nodes
Garage must ensure consistency before answering the request.
### Using quorum to ensure consistency
Garage ensures consistency by attempting to establish a quorum with the
data nodes responsible for the object. When a majority of the data nodes
have provided metadata on a object Garage can then answer the request.
When a request arrives Garage will, assuming the recommended 3 replicas, perform the following actions:
- Make a request to the two preferred nodes for object metadata
- Try the third node if one of the two initial requests fail
- Check that the metadata from at least 2 nodes match
- Check that the object hasn't been marked deleted
- Answer the request with inline data from metadata if object is small enough
- Or get data blocks from the preferred nodes and answer using the assembled object
Garage dynamically determines which nodes to query based on health, preference, and
which nodes actually host a given data. Garage has no concept of "primary" so any
healthy node with the data can be used as long as a quorum is reached for the metadata.
### Node health
Garage keeps a TCP session open to each node in the cluster and periodically pings them. If a connection
cannot be established, or a node fails to answer a number of pings, the target node is marked as failed.
Failed nodes are not used for quorum or other internal requests.
### Node preference
Garage prioritizes which nodes to query according to a few criteria:
- A node always prefers itself if it can answer the request
- Then the node prioritizes nodes in the same zone
- Finally the nodes with the lowest latency are prioritized
For further reading on the cluster structure look at the [gateway](@/documentation/cookbook/gateways.md)
and [cluster layout management](@/documentation/operations/layout.md) pages.
## Garbage collection
A faulty garbage collection procedure has been the cause of
[critical bug #39](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/39).
This precise bug was fixed in the code, however there are potentially more
general issues with the garbage collector being too eager and deleting things
too early. This has been the subject of
[PR #135](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/135).
This section summarizes the discussions on this topic.
Rationale: we want to ensure Garage's safety by making sure things don't get
deleted from disk if they are still needed. Two aspects are involved in this.
### 1. Garbage collection of table entries (in `meta/` directory)
The `Entry` trait used for table entries (defined in `tables/schema.rs`)
defines a function `is_tombstone()` that returns `true` if that entry
represents an entry that is deleted in the table. CRDT semantics by default
keep all tombstones, because they are necessary for reconciliation: if node A
has a tombstone that supersedes a value `x`, and node B has value `x`, A has to
keep the tombstone in memory so that the value `x` can be properly deleted at
node `B`. Otherwise, due to the CRDT reconciliation rule, the value `x` from B
would flow back to A and a deleted item would reappear in the system.
Here, we have some control on the nodes involved in storing Garage data.
Therefore we have a garbage collector that is able to delete tombstones UNDER
CERTAIN CONDITIONS. This garbage collector is implemented in `table/gc.rs`. To
delete a tombstone, the following condition has to be met:
- All nodes responsible for storing this entry are aware of the existence of
the tombstone, i.e. they cannot hold another version of the entry that is
superseeded by the tombstone. This ensures that deleting the tombstone is
safe and that no deleted value will come back in the system.
Garage makes use of Sled's atomic operations (such as compare-and-swap and
transactions) to ensure that only tombstones that have been correctly
propagated to other nodes are ever deleted from the local entry tree.
This GC is safe in the following sense: no non-tombstone data is ever deleted
from Garage tables.
**However**, there is an issue with the way this interacts with data
rebalancing in the case when a partition is moving between nodes. If a node has
some data of a partition for which it is not responsible, it has to offload it.
However that offload process takes some time. In that interval, the GC does not
check with that node if it has the tombstone before deleting the tombstone, so
perhaps it doesn't have it and when the offload finally happens, old data comes
back in the system.
**PR 135 mostly fixes this** by implementing a 24-hour delay before anything is
garbage collected in a table. This works under the assumption that rebalances
that follow data shuffling terminate in less than 24 hours.
**However**, in distributed systems, it is generally considered a bad practice
to make assumptions that information propagates in a certain time interval:
this consists in making a synchrony assumption, meaning that we are basically
assuming a computing model that has much stronger properties than otherwise. To
maximize the applicability of Garage, we would like to remove this assumption,
and implement a system where time does not play a role. To do this, we would
need to find a way to safely disable the GC when data is being shuffled around,
and safely detect that the shuffling has terminated and thus the GC can be
resumed. This introduces some complexity to the protocol and hasn't been
tackled yet.
### 2. Garbage collection of data blocks (in `data/` directory)
Blocks in the data directory are reference-counted. In Garage versions before
PR #135, blocks could get deleted from local disk as soon as their reference
counter reached zero. We had a mechanism to not trigger this immediately at the
rc-reaches-zero event, but the cleanup could be triggered by other means (for
example by a block repair operation...). PR #135 added a safety measure so that
blocks never get deleted in a 10 minute interval following the time when the RC
reaches zero. This is a measure to make impossible race conditions such as #39.
We would have liked to use a larger delay (e.g. 24 hours), but in the case of a
rebalance of data, this would have led to the disk utilization to explode
during the rebalancing, only to shrink again after 24 hours. The 10-minute
delay is a compromise that gives good security while not having this problem of
disk space explosion on rebalance.