garage/doc/book/design/internals.md

+++
title = "Internals"
weight = 20
+++

## Overview

TODO: write this section

- The Dynamo ring (see [this paper](https://dl.acm.org/doi/abs/10.1145/1323293.1294281) and [that paper](https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/eisenbud))

- CRDTs (see [this paper](https://link.springer.com/chapter/10.1007/978-3-642-24550-3_29))

- Consistency model of Garage tables

In the meantime, you can find some information at the following links:

- [this presentation (in French)](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/doc/talks/2020-12-02_wide-team/talk.pdf)

- [an old design draft](@/documentation/working-documents/design-draft.md)


## Request routing logic

Data retrieval requests to Garage endpoints (S3 API and websites) are resolved 
to an individual object in a bucket. Since objects are replicated to multiple nodes 
Garage must ensure consistency before answering the request.

### Using quorum to ensure consistency

Garage ensures consistency by attempting to establish a quorum with the
data nodes responsible for the object. When a majority of the data nodes
have provided metadata on a object Garage can then answer the request.

When a request arrives Garage will, assuming the recommended 3 replicas, perform the following actions:

- Make a request to the two preferred nodes for object metadata
- Try the third node if one of the two initial requests fail
- Check that the metadata from at least 2 nodes match
- Check that the object hasn't been marked deleted
- Answer the request with inline data from metadata if object is small enough
- Or get data blocks from the preferred nodes and answer using the assembled object

Garage dynamically determines which nodes to query based on health, preference, and 
which nodes actually host a given data. Garage has no concept of "primary" so any 
healthy node with the data can be used as long as a quorum is reached for the metadata.

### Node health

Garage keeps a TCP session open to each node in the cluster and periodically pings them. If a connection
cannot be established, or a node fails to answer a number of pings, the target node is marked as failed.
Failed nodes are not used for quorum or other internal requests.

### Node preference

Garage prioritizes which nodes to query according to a few criteria:

- A node always prefers itself if it can answer the request
- Then the node prioritizes nodes in the same zone
- Finally the nodes with the lowest latency are prioritized 


For further reading on the cluster structure look at the [gateway](@/documentation/cookbook/gateways.md) 
and [cluster layout management](@/documentation/operations/layout.md) pages.

## Garbage collection

A faulty garbage collection procedure has been the cause of
[critical bug #39](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/39).
This precise bug was fixed in the code, however there are potentially more
general issues with the garbage collector being too eager and deleting things
too early. This has been the subject of
[PR #135](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/135).
This section summarizes the discussions on this topic.

Rationale: we want to ensure Garage's safety by making sure things don't get
deleted from disk if they are still needed. Two aspects are involved in this.

### 1. Garbage collection of table entries (in `meta/` directory)

The `Entry` trait used for table entries (defined in `tables/schema.rs`)
defines a function `is_tombstone()` that returns `true` if that entry
represents an entry that is deleted in the table. CRDT semantics by default
keep all tombstones, because they are necessary for reconciliation: if node A
has a tombstone that supersedes a value `x`, and node B has value `x`, A has to
keep the tombstone in memory so that the value `x` can be properly deleted at
node `B`. Otherwise, due to the CRDT reconciliation rule, the value `x` from B
would flow back to A and a deleted item would reappear in the system.

Here, we have some control on the nodes involved in storing Garage data.
Therefore we have a garbage collector that is able to delete tombstones UNDER
CERTAIN CONDITIONS. This garbage collector is implemented in `table/gc.rs`. To
delete a tombstone, the following condition has to be met:

- All nodes responsible for storing this entry are aware of the existence of
  the tombstone, i.e. they cannot hold another version of the entry that is
  superseeded by the tombstone. This ensures that deleting the tombstone is
  safe and that no deleted value will come back in the system.

Garage makes use of Sled's atomic operations (such as compare-and-swap and
transactions) to ensure that only tombstones that have been correctly
propagated to other nodes are ever deleted from the local entry tree.

This GC is safe in the following sense: no non-tombstone data is ever deleted
from Garage tables.

**However**, there is an issue with the way this interacts with data
rebalancing in the case when a partition is moving between nodes. If a node has
some data of a partition for which it is not responsible, it has to offload it.
However that offload process takes some time. In that interval, the GC does not
check with that node if it has the tombstone before deleting the tombstone, so
perhaps it doesn't have it and when the offload finally happens, old data comes
back in the system.

**PR 135 mostly fixes this** by implementing a 24-hour delay before anything is
garbage collected in a table. This works under the assumption that rebalances
that follow data shuffling terminate in less than 24 hours.

**However**, in distributed systems, it is generally considered a bad practice
to make assumptions that information propagates in a certain time interval:
this consists in making a synchrony assumption, meaning that we are basically
assuming a computing model that has much stronger properties than otherwise. To
maximize the applicability of Garage, we would like to remove this assumption,
and implement a system where time does not play a role. To do this, we would
need to find a way to safely disable the GC when data is being shuffled around,
and safely detect that the shuffling has terminated and thus the GC can be
resumed. This introduces some complexity to the protocol and hasn't been
tackled yet.

### 2. Garbage collection of data blocks (in `data/` directory)

Blocks in the data directory are reference-counted. In Garage versions before
PR #135, blocks could get deleted from local disk as soon as their reference
counter reached zero. We had a mechanism to not trigger this immediately at the
rc-reaches-zero event, but the cleanup could be triggered by other means (for
example by a block repair operation...). PR #135 added a safety measure so that
blocks never get deleted in a 10 minute interval following the time when the RC
reaches zero. This is a measure to make impossible race conditions such as #39.
We would have liked to use a larger delay (e.g. 24 hours), but in the case of a
rebalance of data, this would have led to the disk utilization to explode
during the rebalancing, only to shrink again after 24 hours. The 10-minute
delay is a compromise that gives good security while not having this problem of
disk space explosion on rebalance.
Reorganize documentation for new website (#213) This PR should be merged after the new website is deployed. - [x] Rename files - [x] Add front matter section to all `.md` files in the book (necessary for Zola) - [x] Change all internal links to use Zola's linking system that checks broken links - [x] Some updates to documentation contents and organization Co-authored-by: Alex Auvolat <alex@adnab.me> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/213 Co-authored-by: Alex <alex@adnab.me> Co-committed-by: Alex <alex@adnab.me> 2022-02-07 10:51:12 +00:00			`+++`
			`title = "Internals"`
			`weight = 20`
			`+++`
Move design draft to separate file; write about GC in internals 2021-11-08 15:03:15 +00:00
			`## Overview`

			`TODO: write this section`

Reorganize and improve documentation 2021-12-06 15:10:32 +00:00			`- The Dynamo ring (see [this paper](https://dl.acm.org/doi/abs/10.1145/1323293.1294281) and [that paper](https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/eisenbud))`
Move design draft to separate file; write about GC in internals 2021-11-08 15:03:15 +00:00
Reorganize and improve documentation 2021-12-06 15:10:32 +00:00			`- CRDTs (see [this paper](https://link.springer.com/chapter/10.1007/978-3-642-24550-3_29))`
Move design draft to separate file; write about GC in internals 2021-11-08 15:03:15 +00:00
			`- Consistency model of Garage tables`

Reorganize and improve documentation 2021-12-06 15:10:32 +00:00			`In the meantime, you can find some information at the following links:`

			`- [this presentation (in French)](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/doc/talks/2020-12-02_wide-team/talk.pdf)`

Reorganize documentation for new website (#213) This PR should be merged after the new website is deployed. - [x] Rename files - [x] Add front matter section to all `.md` files in the book (necessary for Zola) - [x] Change all internal links to use Zola's linking system that checks broken links - [x] Some updates to documentation contents and organization Co-authored-by: Alex Auvolat <alex@adnab.me> Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/213 Co-authored-by: Alex <alex@adnab.me> Co-committed-by: Alex <alex@adnab.me> 2022-02-07 10:51:12 +00:00			`- [an old design draft](@/documentation/working-documents/design-draft.md)`
Move design draft to separate file; write about GC in internals 2021-11-08 15:03:15 +00:00

Some work on documentation towards v0.8 2022-09-14 17:31:13 +00:00			`## Request routing logic`

			`Data retrieval requests to Garage endpoints (S3 API and websites) are resolved`
			`to an individual object in a bucket. Since objects are replicated to multiple nodes`
			`Garage must ensure consistency before answering the request.`

			`### Using quorum to ensure consistency`

			`Garage ensures consistency by attempting to establish a quorum with the`
			`data nodes responsible for the object. When a majority of the data nodes`
			`have provided metadata on a object Garage can then answer the request.`

			`When a request arrives Garage will, assuming the recommended 3 replicas, perform the following actions:`

			`- Make a request to the two preferred nodes for object metadata`
			`- Try the third node if one of the two initial requests fail`
			`- Check that the metadata from at least 2 nodes match`
			`- Check that the object hasn't been marked deleted`
			`- Answer the request with inline data from metadata if object is small enough`
			`- Or get data blocks from the preferred nodes and answer using the assembled object`

			`Garage dynamically determines which nodes to query based on health, preference, and`
			`which nodes actually host a given data. Garage has no concept of "primary" so any`
			`healthy node with the data can be used as long as a quorum is reached for the metadata.`

			`### Node health`

			`Garage keeps a TCP session open to each node in the cluster and periodically pings them. If a connection`
			`cannot be established, or a node fails to answer a number of pings, the target node is marked as failed.`
			`Failed nodes are not used for quorum or other internal requests.`

			`### Node preference`

			`Garage prioritizes which nodes to query according to a few criteria:`

			`- A node always prefers itself if it can answer the request`
			`- Then the node prioritizes nodes in the same zone`
			`- Finally the nodes with the lowest latency are prioritized`


			`For further reading on the cluster structure look at the [gateway](@/documentation/cookbook/gateways.md)`
doc: add an operations&maintenance section and move some pages there 2023-06-14 10:08:02 +00:00			`and [cluster layout management](@/documentation/operations/layout.md) pages.`
Some work on documentation towards v0.8 2022-09-14 17:31:13 +00:00
Move design draft to separate file; write about GC in internals 2021-11-08 15:03:15 +00:00			`## Garbage collection`

			`A faulty garbage collection procedure has been the cause of`
			`[critical bug #39](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/39).`
			`This precise bug was fixed in the code, however there are potentially more`
			`general issues with the garbage collector being too eager and deleting things`
			`too early. This has been the subject of`
			`[PR #135](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/135).`
			`This section summarizes the discussions on this topic.`

			`Rationale: we want to ensure Garage's safety by making sure things don't get`
			`deleted from disk if they are still needed. Two aspects are involved in this.`

			### 1. Garbage collection of table entries (in `meta/` directory)

			The `Entry` trait used for table entries (defined in `tables/schema.rs`)
			defines a function `is_tombstone()` that returns `true` if that entry
			`represents an entry that is deleted in the table. CRDT semantics by default`
			`keep all tombstones, because they are necessary for reconciliation: if node A`
			has a tombstone that supersedes a value `x`, and node B has value `x`, A has to
			keep the tombstone in memory so that the value `x` can be properly deleted at
			node `B`. Otherwise, due to the CRDT reconciliation rule, the value `x` from B
			`would flow back to A and a deleted item would reappear in the system.`

			`Here, we have some control on the nodes involved in storing Garage data.`
			`Therefore we have a garbage collector that is able to delete tombstones UNDER`
			CERTAIN CONDITIONS. This garbage collector is implemented in `table/gc.rs`. To
			`delete a tombstone, the following condition has to be met:`

			`- All nodes responsible for storing this entry are aware of the existence of`
			`the tombstone, i.e. they cannot hold another version of the entry that is`
			`superseeded by the tombstone. This ensures that deleting the tombstone is`
			`safe and that no deleted value will come back in the system.`

			`Garage makes use of Sled's atomic operations (such as compare-and-swap and`
			`transactions) to ensure that only tombstones that have been correctly`
			`propagated to other nodes are ever deleted from the local entry tree.`

			`This GC is safe in the following sense: no non-tombstone data is ever deleted`
			`from Garage tables.`

			`However, there is an issue with the way this interacts with data`
			`rebalancing in the case when a partition is moving between nodes. If a node has`
			`some data of a partition for which it is not responsible, it has to offload it.`
			`However that offload process takes some time. In that interval, the GC does not`
			`check with that node if it has the tombstone before deleting the tombstone, so`
			`perhaps it doesn't have it and when the offload finally happens, old data comes`
			`back in the system.`

			`PR 135 mostly fixes this by implementing a 24-hour delay before anything is`
			`garbage collected in a table. This works under the assumption that rebalances`
			`that follow data shuffling terminate in less than 24 hours.`

			`However, in distributed systems, it is generally considered a bad practice`
			`to make assumptions that information propagates in a certain time interval:`
			`this consists in making a synchrony assumption, meaning that we are basically`
			`assuming a computing model that has much stronger properties than otherwise. To`
			`maximize the applicability of Garage, we would like to remove this assumption,`
			`and implement a system where time does not play a role. To do this, we would`
			`need to find a way to safely disable the GC when data is being shuffled around,`
			`and safely detect that the shuffling has terminated and thus the GC can be`
			`resumed. This introduces some complexity to the protocol and hasn't been`
			`tackled yet.`

			### 2. Garbage collection of data blocks (in `data/` directory)

			`Blocks in the data directory are reference-counted. In Garage versions before`
			`PR #135, blocks could get deleted from local disk as soon as their reference`
			`counter reached zero. We had a mechanism to not trigger this immediately at the`
			`rc-reaches-zero event, but the cleanup could be triggered by other means (for`
			`example by a block repair operation...). PR #135 added a safety measure so that`
			`blocks never get deleted in a 10 minute interval following the time when the RC`
			`reaches zero. This is a measure to make impossible race conditions such as #39.`
			`We would have liked to use a larger delay (e.g. 24 hours), but in the case of a`
			`rebalance of data, this would have led to the disk utilization to explode`
			`during the rebalancing, only to shrink again after 24 hours. The 10-minute`
			`delay is a compromise that gives good security while not having this problem of`
			`disk space explosion on rebalance.`
Add warning 2021-01-23 18:15:57 +00:00