Doc: be slightly more critical of LMDB #773
3 changed files with 53 additions and 31 deletions
|
@ -27,7 +27,7 @@ To run a real-world deployment, make sure the following conditions are met:
|
||||||
[Yggdrasil](https://yggdrasil-network.github.io/) are approaches to consider
|
[Yggdrasil](https://yggdrasil-network.github.io/) are approaches to consider
|
||||||
in addition to building out your own VPN tunneling.
|
in addition to building out your own VPN tunneling.
|
||||||
|
|
||||||
- This guide will assume you are using Docker containers to deploy Garage on each node.
|
- This guide will assume you are using Docker containers to deploy Garage on each node.
|
||||||
Garage can also be run independently, for instance as a [Systemd service](@/documentation/cookbook/systemd.md).
|
Garage can also be run independently, for instance as a [Systemd service](@/documentation/cookbook/systemd.md).
|
||||||
You can also use an orchestrator such as Nomad or Kubernetes to automatically manage
|
You can also use an orchestrator such as Nomad or Kubernetes to automatically manage
|
||||||
Docker containers on a fleet of nodes.
|
Docker containers on a fleet of nodes.
|
||||||
|
@ -53,9 +53,9 @@ to store 2 TB of data in total.
|
||||||
|
|
||||||
### Best practices
|
### Best practices
|
||||||
|
|
||||||
- If you have fast dedicated networking between all your nodes, and are planing to store
|
- If you have reasonably fast networking between all your nodes, and are planing to store
|
||||||
very large files, bump the `block_size` configuration parameter to 10 MB
|
mostly large files, bump the `block_size` configuration parameter to 10 MB
|
||||||
(`block_size = 10485760`).
|
(`block_size = "10M"`).
|
||||||
|
|
||||||
- Garage stores its files in two locations: it uses a metadata directory to store frequently-accessed
|
- Garage stores its files in two locations: it uses a metadata directory to store frequently-accessed
|
||||||
small metadata items, and a data directory to store data blocks of uploaded objects.
|
small metadata items, and a data directory to store data blocks of uploaded objects.
|
||||||
|
@ -68,20 +68,29 @@ to store 2 TB of data in total.
|
||||||
EXT4 is not recommended as it has more strict limitations on the number of inodes,
|
EXT4 is not recommended as it has more strict limitations on the number of inodes,
|
||||||
which might cause issues with Garage when large numbers of objects are stored.
|
which might cause issues with Garage when large numbers of objects are stored.
|
||||||
|
|
||||||
- If you only have an HDD and no SSD, it's fine to put your metadata alongside the data
|
|
||||||
on the same drive. Having lots of RAM for your kernel to cache the metadata will
|
|
||||||
help a lot with performance. Make sure to use the LMDB database engine,
|
|
||||||
instead of Sled, which suffers from quite bad performance degradation on HDDs.
|
|
||||||
Sled is still the default for legacy reasons, but is not recommended anymore.
|
|
||||||
|
|
||||||
- For the metadata storage, Garage does not do checksumming and integrity
|
|
||||||
verification on its own. If you are afraid of bitrot/data corruption,
|
|
||||||
put your metadata directory on a ZFS or BTRFS partition. Otherwise, just use regular
|
|
||||||
EXT4 or XFS.
|
|
||||||
|
|
||||||
- Servers with multiple HDDs are supported natively by Garage without resorting
|
- Servers with multiple HDDs are supported natively by Garage without resorting
|
||||||
to RAID, see [our dedicated documentation page](@/documentation/operations/multi-hdd.md).
|
to RAID, see [our dedicated documentation page](@/documentation/operations/multi-hdd.md).
|
||||||
|
|
||||||
|
- For the metadata storage, Garage does not do checksumming and integrity
|
||||||
|
verification on its own. Users have reported that when using the LMDB
|
||||||
|
database engine (the default), database files have a tendency of becoming
|
||||||
|
corrupted after an unclean shutdown (e.g. a power outage), so you should use
|
||||||
|
a robust filesystem such as BTRFS or ZFS for the metadata partition, and take
|
||||||
|
regular snapshots so that you can restore to a recent known-good state in
|
||||||
|
case of an incident. If you cannot do so, you might want to switch to Sqlite
|
||||||
|
which is more robust.
|
||||||
|
|
||||||
|
- LMDB is the fastest and most tested database engine, but it has the following
|
||||||
|
weaknesses: 1/ data files are not architecture-independent, you cannot simply
|
||||||
|
move a Garage metadata directory between nodes running different architectures,
|
||||||
|
and 2/ LMDB is not suited for 32-bit platforms. Sqlite is a viable alternative
|
||||||
|
if any of these are of concern.
|
||||||
|
|
||||||
|
- If you only have an HDD and no SSD, it's fine to put your metadata alongside
|
||||||
|
the data on the same drive, but then consider your filesystem choice wisely
|
||||||
|
(see above). Having lots of RAM for your kernel to cache the metadata will
|
||||||
|
help a lot with performance.
|
||||||
|
|
||||||
## Get a Docker image
|
## Get a Docker image
|
||||||
|
|
||||||
Our docker image is currently named `dxflrs/garage` and is stored on the [Docker Hub](https://hub.docker.com/r/dxflrs/garage/tags?page=1&ordering=last_updated).
|
Our docker image is currently named `dxflrs/garage` and is stored on the [Docker Hub](https://hub.docker.com/r/dxflrs/garage/tags?page=1&ordering=last_updated).
|
||||||
|
@ -187,7 +196,7 @@ upgrades. With the containerized setup proposed here, the upgrade process
|
||||||
will require stopping and removing the existing container, and re-creating it
|
will require stopping and removing the existing container, and re-creating it
|
||||||
with the upgraded version.
|
with the upgraded version.
|
||||||
|
|
||||||
## Controling the daemon
|
## Controlling the daemon
|
||||||
|
|
||||||
The `garage` binary has two purposes:
|
The `garage` binary has two purposes:
|
||||||
- it acts as a daemon when launched with `garage server`
|
- it acts as a daemon when launched with `garage server`
|
||||||
|
@ -245,7 +254,7 @@ You can then instruct nodes to connect to one another as follows:
|
||||||
Venus$ garage node connect 563e1ac825ee3323aa441e72c26d1030d6d4414aeb3dd25287c531e7fc2bc95d@[fc00:1::1]:3901
|
Venus$ garage node connect 563e1ac825ee3323aa441e72c26d1030d6d4414aeb3dd25287c531e7fc2bc95d@[fc00:1::1]:3901
|
||||||
```
|
```
|
||||||
|
|
||||||
You don't nead to instruct all node to connect to all other nodes:
|
You don't need to instruct all node to connect to all other nodes:
|
||||||
nodes will discover one another transitively.
|
nodes will discover one another transitively.
|
||||||
|
|
||||||
Now if your run `garage status` on any node, you should have an output that looks as follows:
|
Now if your run `garage status` on any node, you should have an output that looks as follows:
|
||||||
|
@ -328,8 +337,8 @@ Given the information above, we will configure our cluster as follow:
|
||||||
```bash
|
```bash
|
||||||
garage layout assign 563e -z par1 -c 1T -t mercury
|
garage layout assign 563e -z par1 -c 1T -t mercury
|
||||||
garage layout assign 86f0 -z par1 -c 2T -t venus
|
garage layout assign 86f0 -z par1 -c 2T -t venus
|
||||||
garage layout assign 6814 -z lon1 -c 2T -t earth
|
garage layout assign 6814 -z lon1 -c 2T -t earth
|
||||||
garage layout assign 212f -z bru1 -c 1.5T -t mars
|
garage layout assign 212f -z bru1 -c 1.5T -t mars
|
||||||
```
|
```
|
||||||
|
|
||||||
At this point, the changes in the cluster layout have not yet been applied.
|
At this point, the changes in the cluster layout have not yet been applied.
|
||||||
|
|
|
@ -57,7 +57,7 @@ to generate unique and private secrets for security reasons:
|
||||||
cat > garage.toml <<EOF
|
cat > garage.toml <<EOF
|
||||||
metadata_dir = "/tmp/meta"
|
metadata_dir = "/tmp/meta"
|
||||||
data_dir = "/tmp/data"
|
data_dir = "/tmp/data"
|
||||||
db_engine = "lmdb"
|
db_engine = "sqlite"
|
||||||
|
|
||||||
replication_mode = "none"
|
replication_mode = "none"
|
||||||
|
|
||||||
|
|
|
@ -264,18 +264,31 @@ Performance characteristics of the different DB engines are as follows:
|
||||||
- Sled: tends to produce large data files and also has performance issues,
|
- Sled: tends to produce large data files and also has performance issues,
|
||||||
especially when the metadata folder is on a traditional HDD and not on SSD.
|
especially when the metadata folder is on a traditional HDD and not on SSD.
|
||||||
|
|
||||||
- LMDB: the recommended database engine on 64-bit systems, much more
|
- LMDB: the recommended database engine for high-performance distributed
|
||||||
space-efficient and slightly faster. Note that the data format of LMDB is not
|
clusters, much more space-efficient and significantly faster. LMDB works very
|
||||||
portable between architectures, so for instance the Garage database of an
|
well, but is known to have the following limitations:
|
||||||
x86-64 node cannot be moved to an ARM64 node. Also note that, while LMDB can
|
|
||||||
technically be used on 32-bit systems, this will limit your node to very
|
- The data format of LMDB is not portable between architectures, so for
|
||||||
small database sizes due to how LMDB works; it is therefore not recommended.
|
instance the Garage database of an x86-64 node cannot be moved to an ARM64
|
||||||
|
node.
|
||||||
|
|
||||||
|
- While LMDB can technically be used on 32-bit systems, this will limit your
|
||||||
|
node to very small database sizes due to how LMDB works; it is therefore
|
||||||
|
not recommended.
|
||||||
|
|
||||||
|
- Several users have reported corrupted LMDB database files after an unclean
|
||||||
|
shutdown (e.g. a power outage). This situation can generally be recovered
|
||||||
|
from if your cluster is geo-replicated (by rebuilding your metadata db from
|
||||||
|
other nodes), or if you have saved regular snapshots at the filesystem
|
||||||
|
level.
|
||||||
|
|
||||||
- Sqlite: Garage supports Sqlite as an alternative storage backend for
|
- Sqlite: Garage supports Sqlite as an alternative storage backend for
|
||||||
metadata, and although it has not been tested as much, it is expected to work
|
metadata, which does not have the issues listed above for LMDB.
|
||||||
satisfactorily. Since Garage v0.9.0, performance issues have largely been
|
On versions 0.8.x and earlier, Sqlite should be avoided due to abysmal
|
||||||
fixed by allowing for a no-fsync mode (see `metadata_fsync`). Sqlite does not
|
performance, which was fixed with the addition of `metadata_fsync`.
|
||||||
have the database size limitation of LMDB on 32-bit systems.
|
Sqlite is still probably slower than LMDB due to the way we use it,
|
||||||
|
so it is not the best choice for high-performance storage clusters,
|
||||||
|
but it should work fine in many cases.
|
||||||
|
|
||||||
It is possible to convert Garage's metadata directory from one format to another
|
It is possible to convert Garage's metadata directory from one format to another
|
||||||
using the `garage convert-db` command, which should be used as follows:
|
using the `garage convert-db` command, which should be used as follows:
|
||||||
|
@ -302,7 +315,7 @@ Using this option reduces the risk of simultaneous metadata corruption on severa
|
||||||
cluster nodes, which could lead to data loss.
|
cluster nodes, which could lead to data loss.
|
||||||
|
|
||||||
If multi-site replication is used, this option is most likely not necessary, as
|
If multi-site replication is used, this option is most likely not necessary, as
|
||||||
it is extremely unlikely that two nodes in different locations will have a
|
it is extremely unlikely that two nodes in different locations will have a
|
||||||
power failure at the exact same time.
|
power failure at the exact same time.
|
||||||
|
|
||||||
(Metadata corruption on a single node is not an issue, the corrupted data file
|
(Metadata corruption on a single node is not an issue, the corrupted data file
|
||||||
|
|
Loading…
Reference in a new issue