Move out garage's doc
This commit is contained in:
parent
538a72d198
commit
79e604c87e
4 changed files with 4 additions and 338 deletions
|
@ -1,156 +0,0 @@
|
||||||
#### Modules
|
|
||||||
|
|
||||||
- `membership/`: configuration, membership management (gossip of node's presence and status), ring generation --> what about Serf (used by Consul/Nomad) : https://www.serf.io/? Seems a huge library with many features so maybe overkill/hard to integrate
|
|
||||||
- `metadata/`: metadata management
|
|
||||||
- `blocks/`: block management, writing, GC and rebalancing
|
|
||||||
- `internal/`: server to server communication (HTTP server and client that reuses connections, TLS if we want, etc)
|
|
||||||
- `api/`: S3 API
|
|
||||||
- `web/`: web management interface
|
|
||||||
|
|
||||||
#### Metadata tables
|
|
||||||
|
|
||||||
**Objects:**
|
|
||||||
|
|
||||||
- *Hash key:* Bucket name (string)
|
|
||||||
- *Sort key:* Object key (string)
|
|
||||||
- *Sort key:* Version timestamp (int)
|
|
||||||
- *Sort key:* Version UUID (string)
|
|
||||||
- Complete: bool
|
|
||||||
- Inline: bool, true for objects < threshold (say 1024)
|
|
||||||
- Object size (int)
|
|
||||||
- Mime type (string)
|
|
||||||
- Data for inlined objects (blob)
|
|
||||||
- Hash of first block otherwise (string)
|
|
||||||
|
|
||||||
*Having only a hash key on the bucket name will lead to storing all file entries of this table for a specific bucket on a single node. At the same time, it is the only way I see to rapidly being able to list all bucket entries...*
|
|
||||||
|
|
||||||
**Blocks:**
|
|
||||||
|
|
||||||
- *Hash key:* Version UUID (string)
|
|
||||||
- *Sort key:* Offset of block in total file (int)
|
|
||||||
- Hash of data block (string)
|
|
||||||
|
|
||||||
A version is defined by the existence of at least one entry in the blocks table for a certain version UUID.
|
|
||||||
We must keep the following invariant: if a version exists in the blocks table, it has to be referenced in the objects table.
|
|
||||||
We explicitly manage concurrent versions of an object: the version timestamp and version UUID columns are index columns, thus we may have several concurrent versions of an object.
|
|
||||||
Important: before deleting an older version from the objects table, we must make sure that we did a successfull delete of the blocks of that version from the blocks table.
|
|
||||||
|
|
||||||
Thus, the workflow for reading an object is as follows:
|
|
||||||
|
|
||||||
1. Check permissions (LDAP)
|
|
||||||
2. Read entry in object table. If data is inline, we have its data, stop here.
|
|
||||||
-> if several versions, take newest one and launch deletion of old ones in background
|
|
||||||
3. Read first block from cluster. If size <= 1 block, stop here.
|
|
||||||
4. Simultaneously with previous step, if size > 1 block: query the Blocks table for the IDs of the next blocks
|
|
||||||
5. Read subsequent blocks from cluster
|
|
||||||
|
|
||||||
Workflow for PUT:
|
|
||||||
|
|
||||||
1. Check write permission (LDAP)
|
|
||||||
2. Select a new version UUID
|
|
||||||
3. Write a preliminary entry for the new version in the objects table with complete = false
|
|
||||||
4. Send blocks to cluster and write entries in the blocks table
|
|
||||||
5. Update the version with complete = true and all of the accurate information (size, etc)
|
|
||||||
6. Return success to the user
|
|
||||||
7. Launch a background job to check and delete older versions
|
|
||||||
|
|
||||||
Workflow for DELETE:
|
|
||||||
|
|
||||||
1. Check write permission (LDAP)
|
|
||||||
2. Get current version (or versions) in object table
|
|
||||||
3. Do the deletion of those versions NOT IN A BACKGROUND JOB THIS TIME
|
|
||||||
4. Return succes to the user if we were able to delete blocks from the blocks table and entries from the object table
|
|
||||||
|
|
||||||
To delete a version:
|
|
||||||
|
|
||||||
1. List the blocks from Cassandra
|
|
||||||
2. For each block, delete it from cluster. Don't care if some deletions fail, we can do GC.
|
|
||||||
3. Delete all of the blocks from the blocks table
|
|
||||||
4. Finally, delete the version from the objects table
|
|
||||||
|
|
||||||
Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read.
|
|
||||||
|
|
||||||
("Soit P un problème, on s'en fout est une solution à ce problème")
|
|
||||||
|
|
||||||
#### Block storage on disk
|
|
||||||
|
|
||||||
**Blocks themselves:**
|
|
||||||
|
|
||||||
- file path = /blobs/(first 3 hex digits of hash)/(rest of hash)
|
|
||||||
|
|
||||||
**Reverse index for GC & other block-level metadata:**
|
|
||||||
|
|
||||||
- file path = /meta/(first 3 hex digits of hash)/(rest of hash)
|
|
||||||
- map block hash -> set of version UUIDs where it is referenced
|
|
||||||
|
|
||||||
Usefull metadata:
|
|
||||||
|
|
||||||
- list of versions that reference this block in the Casandra table, so that we can do GC by checking in Cassandra that the lines still exist
|
|
||||||
- list of other nodes that we know have acknowledged a write of this block, usefull in the rebalancing algorithm
|
|
||||||
|
|
||||||
Write strategy: have a single thread that does all write IO so that it is serialized (or have several threads that manage independent parts of the hash space). When writing a blob, write it to a temporary file, close, then rename so that a concurrent read gets a consistent result (either not found or found with whole content).
|
|
||||||
|
|
||||||
Read strategy: the only read operation is get(hash) that returns either the data or not found (can do a corruption check as well and return corrupted state if it is the case). Can be done concurrently with writes.
|
|
||||||
|
|
||||||
**Internal API:**
|
|
||||||
|
|
||||||
- get(block hash) -> ok+data/not found/corrupted
|
|
||||||
- put(block hash & data, version uuid + offset) -> ok/error
|
|
||||||
- put with no data(block hash, version uuid + offset) -> ok/not found plz send data/error
|
|
||||||
- delete(block hash, version uuid + offset) -> ok/error
|
|
||||||
|
|
||||||
GC: when last ref is deleted, delete block.
|
|
||||||
Long GC procedure: check in Cassandra that version UUIDs still exist and references this block.
|
|
||||||
|
|
||||||
Rebalancing: takes as argument the list of newly added nodes.
|
|
||||||
|
|
||||||
- List all blocks that we have. For each block:
|
|
||||||
- If it hits a newly introduced node, send it to them.
|
|
||||||
Use put with no data first to check if it has to be sent to them already or not.
|
|
||||||
Use a random listing order to avoid race conditions (they do no harm but we might have two nodes sending the same thing at the same time thus wasting time).
|
|
||||||
- If it doesn't hit us anymore, delete it and its reference list.
|
|
||||||
|
|
||||||
Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters.
|
|
||||||
|
|
||||||
#### Membership management
|
|
||||||
|
|
||||||
Two sets of nodes:
|
|
||||||
|
|
||||||
- set of nodes from which a ping was recently received, with status: number of stored blocks, request counters, error counters, GC%, rebalancing%
|
|
||||||
(eviction from this set after say 30 seconds without ping)
|
|
||||||
- set of nodes that are part of the system, explicitly modified by the operator using the web UI (persisted to disk),
|
|
||||||
is a CRDT using a version number for the value of the whole set
|
|
||||||
|
|
||||||
Thus, three states for nodes:
|
|
||||||
|
|
||||||
- healthy: in both sets
|
|
||||||
- missing: not pingable but part of desired cluster
|
|
||||||
- unused/draining: currently present but not part of the desired cluster, empty = if contains nothing, draining = if still contains some blocks
|
|
||||||
|
|
||||||
Membership messages between nodes:
|
|
||||||
|
|
||||||
- ping with current state + hash of current membership info -> reply with same info
|
|
||||||
- send&get back membership info (the ids of nodes that are in the two sets): used when no local membership change in a long time and membership info hash discrepancy detected with first message (passive membership fixing with full CRDT gossip)
|
|
||||||
- inform of newly pingable node(s) -> no result, when receive new info repeat to all (reliable broadcast)
|
|
||||||
- inform of operator membership change -> no result, when receive new info repeat to all (reliable broadcast)
|
|
||||||
|
|
||||||
Ring: generated from the desired set of nodes, however when doing read/writes on the ring, skip nodes that are known to be not pingable.
|
|
||||||
The tokens are generated in a deterministic fashion from node IDs (hash of node id + token number from 1 to K).
|
|
||||||
Number K of tokens per node: decided by the operator & stored in the operator's list of nodes CRDT. Default value proposal: with node status information also broadcast disk total size and free space, and propose a default number of tokens equal to 80%Free space / 10Gb. (this is all user interface)
|
|
||||||
|
|
||||||
|
|
||||||
#### Constants
|
|
||||||
|
|
||||||
- Block size: around 1MB ? --> Exoscale use 16MB chunks
|
|
||||||
- Number of tokens in the hash ring: one every 10Gb of allocated storage
|
|
||||||
- Threshold for storing data directly in Cassandra objects table: 1kb bytes (maybe up to 4kb?)
|
|
||||||
- Ping timeout (time after which a node is registered as unresponsive/missing): 30 seconds
|
|
||||||
- Ping interval: 10 seconds
|
|
||||||
- ??
|
|
||||||
|
|
||||||
#### Links
|
|
||||||
|
|
||||||
- CDC: <https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf>
|
|
||||||
- Erasure coding: <http://web.eecs.utk.edu/~jplank/plank/papers/CS-08-627.html>
|
|
||||||
- [Openstack Storage Concepts](https://docs.openstack.org/arch-design/design-storage/design-storage-concepts.html)
|
|
||||||
- [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf)
|
|
|
@ -1,140 +0,0 @@
|
||||||
# Quickstart on an existing deployment
|
|
||||||
|
|
||||||
First, chances are that your garage deployment is secured by TLS.
|
|
||||||
All your commands must be prefixed with their certificates.
|
|
||||||
I will define an alias once and for all to ease future commands.
|
|
||||||
Please adapt the path of the binary and certificates to your installation!
|
|
||||||
|
|
||||||
```
|
|
||||||
alias grg="/garage/garage --ca-cert /secrets/garage-ca.crt --client-cert /secrets/garage.crt --client-key /secrets/garage.key"
|
|
||||||
```
|
|
||||||
|
|
||||||
Now we can check that everything is going well by checking our cluster status:
|
|
||||||
|
|
||||||
```
|
|
||||||
grg status
|
|
||||||
```
|
|
||||||
|
|
||||||
Don't forget that `help` command and `--help` subcommands can help you anywhere, the CLI tool is self-documented! Two examples:
|
|
||||||
|
|
||||||
```
|
|
||||||
grg help
|
|
||||||
grg bucket allow --help
|
|
||||||
```
|
|
||||||
|
|
||||||
Fine, now let's create a bucket (we imagine that you want to deploy nextcloud):
|
|
||||||
|
|
||||||
```
|
|
||||||
grg bucket create nextcloud-bucket
|
|
||||||
```
|
|
||||||
|
|
||||||
Check that everything went well:
|
|
||||||
|
|
||||||
```
|
|
||||||
grg bucket list
|
|
||||||
grg bucket info nextcloud-bucket
|
|
||||||
```
|
|
||||||
|
|
||||||
Now we will generate an API key to access this bucket.
|
|
||||||
Note that API keys are independent of buckets: one key can access multiple buckets, multiple keys can access one bucket.
|
|
||||||
|
|
||||||
Now, let's start by creating a key only for our PHP application:
|
|
||||||
|
|
||||||
```
|
|
||||||
grg key new --name nextcloud-app-key
|
|
||||||
```
|
|
||||||
|
|
||||||
You will have the following output (this one is fake, `key_id` and `secret_key` were generated with the openssl CLI tool):
|
|
||||||
|
|
||||||
```
|
|
||||||
Key { key_id: "GK3515373e4c851ebaad366558", secret_key: "7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34", name: "nextcloud-app-key", name_timestamp: 1603280506694, deleted: false, authorized_buckets: [] }
|
|
||||||
```
|
|
||||||
|
|
||||||
Check that everything works as intended (be careful, info works only with your key identifier and not with its friendly name!):
|
|
||||||
|
|
||||||
```
|
|
||||||
grg key list
|
|
||||||
grg key info GK3515373e4c851ebaad366558
|
|
||||||
```
|
|
||||||
|
|
||||||
Now that we have a bucket and a key, we need to give permissions to the key on the bucket!
|
|
||||||
|
|
||||||
```
|
|
||||||
grg bucket allow --read --write nextcloud-bucket --key GK3515373e4c851ebaad366558
|
|
||||||
```
|
|
||||||
|
|
||||||
You can check at any times allowed keys on your bucket with:
|
|
||||||
|
|
||||||
```
|
|
||||||
grg bucket info nextcloud-bucket
|
|
||||||
```
|
|
||||||
|
|
||||||
Now, let's move to the S3 API!
|
|
||||||
We will use the `s3cmd` CLI tool.
|
|
||||||
You can install it via your favorite package manager.
|
|
||||||
Otherwise, check [their website](https://s3tools.org/s3cmd)
|
|
||||||
|
|
||||||
We will configure `s3cmd` with its interactive configuration tool, be careful not all endpoints are implemented!
|
|
||||||
Especially, the test run at the end does not work (yet).
|
|
||||||
|
|
||||||
```
|
|
||||||
$ s3cmd --configure
|
|
||||||
|
|
||||||
Enter new values or accept defaults in brackets with Enter.
|
|
||||||
Refer to user manual for detailed description of all options.
|
|
||||||
|
|
||||||
Access key and Secret key are your identifiers for Amazon S3. Leave them empty for using the env variables.
|
|
||||||
Access Key: GK3515373e4c851ebaad366558
|
|
||||||
Secret Key: 7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34
|
|
||||||
Default Region [US]: garage
|
|
||||||
|
|
||||||
Use "s3.amazonaws.com" for S3 Endpoint and not modify it to the target Amazon S3.
|
|
||||||
S3 Endpoint [s3.amazonaws.com]: garage.deuxfleurs.fr
|
|
||||||
|
|
||||||
Use "%(bucket)s.s3.amazonaws.com" to the target Amazon S3. "%(bucket)s" and "%(location)s" vars can be used
|
|
||||||
if the target S3 system supports dns based buckets.
|
|
||||||
DNS-style bucket+hostname:port template for accessing a bucket [%(bucket)s.s3.amazonaws.com]: garage.deuxfleurs.fr
|
|
||||||
|
|
||||||
Encryption password is used to protect your files from reading
|
|
||||||
by unauthorized persons while in transfer to S3
|
|
||||||
Encryption password:
|
|
||||||
Path to GPG program [/usr/bin/gpg]:
|
|
||||||
|
|
||||||
When using secure HTTPS protocol all communication with Amazon S3
|
|
||||||
servers is protected from 3rd party eavesdropping. This method is
|
|
||||||
slower than plain HTTP, and can only be proxied with Python 2.7 or newer
|
|
||||||
Use HTTPS protocol [Yes]:
|
|
||||||
|
|
||||||
On some networks all internet access must go through a HTTP proxy.
|
|
||||||
Try setting it here if you can't connect to S3 directly
|
|
||||||
HTTP Proxy server name:
|
|
||||||
|
|
||||||
New settings:
|
|
||||||
Access Key: GK3515373e4c851ebaad366558
|
|
||||||
Secret Key: 7d37d093435a41f2aab8f13c19ba067d9776c90215f56614adad6ece597dbb34
|
|
||||||
Default Region: garage
|
|
||||||
S3 Endpoint: garage.deuxfleurs.fr
|
|
||||||
DNS-style bucket+hostname:port template for accessing a bucket: garage.deuxfleurs.fr
|
|
||||||
Encryption password:
|
|
||||||
Path to GPG program: /usr/bin/gpg
|
|
||||||
Use HTTPS protocol: True
|
|
||||||
HTTP Proxy server name:
|
|
||||||
HTTP Proxy server port: 0
|
|
||||||
|
|
||||||
Test access with supplied credentials? [Y/n] n
|
|
||||||
|
|
||||||
Save settings? [y/N] y
|
|
||||||
Configuration saved to '/home/quentin/.s3cfg'
|
|
||||||
```
|
|
||||||
|
|
||||||
Now, if everything works, the following commands should work:
|
|
||||||
|
|
||||||
```
|
|
||||||
echo hello world > hello.txt
|
|
||||||
s3cmd put hello.txt s3://nextcloud-bucket
|
|
||||||
s3cmd ls s3://nextcloud-bucket
|
|
||||||
s3cmd rm s3://nextcloud-bucket/hello.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
That's all for now!
|
|
||||||
|
|
|
@ -1,38 +0,0 @@
|
||||||
## Context
|
|
||||||
|
|
||||||
Data storage is critical: it can lead to data loss if done badly and/or on hardware failure.
|
|
||||||
Filesystems + RAID can help on a single machine but a machine failure can put the whole storage offline.
|
|
||||||
Moreover, it put a hard limit on scalability. Often this limit can be pushed back far away by buying expensive machines.
|
|
||||||
But here we consider non specialized off the shelf machines that can be as low powered and subject to failures as a raspberry pi.
|
|
||||||
|
|
||||||
Distributed storage may help to solve both availability and scalability problems on these machines.
|
|
||||||
Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide.
|
|
||||||
|
|
||||||
## Related work
|
|
||||||
|
|
||||||
Block storage is the most low level one, it's like exposing your raw hard drive over the network.
|
|
||||||
It requires very low latencies and stable network, that are often dedicated.
|
|
||||||
However it provides disk devices that can be manipulated by the operating system with the less constraints: it can be partitioned with any filesystem, meaning that it supports even the most exotic features.
|
|
||||||
We can cite [iSCSI](https://en.wikipedia.org/wiki/ISCSI) or [Fibre Channel](https://en.wikipedia.org/wiki/Fibre_Channel).
|
|
||||||
Openstack Cinder proxy previous solution to provide an uniform API.
|
|
||||||
|
|
||||||
File storage provides a higher abstraction, they are one filesystem among others, which means they don't necessarily have all the exotic features of every filesystem.
|
|
||||||
Often, they relax some POSIX constraints while many applications will still be compatible without any modification.
|
|
||||||
As an example, we are able to run MariaDB (very slowly) over GlusterFS...
|
|
||||||
We can also mention CephFS (read [RADOS](https://ceph.com/wp-content/uploads/2016/08/weil-rados-pdsw07.pdf) whitepaper), Lustre, LizardFS, MooseFS, etc.
|
|
||||||
OpenStack Manila proxy previous solutions to provide an uniform API.
|
|
||||||
|
|
||||||
Finally object storages provide the highest level abstraction.
|
|
||||||
They are the testimony that the POSIX filesystem API is not adapted to distributed filesystems.
|
|
||||||
Especially, the strong concistency has been dropped in favor of eventual consistency which is way more convenient and powerful in presence of high latencies and unreliability.
|
|
||||||
We often read about S3 that pioneered the concept that it's a filesystem for the WAN.
|
|
||||||
Applications must be adapted to work for the desired object storage service.
|
|
||||||
Today, the S3 HTTP REST API acts as a standard in the industry.
|
|
||||||
However, Amazon S3 source code is not open but alternatives were proposed.
|
|
||||||
We identified Minio, Pithos, Swift and Ceph.
|
|
||||||
Minio/Ceph enforces a total order, so properties similar to a (relaxed) filesystem.
|
|
||||||
Swift and Pithos are probably the most similar to AWS S3 with their consistent hashing ring.
|
|
||||||
However Pithos is not maintained anymore. More precisely the company that published Pithos version 1 has developped a second version 2 but has not open sourced it.
|
|
||||||
Some tests conducted by the [ACIDES project](https://acides.org/) have shown that Openstack Swift consumes way more resources (CPU+RAM) that we can afford. Furthermore, people developing Swift have not designed their software for geo-distribution.
|
|
||||||
|
|
||||||
There were many attempts in research too. I am only thinking to [LBFS](https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf) that was used as a basis for Seafile. But none of them have been effectively implemented yet.
|
|
|
@ -16,10 +16,10 @@ Non-goals include:
|
||||||
|
|
||||||
Currently, Garage is deployed on our cluster (this very website is hosted on garage!) but must be considered as a technical preview.
|
Currently, Garage is deployed on our cluster (this very website is hosted on garage!) but must be considered as a technical preview.
|
||||||
|
|
||||||
If you want to learn more about Garage, you can check our documentation:
|
If you want to learn more about Garage, you can check our [documentation](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/master/doc):
|
||||||
- [Quickstart](/Technique/Développement/Garage/Quickstart.html), learn how to quickly interact with garage.
|
- [Quickstart](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/master/doc/Internals.md), learn how to quickly interact with garage.
|
||||||
- [Related Work](/Technique/Développement/Garage/Related%20Work.html), understand why we decided to build a new software instead of using existing ones.
|
- [Related Work](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/master/doc/Related%20Work.md), understand why we decided to build a new software instead of using existing ones.
|
||||||
- [Internals](/Technique/Développement/Garage/Internals.html), contains a quick description of the data models that are used in Garage.
|
- [Internals](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/master/doc/Internals.md), contains a quick description of the data models that are used in Garage.
|
||||||
|
|
||||||
External links:
|
External links:
|
||||||
- [Repository](https://git.deuxfleurs.fr/Deuxfleurs/garage/), Garage is a free software, developed on our own Gitea instance
|
- [Repository](https://git.deuxfleurs.fr/Deuxfleurs/garage/), Garage is a free software, developed on our own Gitea instance
|
||||||
|
|
Loading…
Reference in a new issue