K2V #293
1 changed files with 467 additions and 0 deletions
467
doc/drafts/k2v-spec.md
Normal file
467
doc/drafts/k2v-spec.md
Normal file
|
@ -0,0 +1,467 @@
|
|||
²
|
||||
# Specification of the Garage K2V API (K2V = Key/Key/Value)
|
||||
|
||||
- We are storing triplets of the form `(partition key, sort key, value)` -> no
|
||||
user-defined fields, the client is responsible of writing whatever he wants
|
||||
in the value (typically an encrypted blob). Values are binary blobs, which
|
||||
are always represented as their base64 encoding in the JSON API. Partition
|
||||
keys and sort keys are utf8 strings.
|
||||
|
||||
- Triplets are stored in buckets; each bucket stores a separate set of triplets
|
||||
|
||||
- Bucket names and access keys are the same as for accessing the S3 API
|
||||
|
||||
- K2V triplets exist separately from S3 objects. K2V triples don't exist for
|
||||
the S3 API, and S3 objects don't exist for the K2V API.
|
||||
|
||||
lx marked this conversation as resolved
Outdated
|
||||
- Values stored for triples have associated causality information, that enables
|
||||
Garage to detect concurrent writes. In case of concurrent writes, Garage
|
||||
keeps the concurrent values until a further write supersedes the concurrent
|
||||
values. This is the same method as Riak KV implements. The method used is
|
||||
based on DVVS (dotted version vector sets), described in the paper "Scalable
|
||||
and Accurate Causality Tracking for Eventually Consistent Data Stores", as
|
||||
well as [here](https://github.com/ricardobcl/Dotted-Version-Vectors)
|
||||
|
||||
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Operations on single items
|
||||
|
||||
**ReadItem: `GET /<bucket>/<partition key>?sort_key=<sort key>`**
|
||||
|
||||
|
||||
lx marked this conversation as resolved
Outdated
trinity-1686a
commented
ReadItem reads a single triplet, so I don't think it's affected? Also, are other *Batch affected? I assume no, but this should probably be explicited ReadItem reads a single triplet, so I don't think it's affected? Also, are other \*Batch affected? I assume no, but this should probably be explicited
|
||||
Query parameters:
|
||||
|
||||
| name | default value | meaning |
|
||||
| - | - | - |
|
||||
| `sort_key` | **mandatory** | The sort key of the item to read |
|
||||
|
||||
Returns the item with specified partition key and sort key. Values can be
|
||||
returned in either of two ways:
|
||||
|
||||
1. a JSON array of base64-encoded values, or `null`'s for tombstones, with
|
||||
header `Content-Type: application/json`
|
||||
|
||||
2. in the case where there are no concurrent values, the single present value
|
||||
can be returned directly as the response body (or an HTTP 204 NO CONTENT for
|
||||
a tombstone), with header `Content-Type: application/octet-stream`
|
||||
|
||||
The choice between return formats 1 and 2 is directed by the `Accept` HTTP header:
|
||||
|
||||
- if the `Accept` header is not present, format 1 is always used
|
||||
|
||||
- if `Accept` contains `application/json` but not `application/octet-stream`,
|
||||
format 1 is always used
|
||||
|
||||
- if `Accept` contains `application/octet-stream` but not `application/json`,
|
||||
format 2 is used when there is a single value, and an HTTP error 409 (HTTP
|
||||
409 CONFLICT) is returned in the case of multiple concurrent values
|
||||
(including concurrent tombstones)
|
||||
|
||||
- if `Accept` contains both, format 2 is used when there is a single value, and
|
||||
format 1 is used as a fallback in case of concurrent values
|
||||
|
||||
- if `Accept` contains none, HTTP 406 NOT ACCEPTABLE is raised
|
||||
|
||||
Example query:
|
||||
|
||||
```
|
||||
GET /my_bucket/mailboxes?sort_key=INBOX HTTP/1.1
|
||||
```
|
||||
|
||||
Example response:
|
||||
|
||||
```json
|
||||
HTTP/1.1 200 OK
|
||||
X-Garage-Causality-Token: opaquetoken123
|
||||
Content-Type: application/json
|
||||
|
||||
[
|
||||
"b64cryptoblob123",
|
||||
"b64cryptoblob'123"
|
||||
]
|
||||
```
|
||||
|
||||
Example response in case the item is a tombstone:
|
||||
|
||||
```
|
||||
HTTP/1.1 200 OK
|
||||
X-Garage-Causality-Token: opaquetoken999
|
||||
Content-Type: application/json
|
||||
|
||||
[
|
||||
null
|
||||
]
|
||||
```
|
||||
|
||||
Example query 2:
|
||||
|
||||
```
|
||||
GET /my_bucket/mailboxes?sort_key=INBOX HTTP/1.1
|
||||
Accept: application/octet-stream
|
||||
```
|
||||
|
||||
Example response if multiple concurrent versions exist:
|
||||
|
||||
```
|
||||
HTTP/1.1 409 CONFLICT
|
||||
X-Garage-Causality-Token: opaquetoken123
|
||||
Content-Type: application/octet-stream
|
||||
```
|
||||
|
||||
Example response in case of single value:
|
||||
|
||||
```
|
||||
HTTP/1.1 200 OK
|
||||
X-Garage-Causality-Token: opaquetoken123
|
||||
Content-Type: application/octet-stream
|
||||
|
||||
cryptoblob123
|
||||
```
|
||||
|
||||
Example response in case of a single value that is a tombstone:
|
||||
|
||||
```
|
||||
HTTP/1.1 204 NO CONTENT
|
||||
X-Garage-Causality-Token: opaquetoken123
|
||||
Content-Type: application/octet-stream
|
||||
```
|
||||
|
||||
**InsertItem: `PUT /<bucket>/<partition key>?sort_key=<sort_key>`**
|
||||
|
||||
Inserts a single item. This request does not use JSON, the body is sent directly as a binary blob.
|
||||
|
||||
To supersede previous values, the HTTP header `X-Garage-Causality-Token` should
|
||||
be set to the causality token returned by a previous read on this key. This
|
||||
header can be ommitted for the first writes to the key.
|
||||
|
||||
Example query:
|
||||
|
||||
```
|
||||
PUT /my_bucket/mailboxes?sort_key=INBOX HTTP/1.1
|
||||
X-Garage-Causality-Token: opaquetoken123
|
||||
|
||||
myblobblahblahblah
|
||||
```
|
||||
|
||||
Example response:
|
||||
|
||||
```
|
||||
HTTP/1.1 200 OK
|
||||
```
|
||||
|
||||
**DeleteItem: `DELETE /<bucket>/<partition key>?sort_key=<sort_key>`**
|
||||
|
||||
Deletes a single item. The HTTP header `X-Garage-Causality-Token` must be set
|
||||
to the causality token returned by a previous read on this key, to indicate
|
||||
which versions of the value should be deleted. The request will not process if
|
||||
`X-Garage-Causality-Token` is not set.
|
||||
|
||||
Example query:
|
||||
|
||||
```
|
||||
DELETE /my_bucket/mailboxes?sort_key=INBOX HTTP/1.1
|
||||
X-Garage-Causality-Token: opaquetoken123
|
||||
```
|
||||
|
||||
Example response:
|
||||
|
||||
```
|
||||
HTTP/1.1 204 NO CONTENT
|
||||
```
|
||||
|
||||
### Operations on index
|
||||
|
||||
lx marked this conversation as resolved
Outdated
trinity-1686a
commented
which will which **will**
|
||||
**ReadIndex: `GET /<bucket>?start=<start>&end=<end>&limit=<limit>`**
|
||||
|
||||
Lists all partition keys in the bucket for which some triplets exist, and gives
|
||||
for each the number of triplets (or an approximation thereof, this value is
|
||||
asynchronously updated, and thus eventually consistent).
|
||||
|
||||
Query parameters:
|
||||
|
||||
| name | default value | meaning |
|
||||
| - | - | - |
|
||||
| `start` | `null` | First partition key to list, in lexicographical order |
|
||||
| `end` | `null` | Last partition key to list (excluded) |
|
||||
| `limit` | `null` | Maximum number of partition keys to list |
|
||||
|
||||
The response consists in a JSON object that repeats the parameters of the query and gives the result (see below).
|
||||
|
||||
The listing starts at partition key `start`, or if not specified at the
|
||||
smallest partition key that exists. It returns partition keys in increasing
|
||||
order and stops when either of the following conditions is met:
|
||||
|
||||
1. if `end` is specfied, the partition key `end` is reached or surpassed (if it
|
||||
is reached exactly, it is not included in the result)
|
||||
|
||||
2. if `limit` is specified, `limit` partition keys have been listed
|
||||
|
||||
3. no more partition keys are available to list
|
||||
|
||||
In case 2, and if there are more partition keys to list before condition 1
|
||||
triggers, then in the result `more` is set to `true` and `nextStart` is set to
|
||||
the first partition key that couldn't be listed due to the limit. In the first
|
||||
case (if the listing stopped because of the `end` parameter), `more` is not set
|
||||
and the `nextStart` key is not specified.
|
||||
|
||||
Example query:
|
||||
|
||||
```
|
||||
GET /my_bucket HTTP/1.1
|
||||
```
|
||||
|
||||
Example response:
|
||||
|
||||
```json
|
||||
HTTP/1.1 200 OK
|
||||
|
||||
{
|
||||
start: null,
|
||||
end: null,
|
||||
limit: null,
|
||||
partition_keys: [
|
||||
[ "keys", 3043 ],
|
||||
[ "mailbox:INBOX", 42 ],
|
||||
[ "mailbox:Junk", 2991 ],
|
||||
[ "mailbox:Trash", 10 ],
|
||||
[ "mailboxes", 3 ],
|
||||
],
|
||||
more: false,
|
||||
nextStart: null,
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
### Operations on batches of items
|
||||
|
||||
**InsertBatch: `POST /<bucket>`**
|
||||
|
||||
Simple insertion and deletion of triplets. The body is just a list of items to
|
||||
insert in the following format: `[ "<partition key>", "<sort key>", "<causality
|
||||
token>"|null, "<value>"|null ]`.
|
||||
|
||||
The causality token should be the one returned in a previous read request (e.g.
|
||||
by ReadItem or ReadBatch), to indicate that this write takes into account the
|
||||
values that were returned from these reads, and supersedes them causally. If
|
||||
the triple is inserted for the first time, the causality token should be set to
|
||||
`null`.
|
||||
|
||||
The value is expected to be a base64-encoded binary blob. The value `null` can
|
||||
also be used to delete the triple while preserving causality information: this
|
||||
allows to know if a delete has happenned concurrently with an insert, in which
|
||||
case both are preserved and returned on reads (see below).
|
||||
|
||||
Partition keys and sort keys are utf8 strings which are stored sorted by
|
||||
lexicographical ordering of their binary representation.
|
||||
|
||||
Example query:
|
||||
|
||||
```json
|
||||
POST /my_bucket HTTP/1.1
|
||||
|
||||
[
|
||||
[ "mailbox:INBOX", "001892831", "opaquetoken321", "b64cryptoblob321updated" ],
|
||||
[ "mailbox:INBOX", "001892912", null, "b64cryptoblob444" ],
|
||||
[ "mailbox:INBOX", "001892932", "opaquetoken654", null ],
|
||||
]
|
||||
```
|
||||
|
||||
Example response:
|
||||
|
||||
```
|
||||
HTTP/1.1 200 OK
|
||||
```
|
||||
|
||||
|
||||
**ReadBatch: `POST /<bucket>?search`**, or alternatively<br/>
|
||||
**ReadBatch: `SEARCH /<bucket>`**
|
||||
|
||||
Batch read of triplets in a bucket.
|
||||
|
||||
The request body is a JSON list of searches, that each specify a range of
|
||||
items to get (to get single items, set `single_item` to `true`). A search is a
|
||||
JSON struct with the following fields:
|
||||
|
||||
| name | default value | meaning |
|
||||
| - | - | - |
|
||||
| `partition_key` | **mandatory** | The partition key in which to search |
|
||||
| `start` | `null` | The sort key of the first item to read |
|
||||
| `end` | `null` | The sort key of the last item to read (excluded) |
|
||||
| `limit` | `null` | The maximum number of items to return |
|
||||
| `single_item` | `false` | Whether to return only the item with sort key `start` |
|
||||
| `conflicts_only` | `false` | Whether to return only items that have several concurrent values |
|
||||
| `tombstones` | `false` | Whether or not to return tombstone lines to indicate the presence of old deleted items |
|
||||
|
||||
|
||||
For each of the searches, triplets are listed and returned separately. The
|
||||
semantics of `start`, `end` and `limit` is the same as for ReadIndex. The
|
||||
additionnal parameter `single_item` allows to get a single item, whose sort key
|
||||
is the one given in `start`. Parameters `conflicts_only` and `tombstones`
|
||||
control additional filters on the items that are returned.
|
||||
|
||||
The result is a list of length the number of searches, that consists in for
|
||||
each search a JSON object specified similarly to the result of ReadIndex, but
|
||||
that lists triples within a partition key.
|
||||
|
||||
The format of returned tuples is as follows: `[ "<sort key>", "<causality
|
||||
token>", "<value1>", ...]`, with the following fields:
|
||||
|
||||
- sort key: any unicode string used as a sort key
|
||||
|
||||
- causality token: an opaque token served by the server (generally
|
||||
base64-encoded) to be used in subsequent writes to this key
|
||||
|
||||
- value: binary blob, always base64-encoded
|
||||
|
||||
- if several concurrent values exist, they are appended at the end
|
||||
|
||||
- in case of concurrent update and deletion, a `null` is added to the list of concurrent values
|
||||
|
||||
- if the `tombstones` query parameter is set to `true`, tombstones are returned
|
||||
for items that have been deleted (this can be usefull for inserting after an
|
||||
item that has been deleted, so that the insert is not considered
|
||||
concurrent with the delete). Tombstones are returned as tuples in the
|
||||
same format with only `null` values
|
||||
|
||||
Example query:
|
||||
|
||||
```json
|
||||
POST /my_bucket?search HTTP/1.1
|
||||
|
||||
[
|
||||
{
|
||||
partition_key: "mailboxes",
|
||||
},
|
||||
{
|
||||
partition_key: "mailbox:INBOX",
|
||||
start: "001892831",
|
||||
limit: 3,
|
||||
},
|
||||
{
|
||||
partition_key: "keys",
|
||||
start: "0",
|
||||
single_item: true,
|
||||
},
|
||||
]
|
||||
```
|
||||
|
||||
Example associated response body:
|
||||
|
||||
```json
|
||||
HTTP/1.1 200 OK
|
||||
|
||||
[
|
||||
{
|
||||
partition_key: "mailboxes",
|
||||
start: null,
|
||||
end: null,
|
||||
limit: null,
|
||||
conflicts_only: false,
|
||||
tombstones: false,
|
||||
single_item: false,
|
||||
items: [
|
||||
[ "INBOX", "opaquetoken123", "b64cryptoblob123", "b64cryptoblob'123" ],
|
||||
[ "Trash", "opaquetoken456", "b64cryptoblob456" ],
|
||||
[ "Junk", "opaquetoken789", "b64cryptoblob789" ],
|
||||
],
|
||||
more: false,
|
||||
nextStart: null,
|
||||
},
|
||||
{
|
||||
partition_key: "mailbox::INBOX",
|
||||
start: "001892831",
|
||||
end: null,
|
||||
limit: 3,
|
||||
conflicts_only: false,
|
||||
tombstones: false,
|
||||
single_item: false,
|
||||
items: [
|
||||
[ "001892831", "opaquetoken321", "b64cryptoblob321" ],
|
||||
[ "001892832", "opaquetoken654", "b64cryptoblob654" ],
|
||||
[ "001892874", "opaquetoken987", "b64cryptoblob987" ],
|
||||
],
|
||||
more: true,
|
||||
nextStart: "001892898",
|
||||
},
|
||||
{
|
||||
partition_key: "keys",
|
||||
start: "0",
|
||||
end: null,
|
||||
conflicts_only: false,
|
||||
tombstones: false,
|
||||
limit: null,
|
||||
single_item: true,
|
||||
items: [
|
||||
[ "0", "opaquetoken999", "b64binarystuff999" ],
|
||||
],
|
||||
more: false,
|
||||
nextStart: null,
|
||||
},
|
||||
]
|
||||
```
|
||||
|
||||
|
||||
|
||||
**DeleteBatch: `POST /<bucket>?delete`**
|
||||
|
||||
Batch deletion of triplets. The request format is the same for `POST
|
||||
/<bucket>?search` to indicate items or range of items, except that here they
|
||||
are deleted instead of returned, but only the fields `partition_key`, `start`,
|
||||
`end`, and `single_item` are supported. Causality information is not given by
|
||||
the user: this request will internally list all triplets and write deletion
|
||||
markers that supersede all of the versions that have been read.
|
||||
|
||||
This request returns for each series of items to be deleted, the number of
|
||||
matching items that have been found and deleted.
|
||||
|
||||
Example query:
|
||||
|
||||
```json
|
||||
POST /my_bucket?delete HTTP/1.1
|
||||
|
||||
[
|
||||
{
|
||||
partition_key: "mailbox:OldMailbox",
|
||||
},
|
||||
{
|
||||
partition_key: "mailbox:INBOX",
|
||||
start: "0018928321",
|
||||
single_item: true,
|
||||
},
|
||||
]
|
||||
```
|
||||
|
||||
Example response:
|
||||
|
||||
```
|
||||
HTTP/1.1 200 OK
|
||||
|
||||
[
|
||||
{
|
||||
partition_key: "mailbox:OldMailbox",
|
||||
start: null,
|
||||
end: null,
|
||||
single_item: false,
|
||||
deleted_items: 35,
|
||||
},
|
||||
{
|
||||
partition_key: "mailbox:INBOX",
|
||||
start: "0018928321",
|
||||
end: null,
|
||||
single_item: true,
|
||||
deleted_items: 1,
|
||||
},
|
||||
]
|
||||
```
|
||||
|
||||
|
||||
## Internals: causality tokens
|
||||
|
||||
The method used is based on DVVS (dotted version vector sets). See:
|
||||
|
||||
- the paper "Scalable and Accurate Causality Tracking for Eventually Consistent Data Stores"
|
||||
- <https://github.com/ricardobcl/Dotted-Version-Vectors>
|
||||
|
||||
For DVVS to work, write operations (at each node) must take a lock on the data table.
|
Loading…
Reference in a new issue
are triples and triplets the same thing? If they are different we should clarify how, and if they are the same, we should use only one word to name them.