From ae3d6c9e8407db04f9c9a87f6364d601c8c50437 Mon Sep 17 00:00:00 2001 From: Alex Auvolat Date: Mon, 11 Apr 2022 14:36:28 +0200 Subject: [PATCH] Specify stuff about causality tokens (aka contexts) --- doc/drafts/k2v-spec.md | 116 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 115 insertions(+), 1 deletion(-) diff --git a/doc/drafts/k2v-spec.md b/doc/drafts/k2v-spec.md index 7209ea68..0f7aea0c 100644 --- a/doc/drafts/k2v-spec.md +++ b/doc/drafts/k2v-spec.md @@ -1,4 +1,3 @@ -² # Specification of the Garage K2V API (K2V = Key/Key/Value) - We are storing triplets of the form `(partition key, sort key, value)` -> no @@ -23,6 +22,121 @@ well as [here](https://github.com/ricardobcl/Dotted-Version-Vectors) +## Data format + +### Triple format + +Triples in K2V are constituted of three fields: + +- a partition key (`pk`), an utf8 string that defines in what partition the triple is + stored; triples in different partitions cannot be listed together, they must + be the object of different ReadItem or ReadBatch queries + +- a sort key (`sk`), an utf8 string that defines the index of the triple inside its + partition; triples are uniquely idendified by their partition key + sort key + +- a value (`v`), an opaque binary blob associated to the partition key + sort key; + they are transmitted as binary when possible but in most case in the JSON API + they will be represented as strings using base64 encoding; a value can also + be `null` to indicate a deleted triple (a `null` value is called a tombstone) + +### Causality information + +K2V supports storing several concurrent values associated to a pk+sk, in the +case where insertion or deletion operations are detected to be concurrent (i.e. +there is not one that was aware of the other, they are not causally dependant +one on the other). In practice, it even looks more like the opposite: to +overwrite a previously existing value, the client must give a "causality token" +that "proves" (not in a cryptographic sense) that it had seen a previous value. +Otherwise, the value written will not overwrite an existing value, it will just +create a new concurrent value. + +The causality token is a binary/b64-encoded representation of a context, +specified below. + +A set of concurrent values looks like this: + +``` +(node1, tdiscard1, (v1, t1), (v2, t2)) ; tdiscard1 < t1 < t2 +(node2, tdiscard2, (v3, t3) ; tdiscard2 < t3 +``` + +`tdiscard` for a node `i` means that all values inserted by node `i` with times +`<= tdiscard` are obsoleted, i.e. have been read by a client that overwrote it +afterwards. + +The associated context would be the following: `[(node1, t2), (node2, t3)]`, +i.e. if a node reads this set of values and inserts a new values, we will now +have `tdiscard1 = t2` and `tdiscard2 = t3`, to indicate that values v1, v2 and v3 +are obsoleted by the new write. + +**Basic insertion.** To insert a new value `v4` with context `[(node1, t2), (node2, t3)]`, in a +simple case where there was no insertion in-between reading the value +mentionned above and writing `v4`, and supposing that node2 receives the +InsertItem query: + +- `node2` generates a timestamp `t4` such that `t4 > t3`. +- the new state is as follows: + +``` +(node1, tdiscard1', ()) ; tdiscard1' = t2 +(node2, tdiscard2', (v4, t4)) ; tdiscard2' = t3 +``` + +**A more complex insertion example.** In the general case, other intermediate values could have +been written before `v4` with context `[(node1, t2), (node2, t3)]` is sent to the system. +For instance, here is a possible sequence of events: + +1. First we have the set of values v1, v2 and v3 described above. + A node reads it, it obtains values v1, v2 and v3 with context `[(node1, t2), (node2, t3)]`. + +2. A node writes a value `v5` with context `[(node1, t1)]`, i.e. `v5` is only a successor of v1 but not of v2 or v3. Suppose node1 receives the write, it will generate a new timestamp `t5` larger than all of the timestamps it knows of, i.e. `t5 > t2`. We will now have: + +``` +(node1, tdiscard1'', (v2, t2), (v5, t5)) ; tdiscard1'' = t1 < t2 < t5 +(node2, tdiscard2, (v3, t3) ; tdiscard2 < t3 +``` + +3. Now `v4` is written with context `[(node1, t2), (node2, t3)]`, and node2 processes the query. It will generate `t4 > t3` and the state will become: + +``` +(node1, tdiscard1', (v5, t5)) ; tdiscard1' = t2 < t5 +(node2, tdiscard2', (v4, t4)) ; tdiscard2' = t3 +``` + +**Generic algorithm for handling insertions:** A certain node i handles the InsertItem and is responsible for the correctness of this procedure. + +1. Lock the key (or the whole table?) at this node to prevent concurrent updates of the value that would mess things up +2. Read current set of values +3. Generate a new timestamp that is larger than the largest timestamp for node i +4. Add the inserted value in the list of values of node i +5. Update the discard times to be the times set in the context, and accordingly discard overwritten values +6. Release lock +7. Propagate updated value to other nodes +8. Return to user when propagation achieved the write quorum (propagation to other nodes continues asynchronously) + +**Encoding of contexts:** + +Contexts consist in a list of (node id, timestamp) pairs. +They are encoded in binary as follows: + +``` +checksum: u64, [ node: u64, timestamp: u64 ]* +``` + +The checksum is just the XOR of all of the node IDs and timestamps. + +Once encoded in binary, contexts are written and transmitted in base64. + + +### Indexing + +K2V keeps an index, a secondary data structure that is updated asynchronously, +that keeps tracks of the number of triples stored for each partition key. +This allows easy listing of all of the partition keys for which triples exist +in a bucket, as the partition key becomes the sort key in the index. + +TODO: writeup asynchronous counting strategy ## API Endpoints