2022-05-10 11:16:58 +00:00
1 changed files with 25 additions and 4 deletions
--- a/doc/drafts/k2v-spec.md
+++ b/doc/drafts/k2v-spec.md
@ -104,12 +104,12 @@ For instance, here is a possible sequence of events:
 (node2, tdiscard2', (v4, t4))       ; tdiscard2' = t3
 ```

-**Generic algorithm for handling insertions:** A certain node i handles the InsertItem and is responsible for the correctness of this procedure.
+**Generic algorithm for handling insertions:** A certain node n handles the InsertItem and is responsible for the correctness of this procedure.

 1. Lock the key (or the whole table?) at this node to prevent concurrent updates of the value that would mess things up
 2. Read current set of values
-3. Generate a new timestamp that is larger than the largest timestamp for node i
-4. Add the inserted value in the list of values of node i
+3. Generate a new timestamp that is larger than the largest timestamp for node n
+4. Add the inserted value in the list of values of node n
 5. Update the discard times to be the times set in the context, and accordingly discard overwritten values
 6. Release lock
 7. Propagate updated value to other nodes
@ -136,7 +136,28 @@ that keeps tracks of the number of triples stored for each partition key.
 This allows easy listing of all of the partition keys for which triples exist
 in a bucket, as the partition key becomes the sort key in the index.

-TODO: writeup asynchronous counting strategy
+How indexing works:
+
+- Each node keeps a local count of how many items it stores for each partition,
+  in a local Sled tree that is updated atomically when an item is modified.
+- These local counters are asynchronously stored in the index table which is
+  a regular Garage table spread in the network. Counters are stored as LWW values,
+  so basically the final table will have the following structure:
+
+```
+- pk: bucket
+- sk: partition key for which we are counting
+- v:  lwwmap (node id -> number of items)
+```
+
+The final number of items present in the partition can be estimated by taking
+the maximum of the values (i.e. the value for the node that announces having
+the most items for that partition). In most cases the values for different node
+IDs should all be the same; more precisely, three node IDs should map to the
+same non-zero value, and all other node IDs that are present are tombstones
+that map to zeroes.  Note that we need to filter out values from nodes that are
+no longer part of the cluster layout, as when nodes are removed they won't
+necessarily have had the time to set their counters to zero.

 ## API Endpoints