Jepsen testing (NLnet task 3 subtask 1) #544

Merged
lx merged 41 commits from jepsen into main 2024-01-11 10:52:13 +00:00
5 changed files with 85 additions and 36 deletions
Showing only changes of commit d2c365767b - Show all commits

View file

@ -35,55 +35,74 @@ lein run test --nodes-file nodes.vagrant --time-limit 64 --rate 50 --concurrenc
### Register linear, without timestamp patch ### Register linear, without timestamp patch
Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg1 --ops-per-key 100` Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 20 --workload reg1 --ops-per-key 100`
Results: fails with a simple clock-scramble nemesis. Results without timestamp patch:
Explanation: without the timestamp patch, nodes will create objects using their - Fails with a simple clock-scramble nemesis (`--scenario c`).
local clock only as a timestamp, so the ordering will be all over the place if Explanation: without the timestamp patch, nodes will create objects using their
clocks are scrambled. local clock only as a timestamp, so the ordering will be all over the place if
clocks are scrambled.
### Register linear, with timestamp patch Results with timestamp patch (`--patch tsfix2`):
Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 20 --concurrency 20 --workload reg1 --ops-per-key 100 --patch tsfix1`
Results:
- No failure with clock-scramble nemesis - No failure with clock-scramble nemesis
- Fails with clock-scramble nemesis + partition nemesis
Explanation: S3 objects are not meant to behave like linearizable registers. TODO explain using a counter-example - Fails with clock-scramble nemesis + partition nemesis (`--scenario cp`).
### Read-after-write CRDT register model, without timestamp patch **This test is expected to fail.**
Indeed, S3 objects are not meant to behave like linearizable registers.
TODO explain using a counter-example
### Read-after-write CRDT register model
Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload reg2 --ops-per-key 100` Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload reg2 --ops-per-key 100`
Results: fails with a simple clock-scramble nemesis. Results without timestamp patch:
Explanation: old values are not overwritten correctly when their timestamps are in the future. - Fails with a simple clock-scramble nemesis (`--scenario c`).
Explanation: old values are not overwritten correctly when their timestamps are in the future.
### Read-after-write CRDT register model, with timestamp patch (v2 with DeleteObject fix as well) Results with timestamp patch (`--patch tsfix2`):
Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload reg2 --ops-per-key 100 --patch tsfix2` - No failures with clock-scramble nemesis + partition nemesis (`--scenario cp`).
This proves that `tsfix2` (PR#543) does improve consistency.
Results: - **Fails with layout reconfiguration nemesis** (`--scenario r`)
(TODO: note down the run id of a failed run)
- No failures with clock-scramble nemesis + partition nemesis (TODO: test more and investigate).
- Fails with layout reconfiguration nemesis (TODO: test more and investigate) This is the failure mode we are looking for and trying to fix for NLnet task 3.
### Set, basic test (write some items, then read) ### Set, basic test (write some items, then read)
Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload set1 --ops-per-key 100` Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload set1 --ops-per-key 100 --patch tsfix2`
Results: Results:
- For now, no failures with clock-scramble nemesis + partition nemesis - For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run
- TODO: layout reconfiguration nemesis (does not fail yet! but it should)
- Failures were not yet achieved with only the layout reconfiguration nemesis, although they should be.
- **Fails with partition + layout reconfiguration nemesis** (`--scenario pr`)
(TODO: note down the run id of a failed run)
(TODO: test more and investigate).
This is the failure mode we are looking for and trying to fix for NLnet task 3.
### Set, continuous test (interspersed reads and writes) ### Set, continuous test (interspersed reads and writes)
TODO Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload set2 --ops-per-key 100 --patch tsfix2`
Results:
- For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run
- Failures were not yet achieved with only the layout reconfiguration nemesis, although they should be.
- TODO: failures should be achieved with `--scenario pr`? Even with 4 or 5 consecutive test runs, no failures were achieved, why?
(TODO: note down the run id of a failed run)
## Investigating (and fixing) errors ## Investigating (and fixing) errors
@ -112,7 +131,7 @@ and passing all values that were previously in the context (creds and prefix) as
The reg2 test is our custom checker for CRDT read-after-write on individual object keys, acting as registers which can be updated. The reg2 test is our custom checker for CRDT read-after-write on individual object keys, acting as registers which can be updated.
The test fails without the timestamp fix, which is expected as the clock scrambler will prevent nodes from having a correct ordering of objects. The test fails without the timestamp fix, which is expected as the clock scrambler will prevent nodes from having a correct ordering of objects.
With the timestamp fix, the happenned-before relationship should at least be respected, meaning that when a PutObject call starts With the timestamp fix (`--patch tsfix1`), the happenned-before relationship should at least be respected, meaning that when a PutObject call starts
after another PutObject call has ended, the second call should overwrite the value of the first call, and that value should not be after another PutObject call has ended, the second call should overwrite the value of the first call, and that value should not be
readable by future GetObject calls. readable by future GetObject calls.
However, we observed inconsistencies even with the timestamp fix. However, we observed inconsistencies even with the timestamp fix.
@ -121,7 +140,7 @@ The inconsistencies seemed to always happenned after writing a nil value, which
instead of a PutObject. By removing the possibility of writing nil values, therefore only doing instead of a PutObject. By removing the possibility of writing nil values, therefore only doing
PutObject calls, the issue disappears. There is therefore an issue to fix in DeleteObject. PutObject calls, the issue disappears. There is therefore an issue to fix in DeleteObject.
The issue in DeleteObject seems to have been fixed by commit `c82d91c6bccf307186332b6c5c6fc0b128b1b2b1` The issue in DeleteObject seems to have been fixed by commit `c82d91c6bccf307186332b6c5c6fc0b128b1b2b1`, which can be used using `--patch tsfix2`.
## License ## License

View file

@ -23,8 +23,10 @@
(def scenari (def scenari
"A map of scenari to the associated nemesis" "A map of scenari to the associated nemesis"
{"cp" grgNemesis/scenario-cp {"c" grgNemesis/scenario-c
"r" grgNemesis/scenario-r}) "cp" grgNemesis/scenario-cp
"r" grgNemesis/scenario-r
"pr" grgNemesis/scenario-pr})
(def patches (def patches
"A map of patch names to Garage builds" "A map of patch names to Garage builds"

View file

@ -7,6 +7,8 @@
[jepsen.garage.daemon :as grg] [jepsen.garage.daemon :as grg]
[jepsen.control.util :as cu])) [jepsen.control.util :as cu]))
; ---- reconfiguration nemesis ----
(defn configure-present! (defn configure-present!
"Configure node to be active in new cluster layout" "Configure node to be active in new cluster layout"
[test node] [test node]
@ -61,8 +63,18 @@
(teardown! [this test] this))) (teardown! [this test] this)))
; ---- nemesis scenari ----
(defn scenario-c
"Clock scramble scenario"
[opts]
{:generator (cycle [(gen/sleep 5)
{:type :info, :f :clock-scramble}])
:nemesis (nemesis/compose
{{:clock-scramble :scramble} (nemesis/clock-scrambler 20.0)})})
(defn scenario-cp (defn scenario-cp
"Clock scramble + parittion scenario" "Clock scramble + partition scenario"
[opts] [opts]
{:generator (cycle [(gen/sleep 5) {:generator (cycle [(gen/sleep 5)
{:type :info, :f :partition-start} {:type :info, :f :partition-start}
@ -91,3 +103,23 @@
:nemesis (nemesis/compose :nemesis (nemesis/compose
{{:reconfigure-start :start {{:reconfigure-start :start
:reconfigure-stop :stop} (reconfigure-subset 3)})}) :reconfigure-stop :stop} (reconfigure-subset 3)})})
(defn scenario-pr
"Partition + cluster reconfiguration scenario"
[opts]
{:generator (cycle [(gen/sleep 3)
{:type :info, :f :reconfigure-start}
(gen/sleep 3)
{:type :info, :f :partition-start}
(gen/sleep 3)
{:type :info, :f :reconfigure-start}
(gen/sleep 3)
{:type :info, :f :partition-stop}
(gen/sleep 3)
{:type :info, :f :reconfigure-stop}])
:final-generator (gen/once {:type :info, :f :partition-stop})
:nemesis (nemesis/compose
{{:partition-start :start
:partition-stop :stop} (nemesis/partition-random-halves)
{:reconfigure-start :start
:reconfigure-stop :stop} (reconfigure-subset 3)})})

View file

@ -39,12 +39,10 @@
new-object-summaries (:object-summaries list-result) new-object-summaries (:object-summaries list-result)
new-objects (map (fn [d] (:key d)) new-object-summaries) new-objects (map (fn [d] (:key d)) new-object-summaries)
objects (concat new-objects accum)] objects (concat new-objects accum)]
(info (:endpoint creds) "ListObjectsV2 prefix(" prefix "), ct(" ct "): " new-objects)
(if (:truncated? list-result) (if (:truncated? list-result)
(list-inner creds prefix (:next-continuation-token list-result) objects) (list-inner creds prefix (:next-continuation-token list-result) objects)
objects))) objects)))
(defn list (defn list
"Helper for ListObjects -- just lists everything in the bucket" "Helper for ListObjects -- just lists everything in the bucket"
[creds prefix] [creds prefix]
(info "in s3/list creds:" creds ", prefix:" prefix)
(list-inner creds prefix nil [])) (list-inner creds prefix nil []))

View file

@ -45,9 +45,7 @@
10000 10000
(assoc op :type :fail, :error ::timeout) (assoc op :type :fail, :error ::timeout)
(do (do
(info "call s3/list creds: " (:creds this) ", prefix:" prefix)
(let [items (s3/list (:creds this) prefix)] (let [items (s3/list (:creds this) prefix)]
(info "list results for prefix" prefix ":" items " (node:" (:endpoint (:creds this)) ")")
(let [items-stripped (map (fn [o] (let [items-stripped (map (fn [o]
(assert (str/starts-with? o prefix)) (assert (str/starts-with? o prefix))
(str/replace-first o prefix "")) items) (str/replace-first o prefix "")) items)
@ -115,8 +113,8 @@
{:client (SetClient. nil) {:client (SetClient. nil)
:checker (independent/checker :checker (independent/checker
(checker/compose (checker/compose
{:set-full (checker/set-full {:linearizable? false}) {:set-read-after-write (set-read-after-write)
:set-read-after-write (set-read-after-write) ; :set-full (checker/set-full {:linearizable? false})
:timeline (timeline/html)})) :timeline (timeline/html)}))
:generator (independent/concurrent-generator :generator (independent/concurrent-generator
10 10