Jepsen testing (NLnet task 3 subtask 1) #544

Merged
lx merged 41 commits from jepsen into main 2024-01-11 10:52:13 +00:00
12 changed files with 96 additions and 31 deletions
Showing only changes of commit fa9247f11b - Show all commits

View file

@ -13,4 +13,5 @@ pom.xml.asc
.hg/ .hg/
.direnv .direnv
/store /store
/store.*
.vagrant .vagrant

View file

@ -7,29 +7,19 @@ Jepsen checking of Garage consistency properties.
Requirements: Requirements:
- vagrant - vagrant
- VirtualBox, configured so that nodes can take an IP in a private network `192.168.56.0/24` - VirtualBox, configured so that nodes can take an IP in a private network `192.168.56.0/24` (it's the default)
- a user that can create VirtualBox VMs - a user that can create VirtualBox VMs
- leiningen - leiningen
- gnuplot - gnuplot
Set up VMs: Set up VMs before running tests:
``` ```
vagrant up vagrant up
``` ```
Run tests (this one should fail): Run tests: see commands below.
```
lein run test --nodes-file nodes.vagrant --time-limit 64 --concurrency 50 --rate 50 --workload reg
```
These ones are working:
```
lein run test --nodes-file nodes.vagrant --time-limit 64 --rate 50 --concurrency 50 --workload set1
lein run test --nodes-file nodes.vagrant --time-limit 64 --rate 50 --concurrency 50 --workload set2
```
## Results ## Results
@ -73,16 +63,19 @@ Results with timestamp patch (`--patch tsfix2`):
Example of a failed run: `garage reg2/20231024T120806.899+0200`. Example of a failed run: `garage reg2/20231024T120806.899+0200`.
This is the failure mode we are looking for and trying to fix for NLnet task 3. This is the failure mode we are looking for and trying to fix for NLnet task 3.
- Changes brought by NLnet task 3 code (commit 707442f5de): Results with NLnet task 3 code (commit 707442f5de, `--patch task3a`):
no failures with `--scenario r` (0 of 10 runs), `--scenario pr` (0 of 10 runs),
- No failures with `--scenario r` (0 of 10 runs), `--scenario pr` (0 of 10 runs),
`--scenario cpr` (0 of 10 runs) and `--scenario dpr` (0 of 10 runs). `--scenario cpr` (0 of 10 runs) and `--scenario dpr` (0 of 10 runs).
- Same with `--patch task3c` (commit `0041b013`, the final version).
### Set, basic test (write some items, then read) ### Set, basic test (write some items, then read)
Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 200 --concurrency 200 --workload set1 --ops-per-key 100 --patch tsfix2` Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 200 --concurrency 200 --workload set1 --ops-per-key 100`
Results: Results without NLnet task3 code (`--patch tsfix2`):
- For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run - For now, no failures with clock-scramble nemesis + partition nemesis -> TODO long test run
@ -90,15 +83,22 @@ Results:
- **Fails with the partition + layout reconfiguration nemesis** (`--scenario pr`). - **Fails with the partition + layout reconfiguration nemesis** (`--scenario pr`).
Example of a failed run: `garage set1/20231024T172214.488+0200` (1 failure in 4 runs). Example of a failed run: `garage set1/20231024T172214.488+0200` (1 failure in 4 runs).
TODO: investigate.
This is the failure mode we are looking for and trying to fix for NLnet task 3. This is the failure mode we are looking for and trying to fix for NLnet task 3.
Results with NLnet task 3 code (commit 707442f5de, `--patch task3a`):
- The tests are buggy and often result in an "unknown" validity status, which
is caused by some requests not returning results during network partitions or
other nemesis-induced broken cluster states. However, when the tests were
able to finish, there were no failures with scenarios `r`, `pr`, `cpr`,
`dpr`.
### Set, continuous test (interspersed reads and writes) ### Set, continuous test (interspersed reads and writes)
Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload set2 --ops-per-key 100 --patch tsfix2` Command: `lein run test --nodes-file nodes.vagrant --time-limit 60 --rate 100 --concurrency 100 --workload set2 --ops-per-key 100`
Results: Results without NLnet task3 code (`--patch tsfix2`):
- No failures with clock-scramble nemesis + db nemesis + partition nemesis (`--scenario cdp`) (0 failures in 10 runs). - No failures with clock-scramble nemesis + db nemesis + partition nemesis (`--scenario cdp`) (0 failures in 10 runs).
@ -106,17 +106,26 @@ Results:
Example of a failed run: `garage set2/20231025T141940.198+0200` (10 failures in 10 runs). Example of a failed run: `garage set2/20231025T141940.198+0200` (10 failures in 10 runs).
This is the failure mode we are looking for and trying to fix for NLnet task 3. This is the failure mode we are looking for and trying to fix for NLnet task 3.
- Changes brought by NLnet task 3 code (commit 707442f5de): Results with NLnet task3 code (commit 707442f5de, `--patch task3a`):
no failures with `--scenario r` (0 of 10 runs), `--scenario pr` (0 of 10 runs).
- No failures with `--scenario r` (0 of 10 runs), `--scenario pr` (0 of 10 runs),
`--scenario cpr` (0 of 10 runs) and `--scenario dpr` (0 of 10 runs). `--scenario cpr` (0 of 10 runs) and `--scenario dpr` (0 of 10 runs).
- Same with `--patch task3c` (commit `0041b013`, the final version).
## NLnet task 3 final results
- With code from task3 (`--patch task3c`): [reg2 and set2](results/Results-2023-12-13-task3c.png), [set1](results/Results-2023-12-14-task3-set1.png).
- Without (`--patch tsfix2`): [reg2 and set2](results/Results-2023-12-13-tsfix2.png), set1 TBD.
## Investigating (and fixing) errors ## Investigating (and fixing) errors
### Segfaults ### Segfaults
They are due to the download being interrupted in the middle (^C during first launch on clean VMs), the `garage` binary is truncated. They are due to the download being interrupted in the middle (^C during first launch on clean VMs), the `garage` binary is truncated.
Add `:force?` to the `cached-wget!` call in `daemon.clj` to re-download the binary. Add `:force?` to the `cached-wget!` call in `daemon.clj` to re-download the binary,
or restar the VMs to clear temporary files.
### In `jepsen.garage`: prefix wierdness ### In `jepsen.garage`: prefix wierdness

View file

@ -29,4 +29,12 @@ Vagrant.configure("2") do |config|
config.vm.define "n5" do |config| vm(config, "n5", "192.168.56.25") end config.vm.define "n5" do |config| vm(config, "n5", "192.168.56.25") end
config.vm.define "n6" do |config| vm(config, "n6", "192.168.56.26") end config.vm.define "n6" do |config| vm(config, "n6", "192.168.56.26") end
config.vm.define "n7" do |config| vm(config, "n7", "192.168.56.27") end config.vm.define "n7" do |config| vm(config, "n7", "192.168.56.27") end
config.vm.define "n8" do |config| vm(config, "n8", "192.168.56.28") end
config.vm.define "n9" do |config| vm(config, "n9", "192.168.56.29") end
config.vm.define "n10" do |config| vm(config, "n10", "192.168.56.30") end
config.vm.define "n11" do |config| vm(config, "n11", "192.168.56.31") end
config.vm.define "n12" do |config| vm(config, "n12", "192.168.56.32") end
config.vm.define "n13" do |config| vm(config, "n13", "192.168.56.33") end
config.vm.define "n14" do |config| vm(config, "n14", "192.168.56.34") end
end end

View file

@ -0,0 +1,18 @@
#!/usr/bin/env bash
set -x
#for ppatch in task3c task3a tsfix2; do
for ppatch in tsfix2; do
#for psc in c cp cdp r pr cpr dpr; do
for psc in cdp r pr cpr dpr; do
#for ptsk in reg2 set1 set2; do
for ptsk in set1; do
for irun in $(seq 10); do
lein run test --nodes-file nodes.vagrant \
--time-limit 60 --rate 100 --concurrency 100 --ops-per-key 100 \
--workload $ptsk --patch $ppatch --scenario $psc
done
done
done
done

View file

@ -0,0 +1,16 @@
#!/usr/bin/env bash
set -x
#for ppatch in task3c tsfix2; do
for ppatch in tsfix2; do
for psc in cdp r pr cpr dpr; do
for ptsk in set1; do
for irun in $(seq 10); do
lein run test --nodes-file nodes2.vagrant \
--time-limit 60 --rate 100 --concurrency 100 --ops-per-key 100 \
--workload $ptsk --patch $ppatch --scenario $psc
done
done
done
done

View file

@ -0,0 +1,7 @@
192.168.56.28
192.168.56.29
192.168.56.30
192.168.56.31
192.168.56.32
192.168.56.33
192.168.56.34

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1 MiB

View file

@ -36,7 +36,9 @@
{"default" "v0.9.0" {"default" "v0.9.0"
"tsfix1" "d146cdd5b66ca1d3ed65ce93ca42c6db22defc09" "tsfix1" "d146cdd5b66ca1d3ed65ce93ca42c6db22defc09"
"tsfix2" "c82d91c6bccf307186332b6c5c6fc0b128b1b2b1" "tsfix2" "c82d91c6bccf307186332b6c5c6fc0b128b1b2b1"
"task3a" "707442f5de416fdbed4681a33b739f0a787b7834"}) "task3a" "707442f5de416fdbed4681a33b739f0a787b7834"
"task3b" "431b28e0cfdc9cac6c649193cf602108a8b02997"
"task3c" "0041b013a473e3ae72f50209d8f79db75a72848b"})
(def cli-opts (def cli-opts
"Additional command line options." "Additional command line options."
@ -69,7 +71,7 @@
(merge tests/noop-test (merge tests/noop-test
opts opts
{:pure-generators true {:pure-generators true
:name (str "garage " (name (:workload opts)) " " (name (:scenario opts)) " " (name (:patch opts))) :name (str "garage-" (name (:patch opts)) " " (name (:workload opts)) " " (name (:scenario opts)))
:os debian/os :os debian/os
:db db :db db
:client (:client workload) :client (:client workload)
@ -83,7 +85,9 @@
(gen/nemesis (:final-generator scenario)) (gen/nemesis (:final-generator scenario))
(gen/log "Waiting for recovery") (gen/log "Waiting for recovery")
(gen/sleep 10) (gen/sleep 10)
(gen/clients (:final-generator workload))) (gen/log "Running final generator")
(gen/clients (:final-generator workload))
(gen/log "Generators all done"))
:nemesis (:nemesis scenario) :nemesis (:nemesis scenario)
:checker (checker/compose :checker (checker/compose
{:perf (checker/perf (:perf scenario)) {:perf (checker/perf (:perf scenario))

View file

@ -108,11 +108,13 @@
(->> (range) (->> (range)
(map (fn [x] {:type :invoke, :f :add, :value x})) (map (fn [x] {:type :invoke, :f :add, :value x}))
(gen/limit (:ops-per-key opts))))) (gen/limit (:ops-per-key opts)))))
:final-generator (gen/phases :final-generator (independent/concurrent-generator
(independent/sequential-generator 10
(range 100) (range 100)
(fn [k] (gen/once op-read))) (fn [k]
(gen/sleep 5))}) (gen/phases
(gen/once op-read)
(gen/sleep 5))))})
(defn workload2 (defn workload2
"Tests insertions and deletions" "Tests insertions and deletions"