Performances collapse with 10 millions pictures in a bucket #851

Open
opened 2024-08-12 10:18:48 +00:00 by quentin · 16 comments
Owner

Issue

Once ~4 millions objects are uploaded, PUT requests take more than 10 seconds even when the cluster is idle, even on empty buckets, even for 2KB files. GET requests do not seem to be impacted (they are all done in less than a second).

Expected

PUT requests take less than a second to execute, and probably less than 200ms considering the cluster spec. In our performance benchmark, we were able to do ~100 PUT/second without any issue.

Cluster spec

ZFS is used to store both metadata and data.
LMDB is used for the metadata engine.
Garage runs in a VM on a powerful server, it has 4 real CPU (not a vCPU) assigned + 8GB of RAM.
Configuration has been reviewed, it is standard (fsync disabled, 1MB chunks).
There is no repair tasks running, everything is correctly synchronized.
The team running the cluster is knowledgeable and currently uses Riak KV to store these 10M objects without any issue.

During the debug, we noted that the metadata database reached 50GB (for 80GB of data chunks). So around 7KB per entry, not sure if it's intended or not.

Workload spec

~11 millions pictures of ~1MB.
Migration is done by sending batches of 4 pictures in parallel.
At the beginning, the batch is finished in less than a few secondes.
After a while, it takes more than 60 seconds and hit an internal timeout.
We have done manual test then on the idle cluster, where we sent various small files that all took ~10 seconds to upload.

Investigations so far

We extracted a trace for a PUT and a GET request. For the PUT request, the performance hit is due to a slow try_write_many_sets, waiting a long time for the other nodes of the network. The corresponding call for GET is fast however. We can't investigate further currently, as we are "crossing a network boundary" and we have not implemented "distributed tracing" (mainly sending the trace/span id over the network).

Based on this observation, we suppose that the metadata engine is slow. Indeed, the network and the cluster are idle, so we have no bottleneck or buffer bloat/queuing issue here, and additionally, GET requests have no issue. Furthermore, the slow delays are observed for metadata writes (writing an entry in the bucket list, a new object version, etc.) and thus are not limited to the block manager.

Today, we have no opentelemetry metrics to measure the responsiveness of the metadata engine, the length of our network queues, the amount of bytes written, of data sent and receive, etc.

We were also not able to explore the LMDB database to see if we have some "stale" entries or things like that.

Investigations to conduct

  • Collect new opentelemetry metrics
    • On the metadata engine, especially
      • Number of read ops
      • Number of write ops
      • Distribution of time taken by read ops
      • Distribution of time taken by write ops
    • Also on the netdata side, especially
      • Number of packets sent/received
      • Size of packets sent/received
      • Queue length
      • Some info about our custom QoS logic
    • Ideally on the block manager side
      • Number of block reads
      • Number of blocks written (or deleted)
      • Distribution of time taken to read a block
      • Distribution of time take to write block
  • Allow the operator to explore the metadata engine of the current node
    • through the current garage command
    • read only only to avoid cluster corruptions
    • for debug purposes
  • Use Brendan Gregg's tools (strace/ptrace/whatever) to observe read/writes on the metadata file
  • Use my s3billion script to try to reproduce (or not reproduce) this issue. With the following configuration:
    • OBJ_SIZE=1048576 - 1MB objects
    • BATCH_SIZE=16 - batch_size is multiplied by THREAD, each thread will send BATCH_SIZE PUT req before reporting/synchronizing
    • THREAD=64 - should be called "goroutine", the number of parallel request
    • BATCH_COUNT=10000 - we report every ~1k objects, to send 10M objects, we need 10k iterations.
    • SSL,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION,SSL_INSECURE,ENDPOINT must/might be configured
## Issue Once ~4 millions objects are uploaded, PUT requests take more than 10 seconds even when the cluster is idle, even on empty buckets, even for 2KB files. GET requests do not seem to be impacted (they are all done in less than a second). ## Expected PUT requests take less than a second to execute, and probably less than 200ms considering the cluster spec. In [our performance benchmark](https://garagehq.deuxfleurs.fr/blog/2022-perf/), we were able to do ~100 PUT/second without any issue. ## Cluster spec ZFS is used to store both metadata and data. LMDB is used for the metadata engine. Garage runs in a VM on a powerful server, it has 4 real CPU (not a vCPU) assigned + 8GB of RAM. Configuration has been reviewed, it is standard (fsync disabled, 1MB chunks). There is no repair tasks running, everything is correctly synchronized. The team running the cluster is knowledgeable and currently uses Riak KV to store these 10M objects without any issue. During the debug, we noted that the metadata database reached 50GB (for 80GB of data chunks). So around 7KB per entry, not sure if it's intended or not. ## Workload spec ~11 millions pictures of ~1MB. Migration is done by sending batches of 4 pictures in parallel. At the beginning, the batch is finished in less than a few secondes. After a while, it takes more than 60 seconds and hit an internal timeout. We have done manual test then on the idle cluster, where we sent various small files that all took ~10 seconds to upload. ## Investigations so far We extracted a trace for a `PUT` and a `GET` request. For the `PUT` request, the performance hit is due to a slow `try_write_many_sets`, waiting a long time for the other nodes of the network. The corresponding call for `GET` is fast however. We can't investigate further currently, as we are "crossing a network boundary" and we have not implemented "distributed tracing" (mainly sending the trace/span id over the network). Based on this observation, we suppose that the metadata engine is slow. Indeed, the network and the cluster are idle, so we have no bottleneck or buffer bloat/queuing issue here, and additionally, GET requests have no issue. Furthermore, the slow delays are observed for metadata writes (writing an entry in the bucket list, a new object version, etc.) and thus are not limited to the block manager. Today, we have no opentelemetry metrics to measure the responsiveness of the metadata engine, the length of our network queues, the amount of bytes written, of data sent and receive, etc. We were also not able to explore the LMDB database to see if we have some "stale" entries or things like that. ## Investigations to conduct - Collect new opentelemetry metrics - On the metadata engine, especially - Number of read ops - Number of write ops - Distribution of time taken by read ops - Distribution of time taken by write ops - Also on the netdata side, especially - Number of packets sent/received - Size of packets sent/received - Queue length - Some info about our custom QoS logic - Ideally on the block manager side - Number of block reads - Number of blocks written (or deleted) - Distribution of time taken to read a block - Distribution of time take to write block - Allow the operator to explore the metadata engine of the current node - through the current `garage` command - read only only to avoid cluster corruptions - for debug purposes - Use Brendan Gregg's tools (strace/ptrace/whatever) to observe read/writes on the metadata file - See https://www.brendangregg.com/linuxperf.html - Use my [s3billion script](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion) to try to reproduce (or not reproduce) this issue. With the following configuration: - `OBJ_SIZE=1048576` - 1MB objects - `BATCH_SIZE=16` - batch_size is multiplied by THREAD, each thread will send BATCH_SIZE PUT req before reporting/synchronizing - `THREAD=64` - should be called "goroutine", the number of parallel request - `BATCH_COUNT=10000` - we report every ~1k objects, to send 10M objects, we need 10k iterations. - `SSL,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION,SSL_INSECURE,ENDPOINT` must/might be configured
quentin added the
scope
s3-api
kind
performance
action
more-info-needed
labels 2024-08-12 10:18:48 +00:00
quentin self-assigned this 2024-08-12 10:18:48 +00:00
Author
Owner

Summary of my tests on Grid 5k

On august 15th, I ran some tests on the grappe cluster of the Grid5k project.

Test setup

warp put \
  --host=grappe-4-ipv6.nancy.grid5000.fr:3900,grappe-14-ipv6.nancy.grid5000.fr:3900,grappe-16-ipv6.nancy.grid5000.fr:3900 \
  --host-select=roundrobin \
  --obj.size=1KB \
  --duration=1h \
  --concurrent=256 \
  --bucket=xxx \
  --noclear

Relevant observations

full dashboard (different time span)



Docker compose + prometheus conf to explore data yourself

compose.yml:

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - 9090:9090
    volumes:
      - ./prometheus.yml:/prometheus.yml
      - ./prom-snap:/prometheus
    command:
      # extracted from dockerfile
      - "--config.file=/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--web.console.libraries=/usr/share/prometheus/console_libraries"
      - "--web.console.templates=/usr/share/prometheus/consoles"
      # disable time-based data deletion
      - "--storage.tsdb.retention.time=20y"
      - "--storage.tsdb.retention.size=20GB"
  grafana:
    image: grafana/grafana:latest
    ports:
      - 3000:3000
    volumes:
      - grafana-data:/var/lib/grafana
volumes:
  grafana-data:

prometheus.yml

global:
  scrape_interval: 1y
  scrape_timeout: 10s
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  evaluation_interval: 15s
runtime:
  gogc: 75
alerting:
  alertmanagers:
  - follow_redirects: true
    enable_http2: true
    http_headers: null
    scheme: http
    timeout: 10s
    api_version: v2
    static_configs:
    - targets: []
scrape_configs: []

Analysis

Garage ran in 2 different regimes:

  • First a stable one where performances are stable, nodes feature the same behavior,
  • Then an oscillating one where one of the three nodes performs slower than the others, and many components have oscillating performances.

Oscillations could be due to how the benchmark took (warp) is implemented.

## Summary of my tests on Grid 5k On august 15th, I ran some tests on the [grappe cluster](https://www.grid5000.fr/w/Nancy:Hardware#grappe) of the [Grid5k](https://www.grid5000.fr/w/Grid5000:Home) project. ### Test setup ```bash warp put \ --host=grappe-4-ipv6.nancy.grid5000.fr:3900,grappe-14-ipv6.nancy.grid5000.fr:3900,grappe-16-ipv6.nancy.grid5000.fr:3900 \ --host-select=roundrobin \ --obj.size=1KB \ --duration=1h \ --concurrent=256 \ --bucket=xxx \ --noclear ``` ### Relevant observations <details><summary>full dashboard (different time span)</summary> ![](https://git.deuxfleurs.fr/attachments/598ccd4b-560a-4679-8b51-f69fef37cd0d) ![](https://git.deuxfleurs.fr/attachments/2738634c-b2e9-49f1-a584-d35a02daacea) ![](https://git.deuxfleurs.fr/attachments/c30aa5fa-a14d-4e58-978a-f15fff553abd) </details> ![](https://git.deuxfleurs.fr/attachments/e8055704-6aa4-4c36-8400-6b123268d41b) <details><summary>Docker compose + prometheus conf to explore data yourself</summary> compose.yml: ```yml services: prometheus: image: prom/prometheus:latest ports: - 9090:9090 volumes: - ./prometheus.yml:/prometheus.yml - ./prom-snap:/prometheus command: # extracted from dockerfile - "--config.file=/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--web.console.libraries=/usr/share/prometheus/console_libraries" - "--web.console.templates=/usr/share/prometheus/consoles" # disable time-based data deletion - "--storage.tsdb.retention.time=20y" - "--storage.tsdb.retention.size=20GB" grafana: image: grafana/grafana:latest ports: - 3000:3000 volumes: - grafana-data:/var/lib/grafana volumes: grafana-data: ``` prometheus.yml ```yml global: scrape_interval: 1y scrape_timeout: 10s scrape_protocols: - OpenMetricsText1.0.0 - OpenMetricsText0.0.1 - PrometheusText0.0.4 evaluation_interval: 15s runtime: gogc: 75 alerting: alertmanagers: - follow_redirects: true enable_http2: true http_headers: null scheme: http timeout: 10s api_version: v2 static_configs: - targets: [] scrape_configs: [] ``` </details> ### Analysis Garage ran in 2 different regimes: - First a stable one where performances are stable, nodes feature the same behavior, - Then an oscillating one where one of the three nodes performs slower than the others, and many components have oscillating performances. Oscillations could be due to how the benchmark took (`warp`) is implemented.
Author
Owner

Summary of my tests on a Scaleway DEV1-S cluster

Servers have 2vCPU, 2GB of RAM, 50GB of SAN SSD storage. They run Ubuntu 24.04 with Garage on it.
Scaleway is broken so my benchmarking node has been shutdown numerous time (but not my garage cluster):it's why data have holes.

The test command is:

warp put \
  --host=g1:3900,g2:3900,g3:3900 \
  --host-select=roundrobin \
  --obj.size=1KB \
  --duration=4h \
  --concurrent=256 \
  --bucket=xxx \
  --noclear

Reproducing performance collapse.

I analyzed a run going from 0 to 1.5 million objects:

Here also performances collapse as soon as LMDB database is larger than RAM.

Let zoom on the part where we go from 1.5M objects to 1.6M objects

We go from ~1k req/sec to 40 req/sec (25 times slower). However we don't have oscillations like in Grid5k.

Disabling read ahead improves performances by at least 10x

It appears that by default read ahead is activated on Linux. When fetching a page on disk, Linux also fetches the 15 next ones. It's meant to improve sequential reads on hard drives, but we don't have a hard drive nor sequential reads here. So basically, it hurts our performances.

LMDB has an option to deactivate read ahead on Linux.
It's named MdbNoRdAhead on heed, the Rust wrapper we use for LMDB, I did a simple commit to activate it.
Incidentally, activating this option when your database is larger than your memory is recommended by LMDB author.

So I did the same test on the same hardware but with this flag activated:

20240817_20h33m06s_grim.png

It started at the same speed, but once we reached the threshold of the RAM size, performances did not collapse similarly as the previous run. I was able to push Garage to 10M objects without any major issue. At 10M objects, the LMDB database takes ~33GB on the filesystem.

Let zoom around 1.5 million objects to see how it behaves:

20240817_20h37m35s_grim.png

We are putting objects at ~700 req/sec. Compared to previous run at 40 req/sec, we are ~17x faster. And if look around 10M objects, we are still at ~400 req/sec, in other words ~10x faster.

20240817_20h33m42s_grim.png

Discussion

About mmap. There are discussions to know whether or not mmap is a good idea. There is this old debate: 1975 programming vs 2006 programming. More recently, scholars have published a paper where they heavily discourage DBMS implementers from using mmap. With these benchmarks, we have learnt that LMDB has really "2 regimes": one when the whole database fits in RAM and one when it does not fit anymore, and when it does not fit anymore, it is very sensitive to its environment. Like Riak, the metadata engine is abstracted behind an interface in Garage: it's not hard to implement other engines to see how they perform. Trying LSM-Tree based engine, like RocksDB or the Rust native fjall could be interesting. A more exhaustive list of possible metadata engine.

About ZFS. This whole discussion started mentioning ZFS. While our benchmarks showed suboptimal performances, none of them where as abysmal as observed by people reporting the initial issue. By digging on the Internet, it appears that ZFS has a 128kb record size (ie. page size?) by default. And it is known to perform poorly with LMDB as reported by a netizen trying to use Monero over ZFS (that records its blockchain in LMDB). Checking ZFS record size and setting it to 4096 (4k to match LMDB page size) apparently drastically increase performances.

About Merkle Todo. It appears on some nodes the "Merkle Todo" grows without any bound. Garage has a worker that builds asynchronously a Merkle Tree that will then be used for healing. New items to put in this merkle tree are put in a "todo queue" that is also persisted in the metadata engine. In the end, it's not a huge issue that this queue grows: it only means that your healing mechanism will not be aware of the most recent items before some times (delaying a little bit the repair logic). However, by having the queue also managed by the LMDB database, in some way, we may "double" the size of the LMDB database (by having one entry in the object table and one entry in the merkle queue). Also having a queue in LMDB requires to do many writes & removes that are intensive. Maybe LMDB is not the best tool to manage this queue? We might also want to slow down the RPC if we are not able to process this queue fast enough to allow some backpressure in Garage (and in the same way if too much RPC accumulate on a node, slow down the responses to the API and/or return an "overloaded" error).

Explore the data

All the collected data are made available as a prometheus snapshot (see prom.tgz below).

## Summary of my tests on a Scaleway DEV1-S cluster Servers have 2vCPU, 2GB of RAM, 50GB of SAN SSD storage. They run Ubuntu 24.04 with Garage on it. Scaleway is broken so my benchmarking node has been shutdown numerous time (but not my garage cluster):it's why data have holes. The test command is: ``` warp put \ --host=g1:3900,g2:3900,g3:3900 \ --host-select=roundrobin \ --obj.size=1KB \ --duration=4h \ --concurrent=256 \ --bucket=xxx \ --noclear ``` ### Reproducing performance collapse. I analyzed a run going from 0 to 1.5 million objects: ![](https://git.deuxfleurs.fr/attachments/47796f65-e2e2-489a-8dc0-303c99116244) Here also performances collapse as soon as LMDB database is larger than RAM. Let zoom on the part where we go from 1.5M objects to 1.6M objects ![](https://git.deuxfleurs.fr/attachments/b4f721a1-86fb-43d4-a65c-c496adbb8463) We go from ~1k req/sec to 40 req/sec (25 times slower). However we don't have oscillations like in Grid5k. ### Disabling read ahead improves performances by at least 10x It appears that by default read ahead is activated on Linux. When fetching a page on disk, Linux also fetches the 15 next ones. It's meant to improve sequential reads on hard drives, but we don't have a hard drive nor sequential reads here. So basically, it hurts our performances. LMDB has an option to deactivate read ahead on Linux. It's named `MdbNoRdAhead` on heed, the Rust wrapper we use for LMDB, I did [a simple commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/1ebaf7aa17672bdc6e83e6c04c4c13e142f57629) to activate it. Incidentally, activating this option when your database is larger than your memory is recommended [by LMDB author](https://github.com/bmatsuo/lmdb-go/issues/118#issuecomment-325449496). So I did the same test on the same hardware but with this flag activated: ![20240817_20h33m06s_grim.png](/attachments/c6393ed2-ab9f-4ad3-972b-8ac16039e1e3) It started at the same speed, but once we reached the threshold of the RAM size, performances did not collapse similarly as the previous run. **I was able to push Garage to 10M objects without any major issue**. At 10M objects, the LMDB database takes ~33GB on the filesystem. Let zoom around 1.5 million objects to see how it behaves: ![20240817_20h37m35s_grim.png](/attachments/dbb87c88-ba8d-4185-90ed-f0096b0f414b) We are putting objects at ~700 req/sec. Compared to previous run at 40 req/sec, we are ~17x faster. And if look around 10M objects, we are still at ~400 req/sec, in other words ~10x faster. ![20240817_20h33m42s_grim.png](/attachments/08f2ea73-653d-4a92-961f-f82878bcf080) ### Discussion **About mmap**. There are discussions to know whether or not mmap is a good idea. There is this old debate: [1975 programming](http://varnish-cache.org/docs/trunk/phk/notes.html) vs [2006 programming](http://oldblog.antirez.com/post/what-is-wrong-with-2006-programming.html). More recently, scholars have published [a paper](https://db.cs.cmu.edu/mmap-cidr2022/) where they heavily discourage DBMS implementers from using mmap. With these benchmarks, we have learnt that LMDB has really "2 regimes": one when the whole database fits in RAM and one when it does not fit anymore, and when it does not fit anymore, it is very sensitive to its environment. [Like Riak](https://riak.com/assets/bitcask-intro.pdf), the metadata engine is abstracted behind an interface in Garage: it's not hard to implement other engines to see how they perform. Trying LSM-Tree based engine, like [RocksDB](https://rocksdb.org/) or the Rust native [fjall](https://github.com/fjall-rs/fjall) could be interesting. [A more exhaustive list of possible metadata engine](https://github.com/marvin-j97/rust-storage-bench). **About ZFS**. This whole discussion started mentioning ZFS. While our benchmarks showed suboptimal performances, none of them where as abysmal as observed by people reporting the initial issue. By digging on the Internet, it appears that ZFS has a 128kb record size (ie. page size?) by default. And it is known to perform poorly with LMDB as reported by a netizen [trying to use Monero over ZFS](https://github.com/openzfs/zfs/issues/7543) (that records its blockchain in LMDB). Checking ZFS record size and setting it to 4096 (4k to match LMDB page size) apparently drastically increase performances. **About Merkle Todo.** It appears on some nodes the "Merkle Todo" grows without any bound. Garage has a worker that builds asynchronously a Merkle Tree that will then be used for healing. New items to put in this merkle tree are put in a "todo queue" that is also persisted in the metadata engine. In the end, it's not a huge issue that this queue grows: it only means that your healing mechanism will not be aware of the most recent items before some times (delaying a little bit the repair logic). However, by having the queue also managed by the LMDB database, in some way, we may "double" the size of the LMDB database (by having one entry in the object table and one entry in the merkle queue). Also having a queue in LMDB requires to do many writes & removes that are intensive. Maybe LMDB is not the best tool to manage this queue? We might also want to slow down the RPC if we are not able to process this queue fast enough to allow some backpressure in Garage (and in the same way if too much RPC accumulate on a node, slow down the responses to the API and/or return an "overloaded" error). ### Explore the data All the collected data are made available as a prometheus snapshot (see `prom.tgz` below).
Contributor

Retest on a cluster w/ data blocks (objects beyond 3KiB)

Very nice catch for the read ahead flag, thank you for your commit! To further investigate the issue, we re-ran a benchmark of our own, using larger and random object sizes to ensure block refs were created and blocks were stored.

  • To match the Scaleway test and leave out any filesystem-related side effects, we're running on ext4 for both meta and data
  • The servers are VMs running on various CPUs from 2.6 to 2.9GHz, with 8GB RAM and 500GB NVMe per node
  • Multiple factors related to how our drives are set up are expected to slow down writes, but this should remain acceptable (and should be more or less constant)
  • Our Garage build is based off the main branch after !855, into which we then merged !853 to enable LMDB metrics ; the work suggested in !860 was not included and the send_all_at_once read strategy was not used
  • Object sizes randomly go up to 35KiB (the median size in our use case, beyond Garage's inlining threshold)

warp command line

warp put \
    --host=node1:3900,node2:3900,node3:3900 \
    --host-select=roundrobin \
    --obj.size=35KB \
    --obj.randsize \
    --duration=4h \
    --concurrent=256 \
    --bucket=perftest \
    --noclear

garage.toml

metadata_dir = "/var/lib/garage/meta"
data_dir = "/var/lib/garage/data"
db_engine = "lmdb-with-metrics"

replication_factor = 3
consistency_mode = "consistent"
compression_level = "none"

rpc_bind_addr = "[::]:3901"
rpc_public_addr = "<node IP>:3901"
rpc_secret = "<secret>"

[s3_api]
s3_region = "garage"
api_bind_addr = "[::]:3900"
root_domain = ".s3.garage.localhost"

[k2v_api]
api_bind_addr = "[::]:3904"

[admin]
api_bind_addr = "[::]:3903"
admin_token = "<secret>"

Cluster observability

That initial 4-hour-long benchmark yielded mixed results, as shown in the PutObject rate graphs below:

image image image

There appears to be roughly 3 phases:

  • An initial burst at 800#/s, followed by a steady slowdown to 120#/s
  • An acceleration up to ~375#/s
  • A steady slowdown which suggests an eventual stabilisation between 75 and 100#/s

It should also be noted that the database ended up at ~6.3GiB (below RAM size) so the impact of read-ahead must have been minimal. The RAM is also dedicated to the VMs so nothing external is forcing pages out of memory.

The speed pattern matches what is observed on RPCs and table operations:

image image image

The LMDB metrics also follow the pattern:

image
More: LMDB heat maps image

Looking at cluster synchronisation, we observe that the block sync buffers (block_ram_buffer_free_kb) do not get overwhelmed:

image

Similarly, no node is lagging behind on Merkle tree updates, although node 3 (2nd in 1st AZ) exhibits higher values:

image

However the block resync queue (block_resync_queue_length) goes absolutely crazy on that 3rd node:

image

System observability

We are also able to provide some system-level stats. First looking at load, we can definitely see that node 3 is struggling more than the rest, although all of them remain within their CPU capacity:

image

Looking at cache usage, a strange behaviour comes up : while nodes 1 and 2 keep a steady cache usage, pages on node 3 are regularly being released:

image

That memory does not appear to be claimed for use though, meaning those drops are unlikely to be forced evictions:

image

Finally, network-wise, all nodes exhibit similar behaviours, communicating far below link capacity (I may have got those units wrong, but in any case, there's plenty of bandwidth left) :

image

Discussion

While the read-ahead flag definitely delivers improvements when the database size exceeds the RAM, there appears to be other factors which, in our setup, hinder Garage's capabilities in terms of performance. If the Scaleway cluster stabilised at around 400#/s, we were only able to reach (a much more unstable) 75#/s.

Regarding the impact of the object count, we just restarted another benchmark today, starting off where the first one had left off. The idea was to figure out whether the slowdown was related to cluster size (in which case the new benchmark should start at around 75#/s) or to load/work backlog (in which case the new benchmark should start over from about 800#/s now that the cluster has settled). The first results suggest the former.

image

➡️ Cluster size appears to still play a part in the performance drop, even below RAM capacity.

Regarding node 3, those results also confirm what we initially saw on our test cluster: the 2nd node in the 1st availability zone appears to do a lot more work than the rest of the nodes, causing it to lag behind at times. This has a cluster-wide impact since this node is still expected to participate in quorums.

➡️ Quick tests were made using send_all_at_once but this only improves reads, which overall play a minor role in a PutObject call - node 3 may still be delaying write operations.

Regarding RAM cache, those releases on node 3 are definitely odd. They suggest that the node may be behaving more erratically than the rest and moving more things in and out of memory.

Similarly, regarding the block resync queue, we are still unable to explain why node 3 accumulates so much work (although this may just be another symptom).

➡️ Overall, it seems important to understand what makes node 3 so special, given that, system-wise, it does not appear to be particularly different from the rest of the cluster.


We will continue our investigations, but would definitely appreciate any feedback you may have on those results! If something worth testing comes to mind, also feel free to suggest it so we may run it on this cluster. While we can't share access to those servers, the resources will remain available for testing for some time.

Next steps for us:

  • Try again with objects at 1KB to remove the data block logic from the results
  • Try again with ZFS for data blocks to eliminate any ext4 performance issues (many small files)

JKR

## Retest on a cluster w/ data blocks (objects beyond 3KiB) Very nice catch for the read ahead flag, thank you for your commit! To further investigate the issue, we re-ran a benchmark of our own, using larger and random object sizes to ensure block refs were created and blocks were stored. - To match the Scaleway test and leave out any filesystem-related side effects, we're running on ext4 for both meta and data - The servers are VMs running on various CPUs from 2.6 to 2.9GHz, with 8GB RAM and 500GB NVMe per node - Multiple factors related to how our drives are set up are expected to slow down writes, but this should remain acceptable (and should be more or less constant) - Our Garage build is based off the main branch after !855, into which we then merged !853 to enable LMDB metrics ; the work suggested in !860 was **not** included and the `send_all_at_once` read strategy was **not** used - Object sizes randomly go up to 35KiB (the median size in our use case, beyond Garage's inlining threshold) **warp command line** ``` warp put \ --host=node1:3900,node2:3900,node3:3900 \ --host-select=roundrobin \ --obj.size=35KB \ --obj.randsize \ --duration=4h \ --concurrent=256 \ --bucket=perftest \ --noclear ``` **garage.toml** ``` metadata_dir = "/var/lib/garage/meta" data_dir = "/var/lib/garage/data" db_engine = "lmdb-with-metrics" replication_factor = 3 consistency_mode = "consistent" compression_level = "none" rpc_bind_addr = "[::]:3901" rpc_public_addr = "<node IP>:3901" rpc_secret = "<secret>" [s3_api] s3_region = "garage" api_bind_addr = "[::]:3900" root_domain = ".s3.garage.localhost" [k2v_api] api_bind_addr = "[::]:3904" [admin] api_bind_addr = "[::]:3903" admin_token = "<secret>" ``` ### Cluster observability That initial 4-hour-long benchmark yielded mixed results, as shown in the PutObject rate graphs below: <img width="472" alt="image" src="/attachments/7c678e20-9cba-4152-823b-cda8aefc03bc"> <img width="473" alt="image" src="/attachments/66ad55bb-cc5d-487f-b703-387a4ff7322c"> <img width="474" alt="image" src="/attachments/eea9b98f-6a06-494c-a3b8-9e69527b57f8"> There appears to be roughly 3 phases: - An initial burst at 800#/s, followed by a steady slowdown to 120#/s - An acceleration up to ~375#/s - A steady slowdown which suggests an eventual stabilisation between 75 and 100#/s It should also be noted that the database ended up at ~6.3GiB (below RAM size) so the impact of read-ahead must have been minimal. The RAM is also dedicated to the VMs so nothing external is forcing pages out of memory. The speed pattern matches what is observed on RPCs and table operations: <img width="473" alt="image" src="/attachments/3a786b77-0b08-4bd4-b160-79e6a7689ce5"> <img width="474" alt="image" src="/attachments/50ab95d6-fe1c-4af0-9ae9-64dc4a59db15"> <img width="473" alt="image" src="/attachments/bec8cf69-d961-431e-a8dd-2edacc8a6eb0"> The LMDB metrics also follow the pattern: <img width="836" alt="image" src="/attachments/002aa8dd-d4a6-438b-bba3-61e1153f1431"> <details> <summary>More: LMDB heat maps</summary> <img width="833" alt="image" src="/attachments/158fd5e5-0072-4787-bffe-a6090e30ac18"> </details> Looking at cluster synchronisation, we observe that the block sync buffers (`block_ram_buffer_free_kb`) do not get overwhelmed: <img width="473" alt="image" src="/attachments/7d780418-07c8-4c57-a2f8-b05830777ae1"> Similarly, no node is lagging behind on Merkle tree updates, although node 3 (2nd in 1st AZ) exhibits higher values: <img width="473" alt="image" src="/attachments/346dfb62-4809-43f0-9486-a2651743e559"> However the block resync queue (`block_resync_queue_length`) goes absolutely crazy on that 3rd node: <img width="473" alt="image" src="/attachments/b9f788d0-b88e-4873-9a4e-0435ab7866bc"> ### System observability We are also able to provide some system-level stats. First looking at load, we can definitely see that node 3 is struggling more than the rest, although all of them remain within their CPU capacity: <img width="474" alt="image" src="/attachments/6b64ef19-66c2-48f6-9371-3a3c7a788906"> Looking at cache usage, a strange behaviour comes up : while nodes 1 and 2 keep a steady cache usage, pages on node 3 are regularly being released: <img width="476" alt="image" src="/attachments/1ca369f5-6fbd-483e-a41c-aea21a1b0e3c"> That memory does not appear to be claimed for use though, meaning those drops are unlikely to be forced evictions: <img width="473" alt="image" src="/attachments/38bdb421-b5d9-4ce7-9390-d4b29b1bc37c"> Finally, network-wise, all nodes exhibit similar behaviours, communicating far below link capacity (I may have got those units wrong, but in any case, there's plenty of bandwidth left) : <img width="474" alt="image" src="/attachments/f39dca35-b081-47b2-b849-54bef9090029"> ### Discussion While the read-ahead flag definitely delivers improvements when the database size exceeds the RAM, there appears to be other factors which, in our setup, hinder Garage's capabilities in terms of performance. If the Scaleway cluster stabilised at around 400#/s, we were only able to reach (a much more unstable) 75#/s. **Regarding the impact of the object count,** we just restarted another benchmark today, starting off where the first one had left off. The idea was to figure out whether the slowdown was related to cluster size (in which case the new benchmark should start at around 75#/s) or to load/work backlog (in which case the new benchmark should start over from about 800#/s now that the cluster has settled). The first results suggest the former. <img width="470" alt="image" src="/attachments/39cf527f-ddd5-43c4-bb6f-4cd8da356bc5"> ➡️ Cluster size appears to still play a part in the performance drop, even below RAM capacity. **Regarding node 3,** those results also confirm what we initially saw on our test cluster: the 2nd node in the 1st availability zone appears to do a lot more work than the rest of the nodes, causing it to lag behind at times. This has a cluster-wide impact since this node is still expected to participate in quorums. ➡️ Quick tests were made using `send_all_at_once` but this only improves reads, which overall play a minor role in a PutObject call - node 3 may still be delaying write operations. **Regarding RAM cache,** those releases on node 3 are definitely odd. They suggest that the node may be behaving more erratically than the rest and moving more things in and out of memory. **Similarly, regarding the block resync queue,** we are still unable to explain why node 3 accumulates so much work (although this may just be another symptom). ➡️ Overall, it seems important to understand what makes node 3 so special, given that, system-wise, it does not appear to be particularly different from the rest of the cluster. --- We will continue our investigations, but would definitely appreciate any feedback you may have on those results! If something worth testing comes to mind, also feel free to suggest it so we may run it on this cluster. While we can't share access to those servers, the resources will remain available for testing for some time. Next steps for us: - Try again with objects at 1KB to remove the data block logic from the results - Try again with ZFS for data blocks to eliminate any ext4 performance issues (many small files) JKR
Author
Owner

Thanks a lot for your extensive feedback!


Regarding the fact that performances diminish with the number of objects stored - and size of LMDB, I think it can be expected as long as it's O(log(n)). Quickly:

B-Trees maintains balance by ensuring that each node has a minimum number of keys, so the tree is always balanced. This balance guarantees that the time complexity for operations such as insertion, deletion, and searching is always O(log n), regardless of the initial shape of the tree.

https://www.geeksforgeeks.org/introduction-of-b-tree-2/


On discussing your node 3 performances, it seems that now you are less concerned by raw performances and more by predictability / explainability / homogeneity of performances. Am I right?


It might be an obsession but I don't like the following graph as I don't understand what create this pattern. I would really like to understand what's going on and be able to say whether it's expected or an issue.


I will try to get a closer look later to all of your data. We might find a thing or two!


I don't know exactly when I will be able to run additional benchmarks on my side, but that's what I have on my roadmap:

  • a benchmark with 5kiB files (such that we run the full logic (ie. write more tables + an on-disk block).
  • try to replace merkle todo (and maybe the bloc resync queue) by a bounded on-disk queue outside of the metadata engine (like yaque)
  • try to implement RocksDB (we know that RocksDB LSM-Tree is more efficient on writes that LMDB B-Tree but less efficient on reads, so we will have to benchmark reads too, to better understand the trade-off...).
Thanks a lot for your extensive feedback! --- Regarding the fact that performances diminish with the number of objects stored - and size of LMDB, I think it can be expected as long as it's `O(log(n))`. Quickly: > B-Trees maintains balance by ensuring that each node has a minimum number of keys, so the tree is always balanced. This balance guarantees that the time complexity for operations such as insertion, deletion, and searching is always O(log n), regardless of the initial shape of the tree. https://www.geeksforgeeks.org/introduction-of-b-tree-2/ --- On discussing your node 3 performances, it seems that now you are less concerned by raw performances and more by predictability / explainability / homogeneity of performances. Am I right? --- It might be an obsession but I don't like the following graph as I don't understand what create this pattern. I would really like to understand what's going on and be able to say whether it's expected or an issue. ![](https://git.deuxfleurs.fr/attachments/bec8cf69-d961-431e-a8dd-2edacc8a6eb0) --- I will try to get a closer look later to all of your data. We might find a thing or two! --- I don't know exactly when I will be able to run additional benchmarks on my side, but that's what I have on my roadmap: - a benchmark with 5kiB files (such that we run the full logic (ie. write more tables + an on-disk block). - try to replace merkle todo (and maybe the bloc resync queue) by a bounded on-disk queue outside of the metadata engine (like [yaque](https://docs.rs/yaque/latest/yaque/)) - try to implement RocksDB (we know that RocksDB LSM-Tree is more efficient on writes that LMDB B-Tree but less efficient on reads, so we will have to benchmark reads too, to better understand the trade-off...).
Contributor

@quentin Hi, author of Fjall here.

With these benchmarks, we have learnt that LMDB has really "2 regimes": one when the whole database fits in RAM and one when it does not fit anymore, and when it does not fit anymore

This is observable for any database. While it's true that LMDB's read path is probably the quickest anywhere - as your data surpasses RAM size, any database will take quite a hit as I/O takes over. It can be really hard to benchmark LMDB because the OS just allows it to take all the memory, while with RocksDB etc. you have to explicitly set the block cache size, and you don't want it to start swapping memory. On the other hand, blocks in LSM-trees are tightly packed, while pages in a B-tree tend to be 1/2 or 1/3 empty (if values are small), so you may have to cache more pages for a B-tree to keep the same amount of items in memory...

we know that RocksDB LSM-Tree is more efficient on writes that LMDB B-Tree but less efficient on reads

There are definitely some considerations here. If you write large values, as you seem to do, you probably want to use RocksDB's BlobDB mode, which will heavily reduce compaction overhead caused by large values (the equivalent in Fjall being use_kv_separation(true) in upcoming V2). And there are probably a myriad of other configuration options, which is one pain point of RocksDB really, next to its absurdly high compilation times.

A big problem with B-trees is that they need to read to write, ideally a lot of your pages are cached, but you may incur random read I/O to actually write data, which is then possibly a random write (LSM-trees always write sequentially, which SSDs like). Why the above benchmark degrades so much while having all the data still in memory, I'm not sure. It is an unsolvable downside of the LMDB write path - you can at least alleviate some of LSM-trees' shortcomings by throwing more memory at it. And an LSM-tree just appends to the WAL, which has O(1) complexity, plus the write path is shorter overall.

There are also circumstances where a tuned RocksDB's read performance is very competitive with LMDB, sometimes beating it in multithreaded workloads: http://www.lmdb.tech/bench/ondisk/HW96rscale.png. It's not black & white, but LMDB has the better average read performance, of course.

Sync

I'm not sure which sync level was used in the above benchmarks, but note that LMDB's NOSYNC level is possibly unsafe, unless you can assure you file system works with it:

This optimization means a system crash can corrupt the database or lose the last transactions if buffers are not yet flushed to disk. The risk is governed by how often the system flushes dirty buffers to disk and how often Environment::sync() is called. http://www.lmdb.tech/doc/group__mdb.html#ga32a193c6bf4d7d5c5d579e71f22e9340

LSM-trees, because they use a WAL, can adjust the durability level per write during runtime, not by setting a global environment flag. And unlike LMDB's NoSync flag, writing to a WAL cannot corrupt the database. For example, you could write blocks without sync, and then sync after writing the last block, without worrying that your file system may corrupt your database as described in the LMDB docs.

While I know some stuff about S3's multipart uploads, I'm not really familiar with Garage and what durability levels it needs or how many KV operations it actually performs per PUT, and what kind of value sizes the metadata is.

Possible issues in code

https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/lmdb_adapter.rs#L135-L142 Db::insert and ::remove return the previous value. This is expensive, as you essentially, for every insert, perform a O(log n) tree search (+ heap allocation with memcpy because of Vec::from), followed up by another O(log n) tree search (as explained above), followed by write I/O during commit, as LMDB needs to CoW the affected pages. If your data set is greater than RAM, your inserts now (1) may have to wait for read I/O and (2) can cause pages to be kicked out of the read cache. Using RocksDB wouldn't necessarily help here. The advantage of LSM-trees is to not read while inserting, but the code does exactly that.

Looking at the code base, the return value is (almost?) never used, so I would suggest changing insert and remove to be:

fn insert(key, value) -> Result<()>;
fn remove(key) -> Result<()>;

And if the return value is actually needed, add new operations, maybe something like:

fn upsert(key, value) -> Result<Option<PreviousValue>>;
fn take(key) -> Result<Option<PreviousValue>>;

Queue in key-value store

Use better datastructures - We store some queues in LMDB, a sorted collection. Sorted collections are expensive datastructure and queues don't need to be sorted according to a key

Also having a queue in LMDB requires to do many writes & removes that are intensive. Maybe LMDB is not the best tool to manage this queue?

Creating a sorted collection from random writes is expensive. If the collection is sorted to begin with, it is not expensive.

If I am reading the code correctly, the keys for the MerkleTree are not monotonic, because it's a compound key (partition, prefix). This will cause writes to be distributed across the B-tree (same for LSM-tree), which is undesirable. If you want a queue, you need monotonic keys, e.g. ULID, CUID, Scru128 maybe, and store the partition + prefix in the value instead, not the key. I still don't think LMDB is a great fit for this use case because it can (1) never shrink and (2) still needs random reads & writes as explained above. Plus if your values are small (can't tell from the code), a B-tree has very high write amplification.

@quentin Hi, author of [Fjall](https://github.com/fjall-rs/fjall) here. > With these benchmarks, we have learnt that LMDB has really "2 regimes": one when the whole database fits in RAM and one when it does not fit anymore, and when it does not fit anymore This is observable for any database. While it's true that LMDB's read path is probably the quickest anywhere - as your data surpasses RAM size, any database will take quite a hit as I/O takes over. It can be really hard to benchmark LMDB because the OS just allows it to take all the memory, while with RocksDB etc. you have to explicitly set the block cache size, and you don't want it to start swapping memory. On the other hand, blocks in LSM-trees are tightly packed, while pages in a B-tree tend to be 1/2 or 1/3 empty (if values are small), so you may have to cache more pages for a B-tree to keep the same amount of items in memory... > we know that RocksDB LSM-Tree is more efficient on writes that LMDB B-Tree but less efficient on reads There are definitely some considerations here. If you write large values, as you seem to do, you probably want to use RocksDB's BlobDB mode, which will heavily reduce compaction overhead caused by large values (the equivalent in Fjall being `use_kv_separation(true)` in upcoming V2). And there are probably a myriad of other configuration options, which is one pain point of RocksDB really, next to its absurdly high compilation times. A big problem with B-trees is that they need to read to write, ideally a lot of your pages are cached, but you may incur random read I/O to actually write data, which is then possibly a random write (LSM-trees always write sequentially, [which SSDs like](https://web.archive.org/web/20170606145305/https://www.seagate.com/ca/en/tech-insights/lies-damn-lies-and-ssd-benchmark-master-ti/)). Why the above benchmark degrades so much while having all the data still in memory, I'm not sure. It is an _unsolvable_ downside of the LMDB write path - you can at least alleviate some of LSM-trees' shortcomings by throwing more memory at it. And an LSM-tree just appends to the WAL, which has O(1) complexity, plus the write path is shorter overall. There are also circumstances where a tuned RocksDB's read performance is very competitive with LMDB, sometimes beating it in multithreaded workloads: http://www.lmdb.tech/bench/ondisk/HW96rscale.png. It's not black & white, but LMDB has the better average read performance, of course. ## Sync I'm not sure which sync level was used in the above benchmarks, but note that LMDB's NOSYNC level is possibly unsafe, unless you can assure you file system works with it: > This optimization means a system crash can **corrupt** the database or lose the last transactions if buffers are not yet flushed to disk. The risk is governed by how often the system flushes dirty buffers to disk and how often Environment::sync() is called. http://www.lmdb.tech/doc/group__mdb.html#ga32a193c6bf4d7d5c5d579e71f22e9340 LSM-trees, because they use a WAL, can adjust the durability level per write during runtime, not by setting a global environment flag. And unlike LMDB's NoSync flag, writing to a WAL cannot corrupt the database. For example, you could write blocks without sync, and then sync after writing the last block, without worrying that your file system may corrupt your database as described in the LMDB docs. While I know some stuff about S3's multipart uploads, I'm not really familiar with Garage and what durability levels it needs or how many KV operations it actually performs per PUT, and what kind of value sizes the metadata is. ## Possible issues in code https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/lmdb_adapter.rs#L135-L142 Db::insert and ::remove return the previous value. This is expensive, as you essentially, _for every insert_, perform a O(log n) tree search (+ heap allocation with memcpy because of Vec::from), followed up by another O(log n) tree search (as explained above), followed by write I/O during commit, as LMDB needs to CoW the affected pages. If your data set is greater than RAM, your inserts now (1) may have to wait for read I/O and (2) can cause pages to be kicked out of the read cache. Using RocksDB wouldn't necessarily help here. The advantage of LSM-trees is to _not_ read while inserting, but the code does exactly that. Looking at the code base, the return value is (almost?) never used, so I would suggest changing `insert` and `remove` to be: ```rs fn insert(key, value) -> Result<()>; fn remove(key) -> Result<()>; ``` And if the return value is actually needed, add new operations, maybe something like: ```rs fn upsert(key, value) -> Result<Option<PreviousValue>>; fn take(key) -> Result<Option<PreviousValue>>; ``` ## Queue in key-value store > Use better datastructures - We store some queues in LMDB, a sorted collection. Sorted collections are expensive datastructure and queues don't need to be sorted according to a key > Also having a queue in LMDB requires to do many writes & removes that are intensive. Maybe LMDB is not the best tool to manage this queue? Creating a sorted collection from random writes is expensive. If the collection is sorted to begin with, it is not expensive. If I am reading the code correctly, the keys for the MerkleTree are not monotonic, because it's a compound key (partition, prefix). This will cause writes to be distributed across the B-tree (same for LSM-tree), which is undesirable. If you want a queue, you need monotonic keys, e.g. ULID, CUID, Scru128 maybe, and store the partition + prefix in the value instead, not the key. I still don't think LMDB is a great fit for this use case because it can (1) never shrink and (2) still needs random reads & writes as explained above. Plus if your values are small (can't tell from the code), a B-tree has very high write amplification.
Contributor

Quick follow-up on the previous benchmark, discussion below - thanks again everyone for your input!


Inlined objects, no data logic

Try again with objects at 1KB to remove the data block logic from the results

This is more of a retest following the read-ahead improvement, where the benchmark stores very small (inlined) values. We can confirm that we were able to reproduce the results obtained by @quentin on the Scaleway cluster:

image

image

Full dashboard

image

(forgive the gap, my benchmarking node likes to fail during lunch)

In those results, we can definitely see Garage performing steadily at about 800#/s. LMDB alone seems to perform well under stress and, with the read-ahead fix, barely suffers once the database has outgrown the RAM. The last graph confirms that the data partition remains unused (all inlined).


35KB objects, ZFS data partition, record size 128K

Try again with ZFS for data blocks to eliminate any ext4 performance issues (many small files)

The previous results suggest that the performance drop may be related to the data block logic, since LMDB alone performs well. To eliminate possible ext4 performance issues when handling many small files, we reran that test using ext4 for LMDB and ZFS for the data blocks. We're keeping the default record size for now (128K) since LMDB isn't on ZFS.

image

image

Full dashboard

image

(no idea where the data partition usage data went, but FYI it kept growing steadily up to 30GB)

As before, in this scenario, we are definitely seeing a performance drop: as the number of data blocks grows, the insert operations begin to slow down. This is also visible in the LMDB heatmaps (insert op). This suggests that writing block ref objects is much slower than just writing objects/versions. Perhaps this causes more page swaps? There's also the (probably) non-negligible cost of writing to the filesystem.

There was also a very odd behaviour towards the end of the benchmark there... Disk usage for LMDB went straight from 10GB to 40GB in a matter of seconds, reaching 100% capacity. At this point all operations also took a serious performance hit.


Discussion

Regarding the fact that performances diminish with the number of objects stored - and size of LMDB, I think it can be expected as long as it's O(log(n)).

I fear that when data blocks are involved, we may be leaning towards the "worst-case" edge in terms of complexity. The performance graph is definitely giving logarithmic vibes, but it's going down pretty fast still. That sudden change soon after we exceeded RAM size in the second benchmark also leaves me a little perplexed.

On discussing your node 3 performances, it seems that now you are less concerned by raw performances and more by predictability / explainability / homogeneity of performances. Am I right?

Yes, the idea was to understand it in case it got involved in quorums and slowed down the entire cluster. But that seems to go against what we are seeing in the block resync queue graph. From what I understand, our setup implies that all nodes hold the same data and effectively act as 3 mirrors. In this case, am I right to assume that the preference list is the same for all keys, say [node1, node2, node3]? Then all quorums are reached using nodes 1 and 2, while node 3 accumulates an async backlog. In this scenario, node 3 never gets a chance to participate in quorums and therefore can't slow them down. Let me know if I got that wrong.

It might be an obsession but I don't like the following graph as I don't understand what create this pattern.

I'm afraid these last 2 benchmarks don't really bring much light on this particular point... We are still seeing the pattern even with inlined objects.


I am not nearly familiar enough with LSM-trees (and/or Garage) to provide much feedback on that subject I'm afraid. One thing I did notice when benchmarking LMDB earlier was that the Garage DB did in fact prefix its keys (at least for objects). I wonder if non-monotonic keys may be playing a part in the performance drop once block refs are involved. The values also aren't particularly large, so there is that last concern about write amplification.

JKR

_Quick follow-up on the previous benchmark, discussion below - thanks again everyone for your input!_ --- ## Inlined objects, no data logic > Try again with objects at 1KB to remove the data block logic from the results This is more of a retest following the read-ahead improvement, where the benchmark stores very small (inlined) values. We can confirm that we were able to reproduce the results obtained by @quentin on the Scaleway cluster: ![image](/attachments/2a6ec619-fb2d-4a42-b3e2-e1bf475484d5) ![image](/attachments/91fb697d-8967-4a9f-83ec-a3b86a77bfc5) <details> <summary>Full dashboard</summary> ![image](/attachments/714e6d39-d108-4946-a95c-4b88f0f6a8fb) </details> <sup>(forgive the gap, my benchmarking node likes to fail during lunch)</sup> In those results, we can definitely see Garage performing steadily at about 800#/s. **LMDB alone seems to perform well** under stress and, with the read-ahead fix, barely suffers once the database has outgrown the RAM. The last graph confirms that the data partition remains unused (all inlined). --- ## 35KB objects, ZFS data partition, record size 128K > Try again with ZFS for data blocks to eliminate any ext4 performance issues (many small files) The previous results suggest that the performance drop may be related to the data block logic, since LMDB alone performs well. To eliminate possible ext4 performance issues when handling many small files, we reran that test using ext4 for LMDB and ZFS for the data blocks. We're keeping the default record size for now (128K) since LMDB isn't on ZFS. ![image](/attachments/650b6c96-47e1-48e8-b5ef-015ea4a6d0e2) ![image](/attachments/3a7d1aa3-3d6c-479c-a0f6-ae3d9f0bcd16) <details> <summary>Full dashboard</summary> ![image](/attachments/a5dd6d2e-0169-4e24-b8c6-ca4af8e5b3b6) </details> <sup>(no idea where the data partition usage data went, but FYI it kept growing steadily up to 30GB)</sup> As before, in this scenario, we are definitely seeing a performance drop: as the number of data blocks grows, the insert operations begin to slow down. This is also visible in the LMDB heatmaps (insert op). This suggests that writing block ref objects is much slower than just writing objects/versions. Perhaps this causes more page swaps? There's also the (probably) non-negligible cost of writing to the filesystem. There was also a very odd behaviour towards the end of the benchmark there... Disk usage for LMDB went straight from 10GB to 40GB in a matter of seconds, reaching 100% capacity. At this point all operations also took a serious performance hit. --- ## Discussion > Regarding the fact that performances diminish with the number of objects stored - and size of LMDB, I think it can be expected as long as it's O(log(n)). I fear that when data blocks are involved, we may be leaning towards the "worst-case" edge in terms of complexity. The performance graph is definitely giving logarithmic vibes, but it's going down pretty fast still. That sudden change soon after we exceeded RAM size in the second benchmark also leaves me a little perplexed. > On discussing your node 3 performances, it seems that now you are less concerned by raw performances and more by predictability / explainability / homogeneity of performances. Am I right? Yes, the idea was to understand it in case it got involved in quorums and slowed down the entire cluster. But that seems to go against what we are seeing in the block resync queue graph. From what I understand, our setup implies that all nodes hold the same data and effectively act as 3 mirrors. In this case, am I right to assume that the preference list is the same for all keys, say `[node1, node2, node3]`? Then all quorums are reached using nodes 1 and 2, while node 3 accumulates an async backlog. In this scenario, node 3 never gets a chance to participate in quorums and therefore can't slow them down. Let me know if I got that wrong. > It might be an obsession but I don't like the following graph as I don't understand what create this pattern. I'm afraid these last 2 benchmarks don't really bring much light on this particular point... We are still seeing the pattern even with inlined objects. --- I am not nearly familiar enough with LSM-trees (and/or Garage) to provide much feedback on that subject I'm afraid. One thing I did notice when benchmarking LMDB earlier was that the Garage DB did in fact prefix its keys (at least for objects). I wonder if non-monotonic keys may be playing a part in the performance drop once block refs are involved. The values also aren't particularly large, so there is that last concern about write amplification. JKR
Contributor

Disk usage for LMDB went straight from 10GB to 40GB in a matter of seconds

I find this behaviour really curious. Is this maybe caused by the Todo queue's rapid growth? It would be nice to see the size of the different tables specifically, I can't imagine metadata taking 40GB. If so, one problem is LMDB files will never shrink. LMDB will try to reuse pages if possible, but even if the queue shrinks, you won't get those 30GB back. The only solution is to rewrite the entire database file (https://www.openldap.com/lists/openldap-devel/201505/msg00028.html).

> Disk usage for LMDB went straight from 10GB to 40GB in a matter of seconds I find this behaviour really curious. Is this maybe caused by the Todo queue's rapid growth? It would be nice to see the size of the different tables specifically, I can't imagine metadata taking 40GB. If so, one problem is LMDB files will never shrink. LMDB will try to reuse pages if possible, but even if the queue shrinks, you won't get those 30GB back. The only solution is to rewrite the entire database file (https://www.openldap.com/lists/openldap-devel/201505/msg00028.html).
Contributor

I have tried to recreate the block ref schema in Garage in a CLI tool thingy: https://github.com/marvin-j97/write-bench

@withings Can you try running this on hardware that is comparable (or the same) as above? With:

cargo run -r -- --heed

I have found LMDB to really struggle on a memory-limited (1 GB) EC2 VM (at least 50x slower than LSM). On my personal machine (32 GB RAM) it's "only" 3x slower.
If the machine works through the workload fine, can you increase the amount of items, until you get to the breaking point?

I have tried to recreate the block ref schema in Garage in a CLI tool thingy: https://github.com/marvin-j97/write-bench @withings Can you try running this on hardware that is comparable (or the same) as above? With: ```bash cargo run -r -- --heed ``` I have found LMDB to really struggle on a memory-limited (1 GB) EC2 VM (at least 50x slower than LSM). On my personal machine (32 GB RAM) it's "only" 3x slower. If the machine works through the workload fine, can you increase the amount of items, until you get to the breaking point?
Contributor

Hello! Sorry for the delay, there was some OoO on my end.

Thanks for write-bench, I ran it on one of our nodes with Garage down: we observed roughly the same performance trend as the above graphs suggest, without any burst even beyond 10M objects.

It turns out the explanation was actually a lot less exotic: we had metadata_auto_snapshot_interval on, which creates a snapshot/copy of the DB under the same mount point after some time. Those snapshots were filling up the disk and hogging CPU time. There appears to be a timezone issue with that feature (it triggered after 4 hours instead of 6 in UTC+2 - I'll see about opening a separate issue).

With that feature off, Garage behaves as expected:

image

Performance went down to ~110#/s at 5M objects:

image
image

We've added some extra metrics to be able to split LMDB usage per tree and got the following graph:

image

As we may have expected, object:merkle_tree and version:merkle_tree are the busiest trees, although cumulatively the todo trees still account for a non-negligible portion of LMDB usage.

Most likely next steps for us:

  • Redeploy to production and try again with our use case, see how Garage performs in practice now
  • Try to replace the todo trees with queues as suggested earlier
  • Try to implement fjall as a backend - we've got an initial implementation running but the results aren't quite there yet
  • Try to implement RocksDB as a backend

Again, not sure how far we'll be able to push it, just a rough roadmap for the time being.

JKR

Hello! Sorry for the delay, there was some OoO on my end. Thanks for `write-bench`, I ran it on one of our nodes with Garage down: we observed roughly the same performance trend as the above graphs suggest, without any burst even beyond 10M objects. It turns out the explanation was actually a lot less exotic: we had `metadata_auto_snapshot_interval` on, which creates a snapshot/copy of the DB under the same mount point after some time. Those snapshots were filling up the disk and hogging CPU time. There appears to be a timezone issue with that feature (it triggered after 4 hours instead of 6 in UTC+2 - I'll see about opening a separate issue). With that feature off, Garage behaves as expected: ![image](/attachments/b7c281ca-d9c1-41d7-875f-f0e98e7d5dc8) Performance went down to ~110#/s at 5M objects: ![image](/attachments/4ed7e4ee-ff35-4efd-aa32-52a36460feed) ![image](/attachments/81b68f6a-4886-492d-8ad5-cc4611794c87) We've added some extra metrics to be able to split LMDB usage per tree and got the following graph: ![image](/attachments/d3a71c8e-4519-43fe-9d37-eb0e9c302f78) As we may have expected, `object:merkle_tree` and `version:merkle_tree` are the busiest trees, although cumulatively the `todo` trees still account for a non-negligible portion of LMDB usage. Most likely next steps for us: - Redeploy to production and try again with our use case, see how Garage performs in practice now - Try to replace the `todo` trees with queues as suggested earlier - Try to implement fjall as a backend - we've got an initial implementation running but the results aren't quite there yet - Try to implement RocksDB as a backend Again, not sure how far we'll be able to push it, just a rough roadmap for the time being. JKR
Contributor

@withings What NVMe drives are you running? The device can vary the write speed quite a lot for LMDB, even when using NO_SYNC (+ NO_READAHEAD, because that makes reads faster as we have established earlier).

Here's what I got (80M x 100 byte, random key):

Samsung 990 EVO:

hmbxqipu.png
hyv34p87.png

Samsung PM9A3:

d29xwno1.png
0dt2dste.png


asdb.png

For both runs, you can see until about 280s it runs quickly, but then when it doesn't get any more memory from the OS, it starts running into fsyncs. For the EVO this becomes really expensive, reducing the write rate down to around 10'000 per second, and 24'000 for the PM9A3 respectively.

@withings What NVMe drives are you running? The device can vary the write speed quite a lot for LMDB, even when using NO_SYNC (+ NO_READAHEAD, because that makes reads faster as we have established earlier). Here's what I got (80M x 100 byte, random key): ## Samsung 990 EVO: ![hmbxqipu.png](/attachments/918ec918-2ccd-4c56-8e93-177d94895e41) ![hyv34p87.png](/attachments/4f7040e1-9d07-444e-a77b-ae5f61cd3c35) ## Samsung PM9A3: ![d29xwno1.png](/attachments/3688ed01-d830-44fb-b14e-1369242c7767) ![0dt2dste.png](/attachments/5b3feecf-db7e-4145-a896-880be21d0654) --- ![asdb.png](/attachments/93bb2f4f-6558-483a-b103-eed07163dd25) For both runs, you can see until about 280s it runs quickly, but then when it doesn't get any more memory from the OS, it starts running into fsyncs. For the EVO this becomes really expensive, reducing the write rate down to around 10'000 per second, and 24'000 for the PM9A3 respectively.
Contributor

Our Garage instances run in virtual machines with a number of abstractions between them and the drives (RAID, encryption, ...) so it's a bit difficult to pin it down to the hardware itself. That being said, I can still run write-bench and graph its performance on top of it all.

Average insert time (ms) up to 100M entries

image

We can see some serious variations between nodes but a common breaking point just before reaching 50M inserts. It aligns nearly perfectly with the moment the DB size exceeded the RAM, so that's to be expected. All in all, I'm not going to complain about a 0.3ms insert time at 100M objects, that's more than fine.

Retry in production

That went well: our client is slower than Garage so we can't challenge its performance in that context anymore. We are still seeing some sporadic slow calls up to 7s which are somewhat annoying. Those will still need to be investigated, however last time I tried I couldn't find that "lost" time in traces.

Unifying key prefixes

Something which was mentioned before is that Garage's LMDB entries have prefixed keys. Objects, for instance, are recorded using the bucket ID as a prefix. I was curious to check performance with a more uniform key distribution so I did a run where keys were transparently hashed (XXH3) by the adapter. Those changes were pushed to withings/garage/feat/lmdb-hashed-keys but the performance difference is too insignificant to warrant a PR IMO.

image

image

Using queues for the todo trees

We have implemented that over at withings/garage/feat/todo-queues, using yaque as the first adapter. First results seem to suggest that the approach does indeed improve performance, and helps LMDB a lot.

Note that this approach does come with some warning signs. First, it is no longer possible to probe for queue length, meaning you can't "see" how many items are pending through Garage (yaque limitation). Second, I'm very uncertain of the approach I've taken when refactoring the GC. I think I will open a draft PR for this, I'd love your feedback.

image
Performance on the S3 endpoint appears to be more stable, although those "dips" remain unexplained)

image
LMDB performance is a lot more stable once the merkle_todo and gc_todo workloads have been moved out

image
Also observed on the LMDB op delay, much more stable

Using queues for block resync

In its current state the resync worker actually needs a tree as it relies on the order of its entries to work. Using queues there would require much more important rework IMO.

That being said, we did try (out of curiosity) to remove the resync logic from Garage (never resync). I don't have the graphs with me, but the results seemed to suggest that the resync worker claims a lot of CPU time, to the point where it can overpower the client-serving logic and ultimately render Garage unusable. There might be some room for fiddling in the tranquilizer logic which regulates that worker.

Fjall backend

We also tried fjall as a new database backend, the work for this is over at withings/garage/feat/fjall-db-engine. There's got to be something fundamentally wrong with my implementation though, because I could not get it beyond 50 req/s at the peak, and 10 req/s after a few 100k objects. I tried to fiddle with it a bit, but no dice.

RocksDB backend

Similar story here, you'll find that work on withings/garage/feat/feat/rocksdb-engine. The results were a bit more catastrophic there, with the engine straight up crashing after a couple 100k objects. I'm 99% sure there's something wrong with my implementation there too, but couldn't figure it out.

Our Garage instances run in virtual machines with a number of abstractions between them and the drives (RAID, encryption, ...) so it's a bit difficult to pin it down to the hardware itself. That being said, I can still run write-bench and graph its performance on top of it all. **Average insert time (ms) up to 100M entries** ![image](/attachments/32f8d0b9-af50-4d63-85f9-8450733373ed) We can see some serious variations between nodes but a common breaking point just before reaching 50M inserts. It aligns nearly perfectly with the moment the DB size exceeded the RAM, so that's to be expected. All in all, I'm not going to complain about a 0.3ms insert time at 100M objects, that's more than fine. ### Retry in production That went well: our client is slower than Garage so we can't challenge its performance in that context anymore. We are still seeing some sporadic slow calls up to 7s which are somewhat annoying. Those will still need to be investigated, however last time I tried I couldn't find that "lost" time in traces. ### Unifying key prefixes Something which was mentioned before is that Garage's LMDB entries have prefixed keys. Objects, for instance, are recorded using the bucket ID as a prefix. I was curious to check performance with a more uniform key distribution so I did a run where keys were transparently hashed (XXH3) by the adapter. Those changes were pushed to [withings/garage/feat/lmdb-hashed-keys](https://git.deuxfleurs.fr/withings/garage/src/branch/feat/lmdb-hashed-keys) but the performance difference is too insignificant to warrant a PR IMO. ![image](/attachments/f86b3c7a-c270-4bdb-9811-e05ea6b4c844) ![image](/attachments/41bfc7b4-cf5a-4ae1-9b4e-d3462cd7b868) ### Using queues for the todo trees We have implemented that over at [withings/garage/feat/todo-queues](https://git.deuxfleurs.fr/withings/garage/src/branch/feat/todo-queues), using [yaque](https://docs.rs/yaque/) as the first adapter. First results seem to suggest that the approach does indeed improve performance, and helps LMDB a lot. Note that this approach does come with some warning signs. First, it is no longer possible to probe for queue length, meaning you can't "see" how many items are pending through Garage (yaque limitation). Second, I'm very uncertain of the approach I've taken when refactoring the GC. I think I will open a draft PR for this, I'd love your feedback. ![image](/attachments/5d3af963-98d8-4ab4-9e27-b93574b7d00a) <sub>Performance on the S3 endpoint appears to be more stable, although those "dips" remain unexplained)</sub> ![image](/attachments/54a5e7b2-363c-4cde-ab47-ccc2ffd35345) <sub>LMDB performance is a lot more stable once the merkle_todo and gc_todo workloads have been moved out</sub> ![image](/attachments/4026fb4d-23f5-4705-baf4-12855e1a3495) <sub>Also observed on the LMDB op delay, much more stable</sub> ### Using queues for block resync In its current state the resync worker actually needs a tree as it relies on the order of its entries to work. Using queues there would require much more important rework IMO. That being said, we did try (out of curiosity) to remove the resync logic from Garage (never resync). I don't have the graphs with me, but the results seemed to suggest that the resync worker claims a *lot* of CPU time, to the point where it can overpower the client-serving logic and ultimately render Garage unusable. There might be some room for fiddling in the tranquilizer logic which regulates that worker. ### Fjall backend We also tried [fjall](https://docs.rs/fjall) as a new database backend, the work for this is over at [withings/garage/feat/fjall-db-engine](https://git.deuxfleurs.fr/withings/garage/src/branch/feat/fjall-db-engine). There's got to be something fundamentally wrong with my implementation though, because I could not get it beyond 50 req/s at the peak, and 10 req/s after a few 100k objects. I tried to fiddle with it a bit, but no dice. ### RocksDB backend Similar story here, you'll find that work on [withings/garage/feat/feat/rocksdb-engine](https://git.deuxfleurs.fr/withings/garage/src/branch/feat/rocksdb-engine). The results were a bit more catastrophic there, with the engine straight up crashing after a couple 100k objects. I'm 99% sure there's something wrong with my implementation there too, but couldn't figure it out.
Owner

Regarding the merkle-todo worker stressing LMDB: I was wondering if applying todo items in batches to the merkle tree could help.
The idea would be to pop several items from the todo queue and update the tree at once for this batch of items, resulting in less writes to LMDB overall compared to applying items one by one. (For instance, applying 100 items individually means updating the root node 100 times, whereas applying a batch of 100 items would only need to update the root node of the tree once.)

I think this could be implemented in two ways: (1) by directly implementing update_items, a multi-item version of update_item (that would be logically equivalent to iterating update_item but more efficient in terms of DB writes), (2) by generalizing update_item to an abstract type of database transaction, and implementing update_items on top of it by logging the underlying DB operations in a log, compressing the log (discarding writes overwritten by later writes) and applying it to the actual DB.

I've spent a bit of time to implement option (1) and I found it quite tricky; I think (2) is easier to implement conceptually, though probably more invasive.

Regarding the merkle-todo worker stressing LMDB: I was wondering if applying todo items in batches to the merkle tree could help. The idea would be to pop several items from the todo queue and update the tree at once for this batch of items, resulting in less writes to LMDB overall compared to applying items one by one. (For instance, applying 100 items individually means updating the root node 100 times, whereas applying a batch of 100 items would only need to update the root node of the tree once.) I think this could be implemented in two ways: (1) by directly implementing `update_items`, a multi-item version of `update_item` (that would be logically equivalent to iterating `update_item` but more efficient in terms of DB writes), (2) by generalizing `update_item` to an abstract type of database transaction, and implementing `update_items` on top of it by logging the underlying DB operations in a log, compressing the log (discarding writes overwritten by later writes) and applying it to the actual DB. I've spent a bit of time to implement option (1) and I found it quite tricky; I think (2) is easier to implement conceptually, though probably more invasive.
Contributor

@withings Your profile/repository are set to private I think, otherwise I would take a look

@withings Your profile/repository are set to private I think, otherwise I would take a look
Owner

It would be nice if you could open PRs to include fjall and rocksdb support in Garage. Even if they are janky we might want to include them so that users could try them out and try to tune them to work better.

It would be nice if you could open PRs to include fjall and rocksdb support in Garage. Even if they are janky we might want to include them so that users could try them out and try to tune them to work better.
Contributor

@marvinj97 @lx whoops sorry, force of habit when I created my profile probably. I've set my profile to public. I'll go through your reviews ASAP!

@marvinj97 @lx whoops sorry, force of habit when I created my profile probably. I've set my profile to public. I'll go through your reviews ASAP!
Contributor

@withings So the most obvious thing I'm seeing without reading too much into the code is that:

  • Fjall is still at 1.x - there have been a good bunch of performance improvements since then
  • most importantly, you are not setting the cache capacity of either Fjall or Rocks. Compared to LMDB, those do not use memory mapping, so instead there's a user-space cache (BlockCache); with your RAM I would suggest something around 1-4 GB for a start; as you decrease cache capacity, read performance will degrade gracefully, however write performance should be unaffected by cache capacity

Also, you should see both LSM-trees using less disk space comparatively to LMDB. For anything else I would need to understand more how Garage stores and KVs and what kind of values (and their sizes) are stored.

@withings So the most obvious thing I'm seeing without reading too much into the code is that: - Fjall is still at 1.x - there have been a good bunch of performance improvements since then - most importantly, _you are not setting the cache capacity_ of either Fjall or Rocks. Compared to LMDB, those do not use memory mapping, so instead there's a user-space cache (BlockCache); with your RAM I would suggest something around 1-4 GB for a start; as you decrease cache capacity, read performance will degrade gracefully, however write performance should be unaffected by cache capacity Also, you should see both LSM-trees using less disk space comparatively to LMDB. For anything else I would need to understand more how Garage stores and KVs and what kind of values (and their sizes) are stored.
Sign in to join this conversation.
No milestone
No project
No assignees
5 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#851
No description provided.