Performances collapse with 10 millions pictures in a bucket

quentin commented

2024-08-12 10:18:48 +00:00

Owner

Issue

Once ~4 millions objects are uploaded, PUT requests take more than 10 seconds even when the cluster is idle, even on empty buckets, even for 2KB files. GET requests do not seem to be impacted (they are all done in less than a second).

Expected

PUT requests take less than a second to execute, and probably less than 200ms considering the cluster spec. In our performance benchmark, we were able to do ~100 PUT/second without any issue.

Cluster spec

ZFS is used to store both metadata and data.
LMDB is used for the metadata engine.
Garage runs in a VM on a powerful server, it has 4 real CPU (not a vCPU) assigned + 8GB of RAM.
Configuration has been reviewed, it is standard (fsync disabled, 1MB chunks).
There is no repair tasks running, everything is correctly synchronized.
The team running the cluster is knowledgeable and currently uses Riak KV to store these 10M objects without any issue.

During the debug, we noted that the metadata database reached 50GB (for 80GB of data chunks). So around 7KB per entry, not sure if it's intended or not.

Workload spec

~11 millions pictures of ~1MB.
Migration is done by sending batches of 4 pictures in parallel.
At the beginning, the batch is finished in less than a few secondes.
After a while, it takes more than 60 seconds and hit an internal timeout.
We have done manual test then on the idle cluster, where we sent various small files that all took ~10 seconds to upload.

Investigations so far

We extracted a trace for a PUT and a GET request. For the PUT request, the performance hit is due to a slow try_write_many_sets, waiting a long time for the other nodes of the network. The corresponding call for GET is fast however. We can't investigate further currently, as we are "crossing a network boundary" and we have not implemented "distributed tracing" (mainly sending the trace/span id over the network).

Based on this observation, we suppose that the metadata engine is slow. Indeed, the network and the cluster are idle, so we have no bottleneck or buffer bloat/queuing issue here, and additionally, GET requests have no issue. Furthermore, the slow delays are observed for metadata writes (writing an entry in the bucket list, a new object version, etc.) and thus are not limited to the block manager.

Today, we have no opentelemetry metrics to measure the responsiveness of the metadata engine, the length of our network queues, the amount of bytes written, of data sent and receive, etc.

We were also not able to explore the LMDB database to see if we have some "stale" entries or things like that.

Investigations to conduct

Collect new opentelemetry metrics
- On the metadata engine, especially
  - Number of read ops
  - Number of write ops
  - Distribution of time taken by read ops
  - Distribution of time taken by write ops
- Also on the netdata side, especially
  - Number of packets sent/received
  - Size of packets sent/received
  - Queue length
  - Some info about our custom QoS logic
- Ideally on the block manager side
  - Number of block reads
  - Number of blocks written (or deleted)
  - Distribution of time taken to read a block
  - Distribution of time take to write block
Allow the operator to explore the metadata engine of the current node
- through the current garage command
- read only only to avoid cluster corruptions
- for debug purposes
Use Brendan Gregg's tools (strace/ptrace/whatever) to observe read/writes on the metadata file
- See https://www.brendangregg.com/linuxperf.html
Use my s3billion script to try to reproduce (or not reproduce) this issue. With the following configuration:
- OBJ_SIZE=1048576 - 1MB objects
- BATCH_SIZE=16 - batch_size is multiplied by THREAD, each thread will send BATCH_SIZE PUT req before reporting/synchronizing
- THREAD=64 - should be called "goroutine", the number of parallel request
- BATCH_COUNT=10000 - we report every ~1k objects, to send 10M objects, we need 10k iterations.
- SSL,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION,SSL_INSECURE,ENDPOINT must/might be configured

## Issue Once ~4 millions objects are uploaded, PUT requests take more than 10 seconds even when the cluster is idle, even on empty buckets, even for 2KB files. GET requests do not seem to be impacted (they are all done in less than a second). ## Expected PUT requests take less than a second to execute, and probably less than 200ms considering the cluster spec. In [our performance benchmark](https://garagehq.deuxfleurs.fr/blog/2022-perf/), we were able to do ~100 PUT/second without any issue. ## Cluster spec ZFS is used to store both metadata and data. LMDB is used for the metadata engine. Garage runs in a VM on a powerful server, it has 4 real CPU (not a vCPU) assigned + 8GB of RAM. Configuration has been reviewed, it is standard (fsync disabled, 1MB chunks). There is no repair tasks running, everything is correctly synchronized. The team running the cluster is knowledgeable and currently uses Riak KV to store these 10M objects without any issue. During the debug, we noted that the metadata database reached 50GB (for 80GB of data chunks). So around 7KB per entry, not sure if it's intended or not. ## Workload spec ~11 millions pictures of ~1MB. Migration is done by sending batches of 4 pictures in parallel. At the beginning, the batch is finished in less than a few secondes. After a while, it takes more than 60 seconds and hit an internal timeout. We have done manual test then on the idle cluster, where we sent various small files that all took ~10 seconds to upload. ## Investigations so far We extracted a trace for a `PUT` and a `GET` request. For the `PUT` request, the performance hit is due to a slow `try_write_many_sets`, waiting a long time for the other nodes of the network. The corresponding call for `GET` is fast however. We can't investigate further currently, as we are "crossing a network boundary" and we have not implemented "distributed tracing" (mainly sending the trace/span id over the network). Based on this observation, we suppose that the metadata engine is slow. Indeed, the network and the cluster are idle, so we have no bottleneck or buffer bloat/queuing issue here, and additionally, GET requests have no issue. Furthermore, the slow delays are observed for metadata writes (writing an entry in the bucket list, a new object version, etc.) and thus are not limited to the block manager. Today, we have no opentelemetry metrics to measure the responsiveness of the metadata engine, the length of our network queues, the amount of bytes written, of data sent and receive, etc. We were also not able to explore the LMDB database to see if we have some "stale" entries or things like that. ## Investigations to conduct - Collect new opentelemetry metrics - On the metadata engine, especially - Number of read ops - Number of write ops - Distribution of time taken by read ops - Distribution of time taken by write ops - Also on the netdata side, especially - Number of packets sent/received - Size of packets sent/received - Queue length - Some info about our custom QoS logic - Ideally on the block manager side - Number of block reads - Number of blocks written (or deleted) - Distribution of time taken to read a block - Distribution of time take to write block - Allow the operator to explore the metadata engine of the current node - through the current `garage` command - read only only to avoid cluster corruptions - for debug purposes - Use Brendan Gregg's tools (strace/ptrace/whatever) to observe read/writes on the metadata file - See https://www.brendangregg.com/linuxperf.html - Use my [s3billion script](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion) to try to reproduce (or not reproduce) this issue. With the following configuration: - `OBJ_SIZE=1048576` - 1MB objects - `BATCH_SIZE=16` - batch_size is multiplied by THREAD, each thread will send BATCH_SIZE PUT req before reporting/synchronizing - `THREAD=64` - should be called "goroutine", the number of parallel request - `BATCH_COUNT=10000` - we report every ~1k objects, to send 10M objects, we need 10k iterations. - `SSL,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION,SSL_INSECURE,ENDPOINT` must/might be configured

👀 1

quentin added the

labels 2024-08-12 10:18:48 +00:00

quentin self-assigned this 2024-08-12 10:18:48 +00:00

quentin referenced this issue

2024-08-14 20:37:22 +00:00

WIP: add metrics to the metadata engine #853

quentin commented

2024-08-15 19:29:54 +00:00

Author

Owner

Summary of my tests on Grid 5k

On august 15th, I ran some tests on the grappe cluster of the Grid5k project.

Test setup

warp put \
  --host=grappe-4-ipv6.nancy.grid5000.fr:3900,grappe-14-ipv6.nancy.grid5000.fr:3900,grappe-16-ipv6.nancy.grid5000.fr:3900 \
  --host-select=roundrobin \
  --obj.size=1KB \
  --duration=1h \
  --concurrent=256 \
  --bucket=xxx \
  --noclear

Relevant observations

full dashboard (different time span)

Docker compose + prometheus conf to explore data yourself

compose.yml:

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - 9090:9090
    volumes:
      - ./prometheus.yml:/prometheus.yml
      - ./prom-snap:/prometheus
    command:
      # extracted from dockerfile
      - "--config.file=/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--web.console.libraries=/usr/share/prometheus/console_libraries"
      - "--web.console.templates=/usr/share/prometheus/consoles"
      # disable time-based data deletion
      - "--storage.tsdb.retention.time=20y"
      - "--storage.tsdb.retention.size=20GB"
  grafana:
    image: grafana/grafana:latest
    ports:
      - 3000:3000
    volumes:
      - grafana-data:/var/lib/grafana
volumes:
  grafana-data:

prometheus.yml

global:
  scrape_interval: 1y
  scrape_timeout: 10s
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  evaluation_interval: 15s
runtime:
  gogc: 75
alerting:
  alertmanagers:
  - follow_redirects: true
    enable_http2: true
    http_headers: null
    scheme: http
    timeout: 10s
    api_version: v2
    static_configs:
    - targets: []
scrape_configs: []

Analysis

Garage ran in 2 different regimes:

First a stable one where performances are stable, nodes feature the same behavior,
Then an oscillating one where one of the three nodes performs slower than the others, and many components have oscillating performances.

Oscillations could be due to how the benchmark took (warp) is implemented.

## Summary of my tests on Grid 5k On august 15th, I ran some tests on the [grappe cluster](https://www.grid5000.fr/w/Nancy:Hardware#grappe) of the [Grid5k](https://www.grid5000.fr/w/Grid5000:Home) project. ### Test setup ```bash warp put \ --host=grappe-4-ipv6.nancy.grid5000.fr:3900,grappe-14-ipv6.nancy.grid5000.fr:3900,grappe-16-ipv6.nancy.grid5000.fr:3900 \ --host-select=roundrobin \ --obj.size=1KB \ --duration=1h \ --concurrent=256 \ --bucket=xxx \ --noclear ``` ### Relevant observations <details><summary>full dashboard (different time span)</summary> ![](https://git.deuxfleurs.fr/attachments/598ccd4b-560a-4679-8b51-f69fef37cd0d) ![](https://git.deuxfleurs.fr/attachments/2738634c-b2e9-49f1-a584-d35a02daacea) ![](https://git.deuxfleurs.fr/attachments/c30aa5fa-a14d-4e58-978a-f15fff553abd) </details> ![](https://git.deuxfleurs.fr/attachments/e8055704-6aa4-4c36-8400-6b123268d41b) <details><summary>Docker compose + prometheus conf to explore data yourself</summary> compose.yml: ```yml services: prometheus: image: prom/prometheus:latest ports: - 9090:9090 volumes: - ./prometheus.yml:/prometheus.yml - ./prom-snap:/prometheus command: # extracted from dockerfile - "--config.file=/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--web.console.libraries=/usr/share/prometheus/console_libraries" - "--web.console.templates=/usr/share/prometheus/consoles" # disable time-based data deletion - "--storage.tsdb.retention.time=20y" - "--storage.tsdb.retention.size=20GB" grafana: image: grafana/grafana:latest ports: - 3000:3000 volumes: - grafana-data:/var/lib/grafana volumes: grafana-data: ``` prometheus.yml ```yml global: scrape_interval: 1y scrape_timeout: 10s scrape_protocols: - OpenMetricsText1.0.0 - OpenMetricsText0.0.1 - PrometheusText0.0.4 evaluation_interval: 15s runtime: gogc: 75 alerting: alertmanagers: - follow_redirects: true enable_http2: true http_headers: null scheme: http timeout: 10s api_version: v2 static_configs: - targets: [] scrape_configs: [] ``` </details> ### Analysis Garage ran in 2 different regimes: - First a stable one where performances are stable, nodes feature the same behavior, - Then an oscillating one where one of the three nodes performs slower than the others, and many components have oscillating performances. Oscillations could be due to how the benchmark took (`warp`) is implemented.

prom-final.tgz

24 MiB

20240815_19h13m08s_grim.png

252 KiB

20240815_19h13m34s_grim.png

186 KiB

dashboard.json

40 KiB

20240815_19h13m46s_grim.png

86 KiB

20240816_17h20m24s_grim.png

179 KiB

👀 1

quentin commented

2024-08-17 19:10:54 +00:00

Author

Owner

Summary of my tests on a Scaleway DEV1-S cluster

Servers have 2vCPU, 2GB of RAM, 50GB of SAN SSD storage. They run Ubuntu 24.04 with Garage on it.
Scaleway is broken so my benchmarking node has been shutdown numerous time (but not my garage cluster):it's why data have holes.

The test command is:

warp put \
  --host=g1:3900,g2:3900,g3:3900 \
  --host-select=roundrobin \
  --obj.size=1KB \
  --duration=4h \
  --concurrent=256 \
  --bucket=xxx \
  --noclear

Reproducing performance collapse.

I analyzed a run going from 0 to 1.5 million objects:

Here also performances collapse as soon as LMDB database is larger than RAM.

Let zoom on the part where we go from 1.5M objects to 1.6M objects

We go from ~1k req/sec to 40 req/sec (25 times slower). However we don't have oscillations like in Grid5k.

Disabling read ahead improves performances by at least 10x

It appears that by default read ahead is activated on Linux. When fetching a page on disk, Linux also fetches the 15 next ones. It's meant to improve sequential reads on hard drives, but we don't have a hard drive nor sequential reads here. So basically, it hurts our performances.

LMDB has an option to deactivate read ahead on Linux.
It's named MdbNoRdAhead on heed, the Rust wrapper we use for LMDB, I did a simple commit to activate it.
Incidentally, activating this option when your database is larger than your memory is recommended by LMDB author.

So I did the same test on the same hardware but with this flag activated:

It started at the same speed, but once we reached the threshold of the RAM size, performances did not collapse similarly as the previous run. I was able to push Garage to 10M objects without any major issue. At 10M objects, the LMDB database takes ~33GB on the filesystem.

Let zoom around 1.5 million objects to see how it behaves:

We are putting objects at ~700 req/sec. Compared to previous run at 40 req/sec, we are ~17x faster. And if look around 10M objects, we are still at ~400 req/sec, in other words ~10x faster.

Discussion

About mmap. There are discussions to know whether or not mmap is a good idea. There is this old debate: 1975 programming vs 2006 programming. More recently, scholars have published a paper where they heavily discourage DBMS implementers from using mmap. With these benchmarks, we have learnt that LMDB has really "2 regimes": one when the whole database fits in RAM and one when it does not fit anymore, and when it does not fit anymore, it is very sensitive to its environment. Like Riak, the metadata engine is abstracted behind an interface in Garage: it's not hard to implement other engines to see how they perform. Trying LSM-Tree based engine, like RocksDB or the Rust native fjall could be interesting. A more exhaustive list of possible metadata engine.

About ZFS. This whole discussion started mentioning ZFS. While our benchmarks showed suboptimal performances, none of them where as abysmal as observed by people reporting the initial issue. By digging on the Internet, it appears that ZFS has a 128kb record size (ie. page size?) by default. And it is known to perform poorly with LMDB as reported by a netizen trying to use Monero over ZFS (that records its blockchain in LMDB). Checking ZFS record size and setting it to 4096 (4k to match LMDB page size) apparently drastically increase performances.

About Merkle Todo. It appears on some nodes the "Merkle Todo" grows without any bound. Garage has a worker that builds asynchronously a Merkle Tree that will then be used for healing. New items to put in this merkle tree are put in a "todo queue" that is also persisted in the metadata engine. In the end, it's not a huge issue that this queue grows: it only means that your healing mechanism will not be aware of the most recent items before some times (delaying a little bit the repair logic). However, by having the queue also managed by the LMDB database, in some way, we may "double" the size of the LMDB database (by having one entry in the object table and one entry in the merkle queue). Also having a queue in LMDB requires to do many writes & removes that are intensive. Maybe LMDB is not the best tool to manage this queue? We might also want to slow down the RPC if we are not able to process this queue fast enough to allow some backpressure in Garage (and in the same way if too much RPC accumulate on a node, slow down the responses to the API and/or return an "overloaded" error).

Explore the data

All the collected data are made available as a prometheus snapshot (see prom.tgz below).

## Summary of my tests on a Scaleway DEV1-S cluster Servers have 2vCPU, 2GB of RAM, 50GB of SAN SSD storage. They run Ubuntu 24.04 with Garage on it. Scaleway is broken so my benchmarking node has been shutdown numerous time (but not my garage cluster):it's why data have holes. The test command is: ``` warp put \ --host=g1:3900,g2:3900,g3:3900 \ --host-select=roundrobin \ --obj.size=1KB \ --duration=4h \ --concurrent=256 \ --bucket=xxx \ --noclear ``` ### Reproducing performance collapse. I analyzed a run going from 0 to 1.5 million objects: ![](https://git.deuxfleurs.fr/attachments/47796f65-e2e2-489a-8dc0-303c99116244) Here also performances collapse as soon as LMDB database is larger than RAM. Let zoom on the part where we go from 1.5M objects to 1.6M objects ![](https://git.deuxfleurs.fr/attachments/b4f721a1-86fb-43d4-a65c-c496adbb8463) We go from ~1k req/sec to 40 req/sec (25 times slower). However we don't have oscillations like in Grid5k. ### Disabling read ahead improves performances by at least 10x It appears that by default read ahead is activated on Linux. When fetching a page on disk, Linux also fetches the 15 next ones. It's meant to improve sequential reads on hard drives, but we don't have a hard drive nor sequential reads here. So basically, it hurts our performances. LMDB has an option to deactivate read ahead on Linux. It's named `MdbNoRdAhead` on heed, the Rust wrapper we use for LMDB, I did [a simple commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/1ebaf7aa17672bdc6e83e6c04c4c13e142f57629) to activate it. Incidentally, activating this option when your database is larger than your memory is recommended [by LMDB author](https://github.com/bmatsuo/lmdb-go/issues/118#issuecomment-325449496). So I did the same test on the same hardware but with this flag activated: ![20240817_20h33m06s_grim.png](/attachments/c6393ed2-ab9f-4ad3-972b-8ac16039e1e3) It started at the same speed, but once we reached the threshold of the RAM size, performances did not collapse similarly as the previous run. **I was able to push Garage to 10M objects without any major issue**. At 10M objects, the LMDB database takes ~33GB on the filesystem. Let zoom around 1.5 million objects to see how it behaves: ![20240817_20h37m35s_grim.png](/attachments/dbb87c88-ba8d-4185-90ed-f0096b0f414b) We are putting objects at ~700 req/sec. Compared to previous run at 40 req/sec, we are ~17x faster. And if look around 10M objects, we are still at ~400 req/sec, in other words ~10x faster. ![20240817_20h33m42s_grim.png](/attachments/08f2ea73-653d-4a92-961f-f82878bcf080) ### Discussion **About mmap**. There are discussions to know whether or not mmap is a good idea. There is this old debate: [1975 programming](http://varnish-cache.org/docs/trunk/phk/notes.html) vs [2006 programming](http://oldblog.antirez.com/post/what-is-wrong-with-2006-programming.html). More recently, scholars have published [a paper](https://db.cs.cmu.edu/mmap-cidr2022/) where they heavily discourage DBMS implementers from using mmap. With these benchmarks, we have learnt that LMDB has really "2 regimes": one when the whole database fits in RAM and one when it does not fit anymore, and when it does not fit anymore, it is very sensitive to its environment. [Like Riak](https://riak.com/assets/bitcask-intro.pdf), the metadata engine is abstracted behind an interface in Garage: it's not hard to implement other engines to see how they perform. Trying LSM-Tree based engine, like [RocksDB](https://rocksdb.org/) or the Rust native [fjall](https://github.com/fjall-rs/fjall) could be interesting. [A more exhaustive list of possible metadata engine](https://github.com/marvin-j97/rust-storage-bench). **About ZFS**. This whole discussion started mentioning ZFS. While our benchmarks showed suboptimal performances, none of them where as abysmal as observed by people reporting the initial issue. By digging on the Internet, it appears that ZFS has a 128kb record size (ie. page size?) by default. And it is known to perform poorly with LMDB as reported by a netizen [trying to use Monero over ZFS](https://github.com/openzfs/zfs/issues/7543) (that records its blockchain in LMDB). Checking ZFS record size and setting it to 4096 (4k to match LMDB page size) apparently drastically increase performances. **About Merkle Todo.** It appears on some nodes the "Merkle Todo" grows without any bound. Garage has a worker that builds asynchronously a Merkle Tree that will then be used for healing. New items to put in this merkle tree are put in a "todo queue" that is also persisted in the metadata engine. In the end, it's not a huge issue that this queue grows: it only means that your healing mechanism will not be aware of the most recent items before some times (delaying a little bit the repair logic). However, by having the queue also managed by the LMDB database, in some way, we may "double" the size of the LMDB database (by having one entry in the object table and one entry in the merkle queue). Also having a queue in LMDB requires to do many writes & removes that are intensive. Maybe LMDB is not the best tool to manage this queue? We might also want to slow down the RPC if we are not able to process this queue fast enough to allow some backpressure in Garage (and in the same way if too much RPC accumulate on a node, slow down the responses to the API and/or return an "overloaded" error). ### Explore the data All the collected data are made available as a prometheus snapshot (see `prom.tgz` below).

20240817_20h33m06s_grim.png

72 KiB

20240817_20h37m35s_grim.png

51 KiB

20240817_20h33m42s_grim.png

74 KiB

prom.tgz

38 MiB

20240817_20h35m51s_grim.png

66 KiB

20240817_20h35m27s_grim.png

61 KiB

quentin referenced this issue

2024-08-17 19:19:22 +00:00

Set "no read ahead" on LMDB to improve performances when "LMDB size" > "RAM size" #855

maximilien referenced this issue

2024-08-18 18:05:03 +00:00

Implement underlying information about LMDB database #856

withings referenced this issue

2024-08-23 12:46:21 +00:00

WIP: Implement preemptive sends to alleviate slow RPC propagation #860

withings referenced this issue

2024-08-26 13:34:28 +00:00

WIP: Implement preemptive sends to alleviate slow RPC propagation #860

withings commented

2024-08-29 14:55:08 +00:00

Contributor

Retest on a cluster w/ data blocks (objects beyond 3KiB)

Very nice catch for the read ahead flag, thank you for your commit! To further investigate the issue, we re-ran a benchmark of our own, using larger and random object sizes to ensure block refs were created and blocks were stored.

To match the Scaleway test and leave out any filesystem-related side effects, we're running on ext4 for both meta and data
The servers are VMs running on various CPUs from 2.6 to 2.9GHz, with 8GB RAM and 500GB NVMe per node
Multiple factors related to how our drives are set up are expected to slow down writes, but this should remain acceptable (and should be more or less constant)
Our Garage build is based off the main branch after !855, into which we then merged !853 to enable LMDB metrics ; the work suggested in !860 was not included and the send_all_at_once read strategy was not used
Object sizes randomly go up to 35KiB (the median size in our use case, beyond Garage's inlining threshold)

warp command line

warp put \
    --host=node1:3900,node2:3900,node3:3900 \
    --host-select=roundrobin \
    --obj.size=35KB \
    --obj.randsize \
    --duration=4h \
    --concurrent=256 \
    --bucket=perftest \
    --noclear

garage.toml

metadata_dir = "/var/lib/garage/meta"
data_dir = "/var/lib/garage/data"
db_engine = "lmdb-with-metrics"

replication_factor = 3
consistency_mode = "consistent"
compression_level = "none"

rpc_bind_addr = "[::]:3901"
rpc_public_addr = "<node IP>:3901"
rpc_secret = "<secret>"

[s3_api]
s3_region = "garage"
api_bind_addr = "[::]:3900"
root_domain = ".s3.garage.localhost"

[k2v_api]
api_bind_addr = "[::]:3904"

[admin]
api_bind_addr = "[::]:3903"
admin_token = "<secret>"

Cluster observability

That initial 4-hour-long benchmark yielded mixed results, as shown in the PutObject rate graphs below:

There appears to be roughly 3 phases:

An initial burst at 800#/s, followed by a steady slowdown to 120#/s
An acceleration up to ~375#/s
A steady slowdown which suggests an eventual stabilisation between 75 and 100#/s

It should also be noted that the database ended up at ~6.3GiB (below RAM size) so the impact of read-ahead must have been minimal. The RAM is also dedicated to the VMs so nothing external is forcing pages out of memory.

The speed pattern matches what is observed on RPCs and table operations:

The LMDB metrics also follow the pattern:

More: LMDB heat maps

Looking at cluster synchronisation, we observe that the block sync buffers (block_ram_buffer_free_kb) do not get overwhelmed:

Similarly, no node is lagging behind on Merkle tree updates, although node 3 (2nd in 1st AZ) exhibits higher values:

However the block resync queue (block_resync_queue_length) goes absolutely crazy on that 3rd node:

System observability

We are also able to provide some system-level stats. First looking at load, we can definitely see that node 3 is struggling more than the rest, although all of them remain within their CPU capacity:

Looking at cache usage, a strange behaviour comes up : while nodes 1 and 2 keep a steady cache usage, pages on node 3 are regularly being released:

That memory does not appear to be claimed for use though, meaning those drops are unlikely to be forced evictions:

Finally, network-wise, all nodes exhibit similar behaviours, communicating far below link capacity (I may have got those units wrong, but in any case, there's plenty of bandwidth left) :

Discussion

While the read-ahead flag definitely delivers improvements when the database size exceeds the RAM, there appears to be other factors which, in our setup, hinder Garage's capabilities in terms of performance. If the Scaleway cluster stabilised at around 400#/s, we were only able to reach (a much more unstable) 75#/s.

Regarding the impact of the object count, we just restarted another benchmark today, starting off where the first one had left off. The idea was to figure out whether the slowdown was related to cluster size (in which case the new benchmark should start at around 75#/s) or to load/work backlog (in which case the new benchmark should start over from about 800#/s now that the cluster has settled). The first results suggest the former.

➡️ Cluster size appears to still play a part in the performance drop, even below RAM capacity.

Regarding node 3, those results also confirm what we initially saw on our test cluster: the 2nd node in the 1st availability zone appears to do a lot more work than the rest of the nodes, causing it to lag behind at times. This has a cluster-wide impact since this node is still expected to participate in quorums.

➡️ Quick tests were made using send_all_at_once but this only improves reads, which overall play a minor role in a PutObject call - node 3 may still be delaying write operations.

Regarding RAM cache, those releases on node 3 are definitely odd. They suggest that the node may be behaving more erratically than the rest and moving more things in and out of memory.

Similarly, regarding the block resync queue, we are still unable to explain why node 3 accumulates so much work (although this may just be another symptom).

➡️ Overall, it seems important to understand what makes node 3 so special, given that, system-wise, it does not appear to be particularly different from the rest of the cluster.

We will continue our investigations, but would definitely appreciate any feedback you may have on those results! If something worth testing comes to mind, also feel free to suggest it so we may run it on this cluster. While we can't share access to those servers, the resources will remain available for testing for some time.

Next steps for us:

Try again with objects at 1KB to remove the data block logic from the results
Try again with ZFS for data blocks to eliminate any ext4 performance issues (many small files)

JKR

## Retest on a cluster w/ data blocks (objects beyond 3KiB) Very nice catch for the read ahead flag, thank you for your commit! To further investigate the issue, we re-ran a benchmark of our own, using larger and random object sizes to ensure block refs were created and blocks were stored. - To match the Scaleway test and leave out any filesystem-related side effects, we're running on ext4 for both meta and data - The servers are VMs running on various CPUs from 2.6 to 2.9GHz, with 8GB RAM and 500GB NVMe per node - Multiple factors related to how our drives are set up are expected to slow down writes, but this should remain acceptable (and should be more or less constant) - Our Garage build is based off the main branch after !855, into which we then merged !853 to enable LMDB metrics ; the work suggested in !860 was **not** included and the `send_all_at_once` read strategy was **not** used - Object sizes randomly go up to 35KiB (the median size in our use case, beyond Garage's inlining threshold) **warp command line** ``` warp put \ --host=node1:3900,node2:3900,node3:3900 \ --host-select=roundrobin \ --obj.size=35KB \ --obj.randsize \ --duration=4h \ --concurrent=256 \ --bucket=perftest \ --noclear ``` **garage.toml** ``` metadata_dir = "/var/lib/garage/meta" data_dir = "/var/lib/garage/data" db_engine = "lmdb-with-metrics" replication_factor = 3 consistency_mode = "consistent" compression_level = "none" rpc_bind_addr = "[::]:3901" rpc_public_addr = "<node IP>:3901" rpc_secret = "<secret>" [s3_api] s3_region = "garage" api_bind_addr = "[::]:3900" root_domain = ".s3.garage.localhost" [k2v_api] api_bind_addr = "[::]:3904" [admin] api_bind_addr = "[::]:3903" admin_token = "<secret>" ``` ### Cluster observability That initial 4-hour-long benchmark yielded mixed results, as shown in the PutObject rate graphs below: <img width="472" alt="image" src="/attachments/7c678e20-9cba-4152-823b-cda8aefc03bc"> <img width="473" alt="image" src="/attachments/66ad55bb-cc5d-487f-b703-387a4ff7322c"> <img width="474" alt="image" src="/attachments/eea9b98f-6a06-494c-a3b8-9e69527b57f8"> There appears to be roughly 3 phases: - An initial burst at 800#/s, followed by a steady slowdown to 120#/s - An acceleration up to ~375#/s - A steady slowdown which suggests an eventual stabilisation between 75 and 100#/s It should also be noted that the database ended up at ~6.3GiB (below RAM size) so the impact of read-ahead must have been minimal. The RAM is also dedicated to the VMs so nothing external is forcing pages out of memory. The speed pattern matches what is observed on RPCs and table operations: <img width="473" alt="image" src="/attachments/3a786b77-0b08-4bd4-b160-79e6a7689ce5"> <img width="474" alt="image" src="/attachments/50ab95d6-fe1c-4af0-9ae9-64dc4a59db15"> <img width="473" alt="image" src="/attachments/bec8cf69-d961-431e-a8dd-2edacc8a6eb0"> The LMDB metrics also follow the pattern: <img width="836" alt="image" src="/attachments/002aa8dd-d4a6-438b-bba3-61e1153f1431"> <details> <summary>More: LMDB heat maps</summary> <img width="833" alt="image" src="/attachments/158fd5e5-0072-4787-bffe-a6090e30ac18"> </details> Looking at cluster synchronisation, we observe that the block sync buffers (`block_ram_buffer_free_kb`) do not get overwhelmed: <img width="473" alt="image" src="/attachments/7d780418-07c8-4c57-a2f8-b05830777ae1"> Similarly, no node is lagging behind on Merkle tree updates, although node 3 (2nd in 1st AZ) exhibits higher values: <img width="473" alt="image" src="/attachments/346dfb62-4809-43f0-9486-a2651743e559"> However the block resync queue (`block_resync_queue_length`) goes absolutely crazy on that 3rd node: <img width="473" alt="image" src="/attachments/b9f788d0-b88e-4873-9a4e-0435ab7866bc"> ### System observability We are also able to provide some system-level stats. First looking at load, we can definitely see that node 3 is struggling more than the rest, although all of them remain within their CPU capacity: <img width="474" alt="image" src="/attachments/6b64ef19-66c2-48f6-9371-3a3c7a788906"> Looking at cache usage, a strange behaviour comes up : while nodes 1 and 2 keep a steady cache usage, pages on node 3 are regularly being released: <img width="476" alt="image" src="/attachments/1ca369f5-6fbd-483e-a41c-aea21a1b0e3c"> That memory does not appear to be claimed for use though, meaning those drops are unlikely to be forced evictions: <img width="473" alt="image" src="/attachments/38bdb421-b5d9-4ce7-9390-d4b29b1bc37c"> Finally, network-wise, all nodes exhibit similar behaviours, communicating far below link capacity (I may have got those units wrong, but in any case, there's plenty of bandwidth left) : <img width="474" alt="image" src="/attachments/f39dca35-b081-47b2-b849-54bef9090029"> ### Discussion While the read-ahead flag definitely delivers improvements when the database size exceeds the RAM, there appears to be other factors which, in our setup, hinder Garage's capabilities in terms of performance. If the Scaleway cluster stabilised at around 400#/s, we were only able to reach (a much more unstable) 75#/s. **Regarding the impact of the object count,** we just restarted another benchmark today, starting off where the first one had left off. The idea was to figure out whether the slowdown was related to cluster size (in which case the new benchmark should start at around 75#/s) or to load/work backlog (in which case the new benchmark should start over from about 800#/s now that the cluster has settled). The first results suggest the former. <img width="470" alt="image" src="/attachments/39cf527f-ddd5-43c4-bb6f-4cd8da356bc5"> ➡️ Cluster size appears to still play a part in the performance drop, even below RAM capacity. **Regarding node 3,** those results also confirm what we initially saw on our test cluster: the 2nd node in the 1st availability zone appears to do a lot more work than the rest of the nodes, causing it to lag behind at times. This has a cluster-wide impact since this node is still expected to participate in quorums. ➡️ Quick tests were made using `send_all_at_once` but this only improves reads, which overall play a minor role in a PutObject call - node 3 may still be delaying write operations. **Regarding RAM cache,** those releases on node 3 are definitely odd. They suggest that the node may be behaving more erratically than the rest and moving more things in and out of memory. **Similarly, regarding the block resync queue,** we are still unable to explain why node 3 accumulates so much work (although this may just be another symptom). ➡️ Overall, it seems important to understand what makes node 3 so special, given that, system-wise, it does not appear to be particularly different from the rest of the cluster. --- We will continue our investigations, but would definitely appreciate any feedback you may have on those results! If something worth testing comes to mind, also feel free to suggest it so we may run it on this cluster. While we can't share access to those servers, the resources will remain available for testing for some time. Next steps for us: - Try again with objects at 1KB to remove the data block logic from the results - Try again with ZFS for data blocks to eliminate any ext4 performance issues (many small files) JKR

Rows
Columns

Performances collapse with 10 millions pictures in a bucket #851

Issue

Expected

Cluster spec

Workload spec

Investigations so far

Investigations to conduct

Summary of my tests on Grid 5k

Test setup

Relevant observations

Analysis

Summary of my tests on a Scaleway DEV1-S cluster

Reproducing performance collapse.

Disabling read ahead improves performances by at least 10x

Discussion

Explore the data

Retest on a cluster w/ data blocks (objects beyond 3KiB)

Cluster observability

System observability

Discussion

Sync

Possible issues in code

Queue in key-value store

Inlined objects, no data logic

35KB objects, ZFS data partition, record size 128K

Discussion

Samsung 990 EVO:

Samsung PM9A3:

Retry in production

Unifying key prefixes

Using queues for the todo trees

Using queues for block resync

Fjall backend

RocksDB backend