Performance not increasing with more nodes #998

Open
opened 2025-03-30 23:18:51 +00:00 by protocols · 17 comments

I added 3 more nodes to an existing 3 node cluster, but I do not see any performance increase when transferring files. It seems rather the load balanced over all 6 nodes instead of increasing by 2x.

Is there a delay when new nodes are fully active? E.g. is rebalancing happening? But how can I verify this?

Is there a bottleneck that is globally applicable due to garage having to check-in against all nodes? Or should adding nodes have the expectation of increase in write throughput?

I added 3 more nodes to an existing 3 node cluster, but I do not see any performance increase when transferring files. It seems rather the load balanced over all 6 nodes instead of increasing by 2x. Is there a delay when new nodes are fully active? E.g. is rebalancing happening? But how can I verify this? Is there a bottleneck that is globally applicable due to garage having to check-in against all nodes? Or should adding nodes have the expectation of increase in write throughput?
Owner

It really depends on what your bottleneck is.

If your bottleneck is latency between nodes or CPU usage on your entry node (i.e. the node(s) receiving API calls), then you will not see much change.

If your bottleneck is disk I/O, then adding more nodes should help performance by spreading the load.

To check if the new nodes are already used, look at the layout history and/or metrics related to the various queues

It really depends on what your bottleneck is. If your bottleneck is latency between nodes or CPU usage on your entry node (i.e. the node(s) receiving API calls), then you will not see much change. If your bottleneck is disk I/O, then adding more nodes should help performance by spreading the load. To check if the new nodes are already used, look at the layout history and/or [metrics related to the various queues](https://garagehq.deuxfleurs.fr/documentation/reference-manual/monitoring/#metrics-of-the-data-block-manager)
Author

The nodes are fairly underutilized:

  • Dual CPU with 48 threads in total (cpu usage is low)
  • 256 GB RAM
  • Separate partitions for OS (SSD), Metadata (NVMe) and Data (HDDs)

Screenshot 2025-04-01 at 02.55.11.png

The nodes are fairly underutilized: - Dual CPU with 48 threads in total (cpu usage is low) - 256 GB RAM - Separate partitions for OS (SSD), Metadata (NVMe) and Data (HDDs) ![Screenshot 2025-04-01 at 02.55.11.png](/attachments/f973cb77-88b4-4a82-a18c-f891e3a92522)
Owner

Can you detail how you access the nodes? Is there a load balancer in front? Are you hitting a single node? What's the replication and the layout like?

Can you detail how you access the nodes? Is there a load balancer in front? Are you hitting a single node? What's the replication and the layout like?
Author

@maximilien

I have replication_factor = 1 because we have local hardware raid6 (with write cache) and because we want to maximize on available storage understanding risks of a node failure.

$ sudo -Hu garage garage layout show
==== CURRENT CLUSTER LAYOUT ====
ID                Tags  Zone  Capacity  Usable capacity
6e848e18decd141c        rs1   105.0 TB  105.0 TB (100.0%)
c22894e004e03bef        rs1   105.0 TB  105.0 TB (100.0%)
c60958a9b2cf8ef7        rs1   105.0 TB  105.0 TB (100.0%)
c9baf0d8b006b231        rs1   105.0 TB  105.0 TB (100.0%)
d762c0f5469e0d37        rs1   105.0 TB  96.5 TB (91.9%)
ec4c749e91a42e46        rs1   105.0 TB  105.0 TB (100.0%)
f0c5ed3b36424349        rs1   105.0 TB  105.0 TB (100.0%)

Zone redundancy: maximum

Current cluster layout version: 3

node has 25gbit NICs each - within same DC and also verified performance via iperf3

Using haproxy on 6/7 nodes that has all nodes as backends with round-robin-dns on all 6 haproxies. Meaning: traffic will hit any of 6/7 nodes, gets distributed by haproxy, then garage itself I suppose will distribute as well.

I previously had garage only in round-robin with just simple nginx per node, and the performance was unstable because round-robin-dns is not very random so it mostly hit the same node during rclone lifecycle (running with 25 threads currently).

@maximilien I have `replication_factor = 1` because we have local hardware raid6 (with write cache) and because we want to maximize on available storage understanding risks of a node failure. ```bash $ sudo -Hu garage garage layout show ==== CURRENT CLUSTER LAYOUT ==== ID Tags Zone Capacity Usable capacity 6e848e18decd141c rs1 105.0 TB 105.0 TB (100.0%) c22894e004e03bef rs1 105.0 TB 105.0 TB (100.0%) c60958a9b2cf8ef7 rs1 105.0 TB 105.0 TB (100.0%) c9baf0d8b006b231 rs1 105.0 TB 105.0 TB (100.0%) d762c0f5469e0d37 rs1 105.0 TB 96.5 TB (91.9%) ec4c749e91a42e46 rs1 105.0 TB 105.0 TB (100.0%) f0c5ed3b36424349 rs1 105.0 TB 105.0 TB (100.0%) Zone redundancy: maximum Current cluster layout version: 3 ``` node has 25gbit NICs each - within same DC and also verified performance via iperf3 Using haproxy on 6/7 nodes that has all nodes as backends with round-robin-dns on all 6 haproxies. Meaning: traffic will hit any of 6/7 nodes, gets distributed by haproxy, then garage itself I suppose will distribute as well. I previously had garage only in round-robin with just simple nginx per node, and the performance was unstable because round-robin-dns is not very random so it mostly hit the same node during rclone lifecycle (running with 25 threads currently).
Owner

You might be able to squeeze a bit more perf by tuning the following two configuration params:

  • block_size: at least 10MB but there is nothing preventing you to try even bigger values like 100MB
  • block_ram_buffer_max: you can try setting this to several GB (I'd be curious to see if it creates any instability though). Since you have replication_factor = 1, this might not do much for you. The default of 256MB is definitely too small if you increase the block size significantly
You might be able to squeeze a bit more perf by tuning the following two configuration params: - `block_size`: at least 10MB but there is nothing preventing you to try even bigger values like 100MB - `block_ram_buffer_max`: you can try setting this to several GB (I'd be curious to see if it creates any instability though). Since you have replication_factor = 1, this might not do much for you. The default of 256MB is definitely too small if you increase the block size significantly
Author

@lx

both have already been increased from the start:

block_size = "30MiB"
block_ram_buffer_max = "8GiB"

from the previous screenshot bottom right you can see that buffer-ram is not anywhere close to 8GiB usage.

@lx both have already been increased from the start: ``` block_size = "30MiB" block_ram_buffer_max = "8GiB" ``` from the previous screenshot bottom right you can see that buffer-ram is not anywhere close to 8GiB usage.
Owner

Depending on what the underlying hardware is, you would likely get better perf (but lower available storage) by ditching RAID6 and leveraging multi-hdd support with replication = 3.

Depending on what the underlying hardware is, you would likely get better perf (but lower available storage) by ditching RAID6 and leveraging [multi-hdd](https://garagehq.deuxfleurs.fr/documentation/operations/multi-hdd/) support with `replication = 3`.
Author

Unfortunately that is not an option because we would lose 1/3 of capacity.

But I also dont think that would help anything. Because the problem would still exist: why is doubling the number of nodes not giving any performance increase (not even 1%)

Anything in garage that is limiting? Is it because the more nodes, every node needs to be contacted first?

Unfortunately that is not an option because we would lose 1/3 of capacity. But I also dont think that would help anything. Because the problem would still exist: why is doubling the number of nodes not giving any performance increase (not even 1%) Anything in garage that is limiting? Is it because the more nodes, every node needs to be contacted first?
Owner

In your case if you upload files one by one, you'll always be limited by the IO on one node. Are you uploading multiple files in parallel? What does the iowait looks like on the nodes?

In your case if you upload files one by one, you'll always be limited by the IO on one node. Are you uploading multiple files in parallel? What does the iowait looks like on the nodes?
Author

@maximilien as mentioned here: #998 (comment) - I am uploading using rclone with 25 threads. Increasing threads did not increase throughput. Size per file is around 300 - 1000mb.

Not sure where the assumption is coming from that I only upload 1 file ^^"

iowait is very low. Those are 12x12TB SAS per node.

I think the key point is forgotten: I don't mind if garage is slower than native file operation, my concern is that going from 3 to 7 nodes (same hardware) did not increase performance by even a little bit.

@maximilien as mentioned here: https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/998#issuecomment-12417 - I am uploading using rclone with 25 threads. Increasing threads did not increase throughput. Size per file is around 300 - 1000mb. Not sure where the assumption is coming from that I only upload 1 file ^^" iowait is very low. Those are 12x12TB SAS per node. I think the key point is forgotten: I don't mind if garage is slower than native file operation, my concern is that going from 3 to 7 nodes (same hardware) did not increase performance by even a little bit.
Owner

Sorry I did not read that comment until the very end :)
Did you notice less performance with less clients? ie. does it plateau after a 20 clients?

Sorry I did not read that comment until the very end :) Did you notice less performance with less clients? ie. does it plateau after a 20 clients?
Owner

Interested as well by what kind of workload clients are sending.

Interested as well by what kind of workload clients are sending.
Author

Did you notice less performance with less clients? ie. does it plateau after a 20 clients?

Going lower decreases the throughput, increasing does not increase.

Interested as well by what kind of workload clients are sending.

Files in the size of 300mb - 2gb

> Did you notice less performance with less clients? ie. does it plateau after a 20 clients? Going lower decreases the throughput, increasing does not increase. > Interested as well by what kind of workload clients are sending. Files in the size of 300mb - 2gb
Owner

So throughput increases from 0 to 250 simultaneous uploads (250 = 20 clients * 25 threads), and then doesn't increase any more?

There is a 256 concurrency limit in the block manager of Garage (not a global limit, it's local on each node). This limit is currently hard-coded but I'm interested to see whether increasing it in your cluster helps with throughput, or if the bottleneck is somewhere else.

If you'd like to try changing it, modify the value on this line to a bigger value like 4096, recompile Garage and change your cluster to use the patched version. If increasing it does indeed help throughtput on your cluster, I will consider making this a configurable value in the configuration file.

I'm still a bit puzzled however because I also don't understand why adding more nodes doesn't help with throughput. There is nothing in Garage that requires global communication before data can be written. It would be interesting to check metrics on all of your nodes individually and see if there is not a single node that has a bottleneck on CPU or networking for instance.

So throughput increases from 0 to 250 simultaneous uploads (250 = 20 clients * 25 threads), and then doesn't increase any more? There is a 256 concurrency limit in the block manager of Garage (not a global limit, it's local on each node). This limit is currently hard-coded but I'm interested to see whether increasing it in your cluster helps with throughput, or if the bottleneck is somewhere else. If you'd like to try changing it, modify the value on [this line](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/block/manager.rs#L114) to a bigger value like 4096, recompile Garage and change your cluster to use the patched version. If increasing it does indeed help throughtput on your cluster, I will consider making this a configurable value in the configuration file. I'm still a bit puzzled however because I also don't understand why adding more nodes doesn't help with throughput. There is nothing in Garage that requires global communication before data can be written. It would be interesting to check metrics on all of your nodes individually and see if there is not a single node that has a bottleneck on CPU or networking for instance.
lx added the
kind
performance
action
more-info-needed
labels 2025-04-02 18:02:00 +00:00
Author

Thanks for having a look <3

Sadly it did not make any difference.

Interestingly I can observe two things:

  1. there are not a lot of concurrent connections on haproxy - so in total ~ 30 spread across 7 nodes. Not a lot per node.

Screenshot 2025-04-03 at 00.30.02.png

  1. using the following rclone copy command I can somewhat squeeze some performance, but not in a rate of 2x of added nodes (also while nodes are not under load):
--size-only   --fast-list   --s3-no-head   --retries "3"   --retries-sleep 3s   --transfers=20   --checkers=20 --s3-upload-concurrency 32 --s3-chunk-size 24Mi --s3-disable-http2   --progress   --stats-one-line

(http2 was a big issue with haproxy on default settings)


Questions:

a) I am assuming correctly that every node holds a copy of other nodes information? Or how does garage know if I upload file to node1, that it should store it on node1 and not node2 ?
b) is there perhaps a limitation on single core performance? those are fairly old CPUs from 2016-ish. Because when pushing at 300mb/s and if file is only 1gb, then its 3s per file, hence also why haproxy graphs are very up & down. At that network speed its almost like uploading many small files. So something is maybe not scaling vertically?

Thanks for having a look <3 Sadly it did not make any difference. Interestingly I can observe two things: 1) there are not a lot of concurrent connections on haproxy - so in total ~ 30 spread across 7 nodes. Not a lot per node. ![Screenshot 2025-04-03 at 00.30.02.png](/attachments/63c55749-7ae2-41fb-a801-6c29dd719dcf) 2) using the following rclone copy command I can somewhat squeeze some performance, but not in a rate of 2x of added nodes (also while nodes are not under load): ``` --size-only --fast-list --s3-no-head --retries "3" --retries-sleep 3s --transfers=20 --checkers=20 --s3-upload-concurrency 32 --s3-chunk-size 24Mi --s3-disable-http2 --progress --stats-one-line ``` (http2 was a big issue with haproxy on default settings) --- Questions: a) I am assuming correctly that every node holds a copy of other nodes information? Or how does garage know if I upload file to node1, that it should store it on node1 and not node2 ? b) is there perhaps a limitation on single core performance? those are fairly old CPUs from 2016-ish. Because when pushing at 300mb/s and if file is only 1gb, then its 3s per file, hence also why haproxy graphs are very up & down. At that network speed its almost like uploading many small files. So something is maybe not scaling vertically?
Owner

a) I am assuming correctly that every node holds a copy of other nodes information? Or how does garage know if I upload file to node1, that it should store it on node1 and not node2 ?

Garage basically works as a Distributed Hash Table: for each metadata entry (for instance, an object version) or data block, Garage uses a hash or a random UUID to determine which node is responsible for storing it. The chance of a hash being affected to a node is proportionnal to the capacity you declared in the layout. Check these slides starting at page 10 for a visual explanation: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/doc/talks/2023-09-20-ocp/talk.pdf

b) is there perhaps a limitation on single core performance? those are fairly old CPUs from 2016-ish. Because when pushing at 300mb/s and if file is only 1gb, then its 3s per file, hence also why haproxy graphs are very up & down. At that network speed its almost like uploading many small files. So something is maybe not scaling vertically?

Single core performance will limit the upload speed for a single S3 request, but for multiple requests the load should be spread over all available CPUs. To confirm that there is no issue with this, can you check how many threads a Garage process is running? It should be at least the number of CPU threads or maybe twice that number (we are using Tokio which is responsible for creating these threads and spreading the load).

My turn to ask questions:

  1. Does --retries "3" --retries-sleep 3s change something? If so, why is your client retrying requests? Do some requests fail? With what error code?
  2. Is haproxy rejecting connections, putting them on hold, or otherwise limiting the number of concurrent connections?
  3. What happens if you remove haproxy and configure your rclone commands to directly talk to the Garage daemon? The best possible scenario would be to configure rclone to talk to http://localhost:3900. If you are using rclone from a non-garage node, you can try a/ selecting one garage node as an entrypoint in the cluster and directing all requests to that node, or b/ running a Garage gateway node on the client.
  4. Did you try other reverse proxies/load balancers? For instance I think nginx can be configured to redirect requests to multiple backend servers.
> a) I am assuming correctly that every node holds a copy of other nodes information? Or how does garage know if I upload file to node1, that it should store it on node1 and not node2 ? Garage basically works as a Distributed Hash Table: for each metadata entry (for instance, an object version) or data block, Garage uses a hash or a random UUID to determine which node is responsible for storing it. The chance of a hash being affected to a node is proportionnal to the capacity you declared in the layout. Check these slides starting at page 10 for a visual explanation: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/doc/talks/2023-09-20-ocp/talk.pdf > b) is there perhaps a limitation on single core performance? those are fairly old CPUs from 2016-ish. Because when pushing at 300mb/s and if file is only 1gb, then its 3s per file, hence also why haproxy graphs are very up & down. At that network speed its almost like uploading many small files. So something is maybe not scaling vertically? Single core performance will limit the upload speed for a single S3 request, but for multiple requests the load should be spread over all available CPUs. To confirm that there is no issue with this, can you check how many threads a Garage process is running? It should be at least the number of CPU threads or maybe twice that number (we are using Tokio which is responsible for creating these threads and spreading the load). My turn to ask questions: 1. Does `--retries "3" --retries-sleep 3s` change something? If so, why is your client retrying requests? Do some requests fail? With what error code? 2. Is haproxy rejecting connections, putting them on hold, or otherwise limiting the number of concurrent connections? 3. What happens if you remove haproxy and configure your rclone commands to directly talk to the Garage daemon? The best possible scenario would be to configure rclone to talk to `http://localhost:3900`. If you are using rclone from a non-garage node, you can try a/ selecting one garage node as an entrypoint in the cluster and directing all requests to that node, or b/ running a Garage gateway node on the client. 4. Did you try other reverse proxies/load balancers? For instance I think nginx can be configured to redirect requests to multiple backend servers.
Author

can you check how many threads a Garage process is running

definitely plenty (so that at least does not seem to be the issue)

  1. this is just precaution - I see logs when it happens and its not happening generally

  2. no, the number of connections has been increased significantly

  3. tried it out, no direct improvement. I'd say even slightly slower?

  4. Yes, tried out nginx first, and for http2 with default config it was faster, but was able to match with haproxy and http1.1

Disk performance per node:

sudo dd if=/dev/zero of=/data/testfile bs=1G count=5 oflag=dsync
5+0 records in
5+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 6.84355 s, 784 MB/s

I know this is not super representative because its ideal scenario copy operation. But for now it does not seem like there is a hardware limitation - unless of course single thread performance is an issue.

Unless for some reason block-size of 30MiB is the limiting factor? But that should not be a global limit?

> can you check how many threads a Garage process is running definitely plenty (so that at least does not seem to be the issue) 1) this is just precaution - I see logs when it happens and its not happening generally 2) no, the number of connections has been increased significantly 3) tried it out, no direct improvement. I'd say even slightly slower? 4) Yes, tried out nginx first, and for http2 with default config it was faster, but was able to match with haproxy and http1.1 Disk performance per node: ``` sudo dd if=/dev/zero of=/data/testfile bs=1G count=5 oflag=dsync 5+0 records in 5+0 records out 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 6.84355 s, 784 MB/s ``` I know this is not super representative because its ideal scenario copy operation. But for now it does not seem like there is a hardware limitation - unless of course single thread performance is an issue. Unless for some reason block-size of 30MiB is the limiting factor? But that should not be a global limit?
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#998
No description provided.