Performance not increasing with more nodes #998
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind/experimental
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
admin-sdk
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#998
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I added 3 more nodes to an existing 3 node cluster, but I do not see any performance increase when transferring files. It seems rather the load balanced over all 6 nodes instead of increasing by 2x.
Is there a delay when new nodes are fully active? E.g. is rebalancing happening? But how can I verify this?
Is there a bottleneck that is globally applicable due to garage having to check-in against all nodes? Or should adding nodes have the expectation of increase in write throughput?
It really depends on what your bottleneck is.
If your bottleneck is latency between nodes or CPU usage on your entry node (i.e. the node(s) receiving API calls), then you will not see much change.
If your bottleneck is disk I/O, then adding more nodes should help performance by spreading the load.
To check if the new nodes are already used, look at the layout history and/or metrics related to the various queues
The nodes are fairly underutilized:
Can you detail how you access the nodes? Is there a load balancer in front? Are you hitting a single node? What's the replication and the layout like?
@maximilien
I have
replication_factor = 1
because we have local hardware raid6 (with write cache) and because we want to maximize on available storage understanding risks of a node failure.node has 25gbit NICs each - within same DC and also verified performance via iperf3
Using haproxy on 6/7 nodes that has all nodes as backends with round-robin-dns on all 6 haproxies. Meaning: traffic will hit any of 6/7 nodes, gets distributed by haproxy, then garage itself I suppose will distribute as well.
I previously had garage only in round-robin with just simple nginx per node, and the performance was unstable because round-robin-dns is not very random so it mostly hit the same node during rclone lifecycle (running with 25 threads currently).
You might be able to squeeze a bit more perf by tuning the following two configuration params:
block_size
: at least 10MB but there is nothing preventing you to try even bigger values like 100MBblock_ram_buffer_max
: you can try setting this to several GB (I'd be curious to see if it creates any instability though). Since you have replication_factor = 1, this might not do much for you. The default of 256MB is definitely too small if you increase the block size significantly@lx
both have already been increased from the start:
from the previous screenshot bottom right you can see that buffer-ram is not anywhere close to 8GiB usage.
Depending on what the underlying hardware is, you would likely get better perf (but lower available storage) by ditching RAID6 and leveraging multi-hdd support with
replication = 3
.Unfortunately that is not an option because we would lose 1/3 of capacity.
But I also dont think that would help anything. Because the problem would still exist: why is doubling the number of nodes not giving any performance increase (not even 1%)
Anything in garage that is limiting? Is it because the more nodes, every node needs to be contacted first?
In your case if you upload files one by one, you'll always be limited by the IO on one node. Are you uploading multiple files in parallel? What does the iowait looks like on the nodes?
@maximilien as mentioned here: #998 (comment) - I am uploading using rclone with 25 threads. Increasing threads did not increase throughput. Size per file is around 300 - 1000mb.
Not sure where the assumption is coming from that I only upload 1 file ^^"
iowait is very low. Those are 12x12TB SAS per node.
I think the key point is forgotten: I don't mind if garage is slower than native file operation, my concern is that going from 3 to 7 nodes (same hardware) did not increase performance by even a little bit.
Sorry I did not read that comment until the very end :)
Did you notice less performance with less clients? ie. does it plateau after a 20 clients?
Interested as well by what kind of workload clients are sending.
Going lower decreases the throughput, increasing does not increase.
Files in the size of 300mb - 2gb
So throughput increases from 0 to 250 simultaneous uploads (250 = 20 clients * 25 threads), and then doesn't increase any more?
There is a 256 concurrency limit in the block manager of Garage (not a global limit, it's local on each node). This limit is currently hard-coded but I'm interested to see whether increasing it in your cluster helps with throughput, or if the bottleneck is somewhere else.
If you'd like to try changing it, modify the value on this line to a bigger value like 4096, recompile Garage and change your cluster to use the patched version. If increasing it does indeed help throughtput on your cluster, I will consider making this a configurable value in the configuration file.
I'm still a bit puzzled however because I also don't understand why adding more nodes doesn't help with throughput. There is nothing in Garage that requires global communication before data can be written. It would be interesting to check metrics on all of your nodes individually and see if there is not a single node that has a bottleneck on CPU or networking for instance.
Thanks for having a look <3
Sadly it did not make any difference.
Interestingly I can observe two things:
(http2 was a big issue with haproxy on default settings)
Questions:
a) I am assuming correctly that every node holds a copy of other nodes information? Or how does garage know if I upload file to node1, that it should store it on node1 and not node2 ?
b) is there perhaps a limitation on single core performance? those are fairly old CPUs from 2016-ish. Because when pushing at 300mb/s and if file is only 1gb, then its 3s per file, hence also why haproxy graphs are very up & down. At that network speed its almost like uploading many small files. So something is maybe not scaling vertically?
Garage basically works as a Distributed Hash Table: for each metadata entry (for instance, an object version) or data block, Garage uses a hash or a random UUID to determine which node is responsible for storing it. The chance of a hash being affected to a node is proportionnal to the capacity you declared in the layout. Check these slides starting at page 10 for a visual explanation: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/doc/talks/2023-09-20-ocp/talk.pdf
Single core performance will limit the upload speed for a single S3 request, but for multiple requests the load should be spread over all available CPUs. To confirm that there is no issue with this, can you check how many threads a Garage process is running? It should be at least the number of CPU threads or maybe twice that number (we are using Tokio which is responsible for creating these threads and spreading the load).
My turn to ask questions:
--retries "3" --retries-sleep 3s
change something? If so, why is your client retrying requests? Do some requests fail? With what error code?http://localhost:3900
. If you are using rclone from a non-garage node, you can try a/ selecting one garage node as an entrypoint in the cluster and directing all requests to that node, or b/ running a Garage gateway node on the client.definitely plenty (so that at least does not seem to be the issue)
this is just precaution - I see logs when it happens and its not happening generally
no, the number of connections has been increased significantly
tried it out, no direct improvement. I'd say even slightly slower?
Yes, tried out nginx first, and for http2 with default config it was faster, but was able to match with haproxy and http1.1
Disk performance per node:
I know this is not super representative because its ideal scenario copy operation. But for now it does not seem like there is a hardware limitation - unless of course single thread performance is an issue.
Unless for some reason block-size of 30MiB is the limiting factor? But that should not be a global limit?