Poor I/O performance when writing to HDD #288
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#288
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I followed the quickstart, with garage 0.7.0 on an amd64 server. Overall, it works well :)
However, when writing big files to a local bucket with
mc
, I/O performance on the HDD is not very good: it immediately jumps to 100% I/O usage with around 200 IOPS and 20 MB/s of write.htop
shows that some garage threads are blocked on I/O. This disk model should go up to 200 MB/s with sequential write, so we are far from it. On the other hand, a maximum of 200 IOPS is consistent with random writes on this hardware.So, I expected writes to be sequential, but it seems that garage is actually doing many random small writes (or maybe calling
sync
after each sequential small write?). The size of writes done by garage seems to vary between 64 KB and 128 KB, it's not fixed.I replicated these results both in a single-node setup and in a 3-nodes setup, it's the same. In the 3-nodes setup, all nodes exhibit the same kind of I/O pattern on their HDD (I'm not limited by the network).
I also tried on a machine with a SSD, and it goes up to 300 IOPS and 60 MB/s, with around 30% I/O usage. It seems that garage is CPU-bound in that case. Interestingly, the size of the writes done by garage is somewhat bigger in this setup.
Ok, I found the
block_size
option. I did some quick tests with varying block_size, there is some variance but the trend is clear:So, with fast hardware and large files, a larger block size is necessary for good performance. 20 MB is enough in my case.
Still, it seems that garage writes data in small chunks even if the block size is large. This may be something to optimize.
Can you give a bit more details on the configuration of your underlying storage (specifically the filesystem and the blocksize, and the kernel version).
Garage also has metadata to write, was that set on a different disk/partition?
Be careful with
block_size
, it has also some impact on the network side. Blocks are used to multiplex reads and writes over a single TCP socket between clients. If you habe a 1GB block, it means that during the time you transfer it, no other block will be served from the same server. Additionnaly, if you encounter a network error, all blocks that were only partially transferred will be transferred again from scratch.You can learn more from here #139
We did not optimize Garage for performance yet, but we plan to do it this summer. So currently, you may have be able to find low-hanging fruit in term of performance optimization if you are ready to propose patches. But keep in mind that some
sync
are required for correctness and data durability.As @maximilien said, it is important to differentiate metadata and data. In production, we store metadata on SSD and data on HDD. You should re-run your tests to see if the high number of IOPS and sequential reads are due to metadata or data.
If it's due to metadata, the problem is linked with sled, the embedded database we use. We plan to abstract the embedded database we use and try some alternatives (see #284 to learn why). If it's due to data, ie. simply writing chunks of 1MB (considering the default configuration, on a large files workload) on disk, this is our own code. We use tokio.rs to perform our IO. Feel free to profile the application to give us more insights on where is the code triggering the sync. I would be surprised that we call some sync logic before writing a full block.
Sorry for the delay, for some reason I didn't receive any email from gitea about these tickets.
This was on a really standard Debian 11 system with a 5.10 kernel, an ext4 filesystem, 4096 filesystem block size.
I did read the warning about metadata performance, so I even put metadata in /dev/shm/ to be certain it's fast. So I'm pretty sure the I/O on the disk is 100% related to the data.
Thanks for the link to the other issue, I had missed it. I don't think I will have time to dive into the code or do proper profiling, sorry, but I agree this would be the way to go.
It appears that one reason that may cause performance issues is that we write blocks sequentially (we wait that one block has been written before writing the next one). We plan to work on that subject in some months, if you have more inputs on performance issues, please report them here.
#342 should improve this significantly, and #343 might help as well. We should benchmark this as part of the results we will submit to NGI for milesetone 4.