Poor I/O performance when writing to HDD #288

Open
opened 8 months ago by baptiste · 6 comments
Owner

I followed the quickstart, with garage 0.7.0 on an amd64 server. Overall, it works well :)

However, when writing big files to a local bucket with mc, I/O performance on the HDD is not very good: it immediately jumps to 100% I/O usage with around 200 IOPS and 20 MB/s of write. htop shows that some garage threads are blocked on I/O. This disk model should go up to 200 MB/s with sequential write, so we are far from it. On the other hand, a maximum of 200 IOPS is consistent with random writes on this hardware.

So, I expected writes to be sequential, but it seems that garage is actually doing many random small writes (or maybe calling sync after each sequential small write?). The size of writes done by garage seems to vary between 64 KB and 128 KB, it's not fixed.

I replicated these results both in a single-node setup and in a 3-nodes setup, it's the same. In the 3-nodes setup, all nodes exhibit the same kind of I/O pattern on their HDD (I'm not limited by the network).

I also tried on a machine with a SSD, and it goes up to 300 IOPS and 60 MB/s, with around 30% I/O usage. It seems that garage is CPU-bound in that case. Interestingly, the size of the writes done by garage is somewhat bigger in this setup.

I followed the quickstart, with garage 0.7.0 on an amd64 server. Overall, it works well :) However, when writing big files to a local bucket with `mc`, I/O performance on the HDD is not very good: it immediately jumps to 100% I/O usage with around 200 IOPS and 20 MB/s of write. `htop` shows that some garage threads are blocked on I/O. This disk model should go up to 200 MB/s with sequential write, so we are far from it. On the other hand, a maximum of 200 IOPS is consistent with random writes on this hardware. So, I expected writes to be sequential, but it seems that garage is actually doing many random small writes (or maybe calling `sync` after each sequential small write?). The size of writes done by garage seems to vary between 64 KB and 128 KB, it's not fixed. I replicated these results both in a single-node setup and in a 3-nodes setup, it's the same. In the 3-nodes setup, all nodes exhibit the same kind of I/O pattern on their HDD (I'm not limited by the network). I also tried on a machine with a SSD, and it goes up to 300 IOPS and 60 MB/s, with around 30% I/O usage. It seems that garage is CPU-bound in that case. Interestingly, the size of the writes done by garage is somewhat bigger in this setup.
Poster
Owner

Ok, I found the block_size option. I did some quick tests with varying block_size, there is some variance but the trend is clear:

# block_size = 102400 (100 KB)
140 IOPS, 5 MB/s, 100% I/O usage. Average write size: 35 KB

# block_size = 1048576 (1 MB, default)
160 IOPS, 20 MB/s, 80% I/O usage. Average write size: 125 KB

# block_size = 10485760 (10 MB)
280 IOPS, 60 MB/s, 60% I/O usage. Average write size: 220 KB

# block_size = 20971520 (20 MB)
415 IOPS, 94 MB/s, 77% I/O usage. Average write size: 230 KB

# block_size = 41943040 (40 MB)
420 IOPS, 95 MB/s, 78% I/O usage. Average write size: 230 KB

# block_size = 104857600 (100 MB)
440 IOPS, 100 MB/s, 80% I/O usage. Average write size: 230 KB

# block_size = 1073741824 (1 GB)
440 IOPS, 100 MB/s, 80% I/O usage. Average write size: 230 KB

So, with fast hardware and large files, a larger block size is necessary for good performance. 20 MB is enough in my case.

Still, it seems that garage writes data in small chunks even if the block size is large. This may be something to optimize.

Ok, I found the `block_size` option. I did some quick tests with varying block_size, there is some variance but the trend is clear: ``` # block_size = 102400 (100 KB) 140 IOPS, 5 MB/s, 100% I/O usage. Average write size: 35 KB # block_size = 1048576 (1 MB, default) 160 IOPS, 20 MB/s, 80% I/O usage. Average write size: 125 KB # block_size = 10485760 (10 MB) 280 IOPS, 60 MB/s, 60% I/O usage. Average write size: 220 KB # block_size = 20971520 (20 MB) 415 IOPS, 94 MB/s, 77% I/O usage. Average write size: 230 KB # block_size = 41943040 (40 MB) 420 IOPS, 95 MB/s, 78% I/O usage. Average write size: 230 KB # block_size = 104857600 (100 MB) 440 IOPS, 100 MB/s, 80% I/O usage. Average write size: 230 KB # block_size = 1073741824 (1 GB) 440 IOPS, 100 MB/s, 80% I/O usage. Average write size: 230 KB ``` So, with fast hardware and large files, a larger block size is necessary for good performance. 20 MB is enough in my case. Still, it seems that garage writes data in small chunks even if the block size is large. This may be something to optimize.

Can you give a bit more details on the configuration of your underlying storage (specifically the filesystem and the blocksize, and the kernel version).

Garage also has metadata to write, was that set on a different disk/partition?

Can you give a bit more details on the configuration of your underlying storage (specifically the filesystem and the blocksize, and the kernel version). Garage also has metadata to write, was that set on a different disk/partition?
Owner

Be careful with block_size, it has also some impact on the network side. Blocks are used to multiplex reads and writes over a single TCP socket between clients. If you habe a 1GB block, it means that during the time you transfer it, no other block will be served from the same server. Additionnaly, if you encounter a network error, all blocks that were only partially transferred will be transferred again from scratch.
You can learn more from here #139

We did not optimize Garage for performance yet, but we plan to do it this summer. So currently, you may have be able to find low-hanging fruit in term of performance optimization if you are ready to propose patches. But keep in mind that some sync are required for correctness and data durability.

As @maximilien said, it is important to differentiate metadata and data. In production, we store metadata on SSD and data on HDD. You should re-run your tests to see if the high number of IOPS and sequential reads are due to metadata or data.
If it's due to metadata, the problem is linked with sled, the embedded database we use. We plan to abstract the embedded database we use and try some alternatives (see #284 to learn why). If it's due to data, ie. simply writing chunks of 1MB (considering the default configuration, on a large files workload) on disk, this is our own code. We use tokio.rs to perform our IO. Feel free to profile the application to give us more insights on where is the code triggering the sync. I would be surprised that we call some sync logic before writing a full block.

Be careful with `block_size`, it has also some impact on the network side. Blocks are used to multiplex reads and writes over a single TCP socket between clients. If you habe a 1GB block, it means that during the time you transfer it, no other block will be served from the same server. Additionnaly, if you encounter a network error, all blocks that were only partially transferred will be transferred again from scratch. You can learn more from here [#139](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/139) We did not optimize Garage for performance yet, but we plan to do it this summer. So currently, you may have be able to find low-hanging fruit in term of performance optimization if you are ready to propose patches. But keep in mind that some `sync` are required for correctness and data durability. As @maximilien said, it is important to differentiate metadata and data. In production, we store metadata on SSD and data on HDD. You should re-run your tests to see if the high number of IOPS and sequential reads are due to metadata or data. If it's due to metadata, the problem is linked with sled, the embedded database we use. We plan to abstract the embedded database we use and try some alternatives (see [#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284) to learn why). If it's due to data, ie. simply writing chunks of 1MB (considering the default configuration, on a large files workload) on disk, this is our own code. We use [tokio.rs](https://tokio.rs/) to perform our IO. Feel free to profile the application to give us more insights on where is the code triggering the sync. I would be surprised that we call some sync logic before writing a full block.
quentin added the
Performance
label 8 months ago
Poster
Owner

Sorry for the delay, for some reason I didn't receive any email from gitea about these tickets.

This was on a really standard Debian 11 system with a 5.10 kernel, an ext4 filesystem, 4096 filesystem block size.

I did read the warning about metadata performance, so I even put metadata in /dev/shm/ to be certain it's fast. So I'm pretty sure the I/O on the disk is 100% related to the data.

Thanks for the link to the other issue, I had missed it. I don't think I will have time to dive into the code or do proper profiling, sorry, but I agree this would be the way to go.

Sorry for the delay, for some reason I didn't receive any email from gitea about these tickets. This was on a really standard Debian 11 system with a 5.10 kernel, an ext4 filesystem, 4096 filesystem block size. I did read the warning about metadata performance, so I even put metadata in /dev/shm/ to be certain it's fast. So I'm pretty sure the I/O on the disk is 100% related to the data. Thanks for the link to the other issue, I had missed it. I don't think I will have time to dive into the code or do proper profiling, sorry, but I agree this would be the way to go.
Owner

It appears that one reason that may cause performance issues is that we write blocks sequentially (we wait that one block has been written before writing the next one). We plan to work on that subject in some months, if you have more inputs on performance issues, please report them here.

It appears that one reason that may cause performance issues is that we write blocks sequentially (we wait that one block has been written before writing the next one). We plan to work on that subject in some months, if you have more inputs on performance issues, please report them here.
Owner

#342 should improve this significantly, and #343 might help as well. We should benchmark this as part of the results we will submit to NGI for milesetone 4.

#342 should improve this significantly, and #343 might help as well. We should benchmark this as part of the results we will submit to NGI for milesetone 4.
lx added this to the v0.8 milestone 3 months ago
lx modified the milestone from v0.8 to v0.9 3 months ago
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#288
Loading…
There is no content yet.