Exceedingly slow performance for s3fs and garage 0.8.2 #668

New issue

Closed

opened 2023-11-16 01:59:04 +00:00 by mitchty · 1 comment

mitchty commented

2023-11-16 01:59:04 +00:00

All timings here are for 0.8.2, I can upgrade to 0.9 shortly once I back up everything to my older nas running seaweedfs but I figure the overall experience of debugging the slowness might help.

The issue i'm hitting is s3fs performance is... well abysmal, roughly the best I can get out of local s3fs filesystem copies is a sustained 5MiB/s sometimes less to copy a multi gibibit file.

In case it helps, the hardware I'm running on is these systems:
https://www.terra-master.com/us/f4-4736.html

I've configured garage to use the secondary 2.5g nics for all its rpc traffic to remove any possibility of nic contention on the front end. (I think? maybe I set that up wrong)

Running nixos 23.05 (for now)

Kernel is this (somewhat constrained by me using zfs for / for now but reasonably modern)

[root@cl1:~]# uname -a
Linux cl1 6.1.52 #1-NixOS SMP PREEMPT_DYNAMIC Wed Sep  6 20:27:03 UTC 2023 x86_64 GNU/Linux

The layout for block devices is as follows (note zfs for the nvme drives outside of /boot partitions and swap which aren't pertinent. The main /data/disk/0 partition is simply journaled ext4 which seems fine for the use case but I can swap this out for xfs or whatever if its an issue)

[root@cl1:~]# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda           8:0    0  18.2T  0 disk  /data/disk/0
nvme0n1     259:0    0 931.5G  0 disk  
├─nvme0n1p1 259:2    0   512M  0 part  /boot
├─nvme0n1p2 259:3    0    16G  0 part  
│ └─md127     9:127  0    16G  0 raid1 [SWAP]
├─nvme0n1p3 259:4    0   768G  0 part  
└─nvme0n1p4 259:5    0   147G  0 part  
nvme1n1     259:1    0 931.5G  0 disk  
├─nvme1n1p1 259:6    0   512M  0 part  /boot1
├─nvme1n1p2 259:7    0    16G  0 part  
│ └─md127     9:127  0    16G  0 raid1 [SWAP]
├─nvme1n1p3 259:8    0   768G  0 part  
└─nvme1n1p4 259:9    0   147G  0 part  

[root@cl1:~]# lsblk -d -o +MODEL
NAME    MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS  MODEL
sda       8:0    0  18.2T  0 disk /data/disk/0 WDC WUH722020ALE6L4
nvme0n1 259:0    0 931.5G  0 disk              Samsung SSD 970 EVO Plus 1TB
nvme1n1 259:1    0 931.5G  0 disk              Samsung SSD 970 EVO Plus 1TB

e.g. the lmdb db is on those two nvme*n1 devices which are a simple zfs mirror, all the bulk data is on a 20TiB WDC drive which for serial operation should be able to achieve almost 300MiB/s.

My configuration toml, I am using replication mode of 2 as my main goal is mostly to not have my singular nas die on me. I've tried tweaking the block_size up from 1MiB to 10MiB but that didn't do a ton. Keeping in mind most of what i will store here is going to be bulk data and writes that should amount to straight up copies over s3fs:

[root@cl1:~]# cat /etc/garage.toml 
block_size = 1048576
bootstrap_peers = omitted for brevity
data_dir = "/data/disk/0"
db_engine = "lmdb"
metadata_dir = "/data/garage"
replication_mode = "2"
rpc_bind_addr = "[::]:3901"
rpc_public_addr = "192.168.254.1:3901"
rpc_secret = "omitted"

[admin]
api_bind_addr = "0.0.0.0:3903"

[s3_api]
api_bind_addr = "0.0.0.0:3900"
root_domain = "cluster.home.arpa"
s3_region = "garage"

[s3_web]
bind_addr = "0.0.0.0:3902"
index = "index.html"
root_domain = "cluster.home.arpa"

For s3fs my mount line is as follows (ignore the /nix stuff its a full path for reasons not worth getting into here relating to fuser mounts and nixos):

/nix/store/43mvc136l9x6l67zshhli81vmnynbns5-s3fs-fuse-1.91/bin/s3fs#media-s3fs:/tv /s3/s3fs fuse _netdev,curldbg,use_cache=/tmp/s3/s3fs,passwd_file=/run/agenix/s3/bucket-media,allow_other,mp_umask=022,uid=3000,gid=3000,url=http://cluster.home.arpa:3900,use_path_request_style,list_object_max_keys=10000,parallel_count=5,multipart_size=1024,multipart_copy_size=1024,max_dirty_data=1024 0 0

I've tried tweaking s3fs's parallel count, multipart size, copy size, dirty data etc... all over the map but what seems to be happening is related to a new 0.9.0 feature
Other changes since v0.8.4:

Optimal layout assignation algorithm (#296)
Full multipart upload semantics (#204, #553)

I've seen multipart uploads fail on specific pieces and an error (can't find it in my logs right now about needing to upload sequences in series). I presume that would help with that?

Additionally I've setup the prometheus exporter and done a scrub to get a better idea of how much data is being read and it seems to roughly cap out at 50MiB/s even when tranquility is set to 0 via:
garage repair --yes scrub set-tranquility 0

These WDC drives however can definitely do much better iops than that however, they allegedly are CMR drives not SMR but I do see some... fun behavior depending on how many threads do writes at once which I think may explain some of the observed behavior.

[nix-shell:/data/disk/0]# for x in 1 5 10 20 50; do
> fio --name=fio-test --ioengine=posixaio --rw=write --bs=4k --numjobs=${x} --size=1g --iodepth=1 --runtime=60 --time_based --end_fsync=1 | tail
> done
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2168120,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=138MiB/s (145MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s), io=8469MiB (8881MB), run=61451-61451msec

Disk stats (read/write):
  sda: ios=0/18230, merge=0/5146, ticks=0/315843, in_queue=319219, util=73.96%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,718047,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=169MiB/s (178MB/s), 29.9MiB/s-41.8MiB/s (31.4MB/s-43.8MB/s), io=11.8GiB (12.6GB), run=61494-71064msec

Disk stats (read/write):
  sda: ios=0/19088, merge=0/1802, ticks=0/483895, in_queue=489424, util=98.25%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,279478,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=176MiB/s (184MB/s), 16.7MiB/s-25.1MiB/s (17.5MB/s-26.4MB/s), io=12.5GiB (13.5GB), run=62268-73140msec

Disk stats (read/write):
  sda: ios=0/17283, merge=0/209, ticks=0/509498, in_queue=514865, util=97.77%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,167634,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=178MiB/s (186MB/s), 8865KiB/s-10.2MiB/s (9078kB/s-10.7MB/s), io=13.0GiB (13.9GB), run=64274-74761msec

Disk stats (read/write):
  sda: ios=0/17579, merge=0/406, ticks=0/582329, in_queue=589092, util=98.79%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,62603,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=168MiB/s (176MB/s), 3482KiB/s-4392KiB/s (3566kB/s-4497kB/s), io=12.6GiB (13.6GB), run=63144-77123msec

Disk stats (read/write):
  sda: ios=0/16553, merge=0/506, ticks=0/627853, in_queue=637234, util=97.66%

As you can see when you increase the number of threads doing sequential writes you can see the overall throuhput hit a bit of a sweet spot before falling off a bit of a cliff. I've tested the local links public facing and private and can saturate the network with iperf so I'm thinking the issue I'm hitting here is relating to how many threads are trying to write at once but not sure how I can or should tweak that.

My question here is mostly how do I track down and get a better idea of where my performance issues lie? And/or where would I get at the data? I have garage running in debug mode but hopefully that isn't an issue but I've dug through docs and haven't found much that would help in relation to performance optimization or debugging.

I'll keep plugging away at it, I've gotten s3fs to not be too bad with writes by adding a local cache etc... but the overall performance write wise is atrocious.

Using rclone to/from the node seems much better but that doesn't help when I need to just have a local filesystem to copy bits from/to the filesystem.

Would a local garage server help with s3fs here? I'm sure it would but given all s3fs is doing is chunking the files and uploading them all at once it seems like that wouldn't be too big of a gain.

Debugging performance isn't too problematic for me but black box debugging can be time intensive and I figure it would be better to ask if there already exists knobs without getting too deep into code.

Thanks!

Ola, so for my home nas I've been testing out garage as my s3 backend to back plex mostly as well as a restic target. Apologies for the mini thesis but figured i'd just write up as much as I could information wise. If it helps I can link to my nixos configuration flake as I setup everything that way but that might be too much information. All timings here are for 0.8.2, I can upgrade to 0.9 shortly once I back up everything to my older nas running seaweedfs but I figure the overall experience of debugging the slowness might help. The issue i'm hitting is s3fs performance is... well abysmal, roughly the best I can get out of local s3fs filesystem copies is a sustained 5MiB/s sometimes less to copy a multi gibibit file. In case it helps, the hardware I'm running on is these systems: https://www.terra-master.com/us/f4-4736.html I've configured garage to use the secondary 2.5g nics for all its rpc traffic to remove any possibility of nic contention on the front end. (I think? maybe I set that up wrong) Running nixos 23.05 (for now) Kernel is this (somewhat constrained by me using zfs for / for now but reasonably modern) ``` [root@cl1:~]# uname -a Linux cl1 6.1.52 #1-NixOS SMP PREEMPT_DYNAMIC Wed Sep 6 20:27:03 UTC 2023 x86_64 GNU/Linux ``` The layout for block devices is as follows (note zfs for the nvme drives outside of /boot partitions and swap which aren't pertinent. The main /data/disk/0 partition is simply journaled ext4 which seems fine for the use case but I can swap this out for xfs or whatever if its an issue) ``` [root@cl1:~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 18.2T 0 disk /data/disk/0 nvme0n1 259:0 0 931.5G 0 disk ├─nvme0n1p1 259:2 0 512M 0 part /boot ├─nvme0n1p2 259:3 0 16G 0 part │ └─md127 9:127 0 16G 0 raid1 [SWAP] ├─nvme0n1p3 259:4 0 768G 0 part └─nvme0n1p4 259:5 0 147G 0 part nvme1n1 259:1 0 931.5G 0 disk ├─nvme1n1p1 259:6 0 512M 0 part /boot1 ├─nvme1n1p2 259:7 0 16G 0 part │ └─md127 9:127 0 16G 0 raid1 [SWAP] ├─nvme1n1p3 259:8 0 768G 0 part └─nvme1n1p4 259:9 0 147G 0 part [root@cl1:~]# lsblk -d -o +MODEL NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS MODEL sda 8:0 0 18.2T 0 disk /data/disk/0 WDC WUH722020ALE6L4 nvme0n1 259:0 0 931.5G 0 disk Samsung SSD 970 EVO Plus 1TB nvme1n1 259:1 0 931.5G 0 disk Samsung SSD 970 EVO Plus 1TB ``` e.g. the lmdb db is on those two nvme*n1 devices which are a simple zfs mirror, all the bulk data is on a 20TiB WDC drive which for serial operation should be able to achieve almost 300MiB/s. My configuration toml, I am using replication mode of 2 as my main goal is mostly to not have my singular nas die on me. I've tried tweaking the block_size up from 1MiB to 10MiB but that didn't do a ton. Keeping in mind most of what i will store here is going to be bulk data and writes that should amount to straight up copies over s3fs: ``` [root@cl1:~]# cat /etc/garage.toml block_size = 1048576 bootstrap_peers = omitted for brevity data_dir = "/data/disk/0" db_engine = "lmdb" metadata_dir = "/data/garage" replication_mode = "2" rpc_bind_addr = "[::]:3901" rpc_public_addr = "192.168.254.1:3901" rpc_secret = "omitted" [admin] api_bind_addr = "0.0.0.0:3903" [s3_api] api_bind_addr = "0.0.0.0:3900" root_domain = "cluster.home.arpa" s3_region = "garage" [s3_web] bind_addr = "0.0.0.0:3902" index = "index.html" root_domain = "cluster.home.arpa" ``` For s3fs my mount line is as follows (ignore the /nix stuff its a full path for reasons not worth getting into here relating to fuser mounts and nixos): ``` /nix/store/43mvc136l9x6l67zshhli81vmnynbns5-s3fs-fuse-1.91/bin/s3fs#media-s3fs:/tv /s3/s3fs fuse _netdev,curldbg,use_cache=/tmp/s3/s3fs,passwd_file=/run/agenix/s3/bucket-media,allow_other,mp_umask=022,uid=3000,gid=3000,url=http://cluster.home.arpa:3900,use_path_request_style,list_object_max_keys=10000,parallel_count=5,multipart_size=1024,multipart_copy_size=1024,max_dirty_data=1024 0 0 ``` I've tried tweaking s3fs's parallel count, multipart size, copy size, dirty data etc... all over the map but what seems to be happening is related to a new 0.9.0 feature Other changes since v0.8.4: Optimal layout assignation algorithm (#296) Full multipart upload semantics (#204, #553) I've seen multipart uploads fail on specific pieces and an error (can't find it in my logs right now about needing to upload sequences in series). I presume that would help with that? Additionally I've setup the prometheus exporter and done a scrub to get a better idea of how much data is being read and it seems to roughly cap out at 50MiB/s even when tranquility is set to 0 via: garage repair --yes scrub set-tranquility 0 These WDC drives however can definitely do much better iops than that however, they allegedly are CMR drives not SMR but I do see some... fun behavior depending on how many threads do writes at once which I think may explain some of the observed behavior. ``` [nix-shell:/data/disk/0]# for x in 1 5 10 20 50; do > fio --name=fio-test --ioengine=posixaio --rw=write --bs=4k --numjobs=${x} --size=1g --iodepth=1 --runtime=60 --time_based --end_fsync=1 | tail > done submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,2168120,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=138MiB/s (145MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s), io=8469MiB (8881MB), run=61451-61451msec Disk stats (read/write): sda: ios=0/18230, merge=0/5146, ticks=0/315843, in_queue=319219, util=73.96% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,718047,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=169MiB/s (178MB/s), 29.9MiB/s-41.8MiB/s (31.4MB/s-43.8MB/s), io=11.8GiB (12.6GB), run=61494-71064msec Disk stats (read/write): sda: ios=0/19088, merge=0/1802, ticks=0/483895, in_queue=489424, util=98.25% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,279478,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=176MiB/s (184MB/s), 16.7MiB/s-25.1MiB/s (17.5MB/s-26.4MB/s), io=12.5GiB (13.5GB), run=62268-73140msec Disk stats (read/write): sda: ios=0/17283, merge=0/209, ticks=0/509498, in_queue=514865, util=97.77% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,167634,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=178MiB/s (186MB/s), 8865KiB/s-10.2MiB/s (9078kB/s-10.7MB/s), io=13.0GiB (13.9GB), run=64274-74761msec Disk stats (read/write): sda: ios=0/17579, merge=0/406, ticks=0/582329, in_queue=589092, util=98.79% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,62603,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=168MiB/s (176MB/s), 3482KiB/s-4392KiB/s (3566kB/s-4497kB/s), io=12.6GiB (13.6GB), run=63144-77123msec Disk stats (read/write): sda: ios=0/16553, merge=0/506, ticks=0/627853, in_queue=637234, util=97.66% ``` As you can see when you increase the number of threads doing sequential writes you can see the overall throuhput hit a bit of a sweet spot before falling off a bit of a cliff. I've tested the local links public facing and private and can saturate the network with iperf so I'm thinking the issue I'm hitting here is relating to how many threads are trying to write at once but not sure how I can or should tweak that. My question here is mostly how do I track down and get a better idea of where my performance issues lie? And/or where would I get at the data? I have garage running in debug mode but hopefully that isn't an issue but I've dug through docs and haven't found much that would help in relation to performance optimization or debugging. I'll keep plugging away at it, I've gotten s3fs to not be too bad with writes by adding a local cache etc... but the overall performance write wise is atrocious. Using rclone to/from the node seems much better but that doesn't help when I need to just have a local filesystem to copy bits from/to the filesystem. Would a local garage server help with s3fs here? I'm sure it would but given all s3fs is doing is chunking the files and uploading them all at once it seems like that wouldn't be too big of a gain. Debugging performance isn't too problematic for me but black box debugging can be time intensive and I figure it would be better to ask if there already exists knobs without getting too deep into code. Thanks!

lx commented

2023-11-16 09:49:27 +00:00

Owner

Hi @mitchty, thank you so much for the detailed feedback!

Since you are seeing much better performance using rclone to copy files, I'd assume the issue lies mostly in the way s3fs works. You should try to use RUST_LOG=garage_api=debug to view all the requests s3fs makes to your garage server. Figuring out the mapping between these requests and actual operations made on the filesystem will help you understand the request patterns s3fs uses and why they might be sub-optimal. But keep in mind that S3 and regular filesystems are extremely different and mapping from one to another necessarily incurs a very big penalty. I think s3fs in particular is not very well designed in this regard, and you might be interested in trying out alternatives such as goofys or rclone mount.

Other remarks, without a particular order:

You have two machines, so your write speeds will be limited by the slowest machine since Garage writes synchronously to both. For further troubleshooting of the distributed aspect, please post the output of garage status, garage stats -a and lsblk+filesystem info on the second node (is the second system the exact same?). You can try setting up your system using replication_mode="2-dangerous" and see if it improves performance, that would be an interesting result to have.
If your s3fs is mounted on a different machine than your Garage nodes, you can try setting up a Garage gateway node on that machine and point s3fs to localhost to see if it improves perf. However since network is not your bottleneck it will probably not change a lot.
Have you confirmed that your bottleneck is IOPS on the HDD? At least, have you excluded any form of bottleneck on your NVMe drives?
Garage does not limit parallel IO operation based on a number of threads, as it uses an async runtime / lightweight threads to run all tasks in parallel (one per incoming request + background tasks). This should rather be tuned client side by changing the number of parallel S3 API requests made to Garage.
For multi-gb files, using big block size (at least 10MB) is necessary for many reasons:
- to reduce the load on metadata storage and provide faster access times to metadata (since there is less metadata per object to load)
- on ext4, your inode limit will eventually be reached and everything will blow up
- I don't know how much RAM your NAS have but if it is not very big, caching of filesystem metadata (inodes & directory lists) will be limited, slowing down the system. Also probably ZFS is already reserving half your RAM for its own cache, further exacerbating the issue.
xfs doesn't have the inode limit issue of ext4, and generally gives better performance than ext4, so try it out.
If you have multipart upload failures then yes, please switch to 0.9 asap. With 0.8 a whole multi-gb file has to be retried from the start when a single part fails, whereas with 0.9 only the failed parts need to be resent.
Garage does an fsync on each block write, which has been removed by default in 0.9. This can be a significant source of slowness during writes.
The upgrade to 0.9 is not particularly risky, I'd say there is no need to backup all of your data, simply make a backup of your metadata folders (make a tar snapshot while Garage is not running) and make sure to update both nodes simultaneously.
For the scrub we are doing everything sequentially, listing possibly very big directories (can be very slow on hdd) and reading many individual files, so I'm not surprised the bandwidth is limited to 50mb/s. At least since it's not saturating iops it will let you do other things at the same time. With bigger block size scrub should be faster too.

Hi @mitchty, thank you so much for the detailed feedback! Since you are seeing much better performance using rclone to copy files, I'd assume the issue lies mostly in the way s3fs works. You should try to use `RUST_LOG=garage_api=debug` to view all the requests s3fs makes to your garage server. Figuring out the mapping between these requests and actual operations made on the filesystem will help you understand the request patterns s3fs uses and why they might be sub-optimal. But keep in mind that S3 and regular filesystems are extremely different and mapping from one to another necessarily incurs a very big penalty. I think s3fs in particular is not very well designed in this regard, and you might be interested in trying out alternatives such as goofys or rclone mount. Other remarks, without a particular order: - You have two machines, so your write speeds will be limited by the slowest machine since Garage writes synchronously to both. For further troubleshooting of the distributed aspect, please post the output of `garage status`, `garage stats -a` and lsblk+filesystem info on the second node (is the second system the exact same?). You can try setting up your system using `replication_mode="2-dangerous"` and see if it improves performance, that would be an interesting result to have. - If your s3fs is mounted on a different machine than your Garage nodes, you can try setting up a Garage gateway node on that machine and point s3fs to localhost to see if it improves perf. However since network is not your bottleneck it will probably not change a lot. - Have you confirmed that your bottleneck is IOPS on the HDD? At least, have you excluded any form of bottleneck on your NVMe drives? - Garage does not limit parallel IO operation based on a number of threads, as it uses an async runtime / lightweight threads to run all tasks in parallel (one per incoming request + background tasks). This should rather be tuned client side by changing the number of parallel S3 API requests made to Garage. - For multi-gb files, using big block size (at least 10MB) is necessary for many reasons: - to reduce the load on metadata storage and provide faster access times to metadata (since there is less metadata per object to load) - on ext4, your inode limit will eventually be reached and everything will blow up - I don't know how much RAM your NAS have but if it is not very big, caching of filesystem metadata (inodes & directory lists) will be limited, slowing down the system. Also probably ZFS is already reserving half your RAM for its own cache, further exacerbating the issue. - xfs doesn't have the inode limit issue of ext4, and generally gives better performance than ext4, so try it out. - If you have multipart upload failures then yes, please switch to 0.9 asap. With 0.8 a whole multi-gb file has to be retried from the start when a single part fails, whereas with 0.9 only the failed parts need to be resent. - Garage does an fsync on each block write, which has been removed by default in 0.9. This can be a significant source of slowness during writes. - The upgrade to 0.9 is not particularly risky, I'd say there is no need to backup all of your data, simply make a backup of your metadata folders (make a tar snapshot while Garage is not running) and make sure to update both nodes simultaneously. - For the scrub we are doing everything sequentially, listing possibly very big directories (can be very slow on hdd) and reading many individual files, so I'm not surprised the bandwidth is limited to 50mb/s. At least since it's not saturating iops it will let you do other things at the same time. With bigger block size scrub should be faster too.