Exceedingly slow performance for s3fs and garage 0.8.2 #668
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#668
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Ola, so for my home nas I've been testing out garage as my s3 backend to back plex mostly as well as a restic target. Apologies for the mini thesis but figured i'd just write up as much as I could information wise. If it helps I can link to my nixos configuration flake as I setup everything that way but that might be too much information.
All timings here are for 0.8.2, I can upgrade to 0.9 shortly once I back up everything to my older nas running seaweedfs but I figure the overall experience of debugging the slowness might help.
The issue i'm hitting is s3fs performance is... well abysmal, roughly the best I can get out of local s3fs filesystem copies is a sustained 5MiB/s sometimes less to copy a multi gibibit file.
In case it helps, the hardware I'm running on is these systems:
https://www.terra-master.com/us/f4-4736.html
I've configured garage to use the secondary 2.5g nics for all its rpc traffic to remove any possibility of nic contention on the front end. (I think? maybe I set that up wrong)
Running nixos 23.05 (for now)
Kernel is this (somewhat constrained by me using zfs for / for now but reasonably modern)
The layout for block devices is as follows (note zfs for the nvme drives outside of /boot partitions and swap which aren't pertinent. The main /data/disk/0 partition is simply journaled ext4 which seems fine for the use case but I can swap this out for xfs or whatever if its an issue)
e.g. the lmdb db is on those two nvme*n1 devices which are a simple zfs mirror, all the bulk data is on a 20TiB WDC drive which for serial operation should be able to achieve almost 300MiB/s.
My configuration toml, I am using replication mode of 2 as my main goal is mostly to not have my singular nas die on me. I've tried tweaking the block_size up from 1MiB to 10MiB but that didn't do a ton. Keeping in mind most of what i will store here is going to be bulk data and writes that should amount to straight up copies over s3fs:
For s3fs my mount line is as follows (ignore the /nix stuff its a full path for reasons not worth getting into here relating to fuser mounts and nixos):
I've tried tweaking s3fs's parallel count, multipart size, copy size, dirty data etc... all over the map but what seems to be happening is related to a new 0.9.0 feature
Other changes since v0.8.4:
Optimal layout assignation algorithm (#296)
Full multipart upload semantics (#204, #553)
I've seen multipart uploads fail on specific pieces and an error (can't find it in my logs right now about needing to upload sequences in series). I presume that would help with that?
Additionally I've setup the prometheus exporter and done a scrub to get a better idea of how much data is being read and it seems to roughly cap out at 50MiB/s even when tranquility is set to 0 via:
garage repair --yes scrub set-tranquility 0
These WDC drives however can definitely do much better iops than that however, they allegedly are CMR drives not SMR but I do see some... fun behavior depending on how many threads do writes at once which I think may explain some of the observed behavior.
As you can see when you increase the number of threads doing sequential writes you can see the overall throuhput hit a bit of a sweet spot before falling off a bit of a cliff. I've tested the local links public facing and private and can saturate the network with iperf so I'm thinking the issue I'm hitting here is relating to how many threads are trying to write at once but not sure how I can or should tweak that.
My question here is mostly how do I track down and get a better idea of where my performance issues lie? And/or where would I get at the data? I have garage running in debug mode but hopefully that isn't an issue but I've dug through docs and haven't found much that would help in relation to performance optimization or debugging.
I'll keep plugging away at it, I've gotten s3fs to not be too bad with writes by adding a local cache etc... but the overall performance write wise is atrocious.
Using rclone to/from the node seems much better but that doesn't help when I need to just have a local filesystem to copy bits from/to the filesystem.
Would a local garage server help with s3fs here? I'm sure it would but given all s3fs is doing is chunking the files and uploading them all at once it seems like that wouldn't be too big of a gain.
Debugging performance isn't too problematic for me but black box debugging can be time intensive and I figure it would be better to ask if there already exists knobs without getting too deep into code.
Thanks!
Hi @mitchty, thank you so much for the detailed feedback!
Since you are seeing much better performance using rclone to copy files, I'd assume the issue lies mostly in the way s3fs works. You should try to use
RUST_LOG=garage_api=debug
to view all the requests s3fs makes to your garage server. Figuring out the mapping between these requests and actual operations made on the filesystem will help you understand the request patterns s3fs uses and why they might be sub-optimal. But keep in mind that S3 and regular filesystems are extremely different and mapping from one to another necessarily incurs a very big penalty. I think s3fs in particular is not very well designed in this regard, and you might be interested in trying out alternatives such as goofys or rclone mount.Other remarks, without a particular order:
You have two machines, so your write speeds will be limited by the slowest machine since Garage writes synchronously to both. For further troubleshooting of the distributed aspect, please post the output of
garage status
,garage stats -a
and lsblk+filesystem info on the second node (is the second system the exact same?). You can try setting up your system usingreplication_mode="2-dangerous"
and see if it improves performance, that would be an interesting result to have.If your s3fs is mounted on a different machine than your Garage nodes, you can try setting up a Garage gateway node on that machine and point s3fs to localhost to see if it improves perf. However since network is not your bottleneck it will probably not change a lot.
Have you confirmed that your bottleneck is IOPS on the HDD? At least, have you excluded any form of bottleneck on your NVMe drives?
Garage does not limit parallel IO operation based on a number of threads, as it uses an async runtime / lightweight threads to run all tasks in parallel (one per incoming request + background tasks). This should rather be tuned client side by changing the number of parallel S3 API requests made to Garage.
For multi-gb files, using big block size (at least 10MB) is necessary for many reasons:
to reduce the load on metadata storage and provide faster access times to metadata (since there is less metadata per object to load)
on ext4, your inode limit will eventually be reached and everything will blow up
I don't know how much RAM your NAS have but if it is not very big, caching of filesystem metadata (inodes & directory lists) will be limited, slowing down the system. Also probably ZFS is already reserving half your RAM for its own cache, further exacerbating the issue.
xfs doesn't have the inode limit issue of ext4, and generally gives better performance than ext4, so try it out.
If you have multipart upload failures then yes, please switch to 0.9 asap. With 0.8 a whole multi-gb file has to be retried from the start when a single part fails, whereas with 0.9 only the failed parts need to be resent.
Garage does an fsync on each block write, which has been removed by default in 0.9. This can be a significant source of slowness during writes.
The upgrade to 0.9 is not particularly risky, I'd say there is no need to backup all of your data, simply make a backup of your metadata folders (make a tar snapshot while Garage is not running) and make sure to update both nodes simultaneously.
For the scrub we are doing everything sequentially, listing possibly very big directories (can be very slow on hdd) and reading many individual files, so I'm not surprised the bandwidth is limited to 50mb/s. At least since it's not saturating iops it will let you do other things at the same time. With bigger block size scrub should be faster too.