Capacity vs DataAvail and going overboard #837
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#837
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Key questions
Cluster layout
I have created a three-node Garage cluster and applied a layout with a capacity of 8G. The data and metadata are on
/mnt/DATA/{data,meta}
respectively, and/mnt/DATA
is the mountpoint of a 10 GB XFS-formatted partition.Going overboard
On this cluster I've created a bucket. From a client machine I have done the following using
aws s3 cp
, which resulted in no errors:Copied numerous files filled with random data (from
/dev/urandom
) totalling more than 8 GB (the cluster capacity) and less than 10 GB (the size of the XFS partition where data and metadata are stored). What is the effect of specifying a capacity?Kept copying such files, going above the capacity of the XFS partition. At some point the
aws s3 cp
operation hung and I had to Ctrl-C it.aws s3 ls
on the bucket even shows the last file that caused the storage space to be exceeded, but not the last one where I had to press Ctrl-C to cancel the operation.If I don't Ctrl-C it, I get this:
Is the idea that we need to monitor metrics and add more storage space proactively?
Are there any safeguards that I should perhaps be activating to prevent going overboard?
I suggest what should happen is the S3 API service should respond with a HTTP 507 once disk space gets under a certain threshold.
"No space left on device" should be something that never happens.
When one garage node runs out of space, it seems like uploads grind to a halt while garage attempts to write files to that disk. Even when there are enough other nodes available that can easily take the data. In my case, upload speeds drop from a consistent 40MiB/s (saturating the network) to 100KiB/s-5MiB/s. It also happens when the node has free space, but one of its disks is entirely full.
Garage partition allocation is static, so it cannot "reroute" files to other nodes dynamically, it would need a layout change.