Capacity vs DataAvail and going overboard #837

Open
opened 2024-07-07 08:47:01 +00:00 by overbring · 3 comments

Key questions

  1. Is the provision of a Capacity argument when defining a cluster layout merely informational?
  2. Is it an issue of the S3 API spec or a Garage that there is no error when uploading data to the bucket that makes the total space required go above the available space on the disk?

Cluster layout

I have created a three-node Garage cluster and applied a layout with a capacity of 8G. The data and metadata are on /mnt/DATA/{data,meta} respectively, and /mnt/DATA is the mountpoint of a 10 GB XFS-formatted partition.

Going overboard

On this cluster I've created a bucket. From a client machine I have done the following using aws s3 cp, which resulted in no errors:

  • Copied numerous files filled with random data (from /dev/urandom) totalling more than 8 GB (the cluster capacity) and less than 10 GB (the size of the XFS partition where data and metadata are stored). What is the effect of specifying a capacity?

  • Kept copying such files, going above the capacity of the XFS partition. At some point the aws s3 cp operation hung and I had to Ctrl-C it. aws s3 ls on the bucket even shows the last file that caused the storage space to be exceeded, but not the last one where I had to press Ctrl-C to cancel the operation.

If I don't Ctrl-C it, I get this:

upload failed: ./random.data to s3://experiment/1720341923.data An error occurred (ServiceUnavailable) when calling the UploadPart operation (reached max retries: 2): Internal error: Could not reach quorum of 2. 0 of 3 request succeeded, others returned errors: ["IO error: No space left on device (os error 28)", "Remote error: IO error: No space left on device (os error 28)", "Remote error: IO error: No space left on device (os error 28)"]

Is the idea that we need to monitor metrics and add more storage space proactively?

Are there any safeguards that I should perhaps be activating to prevent going overboard?

# Key questions 1. Is the provision of a Capacity argument when defining a cluster layout merely informational? 2. Is it an issue of the S3 API spec or a Garage that there is no error when uploading data to the bucket that makes the total space required go above the available space on the disk? # Cluster layout I have created a three-node Garage cluster and applied a layout with a capacity of 8G. The data and metadata are on `/mnt/DATA/{data,meta}` respectively, and `/mnt/DATA` is the mountpoint of a 10 GB XFS-formatted partition. # Going overboard On this cluster I've created a bucket. From a client machine I have done the following using `aws s3 cp`, which resulted in no errors: * Copied numerous files filled with random data (from `/dev/urandom`) totalling more than 8 GB (the cluster capacity) and less than 10 GB (the size of the XFS partition where data and metadata are stored). What is the effect of specifying a capacity? * Kept copying such files, going above the capacity of the XFS partition. At some point the `aws s3 cp` operation hung and I had to Ctrl-C it. `aws s3 ls` on the bucket even shows the last file that caused the storage space to be exceeded, but not the last one where I had to press Ctrl-C to cancel the operation. If I don't Ctrl-C it, I get this: ``` upload failed: ./random.data to s3://experiment/1720341923.data An error occurred (ServiceUnavailable) when calling the UploadPart operation (reached max retries: 2): Internal error: Could not reach quorum of 2. 0 of 3 request succeeded, others returned errors: ["IO error: No space left on device (os error 28)", "Remote error: IO error: No space left on device (os error 28)", "Remote error: IO error: No space left on device (os error 28)"] ``` Is the idea that we need to monitor metrics and add more storage space proactively? Are there any safeguards that I should perhaps be activating to prevent going overboard?
quentin added the
kind
usability
action
discussion-needed
scope
layout
labels 2024-08-07 09:54:22 +00:00

I suggest what should happen is the S3 API service should respond with a HTTP 507 once disk space gets under a certain threshold.

"No space left on device" should be something that never happens.

I suggest what should happen is the S3 API service should respond with a HTTP 507 once disk space gets under a certain threshold. "No space left on device" should be something that never happens.

When one garage node runs out of space, it seems like uploads grind to a halt while garage attempts to write files to that disk. Even when there are enough other nodes available that can easily take the data. In my case, upload speeds drop from a consistent 40MiB/s (saturating the network) to 100KiB/s-5MiB/s. It also happens when the node has free space, but one of its disks is entirely full.

When one garage node runs out of space, it seems like uploads grind to a halt while garage attempts to write files to that disk. Even when there are enough other nodes available that can easily take the data. In my case, upload speeds drop from a consistent 40MiB/s (saturating the network) to 100KiB/s-5MiB/s. It also happens when the node has free space, but one of its disks is entirely full.
Owner

Even when there are enough other nodes available that can easily take the data

Garage partition allocation is static, so it cannot "reroute" files to other nodes dynamically, it would need a layout change.

> Even when there are enough other nodes available that can easily take the data Garage partition allocation is static, so it cannot "reroute" files to other nodes dynamically, it would need a layout change.
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#837
No description provided.