Capacity vs DataAvail and going overboard #837

Open
opened 2024-07-07 08:47:01 +00:00 by overbring · 1 comment

Key questions

  1. Is the provision of a Capacity argument when defining a cluster layout merely informational?
  2. Is it an issue of the S3 API spec or a Garage that there is no error when uploading data to the bucket that makes the total space required go above the available space on the disk?

Cluster layout

I have created a three-node Garage cluster and applied a layout with a capacity of 8G. The data and metadata are on /mnt/DATA/{data,meta} respectively, and /mnt/DATA is the mountpoint of a 10 GB XFS-formatted partition.

Going overboard

On this cluster I've created a bucket. From a client machine I have done the following using aws s3 cp, which resulted in no errors:

  • Copied numerous files filled with random data (from /dev/urandom) totalling more than 8 GB (the cluster capacity) and less than 10 GB (the size of the XFS partition where data and metadata are stored). What is the effect of specifying a capacity?

  • Kept copying such files, going above the capacity of the XFS partition. At some point the aws s3 cp operation hung and I had to Ctrl-C it. aws s3 ls on the bucket even shows the last file that caused the storage space to be exceeded, but not the last one where I had to press Ctrl-C to cancel the operation.

If I don't Ctrl-C it, I get this:

upload failed: ./random.data to s3://experiment/1720341923.data An error occurred (ServiceUnavailable) when calling the UploadPart operation (reached max retries: 2): Internal error: Could not reach quorum of 2. 0 of 3 request succeeded, others returned errors: ["IO error: No space left on device (os error 28)", "Remote error: IO error: No space left on device (os error 28)", "Remote error: IO error: No space left on device (os error 28)"]

Is the idea that we need to monitor metrics and add more storage space proactively?

Are there any safeguards that I should perhaps be activating to prevent going overboard?

# Key questions 1. Is the provision of a Capacity argument when defining a cluster layout merely informational? 2. Is it an issue of the S3 API spec or a Garage that there is no error when uploading data to the bucket that makes the total space required go above the available space on the disk? # Cluster layout I have created a three-node Garage cluster and applied a layout with a capacity of 8G. The data and metadata are on `/mnt/DATA/{data,meta}` respectively, and `/mnt/DATA` is the mountpoint of a 10 GB XFS-formatted partition. # Going overboard On this cluster I've created a bucket. From a client machine I have done the following using `aws s3 cp`, which resulted in no errors: * Copied numerous files filled with random data (from `/dev/urandom`) totalling more than 8 GB (the cluster capacity) and less than 10 GB (the size of the XFS partition where data and metadata are stored). What is the effect of specifying a capacity? * Kept copying such files, going above the capacity of the XFS partition. At some point the `aws s3 cp` operation hung and I had to Ctrl-C it. `aws s3 ls` on the bucket even shows the last file that caused the storage space to be exceeded, but not the last one where I had to press Ctrl-C to cancel the operation. If I don't Ctrl-C it, I get this: ``` upload failed: ./random.data to s3://experiment/1720341923.data An error occurred (ServiceUnavailable) when calling the UploadPart operation (reached max retries: 2): Internal error: Could not reach quorum of 2. 0 of 3 request succeeded, others returned errors: ["IO error: No space left on device (os error 28)", "Remote error: IO error: No space left on device (os error 28)", "Remote error: IO error: No space left on device (os error 28)"] ``` Is the idea that we need to monitor metrics and add more storage space proactively? Are there any safeguards that I should perhaps be activating to prevent going overboard?
quentin added the
kind
usability
action
discussion-needed
scope
layout
labels 2024-08-07 09:54:22 +00:00

I suggest what should happen is the S3 API service should respond with a HTTP 507 once disk space gets under a certain threshold.

"No space left on device" should be something that never happens.

I suggest what should happen is the S3 API service should respond with a HTTP 507 once disk space gets under a certain threshold. "No space left on device" should be something that never happens.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#837
No description provided.