Handle FD starvation correctly #595
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#595
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
It appears that if another process starves a node of all file descriptors, garage reports that a block was saved successfully. However, it will be impossible to actually read the block because it was never written.
If Garage fails to open files due to FD starvation, it should crash and fail to start -- perhaps even sending the block currently in memory to another node before crashing.
I don't understand how this is possible. If there is no available file descriptor, the
open
syscall should return an error, and the code always propagates such errors to the caller. Garage should not report that the block was successfully saved in this situation.What might happen is that the block could not be saved to a single nodes, but it could be saved on all of the other nodes and therefore a quorum was achieved and the write succeeded globally (this is an expected scenario).
Are you sure that this is really what is happening? How do you know? Please provide proof of what you are saying : how did you come to this conclusion, how can I reproduce the issue, etc.
These blocks are gone completely, as far as I can tell. I had a node that was running netdata (https://github.com/netdata/helmchart/issues/372) which somehow consumed all file descriptors on the node. This is about half of the errors, I've been manually doing
garage block purge
one-by-one sincegarage block list errors | awk '{print $1;}' | xargs garage block purge --yes
doesn't seem to work.This is what the
garage
logs look like during this time:And how I guessed that it was causing corruption:
It's worth pointing out that it took days for any serious issues to start happening (we had zero monitoring for this failure case), and in this case, it was containers not coming up that had just been pushed (Harbor is using
garage
as a backend) successfully. So, garage was actually quite resilient in the face of catastrophic error conditions. Perhaps too resilient?Happy to provide the entire logs for your purusal.
Could you provide us with your
garage.toml
(minus rpc secret) as well as logs? The logs you provided don't show any block getting written, "just" the inability to save the peer list, which isn't too bad on its own, and the inability to read blocks, which sounds like it's a consequence, not a cause.As lx said, the scenario you describe shouldn't happen, it's an io error like any other, and should be reported as is. Do you have traces that the uploads went correctly, and did not receive 4xx/5xx error codes in response? Some metadata are saved in parallel to data, so if saving some data fails, there can still be some metadata saying the block should exist (which would probably generate this kind of message), but the client would still have been informed of an error.
I'll get those to you.
I think the logs were over the upload limit and they disappeared into the ether.
can you upload them somewhere else? Or compress them maybe?
The same errors...
Single node, good new disks (data on 2x16Tb in RAID1, meta on 2 NVMe SSD RAID1) without inserts over a week...
Garage version: v0.8.2 [features: k2v, sled, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs]
Rust compiler version: 1.63.0
Database engine: Sled
Table stats:
Table Items MklItems MklTodo GcTodo
bucket_v2 NC NC 0 0
key NC NC 0 0
object NC NC 0 0
version NC NC 0 0
block_ref NC NC 0 0
Block manager stats:
number of RC entries (~= number of blocks): NC
resync queue length: 50
blocks with resync errors: 48
If values are missing above (marked as NC), consider adding the --detailed flag (this will be slow).
Storage nodes:
ID Hostname Zone Capacity Part. DataAvail MetaAvail
ae06d710efxxxxx5 xxx xx 1 256 8.1 TB/15.9 TB (50.8%) 91.0 GB/1081.1 GB (8.4%)
Estimated available storage space cluster-wide (might be lower in practice):
data: 8.1 TB
metadata: 91.0 GB
after --detailed
After looking at this again, it looks like the issues reported by @withinboredom and @Mako are not the same.
@withinboredom your
garage block list-errors
show many blocks in an errored state but with zero references; Garage shoud normally not try to store those blocks as they are not needed, and the errors should disappear. However the "next retry time" is in a lot of years so that retry will never happen and the errors will not be cleared. This looks a lot like an integer overflow/underflow error.@Mako your
garage block list-errors
shows many blocks with non-zero reference counters. This might either be #644 (non-zero reference counter but blocks are not truly referenced) appearing on a scale unseen before (it usually happens for one block at a time), or an error where actually needed blocks did truly disappear. To know in which case we are, I need the output ofgarage block info
for the block hashes in question.Hi, it seems like ive encountered same error but a little bit different considering i have two nodes and replication factor is set to 2
which is not correct as amount of consumed fds per node is ~2k and system max of fds is billions..
@kot-o-pes I don't think Garage is inventing this error, could you try to strace your process to see where it is comming from?
I think this issue has failed to conclusively pinpoint a specific issue in Garage, so I will close it here for inactivity and lack of focus. For debugging running clusters, we are available to answer questions on the Matrix channel. If an actual issue with the handling of file descriptors can be demonstrated using appropriate tools such as strace, feel free to open a new issue.