Not all workers quit on time — how can I troubleshoot? #676
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#676
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I'm running garage on ZFS in NixOS via systemd and noticing several things:
I'm not sure when it started, but this initially happened on v0.8.2, despite it running for months without any problems. I thought it was a problem that would potentially be fixed with upgrade, so I upgraded to v0.9.0 and the problem persists.
Attached at the end of this post is the complete log, but I thought the interesting bit is here:
I'm not sure how to troubleshoot this since I can't tell which worker didn't manage to exit. How do I proceed on chasing this bug? Any pointers would be appreciated!
System information
Garage configuration
Garage logs
Possibly more hints! After the systemd service was terminated with
failed
status, I can see that the process is defunct.However,
lsof
is still showing the files are open.To know which tasks did not complete successfully, we would need to look at the output of
garage worker list
and find the ones that were not mentionned in the logs when exiting. It's a bit tedious but can be doneDoes the 100% CPU start when Garage is started or only when you initiate the shutdown?
I think it woud be nice to be able to debug the process using gdb during the time where it is at 100% cpu, so that we can obtain a backtrace of the thread that keeps doing things. It could be that some logic in Garage is broken and it is just running in a loop, or it could be some issue related to some dependency like LMDB. I can't really tell for now.
Will try it next time I get the chance, thanks!
On start. Essentially it was impossible for me to shut down the computer after starting garage without manually using the physical power button. Tried running the server with and without systemd, both ends up with the same 100% CPU.
Makes sense. I had uninstalled garage for the time being but will try to come back with more details next time. Thanks!