Garage resync queue length just grows #934
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#934
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hello again,
I am back with a similar problem as before, but a slightly different behaviour. While the last problem was mostly related to new nodes, and resync not being fast enough, now these are probably not the problem; we did not change our Garage cluster at all, and I left the resync settings that were used to fix the previous problem (will show the settings below).
First of all an overview of all metrics:
We can observe that our previous problem mostly finished on 14th of January and it seems that data was synchronized correctly because all nodes now have similar free disk space (while the new ones started at ~7 TB and the old ones were reaching exhaustion, barely surviving at 100 GB and often being taken out of commission by k8s due to low disk size).
Not long after a new problem started, not sure why though:
Worker settings on the nodes:
Stats as seen from garage-5:
Any ideas what could be going on ? Do you need any more information ?
Thank you.
With respect,
Arnold
Maybe taking a look at the logs on those nodes would be interesting. I suspect they maybe have some kind of connectivity issue that prevent garage from syncing blocks from other nodes and causing the resync queue to blow up.