Garage RPC hangs after a certain amount of time #99

Closed
opened 2021-09-01 11:05:46 +00:00 by quentin · 3 comments
Owner

Months ago, we had a problem where garage instances crashed during night.
We backtracked the problem and it appeared it occured during backups that were putting an important load on the cluster. We think it is due to some ressource exhaustion linked with Hyper.rs leading to HTTP timeouts, including on our health check that was triggering a reboot. We put a workaround in Nomad, asking it to indefinetely reboot the service when it crashes but as far as I know, the root problem is not yet solved.

Months ago, we had a problem where garage instances crashed during night. We backtracked the problem and it appeared it occured during backups that were putting an important load on the cluster. We think it is due to some ressource exhaustion linked with Hyper.rs leading to HTTP timeouts, including on our health check that was triggering a reboot. We put a workaround in Nomad, asking it to indefinetely reboot the service when it crashes but as far as I know, the root problem is not yet solved.
quentin added the
Bug
label 2021-09-01 11:05:47 +00:00
Author
Owner
Might be related: [hyper #2419 - Http2: Hyper client gets stuck if too many requests are spawned #2419 ](https://github.com/hyperium/hyper/issues/2419).
Owner

Does this still happen with Netapp?

(we will know once Deuxfleurs is migrated to 0.4)

Does this still happen with Netapp? (we will know once Deuxfleurs is migrated to 0.4)
Owner

Closing this for now. We will reopen if issues arise again with Netapp.

Closing this for now. We will reopen if issues arise again with Netapp.
lx closed this issue 2021-11-08 15:34:33 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#99
No description provided.