Garage RPC hangs after a certain amount of time #99

New issue

Closed

opened 2021-09-01 11:05:46 +00:00 by quentin · 3 comments

quentin commented

2021-09-01 11:05:46 +00:00

Owner

Months ago, we had a problem where garage instances crashed during night.
We backtracked the problem and it appeared it occured during backups that were putting an important load on the cluster. We think it is due to some ressource exhaustion linked with Hyper.rs leading to HTTP timeouts, including on our health check that was triggering a reboot. We put a workaround in Nomad, asking it to indefinetely reboot the service when it crashes but as far as I know, the root problem is not yet solved.

Months ago, we had a problem where garage instances crashed during night. We backtracked the problem and it appeared it occured during backups that were putting an important load on the cluster. We think it is due to some ressource exhaustion linked with Hyper.rs leading to HTTP timeouts, including on our health check that was triggering a reboot. We put a workaround in Nomad, asking it to indefinetely reboot the service when it crashes but as far as I know, the root problem is not yet solved.

quentin added the

Bug

label 2021-09-01 11:05:47 +00:00

quentin commented

2021-09-24 18:08:26 +00:00

Author

Owner

Might be related: [hyper #2419 - Http2: Hyper client gets stuck if too many requests are spawned #2419 ](https://github.com/hyperium/hyper/issues/2419).