extremely unstable on arm rpi4 #909

New issue

Closed

opened 2024-12-06 15:48:00 +00:00 by Swedish_Hermit · 8 comments

Swedish_Hermit commented

2024-12-06 15:48:00 +00:00

EDIT:
It seems that when the sqlite snapshot happens sometimes it does not respond to the pings from garage and makes it "failed" note that this does not always happen but usually the rpis does not respond to the pings in time and gets ping timeout and then failed state.
I am on sqlite database and have USB 3.5 inch desktop HDD in a usb adapter casing, With UASP enabled or regular usb-storage driver, Both the database and data files are on the same disk.

As the title says i am having tons of issues running this software on my rpi 4s in a "cluster", usually they just gets "failed" state from one of them and restarting seems to work mostly then it comes back after some time but i cannot really see any error that sticks out. EDIT: It seems that when the sqlite snapshot happens sometimes it does not respond to the pings from garage and makes it "failed" note that this does not always happen but usually the rpis does not respond to the pings in time and gets ping timeout and then failed state. I am on sqlite database and have USB 3.5 inch desktop HDD in a usb adapter casing, With UASP enabled or regular usb-storage driver, Both the database and data files are on the same disk.

maximilien commented

2024-12-06 18:00:36 +00:00

Owner

How big is your metadata database?

Swedish_Hermit commented

2024-12-06 18:01:58 +00:00

Author

3.3 GiB at the troublesome node.

Swedish_Hermit commented

2024-12-06 18:30:54 +00:00

Author

Seems writing to the troublesome node makes it go to failed state aswell.

maximilien commented

2024-12-06 18:39:11 +00:00

Owner

Based on your informations above I would say that garage is stalling because the disk is simply too slow to handle your cluster size. Could you get some telemetry from the OS to confirm that? Something like the load numbers, or better the IOWAIT or storage PSI. You should be able to get those from htop for example.

maximilien added the

kind

performance

label 2024-12-06 18:39:33 +00:00

Swedish_Hermit commented

2024-12-06 18:50:22 +00:00

Author

Looking at iostat im getting about tops 50% iowait. But it do fluctate below alot.

Swedish_Hermit commented

2024-12-06 18:56:07 +00:00

Author

Doing a FIO test with these parameters causes spikes upwards of 72% iowait. I do wonder if it is the HDD not keeping up, if so it might be on the way out and i need to replace it.
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75

Doing a FIO test with these parameters causes spikes upwards of 72% iowait. I do wonder if it is the HDD not keeping up, if so it might be on the way out and i need to replace it. `fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75`

maximilien commented

2024-12-06 19:05:26 +00:00

Owner

I would strongly encourage you at this cluster size to have the metadata on SSD (or metadata and data). Especially if you don't have enough RAM to keep around the data in the metadata database. I don't see any pointer to an issue with garage itself here, so unless you have further concerns would you be OK with closing this ticket?

Swedish_Hermit commented

2024-12-06 19:06:57 +00:00

Author

Yes we could close this ticket for now, I will try to figure out a way to move the metadata to different disk, and see if the issue persists.
have a nice weekend!

Yes we could close this ticket for now, I will try to figure out a way to move the metadata to different disk, and see if the issue persists. have a nice weekend!

👍 1

maximilien closed this issue

2024-12-06 19:07:30 +00:00