extremely unstable on arm rpi4 #909

Closed
opened 2024-12-06 15:48:00 +00:00 by Swedish_Hermit · 8 comments

As the title says i am having tons of issues running this software on my rpi 4s in a "cluster", usually they just gets "failed" state from one of them and restarting seems to work mostly then it comes back after some time but i cannot really see any error that sticks out.

EDIT:
It seems that when the sqlite snapshot happens sometimes it does not respond to the pings from garage and makes it "failed" note that this does not always happen but usually the rpis does not respond to the pings in time and gets ping timeout and then failed state.
I am on sqlite database and have USB 3.5 inch desktop HDD in a usb adapter casing, With UASP enabled or regular usb-storage driver, Both the database and data files are on the same disk.

As the title says i am having tons of issues running this software on my rpi 4s in a "cluster", usually they just gets "failed" state from one of them and restarting seems to work mostly then it comes back after some time but i cannot really see any error that sticks out. EDIT: It seems that when the sqlite snapshot happens sometimes it does not respond to the pings from garage and makes it "failed" note that this does not always happen but usually the rpis does not respond to the pings in time and gets ping timeout and then failed state. I am on sqlite database and have USB 3.5 inch desktop HDD in a usb adapter casing, With UASP enabled or regular usb-storage driver, Both the database and data files are on the same disk.
Owner

How big is your metadata database?

How big is your metadata database?
Author

3.3 GiB at the troublesome node.

3.3 GiB at the troublesome node.
Author

Seems writing to the troublesome node makes it go to failed state aswell.

Seems writing to the troublesome node makes it go to failed state aswell.
Owner

Based on your informations above I would say that garage is stalling because the disk is simply too slow to handle your cluster size. Could you get some telemetry from the OS to confirm that? Something like the load numbers, or better the IOWAIT or storage PSI. You should be able to get those from htop for example.

Based on your informations above I would say that garage is stalling because the disk is simply too slow to handle your cluster size. Could you get some telemetry from the OS to confirm that? Something like the load numbers, or better the IOWAIT or storage PSI. You should be able to get those from htop for example.
maximilien added the
kind
performance
label 2024-12-06 18:39:33 +00:00
Author

Looking at iostat im getting about tops 50% iowait. But it do fluctate below alot.

Looking at iostat im getting about tops 50% iowait. But it do fluctate below alot.
Author

Doing a FIO test with these parameters causes spikes upwards of 72% iowait. I do wonder if it is the HDD not keeping up, if so it might be on the way out and i need to replace it.
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75

Doing a FIO test with these parameters causes spikes upwards of 72% iowait. I do wonder if it is the HDD not keeping up, if so it might be on the way out and i need to replace it. `fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75`
Owner

I would strongly encourage you at this cluster size to have the metadata on SSD (or metadata and data). Especially if you don't have enough RAM to keep around the data in the metadata database. I don't see any pointer to an issue with garage itself here, so unless you have further concerns would you be OK with closing this ticket?

I would strongly encourage you at this cluster size to have the metadata on SSD (or metadata and data). Especially if you don't have enough RAM to keep around the data in the metadata database. I don't see any pointer to an issue with garage itself here, so unless you have further concerns would you be OK with closing this ticket?
Author

Yes we could close this ticket for now, I will try to figure out a way to move the metadata to different disk, and see if the issue persists.
have a nice weekend!

Yes we could close this ticket for now, I will try to figure out a way to move the metadata to different disk, and see if the issue persists. have a nice weekend!
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#909
No description provided.