extremely unstable on arm rpi4 #909
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#909
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
As the title says i am having tons of issues running this software on my rpi 4s in a "cluster", usually they just gets "failed" state from one of them and restarting seems to work mostly then it comes back after some time but i cannot really see any error that sticks out.
EDIT:
It seems that when the sqlite snapshot happens sometimes it does not respond to the pings from garage and makes it "failed" note that this does not always happen but usually the rpis does not respond to the pings in time and gets ping timeout and then failed state.
I am on sqlite database and have USB 3.5 inch desktop HDD in a usb adapter casing, With UASP enabled or regular usb-storage driver, Both the database and data files are on the same disk.
How big is your metadata database?
3.3 GiB at the troublesome node.
Seems writing to the troublesome node makes it go to failed state aswell.
Based on your informations above I would say that garage is stalling because the disk is simply too slow to handle your cluster size. Could you get some telemetry from the OS to confirm that? Something like the load numbers, or better the IOWAIT or storage PSI. You should be able to get those from htop for example.
Looking at iostat im getting about tops 50% iowait. But it do fluctate below alot.
Doing a FIO test with these parameters causes spikes upwards of 72% iowait. I do wonder if it is the HDD not keeping up, if so it might be on the way out and i need to replace it.
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=1G --readwrite=randrw --rwmixread=75
I would strongly encourage you at this cluster size to have the metadata on SSD (or metadata and data). Especially if you don't have enough RAM to keep around the data in the metadata database. I don't see any pointer to an issue with garage itself here, so unless you have further concerns would you be OK with closing this ticket?
Yes we could close this ticket for now, I will try to figure out a way to move the metadata to different disk, and see if the issue persists.
have a nice weekend!