Durability Concerns Regarding Disk Failure in Multi-Datacenter Deployments #890
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#890
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I am trying to better understand the durability design of GarageHQ.
According to the documentation, it is recommended to run on raw HDDs without RAID, using XFS. However, I have concerns about the following scenario:
Is this assumption correct? If so, would it be more advisable to recommend a deployment on ZFS Z2 for added resilience, rather than relying on raw HDDs?
Thank you for your insights!
I think the issue might be with the replication factor. As the amount of data grows, it’s not enough to simply add more disks to resolve your concerns. You also need to scale up the number of servers in the cluster, otherwise the risk of failure remains.
I feel like increasing the replication factor — to 5 or even 7 — would be a better way to improve durability as your dataset grows. If you have the option to add more servers, this could be fine. But if you're limited in how many servers you can run and can only add more disks, then yeah, it might make sense to go against the devs recommendation and still consider RAID to protect against multiple disk failures.
It all depends on your durability target. For a simple 3 site deployment with replica set to 3, we feel like it's interesting to use the disks as is, so you can lose any disks in one zone without actual data loss (akin to raid5). If your number of disks or zone increases, then you might have to increase the number of replicas and/or use some kind of redundancy, in the same way that no one would recommend doing a raid5 stripped over 12 disks.
To be clear we don't discourage you to run garage on a RAID data pool (be it mdadm, Btrfs, zfs or hardware raid), we simply feel like you should give a though to the resiliency afforded by the geo-distributed replicas first, and leverage local resiliency if needed. I would also say that local resiliency is suggested for metadata folders, and that it can be somewhat easier to deal with hardware failures on a RAID volume rather than deal with the consequences of a dead disk with an actively used filesystem on time (you'll very likely have to restart the server to make the kernel happy).
Thank you for clarifying the durability model. It seems that GarageHQ’s durability approach resembles a RAID-10 configuration, with each datacenter functioning like a RAID-0 set while the replication across datacenters acts as a “mirrored” layer.
Without internal redundancy (RAID-1 or RAID-Z2) within each datacenter, simultaneous disk failures across datacenters could result in data loss, especially when disks grows.
If erasure coding is not planned, increasing the replication factor for large datasets could be a solution, but this does have substantial resource costs. In this case, it might be beneficial to provide guidance or documentation noting that each datacenter essentially behaves like a RAID-0, with the implications for durability decrease as disk count scales.
I think one approach that could help would be to implement parameters like rack or max-strip-disks that define a maximum strip size (e.g., maybe limit to 3/6 disks) within each datacenter. This would enable configurations where data is only striped across a limited number of disks per bucket, reducing exposure to failures. This would help maintain durability even as the disk count grows, allowing users to manage risk more effectively.
Yes and no: because the data is chunked and not stripped, losing a disk on a site doesn't incur the same penalty as a raid0 would: the disk can be replaced and the lost blocks can be recovered from other sites relatively quickly, because only that disk need to be rebuilt.
The node is also able to serve blocks from other disks. The loss of a disk is hence significantly less impactful than losing a disk in a large stripped array, so the number of disk on a node is not that impactful.
Garage was also built for heavily geo-distributed systems, think small nodes with a handful of disks each, not to compete with existing softwares (minio, ceph...) in datacenter-rack deployements where you have nodes with 6+ disks and are looking at efficiency (ie. getting the most storage for you money).
Closing for now as the questions has been answered