Durability Concerns Regarding Disk Failure in Multi-Datacenter Deployments #890

New issue

Closed

opened 2024-10-20 09:55:50 +00:00 by hooloovoo · 6 comments

hooloovoo commented

2024-10-20 09:55:50 +00:00

I am trying to better understand the durability design of GarageHQ.

According to the documentation, it is recommended to run on raw HDDs without RAID, using XFS. However, I have concerns about the following scenario:

Assume we have 3 datacenters, each with 20 disks. And GarageHQ try best to distribute data across disks evenly.
The replication factor is set to 3, meaning each datacenter holds a copy of the data.
If a single disk fails simultaneously in each datacenter (1/20 failure rate in each), this could lead to some permanent data loss.
As the number of disks increases, the likelihood of multiple simultaneous failures also grows linearly, increasing the risk of data loss.

Is this assumption correct? If so, would it be more advisable to recommend a deployment on ZFS Z2 for added resilience, rather than relying on raw HDDs?

Thank you for your insights!

I am trying to better understand the durability design of GarageHQ. According to the documentation, it is recommended to run on raw HDDs without RAID, using XFS. However, I have concerns about the following scenario: 1. Assume we have 3 datacenters, each with 20 disks. And GarageHQ try best to distribute data across disks evenly. 2. The replication factor is set to 3, meaning each datacenter holds a copy of the data. 3. If a single disk fails simultaneously in each datacenter (1/20 failure rate in each), this could lead to some permanent data loss. 4. As the number of disks increases, the likelihood of multiple simultaneous failures also grows linearly, increasing the risk of data loss. Is this assumption correct? If so, would it be more advisable to recommend a deployment on ZFS Z2 for added resilience, rather than relying on raw HDDs? Thank you for your insights!

kirick commented

2024-10-23 12:04:24 +00:00

I feel like increasing the replication factor — to 5 or even 7 — would be a better way to improve durability as your dataset grows. If you have the option to add more servers, this could be fine. But if you're limited in how many servers you can run and can only add more disks, then yeah, it might make sense to go against the devs recommendation and still consider RAID to protect against multiple disk failures.

I think the issue might be with the replication factor. As the amount of data grows, it’s not enough to simply add more disks to resolve your concerns. You also need to scale up the number of servers in the cluster, otherwise the risk of failure remains. I feel like increasing the replication factor — to 5 or even 7 — would be a better way to improve durability as your dataset grows. If you have the option to add more servers, this could be fine. But if you're limited in how many servers you can run and can only add more disks, then yeah, it might make sense to go against the devs recommendation and still consider RAID to protect against multiple disk failures.

maximilien commented

2024-10-31 12:26:38 +00:00

Owner

It all depends on your durability target. For a simple 3 site deployment with replica set to 3, we feel like it's interesting to use the disks as is, so you can lose any disks in one zone without actual data loss (akin to raid5). If your number of disks or zone increases, then you might have to increase the number of replicas and/or use some kind of redundancy, in the same way that no one would recommend doing a raid5 stripped over 12 disks.

maximilien commented

2024-10-31 12:30:44 +00:00

Owner

To be clear we don't discourage you to run garage on a RAID data pool (be it mdadm, Btrfs, zfs or hardware raid), we simply feel like you should give a though to the resiliency afforded by the geo-distributed replicas first, and leverage local resiliency if needed. I would also say that local resiliency is suggested for metadata folders, and that it can be somewhat easier to deal with hardware failures on a RAID volume rather than deal with the consequences of a dead disk with an actively used filesystem on time (you'll very likely have to restart the server to make the kernel happy).

To be clear we don't discourage you to run garage on a RAID data pool (be it mdadm, Btrfs, zfs or hardware raid), we simply feel like you should give a though to the resiliency afforded by the geo-distributed replicas first, and leverage local resiliency if needed. I would also say that local resiliency _is_ suggested for metadata folders, and that it can be somewhat easier to deal with hardware failures on a RAID volume rather than deal with the consequences of a dead disk with an actively used filesystem on time (you'll very likely have to restart the server to make the kernel happy).

hooloovoo commented

2024-11-01 16:49:46 +00:00

Author

Without internal redundancy (RAID-1 or RAID-Z2) within each datacenter, simultaneous disk failures across datacenters could result in data loss, especially when disks grows.

If erasure coding is not planned, increasing the replication factor for large datasets could be a solution, but this does have substantial resource costs. In this case, it might be beneficial to provide guidance or documentation noting that each datacenter essentially behaves like a RAID-0, with the implications for durability decrease as disk count scales.

I think one approach that could help would be to implement parameters like rack or max-strip-disks that define a maximum strip size (e.g., maybe limit to 3/6 disks) within each datacenter. This would enable configurations where data is only striped across a limited number of disks per bucket, reducing exposure to failures. This would help maintain durability even as the disk count grows, allowing users to manage risk more effectively.

Thank you for clarifying the durability model. It seems that GarageHQ’s durability approach resembles a RAID-10 configuration, with each datacenter functioning like a RAID-0 set while the replication across datacenters acts as a “mirrored” layer. Without internal redundancy (RAID-1 or RAID-Z2) within each datacenter, simultaneous disk failures across datacenters could result in data loss, especially when disks grows. If erasure coding is not planned, increasing the replication factor for large datasets could be a solution, but this does have substantial resource costs. In this case, it might be beneficial to provide guidance or documentation noting that **each datacenter essentially behaves like a RAID-0**, with the implications for durability decrease as disk count scales. I think one approach that could help would be to implement parameters like rack or max-strip-disks that define a maximum strip size (e.g., maybe limit to 3/6 disks) within each datacenter. This would enable configurations where data is only striped across a limited number of disks per bucket, reducing exposure to failures. This would help maintain durability even as the disk count grows, allowing users to manage risk more effectively.

maximilien commented

2024-11-02 19:47:45 +00:00

Owner

Thank you for clarifying the durability model. It seems that GarageHQ’s durability approach resembles a RAID-10 configuration, with each datacenter functioning like a RAID-0 set while the replication across datacenters acts as a “mirrored” layer.

Yes and no: because the data is chunked and not stripped, losing a disk on a site doesn't incur the same penalty as a raid0 would: the disk can be replaced and the lost blocks can be recovered from other sites relatively quickly, because only that disk need to be rebuilt.
The node is also able to serve blocks from other disks. The loss of a disk is hence significantly less impactful than losing a disk in a large stripped array, so the number of disk on a node is not that impactful.

Garage was also built for heavily geo-distributed systems, think small nodes with a handful of disks each, not to compete with existing softwares (minio, ceph...) in datacenter-rack deployements where you have nodes with 6+ disks and are looking at efficiency (ie. getting the most storage for you money).

> Thank you for clarifying the durability model. It seems that GarageHQ’s durability approach resembles a RAID-10 configuration, with each datacenter functioning like a RAID-0 set while the replication across datacenters acts as a “mirrored” layer. Yes and no: because the data is chunked and not stripped, losing a disk on a site doesn't incur the same penalty as a raid0 would: the disk can be replaced and the lost blocks can be recovered from other sites relatively quickly, because _only that disk need to be rebuilt_. The node is also able to serve blocks from other disks. The loss of a disk is hence significantly less impactful than losing a disk in a large stripped array, so the number of disk on a node is not that impactful. Garage was also built for heavily geo-distributed systems, think small nodes with a handful of disks each, not to compete with existing softwares (minio, ceph...) in datacenter-rack deployements where you have nodes with 6+ disks and are looking at efficiency (ie. getting the most storage for you money).

maximilien commented

2024-11-19 22:35:31 +00:00

Owner

Closing for now as the questions has been answered