Support multiple hard drive per server

quentin commented

2022-02-04 07:46:54 +00:00

Owner

We have a very specific use case, with only one hard drive per server.
But people often have numerous HDD, up to ~64, and they do not know how to deploy Garage with such setup.
That's the question we had in one of our past exchange.

We could ask them to just create a JBOD (if one disk is offline, we must rebuild the whole server) or a RAID device (we add another level of redundancy).

But I think it would be both simpler, easy to recover and more performant to allow people setting multiple paths, thus using multiple disks in Garage's configuration.

Example of a new configuration file:

# data_dir = "/var/lib/garage/data"
data_dir = [
  /mnt/hdd1/garage
  /mnt/hdd2/garage
  /mnt/hdd3/garage
  /mnt/hdd4/garage
]

It raises some questions:

If a disk is offline, should Garage stops, ignore the offline data, or try to recover the chunks on other disks? And what are the risks of just filling the other disks and make the whole node unusable?
If we add or remove a disk, should the data be rebalanced? What is the rule to place chunks on disks?

Another solution could also be to run one garage instace per disk, which would make it harder to deploy and maintain however.

We have a very specific use case, with only one hard drive per server. But people often have numerous HDD, up to ~64, and they do not know how to deploy Garage with such setup. That's the question we had in one of our past exchange. We could ask them to just create a JBOD (if one disk is offline, we must rebuild the whole server) or a RAID device (we add another level of redundancy). But I think it would be both simpler, easy to recover and more performant to allow people setting multiple paths, thus using multiple disks in Garage's configuration. Example of a new configuration file: ```toml # data_dir = "/var/lib/garage/data" data_dir = [ /mnt/hdd1/garage /mnt/hdd2/garage /mnt/hdd3/garage /mnt/hdd4/garage ] ``` It raises some questions: - If a disk is offline, should Garage stops, ignore the offline data, or try to recover the chunks on other disks? And what are the risks of just filling the other disks and make the whole node unusable? - If we add or remove a disk, should the data be rebalanced? What is the rule to place chunks on disks? Another solution could also be to run one garage instace per disk, which would make it harder to deploy and maintain however.

👍 1

quentin added the

kind

ideas

label 2022-02-04 07:46:54 +00:00

Iam-Tj commented

2022-02-08 21:40:21 +00:00

I'd strongly recommend allowing the OS to handle this, either via block-level (MD, LVM, DRBD, iSCSI, etc.) or file-system level (ZFS, BTRFS, etc.), since those tools have spent years perfecting redundancy and recovery.

That way you simply point Garage at any (block device) mount-point.

I'd strongly recommend allowing the OS to handle this, either via block-level (MD, LVM, DRBD, iSCSI, etc.) or file-system level (ZFS, BTRFS, etc.), since those tools have spent years perfecting redundancy and recovery. That way you simply point Garage at any (block device) mount-point.

volth referenced this issue

2022-02-17 04:07:53 +00:00

Documentation on topology change #217

baptiste commented

2022-04-09 09:43:04 +00:00

Owner

I had the same question and did not find the answer in the docs.

The thing is, it's wasteful to add redundancy. If you distribute data below garage without redundancy (with LVM, RAID-0 mdadm, ZFS vdev striping, or even hardware RAID-0), as soon as you lose a single disk, you lose everything. This is not good, you would have to rebalance/rebuild your whole dataset. Also, performance-wise, Garage would have no control over the placement of data, so it will likely under-utilize the hardware.

Running one garage instance per disk looks like a good idea: it will clearly separate failure domains, and will provide good parallelism. However it might challenge the P2P part of garage if too many instances are running?

I had the same question and did not find the answer in the docs. The thing is, it's wasteful to add redundancy. If you distribute data below garage without redundancy (with LVM, RAID-0 mdadm, ZFS vdev striping, or even hardware RAID-0), as soon as you lose a single disk, you lose everything. This is not good, you would have to rebalance/rebuild your whole dataset. Also, performance-wise, Garage would have no control over the placement of data, so it will likely under-utilize the hardware. Running one garage instance per disk looks like a good idea: it will clearly separate failure domains, and will provide good parallelism. However it might challenge the P2P part of garage if too many instances are running?

baptiste commented

2022-04-09 09:44:08 +00:00

Owner

It might be useful to look at Ceph: if I'm not mistaken, Ceph runs one OSD daemon for each physical disk, so you would have as many OSD daemons running in parallel as your number of disks.

quentin commented

2022-04-09 19:53:38 +00:00

Author

Owner

For now, I think we can recommend running one garage instance per disk. This is not perfect, as some metadata will be duplicated (but not all) between each instance. On the network part, it should not be blocking but it will involve a small overhead too.

Another hacky solution that I may test soon is by manually mounting data directory.
In this directory, Garage stores chunks according to their hash, in a 2 level tree based on the first and second byte of the hash. So the hash e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 will be stored in the folder ./e3/b0/. It means that on the root folder, you will have 256 folders from 00 to ff. If you have 2 disks of the same size, you can mount 00 to 7f on disk1 and 80 to ff on disk 2.

In the future, I would like to directly integrate this solution in Garage, with the top configuration I presented.

For now, I think we can recommend running one garage instance per disk. This is not perfect, as some metadata will be duplicated (but not all) between each instance. On the network part, it should not be blocking but it will involve a small overhead too. Another hacky solution that I may test soon is by manually mounting data directory. In this directory, Garage stores chunks according to their hash, in a 2 level tree based on the first and second byte of the hash. So the hash `e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855` will be stored in the folder `./e3/b0/`. It means that on the root folder, you will have 256 folders from `00` to `ff`. If you have 2 disks of the same size, you can mount `00` to `7f` on disk1 and `80` to `ff` on disk 2. In the future, I would like to directly integrate this solution in Garage, with the top configuration I presented.

lx commented

2022-11-16 11:31:42 +00:00

Owner

Other option: make an XFS partition on each drive, and put them together with mergerfs. This looks like it would be a near-optimal setup. TODO: write an example mergerfs mount command with the correct parameters.

Other option: make an XFS partition on each drive, and put them together with [`mergerfs`](https://github.com/trapexit/mergerfs). This looks like it would be a near-optimal setup. TODO: write an example mergerfs mount command with the correct parameters.

lx added

and removed

labels 2022-11-16 16:18:36 +00:00

lx added this to the v0.9 milestone 2022-11-16 16:18:38 +00:00

shann commented

2022-12-17 16:52:49 +00:00

Hi,

When i seach distribute storage for my home server, i search see minio work with 1Gbps network and found link about garage.

I read blog article, and read listen to learn more.
Garage match perfectly my usage, storage for nextcloud, goal to offer own cloud services for my family (nextcloud, jitsi and for me own services for my /e/os phone).
In doc i found example with one path, in my case i have 6 nodes with each 4 disks of 1To.
Search in issues if similar case reported and find this :)

@lx can you please me explain usage for mergefs / xfs setup ?

Great work guys. Big thanks
I bookmark deuxfleurs website, and follow.

Hi, When i seach distribute storage for my home server, i search see minio work with 1Gbps network and found link about garage. I read blog article, and read listen to learn more. Garage match perfectly my usage, storage for nextcloud, goal to offer own cloud services for my family (nextcloud, jitsi and for me own services for my /e/os phone). In doc i found example with one path, in my case i have 6 nodes with each 4 disks of 1To. Search in issues if similar case reported and find this :) @lx can you please me explain usage for mergefs / xfs setup ? Great work guys. Big thanks I bookmark deuxfleurs website, and follow.

lx commented

2022-12-21 16:33:02 +00:00

Owner

@lx can you please me explain usage for mergefs / xfs setup ?

Basically, format each of your drives as XFS and mount them for instance at /mnt/hdd1, /mnt/hdd2, ..., /mnt/hddN.

Then mount everything together at /mnt/garage-data, for instance with the following line in /etc/fstab:

/mnt/hdd*  /mnt/garage-data      fuse.mergerfs  allow_other,use_ino,cache.files=off,dropcacheonclose=true,category.create=mfs   0       0

Then, point your Garage data directory to /mnt/garage-data.

I have never tested this setup, it would be interesting to compare this with the setup with one daemon per HDD. In all cases, you can start by formating all your HDDs using XFS. I think one-daemon-per-HDD is probably more performant, but harder to manage and also harder to scale as Garage shouldn't be used to run more than about 100 daemons without special tweaking.

> @lx can you please me explain usage for mergefs / xfs setup ? Basically, format each of your drives as XFS and mount them for instance at `/mnt/hdd1`, `/mnt/hdd2`, ..., `/mnt/hddN`. Then mount everything together at `/mnt/garage-data`, for instance with the following line in `/etc/fstab`: ``` /mnt/hdd* /mnt/garage-data fuse.mergerfs allow_other,use_ino,cache.files=off,dropcacheonclose=true,category.create=mfs 0 0 ``` Then, point your Garage data directory to `/mnt/garage-data`. I have never tested this setup, it would be interesting to compare this with the setup with one daemon per HDD. In all cases, you can start by formating all your HDDs using XFS. I think one-daemon-per-HDD is probably more performant, but harder to manage and also harder to scale as Garage shouldn't be used to run more than about 100 daemons without special tweaking.

shann commented

2022-12-30 20:28:52 +00:00

Thanks for answer.

One daemon per HDD, signify that can't manage all disks in one volume ?
In my case i would like "merge" 24 disks to have datapool with replication 2, average ~10To usuable if i follow ceph calc.

Need prepare stuff to test setup,, today i have issue with blk_update_request I/O error, after smartctl, plug/unplug cable on motherboard node, same for disk on front, error disappears :)

Thanks for answer. One daemon per HDD, signify that can't manage all disks in one volume ? In my case i would like "merge" 24 disks to have datapool with replication 2, average ~10To usuable if i follow ceph calc. Need prepare stuff to test setup,, today i have issue with blk_update_request I/O error, after smartctl, plug/unplug cable on motherboard node, same for disk on front, error disappears :)

lx commented

2023-01-03 15:12:44 +00:00

Owner

In my case i would like "merge" 24 disks to have datapool with replication 2, average ~10To usuable if i follow ceph calc.

For that there is only one choice, run one garage daemon per disk.

> In my case i would like "merge" 24 disks to have datapool with replication 2, average ~10To usuable if i follow ceph calc. For that there is only one choice, run one garage daemon per disk.

maximilien commented

2023-01-14 17:44:22 +00:00

Owner

I'd strongly recommend allowing the OS to handle this, either via block-level (MD, LVM, DRBD, iSCSI, etc.) or file-system level (ZFS, BTRFS, etc.), since those tools have spent years perfecting redundancy and recovery.

That way you simply point Garage at any (block device) mount-point.

I would strongly advise this for now, or running one garage instance per disk.

The failure of a non-redundant block device can cause a lot of strange behaviors in the kernel, and it is very likely that you'll have to restart the kernel to recover the application.

What would happen for example if one of the disk switch to read-only? This is something that can happen with some SSDs when they reach their endurance limit. That would mean that a random subset of the garage node blocks would fail DELETE and WRITE requests...

> I'd strongly recommend allowing the OS to handle this, either via block-level (MD, LVM, DRBD, iSCSI, etc.) or file-system level (ZFS, BTRFS, etc.), since those tools have spent years perfecting redundancy and recovery. > > That way you simply point Garage at any (block device) mount-point. I would strongly advise this for now, or running one garage instance per disk. The failure of a non-redundant block device can cause a lot of strange behaviors in the kernel, and it is _very_ likely that you'll have to restart the kernel to recover the application. What would happen for example if one of the disk switch to read-only? This is something that can happen with some SSDs when they reach their endurance limit. That would mean that a random subset of the garage node blocks would fail DELETE and WRITE requests...

shann commented

2023-01-24 20:05:09 +00:00

@lx,

Do you have example for run one garage daemon per disk, please.
I see garage.toml indicate rpc_public_addr, in case of multiple daemons, need define another ports ?

Thanks

@lx, Do you have example for run one garage daemon per disk, please. I see garage.toml indicate rpc_public_addr, in case of multiple daemons, need define another ports ? Thanks

lx commented

2023-01-24 20:40:02 +00:00

Owner

I see garage.toml indicate rpc_public_addr, in case of multiple daemons, need define another ports ?

Yes that's pretty much it, allocate one port per daemon for the RPC port (rpc_bind_addr and rpc_public_port). If you are interested in monitoring, you should also allocate one port for the admin API on each node (api_bind_addr in the [admin_api] section). For the S3 API, you can remove the [s3_api] section on most nodes and use only one of them to access it (or use a separate gateway node).

> I see garage.toml indicate rpc_public_addr, in case of multiple daemons, need define another ports ? Yes that's pretty much it, allocate one port per daemon for the RPC port (`rpc_bind_addr` and `rpc_public_port`). If you are interested in monitoring, you should also allocate one port for the admin API on each node (`api_bind_addr` in the `[admin_api]` section). For the S3 API, you can remove the `[s3_api]` section on most nodes and use only one of them to access it (or use a separate gateway node).

lx referenced this issue

2023-08-29 09:54:47 +00:00

Garage v0.9 #473

lx referenced this issue from a pull request that will close it,

2023-09-04 12:51:57 +00:00

multi-hdd support (fix #218) #625

lx closed this issue

2023-09-11 10:52:02 +00:00

lx referenced this issue from a commit

2023-09-11 10:52:02 +00:00

Merge pull request 'multi-hdd support (fix #218)' (#625) from multihdd into next

Rows
Columns

Support multiple hard drive per server #218