Support multiple hard drive per server #218

Closed
opened 2022-02-04 07:46:54 +00:00 by quentin · 12 comments
Owner

We have a very specific use case, with only one hard drive per server.
But people often have numerous HDD, up to ~64, and they do not know how to deploy Garage with such setup.
That's the question we had in one of our past exchange.

We could ask them to just create a JBOD (if one disk is offline, we must rebuild the whole server) or a RAID device (we add another level of redundancy).

But I think it would be both simpler, easy to recover and more performant to allow people setting multiple paths, thus using multiple disks in Garage's configuration.

Example of a new configuration file:

# data_dir = "/var/lib/garage/data"
data_dir = [
  /mnt/hdd1/garage
  /mnt/hdd2/garage
  /mnt/hdd3/garage
  /mnt/hdd4/garage
]

It raises some questions:

  • If a disk is offline, should Garage stops, ignore the offline data, or try to recover the chunks on other disks? And what are the risks of just filling the other disks and make the whole node unusable?
  • If we add or remove a disk, should the data be rebalanced? What is the rule to place chunks on disks?

Another solution could also be to run one garage instace per disk, which would make it harder to deploy and maintain however.

We have a very specific use case, with only one hard drive per server. But people often have numerous HDD, up to ~64, and they do not know how to deploy Garage with such setup. That's the question we had in one of our past exchange. We could ask them to just create a JBOD (if one disk is offline, we must rebuild the whole server) or a RAID device (we add another level of redundancy). But I think it would be both simpler, easy to recover and more performant to allow people setting multiple paths, thus using multiple disks in Garage's configuration. Example of a new configuration file: ```toml # data_dir = "/var/lib/garage/data" data_dir = [ /mnt/hdd1/garage /mnt/hdd2/garage /mnt/hdd3/garage /mnt/hdd4/garage ] ``` It raises some questions: - If a disk is offline, should Garage stops, ignore the offline data, or try to recover the chunks on other disks? And what are the risks of just filling the other disks and make the whole node unusable? - If we add or remove a disk, should the data be rebalanced? What is the rule to place chunks on disks? Another solution could also be to run one garage instace per disk, which would make it harder to deploy and maintain however.
quentin added the
Ideas
label 2022-02-04 07:46:54 +00:00

I'd strongly recommend allowing the OS to handle this, either via block-level (MD, LVM, DRBD, iSCSI, etc.) or file-system level (ZFS, BTRFS, etc.), since those tools have spent years perfecting redundancy and recovery.

That way you simply point Garage at any (block device) mount-point.

I'd strongly recommend allowing the OS to handle this, either via block-level (MD, LVM, DRBD, iSCSI, etc.) or file-system level (ZFS, BTRFS, etc.), since those tools have spent years perfecting redundancy and recovery. That way you simply point Garage at any (block device) mount-point.
Owner

I had the same question and did not find the answer in the docs.

The thing is, it's wasteful to add redundancy. If you distribute data below garage without redundancy (with LVM, RAID-0 mdadm, ZFS vdev striping, or even hardware RAID-0), as soon as you lose a single disk, you lose everything. This is not good, you would have to rebalance/rebuild your whole dataset. Also, performance-wise, Garage would have no control over the placement of data, so it will likely under-utilize the hardware.

Running one garage instance per disk looks like a good idea: it will clearly separate failure domains, and will provide good parallelism. However it might challenge the P2P part of garage if too many instances are running?

I had the same question and did not find the answer in the docs. The thing is, it's wasteful to add redundancy. If you distribute data below garage without redundancy (with LVM, RAID-0 mdadm, ZFS vdev striping, or even hardware RAID-0), as soon as you lose a single disk, you lose everything. This is not good, you would have to rebalance/rebuild your whole dataset. Also, performance-wise, Garage would have no control over the placement of data, so it will likely under-utilize the hardware. Running one garage instance per disk looks like a good idea: it will clearly separate failure domains, and will provide good parallelism. However it might challenge the P2P part of garage if too many instances are running?
Owner

It might be useful to look at Ceph: if I'm not mistaken, Ceph runs one OSD daemon for each physical disk, so you would have as many OSD daemons running in parallel as your number of disks.

It might be useful to look at Ceph: if I'm not mistaken, Ceph runs one OSD daemon for each physical disk, so you would have as many OSD daemons running in parallel as your number of disks.
Author
Owner

For now, I think we can recommend running one garage instance per disk. This is not perfect, as some metadata will be duplicated (but not all) between each instance. On the network part, it should not be blocking but it will involve a small overhead too.

Another hacky solution that I may test soon is by manually mounting data directory.
In this directory, Garage stores chunks according to their hash, in a 2 level tree based on the first and second byte of the hash. So the hash e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 will be stored in the folder ./e3/b0/. It means that on the root folder, you will have 256 folders from 00 to ff. If you have 2 disks of the same size, you can mount 00 to 7f on disk1 and 80 to ff on disk 2.

In the future, I would like to directly integrate this solution in Garage, with the top configuration I presented.

For now, I think we can recommend running one garage instance per disk. This is not perfect, as some metadata will be duplicated (but not all) between each instance. On the network part, it should not be blocking but it will involve a small overhead too. Another hacky solution that I may test soon is by manually mounting data directory. In this directory, Garage stores chunks according to their hash, in a 2 level tree based on the first and second byte of the hash. So the hash `e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855` will be stored in the folder `./e3/b0/`. It means that on the root folder, you will have 256 folders from `00` to `ff`. If you have 2 disks of the same size, you can mount `00` to `7f` on disk1 and `80` to `ff` on disk 2. In the future, I would like to directly integrate this solution in Garage, with the top configuration I presented.
Owner

Other option: make an XFS partition on each drive, and put them together with mergerfs. This looks like it would be a near-optimal setup. TODO: write an example mergerfs mount command with the correct parameters.

Other option: make an XFS partition on each drive, and put them together with [`mergerfs`](https://github.com/trapexit/mergerfs). This looks like it would be a near-optimal setup. TODO: write an example mergerfs mount command with the correct parameters.
lx added
Improvement
and removed
Ideas
labels 2022-11-16 16:18:36 +00:00
lx added this to the v0.9 milestone 2022-11-16 16:18:38 +00:00

Hi,

When i seach distribute storage for my home server, i search see minio work with 1Gbps network and found link about garage.

I read blog article, and read listen to learn more.
Garage match perfectly my usage, storage for nextcloud, goal to offer own cloud services for my family (nextcloud, jitsi and for me own services for my /e/os phone).
In doc i found example with one path, in my case i have 6 nodes with each 4 disks of 1To.
Search in issues if similar case reported and find this :)

@lx can you please me explain usage for mergefs / xfs setup ?

Great work guys. Big thanks
I bookmark deuxfleurs website, and follow.

Hi, When i seach distribute storage for my home server, i search see minio work with 1Gbps network and found link about garage. I read blog article, and read listen to learn more. Garage match perfectly my usage, storage for nextcloud, goal to offer own cloud services for my family (nextcloud, jitsi and for me own services for my /e/os phone). In doc i found example with one path, in my case i have 6 nodes with each 4 disks of 1To. Search in issues if similar case reported and find this :) @lx can you please me explain usage for mergefs / xfs setup ? Great work guys. Big thanks I bookmark deuxfleurs website, and follow.
Owner

@lx can you please me explain usage for mergefs / xfs setup ?

Basically, format each of your drives as XFS and mount them for instance at /mnt/hdd1, /mnt/hdd2, ..., /mnt/hddN.

Then mount everything together at /mnt/garage-data, for instance with the following line in /etc/fstab:

/mnt/hdd*  /mnt/garage-data      fuse.mergerfs  allow_other,use_ino,cache.files=off,dropcacheonclose=true,category.create=mfs   0       0

Then, point your Garage data directory to /mnt/garage-data.

I have never tested this setup, it would be interesting to compare this with the setup with one daemon per HDD. In all cases, you can start by formating all your HDDs using XFS. I think one-daemon-per-HDD is probably more performant, but harder to manage and also harder to scale as Garage shouldn't be used to run more than about 100 daemons without special tweaking.

> @lx can you please me explain usage for mergefs / xfs setup ? Basically, format each of your drives as XFS and mount them for instance at `/mnt/hdd1`, `/mnt/hdd2`, ..., `/mnt/hddN`. Then mount everything together at `/mnt/garage-data`, for instance with the following line in `/etc/fstab`: ``` /mnt/hdd* /mnt/garage-data fuse.mergerfs allow_other,use_ino,cache.files=off,dropcacheonclose=true,category.create=mfs 0 0 ``` Then, point your Garage data directory to `/mnt/garage-data`. I have never tested this setup, it would be interesting to compare this with the setup with one daemon per HDD. In all cases, you can start by formating all your HDDs using XFS. I think one-daemon-per-HDD is probably more performant, but harder to manage and also harder to scale as Garage shouldn't be used to run more than about 100 daemons without special tweaking.

Thanks for answer.

One daemon per HDD, signify that can't manage all disks in one volume ?
In my case i would like "merge" 24 disks to have datapool with replication 2, average ~10To usuable if i follow ceph calc.

Need prepare stuff to test setup,, today i have issue with blk_update_request I/O error, after smartctl, plug/unplug cable on motherboard node, same for disk on front, error disappears :)

Thanks for answer. One daemon per HDD, signify that can't manage all disks in one volume ? In my case i would like "merge" 24 disks to have datapool with replication 2, average ~10To usuable if i follow ceph calc. Need prepare stuff to test setup,, today i have issue with blk_update_request I/O error, after smartctl, plug/unplug cable on motherboard node, same for disk on front, error disappears :)
Owner

In my case i would like "merge" 24 disks to have datapool with replication 2, average ~10To usuable if i follow ceph calc.

For that there is only one choice, run one garage daemon per disk.

> In my case i would like "merge" 24 disks to have datapool with replication 2, average ~10To usuable if i follow ceph calc. For that there is only one choice, run one garage daemon per disk.
Owner

I'd strongly recommend allowing the OS to handle this, either via block-level (MD, LVM, DRBD, iSCSI, etc.) or file-system level (ZFS, BTRFS, etc.), since those tools have spent years perfecting redundancy and recovery.

That way you simply point Garage at any (block device) mount-point.

I would strongly advise this for now, or running one garage instance per disk.

The failure of a non-redundant block device can cause a lot of strange behaviors in the kernel, and it is very likely that you'll have to restart the kernel to recover the application.

What would happen for example if one of the disk switch to read-only? This is something that can happen with some SSDs when they reach their endurance limit. That would mean that a random subset of the garage node blocks would fail DELETE and WRITE requests...

> I'd strongly recommend allowing the OS to handle this, either via block-level (MD, LVM, DRBD, iSCSI, etc.) or file-system level (ZFS, BTRFS, etc.), since those tools have spent years perfecting redundancy and recovery. > > That way you simply point Garage at any (block device) mount-point. I would strongly advise this for now, or running one garage instance per disk. The failure of a non-redundant block device can cause a lot of strange behaviors in the kernel, and it is _very_ likely that you'll have to restart the kernel to recover the application. What would happen for example if one of the disk switch to read-only? This is something that can happen with some SSDs when they reach their endurance limit. That would mean that a random subset of the garage node blocks would fail DELETE and WRITE requests...

@lx,

Do you have example for run one garage daemon per disk, please.
I see garage.toml indicate rpc_public_addr, in case of multiple daemons, need define another ports ?

Thanks

@lx, Do you have example for run one garage daemon per disk, please. I see garage.toml indicate rpc_public_addr, in case of multiple daemons, need define another ports ? Thanks
Owner

I see garage.toml indicate rpc_public_addr, in case of multiple daemons, need define another ports ?

Yes that's pretty much it, allocate one port per daemon for the RPC port (rpc_bind_addr and rpc_public_port). If you are interested in monitoring, you should also allocate one port for the admin API on each node (api_bind_addr in the [admin_api] section). For the S3 API, you can remove the [s3_api] section on most nodes and use only one of them to access it (or use a separate gateway node).

> I see garage.toml indicate rpc_public_addr, in case of multiple daemons, need define another ports ? Yes that's pretty much it, allocate one port per daemon for the RPC port (`rpc_bind_addr` and `rpc_public_port`). If you are interested in monitoring, you should also allocate one port for the admin API on each node (`api_bind_addr` in the `[admin_api]` section). For the S3 API, you can remove the `[s3_api]` section on most nodes and use only one of them to access it (or use a separate gateway node).
lx closed this issue 2023-09-11 10:52:02 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
6 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#218
No description provided.