"Estimated available storage space cluster-wide" went down after adding capacity #907

Closed
opened 2024-11-22 08:46:49 +00:00 by vk · 10 comments
Contributor

Hello,

I have made a small change - adding 1T capacity to one node of a 5-nodes cluster.
The "Estimated available storage space cluster-wide" provided by garage stats went significantly down, instead of increasing marginally.

The replication terminated as per disk activity and resync queue length going down to previous levels.

Layout & reported capacity before layout change:

Storage nodes:
  ID                Hostname                        Zone     Capacity
  aa**************  **************************      B**      2.0 TB
  77**************  *****************************   M******  8.0 TB
  e9**************  ******************************  P**      6.0 TB
  73**************  ******************************  A**      2.0 TB
  20**************  *************************       F**      8.0 TB

Estimated available storage space cluster-wide (might be lower in practice):
  data: 8.1 TB

Layout & reported capacity after layout change:

Storage nodes:
  ID                Hostname                        Zone     Capacity  Part.  DataAvail               MetaAvail
  aa**************  **************************      B**      3.0 TB    84     1.8 TB/1.9 TB (93.3%)   65.8 GB/66.0 GB (99.6%)
  77**************  *****************************   M******  8.0 TB    228    8.4 TB/9.0 TB (92.6%)   377.4 GB/378.1 GB (99.8%)
  e9**************  ******************************  P**      6.0 TB    171    7.3 TB/7.8 TB (93.6%)   851.4 GB/852.0 GB (99.9%)
  73**************  ******************************  A**      2.0 TB    57     3.3 TB/3.4 TB (95.7%)   3.3 TB/3.3 TB (100.0%)
  20**************  *************************       F**      8.0 TB    228    9.5 TB/10.2 TB (93.5%)  192.4 GB/193.2 GB (99.6%)

Estimated available storage space cluster-wide (might be lower in practice):
  data: 5.5 TB

Also, I'm unsure about what DataAvail is meant to represent, but aa************** node has way more capacity (free disk space) than indicated, spanning over the two data directories.

Garage version:

Garage version: cargo:1.0.1 [features: k2v, lmdb, sqlite, metrics, bundled-libs]
Rust compiler version: 1.81.0
Hello, I have made a small change - adding 1T capacity to one node of a 5-nodes cluster. The "Estimated available storage space cluster-wide" provided by `garage stats` went significantly down, instead of increasing marginally. The replication terminated as per disk activity and resync queue length going down to previous levels. Layout & reported capacity before layout change: ``` Storage nodes: ID Hostname Zone Capacity aa************** ************************** B** 2.0 TB 77************** ***************************** M****** 8.0 TB e9************** ****************************** P** 6.0 TB 73************** ****************************** A** 2.0 TB 20************** ************************* F** 8.0 TB Estimated available storage space cluster-wide (might be lower in practice): data: 8.1 TB ``` Layout & reported capacity after layout change: ``` Storage nodes: ID Hostname Zone Capacity Part. DataAvail MetaAvail aa************** ************************** B** 3.0 TB 84 1.8 TB/1.9 TB (93.3%) 65.8 GB/66.0 GB (99.6%) 77************** ***************************** M****** 8.0 TB 228 8.4 TB/9.0 TB (92.6%) 377.4 GB/378.1 GB (99.8%) e9************** ****************************** P** 6.0 TB 171 7.3 TB/7.8 TB (93.6%) 851.4 GB/852.0 GB (99.9%) 73************** ****************************** A** 2.0 TB 57 3.3 TB/3.4 TB (95.7%) 3.3 TB/3.3 TB (100.0%) 20************** ************************* F** 8.0 TB 228 9.5 TB/10.2 TB (93.5%) 192.4 GB/193.2 GB (99.6%) Estimated available storage space cluster-wide (might be lower in practice): data: 5.5 TB ``` Also, I'm unsure about what `DataAvail` is meant to represent, but `aa**************` node has way more capacity (free disk space) than indicated, spanning over the two data directories. Garage version: ``` Garage version: cargo:1.0.1 [features: k2v, lmdb, sqlite, metrics, bundled-libs] Rust compiler version: 1.81.0 ```

the estimated capacity is coherent with DataVail, the configured Capacity per node, and a replication factor of 3. The limiting node in that case would be aa************** as it claims a capacity of 3TB, while only having a reported available space for 1.8TB.

DataAvail is an estimation of available space based on what statvfs(2) returns. It should be equivalent to running df -h /path/to/datadir and summing the result for all data directories (ignoring data directories with no capacity configured and read_only set to true). Does df return roughly the same available disk space as Garage, or do they disagree?

the estimated capacity is coherent with DataVail, the configured Capacity per node, and a replication factor of 3. The limiting node in that case would be `aa**************` as it claims a capacity of 3TB, while only having a reported available space for 1.8TB. DataAvail is an estimation of available space based on what `statvfs(2)` returns. It should be equivalent to running `df -h /path/to/datadir` and summing the result for all data directories (ignoring data directories with no capacity configured and read_only set to true). Does `df` return roughly the same available disk space as Garage, or do they disagree?
Author
Contributor

Thanks for the quick reply.

Here's df -h report on aa**************:

$ df -h /garage/data-1
Filesystem       Size    Used   Avail Capacity  Mounted on
garage-data-1    1.5T    106G    1.4T     7%    /garage/data-1

$ df -h /garage/data-2
Filesystem       Size    Used   Avail Capacity  Mounted on
garage-data-2    1.8T    121G    1.6T     7%    /garage/data-2

and excerpt from garage.toml:

data_dir = [
        # garage-data-1
        { path = "/garage/data-1", capacity = "1.4T" },
        # garage-data-2
        { path = "/garage/data-2", capacity = "1.6T" }
]

So it seems indeed that DataAvail is somehow wrong. This is a FreeBSD system, using zfs. The configured capacity is purposely slightly lower than actual disk capacity (although those disks are dedicated to garage), as zfs pools performance drops when approaching full usage.

Thanks for the quick reply. Here's `df -h` report on `aa**************`: ``` $ df -h /garage/data-1 Filesystem Size Used Avail Capacity Mounted on garage-data-1 1.5T 106G 1.4T 7% /garage/data-1 $ df -h /garage/data-2 Filesystem Size Used Avail Capacity Mounted on garage-data-2 1.8T 121G 1.6T 7% /garage/data-2 ``` and excerpt from `garage.toml`: ``` data_dir = [ # garage-data-1 { path = "/garage/data-1", capacity = "1.4T" }, # garage-data-2 { path = "/garage/data-2", capacity = "1.6T" } ] ``` So it seems indeed that `DataAvail` is somehow wrong. This is a FreeBSD system, using zfs. The configured capacity is purposely slightly lower than actual disk capacity (although those disks are dedicated to garage), as zfs pools performance drops when approaching full usage.
Author
Contributor

Also, it seems it wasn't wrong before the layout change. The garage node hasn't been restarted for weeks, and I have made several layout changes, progressively increasing that node's capacity.

This is the garage layout show output, showing a "Usable capacity" of 2.9 TB for that node:

$ garage layout show
==== CURRENT CLUSTER LAYOUT ====
ID                Tags        Zone     Capacity  Usable capacity
20**************  *****       F**      8.0 TB    8.0 TB (100.0%)
73**************  **********  A**      2.0 TB    2.0 TB (100.0%)
77**************  *********   M******  8.0 TB    8.0 TB (100.0%)
aa**************  ******      B**      3.0 TB    2.9 TB (98.2%)
e9**************  **********  P**      6.0 TB    6.0 TB (100.0%)

Zone redundancy: maximum

Current cluster layout version: 19
Also, it seems it wasn't wrong before the layout change. The garage node hasn't been restarted for weeks, and I have made several layout changes, progressively increasing that node's capacity. This is the `garage layout show` output, showing a "Usable capacity" of 2.9 TB for that node: ``` $ garage layout show ==== CURRENT CLUSTER LAYOUT ==== ID Tags Zone Capacity Usable capacity 20************** ***** F** 8.0 TB 8.0 TB (100.0%) 73************** ********** A** 2.0 TB 2.0 TB (100.0%) 77************** ********* M****** 8.0 TB 8.0 TB (100.0%) aa************** ****** B** 3.0 TB 2.9 TB (98.2%) e9************** ********** P** 6.0 TB 6.0 TB (100.0%) Zone redundancy: maximum Current cluster layout version: 19 ```
Owner

Small question if possible, there is no reason why really but does that value change if you restart the node?

Small question if possible, there is no reason why really but does that value change if you restart the node?
Author
Contributor

Hi, restarted the aa node, and it didn't change.

Hi, restarted the `aa` node, and it didn't change.
Author
Contributor

Here are the values reported by statvfs() call:

data-1:
        f_frsize: 512
        f_blocks: 3284118120
        f_bavail: 3044746360
        f_bfree: 3044746360
data-2:
        f_frsize: 512
        f_blocks: 3771702384
        f_bavail: 3497497224
        f_bfree: 3497497224

Which gives 512 x (3284118120 + 3771702384) = a bit over 3Tb. So it does seem something is wrong in garage computations.

Values reported by zfs:

$ zfs list
NAME                 USED  AVAIL  REFER  MOUNTPOINT
garage               295M  61.2G   124K  /garage
garage-data-1        114G  1.42T   114G  /garage/data-1
garage-data-2        131G  1.63T   131G  /garage/data-2
garage/meta          257M  61.2G   257M  /garage/meta

This is the C that produced the above output:

#include <stdio.h>
#include <sys/statvfs.h>

int main(int argc, char *argv[])
{
    struct statvfs data1, data2;
    
    if(statvfs("/garage/data-1", &data1))
    {
        printf("statvfs(data-1) failed\n");
        return 1;
    }

    if(statvfs("/garage/data-2", &data2))
    {
        printf("statvfs(data-2) failed\n");
        return 1;
    }
    
    printf("data-1:\n");
    
    printf("\tf_frsize: %lu\n", data1.f_frsize);
    printf("\tf_blocks: %lu\n", data1.f_blocks);
    printf("\tf_bavail: %lu\n", data1.f_bavail);
    printf("\tf_bfree: %lu\n", data1.f_bfree);

    printf("data-2:\n");
    
    printf("\tf_frsize: %lu\n", data2.f_frsize);
    printf("\tf_blocks: %lu\n", data2.f_blocks);
    printf("\tf_bavail: %lu\n", data2.f_bavail);
    printf("\tf_bfree: %lu\n", data2.f_bfree);
    
    return 0;
}
Here are the values reported by `statvfs()` call: ``` data-1: f_frsize: 512 f_blocks: 3284118120 f_bavail: 3044746360 f_bfree: 3044746360 data-2: f_frsize: 512 f_blocks: 3771702384 f_bavail: 3497497224 f_bfree: 3497497224 ``` Which gives 512 x (3284118120 + 3771702384) = a bit over 3Tb. So it does seem something is wrong in garage computations. Values reported by zfs: ``` $ zfs list NAME USED AVAIL REFER MOUNTPOINT garage 295M 61.2G 124K /garage garage-data-1 114G 1.42T 114G /garage/data-1 garage-data-2 131G 1.63T 131G /garage/data-2 garage/meta 257M 61.2G 257M /garage/meta ``` This is the C that produced the above output: ``` #include <stdio.h> #include <sys/statvfs.h> int main(int argc, char *argv[]) { struct statvfs data1, data2; if(statvfs("/garage/data-1", &data1)) { printf("statvfs(data-1) failed\n"); return 1; } if(statvfs("/garage/data-2", &data2)) { printf("statvfs(data-2) failed\n"); return 1; } printf("data-1:\n"); printf("\tf_frsize: %lu\n", data1.f_frsize); printf("\tf_blocks: %lu\n", data1.f_blocks); printf("\tf_bavail: %lu\n", data1.f_bavail); printf("\tf_bfree: %lu\n", data1.f_bfree); printf("data-2:\n"); printf("\tf_frsize: %lu\n", data2.f_frsize); printf("\tf_blocks: %lu\n", data2.f_blocks); printf("\tf_bavail: %lu\n", data2.f_bavail); printf("\tf_bfree: %lu\n", data2.f_bfree); return 0; } ```
Author
Contributor

If my understanding of

fn update_disk_usage(&mut self, meta_dir: &Path, data_dir: &DataDirEnum) {

implementation is correct, a hashmap is built at one point, using statvfs::filesystem_id as key. statvfs::filesystem_id is a wrapper to the f_fsid member of the C struct statvfs member, which, according to man statvfs on FreeBSD:

f_fsid     Not meaningful in this implementation.

Indeed, its value is 0 for both the garage-data-1 and garage-data-2 datasets. Therefore, my guess is that those two datasets are coalesced into one, and garage retains the capacity of one of them only. However, this contradicts the previous observation, where the total capacity was over 8Tb before the last cluster layout change (I didn't keep track of individual nodes at that point in time).

If my understanding of https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/906c8708fd53880d998c595ccd39ab9f08866457/src/rpc/system.rs#L808 implementation is correct, a hashmap is built at one point, using `statvfs::filesystem_id` as key. `statvfs::filesystem_id` is a wrapper to the `f_fsid` member of the C `struct statvfs` member, which, according to `man statvfs` on FreeBSD: ``` f_fsid Not meaningful in this implementation. ``` Indeed, its value is 0 for both the `garage-data-1` and `garage-data-2` datasets. Therefore, my guess is that those two datasets are coalesced into one, and `garage` retains the capacity of one of them only. However, this contradicts the previous observation, where the total capacity was over 8Tb before the last cluster layout change (I didn't keep track of individual nodes at that point in time).
Author
Contributor

I've created and tested (on FreeBSD) a (naive - I'm not a rust dev) fix for this. Was unable to push for PR:

$ (fix_907) git push --set-upstream origin fix_907
Enumerating objects: 9, done.
Counting objects: 100% (9/9), done.
Delta compression using up to 24 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 776 bytes | 776.00 KiB/s, done.
Total 5 (delta 4), reused 0 (delta 0), pack-reused 0 (from 0)
remote:
remote: Forgejo: User permission denied for writing.
To git.deuxfleurs.fr:Deuxfleurs/garage.git
 ! [remote rejected]   fix_907 -> fix_907 (pre-receive hook declined)
error: failed to push some refs to 'git.deuxfleurs.fr:Deuxfleurs/garage.git'

Here's the patch:

diff --git a/src/rpc/system.rs b/src/rpc/system.rs
index d94d4eec..1a5677df 100644
--- a/src/rpc/system.rs
+++ b/src/rpc/system.rs
@@ -807,6 +807,16 @@ impl NodeStatus {

        fn update_disk_usage(&mut self, meta_dir: &Path, data_dir: &DataDirEnum) {
                use nix::sys::statvfs::statvfs;
+
+               // The HashMap used below requires a filesystem identifier from statfs (instead of statvfs) on FreeBSD, as
+               // FreeBSD's statvfs filesystem identifier is "not meaningful in this implementation" (man 3 statvfs).
+
+               #[cfg(target_os = "freebsd")]
+               let get_filesystem_id = |path: &Path| match nix::sys::statfs::statfs(path) {
+                       Ok(fs) => Some(fs.filesystem_id()),
+                       Err(_) => None,
+               };
+
                let mount_avail = |path: &Path| match statvfs(path) {
                        Ok(x) => {
                                let avail = x.blocks_available() as u64 * x.fragment_size() as u64;
@@ -817,6 +827,7 @@ impl NodeStatus {
                };

                self.meta_disk_avail = mount_avail(meta_dir).map(|(_, a, t)| (a, t));
+
                self.data_disk_avail = match data_dir {
                        DataDirEnum::Single(dir) => mount_avail(dir).map(|(_, a, t)| (a, t)),
                        DataDirEnum::Multiple(dirs) => (|| {
@@ -827,12 +838,25 @@ impl NodeStatus {
                                        if dir.capacity.is_none() {
                                                continue;
                                        }
+
+                                       #[cfg(not(target_os = "freebsd"))]
                                        match mount_avail(&dir.path) {
                                                Some((fsid, avail, total)) => {
                                                        mounts.insert(fsid, (avail, total));
                                                }
                                                None => return None,
                                        }
+
+                                       #[cfg(target_os = "freebsd")]
+                                       match get_filesystem_id(&dir.path) {
+                                               Some(fsid) => match mount_avail(&dir.path) {
+                                                       Some((_, avail, total)) => {
+                                                               mounts.insert(fsid, (avail, total));
+                                                       }
+                                                       None => return None,
+                                               }
+                                               None => return None,
+                                       }
                                }
                                Some(
                                        mounts

Feel free to use & refactor to suit the project coding rules.

I've created and tested (on FreeBSD) a (naive - I'm not a rust dev) fix for this. Was unable to push for PR: ``` $ (fix_907) git push --set-upstream origin fix_907 Enumerating objects: 9, done. Counting objects: 100% (9/9), done. Delta compression using up to 24 threads Compressing objects: 100% (5/5), done. Writing objects: 100% (5/5), 776 bytes | 776.00 KiB/s, done. Total 5 (delta 4), reused 0 (delta 0), pack-reused 0 (from 0) remote: remote: Forgejo: User permission denied for writing. To git.deuxfleurs.fr:Deuxfleurs/garage.git ! [remote rejected] fix_907 -> fix_907 (pre-receive hook declined) error: failed to push some refs to 'git.deuxfleurs.fr:Deuxfleurs/garage.git' ``` Here's the patch: ``` diff --git a/src/rpc/system.rs b/src/rpc/system.rs index d94d4eec..1a5677df 100644 --- a/src/rpc/system.rs +++ b/src/rpc/system.rs @@ -807,6 +807,16 @@ impl NodeStatus { fn update_disk_usage(&mut self, meta_dir: &Path, data_dir: &DataDirEnum) { use nix::sys::statvfs::statvfs; + + // The HashMap used below requires a filesystem identifier from statfs (instead of statvfs) on FreeBSD, as + // FreeBSD's statvfs filesystem identifier is "not meaningful in this implementation" (man 3 statvfs). + + #[cfg(target_os = "freebsd")] + let get_filesystem_id = |path: &Path| match nix::sys::statfs::statfs(path) { + Ok(fs) => Some(fs.filesystem_id()), + Err(_) => None, + }; + let mount_avail = |path: &Path| match statvfs(path) { Ok(x) => { let avail = x.blocks_available() as u64 * x.fragment_size() as u64; @@ -817,6 +827,7 @@ impl NodeStatus { }; self.meta_disk_avail = mount_avail(meta_dir).map(|(_, a, t)| (a, t)); + self.data_disk_avail = match data_dir { DataDirEnum::Single(dir) => mount_avail(dir).map(|(_, a, t)| (a, t)), DataDirEnum::Multiple(dirs) => (|| { @@ -827,12 +838,25 @@ impl NodeStatus { if dir.capacity.is_none() { continue; } + + #[cfg(not(target_os = "freebsd"))] match mount_avail(&dir.path) { Some((fsid, avail, total)) => { mounts.insert(fsid, (avail, total)); } None => return None, } + + #[cfg(target_os = "freebsd")] + match get_filesystem_id(&dir.path) { + Some(fsid) => match mount_avail(&dir.path) { + Some((_, avail, total)) => { + mounts.insert(fsid, (avail, total)); + } + None => return None, + } + None => return None, + } } Some( mounts ``` Feel free to use & refactor to suit the project coding rules.
Owner

@vk same as for github, you need to "fork" the repo under your own user, and push your code in a branch of your repo. You'll then be able to request a PR.

@vk same as for github, you need to "fork" the repo under your own user, and push your code in a branch of your repo. You'll then be able to request a PR.
Author
Contributor

Missed a step indeed...

Missed a step indeed...
lx closed this issue 2025-01-04 16:07:42 +00:00
lx referenced this issue from a commit 2025-01-04 16:07:42 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#907
No description provided.