"Estimated available storage space cluster-wide" went down after adding capacity #907

Open
opened 2024-11-22 08:46:49 +00:00 by vk · 7 comments

Hello,

I have made a small change - adding 1T capacity to one node of a 5-nodes cluster.
The "Estimated available storage space cluster-wide" provided by garage stats went significantly down, instead of increasing marginally.

The replication terminated as per disk activity and resync queue length going down to previous levels.

Layout & reported capacity before layout change:

Storage nodes:
  ID                Hostname                        Zone     Capacity
  aa**************  **************************      B**      2.0 TB
  77**************  *****************************   M******  8.0 TB
  e9**************  ******************************  P**      6.0 TB
  73**************  ******************************  A**      2.0 TB
  20**************  *************************       F**      8.0 TB

Estimated available storage space cluster-wide (might be lower in practice):
  data: 8.1 TB

Layout & reported capacity after layout change:

Storage nodes:
  ID                Hostname                        Zone     Capacity  Part.  DataAvail               MetaAvail
  aa**************  **************************      B**      3.0 TB    84     1.8 TB/1.9 TB (93.3%)   65.8 GB/66.0 GB (99.6%)
  77**************  *****************************   M******  8.0 TB    228    8.4 TB/9.0 TB (92.6%)   377.4 GB/378.1 GB (99.8%)
  e9**************  ******************************  P**      6.0 TB    171    7.3 TB/7.8 TB (93.6%)   851.4 GB/852.0 GB (99.9%)
  73**************  ******************************  A**      2.0 TB    57     3.3 TB/3.4 TB (95.7%)   3.3 TB/3.3 TB (100.0%)
  20**************  *************************       F**      8.0 TB    228    9.5 TB/10.2 TB (93.5%)  192.4 GB/193.2 GB (99.6%)

Estimated available storage space cluster-wide (might be lower in practice):
  data: 5.5 TB

Also, I'm unsure about what DataAvail is meant to represent, but aa************** node has way more capacity (free disk space) than indicated, spanning over the two data directories.

Garage version:

Garage version: cargo:1.0.1 [features: k2v, lmdb, sqlite, metrics, bundled-libs]
Rust compiler version: 1.81.0
Hello, I have made a small change - adding 1T capacity to one node of a 5-nodes cluster. The "Estimated available storage space cluster-wide" provided by `garage stats` went significantly down, instead of increasing marginally. The replication terminated as per disk activity and resync queue length going down to previous levels. Layout & reported capacity before layout change: ``` Storage nodes: ID Hostname Zone Capacity aa************** ************************** B** 2.0 TB 77************** ***************************** M****** 8.0 TB e9************** ****************************** P** 6.0 TB 73************** ****************************** A** 2.0 TB 20************** ************************* F** 8.0 TB Estimated available storage space cluster-wide (might be lower in practice): data: 8.1 TB ``` Layout & reported capacity after layout change: ``` Storage nodes: ID Hostname Zone Capacity Part. DataAvail MetaAvail aa************** ************************** B** 3.0 TB 84 1.8 TB/1.9 TB (93.3%) 65.8 GB/66.0 GB (99.6%) 77************** ***************************** M****** 8.0 TB 228 8.4 TB/9.0 TB (92.6%) 377.4 GB/378.1 GB (99.8%) e9************** ****************************** P** 6.0 TB 171 7.3 TB/7.8 TB (93.6%) 851.4 GB/852.0 GB (99.9%) 73************** ****************************** A** 2.0 TB 57 3.3 TB/3.4 TB (95.7%) 3.3 TB/3.3 TB (100.0%) 20************** ************************* F** 8.0 TB 228 9.5 TB/10.2 TB (93.5%) 192.4 GB/193.2 GB (99.6%) Estimated available storage space cluster-wide (might be lower in practice): data: 5.5 TB ``` Also, I'm unsure about what `DataAvail` is meant to represent, but `aa**************` node has way more capacity (free disk space) than indicated, spanning over the two data directories. Garage version: ``` Garage version: cargo:1.0.1 [features: k2v, lmdb, sqlite, metrics, bundled-libs] Rust compiler version: 1.81.0 ```

the estimated capacity is coherent with DataVail, the configured Capacity per node, and a replication factor of 3. The limiting node in that case would be aa************** as it claims a capacity of 3TB, while only having a reported available space for 1.8TB.

DataAvail is an estimation of available space based on what statvfs(2) returns. It should be equivalent to running df -h /path/to/datadir and summing the result for all data directories (ignoring data directories with no capacity configured and read_only set to true). Does df return roughly the same available disk space as Garage, or do they disagree?

the estimated capacity is coherent with DataVail, the configured Capacity per node, and a replication factor of 3. The limiting node in that case would be `aa**************` as it claims a capacity of 3TB, while only having a reported available space for 1.8TB. DataAvail is an estimation of available space based on what `statvfs(2)` returns. It should be equivalent to running `df -h /path/to/datadir` and summing the result for all data directories (ignoring data directories with no capacity configured and read_only set to true). Does `df` return roughly the same available disk space as Garage, or do they disagree?
Author

Thanks for the quick reply.

Here's df -h report on aa**************:

$ df -h /garage/data-1
Filesystem       Size    Used   Avail Capacity  Mounted on
garage-data-1    1.5T    106G    1.4T     7%    /garage/data-1

$ df -h /garage/data-2
Filesystem       Size    Used   Avail Capacity  Mounted on
garage-data-2    1.8T    121G    1.6T     7%    /garage/data-2

and excerpt from garage.toml:

data_dir = [
        # garage-data-1
        { path = "/garage/data-1", capacity = "1.4T" },
        # garage-data-2
        { path = "/garage/data-2", capacity = "1.6T" }
]

So it seems indeed that DataAvail is somehow wrong. This is a FreeBSD system, using zfs. The configured capacity is purposely slightly lower than actual disk capacity (although those disks are dedicated to garage), as zfs pools performance drops when approaching full usage.

Thanks for the quick reply. Here's `df -h` report on `aa**************`: ``` $ df -h /garage/data-1 Filesystem Size Used Avail Capacity Mounted on garage-data-1 1.5T 106G 1.4T 7% /garage/data-1 $ df -h /garage/data-2 Filesystem Size Used Avail Capacity Mounted on garage-data-2 1.8T 121G 1.6T 7% /garage/data-2 ``` and excerpt from `garage.toml`: ``` data_dir = [ # garage-data-1 { path = "/garage/data-1", capacity = "1.4T" }, # garage-data-2 { path = "/garage/data-2", capacity = "1.6T" } ] ``` So it seems indeed that `DataAvail` is somehow wrong. This is a FreeBSD system, using zfs. The configured capacity is purposely slightly lower than actual disk capacity (although those disks are dedicated to garage), as zfs pools performance drops when approaching full usage.
Author

Also, it seems it wasn't wrong before the layout change. The garage node hasn't been restarted for weeks, and I have made several layout changes, progressively increasing that node's capacity.

This is the garage layout show output, showing a "Usable capacity" of 2.9 TB for that node:

$ garage layout show
==== CURRENT CLUSTER LAYOUT ====
ID                Tags        Zone     Capacity  Usable capacity
20**************  *****       F**      8.0 TB    8.0 TB (100.0%)
73**************  **********  A**      2.0 TB    2.0 TB (100.0%)
77**************  *********   M******  8.0 TB    8.0 TB (100.0%)
aa**************  ******      B**      3.0 TB    2.9 TB (98.2%)
e9**************  **********  P**      6.0 TB    6.0 TB (100.0%)

Zone redundancy: maximum

Current cluster layout version: 19
Also, it seems it wasn't wrong before the layout change. The garage node hasn't been restarted for weeks, and I have made several layout changes, progressively increasing that node's capacity. This is the `garage layout show` output, showing a "Usable capacity" of 2.9 TB for that node: ``` $ garage layout show ==== CURRENT CLUSTER LAYOUT ==== ID Tags Zone Capacity Usable capacity 20************** ***** F** 8.0 TB 8.0 TB (100.0%) 73************** ********** A** 2.0 TB 2.0 TB (100.0%) 77************** ********* M****** 8.0 TB 8.0 TB (100.0%) aa************** ****** B** 3.0 TB 2.9 TB (98.2%) e9************** ********** P** 6.0 TB 6.0 TB (100.0%) Zone redundancy: maximum Current cluster layout version: 19 ```
Owner

Small question if possible, there is no reason why really but does that value change if you restart the node?

Small question if possible, there is no reason why really but does that value change if you restart the node?
Author

Hi, restarted the aa node, and it didn't change.

Hi, restarted the `aa` node, and it didn't change.
Author

Here are the values reported by statvfs() call:

data-1:
        f_frsize: 512
        f_blocks: 3284118120
        f_bavail: 3044746360
        f_bfree: 3044746360
data-2:
        f_frsize: 512
        f_blocks: 3771702384
        f_bavail: 3497497224
        f_bfree: 3497497224

Which gives 512 x (3284118120 + 3771702384) = a bit over 3Tb. So it does seem something is wrong in garage computations.

Values reported by zfs:

$ zfs list
NAME                 USED  AVAIL  REFER  MOUNTPOINT
garage               295M  61.2G   124K  /garage
garage-data-1        114G  1.42T   114G  /garage/data-1
garage-data-2        131G  1.63T   131G  /garage/data-2
garage/meta          257M  61.2G   257M  /garage/meta

This is the C that produced the above output:

#include <stdio.h>
#include <sys/statvfs.h>

int main(int argc, char *argv[])
{
    struct statvfs data1, data2;
    
    if(statvfs("/garage/data-1", &data1))
    {
        printf("statvfs(data-1) failed\n");
        return 1;
    }

    if(statvfs("/garage/data-2", &data2))
    {
        printf("statvfs(data-2) failed\n");
        return 1;
    }
    
    printf("data-1:\n");
    
    printf("\tf_frsize: %lu\n", data1.f_frsize);
    printf("\tf_blocks: %lu\n", data1.f_blocks);
    printf("\tf_bavail: %lu\n", data1.f_bavail);
    printf("\tf_bfree: %lu\n", data1.f_bfree);

    printf("data-2:\n");
    
    printf("\tf_frsize: %lu\n", data2.f_frsize);
    printf("\tf_blocks: %lu\n", data2.f_blocks);
    printf("\tf_bavail: %lu\n", data2.f_bavail);
    printf("\tf_bfree: %lu\n", data2.f_bfree);
    
    return 0;
}
Here are the values reported by `statvfs()` call: ``` data-1: f_frsize: 512 f_blocks: 3284118120 f_bavail: 3044746360 f_bfree: 3044746360 data-2: f_frsize: 512 f_blocks: 3771702384 f_bavail: 3497497224 f_bfree: 3497497224 ``` Which gives 512 x (3284118120 + 3771702384) = a bit over 3Tb. So it does seem something is wrong in garage computations. Values reported by zfs: ``` $ zfs list NAME USED AVAIL REFER MOUNTPOINT garage 295M 61.2G 124K /garage garage-data-1 114G 1.42T 114G /garage/data-1 garage-data-2 131G 1.63T 131G /garage/data-2 garage/meta 257M 61.2G 257M /garage/meta ``` This is the C that produced the above output: ``` #include <stdio.h> #include <sys/statvfs.h> int main(int argc, char *argv[]) { struct statvfs data1, data2; if(statvfs("/garage/data-1", &data1)) { printf("statvfs(data-1) failed\n"); return 1; } if(statvfs("/garage/data-2", &data2)) { printf("statvfs(data-2) failed\n"); return 1; } printf("data-1:\n"); printf("\tf_frsize: %lu\n", data1.f_frsize); printf("\tf_blocks: %lu\n", data1.f_blocks); printf("\tf_bavail: %lu\n", data1.f_bavail); printf("\tf_bfree: %lu\n", data1.f_bfree); printf("data-2:\n"); printf("\tf_frsize: %lu\n", data2.f_frsize); printf("\tf_blocks: %lu\n", data2.f_blocks); printf("\tf_bavail: %lu\n", data2.f_bavail); printf("\tf_bfree: %lu\n", data2.f_bfree); return 0; } ```
Author

If my understanding of

fn update_disk_usage(&mut self, meta_dir: &Path, data_dir: &DataDirEnum) {

implementation is correct, a hashmap is built at one point, using statvfs::filesystem_id as key. statvfs::filesystem_id is a wrapper to the f_fsid member of the C struct statvfs member, which, according to man statvfs on FreeBSD:

f_fsid     Not meaningful in this implementation.

Indeed, its value is 0 for both the garage-data-1 and garage-data-2 datasets. Therefore, my guess is that those two datasets are coalesced into one, and garage retains the capacity of one of them only. However, this contradicts the previous observation, where the total capacity was over 8Tb before the last cluster layout change (I didn't keep track of individual nodes at that point in time).

If my understanding of https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/906c8708fd53880d998c595ccd39ab9f08866457/src/rpc/system.rs#L808 implementation is correct, a hashmap is built at one point, using `statvfs::filesystem_id` as key. `statvfs::filesystem_id` is a wrapper to the `f_fsid` member of the C `struct statvfs` member, which, according to `man statvfs` on FreeBSD: ``` f_fsid Not meaningful in this implementation. ``` Indeed, its value is 0 for both the `garage-data-1` and `garage-data-2` datasets. Therefore, my guess is that those two datasets are coalesced into one, and `garage` retains the capacity of one of them only. However, this contradicts the previous observation, where the total capacity was over 8Tb before the last cluster layout change (I didn't keep track of individual nodes at that point in time).
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#907
No description provided.