"Estimated available storage space cluster-wide" went down after adding capacity #907
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#907
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hello,
I have made a small change - adding 1T capacity to one node of a 5-nodes cluster.
The "Estimated available storage space cluster-wide" provided by
garage stats
went significantly down, instead of increasing marginally.The replication terminated as per disk activity and resync queue length going down to previous levels.
Layout & reported capacity before layout change:
Layout & reported capacity after layout change:
Also, I'm unsure about what
DataAvail
is meant to represent, butaa**************
node has way more capacity (free disk space) than indicated, spanning over the two data directories.Garage version:
the estimated capacity is coherent with DataVail, the configured Capacity per node, and a replication factor of 3. The limiting node in that case would be
aa**************
as it claims a capacity of 3TB, while only having a reported available space for 1.8TB.DataAvail is an estimation of available space based on what
statvfs(2)
returns. It should be equivalent to runningdf -h /path/to/datadir
and summing the result for all data directories (ignoring data directories with no capacity configured and read_only set to true). Doesdf
return roughly the same available disk space as Garage, or do they disagree?Thanks for the quick reply.
Here's
df -h
report onaa**************
:and excerpt from
garage.toml
:So it seems indeed that
DataAvail
is somehow wrong. This is a FreeBSD system, using zfs. The configured capacity is purposely slightly lower than actual disk capacity (although those disks are dedicated to garage), as zfs pools performance drops when approaching full usage.Also, it seems it wasn't wrong before the layout change. The garage node hasn't been restarted for weeks, and I have made several layout changes, progressively increasing that node's capacity.
This is the
garage layout show
output, showing a "Usable capacity" of 2.9 TB for that node:Small question if possible, there is no reason why really but does that value change if you restart the node?
Hi, restarted the
aa
node, and it didn't change.Here are the values reported by
statvfs()
call:Which gives 512 x (3284118120 + 3771702384) = a bit over 3Tb. So it does seem something is wrong in garage computations.
Values reported by zfs:
This is the C that produced the above output:
If my understanding of
fn update_disk_usage(&mut self, meta_dir: &Path, data_dir: &DataDirEnum) {
implementation is correct, a hashmap is built at one point, using
statvfs::filesystem_id
as key.statvfs::filesystem_id
is a wrapper to thef_fsid
member of the Cstruct statvfs
member, which, according toman statvfs
on FreeBSD:Indeed, its value is 0 for both the
garage-data-1
andgarage-data-2
datasets. Therefore, my guess is that those two datasets are coalesced into one, andgarage
retains the capacity of one of them only. However, this contradicts the previous observation, where the total capacity was over 8Tb before the last cluster layout change (I didn't keep track of individual nodes at that point in time).I've created and tested (on FreeBSD) a (naive - I'm not a rust dev) fix for this. Was unable to push for PR:
Here's the patch:
Feel free to use & refactor to suit the project coding rules.
@vk same as for github, you need to "fork" the repo under your own user, and push your code in a branch of your repo. You'll then be able to request a PR.
Missed a step indeed...