High metadata disk usage #580
Labels
No Label
AdminAPI
Bug
Check AWS
CI
Correctness
Critical
Documentation
Ideas
Improvement
Low priority
Newcomer
Performance
S3 Compatibility
Testing
Usability
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#580
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context: I have a 3 node setup of Garage in an on-premise Kubernetes.
I have been successfully using Garage so far for smaller oss applications (namely, Netbox and Harbor). This instance of Garage was used for Quickwit (an open-source log indexer predating Elasticsearch).
I am using the official helm chart for deployment (0.4.0). Garage version is v0.8.2:
After running it successfully for 1.5 month, I got the following error from garage (reported by quickwit during some queries):
Trying to list the objects (via Minio's
mcli
) result in the same error:Looking at garage's logs, I got a tad bit more context:
Also noteworthy (though it could be a direct consequence):
Here are some Quickwit index statistics:
Garage bucket info:
I mounted the problematic volume that was full (
meta-quickwit-garage-<i>
) to inspect it, and thedb
was the main contributorto the storage usage. I tried inspecting it with lmdb/sqlite/sled tools, but could not figure out the actual format.
All metadata volumes were close to being full:
Here is the metadata configuration (if it matters):
The number of objects is low, while quickwit report having inserted a lot of objects.
It seems quickwit have a high creation/deletion rate of object, since it constantly write data (logs), and concatenate it regularly (a bit like a LSM tree).
The error is straightforward: the database used for metadata grew large enough to exceed its dedicated volume.
This does look like it's linked to the problem described here:
https://garagehq.deuxfleurs.fr/documentation/design/internals/#1-garbage-collection-of-table-entries-in-meta-directory
Is this behavior expected/intended under this situation, or is there something wrong with my setup?
Is there a workaround for limiting this database growth? Like manually triggering a job to delete deleted document paths.
Additionally, I collected metrics over the past month from the Prometheus interface, so feel free to ask for graphs if it helps.
Here is the graph for:
It looks like the number of PUT was predominant.
Hi @rudexi, it seems you are using
sled
as your DB file has no extension.Here the code in Garage that computes the DB path:
https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/model/garage.rs#L88
sled is known for using lot of memory and taking lot of disk space, and we also kow it has garbage collection issues. We plan to switch to
lmdb
by default, and maybe even to completely dropsled
in the future.So a first debug step would be to convert your metadata from sled to lmdb. The steps are documented here: https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#db-engine-since-v0-8-0 - Do not forget to backup your important data before.
Could you try switching to LMDB and come back here to tell us if it solved your problem? :)
Closing for inactivity.