High memory usage #681

Closed
opened 2023-12-28 22:11:51 +00:00 by kfirfer · 12 comments

Hello

I deployed garage in K8s cluster, and for a reason I dont know he consumes relative high memory:

$ kubectl top pods
NAME       CPU(cores)   MEMORY(bytes)   
garage-0   37m          1137Mi          
garage-1   33m          1074Mi          
garage-2   31m          1286Mi 

The disk usage currently is consider low, around 8GB~ used for data volumes, and 800MB~ of meta volumes
Im using db engine lmdb
I didnt configured the lmdb_map_size parameter tho (kept it as 1TB by default)

I dont have many files ,only few hundreds

I guess that it is lmdb, if does, how can I keep it smaller ? why does he takes 800mb in the first place ?
worth noting i didnt changed any configurations about it, kept everything as default

The garage cluster using for Thanos & Quickwit
and we have many write & read operations on those buckets
but still the dataset is relative small

Hello I deployed garage in K8s cluster, and for a reason I dont know he consumes relative high memory: ``` $ kubectl top pods NAME CPU(cores) MEMORY(bytes) garage-0 37m 1137Mi garage-1 33m 1074Mi garage-2 31m 1286Mi ``` The disk usage currently is consider low, around 8GB~ used for data volumes, and 800MB~ of meta volumes Im using db engine `lmdb` I didnt configured the `lmdb_map_size` parameter tho (kept it as 1TB by default) I dont have many files ,only few hundreds I guess that it is lmdb, if does, how can I keep it smaller ? why does he takes 800mb in the first place ? worth noting i didnt changed any configurations about it, kept everything as default The garage cluster using for Thanos & Quickwit and we have many write & read operations on those buckets but still the dataset is relative small
Author

When I restarting the garage instances, the memory usage returns to normal:

$ kubectl top po

NAME       CPU(cores)   MEMORY(bytes)   
garage-0   14m          74Mi            
garage-1   11m          73Mi            
garage-2   12m          72Mi   

65mb+- are the istio sidecars
so garage itself consumes now 5-10mb

i wonder why he stuck on 1gb

the lmdb db size drops little bit but still around 500-800mb (dont really care much about the size, just for the info)

When I restarting the garage instances, the memory usage returns to normal: $ kubectl top po ``` NAME CPU(cores) MEMORY(bytes) garage-0 14m 74Mi garage-1 11m 73Mi garage-2 12m 72Mi ``` 65mb+- are the istio sidecars so garage itself consumes now 5-10mb i wonder why he stuck on 1gb the lmdb db size drops little bit but still around 500-800mb (dont really care much about the size, just for the info)
Owner

That is not memory used by the garage process itself, it is memory used by the kernel to cache pages of the LMDB data file. It's normal and expected behavior, and the reason why LMDB is so fast. These memory page are only cache which is managed by the kernel, and the kernel can easily free these pages when it needs memory for other processes. In your case the kernel decided to keep those in RAM, probably because it had nothing better to do with it.

That is not memory used by the garage process itself, it is memory used by the kernel to cache pages of the LMDB data file. It's normal and expected behavior, and the reason why LMDB is so fast. These memory page are only cache which is managed by the kernel, and the kernel can easily free these pages when it needs memory for other processes. In your case the kernel decided to keep those in RAM, probably because it had nothing better to do with it.
lx closed this issue 2023-12-29 08:25:13 +00:00
Author

Hi @lx
I've noticed a unique issue in our Kubernetes (K8s) clusters that doesn't occur with other applications we use.
Typically, in K8s, when applications in containers no longer require memory and release it, this is reflected in the pods, showing a return to lower memory usage requests. However, this isn't happening with certain applications, for e.g. garage & lmdb, and a few others where memory leaks have been detected through profiling.

To provide some context, we're managing hundreds of applications across approximately 20 K8s clusters. The issue isn't universal, as many applications do successfully reclaim memory. Here's a brief overview of some applications in our clusters:

argocd                    Active   267d
cert-manager              Active   267d
chaos-testing             Active   224d
code-server               Active   267d
default                   Active   267d
descheduler               Active   267d
dex                       Active   267d
emby                      Active   90d
event-exporter            Active   267d
frigate                   Active   28d
garage                    Active   7d
gatekeeper-constraints    Active   266d
gatekeeper-system         Active   266d
gatekeeper-templates      Active   266d
goldilocks                Active   230d
home-assistant            Active   116d
ingress-nginx             Active   267d
istio-system              Active   267d
keel                      Active   267d
kiali                     Active   267d
kube-downscaler           Active   120d
kube-node-lease           Active   267d
kube-public               Active   267d
kube-system               Active   267d
kubernetes-dashboard      Active   267d
metallb-system            Active   267d
monitoring                Active   231d
mosquitto                 Active   103d
nextcloud                 Active   246d
node-problem-detector     Active   267d
oauth2-proxy              Active   267d
pod-cleanup               Active   211d
polaris                   Active   230d
pvc-autoresizer           Active   267d
quickwit                  Active   6d13h
rbac-manager              Active   267d
reloader                  Active   267d
replicator                Active   267d
sealed-secrets            Active   267d
skooner                   Active   267d
slackgpt                  Active   250d
teamspeak                 Active   27d
testkube                  Active   119d
topolvm-system            Active   267d
tracing                   Active   7d22h
vector                    Active   10d
velero                    Active   27d
vertical-pod-autoscaler   Active   267d

This issue has also been observed in bare metal K8s clusters running on Ubuntu 22.04.3 (V1.26), as well as on GKE 1.27 and EKS versions 1.27 and 1.28, among others

I believe that the heap size of the LMDB may not be designed to reclaim and free memory once it's no longer needed, or mabye it is garage, I dont know which

Hi @lx I've noticed a unique issue in our Kubernetes (K8s) clusters that doesn't occur with other applications we use. Typically, in K8s, when applications in containers no longer require memory and release it, this is reflected in the pods, showing a return to lower memory usage requests. However, this isn't happening with certain applications, for e.g. garage & lmdb, and a few others where memory leaks have been detected through profiling. To provide some context, we're managing hundreds of applications across approximately 20 K8s clusters. The issue isn't universal, as many applications do successfully reclaim memory. Here's a brief overview of some applications in our clusters: ``` argocd Active 267d cert-manager Active 267d chaos-testing Active 224d code-server Active 267d default Active 267d descheduler Active 267d dex Active 267d emby Active 90d event-exporter Active 267d frigate Active 28d garage Active 7d gatekeeper-constraints Active 266d gatekeeper-system Active 266d gatekeeper-templates Active 266d goldilocks Active 230d home-assistant Active 116d ingress-nginx Active 267d istio-system Active 267d keel Active 267d kiali Active 267d kube-downscaler Active 120d kube-node-lease Active 267d kube-public Active 267d kube-system Active 267d kubernetes-dashboard Active 267d metallb-system Active 267d monitoring Active 231d mosquitto Active 103d nextcloud Active 246d node-problem-detector Active 267d oauth2-proxy Active 267d pod-cleanup Active 211d polaris Active 230d pvc-autoresizer Active 267d quickwit Active 6d13h rbac-manager Active 267d reloader Active 267d replicator Active 267d sealed-secrets Active 267d skooner Active 267d slackgpt Active 250d teamspeak Active 27d testkube Active 119d topolvm-system Active 267d tracing Active 7d22h vector Active 10d velero Active 27d vertical-pod-autoscaler Active 267d ``` This issue has also been observed in bare metal K8s clusters running on Ubuntu 22.04.3 (V1.26), as well as on GKE 1.27 and EKS versions 1.27 and 1.28, among others I believe that the heap size of the LMDB may not be designed to reclaim and free memory once it's no longer needed, or mabye it is garage, I dont know which
kfirfer reopened this issue 2023-12-30 01:35:40 +00:00
Author

these are the following output of ps aux:

$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1000           1  3.2  0.6 976701232 439300 ?    Ssl  Dec29  27:59 /garage server
1000          65  0.0  0.0   4624  3596 pts/0    Ss   01:56   0:00 /bin/bash
1000         110  0.0  0.0   7060  1560 pts/0    R+   02:05   0:00 ps aux
these are the following output of `ps aux`: ``` $ ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 1000 1 3.2 0.6 976701232 439300 ? Ssl Dec29 27:59 /garage server 1000 65 0.0 0.0 4624 3596 pts/0 Ss 01:56 0:00 /bin/bash 1000 110 0.0 0.0 7060 1560 pts/0 R+ 02:05 0:00 ps aux ```
Author

these is the output of cat /proc/1/status:

$ cat /proc/1/status
Name:   garage
Umask:  0022
State:  S (sleeping)
Tgid:   1
Ngid:   0
Pid:    1
PPid:   0
TracerPid:      0
Uid:    1000    1000    1000    1000
Gid:    1000    1000    1000    1000
FDSize: 256
Groups: 1000 
NStgid: 1
NSpid:  1
NSpgid: 1
NSsid:  1
VmPeak: 976755384 kB
VmSize: 976700820 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    471456 kB
VmRSS:    439188 kB
RssAnon:           10156 kB
RssFile:          429032 kB
RssShmem:              0 kB
VmData:    97800 kB
VmStk:       132 kB
VmExe:     32672 kB
VmLib:         8 kB
VmPTE:      1100 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
Threads:        36
SigQ:   0/255595
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000000004443
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Seccomp_filters:        0
Speculation_Store_Bypass:       thread vulnerable
SpeculationIndirectBranch:      conditional enabled
Cpus_allowed:   fff
Cpus_allowed_list:      0-11
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        29
nonvoluntary_ctxt_switches:     4

This is the top pods when I have issued the commands above on pod garage-0:

$ kubectl top pods                                                           
NAME       CPU(cores)   MEMORY(bytes)   
garage-0   38m          404Mi           
garage-1   26m          431Mi           
garage-2   19m          430Mi  
these is the output of `cat /proc/1/status`: ``` $ cat /proc/1/status Name: garage Umask: 0022 State: S (sleeping) Tgid: 1 Ngid: 0 Pid: 1 PPid: 0 TracerPid: 0 Uid: 1000 1000 1000 1000 Gid: 1000 1000 1000 1000 FDSize: 256 Groups: 1000 NStgid: 1 NSpid: 1 NSpgid: 1 NSsid: 1 VmPeak: 976755384 kB VmSize: 976700820 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 471456 kB VmRSS: 439188 kB RssAnon: 10156 kB RssFile: 429032 kB RssShmem: 0 kB VmData: 97800 kB VmStk: 132 kB VmExe: 32672 kB VmLib: 8 kB VmPTE: 1100 kB VmSwap: 0 kB HugetlbPages: 0 kB CoreDumping: 0 THP_enabled: 1 Threads: 36 SigQ: 0/255595 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000001000 SigCgt: 0000000000004443 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 0000000000000000 CapAmb: 0000000000000000 NoNewPrivs: 0 Seccomp: 0 Seccomp_filters: 0 Speculation_Store_Bypass: thread vulnerable SpeculationIndirectBranch: conditional enabled Cpus_allowed: fff Cpus_allowed_list: 0-11 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 29 nonvoluntary_ctxt_switches: 4 ``` This is the top pods when I have issued the commands above on pod `garage-0`: ``` $ kubectl top pods NAME CPU(cores) MEMORY(bytes) garage-0 38m 404Mi garage-1 26m 431Mi garage-2 19m 430Mi ```
Owner

Hello again,

There is nothing I can do because this is normal and expected behavior of the LMDB storage engine. I can assure you that this is not an issue, because even if the memory is not reclaimed immediately, it will be reclaimed as soon as it is needed for something else.

If you don't like this behavior, please consider configuring garage with another metadata storage engine. Sqlite ahould work quite well.

Hello again, There is nothing I can do because this is normal and expected behavior of the LMDB storage engine. I can assure you that this is not an issue, because even if the memory is not reclaimed immediately, it will be reclaimed as soon as it is needed for something else. If you don't like this behavior, please consider configuring garage with another metadata storage engine. Sqlite ahould work quite well.
lx closed this issue 2023-12-30 08:02:35 +00:00
Owner
RssAnon:           10156 kB
RssFile:          429032 kB

These line indicate that most of the resident set size, I.e. of the memory currently mapped in garages virtual address space, is file-backed memory. This memory can be freed by the kernel at any time, it is not garage's job to free it explicitly. The anonymous set size, which corresponds to the heap, is only 10mb which is normal, so garage is not leaking memory.

``` RssAnon: 10156 kB RssFile: 429032 kB ``` These line indicate that most of the resident set size, I.e. of the memory currently mapped in garages virtual address space, is file-backed memory. This memory can be freed by the kernel at any time, it is not garage's job to free it explicitly. The anonymous set size, which corresponds to the heap, is only 10mb which is normal, so garage is not leaking memory.
Author

@lx
Hi,
it seems you're right about the application allowing the kernel to handle its memory usage. However, in the Kubernetes (K8s) ecosystem, if an application does not properly release its memory and inform the underlying kernel, it disrupts Kubernetes' ability to efficiently manage memory across the entire cluster. Consequently, this affects the optimal distribution of containers across the nodes. We've faced challenges with nodes not being scheduled effectively due to the cluster's lack of awareness about the kernel's caching mechanism. This issue is particularly important in orchestration solutions like Kubernetes. I'm considering trying SQLite, although I'm uncertain if it has been as rigorously tested as LMDB.

@lx Hi, it seems you're right about the application allowing the kernel to handle its memory usage. However, in the Kubernetes (K8s) ecosystem, if an application does not properly release its memory and inform the underlying kernel, it disrupts Kubernetes' ability to efficiently manage memory across the entire cluster. Consequently, this affects the optimal distribution of containers across the nodes. We've faced challenges with nodes not being scheduled effectively due to the cluster's lack of awareness about the kernel's caching mechanism. This issue is particularly important in orchestration solutions like Kubernetes. I'm considering trying SQLite, although I'm uncertain if it has been as rigorously tested as LMDB.
Owner

Is there a workaround you could use in Kubernetes so that it doesn't take into account the quantity reported as "used by Garage", but instead reserves a fixed quantity of memory and makes all scheduling decisions according to that? On the Deuxfleurs infra, we use Nomad as a scheduler for workloads in a cluster and to my knowledge scheduling decisions are made like this, on the basis of fixed memory reservations (the model allows for over-commit, i.e. reserve 500M for a container, and allow it to use up to 1G).

Is there a workaround you could use in Kubernetes so that it doesn't take into account the quantity reported as "used by Garage", but instead reserves a fixed quantity of memory and makes all scheduling decisions according to that? On the Deuxfleurs infra, we use Nomad as a scheduler for workloads in a cluster and to my knowledge scheduling decisions are made like this, on the basis of fixed memory reservations (the model allows for over-commit, i.e. reserve 500M for a container, and allow it to use up to 1G).
Author

@lx
Nomad and Kubernetes function in a similar way, with both allowing for the setting of resource requests and limits. In Kubernetes, I've configured the request memory at 500MB and set the burstable limit to 2GB. Additionally, I'm utilizing the Vertical Pod Autoscaler (VPA) which dynamically adjusts these requests. It's acceptable for containers to occasionally exceed their reserved memory, but if they persistently occupy more memory without releasing it, the Kubernetes scheduler (kube-scheduler) interprets the nodes as full and stops scheduling new pods on them. To manage this, I've been restarting the so-called "garage pods" every 24 hours to free up memory, enabling the scheduler to place new pods on these nodes. However, I'm looking for a solution to avoid this daily restart process.

@lx Nomad and Kubernetes function in a similar way, with both allowing for the setting of resource requests and limits. In Kubernetes, I've configured the request memory at 500MB and set the burstable limit to 2GB. Additionally, I'm utilizing the Vertical Pod Autoscaler (VPA) which dynamically adjusts these requests. It's acceptable for containers to occasionally exceed their reserved memory, but if they persistently occupy more memory without releasing it, the Kubernetes scheduler (kube-scheduler) interprets the nodes as full and stops scheduling new pods on them. To manage this, I've been restarting the so-called "garage pods" every 24 hours to free up memory, enabling the scheduler to place new pods on these nodes. However, I'm looking for a solution to avoid this daily restart process.
Owner

@maximilien is our local Kubernetes expert, maybe he has an idea?

@maximilien is our local Kubernetes expert, maybe he has an idea?
Owner

I've configured the request memory at 500MB and set the burstable limit to 2GB. Additionally, I'm utilizing the Vertical Pod Autoscaler (VPA) which dynamically adjusts these requests. It's acceptable for containers to occasionally exceed their reserved memory, but if they persistently occupy more memory without releasing it, the Kubernetes scheduler (kube-scheduler) interprets the nodes as full and stops scheduling new pods on them.

AFAIK this is not a garage problem, and neither is a LMDB problem. As @lx said, "memory usage" in linux is significantly more complex than the "this app is using 1GB of memory" falsehood than you're getting out of kubectl top. On the topic I highly recommend multiple talks by Chris Down - like Linux memory management at scale or 7 years of cgroup v2: the future of Linux resource control (FOSDEM 2023). Both of them will give you a primer into the root cause of the issue you're having.

Now on the more practical side, I would suggest:

  • disabling your VPA for garage and setting a fixed memory limit (request=limit)
  • patching your VPA to use another metric (like the memory PSI from the garage containers) to upsize the container when you reach a significant amount of memory pressure, akin to what senpai does

The way LMDB operate cache-wise is that it rests on the kernel and other application exercising memory pressure to shrink it's cache. It ensure that it always makes the most of the available memory for performance. If you put the application in an environnement where it is isolated from other workloads and keep giving it more memory as it leverage the one it has available, it will effectively grow indefinitely...

> I've configured the request memory at 500MB and set the burstable limit to 2GB. Additionally, I'm utilizing the Vertical Pod Autoscaler (VPA) which dynamically adjusts these requests. It's acceptable for containers to occasionally exceed their reserved memory, but if they persistently occupy more memory without releasing it, the Kubernetes scheduler (kube-scheduler) interprets the nodes as full and stops scheduling new pods on them. AFAIK this is not a garage problem, and neither is a LMDB problem. As @lx said, "memory usage" in linux is significantly more complex than the "this app is using 1GB of memory" falsehood than you're getting out of `kubectl top`. On the topic I highly recommend multiple talks by Chris Down - like [Linux memory management at scale](https://chrisdown.name/2019/07/18/linux-memory-management-at-scale.html) or [7 years of cgroup v2: the future of Linux resource control (FOSDEM 2023)](https://www.youtube.com/watch?v=LX6fMlIYZcg). Both of them will give you a primer into the root cause of the issue you're having. Now on the more practical side, I would suggest: - disabling your VPA for garage and setting a fixed memory limit (request=limit) - patching your VPA to use another metric (like the memory PSI from the garage containers) to upsize the container when you reach a significant amount of memory pressure, akin to what [senpai](https://github.com/facebookincubator/senpai) does The way LMDB operate cache-wise is that it rests on the kernel and other application exercising memory pressure to shrink it's cache. It ensure that it always makes the most of the available memory for performance. If you put the application in an environnement where it is isolated from other workloads and keep giving it more memory as it leverage the one it has available, it will effectively grow indefinitely...
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#681
No description provided.