web_server.rs: Added bucket domain to observability #608
No reviewers
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#608
Loading…
Reference in a new issue
No description provided.
Delete branch "jpds/garage:domain-web-requests"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
604ad5498f
to0501d722eb
I understand why this is interesting to have as a metric, but we have to be very carefull with cardinality explosion in Prometheus. Logging the domain of the bucket with each request means that we are multiplying the cardinality by n, where n is an arbitrarily large number (some Garage deployments will have thousands or more of buckets enabled for static website serving). The official documentation for Prometheus recommends using metrics with cardinality < 100 for best performance, we will definitely be much higher than that with a patch like this one.
I'd be interested in having the opinion of a monitoring expert like @maximilien, but basically I think this patch should not be integrated.
Reading on the subject: https://stackoverflow.com/questions/46373442/how-dangerous-are-high-cardinality-labels-in-prometheus
Even with my little deployment, you have much heavier metrics cardinality-wise than this already:
(And that RPC metric is the top 3 in my monitoring behind two heavier GRPC ones from thanos.io itself).
And with domain buckets in the low tens, the Grafana web requests panel with this is far more interesting than "my static site hosting is doing something, but I have zero idea which domains are actually popular with users" on top of having no means of correlation of those requests to other observability signals such as a webserver logs.
https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels - from the documentation about the "less than 100" recommendation, you also have examples of exporters such as
node_exporter
which are, not only, going to hundreds more in cardinality, but on Prometheus deployments monitoring tens/hundreds of thousands of phyiscal nodes... I can attest from experience that as the docs say "is fine for Prometheus to handle".Caddy also wants to add this level of granularity to their metrics: https://github.com/caddyserver/caddy/issues/3784 - and they simply haven't been able to because of how their rule evaluation is done. Having awareness of how many requests are hitting a frontend before it gets to a backend is very important (especially so once you add in a cache layer and discover a fun bug in the process).
Hello, SRE by trade here
TLDR: I think adding per-bucket metrics is fine, we can compromise by adding an option to enable/disable that behavior.
The need for bucket is real, as it is very challenging to actually understand which bucket might be targeted by requests without that - outside of the clunky solution to rest on log parsing. There are multiple layers to what can be considered acceptable as far as cardinality is concerned, and it highly depends on how much memory and storage you are confortable giving prometheus (or you local TSDB backend flavor).
To the argument that the RPC metrics have a higher cardinality, that cardinality is bounded to the number of nodes and the number of RPC calls, two values that are relatively bounded. The number of buckets on the other end, is purely controlled by the user, and is not bounded, hence @lx's reluctance to add that as a label. On the other end, it is a stable value (assuming someone is not creating ephemeral buckets, which could very much be the case in some deployments) so this should have a very limited impact in most deployments. The best is likely to add that as an option, so people can turn it on if they wish so, and we don't impact larger deployments unnecessarily.
So can we make this optionnal behavior? It can be enabled by default but we need a way to disable it if it ever causes issues on large deployments.
@jpds are you willing to rebase and implement a configuration parameter to make this optional? Otherwise I can do it.
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.