OpenTelemetry part

2022-04-06 16:23:13 +02:00 · 2022-04-06 16:23:13 +02:00 · 3eb41b73e4
commit 3eb41b73e4
parent 6df348b2a3
4 changed files with 65 additions and 4 deletions
--- a/content/blog/2022-v0.7-released.md
+++ b/content/blog/2022-v0.7-released.md
@ -13,9 +13,9 @@ Two months ago, we were impressed by the success of our open beta launch at FOSD
 Since this event, we continued to improve Garage, and - 2 months after the initial release - we are happy to announce version 0.7.0.

 But first, we would like to thank the contributors that made this new release possible: Alex, Jill, Max Audron, Maximilien, Quentin, Rune Henrisken, Steam, and trinity-1686a.
-This is also our first time welcoming contributors external to the core team, and as we wish for Garage to be a community-driven project, we encourage it.
+This is also our first time welcoming contributors external to the core team, and as we wish for Garage to be a community-driven project, we encourage it!

-As a noverlty as well, you can get this release using our binaries or the package provided by your distribution.
+You can get this release using our binaries or the package provided by your distribution.
 We ship [statically compiled binaries](https://garagehq.deuxfleurs.fr/download/) for most Linux architectures (amd64, i386, aarch64 and armv6) and associated [Docker containers](https://hub.docker.com/u/dxflrs).
 Garage now is also packaged by third parties on some OS/distributions. We are currently aware of [FreeBSD](https://cgit.freebsd.org/ports/tree/www/garage/Makefile) and [AUR for Arch Linux](https://aur.archlinux.org/packages/garage).
 Feel free to [reach us](mailto:garagehq@deuxfleurs.fr) if you are packaging (or planning to package) Garage, we welcome maintainers and will upstream specific patches if that can help. If you already did package garage, tell us and we'll add it to the documentation.
@ -31,9 +31,9 @@ In this new version, Garage integrates a method to discover other peers by using
 Garage can self-apply the [Custom Resource Definition](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/) (CRD) to your cluster, or you can manage it manually.

 Let's see practically how it works with a minimalistic example (not secured nor suitable for production). 
-You can run it on [minikube](https://minikube.sigs.k8s.io) if you a more interactive reading.
+You can run it on [minikube](https://minikube.sigs.k8s.io) if you want a more interactive reading.

-Start by creating a [ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) containg Garage's configuration (let's name it `config.yaml`):
+Start by creating a [ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) containing Garage's configuration (let's name it `config.yaml`):

 ```yaml
 apiVersion: v1
@ -197,6 +197,67 @@ We have not documented it yet but you can get a look at [our Nomad service](http

 ## OpenTelemetry support

+[OpenTelemetry](https://opentelemetry.io/) standardizes how software generate and collect system telemetry, namely metrics, logs and traces.
+By implementing this standard in Garage, we hope that it will help you to better monitor, manage and tune your cluster.
+Note that to fully leverage this feature, you must be already familiar with monitoring stacks like [Prometheus](https://prometheus.io/)+[Grafana](https://grafana.com/) or [ElasticSearch](https://www.elastic.co/elasticsearch/)+[Kibana](https://www.elastic.co/kibana/).
+
+To activate OpenTelemetry on Garage, you must add to your configuration file the following entries (supposing that your collector is also on localhost):
+
+```toml
+[admin]
+api_bind_addr = "127.0.0.1:3903"
+trace_sink = "http://localhost:4317"
+```
+
+We provide [some files](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/script/telemetry) to help you quickly bootstrap a testing monitoring stack.
+It includes a docker-compose file and a pre-configured Grafana dashboard.
+You can use them if you want to reproduce the following examples.
+
+Now that your telemetry data is collected and stored, you can visualize it.
+
+Grafana is particularly adapted to understand how your cluster is performing from a "bird's eye view".
+For example, the following graph shows S3 API calls sent to your node per time-unit,
+you can use it to better understand how your users are interacting with your cluster.
+
+![A screenshot of a plot made by Grafana depicting the number of requests per time units grouped by endpoints](/images/blog/api_rate.png)
+
+Thanks to this graph, we know that starting at 14:55, an important upload has been started.
+This upload is made of many small files, as we see many PutObject calls that are often used for small files.
+It also has some large objects, as we observe some Multipart Uploads requests.
+Conversely, at this time, no read are done as the corresponding read enpoints (ListBuckets, ListObjectsv2, etc.) receive 0 request per time unit.
+
+
+Garage also collects metrics from lower level parts of the system.
+You can use them to better understand how Garage is interacting with your OS and your hardware.
+
+![A screenshot of a plot made by Grafana depicting the write speed (in MB/s) during time.](/images/blog/writes.png)
+
+This plot has been captured at the same moment than the previous one.
+We do not see a correlation between the writes and the API requests for the full upload but only for its beginning.
+However, it maps well to Multipart Uploads requests: this is expected because small files will be throttled by other parts of the system, while large files will be able to saturate the writes of your disk.
+
+This simple example, done on a test cluster, covers only 2 metrics over the 20+ ones that we already defined but we were still able to precisely describe our cluster usage and identifies where bottlenecks could be.
+We are confident that cleverly using these metrics on a production cluster will give you many more valuable insights on your cluster.
+
+While metrics are good to have a large, general overview of your system, they are however not adapted to dig and pinpoint a specific performance problem on a specific code path.
+Thankfully, we also have a solution for this problem: traces.
+
+Using [Application Performance Monitoring](https://www.elastic.co/observability/application-performance-monitoring) (APM) in conjunction with Kibana,
+we get the following visualization:
+
+![A screenshot of APM depicting the trace of a PutObject call](/images/blog/apm.png)
+
+On the top of the screenshot, we see the latency distribution of all PutObject requests.
+We learn that the selected request took ~1ms to execute, while 95% of all requests took less than 80ms to run.
+Having some dispersion between requests is expected as Garage does not run on a strong real-time system, but in this case, you must also consider that
+a request duration is impacted by the size of the object that is sent (a 10B object will be quicker to process than a 10MB one).
+
+Below, you can select the request you want to inspect, and then see its stacktrace.
+You can break down these lines in 4 parts: fetching the API key to check authentication (`key get`), fetching the bucket identifier from its name (`bucket_alias get`), fetching the bucket configuration to check authorizations (`bucket_v2 get`), and finally inserting the object in the storage (`object insert`).
+
+With this example, we demonstrated that we can inspect Garage internals to find slow requests, then see which codepath has been taken by a request, to finally identify which part of the code took time.
+
+
 ## And next?

 roadmap: k2v, allocation simulator, s3 compatibility, community feedback, whitepaper
--- a/static/images/blog/api_rate.png
+++ b/static/images/blog/api_rate.png
--- a/static/images/blog/apm.png
+++ b/static/images/blog/apm.png
--- a/static/images/blog/writes.png
+++ b/static/images/blog/writes.png