WIP roadmap

This commit is contained in:
Quentin 2022-04-06 19:11:11 +02:00
parent 0f7861ec98
commit b96dfce719

View file

@ -234,13 +234,14 @@ You can use them to better understand how Garage is interacting with your OS and
This plot has been captured at the same moment than the previous one.
We do not see a correlation between the writes and the API requests for the full upload but only for its beginning.
However, it maps well to Multipart Uploads requests: this is expected because small files will be throttled by other parts of the system, while large files will be able to saturate the writes of your disk.
More precisely, it maps well to Multipart Uploads requests, and this is expected.
Large files (of the Multipart Uploads) will saturate the writes of your disk but the uploading of small files (via the PutObject endpoint) will be throttled by other parts of the system.
This simple example, done on a test cluster, covers only 2 metrics over the 20+ ones that we already defined but we were still able to precisely describe our cluster usage and identifies where bottlenecks could be.
This simple example covers only 2 metrics over the 20+ ones that we already defined, but we were still able to precisely describe our cluster usage and identifies where bottlenecks could be.
We are confident that cleverly using these metrics on a production cluster will give you many more valuable insights on your cluster.
While metrics are good to have a large, general overview of your system, they are however not adapted to dig and pinpoint a specific performance problem on a specific code path.
Thankfully, we also have a solution for this problem: traces.
While metrics are good to have a large, general overview of your system, they are however not adapted to dig and pinpoint a specific performance issue on a specific code path.
Thankfully, we also have a solution for this problem: tracing.
Using [Application Performance Monitoring](https://www.elastic.co/observability/application-performance-monitoring) (APM) in conjunction with Kibana,
we get the following visualization:
@ -251,14 +252,29 @@ On the top of the screenshot, we see the latency distribution of all PutObject r
We learn that the selected request took ~1ms to execute, while 95% of all requests took less than 80ms to run.
Having some dispersion between requests is expected as Garage does not run on a strong real-time system, but in this case, you must also consider that
a request duration is impacted by the size of the object that is sent (a 10B object will be quicker to process than a 10MB one).
Consequently, this request corresponds probably to a very tiny file.
Below, you can select the request you want to inspect, and then see its stacktrace.
You can break down these lines in 4 parts: fetching the API key to check authentication (`key get`), fetching the bucket identifier from its name (`bucket_alias get`), fetching the bucket configuration to check authorizations (`bucket_v2 get`), and finally inserting the object in the storage (`object insert`).
Below this first histogram, you can select the request you want to inspect, and then see its stacktrace on the bottom part.
You can break down this trace in 4 parts: fetching the API key to check authentication (`key get`), fetching the bucket identifier from its name (`bucket_alias get`), fetching the bucket configuration to check authorizations (`bucket_v2 get`), and finally inserting the object in the storage (`object insert`).
With this example, we demonstrated that we can inspect Garage internals to find slow requests, then see which codepath has been taken by a request, to finally identify which part of the code took time.
Keep in mind that this is our first iteration on telemetry for Garage, so things are a bit rough around the edges (step by step documentation is missing, our Grafana dashboard is a work in a progress, etc.).
In all cases, your feedback is welcome on our Matrix channel.
## And next?
roadmap: k2v, allocation simulator, s3 compatibility, community feedback, whitepaper
While we hope that Garage in its current state inspired you, we also understand that you may be curious about what will come next!
Currently, our goal is to reach v1.0, for which we want to work on these three desirable properties: *Feature complete*, *Understandability and manageability*, and *Correctness*.
**Feature complete**. We have already implemented a selected subset of S3 endpoints that works quite well, but we want to work on the corner cases that are not yet solved (eg. [#263](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues), [#248](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/248), [#204](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/204). Based on community feedbacks, we might consider implementing additional endpoints (eg. [#166](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/166) but we can make no promise (sorry!). Finally, we made a serie of observation: 1) the S3 API has a limited semantic, for example it is not adapted for append-only log data structures, 2) many projects require a database additionaly to the object store 3) we already implemented a key value store internally to handle S3 metadata. It leads us to the conclusion that we study the feasibility of providing a simple and totally optional key value interface that we refer as K2V. We are currently writing [an API draft](https://p.adnab.me/code/#/2/code/view/eUNPbfoUrMbCY+CoMXaqed4jmWlmvWALHNDcfuM-O5o/embed/present/) and will try to implement it in the following months. We would like it to be as close as possible as the original [Amazon Dynamo paper](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf), or if you are more familiar with Cassandra, as the most possible minimalistic Cassandra.
**Understandability and manageability**. We want a system that is understood and manageable by the largest possible amount of operators. To achieve such a goal, we can follow 2 paths: sharing knowledge and making better tools. We want to explore both approaches, and we identified specific subjects on which to work: 1) Garage's consistency model of the S3 API and the admin API, 2) Explaining how Garage can take its place in the existing ecosystem, including among the other distributed storage systems, but also in term of uses cases and business
- Well understood, well explained
- Consistency Model
- Deployment Cas Typique
- web interface, rest admin
- Storage density vs reliability, deployment simulator
- Fast, tested and correct
-