Merge pull request 'New article: Bringing theoretical design and observed performances face to face' (#12) from perf into master

Reviewed-on: Deuxfleurs/garagehq.deuxfleurs.fr#12
2022-09-29 13:16:03 +02:00 · 2022-09-29 13:16:03 +02:00 · d549b125dc
commit d549b125dc
parent e95289c483 4f57f6c742
10 changed files with 514 additions and 1 deletions
--- a/content/blog/2022-perf/1million-both.png
+++ b/content/blog/2022-perf/1million-both.png
--- a/content/blog/2022-perf/1million.png
+++ b/content/blog/2022-perf/1million.png
--- a/content/blog/2022-perf/amplification.png
+++ b/content/blog/2022-perf/amplification.png
--- a/content/blog/2022-perf/complexity.png
+++ b/content/blog/2022-perf/complexity.png
--- a/content/blog/2022-perf/db_engine.png
+++ b/content/blog/2022-perf/db_engine.png
--- a/content/blog/2022-perf/index.md
+++ b/content/blog/2022-perf/index.md
@ -0,0 +1,513 @@
+++
+title="Confronting theoretical design with observed performances"
+date=2022-09-26
+++
+
+
+*During the past years, we have thought a lot about possible design decisions and
+their theoretical trade-offs for Garage. In particular, we pondered the impacts
+of data structures, networking methods, and scheduling algorithms.
+Garage worked well enough for our production
+cluster at Deuxfleurs, but we also knew that people started to experience some
+unexpected behaviors, which motivated us to start a round of benchmarks and performance
+measurements to see how Garage behaves compared to our expectations.
+This post presents some of our first results, which cover
+3 aspects of performance: efficient I/O, "myriads of objects", and resiliency,
+reflecting the high-level properties we are seeking.*
+
+<!-- more -->
+
+---
+
+## ⚠️ Disclaimer
+
+The results presented in this blog post must be taken with a (critical) grain of salt due to some
+limitations that are inherent to any benchmarking endeavor. We try to reference them as
+exhaustively as possible here, but other limitations might exist.
+
+Most of our tests were made on _simulated_ networks, which by definition cannot represent all the
+diversity of _real_ networks (dynamic drop, jitter, latency, all of which could be
+correlated with throughput or any other external event). We also limited
+ourselves to very small workloads that are not representative of a production
+cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
+our results are not an evaluation of the performance of Garage as a whole.
+
+For some benchmarks, we used Minio as a reference. It must be noted that we did
+not try to optimize its configuration as we have done for Garage, and more
+generally, we have significantly less knowledge of Minio's internals compared to Garage, which could lead
+to underrated performance measurements for Minio. It must also be noted that
+Garage and Minio are systems with different feature sets. For instance, Minio supports
+erasure coding for higher data density and Garage doesn't, Minio implements
+way more S3 endpoints than Garage, etc. Such features necessarily have a cost
+that you must keep in mind when reading the plots we will present. You should consider
+Minio's results as a way to contextualize Garage's numbers, to justify that our improvements
+are not simply artificial in the light of existing object storage implementations.
+
+The impact of the testing environment is also not evaluated (kernel patches,
+configuration, parameters, filesystem, hardware configuration, etc.). Some of
+these parameters could favor one configuration or software product over another.
+Especially, it must be noted that most of the tests were done on a
+consumer-grade PC with only a SSD, which is different from most
+production setups. Finally, our results are also provided without statistical
+tests to validate their significance, and might have insufficient ground
+to be claimed as reliable.
+
+When reading this post, please keep in mind that **we are not making any
+business or technical recommendations here, and this is not a scientific paper
+either**; we only share bits of our development process as honestly as
+possible.
+Make your own tests if you need to take a decision,
+remember to read [benchmarking crimes](https://gernot-heiser.org/benchmarking-crimes.html)
+and to remain supportive and caring with your peers ;)
+
+## About our testing environment
+
+We made a first batch of tests on
+[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible
+testbed for experiment-driven research in all areas of computer science,
+which has an
+[open access program](https://www.grid5000.fr/w/Grid5000:Open-Access).
+During our tests, we used part of the following clusters:
+[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova),
+[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and
+[econome](https://www.grid5000.fr/w/Nantes:Hardware#econome), to make a
+geo-distributed topology. We used the Grid5000 testbed only during our
+preliminary tests to identify issues when running Garage on many powerful
+servers. We then  reproduced these issues in a controlled environment
+outside of Grid5000, so don't be
+surprised then if Grid5000 is not always mentioned on our plots.
+
+To reproduce some environments locally, we have a small set of Python scripts
+called [`mknet`](https://git.deuxfleurs.fr/Deuxfleurs/mknet) tailored to our
+needs[^ref1]. Most of the following tests were run locally with `mknet` on a
+single computer: a Dell Inspiron 27" 7775 AIO, with a Ryzen 5 1400, 16GB of
+RAM and a 512GB SSD. In terms of software, NixOS 22.05 with the 5.15.50 kernel is
+used with an ext4 encrypted filesystem. The `vm.dirty_background_ratio` and
+`vm.dirty_ratio` have been reduced to `2` and `1` respectively: with default
+values, the system tends to freeze under heavy I/O load.
+
+## Efficient I/O
+
+The main purpose of an object storage system is to store and retrieve objects
+across the network, and the faster these two functions can be accomplished,
+the more efficient the system as a whole will be. For this analysis, we focus on
+2 aspects of performance. First, since many applications can start processing a file
+before receiving it completely, we will evaluate the time-to-first-byte (TTFB)
+on `GetObject` requests, i.e. the duration between the moment a request is sent
+and the moment where the first bytes of the returned object are received by the client.
+Second, we will evaluate generic throughput, to understand how well
+Garage can leverage the underlying machine's performance.
+
+**Time-to-First-Byte** - One specificity of Garage is that we implemented S3
+web endpoints, with the idea to make it a platform of choice to publish
+static websites. When publishing a website, TTFB can be directly observed
+by the end user, as it will impact the perceived reactivity of the page being loaded.
+
+Up to version 0.7.3, time-to-first-byte on Garage used to be relatively high.
+This can be explained by the fact that Garage was not able to handle data internally
+at a smaller granularity level than entire data blocks, which are up to 1MB chunks of a given object
+(a size which [can be configured](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
+Let us take the example of a 4.5MB object, which Garage will split by default into four 1MB blocks and one 0.5MB block.
+With the old design, when you were sending a `GET`
+request, the first block had to be _fully_ retrieved by the gateway node from the
+storage node before it starts to send any data to the client.
+
+With Garage v0.8, we added a data streaming logic that allows the gateway
+to send the beginning of a block without having to wait for the full block to be received from
+the storage node. We can visually represent the difference as follow:
+
+<center>
+<img src="schema-streaming.png" alt="A schema depicting how streaming improves the delivery of a block" />
+</center>
+
+As our default block size is only 1MB, the difference should be marginal on
+fast networks: it takes only 8ms to transfer 1MB on a 1Gbps network,
+adding at most 8ms of latency to a `GetObject` request (assuming no other
+data transfer is happening in parallel). However,
+on a very slow network, or a very congested link with many parallel requests
+handled, the impact can be much more important: on a 5Mbps network, it takes at least 1.6 seconds
+to transfer our 1MB block, and streaming will heavily improve user experience.
+
+We wanted to see if this theory holds in practice: we simulated a low latency
+but slow network using `mknet` and did some requests with block streaming (Garage v0.8 beta) and
+without (Garage v0.7.3). We also added Minio as a reference. To
+benchmark this behavior, we wrote a small test named
+[s3ttfb](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3ttfb),
+whose results are shown on the following figure:
+
+![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
+
+Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
+and 2s, which matches the time required to transfer the full block which we calculated above.
+On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
+streaming feature (the lowest value is 43ms). Minio sits between the two
+Garage versions: we suppose that it does some form of batching, but smaller
+than our initial 1MB default.
+
+**Throughput** - As soon as we publicly released Garage, people started
+benchmarking it, comparing its performances to writing directly on the
+filesystem, and observed that Garage was slower (eg.
+[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the
+situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread,
+and many others
+([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
+[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)), which led us to
+version 0.8 "Beta 1". We also noticed that some of the logic we wrote
+to better control resource usage
+and detect errors, including semaphores and timeouts, was artificially limiting
+performances. In another iteration, we made this logic less restrictive at the
+cost of higher resource consumption under load
+([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in
+version 0.8 "Beta 2". Finally, we currently do multiple `fsync` calls each time we
+write a block. We know that this is expensive and did a test build without any
+`fsync` call ([see the
+commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7))
+that will not be merged, only to assess the impact of `fsync`. We refer to it
+as `no-fsync` in the following plot.
+
+*A note about `fsync`: for performance reasons, operating systems often do not
+write directly to the disk when a process creates or updates a file in your
+filesystem. Instead, the write is kept in memory, and flushed later in a batch
+with other writes. If a power loss occurs before the OS has time to flush
+data to disk, some writes will be lost. To ensure that a write is effectively
+written to disk, the
+[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
+which effectively blocks until the file or directory on which it is called has been flushed from volatile
+memory to the persistent storage device. Additionally, the exact semantic of
+`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
+and, even on battle-tested software like Postgres, it was
+["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
+Note that on Garage, we are still working on our `fsync` policy and thus, for
+now, you should expect limited data durability in case of power loss, as we are
+aware of some inconsistencies on this point (which we describe in the following
+and plan to solve).*
+
+To assess performance improvements, we used the benchmark tool
+[minio/warp](https://github.com/minio/warp) in a non-standard configuration,
+adapted for small-scale tests, and we kept only the aggregated result named
+"cluster total". The goal of this experiment is to get an idea of the cluster
+performance with a standardized and mixed workload.
+
+![Plot showing IO performances of Garage configurations and Minio](io.png)
+
+Minio, our reference point, gives us the best performances in this test.
+Looking at Garage, we observe that each improvement we made had a visible
+impact on performances. We also note that we have a progress margin in
+terms of performances compared to Minio: additional benchmarks, tests, and
+monitoring could help us better understand the remaining gap.
+
+
+## A myriad of objects
+
+Object storage systems do not handle a single object but huge numbers of them:
+Amazon claims to handle trillions of objects on their platform, and Red Hat
+tout Ceph as being able to handle 10 billion objects. All these
+objects must be tracked efficiently in the system to be fetched, listed,
+removed, etc. In Garage, we use a "metadata engine" component to track them.
+For this analysis, we compare different metadata engines in Garage and see how
+well the best one scales to a million objects.
+
+**Testing metadata engines** - With Garage, we chose not to store metadata
+directly on the filesystem, like Minio for example, but in a specialized on-disk
+B-Tree data structure; in other words, in an embedded database engine. Until now,
+the only supported option was [sled](https://sled.rs/), but we started having
+serious issues with it - and we were not alone
+([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
+v0.8, we introduce an abstraction semantic over the features we expect from our
+database, allowing us to switch from one metadata back-end to another without touching
+the rest of our codebase. We added two additional back-ends: LMDB
+(through [heed](https://github.com/meilisearch/heed)) and SQLite
+(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
+are both experimental: contrarily to sled, we have yet to run them in production
+for a significant time.**
+
+Similarly to the impact of `fsync` on block writing, each database engine we use
+has its own `fsync` policy. Sled flushes its writes every 2 seconds by
+default (this is
+[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
+LMDB default to an `fsync` on each write, which on early tests led to
+abysmal performance. We thus added 2 flags,
+[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
+and
+[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
+to deactivate `fsync` entirely. On SQLite, it is also possible to deactivate `fsync` with
+`pragma synchronous = off`, but we have not started any optimization work on it yet:
+our SQLite implementation currently still calls `fsync` for all write operations. Additionally, we are
+using these engines through Rust bindings that do not support async Rust,
+with which Garage is built, which has an impact on performance as well.
+**Our comparison will therefore not reflect the raw performances of
+these database engines, but instead, our integration choices.**
+
+Still, we think it makes sense to evaluate our implementations in their current
+state in Garage. We designed a benchmark that is intensive on the metadata part
+of the software, i.e. handling large numbers of tiny files. We chose again
+`minio/warp` as a benchmark tool, but we
+configured it with the smallest possible object size it supported, 256
+bytes, to put pressure on the metadata engine. We evaluated sled twice:
+with its default configuration, and with a configuration where we set a flush
+interval of 10 minutes (longer than the test) to disable `fsync`.
+
+*Note that S3 has not been designed for workloads that store huge numbers of small objects;
+a regular database, like Cassandra, would be more appropriate. This test has
+only been designed to stress our metadata engine and is not indicative of
+real-world performances.*
+
+![Plot of our metadata engines comparison with Warp](db_engine.png)
+
+Unsurprisingly, we observe abysmal performances with SQLite, as it is the engine we did not put work on yet,
+and that still does an `fsync` for each write. Garage with the `fsync`-disabled LMDB backend performs twice better than
+with sled in its default version and 60% better than the "no `fsync`" sled version in our
+benchmark.  Furthermore, and not depicted on these plots, LMDB uses way less
+disk storage and RAM; we would like to quantify that in the future.  As we are
+only at the very beginning of our work on metadata engines, it is hard to draw
+strong conclusions.  Still, we can say that SQLite is not ready for production
+workloads, and that LMDB looks very promising both in terms of performances and resource
+usage, and is a very good candidate for being Garage's default metadata engine in
+future releases, once we figure out the proper `fsync` tuning. In the future, we will need to define a data policy for Garage to help us
+arbitrate between performance and durability.
+
+*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
+need to better assess the impact of possibly losing a write after it has been validated.
+Because Garage is a distributed system, even if a node loses its write due to a
+power loss, it will fetch it back from the 2 other nodes that store it. But rare
+situations can occur where 1 node is down and the 2 others validate the write and then
+lose power before having time to flush to disk. What is our policy in this case? For storage durability,
+we are already supposing that we never lose the storage of more than 2 nodes,
+so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
+time? What should we do about people hosting all of their nodes at the same
+place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
+some compromises on this side
+([#3536](https://github.com/minio/minio/issues/3536),
+[HN Discussion](https://news.ycombinator.com/item?id=28135533)). Now, they seem to
+use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
+only data and not metadata is persisted on disk - in combination with
+`O_DIRECT` for direct I/O
+([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
+[example in Minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
+
+**Storing a million objects** - Object storage systems are designed not only
+for data durability and availability but also for scalability, so naturally,
+some people asked us how scalable Garage is. If giving a definitive answer to this
+question is out of the scope of this study, we wanted to be sure that our
+metadata engine would be able to scale to a million objects. To put this
+target in context, it remains small compared to other industrial solutions:
+Ceph claims to scale up to [10 billion objects](https://www.redhat.com/en/resources/data-solutions-overview),
+which is 4 orders of magnitude more than our current target. Of course, their
+benchmarking setup has nothing in common with ours, and their tests are way
+more exhaustive.
+
+We wrote our own benchmarking tool for this test,
+[s3billion](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3billion)[^ref2].
+The benchmark procedure consists in 
+concurrently sending a defined number of tiny objects (8192 objects of 16
+bytes by default) and measuring the wall clock time to the last object upload. This step is then repeated a given
+number of times (128 by default) to effectively create a target number of
+objects on the cluster (1M by default). On our local setup with 3
+nodes, both Minio and Garage with LMDB were able to achieve this target. In the
+following plot, we show how much time it took Garage and Minio to handle
+each batch.
+
+Before looking at the plot, **you must keep in mind some important points regarding
+the internals of both Minio and Garage**.
+
+Minio has no metadata engine, it stores its objects directly on the filesystem.
+Sending 1 million objects on Minio results in creating one million inodes on
+the storage server in our current setup. So the performances of the filesystem
+probably have a substantial impact on the observed results.
+In our precise setup, we know that the
+filesystem we used is not adapted at all for Minio (encryption layer, fixed
+number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
+`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
+creation of objects. Finally, object storage is designed for big objects, for which the
+costs measured here are negligible. In the end, again, we use Minio only as a
+reference point to understand what performance budget we have for each part of our
+software.
+
+Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
+not created on the filesystem but the object is directly stored inline in the
+metadata engine.  In the future, we plan to evaluate how Garage behaves at scale with
+objects above 3KB, which we expect to be way closer to Minio, as it will have to create
+at least one inode per object. For now, we limit ourselves to evaluating our
+metadata engine and focus only on 16-byte objects.
+
+![Showing the time to send 128 batches of 8192 objects for Minio and Garage](1million-both.png)
+
+It appears that the performances of our metadata engine are acceptable, as we
+have a comfortable margin compared to Minio (Minio is between 3x and 4x times
+slower per batch).  We also note that, past the 200k objects mark, Minio's
+time to complete a batch of inserts is constant, while on Garage it still increases on the observed range.
+It could be interesting to know if Garage's batch completion time would cross Minio's one
+for a very large number of objects.  If we reason per object, both Minio's and
+Garage's performances remain very good: it takes respectively around 20ms and
+5ms to create an object.  In a real-world scenario, at 100 Mbps, the upload of a 10MB file takes
+800ms, and goes up to 8sec for a 100MB file: in both cases
+handling the object metadata would be only a fraction of the upload time.  The
+only cases where a difference would be noticeable would be when uploading a lot of very
+small files at once, which again would be an unusual usage of the S3 API.
+
+Let us now focus on Garage's metrics only to better see its specific behavior:
+
+![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
+
+Two effects are now more visible: 1., batch completion time increases with the
+number of objects in the bucket and 2., measurements are scattered, at least
+more than for Minio. We expected this batch completion time increase to be logarithmic,
+but we don't have enough data points to conclude confidently it is the case: additional
+measurements are needed. Concerning the observed instability, it could
+be a symptom of what we saw with some other experiments on this setup,
+which sometimes freezes under heavy I/O load.  Such freezes could lead to
+request timeouts and failures. If this occurs on our testing computer, it might
+occur on other servers as well: it would be interesting to better understand this
+issue, document how to avoid it, and potentially change how we handle I/O
+internally in Garage.  But still, this was a very heavy test that will probably not be encountered in
+many setups: we were adding 273 objects per second for 30 minutes straight!
+
+To conclude this part, Garage can ingest 1 million tiny objects while remaining
+usable on our local setup. To put this result in perspective, our production
+cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with
+116k objects. This bucket contains real-world production data: it is used by our Matrix instance
+to store people's media files (profile pictures, shared pictures, videos,
+audio files, documents...). Thanks to this benchmark, we have identified two points
+of vigilance: the increase of batch insert time with the number of existing
+objects in the cluster in the observed range, and the volatility in our measured data that
+could be a symptom of our system freezing under the load. Despite these two
+points, we are confident that Garage could scale way above 1M objects, although
+that remains to be proven.
+
+## In an unpredictable world, stay resilient
+
+Supporting a variety of real-world networks and computers, especially ones that
+were not designed for software-defined storage or even for server purposes, is the
+core value proposition of Garage.  For example, our production cluster is
+hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
+behind consumer-grade fiber links across France and Belgium (if you are reading this,
+congratulation, you fetched this webpage from it!). That's why we are very
+careful that our internal protocol (referred to as "RPC protocol" in our documentation)
+remains as lightweight as possible. For this analysis, we quantify how network
+latency and number of nodes in the cluster impact the duration of the most
+important kinds of S3 requests.
+
+**Latency amplification** - With the kind of networks we use (consumer-grade
+fiber links across the EU), the observed latency between nodes is in the 50ms range.
+When latency is not negligible, you will observe that request completion
+time is a factor of the observed latency. That's to be expected: in many cases, the
+node of the cluster you are contacting cannot directly answer your request, and
+has to reach other nodes of the cluster to get the data. Each
+of these sequential remote procedure calls - or RPCs - adds to the final S3 request duration, which can quickly become
+expensive.  This ratio between request duration and network latency is what we
+refer to as *latency amplification*.
+
+For example, on Garage, a `GetObject` request does two sequential calls: first,
+it fetches the descriptor of the requested object from the metadata engine, which contains a reference
+to the first block of data, and then only in a second step it can start retrieving data blocks
+from storage nodes.  We can therefore expect that the
+request duration of a small `GetObject` request will be close to twice the
+network latency.
+
+We tested the latency amplification theory with another benchmark of our own named
+[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
+which does a single request at a time on an endpoint and measures the response
+time. As we are not interested in bandwidth but latency, all our requests
+involving objects are made on tiny files of around 16 bytes. Our benchmark
+tests 5 standard endpoints of the S3 API: ListBuckets, ListObjects, PutObject, GetObject and
+RemoveObject. Here are the results:
+
+
+![Latency amplification](amplification.png)
+
+As Garage has been optimized for this use case from the very beginning, we don't see
+any significant evolution from one version to another (Garage v0.7.3 and Garage
+v0.8.0 Beta 1 here).  Compared to Minio, these values are either similar (for
+ListObjects and ListBuckets) or significantly better (for GetObject, PutObject, and
+RemoveObject).  This can be easily explained by the fact that Minio has not been designed with
+environments with high latencies in mind. Instead, it is expected to run on clusters that are built
+in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected with their asynchronous
+[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
+feature.
+
+*Minio also has a [multi-site active-active replication system](https://blog.min.io/minio-multi-site-active-active-replication/)
+but it is even more sensitive to latency: "Multi-site replication has increased
+latency sensitivity, as Minio does not consider an object as replicated until
+it has synchronized to all configured remote targets. Replication latency is
+therefore dictated by the slowest link in the replication mesh."*
+
+
+**A cluster with many nodes** - Whether you already have many compute nodes
+with unused storage, need to store a lot of data, or are experimenting with unusual
+system architectures, you might be interested in deploying over a hundred Garage nodes.
+However, in some distributed systems, the number of nodes in the cluster will
+have an impact on performance. Theoretically, our protocol, which is inspired by distributed
+hash tables (DHT), should scale fairly well, but until now, we never took the time to test it
+with a hundred nodes or more.
+
+This test was run directly on Grid5000 with 6 physical servers spread
+in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up
+to 65 instances of Garage simultaneously, for a total of 390 nodes. The
+network between physical servers is the dedicated network provided by
+the Grid5000 community. Nodes on the same physical machine communicate directly
+through the Linux network stack without any limitation. We are aware that this is a
+weakness of this test, but we still think that this test can be relevant as, at
+each step in the test, each instance of Garage has 83% (5/6) of its connections
+that are made over a real network. To measure performances for each cluster size, we used
+[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
+again:
+
+
+![Impact of response time with bigger clusters](complexity.png)
+
+Up to 250 nodes, we observed response times that remain constant. After this threshold,
+results become very noisy.  By looking at the server resource usage, we saw
+that their load started to become non-negligible: it seems that we are not
+hitting a limit on the protocol side, but have simply exhausted the resource
+of our testing nodes. In the future, we would like to run this experiment
+again, but on many more physical nodes, to confirm our hypothesis.  For now, we
+are confident that a Garage cluster with 100+ nodes should work.
+
+
+## Conclusion and Future work
+
+During this work, we identified some sensitive points on Garage,
+on which we will have to continue working: our data durability target and interaction with the
+filesystem (`O_DSYNC`, `fsync`, `O_DIRECT`, etc.) is not yet homogeneous across
+our components; our new metadata engines (LMDB, SQLite) still need some testing
+and tuning; and we know that raw I/O performances (GetObject and PutObject for large objects) have a small
+improvement margin.
+
+At the same time, Garage has never been in better shape: its next version (version 0.8) will
+see drastic improvements in terms of performance and reliability.  We are
+confident that Garage is already able to cover a wide range of deployment needs, up
+to over a hundred nodes and millions of objects.
+
+In the future, on the performance aspect, we would like to evaluate the impact
+of introducing an SRPT scheduler
+([#361](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/361)), define a data
+durability policy and implement it, make a deeper and larger review of the
+state of the art (Minio, Ceph, Swift, OpenIO, Riak CS, SeaweedFS, etc.) to
+learn from them and, lastly, benchmark Garage at scale with possibly multiple
+terabytes of data and billions of objects on long-lasting experiments.
+
+In the meantime, stay tuned: we have released
+[a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1),
+and are already working on several features for the next version.
+For instance, we are working on a new layout that will have enhanced optimality properties,
+as well as a theoretical proof of correctness
+([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)). We are also
+working on a Python SDK for Garage's administration API
+([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will
+soon officially introduce a new API (as a technical preview) named K2V
+([see K2V on our doc for a primer](https://garagehq.deuxfleurs.fr/documentation/reference-manual/k2v/)).
+
+
+## Notes
+
+[^ref1]: Yes, we are aware of [Jepsen](https://github.com/jepsen-io/jepsen)'s
+existence. Jepsen is far more complex than our set of scripts, but
+it is also way more versatile.
+
+[^ref2]: The program name contains the word "billion", although we only tested Garage
+up to 1 million objects: this is not a typo, we were just a little bit too
+enthusiastic when we wrote it ;)
+
+<style>
+.footnote-definition p { display: inline; }
+</style>
--- a/content/blog/2022-perf/io.png
+++ b/content/blog/2022-perf/io.png
--- a/content/blog/2022-perf/schema-streaming.png
+++ b/content/blog/2022-perf/schema-streaming.png
--- a/content/blog/2022-perf/ttfb.png
+++ b/content/blog/2022-perf/ttfb.png
--- a/templates/section.html
+++ b/templates/section.html
@ -42,7 +42,7 @@
          </div>
          <div class="content mt-2">
            <div class="text-gray-700 text-lg not-italic">
-              {{ page.summary | safe | striptags }}
+              {{ page.summary | striptags | safe }}
            </div>
            <a class="group font-semibold p-4 flex items-center space-x-1 text-garage-orange" href='{{ page.permalink }}'>
              <div class="h-0.5 mt-0.5 w-4 group-hover:w-8 group-hover:bg-garage-gray transition-all bg-garage-orange"></div>