New article: Bringing theoretical design and observed performances face to face #12

Merged
quentin merged 24 commits from perf into master 2022-09-29 11:16:04 +00:00
Showing only changes of commit 6ee5ef82ad - Show all commits

View file

@ -33,7 +33,7 @@ cluster. Furthermore, we only benchmarked some very specific aspects of Garage:
our results are thus not an evaluation of the performance of Garage as a whole. our results are thus not an evaluation of the performance of Garage as a whole.
For some benchmarks, we used Minio as a reference. It must be noted that we did For some benchmarks, we used Minio as a reference. It must be noted that we did
not try to optimize its configuration as we have done on Garage, and more not try to optimize its configuration as we have done for Garage, and more
generally, we have way less knowledge on Minio than on Garage, which can lead generally, we have way less knowledge on Minio than on Garage, which can lead
to underrated performance measurements for Minio. It must also be noted that to underrated performance measurements for Minio. It must also be noted that
Garage and Minio are systems with different feature sets. For instance Minio supports Garage and Minio are systems with different feature sets. For instance Minio supports
@ -41,7 +41,7 @@ erasure coding for higher data density, which Garage doesn't, Minio implements
way more S3 endpoints than Garage, etc. Such features necessarily have a cost way more S3 endpoints than Garage, etc. Such features necessarily have a cost
that you must keep in mind when reading the plots we will present. You should consider that you must keep in mind when reading the plots we will present. You should consider
results on Minio as a way to contextualize our results on Garage, to see that our improvements results on Minio as a way to contextualize our results on Garage, to see that our improvements
are not artificial compared to existing object storage implementations. are not artificial in the light of existing object storage implementations.
The impact of the testing environment is also not evaluated (kernel patches, The impact of the testing environment is also not evaluated (kernel patches,
configuration, parameters, filesystem, hardware configuration, etc.). Some of configuration, parameters, filesystem, hardware configuration, etc.). Some of
@ -66,7 +66,7 @@ We made a first batch of tests on
[Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible [Grid5000](https://www.grid5000.fr/w/Grid5000:Home), a large-scale and flexible
testbed for experiment-driven research in all areas of computer science, testbed for experiment-driven research in all areas of computer science,
which has an which has an
[Open Access program](https://www.grid5000.fr/w/Grid5000:Open-Access). [open access program](https://www.grid5000.fr/w/Grid5000:Open-Access).
During our tests, we used part of the following clusters: During our tests, we used part of the following clusters:
[nova](https://www.grid5000.fr/w/Lyon:Hardware#nova), [nova](https://www.grid5000.fr/w/Lyon:Hardware#nova),
[paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and [paravance](https://www.grid5000.fr/w/Rennes:Hardware#paravance), and
@ -92,7 +92,7 @@ The main purpose of an object storage system is to store and retrieve objects
across the network, and the faster these two functions can be accomplished, across the network, and the faster these two functions can be accomplished,
the more efficient the system as a whole will be. For this analysis, we focus on the more efficient the system as a whole will be. For this analysis, we focus on
2 aspects of performance. First, since many applications can start processing a file 2 aspects of performance. First, since many applications can start processing a file
before receiving it completely, we will evaluate the Time-to-First-Byte (TTFB) before receiving it completely, we will evaluate the time-to-first-byte (TTFB)
on GetObject requests, i.e. the duration between the moment a request is sent on GetObject requests, i.e. the duration between the moment a request is sent
and the moment where the first bytes of the returned object are received by the client. and the moment where the first bytes of the returned object are received by the client.
Second, we will evaluate generic throughput, to understand how well Second, we will evaluate generic throughput, to understand how well
@ -103,17 +103,17 @@ web endpoints, with the idea to make it a platform of choice to publish
static websites. When publishing a website, TTFB can be directly observed static websites. When publishing a website, TTFB can be directly observed
by the end user, as it will impact the perceived reactivity of the websites. by the end user, as it will impact the perceived reactivity of the websites.
Up to version 0.7.3, Time-to-First-Byte on Garage used to be relatively high. Up to version 0.7.3, time-to-first-byte on Garage used to be relatively high.
This can be explained by the fact that Garage was not able to handle data internally This can be explained by the fact that Garage was not able to handle data internally
at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object at a smaller granularity level than entire data blocks, which are 1MB chunks of a given object
(this is [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)). (a size which [can be configured](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#block-size)).
Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of Let us take the example of a 4.5MB object, which Garage will split into 4 blocks of
1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET` 1MB and 1 block of 0.5MB. With the old design, when you were sending a `GET`
request, the first block had to be fully retrieved by the gateway node from the request, the first block had to be fully retrieved by the gateway node from the
storage node before starting to send any data to the client. storage node before starting to send any data to the client.
With Garage v0.8, we integrated a block streaming logic that allows the gateway With Garage v0.8, we integrated a data streaming logic that allows the gateway
to send the beginning of a block without having to wait for the full block from to send the beginning of a block without having to wait for the full block to be received from
the storage node. We can visually represent the difference as follow: the storage node. We can visually represent the difference as follow:
<center> <center>
@ -138,23 +138,23 @@ whose results are shown on the following figure:
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png) ![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
and 2s, which corresponds to the time to transfer the full block which we calculated before. and 2s, which corresponds to the time to transfer the full block which we calculated above.
On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
streaming feature (the lowest value is 43ms). Minio sits between the two streaming feature (the lowest value is 43ms). Minio sits between the two
Garage versions: we suppose that it does some form of batching, but smaller Garage versions: we suppose that it does some form of batching, but smaller
than 1MB. than 1MB.
**Throughput** - As soon as we publicly released Garage, people started **Throughput** - As soon as we publicly released Garage, people started
benchmarking it, comparing its performances with writing directly on the benchmarking it, comparing its performances to writing directly on the
filesystem, and observed that Garage was slower (eg. filesystem, and observed that Garage was slower (eg.
[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the
situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread,
and many others and many others
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)), which led us to
version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote
to better control resource usage to better control resource usage
and detect errors, like semaphores or timeouts, was artificially limiting and detect errors, including semaphores and timeouts, was artificially limiting
performances. In another iteration, we made this logic less restrictive at the performances. In another iteration, we made this logic less restrictive at the
cost of higher resource consumption under load cost of higher resource consumption under load
([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in
@ -170,16 +170,16 @@ write directly to the disk when a process creates or updates a file in your
filesystem. Instead, the write is kept in memory, and flushed later in a batch filesystem. Instead, the write is kept in memory, and flushed later in a batch
with other writes. If a power loss occurs before the OS has time to flush with other writes. If a power loss occurs before the OS has time to flush
data to disk, some writes will be lost. To ensure that a write is effectively data to disk, some writes will be lost. To ensure that a write is effectively
written on disk, the written to disk, the
[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used, [`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
which blocks until the file or directory on which it is called has been written from volatile which blocks until the file or directory on which it is called has been flushed from volatile
memory to the persistent storage device. Additionally, the exact semantic of memory to the persistent storage device. Additionally, the exact semantic of
`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) `fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
and, even on battle-tested software like Postgres, it was and, even on battle-tested software like Postgres, it was
["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). ["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
Note that on Garage, we are currently working on our `fsync` policy and thus, for Note that on Garage, we are currently working on our `fsync` policy and thus, for
now, you should expect limited data durability in case of power loss, as we are now, you should expect limited data durability in case of power loss, as we are
aware of some inconsistency on this point (which we describe in the following aware of some inconsistencies on this point (which we describe in the following
and plan to solve).* and plan to solve).*
To assess performance improvements, we used the benchmark tool To assess performance improvements, we used the benchmark tool
@ -191,10 +191,10 @@ performance with a standardized and mixed workload.
![Plot showing IO performances of Garage configurations and Minio](io.png) ![Plot showing IO performances of Garage configurations and Minio](io.png)
Minio, our reference point, gives us the best performances in this test. Minio, our reference point, gives us the best performances in this test.
Looking at Garage, we observe that each improvement we made has a visible Looking at Garage, we observe that each improvement we made had a visible
impact on performances. We also note that we have a progress margin in impact on performances. We also note that we have a progress margin in
terms of performances compared to Minio: additional benchmarks, tests, and terms of performances compared to Minio: additional benchmarks, tests, and
monitoring could help better understand the remaining difference. monitoring could help us better understand the remaining difference.
## A myriad of objects ## A myriad of objects
@ -205,7 +205,7 @@ communicates about Ceph being able to handle 10 billion objects. All these
objects must be tracked efficiently in the system to be fetched, listed, objects must be tracked efficiently in the system to be fetched, listed,
removed, etc. In Garage, we use a "metadata engine" component to track them. removed, etc. In Garage, we use a "metadata engine" component to track them.
For this analysis, we compare different metadata engines in Garage and see how For this analysis, we compare different metadata engines in Garage and see how
well the best one scale to a million objects. well the best one scales to a million objects.
**Testing metadata engines** - With Garage, we chose not to store metadata **Testing metadata engines** - With Garage, we chose not to store metadata
directly on the filesystem, like Minio for example, but in a specialized on-disk directly on the filesystem, like Minio for example, but in a specialized on-disk
@ -222,11 +222,11 @@ are both experimental: contrarily to sled, we have never run them in production
for a long time.** for a long time.**
Similarly to the impact of `fsync` on block writing, each database engine we use Similarly to the impact of `fsync` on block writing, each database engine we use
has its own policy with `fsync`. Sled flushes its write every 2 seconds by has its own policy with `fsync`. Sled flushes its writes every 2 seconds by
default, this is default (this is
[configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)). [configurable](https://garagehq.deuxfleurs.fr/documentation/reference-manual/configuration/#sled-flush-every-ms)).
LMDB by default does an `fsync` on each write, which on early tests led to very LMDB by default does an `fsync` on each write, which on early tests led to
slow resynchronizations between nodes. We thus added 2 flags, abysmal performance. We thus added 2 flags,
[MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e) [MDB\_NOSYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5791dd1adb09123f82dd1f331209e12e)
and and
[MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16), [MDB\_NOMETASYNC](http://www.lmdb.tech/doc/group__mdb__env.html#ga5021c4e96ffe9f383f5b8ab2af8e4b16),
@ -234,14 +234,15 @@ to deactivate `fsync`. On SQLite, it is also possible to deactivate `fsync` with
`pragma synchronous = off`, but we have not started any optimization work on it yet: `pragma synchronous = off`, but we have not started any optimization work on it yet:
our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are our SQLite implementation currently calls `fsync` for all write operations. Additionally, we are
using these engines through Rust bindings that do not support async Rust, using these engines through Rust bindings that do not support async Rust,
with which Garage is built. **Our comparison will therefore not reflect the raw performances of with which Garage is built, which has an impact on performance as well.
**Our comparison will therefore not reflect the raw performances of
these database engines, but instead, our integration choices.** these database engines, but instead, our integration choices.**
Still, we think it makes sense to evaluate our implementations in their current Still, we think it makes sense to evaluate our implementations in their current
state in Garage. We designed a benchmark that is intensive on the metadata part state in Garage. We designed a benchmark that is intensive on the metadata part
of the software, i.e. handling large numbers of tiny files. We chose again of the software, i.e. handling large numbers of tiny files. We chose again
`minio/warp` as a benchmark tool but we `minio/warp` as a benchmark tool, but we
configured it with the smallest possible object size it supported, 256 configured it here with the smallest possible object size it supported, 256
bytes, to put some pressure on the metadata engine. We evaluated sled twice: bytes, to put some pressure on the metadata engine. We evaluated sled twice:
with its default configuration, and with a configuration where we set a flush with its default configuration, and with a configuration where we set a flush
interval of 10 minutes to disable `fsync`. interval of 10 minutes to disable `fsync`.
@ -261,19 +262,19 @@ disk storage and RAM; we would like to quantify that in the future. As we are
only at the very beginning of our work on metadata engines, it is hard to draw only at the very beginning of our work on metadata engines, it is hard to draw
strong conclusions. Still, we can say that SQLite is not ready for production strong conclusions. Still, we can say that SQLite is not ready for production
workloads, and that LMDB looks very promising both in terms of performances and resource workloads, and that LMDB looks very promising both in terms of performances and resource
usage, and is a very good candidate for being Garage's default metadata engine in the usage, and is a very good candidate for being Garage's default metadata engine in
future. In the future, we will need to define a data policy for Garage to help us future releases. In the future, we will need to define a data policy for Garage to help us
arbitrate between performances and durability. arbitrate between performance and durability.
*To `fsync` or not to `fsync`? Performance is nothing without reliability, so we *To `fsync` or not to `fsync`? Performance is nothing without reliability, so we
need to better assess the impact of validating a write and then possibly losing it. need to better assess the impact of possibly losing a write after it has been validated.
Because Garage is a distributed system, even if a node loses its write due to a Because Garage is a distributed system, even if a node loses its write due to a
power loss, it will fetch it back from the 2 other nodes storing it. But rare power loss, it will fetch it back from the 2 other nodes that store it. But rare
situations can occur, where 1 node is down and the 2 others validated the write and then situations can occur where 1 node is down and the 2 others validate the write and then
lost power. What is our policy in this case? For storage durability, lose power before having time to flush to disk. What is our policy in this case? For storage durability,
we are already supposing that we never lose the storage of more than 2 nodes, we are already supposing that we never lose the storage of more than 2 nodes,
so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same so should we also make the hypothesis that we won't lose power on more than 2 nodes at the same
time? What should we think about people hosting all their nodes at the same time? What should we do about people hosting all of their nodes at the same
place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted place without an uninterruptible power supply (UPS)? Historically, it seems that Minio developers also accepted
some compromises on this side some compromises on this side
([#3536](https://github.com/minio/minio/issues/3536), ([#3536](https://github.com/minio/minio/issues/3536),
@ -286,7 +287,7 @@ only data and not metadata is persisted on disk - in combination with
**Storing a million objects** - Object storage systems are designed not only **Storing a million objects** - Object storage systems are designed not only
for data durability and availability but also for scalability, so naturally, for data durability and availability but also for scalability, so naturally,
some people asked us how scalable Garage is. If answering this some people asked us how scalable Garage is. If giving a definitive answer to this
question is out of the scope of this study, we wanted to be sure that our question is out of the scope of this study, we wanted to be sure that our
metadata engine would be able to scale to a million objects. To put this metadata engine would be able to scale to a million objects. To put this
target in context, it remains small compared to other industrial solutions: target in context, it remains small compared to other industrial solutions:
@ -300,10 +301,10 @@ We wrote our own benchmarking tool for this test,
The benchmark procedure consists in The benchmark procedure consists in
concurrently sending a defined number of tiny objects (8192 objects of 16 concurrently sending a defined number of tiny objects (8192 objects of 16
bytes by default) and measuring the time it takes. This step is then repeated a given bytes by default) and measuring the time it takes. This step is then repeated a given
number of times (128 by default) to effectively create a certain target number of number of times (128 by default) to effectively create a target number of
objects on the cluster (1M by default). On our local setup with 3 objects on the cluster (1M by default). On our local setup with 3
nodes, both Minio and Garage with LMDB were able to achieve this target. In the nodes, both Minio and Garage with LMDB were able to achieve this target. In the
following plot, we show how much time it took to Garage and Minio to handle following plot, we show how much time it took Garage and Minio to handle
each batch. each batch.
Before looking at the plot, **you must keep in mind some important points about Before looking at the plot, **you must keep in mind some important points about
@ -312,14 +313,14 @@ the internals of both Minio and Garage**.
Minio has no metadata engine, it stores its objects directly on the filesystem. Minio has no metadata engine, it stores its objects directly on the filesystem.
Sending 1 million objects on Minio results in creating one million inodes on Sending 1 million objects on Minio results in creating one million inodes on
the storage server in our current setup. So the performances of the filesystem the storage server in our current setup. So the performances of the filesystem
will probably substantially impact the results we will observe. probably have substantial impact on the results we observe.
In our precise setup, we know that the In our precise setup, we know that the
filesystem we used is not adapted at all for Minio (encryption layer, fixed filesystem we used is not adapted at all for Minio (encryption layer, fixed
number of inodes, etc.). Additionally, we mentioned earlier that we deactivated number of inodes, etc.). Additionally, we mentioned earlier that we deactivated
`fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the `fsync` for our metadata engine in Garage, whereas Minio has some `fsync` logic here slowing down the
creation of objects. Finally, object storage is designed for big objects: the creation of objects. Finally, object storage is designed for big objects, for which the
costs measured here are negligible for bigger objects. In the end, again, we use Minio as a costs measured here are negligible. In the end, again, we use Minio as a
reference to understand what is our performance budget for each part of our reference point to understand what performance budget we have for each part of our
software. software.
Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is Conversely, Garage has an optimization for small objects. Below 3KB, a separate file is
@ -334,8 +335,8 @@ metadata engine and thus focus only on 16-byte objects.
It appears that the performances of our metadata engine are acceptable, as we It appears that the performances of our metadata engine are acceptable, as we
have a comfortable margin compared to Minio (Minio is between 3x and 4x times have a comfortable margin compared to Minio (Minio is between 3x and 4x times
slower per batch). We also note that, past the 200k objects mark, Minio's slower per batch). We also note that, past the 200k objects mark, Minio's
time to complete a batch of inserts is constant, while on Garage it is still increasing on the observed range. time to complete a batch of inserts is constant, while on Garage it still increases on the observed range.
It could be interesting to know if Garage batch's completion time would cross Minio's one It could be interesting to know if Garage's batch completion time would cross Minio's one
for a very large number of objects. If we reason per object, both Minio's and for a very large number of objects. If we reason per object, both Minio's and
Garage's performances remain very good: it takes respectively around 20ms and Garage's performances remain very good: it takes respectively around 20ms and
5ms to create an object. At 100 Mbps, the upload of a 10MB file takes 5ms to create an object. At 100 Mbps, the upload of a 10MB file takes
@ -348,17 +349,17 @@ Let us now focus on Garage's metrics only to better see its specific behavior:
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png) ![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
Two effects are now more visible: 1., increasing batch completion time increases with the Two effects are now more visible: 1., batch completion time increases with the
number of objects in the bucket and 2., measurements are dispersed, at least number of objects in the bucket and 2., measurements are dispersed, at least
more than for Minio. We expect this batch completion time increase to be logarithmic, more than for Minio. We expect this batch completion time increase to be logarithmic,
but we don't have enough data points to conclude safety: additional but we don't have enough data points to conclude safety: additional
measurements are needed. Concerning the observed instability, it could measurements are needed. Concerning the observed instability, it could
be a symptom of what we saw with some other experiments in this machine, be a symptom of what we saw with some other experiments in this machine,
which sometimes freezes under heavy I/O load. Such freezes could lead to which sometimes freezes under heavy I/O load. Such freezes could lead to
request timeouts and failures. If this occurs on our testing computer, it will request timeouts and failures. If this occurs on our testing computer, it might
occur on other servers too: it would be interesting to better understand this occur on other servers as well: it would be interesting to better understand this
issue, document how to avoid it, and potentially change how we handle our I/O. At the same issue, document how to avoid it, and potentially change how we handle our I/O
time, this was a very stressful test that will probably not be encountered in internally in Garage. But still, this was a very stressful test that will probably not be encountered in
many setups: we were adding 273 objects per second for 30 minutes straight! many setups: we were adding 273 objects per second for 30 minutes straight!
To conclude this part, Garage can ingest 1 million tiny objects while remaining To conclude this part, Garage can ingest 1 million tiny objects while remaining
@ -366,16 +367,16 @@ usable on our local setup. To put this result in perspective, our production
cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with cluster at [deuxfleurs.fr](https://deuxfleurs) smoothly manages a bucket with
116k objects. This bucket contains real-world production data: it is used by our Matrix instance 116k objects. This bucket contains real-world production data: it is used by our Matrix instance
to store people's media files (profile pictures, shared pictures, videos, to store people's media files (profile pictures, shared pictures, videos,
audios, documents...). Thanks to this benchmark, we have identified two points audio files, documents...). Thanks to this benchmark, we have identified two points
of vigilance: the increase of batch insert time with the number of existing of vigilance: the increase of batch insert time with the number of existing
objects in the cluster in the observed range, and the volatility in our measured data that objects in the cluster in the observed range, and the volatility in our measured data that
could be a symptom of our system freezing under the load. Despite these two could be a symptom of our system freezing under the load. Despite these two
points, we are confident that Garage could scale way above 1M+ objects, although points, we are confident that Garage could scale way above 1M objects, although
that remains to be proven. that remains to be proven.
## In an unpredictable world, stay resilient ## In an unpredictable world, stay resilient
Supporting a variety of network properties and computers, especially ones that Supporting a variety of real-world networks and computers, especially ones that
were not designed for software-defined storage or even for server purposes, is the were not designed for software-defined storage or even for server purposes, is the
core value proposition of Garage. For example, our production cluster is core value proposition of Garage. For example, our production cluster is
hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg) hosted [on refurbished Lenovo Thinkcentre Tiny desktop computers](https://guide.deuxfleurs.fr/img/serv_neptune.jpg)
@ -387,8 +388,8 @@ latency and the number of nodes in the cluster impact the duration of the most
important kinds of S3 requests. important kinds of S3 requests.
**Latency amplification** - With the kind of networks we use (consumer-grade **Latency amplification** - With the kind of networks we use (consumer-grade
fiber links across the EU), the observed latency is in the 50ms range between fiber links across the EU), the observed latency between nodes is in the 50ms range.
nodes. When latency is not negligible, you will observe that request completion When latency is not negligible, you will observe that request completion
time is a factor of the observed latency. That's to be expected: in many cases, the time is a factor of the observed latency. That's to be expected: in many cases, the
node of the cluster you are contacting can not directly answer your request, and node of the cluster you are contacting can not directly answer your request, and
has to reach other nodes of the cluster to get the requested information. Each has to reach other nodes of the cluster to get the requested information. Each
@ -403,7 +404,7 @@ from storage nodes. We can therefore expect that the
request duration of a small GetObject request will be close to twice the request duration of a small GetObject request will be close to twice the
network latency. network latency.
We tested this theory with another benchmark of our own named We tested the latency amplification theory with another benchmark of our own named
[s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat) [s3lat](https://git.deuxfleurs.fr/Deuxfleurs/mknet/src/branch/main/benchmarks/s3lat)
which does a single request at a time on an endpoint and measures its response which does a single request at a time on an endpoint and measures its response
time. As we are not interested in bandwidth but latency, all our requests time. As we are not interested in bandwidth but latency, all our requests
@ -420,7 +421,7 @@ v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
RemoveObject). This can be easily understood by the fact that Minio has not been designed for RemoveObject). This can be easily understood by the fact that Minio has not been designed for
environments with high latencies. Instead, it expects to run on clusters that are built environments with high latencies. Instead, it expects to run on clusters that are built
in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected with their asynchronous
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect) [bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
feature. feature.
@ -443,7 +444,7 @@ This test was ran directly on Grid5000 with 6 physical servers spread
in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up in 3 locations in France: Lyon, Rennes, and Nantes. On each server, we ran up
to 65 instances of Garage simultaneously, for a total of 390 nodes. The to 65 instances of Garage simultaneously, for a total of 390 nodes. The
network between physical servers is the dedicated network provided by network between physical servers is the dedicated network provided by
the Grid5000 operators. Nodes on the same physical machine communicate directly the Grid5000 community. Nodes on the same physical machine communicate directly
through the Linux network stack without any limitation. We are aware that this is a through the Linux network stack without any limitation. We are aware that this is a
weakness of this test, but we still think that this test can be relevant as, at weakness of this test, but we still think that this test can be relevant as, at
each step in the test, each instance of Garage has 83% (5/6) of its connections each step in the test, each instance of Garage has 83% (5/6) of its connections
@ -487,8 +488,10 @@ terabytes of data and billions of objects on long-lasting experiments.
In the meantime, stay tuned: we have released In the meantime, stay tuned: we have released
[a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1), [a first release candidate for Garage v0.8](https://git.deuxfleurs.fr/Deuxfleurs/garage/releases/tag/v0.8.0-rc1),
and we are working on proving and explaining our layout algorithm and are already working on a number of features for the next version.
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)), we are also For instance, we are working on a new layout that will have enhanced optimality properties,
as well as a theoretical proof of correctness
([#296](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/296)). We are also
working on a Python SDK for Garage's administration API working on a Python SDK for Garage's administration API
([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will ([#379](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/379)), and we will
soon officially introduce a new API (as a technical preview) named K2V soon officially introduce a new API (as a technical preview) named K2V