New article: Bringing theoretical design and observed performances face to face #12

Merged
quentin merged 24 commits from perf into master 2022-09-29 11:16:04 +00:00
Showing only changes of commit 7edd77d61a - Show all commits

View file

@ -4,15 +4,16 @@ date=2022-09-26
+++
*During the past years, we have extensively analyzed possible design decisions and
their theoretical trade-offs for Garage, especially concerning networking, data
structures, and scheduling. Garage worked well enough for our production
*During the past years, we have thought a lot about possible design decisions and
their theoretical trade-offs for Garage. In particular, we pondered the impacts
of data structures, networking methods, and scheduling algorithms.
Garage worked well enough for our production
cluster at Deuxfleurs, but we also knew that people started to discover some
unexpected behaviors. We thus started a round of benchmark and performance
unexpected behaviors. We thus started a round of benchmarks and performance
measurements to see how Garage behaves compared to our expectations.
This post presents some of our first results, which cover
3 aspects of performance: efficient I/O, "myriads of objects" and resiliency,
to reflect the high-level properties we are seeking.*
3 aspects of performance: efficient I/O, "myriads of objects", and resiliency,
reflecting the high-level properties we are seeking.*
<!-- more -->
@ -20,8 +21,8 @@ to reflect the high-level properties we are seeking.*
## ⚠️ Disclaimer
The following results must be taken with a critical grain of salt due to some
limitations that are inherent to any benchmark. We try to reference them as
The results presented in this blog post must be taken with a critical grain of salt due to some
limitations that are inherent to any benchmarking endeavour. We try to reference them as
exhaustively as possible in this first section, but other limitations might exist.
Most of our tests were made on simulated networks, which by definition cannot represent all the
@ -91,7 +92,7 @@ The main purpose of an object storage system is to store and retrieve objects
across the network, and the faster these two functions can be accomplished,
the more efficient the system as a whole will be. For this analysis, we focus on
2 aspects of performance. First, since many applications can start processing a file
before receiving it completely, we will evaulate the Time-to-First-Byte (TTFB)
before receiving it completely, we will evaluate the Time-to-First-Byte (TTFB)
on GetObject requests, i.e. the duration between the moment a request is sent
and the moment where the first bytes of the returned object are received by the client.
Second, we will evaluate generic throughput, to understand how well
@ -187,7 +188,7 @@ adapted for small-scale tests, and we kept only the aggregated result named
"cluster total". The goal of this experiment is to get an idea of the cluster
performance with a standardized and mixed workload.
![Plot showing IO perf of Garage configs and Minio](io.png)
![Plot showing IO performances of Garage configurations and Minio](io.png)
Minio, our reference point, gives us the best performances in this test.
Looking at Garage, we observe that each improvement we made has a visible
@ -213,8 +214,8 @@ the only supported option was [sled](https://sled.rs/), but we started having
serious issues with it - and we were not alone
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
v0.8, we introduce an abstraction semantic over the features we expect from our
database, allowing us to switch from one backend to another without touching
the rest of our codebase. We added two additional backends: LMDB
database, allowing us to switch from one back-end to another without touching
the rest of our codebase. We added two additional back-ends: LMDB
(through [heed](https://github.com/meilisearch/heed)) and SQLite
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
are both experimental: contrarily to sled, we have never run them in production
@ -281,7 +282,7 @@ use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
only data and not metadata is persisted on disk - in combination with
`O_DIRECT` for direct I/O
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
[example in Minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
**Storing a million objects** - Object storage systems are designed not only
for data durability and availability but also for scalability, so naturally,
@ -347,11 +348,11 @@ Let us now focus on Garage's metrics only to better see its specific behavior:
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
Two effects are now more visible: 1increasing batch completion time increases with the
number of objects in the bucket and 2. measurements are dispersed, at least
Two effects are now more visible: 1., increasing batch completion time increases with the
number of objects in the bucket and 2., measurements are dispersed, at least
more than for Minio. We expect this batch completion time increase to be logarithmic,
but we don't have enough datapoint to conclude safety: additional
measurements are needed. Concercning the observed instability, it could
but we don't have enough data points to conclude safety: additional
measurements are needed. Concerning the observed instability, it could
be a symptom of what we saw with some other experiments in this machine,
which sometimes freezes under heavy I/O load. Such freezes could lead to
request timeouts and failures. If this occurs on our testing computer, it will
@ -418,14 +419,14 @@ any significant evolution from one version to another (Garage v0.7.3 and Garage
v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
RemoveObject). This can be easily understood by the fact that Minio has not been designed for
environments with high latencies. Instead, it expects to run on clusters that are buil allt
the same datacenter. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous
environments with high latencies. Instead, it expects to run on clusters that are built
in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
feature.
*Minio also has a [multi-site active-active replication system](https://blog.min.io/minio-multi-site-active-active-replication/)
but it is even more sensitive to latency: "Multi-site replication has increased
latency sensitivity, as MinIO does not consider an object as replicated until
latency sensitivity, as Minio does not consider an object as replicated until
it has synchronized to all configured remote targets. Replication latency is
therefore dictated by the slowest link in the replication mesh."*