New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 21 additions and 20 deletions
|
@ -4,15 +4,16 @@ date=2022-09-26
|
|||
+++
|
||||
|
||||
|
||||
*During the past years, we have extensively analyzed possible design decisions and
|
||||
their theoretical trade-offs for Garage, especially concerning networking, data
|
||||
structures, and scheduling. Garage worked well enough for our production
|
||||
*During the past years, we have thought a lot about possible design decisions and
|
||||
their theoretical trade-offs for Garage. In particular, we pondered the impacts
|
||||
of data structures, networking methods, and scheduling algorithms.
|
||||
Garage worked well enough for our production
|
||||
cluster at Deuxfleurs, but we also knew that people started to discover some
|
||||
unexpected behaviors. We thus started a round of benchmark and performance
|
||||
unexpected behaviors. We thus started a round of benchmarks and performance
|
||||
measurements to see how Garage behaves compared to our expectations.
|
||||
This post presents some of our first results, which cover
|
||||
3 aspects of performance: efficient I/O, "myriads of objects" and resiliency,
|
||||
to reflect the high-level properties we are seeking.*
|
||||
3 aspects of performance: efficient I/O, "myriads of objects", and resiliency,
|
||||
reflecting the high-level properties we are seeking.*
|
||||
|
||||
<!-- more -->
|
||||
|
||||
|
@ -20,8 +21,8 @@ to reflect the high-level properties we are seeking.*
|
|||
|
||||
## ⚠️ Disclaimer
|
||||
|
||||
The following results must be taken with a critical grain of salt due to some
|
||||
limitations that are inherent to any benchmark. We try to reference them as
|
||||
The results presented in this blog post must be taken with a critical grain of salt due to some
|
||||
limitations that are inherent to any benchmarking endeavour. We try to reference them as
|
||||
exhaustively as possible in this first section, but other limitations might exist.
|
||||
|
||||
Most of our tests were made on simulated networks, which by definition cannot represent all the
|
||||
|
@ -91,7 +92,7 @@ The main purpose of an object storage system is to store and retrieve objects
|
|||
across the network, and the faster these two functions can be accomplished,
|
||||
the more efficient the system as a whole will be. For this analysis, we focus on
|
||||
2 aspects of performance. First, since many applications can start processing a file
|
||||
before receiving it completely, we will evaulate the Time-to-First-Byte (TTFB)
|
||||
before receiving it completely, we will evaluate the Time-to-First-Byte (TTFB)
|
||||
on GetObject requests, i.e. the duration between the moment a request is sent
|
||||
and the moment where the first bytes of the returned object are received by the client.
|
||||
Second, we will evaluate generic throughput, to understand how well
|
||||
|
@ -187,7 +188,7 @@ adapted for small-scale tests, and we kept only the aggregated result named
|
|||
"cluster total". The goal of this experiment is to get an idea of the cluster
|
||||
performance with a standardized and mixed workload.
|
||||
|
||||
![Plot showing IO perf of Garage configs and Minio](io.png)
|
||||
![Plot showing IO performances of Garage configurations and Minio](io.png)
|
||||
|
||||
Minio, our reference point, gives us the best performances in this test.
|
||||
Looking at Garage, we observe that each improvement we made has a visible
|
||||
|
@ -213,8 +214,8 @@ the only supported option was [sled](https://sled.rs/), but we started having
|
|||
serious issues with it - and we were not alone
|
||||
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
|
||||
v0.8, we introduce an abstraction semantic over the features we expect from our
|
||||
database, allowing us to switch from one backend to another without touching
|
||||
the rest of our codebase. We added two additional backends: LMDB
|
||||
database, allowing us to switch from one back-end to another without touching
|
||||
the rest of our codebase. We added two additional back-ends: LMDB
|
||||
(through [heed](https://github.com/meilisearch/heed)) and SQLite
|
||||
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
||||
are both experimental: contrarily to sled, we have never run them in production
|
||||
|
@ -281,7 +282,7 @@ use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
|
|||
only data and not metadata is persisted on disk - in combination with
|
||||
`O_DIRECT` for direct I/O
|
||||
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
|
||||
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
||||
[example in Minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
||||
|
||||
**Storing a million objects** - Object storage systems are designed not only
|
||||
for data durability and availability but also for scalability, so naturally,
|
||||
|
@ -347,11 +348,11 @@ Let us now focus on Garage's metrics only to better see its specific behavior:
|
|||
|
||||
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
|
||||
|
||||
Two effects are now more visible: 1increasing batch completion time increases with the
|
||||
number of objects in the bucket and 2. measurements are dispersed, at least
|
||||
Two effects are now more visible: 1., increasing batch completion time increases with the
|
||||
number of objects in the bucket and 2., measurements are dispersed, at least
|
||||
more than for Minio. We expect this batch completion time increase to be logarithmic,
|
||||
but we don't have enough datapoint to conclude safety: additional
|
||||
measurements are needed. Concercning the observed instability, it could
|
||||
but we don't have enough data points to conclude safety: additional
|
||||
measurements are needed. Concerning the observed instability, it could
|
||||
be a symptom of what we saw with some other experiments in this machine,
|
||||
which sometimes freezes under heavy I/O load. Such freezes could lead to
|
||||
request timeouts and failures. If this occurs on our testing computer, it will
|
||||
|
@ -418,14 +419,14 @@ any significant evolution from one version to another (Garage v0.7.3 and Garage
|
|||
v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
|
||||
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
|
||||
RemoveObject). This can be easily understood by the fact that Minio has not been designed for
|
||||
environments with high latencies. Instead, it expects to run on clusters that are buil allt
|
||||
the same datacenter. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous
|
||||
environments with high latencies. Instead, it expects to run on clusters that are built
|
||||
in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous
|
||||
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
||||
feature.
|
||||
|
||||
*Minio also has a [multi-site active-active replication system](https://blog.min.io/minio-multi-site-active-active-replication/)
|
||||
but it is even more sensitive to latency: "Multi-site replication has increased
|
||||
latency sensitivity, as MinIO does not consider an object as replicated until
|
||||
latency sensitivity, as Minio does not consider an object as replicated until
|
||||
it has synchronized to all configured remote targets. Replication latency is
|
||||
therefore dictated by the slowest link in the replication mesh."*
|
||||
|
||||
|
|
Loading…
Reference in a new issue