New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 21 additions and 20 deletions
|
@ -4,15 +4,16 @@ date=2022-09-26
|
||||||
+++
|
+++
|
||||||
|
|
||||||
|
|
||||||
*During the past years, we have extensively analyzed possible design decisions and
|
*During the past years, we have thought a lot about possible design decisions and
|
||||||
their theoretical trade-offs for Garage, especially concerning networking, data
|
their theoretical trade-offs for Garage. In particular, we pondered the impacts
|
||||||
structures, and scheduling. Garage worked well enough for our production
|
of data structures, networking methods, and scheduling algorithms.
|
||||||
|
Garage worked well enough for our production
|
||||||
cluster at Deuxfleurs, but we also knew that people started to discover some
|
cluster at Deuxfleurs, but we also knew that people started to discover some
|
||||||
unexpected behaviors. We thus started a round of benchmark and performance
|
unexpected behaviors. We thus started a round of benchmarks and performance
|
||||||
measurements to see how Garage behaves compared to our expectations.
|
measurements to see how Garage behaves compared to our expectations.
|
||||||
This post presents some of our first results, which cover
|
This post presents some of our first results, which cover
|
||||||
3 aspects of performance: efficient I/O, "myriads of objects" and resiliency,
|
3 aspects of performance: efficient I/O, "myriads of objects", and resiliency,
|
||||||
to reflect the high-level properties we are seeking.*
|
reflecting the high-level properties we are seeking.*
|
||||||
|
|
||||||
<!-- more -->
|
<!-- more -->
|
||||||
|
|
||||||
|
@ -20,8 +21,8 @@ to reflect the high-level properties we are seeking.*
|
||||||
|
|
||||||
## ⚠️ Disclaimer
|
## ⚠️ Disclaimer
|
||||||
|
|
||||||
The following results must be taken with a critical grain of salt due to some
|
The results presented in this blog post must be taken with a critical grain of salt due to some
|
||||||
limitations that are inherent to any benchmark. We try to reference them as
|
limitations that are inherent to any benchmarking endeavour. We try to reference them as
|
||||||
exhaustively as possible in this first section, but other limitations might exist.
|
exhaustively as possible in this first section, but other limitations might exist.
|
||||||
|
|
||||||
Most of our tests were made on simulated networks, which by definition cannot represent all the
|
Most of our tests were made on simulated networks, which by definition cannot represent all the
|
||||||
|
@ -91,7 +92,7 @@ The main purpose of an object storage system is to store and retrieve objects
|
||||||
across the network, and the faster these two functions can be accomplished,
|
across the network, and the faster these two functions can be accomplished,
|
||||||
the more efficient the system as a whole will be. For this analysis, we focus on
|
the more efficient the system as a whole will be. For this analysis, we focus on
|
||||||
2 aspects of performance. First, since many applications can start processing a file
|
2 aspects of performance. First, since many applications can start processing a file
|
||||||
before receiving it completely, we will evaulate the Time-to-First-Byte (TTFB)
|
before receiving it completely, we will evaluate the Time-to-First-Byte (TTFB)
|
||||||
on GetObject requests, i.e. the duration between the moment a request is sent
|
on GetObject requests, i.e. the duration between the moment a request is sent
|
||||||
and the moment where the first bytes of the returned object are received by the client.
|
and the moment where the first bytes of the returned object are received by the client.
|
||||||
Second, we will evaluate generic throughput, to understand how well
|
Second, we will evaluate generic throughput, to understand how well
|
||||||
|
@ -187,7 +188,7 @@ adapted for small-scale tests, and we kept only the aggregated result named
|
||||||
"cluster total". The goal of this experiment is to get an idea of the cluster
|
"cluster total". The goal of this experiment is to get an idea of the cluster
|
||||||
performance with a standardized and mixed workload.
|
performance with a standardized and mixed workload.
|
||||||
|
|
||||||
![Plot showing IO perf of Garage configs and Minio](io.png)
|
![Plot showing IO performances of Garage configurations and Minio](io.png)
|
||||||
|
|
||||||
Minio, our reference point, gives us the best performances in this test.
|
Minio, our reference point, gives us the best performances in this test.
|
||||||
Looking at Garage, we observe that each improvement we made has a visible
|
Looking at Garage, we observe that each improvement we made has a visible
|
||||||
|
@ -213,8 +214,8 @@ the only supported option was [sled](https://sled.rs/), but we started having
|
||||||
serious issues with it - and we were not alone
|
serious issues with it - and we were not alone
|
||||||
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
|
([#284](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/284)). With Garage
|
||||||
v0.8, we introduce an abstraction semantic over the features we expect from our
|
v0.8, we introduce an abstraction semantic over the features we expect from our
|
||||||
database, allowing us to switch from one backend to another without touching
|
database, allowing us to switch from one back-end to another without touching
|
||||||
the rest of our codebase. We added two additional backends: LMDB
|
the rest of our codebase. We added two additional back-ends: LMDB
|
||||||
(through [heed](https://github.com/meilisearch/heed)) and SQLite
|
(through [heed](https://github.com/meilisearch/heed)) and SQLite
|
||||||
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
(using [Rusqlite](https://github.com/rusqlite/rusqlite)). **Keep in mind that they
|
||||||
are both experimental: contrarily to sled, we have never run them in production
|
are both experimental: contrarily to sled, we have never run them in production
|
||||||
|
@ -281,7 +282,7 @@ use a combination of `O_DSYNC` and `fdatasync(3p)` - a derivative that ensures
|
||||||
only data and not metadata is persisted on disk - in combination with
|
only data and not metadata is persisted on disk - in combination with
|
||||||
`O_DIRECT` for direct I/O
|
`O_DIRECT` for direct I/O
|
||||||
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
|
([discussion](https://github.com/minio/minio/discussions/14339#discussioncomment-2200274),
|
||||||
[example in minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
[example in Minio source](https://github.com/minio/minio/blob/master/cmd/xl-storage.go#L1928-L1932)).*
|
||||||
|
|
||||||
**Storing a million objects** - Object storage systems are designed not only
|
**Storing a million objects** - Object storage systems are designed not only
|
||||||
for data durability and availability but also for scalability, so naturally,
|
for data durability and availability but also for scalability, so naturally,
|
||||||
|
@ -347,11 +348,11 @@ Let us now focus on Garage's metrics only to better see its specific behavior:
|
||||||
|
|
||||||
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
|
![Showing the time to send 128 batches of 8192 objects for Garage only](1million.png)
|
||||||
|
|
||||||
Two effects are now more visible: 1increasing batch completion time increases with the
|
Two effects are now more visible: 1., increasing batch completion time increases with the
|
||||||
number of objects in the bucket and 2. measurements are dispersed, at least
|
number of objects in the bucket and 2., measurements are dispersed, at least
|
||||||
more than for Minio. We expect this batch completion time increase to be logarithmic,
|
more than for Minio. We expect this batch completion time increase to be logarithmic,
|
||||||
but we don't have enough datapoint to conclude safety: additional
|
but we don't have enough data points to conclude safety: additional
|
||||||
measurements are needed. Concercning the observed instability, it could
|
measurements are needed. Concerning the observed instability, it could
|
||||||
be a symptom of what we saw with some other experiments in this machine,
|
be a symptom of what we saw with some other experiments in this machine,
|
||||||
which sometimes freezes under heavy I/O load. Such freezes could lead to
|
which sometimes freezes under heavy I/O load. Such freezes could lead to
|
||||||
request timeouts and failures. If this occurs on our testing computer, it will
|
request timeouts and failures. If this occurs on our testing computer, it will
|
||||||
|
@ -418,14 +419,14 @@ any significant evolution from one version to another (Garage v0.7.3 and Garage
|
||||||
v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
|
v0.8.0 Beta 1 here). Compared to Minio, these values are either similar (for
|
||||||
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
|
ListObjects and ListBuckets) or way better (for GetObject, PutObject, and
|
||||||
RemoveObject). This can be easily understood by the fact that Minio has not been designed for
|
RemoveObject). This can be easily understood by the fact that Minio has not been designed for
|
||||||
environments with high latencies. Instead, it expects to run on clusters that are buil allt
|
environments with high latencies. Instead, it expects to run on clusters that are built
|
||||||
the same datacenter. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous
|
in a singe data center. In a multi-DC setup, different clusters could then possibly be interconnected them with their asynchronous
|
||||||
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
[bucket replication](https://min.io/docs/minio/linux/administration/bucket-replication.html?ref=docs-redirect)
|
||||||
feature.
|
feature.
|
||||||
|
|
||||||
*Minio also has a [multi-site active-active replication system](https://blog.min.io/minio-multi-site-active-active-replication/)
|
*Minio also has a [multi-site active-active replication system](https://blog.min.io/minio-multi-site-active-active-replication/)
|
||||||
but it is even more sensitive to latency: "Multi-site replication has increased
|
but it is even more sensitive to latency: "Multi-site replication has increased
|
||||||
latency sensitivity, as MinIO does not consider an object as replicated until
|
latency sensitivity, as Minio does not consider an object as replicated until
|
||||||
it has synchronized to all configured remote targets. Replication latency is
|
it has synchronized to all configured remote targets. Replication latency is
|
||||||
therefore dictated by the slowest link in the replication mesh."*
|
therefore dictated by the slowest link in the replication mesh."*
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue