New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 28 additions and 27 deletions
|
@ -97,7 +97,7 @@ and the moment where the first bytes of the returned object are received by the
|
||||||
Second, we will evaluate generic throughput, to understand how well
|
Second, we will evaluate generic throughput, to understand how well
|
||||||
Garage can leverage the underlying machine's performances.
|
Garage can leverage the underlying machine's performances.
|
||||||
|
|
||||||
**Time To First Byte** - One specificity of Garage is that we implemented S3
|
**Time-to-First-Byte** - One specificity of Garage is that we implemented S3
|
||||||
web endpoints, with the idea to make it a platform of choice to publish
|
web endpoints, with the idea to make it a platform of choice to publish
|
||||||
static websites. When publishing a website, TTFB can be directly observed
|
static websites. When publishing a website, TTFB can be directly observed
|
||||||
by the end user, as it will impact the perceived reactivity of the websites.
|
by the end user, as it will impact the perceived reactivity of the websites.
|
||||||
|
@ -136,46 +136,47 @@ whose results are shown on the following figure:
|
||||||
|
|
||||||
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
|
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
|
||||||
|
|
||||||
Garage v0.7, which does not support block streaming, features TTFB between 1.6s
|
Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
|
||||||
and 2s, which corresponds to the theoretical time to transfer the full block.
|
and 2s, which corresponds to the time to transfer the full block which we calculated before.
|
||||||
On the other side of the plot, Garage v0.8 has a very low TTFB thanks to the
|
On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
|
||||||
streaming feature (the lowest value is 43ms). Minio sits between the two
|
streaming feature (the lowest value is 43ms). Minio sits between the two
|
||||||
Garage versions: we suppose that it does some form of batching, but smaller
|
Garage versions: we suppose that it does some form of batching, but smaller
|
||||||
than 1MB.
|
than 1MB.
|
||||||
|
|
||||||
**Throughput** - As soon as we publicly released Garage, people started
|
**Throughput** - As soon as we publicly released Garage, people started
|
||||||
benchmarking it, comparing its performances to writing directly on the
|
benchmarking it, comparing its performances with writing directly on the
|
||||||
filesystem, and observed that Garage was slower (eg.
|
filesystem, and observed that Garage was slower (eg.
|
||||||
[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the
|
[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the
|
||||||
situation, we put costly processing like hashing on a dedicated thread and did
|
situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread
|
||||||
many compute optimization
|
and many others
|
||||||
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
|
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
|
||||||
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to
|
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to
|
||||||
`v0.8 beta 1`. We also noted logic we wrote (to better control resource usage
|
version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote
|
||||||
and detect errors, like semaphores or timeouts) was artificially limiting
|
to better control resource usage
|
||||||
|
and detect errors, like semaphores or timeouts, was artificially limiting
|
||||||
performances. In another iteration, we made this logic less restrictive at the
|
performances. In another iteration, we made this logic less restrictive at the
|
||||||
cost of higher resource consumption under load
|
cost of higher resource consumption under load
|
||||||
([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in
|
([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in
|
||||||
`v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we
|
version 0.8 "Beta 2". Finally, we currently do multiple `fsync` calls each time we
|
||||||
write a block. We know that this is expensive and did a test build without any
|
write a block. We know that this is expensive and did a test build without any
|
||||||
`fsync` call ([see the
|
`fsync` call ([see the
|
||||||
commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7))
|
commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7))
|
||||||
that will not be merged, just to assess the impact of `fsync`. We refer to it
|
that will not be merged, just to assess the impact of `fsync`. We refer to it
|
||||||
as `no-fsync` in the following plot.
|
as `no-fsync` in the following plot.
|
||||||
|
|
||||||
*A note about fsync: for performance reasons, operating systems often do not
|
*A note about `fsync`: for performance reasons, operating systems often do not
|
||||||
write directly to the disk when a process creates or updates a file in your
|
write directly to the disk when a process creates or updates a file in your
|
||||||
filesystem, instead, the write is kept in memory, and flushed later in a batch
|
filesystem. Instead, the write is kept in memory, and flushed later in a batch
|
||||||
with other writes. If a power loss occurs before the OS has time to flush the
|
with other writes. If a power loss occurs before the OS has time to flush
|
||||||
writes on the disk, data will be lost. To ensure that a write is effectively
|
data to disk, some writes will be lost. To ensure that a write is effectively
|
||||||
written on disk, you must use the
|
written on disk, the
|
||||||
[fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it
|
[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
|
||||||
will block until your file or directory has been written from your volatile
|
which blocks until the file or directory on which it is called has been written from volatile
|
||||||
memory to your persisting storage device. Additionally, the exact semantic of
|
memory to the persistent storage device. Additionally, the exact semantic of
|
||||||
fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
|
`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
|
||||||
and, even on battle-tested software like Postgres,
|
and, even on battle-tested software like Postgres, it was
|
||||||
[they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
|
["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
|
||||||
Note that on Garage, we are currently working on our "fsync" policy and thus, for
|
Note that on Garage, we are currently working on our `fsync` policy and thus, for
|
||||||
now, you should expect limited data durability in case of power loss, as we are
|
now, you should expect limited data durability in case of power loss, as we are
|
||||||
aware of some inconsistency on this point (which we describe in the following
|
aware of some inconsistency on this point (which we describe in the following
|
||||||
and plan to solve).*
|
and plan to solve).*
|
||||||
|
@ -188,16 +189,16 @@ performance with a standardized and mixed workload.
|
||||||
|
|
||||||
![Plot showing IO perf of Garage configs and Minio](io.png)
|
![Plot showing IO perf of Garage configs and Minio](io.png)
|
||||||
|
|
||||||
Minio, our ground truth, features the best performances in this test.
|
Minio, our reference point, gives us the best performances in this test.
|
||||||
Considering Garage, we observe that each improvement we made has a visible
|
Looking at Garage, we observe that each improvement we made has a visible
|
||||||
impact on its performances. We also note that we have a progress margin in
|
impact on performances. We also note that we have a progress margin in
|
||||||
terms of performances compared to Minio: additional benchmarks, tests, and
|
terms of performances compared to Minio: additional benchmarks, tests, and
|
||||||
monitoring could help better understand the remaining difference.
|
monitoring could help better understand the remaining difference.
|
||||||
|
|
||||||
|
|
||||||
## A myriad of objects
|
## A myriad of objects
|
||||||
|
|
||||||
Object storage systems do not handle a single object but a myriad of them:
|
Object storage systems do not handle a single object but huge numbers of them:
|
||||||
Amazon claims to handle trillions of objects on their platform, and Red Hat
|
Amazon claims to handle trillions of objects on their platform, and Red Hat
|
||||||
communicates about Ceph being able to handle 10 billion objects. All these
|
communicates about Ceph being able to handle 10 billion objects. All these
|
||||||
objects must be tracked efficiently in the system to be fetched, listed,
|
objects must be tracked efficiently in the system to be fetched, listed,
|
||||||
|
|
Loading…
Reference in a new issue