New article: Bringing theoretical design and observed performances face to face #12
1 changed files with 28 additions and 27 deletions
|
@ -97,7 +97,7 @@ and the moment where the first bytes of the returned object are received by the
|
|||
Second, we will evaluate generic throughput, to understand how well
|
||||
Garage can leverage the underlying machine's performances.
|
||||
|
||||
**Time To First Byte** - One specificity of Garage is that we implemented S3
|
||||
**Time-to-First-Byte** - One specificity of Garage is that we implemented S3
|
||||
web endpoints, with the idea to make it a platform of choice to publish
|
||||
static websites. When publishing a website, TTFB can be directly observed
|
||||
by the end user, as it will impact the perceived reactivity of the websites.
|
||||
|
@ -136,46 +136,47 @@ whose results are shown on the following figure:
|
|||
|
||||
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
|
||||
|
||||
Garage v0.7, which does not support block streaming, features TTFB between 1.6s
|
||||
and 2s, which corresponds to the theoretical time to transfer the full block.
|
||||
On the other side of the plot, Garage v0.8 has a very low TTFB thanks to the
|
||||
Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
|
||||
and 2s, which corresponds to the time to transfer the full block which we calculated before.
|
||||
On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
|
||||
streaming feature (the lowest value is 43ms). Minio sits between the two
|
||||
Garage versions: we suppose that it does some form of batching, but smaller
|
||||
than 1MB.
|
||||
|
||||
**Throughput** - As soon as we publicly released Garage, people started
|
||||
benchmarking it, comparing its performances to writing directly on the
|
||||
benchmarking it, comparing its performances with writing directly on the
|
||||
filesystem, and observed that Garage was slower (eg.
|
||||
[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the
|
||||
situation, we put costly processing like hashing on a dedicated thread and did
|
||||
many compute optimization
|
||||
situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread
|
||||
and many others
|
||||
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
|
||||
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to
|
||||
`v0.8 beta 1`. We also noted logic we wrote (to better control resource usage
|
||||
and detect errors, like semaphores or timeouts) was artificially limiting
|
||||
version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote
|
||||
to better control resource usage
|
||||
and detect errors, like semaphores or timeouts, was artificially limiting
|
||||
performances. In another iteration, we made this logic less restrictive at the
|
||||
cost of higher resource consumption under load
|
||||
([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in
|
||||
`v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we
|
||||
version 0.8 "Beta 2". Finally, we currently do multiple `fsync` calls each time we
|
||||
write a block. We know that this is expensive and did a test build without any
|
||||
`fsync` call ([see the
|
||||
commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7))
|
||||
that will not be merged, just to assess the impact of `fsync`. We refer to it
|
||||
as `no-fsync` in the following plot.
|
||||
|
||||
*A note about fsync: for performance reasons, operating systems often do not
|
||||
*A note about `fsync`: for performance reasons, operating systems often do not
|
||||
write directly to the disk when a process creates or updates a file in your
|
||||
filesystem, instead, the write is kept in memory, and flushed later in a batch
|
||||
with other writes. If a power loss occurs before the OS has time to flush the
|
||||
writes on the disk, data will be lost. To ensure that a write is effectively
|
||||
written on disk, you must use the
|
||||
[fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it
|
||||
will block until your file or directory has been written from your volatile
|
||||
memory to your persisting storage device. Additionally, the exact semantic of
|
||||
fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
|
||||
and, even on battle-tested software like Postgres,
|
||||
[they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
|
||||
Note that on Garage, we are currently working on our "fsync" policy and thus, for
|
||||
filesystem. Instead, the write is kept in memory, and flushed later in a batch
|
||||
with other writes. If a power loss occurs before the OS has time to flush
|
||||
data to disk, some writes will be lost. To ensure that a write is effectively
|
||||
written on disk, the
|
||||
[`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
|
||||
which blocks until the file or directory on which it is called has been written from volatile
|
||||
memory to the persistent storage device. Additionally, the exact semantic of
|
||||
`fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
|
||||
and, even on battle-tested software like Postgres, it was
|
||||
["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
|
||||
Note that on Garage, we are currently working on our `fsync` policy and thus, for
|
||||
now, you should expect limited data durability in case of power loss, as we are
|
||||
aware of some inconsistency on this point (which we describe in the following
|
||||
and plan to solve).*
|
||||
|
@ -188,16 +189,16 @@ performance with a standardized and mixed workload.
|
|||
|
||||
![Plot showing IO perf of Garage configs and Minio](io.png)
|
||||
|
||||
Minio, our ground truth, features the best performances in this test.
|
||||
Considering Garage, we observe that each improvement we made has a visible
|
||||
impact on its performances. We also note that we have a progress margin in
|
||||
Minio, our reference point, gives us the best performances in this test.
|
||||
Looking at Garage, we observe that each improvement we made has a visible
|
||||
impact on performances. We also note that we have a progress margin in
|
||||
terms of performances compared to Minio: additional benchmarks, tests, and
|
||||
monitoring could help better understand the remaining difference.
|
||||
|
||||
|
||||
## A myriad of objects
|
||||
|
||||
Object storage systems do not handle a single object but a myriad of them:
|
||||
Object storage systems do not handle a single object but huge numbers of them:
|
||||
Amazon claims to handle trillions of objects on their platform, and Red Hat
|
||||
communicates about Ceph being able to handle 10 billion objects. All these
|
||||
objects must be tracked efficiently in the system to be fetched, listed,
|
||||
|
|
Loading…
Reference in a new issue