New article: Bringing theoretical design and observed performances face to face #12

Merged
quentin merged 24 commits from perf into master 2022-09-29 11:16:04 +00:00
Showing only changes of commit bacebcfbf1 - Show all commits

View file

@ -97,7 +97,7 @@ and the moment where the first bytes of the returned object are received by the
Second, we will evaluate generic throughput, to understand how well Second, we will evaluate generic throughput, to understand how well
Garage can leverage the underlying machine's performances. Garage can leverage the underlying machine's performances.
**Time To First Byte** - One specificity of Garage is that we implemented S3 **Time-to-First-Byte** - One specificity of Garage is that we implemented S3
web endpoints, with the idea to make it a platform of choice to publish web endpoints, with the idea to make it a platform of choice to publish
static websites. When publishing a website, TTFB can be directly observed static websites. When publishing a website, TTFB can be directly observed
by the end user, as it will impact the perceived reactivity of the websites. by the end user, as it will impact the perceived reactivity of the websites.
@ -136,46 +136,47 @@ whose results are shown on the following figure:
![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png) ![Plot showing the TTFB observed on Garage v0.8, v0.7 and Minio](ttfb.png)
Garage v0.7, which does not support block streaming, features TTFB between 1.6s Garage v0.7, which does not support block streaming, gives us a TTFB between 1.6s
and 2s, which corresponds to the theoretical time to transfer the full block. and 2s, which corresponds to the time to transfer the full block which we calculated before.
On the other side of the plot, Garage v0.8 has a very low TTFB thanks to the On the other side of the plot, we can see Garage v0.8 with a very low TTFB thanks to the
streaming feature (the lowest value is 43ms). Minio sits between the two streaming feature (the lowest value is 43ms). Minio sits between the two
Garage versions: we suppose that it does some form of batching, but smaller Garage versions: we suppose that it does some form of batching, but smaller
than 1MB. than 1MB.
**Throughput** - As soon as we publicly released Garage, people started **Throughput** - As soon as we publicly released Garage, people started
benchmarking it, comparing its performances to writing directly on the benchmarking it, comparing its performances with writing directly on the
filesystem, and observed that Garage was slower (eg. filesystem, and observed that Garage was slower (eg.
[#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the [#288](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/288)). To improve the
situation, we put costly processing like hashing on a dedicated thread and did situation, we did some optimizations, such as putting costly processing like hashing on a dedicated thread
many compute optimization and many others
([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342), ([#342](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/342),
[#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to [#343](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/343)) which lead us to
`v0.8 beta 1`. We also noted logic we wrote (to better control resource usage version 0.8 "Beta 1". We also noticed that some of the logic logic we wrote
and detect errors, like semaphores or timeouts) was artificially limiting to better control resource usage
and detect errors, like semaphores or timeouts, was artificially limiting
performances. In another iteration, we made this logic less restrictive at the performances. In another iteration, we made this logic less restrictive at the
cost of higher resource consumption under load cost of higher resource consumption under load
([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in ([#387](https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/387)), resulting in
`v0.8 beta 2`. Finally, we currently do multiple `fsync` calls each time we version 0.8 "Beta 2". Finally, we currently do multiple `fsync` calls each time we
write a block. We know that this is expensive and did a test build without any write a block. We know that this is expensive and did a test build without any
`fsync` call ([see the `fsync` call ([see the
commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7)) commit](https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/432131f5b8c2aad113df3b295072a00756da47e7))
that will not be merged, just to assess the impact of `fsync`. We refer to it that will not be merged, just to assess the impact of `fsync`. We refer to it
as `no-fsync` in the following plot. as `no-fsync` in the following plot.
*A note about fsync: for performance reasons, operating systems often do not *A note about `fsync`: for performance reasons, operating systems often do not
write directly to the disk when a process creates or updates a file in your write directly to the disk when a process creates or updates a file in your
filesystem, instead, the write is kept in memory, and flushed later in a batch filesystem. Instead, the write is kept in memory, and flushed later in a batch
with other writes. If a power loss occurs before the OS has time to flush the with other writes. If a power loss occurs before the OS has time to flush
writes on the disk, data will be lost. To ensure that a write is effectively data to disk, some writes will be lost. To ensure that a write is effectively
written on disk, you must use the written on disk, the
[fsync(2)](https://man7.org/linux/man-pages/man2/fsync.2.html) system call: it [`fsync(2)`](https://man7.org/linux/man-pages/man2/fsync.2.html) system call must be used,
will block until your file or directory has been written from your volatile which blocks until the file or directory on which it is called has been written from volatile
memory to your persisting storage device. Additionally, the exact semantic of memory to the persistent storage device. Additionally, the exact semantic of
fsync [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/) `fsync` [differs from one OS to another](https://mjtsai.com/blog/2022/02/17/apple-ssd-benchmarks-and-f_fullsync/)
and, even on battle-tested software like Postgres, and, even on battle-tested software like Postgres, it was
[they "did it wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/). ["done wrong for 20 years"](https://archive.fosdem.org/2019/schedule/event/postgresql_fsync/).
Note that on Garage, we are currently working on our "fsync" policy and thus, for Note that on Garage, we are currently working on our `fsync` policy and thus, for
now, you should expect limited data durability in case of power loss, as we are now, you should expect limited data durability in case of power loss, as we are
aware of some inconsistency on this point (which we describe in the following aware of some inconsistency on this point (which we describe in the following
and plan to solve).* and plan to solve).*
@ -188,16 +189,16 @@ performance with a standardized and mixed workload.
![Plot showing IO perf of Garage configs and Minio](io.png) ![Plot showing IO perf of Garage configs and Minio](io.png)
Minio, our ground truth, features the best performances in this test. Minio, our reference point, gives us the best performances in this test.
Considering Garage, we observe that each improvement we made has a visible Looking at Garage, we observe that each improvement we made has a visible
impact on its performances. We also note that we have a progress margin in impact on performances. We also note that we have a progress margin in
terms of performances compared to Minio: additional benchmarks, tests, and terms of performances compared to Minio: additional benchmarks, tests, and
monitoring could help better understand the remaining difference. monitoring could help better understand the remaining difference.
## A myriad of objects ## A myriad of objects
Object storage systems do not handle a single object but a myriad of them: Object storage systems do not handle a single object but huge numbers of them:
Amazon claims to handle trillions of objects on their platform, and Red Hat Amazon claims to handle trillions of objects on their platform, and Red Hat
communicates about Ceph being able to handle 10 billion objects. All these communicates about Ceph being able to handle 10 billion objects. All these
objects must be tracked efficiently in the system to be fetched, listed, objects must be tracked efficiently in the system to be fetched, listed,